1<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
5  <!ENTITY version SYSTEM "version.xml">
6]>
7<chapter id="clusters">
8  <title>Clusters</title>
9  <section id="clusters-and-shaping">
10    <title>Clusters and shaping</title>
11    <para>
12      In text shaping, a <emphasis>cluster</emphasis> is a sequence of
13      characters that needs to be treated as a single, indivisible
14      unit. A single letter or symbol can be a cluster of its
15      own. Other clusters correspond to longer subsequences of the
16      input code points &mdash; such as a ligature or conjunct form
17      &mdash; and require the shaper to ensure that the cluster is not
18      broken during the shaping process.
19    </para>
20    <para>
21      A cluster is distinct from a <emphasis>grapheme</emphasis>,
22      which is the smallest unit of meaning in a writing system or
23      script.
24    </para>
25    <para>
26      The definitions of the two terms are similar. However, clusters
27      are only relevant for script shaping and glyph layout. In
28      contrast, graphemes are a property of the underlying script, and
29      are of interest when client programs implement orthographic
30      or linguistic functionality.
31    </para>
32    <para>
33      For example, two individual letters are often two separate
34      graphemes. When two letters form a ligature, however, they
35      combine into a single glyph. They are then part of the same
36      cluster and are treated as a unit by the shaping engine &mdash;
37      even though the two original, underlying letters remain separate
38      graphemes.
39    </para>
40    <para>
41      HarfBuzz is concerned with clusters, <emphasis>not</emphasis>
42      with graphemes &mdash; although client programs using HarfBuzz
43      may still care about graphemes for other reasons from time to time.
44    </para>
45    <para>
46      During the shaping process, there are several shaping operations
47      that may merge adjacent characters (for example, when two code
48      points form a ligature or a conjunct form and are replaced by a
49      single glyph) or split one character into several (for example,
50      when decomposing a code point through the
51      <literal>ccmp</literal> feature). Operations like these alter
52      clusters; HarfBuzz tracks the changes to ensure that no clusters
53      get lost or broken during shaping.
54    </para>
55    <para>
56      HarfBuzz records cluster information independently from how
57      shaping operations affect the individual glyphs returned in an
58      output buffer. Consequently, a client program using HarfBuzz can
59      utilize the cluster information to implement features such as:
60    </para>
61    <itemizedlist>
62      <listitem>
63	<para>
64	  Correctly positioning the cursor within a shaped text run,
65	  even when characters have formed ligatures, composed or
66	  decomposed, reordered, or undergone other shaping operations.
67	</para>
68      </listitem>
69      <listitem>
70	<para>
71	  Correctly highlighting a text selection that includes some,
72	  but not all, of the characters in a word.
73	</para>
74      </listitem>
75      <listitem>
76	<para>
77	  Applying text attributes (such as color or underlining) to
78	  part, but not all, of a word.
79	</para>
80      </listitem>
81      <listitem>
82	<para>
83	  Generating output document formats (such as PDF) with
84	  embedded text that can be fully extracted.
85	</para>
86      </listitem>
87      <listitem>
88	<para>
89	  Determining the mapping between input characters and output
90	  glyphs, such as which glyphs are ligatures.
91	</para>
92      </listitem>
93      <listitem>
94	<para>
95	  Performing line-breaking, justification, and other
96	  line-level or paragraph-level operations that must be done
97	  after shaping is complete, but which require examining
98	  character-level properties.
99	</para>
100      </listitem>
101    </itemizedlist>
102  </section>
103  <section id="working-with-harfbuzz-clusters">
104    <title>Working with HarfBuzz clusters</title>
105    <para>
106      When you add text to a HarfBuzz buffer, each code point must be
107      assigned a <emphasis>cluster value</emphasis>.
108    </para>
109    <para>
110      This cluster value is an arbitrary number; HarfBuzz uses it only
111      to distinguish between clusters. Many client programs will use
112      the index of each code point in the input text stream as the
113      cluster value. This is for the sake of convenience; the actual
114      value does not matter.
115    </para>
116    <para>
117      Some of the shaping operations performed by HarfBuzz &mdash;
118      such as reordering, composition, decomposition, and substitution
119      &mdash; may alter the cluster values of some characters. The
120      final cluster values in the buffer at the end of the shaping
121      process will indicate to client programs which subsequences of
122      glyphs represent a cluster and, therefore, must not be
123      separated.
124    </para>
125    <para>
126      In addition, client programs can query the final cluster values
127      to discern other potentially important information about the
128      glyphs in the output buffer (such as whether or not a ligature
129      was formed).
130    </para>
131    <para>
132      For example, if the initial sequence of cluster values was:
133    </para>
134    <programlisting>
135      0,1,2,3,4
136    </programlisting>
137    <para>
138      and the final sequence of cluster values is:
139    </para>
140    <programlisting>
141      0,0,3,3
142    </programlisting>
143    <para>
144      then there are two clusters in the output buffer: the first
145      cluster includes the first two glyphs, and the second cluster
146      includes the third and fourth glyphs. It is also evident that a
147      ligature or conjunct has been formed, because there are fewer
148      glyphs in the output buffer (four) than there were code points
149      in the input buffer (five).
150    </para>
151    <para>
152      Although client programs using HarfBuzz are free to assign
153      initial cluster values in any manner they choose to, HarfBuzz
154      does offer some useful guarantees if the cluster values are
155      assigned in a monotonic (either non-decreasing or non-increasing)
156      order.
157    </para>
158    <para>
159      For buffers in the left-to-right (LTR)
160      or top-to-bottom (TTB) text flow direction,
161      HarfBuzz will preserve the monotonic property: client programs
162      are guaranteed that monotonically increasing initial cluster
163      values will be returned as monotonically increasing final
164      cluster values.
165    </para>
166    <para>
167      For buffers in the right-to-left (RTL)
168      or bottom-to-top (BTT) text flow direction,
169      the directionality of the buffer itself is reversed for final
170      output as a matter of design. Therefore, HarfBuzz inverts the
171      monotonic property: client programs are guaranteed that
172      monotonically increasing initial cluster values will be
173      returned as monotonically <emphasis>decreasing</emphasis> final
174      cluster values.
175    </para>
176    <para>
177      Client programs can adjust how HarfBuzz handles clusters during
178      shaping by setting the
179      <literal>cluster_level</literal> of the
180      buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
181      clustering support for this property:
182    </para>
183    <itemizedlist>
184      <listitem>
185	<para><emphasis>Level 0</emphasis> is the default and
186	reproduces the behavior of the old HarfBuzz library.
187	</para>
188	<para>
189	  The distinguishing feature of level 0 behavior is that, at
190	  the beginning of processing the buffer, all code points that
191	  are categorized as <emphasis>marks</emphasis>,
192	  <emphasis>modifier symbols</emphasis>, or
193	  <emphasis>Emoji extended pictographic</emphasis> modifiers,
194	  as well as the <emphasis>Zero Width Joiner</emphasis> and
195	  <emphasis>Zero Width Non-Joiner</emphasis> code points, are
196	  assigned the cluster value of the closest preceding code
197	  point from <emphasis>different</emphasis> category.
198	</para>
199	<para>
200	  In essence, whenever a base character is followed by a mark
201	  character or a sequence of mark characters, those marks are
202	  reassigned to the same initial cluster value as the base
203	  character. This reassignment is referred to as
204	  "merging" the affected clusters. This behavior is based on
205	  the Grapheme Cluster Boundary specification in <ulink
206	  url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode
207	  Technical Report 29</ulink>.
208	</para>
209	<para>
210	  Client programs can specify level 0 behavior for a buffer by
211	  setting its <literal>cluster_level</literal> to
212	  <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>.
213	</para>
214      </listitem>
215      <listitem>
216	<para>
217	  <emphasis>Level 1</emphasis> tweaks the old behavior
218	  slightly to produce better results. Therefore, level 1
219	  clustering is recommended for code that is not required to
220	  implement backward compatibility with the old HarfBuzz.
221	</para>
222	<para>
223	  Level 1 differs from level 0 by not merging the
224	  clusters of marks and other modifier code points with the
225	  preceding "base" code point's cluster. By preserving the
226	  separate cluster values of these marks and modifier code
227	  points, script shapers can perform additional operations
228	  that might lead to improved results (for example, reordering
229	  a sequence of marks).
230	</para>
231	<para>
232	  Client programs can specify level 1 behavior for a buffer by
233	  setting its <literal>cluster_level</literal> to
234	  <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>.
235	</para>
236      </listitem>
237      <listitem>
238	<para>
239	  <emphasis>Level 2</emphasis> differs significantly in how it
240	  treats cluster values. In level 2, HarfBuzz never merges
241	  clusters.
242	</para>
243	<para>
244	  This difference can be seen most clearly when HarfBuzz processes
245	  ligature substitutions and glyph decompositions. In level 0
246	  and level 1, ligatures and glyph decomposition both involve
247	  merging clusters; in level 2, neither of these operations
248	  triggers a merge.
249	</para>
250	<para>
251	  Client programs can specify level 2 behavior for a buffer by
252	  setting its <literal>cluster_level</literal> to
253	  <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>.
254	</para>
255      </listitem>
256    </itemizedlist>
257    <para>
258      As mentioned earlier, client programs using HarfBuzz often
259      assign initial cluster values in a buffer by reusing the indices
260      of the code points in the input text. This gives a sequence of
261      cluster values that is monotonically increasing (for example,
262      0,1,2,3,4).
263    </para>
264    <para>
265      It is not <emphasis>required</emphasis> that the cluster values
266      in a buffer be monotonically increasing. However, if the initial
267      cluster values in a buffer are monotonic and the buffer is
268      configured to use cluster level 0 or 1, then HarfBuzz
269      guarantees that the final cluster values in the shaped buffer
270      will also be monotonic. No such guarantee is made for cluster
271      level 2.
272    </para>
273    <para>
274      In levels 0 and 1, HarfBuzz implements the following conceptual
275      model for cluster values:
276    </para>
277    <itemizedlist spacing="compact">
278      <listitem>
279	<para>
280          If the sequence of input cluster values is monotonic, the
281	  sequence of cluster values will remain monotonic.
282	</para>
283      </listitem>
284      <listitem>
285	<para>
286          Each cluster value represents a single cluster.
287	</para>
288      </listitem>
289      <listitem>
290	<para>
291          Each cluster contains one or more glyphs and one or more
292          characters.
293	</para>
294      </listitem>
295    </itemizedlist>
296    <para>
297      In practice, this model offers several benefits. Assuming that
298      the initial cluster values were monotonically increasing
299      and distinct before shaping began, then, in the final output:
300    </para>
301    <itemizedlist spacing="compact">
302      <listitem>
303	<para>
304	  All adjacent glyphs having the same final cluster
305	  value belong to the same cluster.
306	</para>
307      </listitem>
308      <listitem>
309	<para>
310          Each character belongs to the cluster that has the highest
311	  cluster value <emphasis>not larger than</emphasis> its
312	  initial cluster value.
313	</para>
314      </listitem>
315    </itemizedlist>
316  </section>
317
318  <section id="a-clustering-example-for-levels-0-and-1">
319    <title>A clustering example for levels 0 and 1</title>
320    <para>
321      The basic shaping operations affect clusters in a predictable
322      manner when using level 0 or level 1:
323    </para>
324    <itemizedlist>
325      <listitem>
326	<para>
327	  When two or more clusters <emphasis>merge</emphasis>, the
328	  resulting merged cluster takes as its cluster value the
329	  <emphasis>minimum</emphasis> of the incoming cluster values.
330	</para>
331      </listitem>
332      <listitem>
333	<para>
334	  When a cluster <emphasis>decomposes</emphasis>, all of the
335	  resulting child clusters inherit as their cluster value the
336	  cluster value of the parent cluster.
337	</para>
338      </listitem>
339      <listitem>
340	<para>
341	  When a character is <emphasis>reordered</emphasis>, the
342	  reordered character and all clusters that the character
343	  moves past as part of the reordering are merged into one cluster.
344	</para>
345      </listitem>
346    </itemizedlist>
347    <para>
348      The functionality, guarantees, and benefits of level 0 and level
349      1 behavior can be seen with some examples. First, let us examine
350      what happens with cluster values when shaping involves cluster
351      merging with ligatures and decomposition.
352    </para>
353
354    <para>
355      Let's say we start with the following character sequence (top row) and
356      initial cluster values (bottom row):
357    </para>
358    <programlisting>
359      A,B,C,D,E
360      0,1,2,3,4
361    </programlisting>
362    <para>
363      During shaping, HarfBuzz maps these characters to glyphs from
364      the font. For simplicity, let us assume that each character maps
365      to the corresponding, identical-looking glyph:
366    </para>
367    <programlisting>
368      A,B,C,D,E
369      0,1,2,3,4
370    </programlisting>
371    <para>
372      Now if, for example, <literal>B</literal> and <literal>C</literal>
373      form a ligature, then the clusters to which they belong
374      &quot;merge&quot;. This merged cluster takes for its cluster
375      value the minimum of all the cluster values of the clusters that
376      went in to the ligature. In this case, we get:
377    </para>
378    <programlisting>
379      A,BC,D,E
380      0,1 ,3,4
381    </programlisting>
382    <para>
383      because 1 is the minimum of the set {1,2}, which were the
384      cluster values of <literal>B</literal> and
385      <literal>C</literal>.
386    </para>
387    <para>
388      Next, let us say that the <literal>BC</literal> ligature glyph
389      decomposes into three components, and <literal>D</literal> also
390      decomposes into two components. Whenever a cluster decomposes,
391      its components each inherit the cluster value of their parent:
392    </para>
393    <programlisting>
394      A,BC0,BC1,BC2,D0,D1,E
395      0,1  ,1  ,1  ,3 ,3 ,4
396    </programlisting>
397    <para>
398      Next, if <literal>BC2</literal> and <literal>D0</literal> form a
399      ligature, then their clusters (cluster values 1 and 3) merge into
400      <literal>min(1,3) = 1</literal>:
401    </para>
402    <programlisting>
403      A,BC0,BC1,BC2D0,D1,E
404      0,1  ,1  ,1    ,1 ,4
405    </programlisting>
406    <para>
407      Note that the entirety of cluster 3 merges into cluster 1, not
408      just the <literal>D0</literal> glyph. This reflects the fact
409      that the cluster <emphasis>must</emphasis> be treated as an
410      indivisible unit.
411    </para>
412    <para>
413      At this point, cluster 1 means: the character sequence
414      <literal>BCD</literal> is represented by glyphs
415      <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
416      further.
417    </para>
418  </section>
419  <section id="reordering-in-levels-0-and-1">
420    <title>Reordering in levels 0 and 1</title>
421    <para>
422      Another common operation in the more complex shapers is glyph
423      reordering. In order to maintain a monotonic cluster sequence
424      when glyph reordering takes place, HarfBuzz merges the clusters
425      of everything in the reordering sequence.
426    </para>
427    <para>
428      For example, let us again start with the character sequence (top
429      row) and initial cluster values (bottom row):
430    </para>
431    <programlisting>
432      A,B,C,D,E
433      0,1,2,3,4
434    </programlisting>
435    <para>
436      If <literal>D</literal> is reordered to the position immediately
437      before <literal>B</literal>, then HarfBuzz merges the
438      <literal>B</literal>, <literal>C</literal>, and
439      <literal>D</literal> clusters &mdash; all the clusters between
440      the final position of the reordered glyph and its original
441      position. This means that we get:
442    </para>
443    <programlisting>
444      A,D,B,C,E
445      0,1,1,1,4
446    </programlisting>
447    <para>
448      as the final cluster sequence.
449    </para>
450    <para>
451      Merging this many clusters is not ideal, but it is the only
452      sensible way for HarfBuzz to maintain the guarantee that the
453      sequence of cluster values remains monotonic and to retain the
454      true relationship between glyphs and characters.
455    </para>
456  </section>
457  <section id="the-distinction-between-levels-0-and-1">
458    <title>The distinction between levels 0 and 1</title>
459    <para>
460      The preceding examples demonstrate the main effects of using
461      cluster levels 0 and 1. The only difference between the two
462      levels is this: in level 0, at the very beginning of the shaping
463      process, HarfBuzz merges the cluster of each base character
464      with the clusters of all Unicode marks (combining or not) and
465      modifiers that follow it.
466    </para>
467    <para>
468      For example, let us start with the following character sequence
469      (top row) and accompanying initial cluster values (bottom row):
470    </para>
471    <programlisting>
472      A,acute,B
473      0,1    ,2
474    </programlisting>
475    <para>
476      The <literal>acute</literal> is a Unicode mark. If HarfBuzz is
477      using cluster level 0 on this sequence, then the
478      <literal>A</literal> and <literal>acute</literal> clusters will
479      merge, and the result will become:
480    </para>
481    <programlisting>
482      A,acute,B
483      0,0    ,2
484    </programlisting>
485    <para>
486      This merger is performed before any other script-shaping
487      steps.
488    </para>
489    <para>
490      This initial cluster merging is the default behavior of the
491      Windows shaping engine, and the old HarfBuzz codebase copied
492      that behavior to maintain compatibility. Consequently, it has
493      remained the default behavior in the new HarfBuzz codebase.
494    </para>
495    <para>
496      But this initial cluster-merging behavior makes it impossible
497      for client programs to implement some features (such as to
498      color diacritic marks differently from their base
499      characters). That is why, in level 1, HarfBuzz does not perform
500      the initial merging step.
501    </para>
502    <para>
503      For client programs that rely on HarfBuzz cluster values to
504      perform cursor positioning, level 0 is more convenient. But
505      relying on cluster boundaries for cursor positioning is wrong: cursor
506      positions should be determined based on Unicode grapheme
507      boundaries, not on shaping-cluster boundaries. As such, using
508      level 1 clustering behavior is recommended.
509    </para>
510    <para>
511      One final facet of levels 0 and 1 is worth noting. HarfBuzz
512      currently does not allow any
513      <emphasis>multiple-substitution</emphasis> GSUB lookups to
514      replace a glyph with zero glyphs (in other words, to delete a
515      glyph).
516    </para>
517    <para>
518      But, in some other situations, glyphs can be deleted. In
519      those cases, if the glyph being deleted is the last glyph of its
520      cluster, HarfBuzz makes sure to merge the deleted glyph's
521      cluster with a neighboring cluster.
522    </para>
523    <para>
524      This is done primarily to make sure that the starting cluster of the
525      text always has the cluster index pointing to the start of the text
526      for the run; more than one client program currently relies on this
527      guarantee.
528    </para>
529    <para>
530      Incidentally, Apple's CoreText does something different to
531      maintain the same promise: it inserts a glyph with id 65535 at
532      the beginning of the glyph string if the glyph corresponding to
533      the first character in the run was deleted. HarfBuzz might do
534      something similar in the future.
535    </para>
536  </section>
537  <section id="level-2">
538    <title>Level 2</title>
539    <para>
540      HarfBuzz's level 2 cluster behavior uses a significantly
541      different model than that of level 0 and level 1.
542    </para>
543    <para>
544      The level 2 behavior is easy to describe, but it may be
545      difficult to understand in practical terms. In brief, level 2
546      performs no merging of clusters whatsoever.
547    </para>
548    <para>
549      This means that there is no initial base-and-mark merging step
550      (as is done in level 0), and it means that reordering moves and
551      ligature substitutions do not trigger a cluster merge.
552    </para>
553    <para>
554      Only one shaping operation directly affects clusters when using
555      level 2:
556    </para>
557    <itemizedlist>
558      <listitem>
559	<para>
560	  When a cluster <emphasis>decomposes</emphasis>, all of the
561	  resulting child clusters inherit as their cluster value the
562	  cluster value of the parent cluster.
563	</para>
564      </listitem>
565    </itemizedlist>
566    <para>
567      When glyphs do form a ligature (or when some other feature
568      substitutes multiple glyphs with one glyph) the cluster value
569      of the first glyph is retained as the cluster value for the
570      resulting ligature.
571    </para>
572    <para>
573      This occurrence sounds similar to a cluster merge, but it is
574      different. In particular, no subsequent characters &mdash;
575      including marks and modifiers &mdash; are affected. They retain
576      their previous cluster values.
577    </para>
578    <para>
579      Level 2 cluster behavior is ultimately less complex than level 0
580      or level 1, but there are several cases for which processing
581      cluster values produced at level 2 may be tricky.
582    </para>
583    <section id="ligatures-with-combining-marks-in-level-2">
584      <title>Ligatures with combining marks in level 2</title>
585      <para>
586	The first example of how HarfBuzz's level 2 cluster behavior
587	can be tricky is when the text to be shaped includes combining
588	marks attached to ligatures.
589      </para>
590      <para>
591	Let us start with an input sequence with the following
592	characters (top row) and initial cluster values (bottom row):
593      </para>
594      <programlisting>
595	A,acute,B,breve,C,circumflex
596	0,1    ,2,3    ,4,5
597      </programlisting>
598      <para>
599	If the sequence <literal>A,B,C</literal> forms a ligature,
600	then these are the cluster values HarfBuzz will return under
601	the various cluster levels:
602      </para>
603      <para>
604	Level 0:
605      </para>
606      <programlisting>
607	ABC,acute,breve,circumflex
608	0  ,0    ,0    ,0
609      </programlisting>
610      <para>
611	Level 1:
612      </para>
613      <programlisting>
614	ABC,acute,breve,circumflex
615	0  ,0    ,0    ,5
616      </programlisting>
617      <para>
618	Level 2:
619      </para>
620      <programlisting>
621	ABC,acute,breve,circumflex
622	0  ,1    ,3    ,5
623      </programlisting>
624      <para>
625	Making sense of the level 2 result is the hardest for a client
626	program, because there is nothing in the cluster values that
627	indicates that <literal>B</literal> and <literal>C</literal>
628	formed a ligature with <literal>A</literal>.
629      </para>
630      <para>
631	In contrast, the "merged" cluster values of the mark glyphs
632	that are seen in the level 0 and level 1 output are evidence
633	that a ligature substitution took place.
634      </para>
635    </section>
636    <section id="reordering-in-level-2">
637      <title>Reordering in level 2</title>
638      <para>
639	Another example of how HarfBuzz's level 2 cluster behavior
640	can be tricky is when glyphs reorder. Consider an input sequence
641	with the following characters (top row) and initial cluster
642	values (bottom row):
643      </para>
644      <programlisting>
645	A,B,C,D,E
646	0,1,2,3,4
647      </programlisting>
648      <para>
649	Now imagine <literal>D</literal> moves before
650	<literal>B</literal> in a reordering operation. The cluster
651	values will then be:
652      </para>
653      <programlisting>
654	A,D,B,C,E
655	0,3,1,2,4
656      </programlisting>
657      <para>
658	Next, if <literal>D</literal> forms a ligature with
659	<literal>B</literal>, the output is:
660      </para>
661      <programlisting>
662	A,DB,C,E
663	0,3 ,2,4
664      </programlisting>
665      <para>
666	However, in a different scenario, in which the shaping rules
667	of the script instead caused <literal>A</literal> and
668	<literal>B</literal> to form a ligature
669	<emphasis>before</emphasis> the <literal>D</literal> reordered, the
670	result would be:
671      </para>
672      <programlisting>
673	AB,D,C,E
674	0 ,3,2,4
675      </programlisting>
676      <para>
677	There is no way for a client program to differentiate between
678	these two scenarios based on the cluster values
679	alone. Consequently, client programs that use level 2 might
680	need to undertake additional work in order to manage cursor
681	positioning, text attributes, or other desired features.
682      </para>
683    </section>
684    <section id="other-considerations-in-level-2">
685      <title>Other considerations in level 2</title>
686      <para>
687	There may be other problems encountered with ligatures under
688	level 2, such as if the direction of the text is forced to
689	the opposite of its natural direction (for example, Arabic text
690	that is forced into left-to-right directionality). But,
691	generally speaking, these other scenarios are minor corner
692	cases that are too obscure for most client programs to need to
693	worry about.
694      </para>
695    </section>
696  </section>
697</chapter>
698