1<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
5  <!ENTITY version SYSTEM "version.xml">
6]>
7<chapter id="clusters">
8  <title>Clusters</title>
9  <section id="clusters">
10    <title>Clusters</title>
11    <para>
12      In text shaping, a <emphasis>cluster</emphasis> is a sequence of
13      characters that needs to be treated as a single, indivisible
14      unit.
15    </para>
16    <para>
17      A cluster is distinct from a <emphasis>grapheme</emphasis>,
18      which is the smallest unit of a writing system or script,
19      because clusters are only relevant for script shaping and the
20      layout of glyphs.
21    </para>
22    <para>
23      For example, a grapheme may be a letter, a number, a logogram,
24      or a symbol. When two letters form a ligature, however, they
25      combine into a single glyph. They are therefore part of the same
26      cluster and are treated as a unit &mdash; even though the two
27      original, underlying letters are separate graphemes.
28    </para>
29    <para>
30      During the shaping process, there are several shaping operations
31      that may merge adjacent characters (for example, when two code
32      points form a ligature or a conjunct form and are replaced by a
33      single glyph) or split one character into several (for example,
34      when decomposing a code point through the
35      <literal>ccmp</literal> feature).
36    </para>
37    <para>
38      HarfBuzz tracks clusters independently from how these
39      shaping operations affect the individual glyphs that comprise the
40      output HarfBuzz returns in a buffer. Consequently,
41      a client program using HarfBuzz can utilize the cluster
42      information to implement features such as:
43    </para>
44    <itemizedlist>
45      <listitem>
46	<para>
47	  Correctly positioning the cursor within a shaped text run,
48	  even when characters have formed ligatures, composed or
49	  decomposed, reordered, or undergone other shaping operations.
50	</para>
51      </listitem>
52      <listitem>
53	<para>
54	  Correctly highlighting a text selection that includes some,
55	  but not all, of the characters in a word.
56	</para>
57      </listitem>
58      <listitem>
59	<para>
60	  Applying text attributes (such as color or underlining) to
61	  part, but not all, of a word.
62	</para>
63      </listitem>
64      <listitem>
65	<para>
66	  Generating output document formats (such as PDF) with
67	  embedded text that can be fully extracted.
68	</para>
69      </listitem>
70      <listitem>
71	<para>
72	  Determining the mapping between input characters and output
73	  glyphs, such as which glyphs are ligatures.
74	</para>
75      </listitem>
76      <listitem>
77	<para>
78	  Performing line-breaking, justification, and other
79	  line-level or paragraph-level operations that must be done
80	  after shaping is complete, but which require character-level
81	  properties.
82	</para>
83      </listitem>
84    </itemizedlist>
85    <para>
86      When you add text to a HarfBuzz buffer, each code point must be
87      assigned a <emphasis>cluster value</emphasis>.
88    </para>
89    <para>
90      This cluster value is an arbitrary number; HarfBuzz uses it only
91      to distinguish between clusters. Many client programs will use
92      the index of each code point in the input text stream as the
93      cluster value. This is for the sake of convenience; the actual
94      value does not matter.
95    </para>
96    <para>
97      Client programs can choose how HarfBuzz handles clusters during
98      shaping by setting the
99      <literal>cluster_level</literal> of the
100      buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
101      clustering support for this property:
102    </para>
103    <itemizedlist>
104      <listitem>
105	<para><emphasis>Level 0</emphasis> is the default and
106	reproduces the behavior of the old HarfBuzz library.
107	</para>
108	<para>
109	  The distinguishing feature of level 0 behavior is that, at
110	  the beginning of processing the buffer, all code points that
111	  are categorized as <emphasis>marks</emphasis>,
112	  <emphasis>modifier symbols</emphasis>, or
113	  <emphasis>Emoji extended pictographic</emphasis> modifiers,
114	  as well as the <emphasis>Zero Width Joiner</emphasis> and
115	  <emphasis>Zero Width Non-Joiner</emphasis> code points, are
116	  assigned the cluster value of the closest preceding code
117	  point from <emphasis>different</emphasis> category.
118	</para>
119	<para>
120	  In essence, whenever a base character is followed by a mark
121	  character or a sequence of mark characters, those marks are
122	  reassigned to the same initial cluster value as the base
123	  character. This reassignment is referred to as
124	  "merging" the affected clusters. This behavior is based on
125	  the Grapheme Cluster Boundary specification in <ulink
126	  url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode
127	  Technical Report 29</ulink>.
128	</para>
129	<para>
130	  Client programs can specify level 0 behavior for a buffer by
131	  setting its <literal>cluster_level</literal> to
132	  <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>.
133	</para>
134      </listitem>
135      <listitem>
136	<para>
137	  <emphasis>Level 1</emphasis> tweaks the old behavior
138	  slightly to produce better results. Therefore, level 1
139	  clustering is recommended for code that is not required to
140	  implement backward compatibility with the old HarfBuzz.
141	</para>
142	<para>
143	  Level 1 differs from level 0 by not merging the
144	  clusters of marks and other modifier code points with the
145	  preceding "base" code point's cluster. By preserving the
146	  separate cluster values of these marks and modifier code
147	  points, script shapers can perform additional operations
148	  that might lead to improved results (for example, reordering
149	  a sequence of marks).
150	</para>
151	<para>
152	  Client programs can specify level 1 behavior for a buffer by
153	  setting its <literal>cluster_level</literal> to
154	  <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>.
155	</para>
156      </listitem>
157      <listitem>
158	<para>
159	  <emphasis>Level 2</emphasis> differs significantly in how it
160	  treats cluster values. In level 2, HarfBuzz never merges
161	  clusters.
162	</para>
163	<para>
164	  This difference can be seen most clearly when HarfBuzz processes
165	  ligature substitutions and glyph decompositions. In level 0
166	  and level 1, ligatures and glyph decomposition both involve
167	  merging clusters; in level 2, neither of these operations
168	  triggers a merge.
169	</para>
170	<para>
171	  Client programs can specify level 2 behavior for a buffer by
172	  setting its <literal>cluster_level</literal> to
173	  <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>.
174	</para>
175      </listitem>
176    </itemizedlist>
177    <para>
178      As mentioned earlier, client programs using HarfBuzz often
179      assign initial cluster values in a buffer by reusing the indices
180      of the code points in the input text. This gives a sequence of
181      cluster values that is monotonically increasing (for example,
182      0,1,2,3,4,5).
183    </para>
184    <para>
185      It is not <emphasis>required</emphasis> that the cluster values
186      in a buffer be monotonically increasing. However, if the initial
187      cluster values in a buffer are monotonic and the buffer is
188      configured to use cluster level 0 or 1, then HarfBuzz
189      guarantees that the final cluster values in the shaped buffer
190      will also be monotonic. No such guarantee is made for cluster
191      level 2.
192    </para>
193    <para>
194      In levels 0 and 1, HarfBuzz implements the following conceptual
195      model for cluster values:
196    </para>
197    <itemizedlist spacing="compact">
198      <listitem>
199	<para>
200          If the sequence of input cluster values is monotonic, the
201	  sequence of cluster values will remain monotonic.
202	</para>
203      </listitem>
204      <listitem>
205	<para>
206          Each cluster value represents a single cluster.
207	</para>
208      </listitem>
209      <listitem>
210	<para>
211          Each cluster contains one or more glyphs and one or more
212          characters.
213	</para>
214      </listitem>
215    </itemizedlist>
216    <para>
217      In practice, this model offers several benefits. Assuming that
218      the initial cluster values were monotonically increasing
219      and distinct before shaping began, then, in the final output:
220    </para>
221    <itemizedlist spacing="compact">
222      <listitem>
223	<para>
224	  All adjacent glyphs having the same final cluster
225	  value belong to the same cluster.
226	</para>
227      </listitem>
228      <listitem>
229	<para>
230          Each character belongs to the cluster that has the highest
231	  cluster value <emphasis>not larger than</emphasis> its
232	  initial cluster value.
233	</para>
234      </listitem>
235    </itemizedlist>
236
237  </section>
238  <section id="a-clustering-example-for-levels-0-and-1">
239    <title>A clustering example for levels 0 and 1</title>
240    <para>
241      The guarantees and benefits of level 0 and level 1 can be seen
242      with some examples. First, let us examine what happens with cluster
243      values when shaping involves cluster merging with ligatures and
244      decomposition.
245    </para>
246    <para>
247      Let's say we start with the following character sequence (top row) and
248      initial cluster values (bottom row):
249    </para>
250    <programlisting>
251      A,B,C,D,E
252      0,1,2,3,4
253    </programlisting>
254    <para>
255      During shaping, HarfBuzz maps these characters to glyphs from
256      the font. For simplicity, let us assume that each character maps
257      to the corresponding, identical-looking glyph:
258    </para>
259    <programlisting>
260      A,B,C,D,E
261      0,1,2,3,4
262    </programlisting>
263    <para>
264      Now if, for example, <literal>B</literal> and <literal>C</literal>
265      form a ligature, then the clusters to which they belong
266      &quot;merge&quot;. This merged cluster takes for its cluster
267      value the minimum of all the cluster values of the clusters that
268      went in to the ligature. In this case, we get:
269    </para>
270    <programlisting>
271      A,BC,D,E
272      0,1 ,3,4
273    </programlisting>
274    <para>
275      because 1 is the minimum of the set {1,2}, which were the
276      cluster values of <literal>B</literal> and
277      <literal>C</literal>.
278    </para>
279    <para>
280      Next, let us say that the <literal>BC</literal> ligature glyph
281      decomposes into three components, and <literal>D</literal> also
282      decomposes into two components. These components each inherit the
283      cluster value of their parent:
284    </para>
285    <programlisting>
286      A,BC0,BC1,BC2,D0,D1,E
287      0,1  ,1  ,1  ,3 ,3 ,4
288    </programlisting>
289    <para>
290      Next, if <literal>BC2</literal> and <literal>D0</literal> form a
291      ligature, then their clusters (cluster values 1 and 3) merge into
292      <literal>min(1,3) = 1</literal>:
293    </para>
294    <programlisting>
295      A,BC0,BC1,BC2D0,D1,E
296      0,1  ,1  ,1    ,1 ,4
297    </programlisting>
298    <para>
299      At this point, cluster 1 means: the character sequence
300      <literal>BCD</literal> is represented by glyphs
301      <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
302      further.
303    </para>
304  </section>
305  <section id="reordering-in-levels-0-and-1">
306    <title>Reordering in levels 0 and 1</title>
307    <para>
308      Another common operation in the more complex shapers is glyph
309      reordering. In order to maintain a monotonic cluster sequence
310      when glyph reordering takes place, HarfBuzz merges the clusters
311      of everything in the reordering sequence.
312    </para>
313    <para>
314      For example, let us again start with the character sequence (top
315      row) and initial cluster values (bottom row):
316    </para>
317    <programlisting>
318      A,B,C,D,E
319      0,1,2,3,4
320    </programlisting>
321    <para>
322      If <literal>D</literal> is reordered to before <literal>B</literal>,
323      then HarfBuzz merges the <literal>B</literal>,
324      <literal>C</literal>, and <literal>D</literal> clusters, and we
325      get:
326    </para>
327    <programlisting>
328      A,D,B,C,E
329      0,1,1,1,4
330    </programlisting>
331    <para>
332      This is clearly not ideal, but it is the only sensible way to
333      maintain a monotonic sequence of cluster values and retain the
334      true relationship between glyphs and characters.
335    </para>
336  </section>
337  <section id="the-distinction-between-levels-0-and-1">
338    <title>The distinction between levels 0 and 1</title>
339    <para>
340      The preceding examples demonstrate the main effects of using
341      cluster levels 0 and 1. The only difference between the two
342      levels is this: in level 0, at the very beginning of the shaping
343      process, HarfBuzz also merges clusters between any base character
344      and all Unicode marks (combining or not) that follow it.
345    </para>
346    <para>
347      For example, let us start with the following character sequence
348      (top row) and accompanying initial cluster values (bottom row):
349    </para>
350    <programlisting>
351      A,acute,B
352      0,1    ,2
353    </programlisting>
354    <para>
355      The <literal>acute</literal> is a Unicode mark. If HarfBuzz is
356      using cluster level 0 on this sequence, then the
357      <literal>A</literal> and <literal>acute</literal> clusters will
358      merge, and the result will become:
359    </para>
360    <programlisting>
361      A,acute,B
362      0,0    ,2
363    </programlisting>
364    <para>
365      This initial cluster merging is the default behavior of the
366      Windows shaping engine, and the old HarfBuzz codebase copied
367      that behavior to maintain compatibility. Consequently, it has
368      remained the default behavior in the new HarfBuzz codebase.
369    </para>
370    <para>
371      But this initial cluster-merging behavior makes it impossible to
372      color diacritic marks differently from their base
373      characters. That is why, in level 1, HarfBuzz does not perform
374      the initial merging step.
375    </para>
376    <para>
377      For client programs that rely on HarfBuzz cluster values to
378      perform cursor positioning, level 0 is more convenient. But
379      relying on cluster boundaries for cursor positioning is wrong: cursor
380      positions should be determined based on Unicode grapheme
381      boundaries, not on shaping-cluster boundaries. As such, level 1
382      clusters are preferred.
383    </para>
384    <para>
385      One last note about levels 0 and 1. HarfBuzz currently does not allow a
386      <literal>MultipleSubst</literal> lookup to replace a glyph with zero
387      glyphs (in other words, to delete a glyph). But, in some other situations,
388      glyphs can be deleted. In those cases, if the glyph being deleted is
389      the last glyph of its cluster, HarfBuzz makes sure to merge the cluster
390      with a neighboring cluster.
391    </para>
392    <para>
393      This is done primarily to make sure that the starting cluster of the
394      text always has the cluster index pointing to the start of the text
395      for the run; more than one client currently relies on this
396      guarantee.
397    </para>
398    <para>
399      Incidentally, Apple's CoreText does something else to maintain the
400      same promise: it inserts a glyph with id 65535 at the beginning of
401      the glyph string if the glyph corresponding to the first character
402      in the run was deleted. HarfBuzz might do something similar in the
403      future.
404    </para>
405  </section>
406  <section id="level-2">
407    <title>Level 2</title>
408    <para>
409      HarfBuzz's level 2 cluster behavior uses a significantly
410      different model than that of level 0 and level 1.
411    </para>
412    <para>
413      The level 2 behavior is easy to describe, but it may be
414      difficult to understand in practical terms. In brief, level 2
415      performs no merging of clusters whatsoever.
416    </para>
417    <para>
418      When glyphs form a ligature (or when some other feature
419      substitutes multiple glyphs with one glyph), the cluster value
420      of the first glyph is retained as the cluster value for the
421      ligature. However, no subsequent clusters &mdash; including
422      marks and modifiers &mdash; are affected.
423    </para>
424    <para>
425      Level 2 cluster behavior is less complex than level 0 or level
426      1, but there are a few cases in which processing cluster values
427      produced at level 2 may be tricky.
428    </para>
429    <section id="ligatures-with-combining-marks-in-level-2">
430      <title>Ligatures with combining marks in level 2</title>
431      <para>
432	The first example of how HarfBuzz's level 2 cluster behavior
433	can be tricky is when the text to be shaped includes combining
434	marks attached to ligatures.
435      </para>
436      <para>
437	Let us start with an input sequence with the following
438	characters (top row) and initial cluster values (bottom row):
439      </para>
440      <programlisting>
441	A,acute,B,breve,C,circumflex
442	0,1    ,2,3    ,4,5
443      </programlisting>
444      <para>
445	If the sequence <literal>A,B,C</literal> forms a ligature,
446	then these are the cluster values HarfBuzz will return under
447	the various cluster levels:
448      </para>
449      <para>
450	Level 0:
451      </para>
452      <programlisting>
453	ABC,acute,breve,circumflex
454	0  ,0    ,0    ,0
455      </programlisting>
456      <para>
457	Level 1:
458      </para>
459      <programlisting>
460	ABC,acute,breve,circumflex
461	0  ,0    ,0    ,5
462      </programlisting>
463      <para>
464	Level 2:
465      </para>
466      <programlisting>
467	ABC,acute,breve,circumflex
468	0  ,1    ,3    ,5
469      </programlisting>
470      <para>
471	Making sense of the level 2 result is the hardest for a client
472	program, because there is nothing in the cluster values that
473	indicates that <literal>B</literal> and <literal>C</literal>
474	formed a ligature with <literal>A</literal>.
475      </para>
476      <para>
477	In contrast, the "merged" cluster values of the mark glyphs
478	that are seen in the level 0 and level 1 output are evidence
479	that a ligature substitution took place.
480      </para>
481    </section>
482    <section id="reordering-in-level-2">
483      <title>Reordering in level 2</title>
484      <para>
485	Another example of how HarfBuzz's level 2 cluster behavior
486	can be tricky is when glyphs reorder. Consider an input sequence
487	with the following characters (top row) and initial cluster
488	values (bottom row):
489      </para>
490      <programlisting>
491	A,B,C,D,E
492	0,1,2,3,4
493      </programlisting>
494      <para>
495	Now imagine <literal>D</literal> moves before
496	<literal>B</literal> in a reordering operation. The cluster
497	values will then be:
498      </para>
499      <programlisting>
500	A,D,B,C,E
501	0,3,1,2,4
502      </programlisting>
503      <para>
504	Next, if <literal>D</literal> forms a ligature with
505	<literal>B</literal>, the output is:
506      </para>
507      <programlisting>
508	A,DB,C,E
509	0,3 ,2,4
510      </programlisting>
511      <para>
512	However, in a different scenario, in which the shaping rules
513	of the script instead caused <literal>A</literal> and
514	<literal>B</literal> to form a ligature
515	<emphasis>before</emphasis> the <literal>D</literal> reordered, the
516	result would be:
517      </para>
518      <programlisting>
519	AB,D,C,E
520	0 ,3,2,4
521      </programlisting>
522      <para>
523	There is no way for a client program to differentiate between
524	these two scenarios based on the cluster values
525	alone. Consequently, client programs that use level 2 might
526	need to undertake additional work in order to manage cursor
527	positioning, text attributes, or other desired features.
528      </para>
529    </section>
530    <section id="other-considerations-in-level-2">
531      <title>Other considerations in level 2</title>
532      <para>
533	There may be other problems encountered with ligatures under
534	level 2, such as if the direction of the text is forced to
535	opposite of its natural direction (for example, left-to-right
536	Arabic). But, generally speaking, these other scenarios are
537	minor corner cases that are too obscure for most client
538	programs to need to worry about.
539      </para>
540    </section>
541  </section>
542</chapter>
543