1<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
5  <!ENTITY version SYSTEM "version.xml">
6]>
7<chapter id="shaping-concepts">
8  <title>Shaping concepts</title>
9  <section id="text-shaping-concepts">
10    <title>Text shaping</title>
11    <para>
12      Text shaping is the process of transforming a sequence of Unicode
13      codepoints that represent individual characters (letters,
14      diacritics, tone marks, numbers, symbols, etc.) into the
15      orthographically and linguistically correct two-dimensional layout
16      of glyph shapes taken from a specified font.
17    </para>
18    <para>
19      For some writing systems (or <emphasis>scripts</emphasis>) and
20      languages, the process is simple, requiring the shaper to do
21      little more than advance the horizontal position forward by the
22      correct amount for each successive glyph.
23    </para>
24    <para>
25      But, for <emphasis>complex scripts</emphasis>, any combination of
26      several shaping operations may be required, and the rules for how
27      and when they are applied vary from script to script. HarfBuzz and
28      other shaping engines implement these rules.
29    </para>
30    <para>
31      The exact rules and necessary operations for a particular script
32      constitute a shaping <emphasis>model</emphasis>. OpenType
33      specifies a set of shaping models that covers all of
34      Unicode. Other shaping models are available, however, including
35      Graphite and Apple Advanced Typography (AAT).
36    </para>
37  </section>
38
39  <section id="complex-scripts">
40    <title>Complex scripts</title>
41    <para>
42      In text-shaping terminology, scripts are generally classified as
43      either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
44    </para>
45    <para>
46      Complex scripts are those for which transforming the input
47      sequence into the final layout requires some combination of
48      operations&mdash;such as context-dependent substitutions,
49      context-dependent mark positioning, glyph-to-glyph joining,
50      glyph reordering, or glyph stacking.
51    </para>
52    <para>
53      In some complex scripts, the shaping rules require that a text
54      run be divided into syllables before the operations can be
55      applied. Other complex scripts may apply shaping operations over
56      entire words or over the entire text run, with no subdivision
57      required.
58    </para>
59    <para>
60      Non-complex scripts, by definition, do not require these
61      operations. However, correctly shaping a text run in a
62      non-complex script may still involve Unicode normalization,
63      ligature substitutions, mark positioning, kerning, and applying
64      other font features. The key difference is that a text run in a
65      non-complex script can be processed sequentially and in the same
66      order as the input sequence of Unicode codepoints, without
67      requiring an analysis stage.
68    </para>
69  </section>
70
71  <section id="shaping-operations">
72    <title>Shaping operations</title>
73    <para>
74      Shaping a complex-script text run involves transforming the
75      input sequence of Unicode codepoints with some combination of
76      operations that is specified in the shaping model for the
77      script.
78    </para>
79    <para>
80      The specific conditions that trigger a given operation for a
81      text run varies from script to script, as do the order that the
82      operations are performed in and which codepoints are
83      affected. However, the same general set of shaping operations is
84      common to all of the complex-script shaping models.
85    </para>
86
87    <itemizedlist>
88      <listitem>
89	<para>
90	  A <emphasis>reordering</emphasis> operation moves a glyph
91	  from its original ("logical") position in the sequence to
92	  some other ("visual") position.
93	</para>
94	<para>
95	  The shaping model for a given complex script might involve
96	  more than one reordering step.
97	</para>
98      </listitem>
99
100      <listitem>
101	<para>
102	  A <emphasis>joining</emphasis> operation replaces a glyph
103	  with an alternate form that is designed to connect with one
104	  or more of the adjacent glyphs in the sequence.
105	</para>
106      </listitem>
107
108      <listitem>
109	<para>
110	  A contextual <emphasis>substitution</emphasis> operation
111	  replaces either a single glyph or a subsequence of several
112	  glyphs with an alternate glyph. This substitution is
113	  performed when the original glyph or subsequence of glyphs
114	  occurs in a specified position with respect to the
115	  surrounding sequence. For example, one substitution might be
116	  performed only when the target glyph is the first glyph in
117	  the sequence, while another substitution is performed only
118	  when a different target glyph occurs immediately after a
119	  particular string pattern.
120	</para>
121	<para>
122	  The shaping model for a given complex script might involve
123	  multiple contextual-substitution operations, each applying
124	  to different target glyphs and patterns, and which are
125	  performed in separate steps.
126	</para>
127      </listitem>
128
129      <listitem>
130	<para>
131	  A contextual <emphasis>positioning</emphasis> operation
132	  moves the horizontal and/or vertical position of a
133	  glyph. This positioning move is performed when the glyph
134	  occurs in a specified position with respect to the
135	  surrounding sequence.
136	</para>
137	<para>
138	  Many contextual positioning operations are used to place
139	  <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
140	  signs, and tone markers) with respect to
141	  <emphasis>base</emphasis> glyphs. However, some complex
142	  scripts may use contextual positioning operations to
143	  correctly place base glyphs as well, such as
144	  when the script uses <emphasis>stacking</emphasis> characters.
145	</para>
146      </listitem>
147
148    </itemizedlist>
149  </section>
150
151  <section id="unicode-character-categories">
152    <title>Unicode character categories</title>
153    <para>
154      Shaping models are typically specified with respect to how
155      scripts are defined in the Unicode standard.
156    </para>
157    <para>
158      Every codepoint in the Unicode Character Database (UCD) is
159      assigned a <emphasis>Unicode General Category</emphasis> (UGC),
160      which provides the most fundamental information about the
161      codepoint: whether the codepoint represents a
162      <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
163      <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
164      <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
165      or something else (<emphasis>Other</emphasis>).
166    </para>
167    <para>
168      These UGC properties are "Major" categories. Each codepoint is
169      further assigned to a "minor" category within its Major
170      category, such as "Letter, uppercase" (<literal>Lu</literal>) or
171      "Letter, modifier" (<literal>Lm</literal>).
172    </para>
173    <para>
174      Shaping models are concerned primarily with Letter and Mark
175      codepoints. The minor categories of Mark codepoints are
176      particularly important for shaping. Marks can be nonspacing
177      (<literal>Mn</literal>), spacing combining
178      (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
179    </para>
180    <para>
181      In addition to the UGC property, codepoints in the Indic and
182      Southeast Asian scripts are also assigned
183      <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
184      <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
185      property that provides more detailed information needed for
186      shaping.
187    </para>
188    <para>
189      The UISC property sub-categorizes Letters and Marks according to
190      common script-shaping behaviors. For example, UISC distinguishes
191      between consonant letters, vowel letters, and vowel marks. The
192      UIPC property sub-categorizes Mark codepoints by the visual
193      position that they occupy (above, below, right, left, or in
194      multiple positions).
195    </para>
196    <para>
197      Some complex scripts require that the text run be split into
198      syllables, and what constitutes a valid syllable in these
199      scripts is specified in regular expressions of the Letter and
200      Mark codepoints that take the UISC and UIPC properties into account.
201    </para>
202
203  </section>
204
205  <section id="text-runs">
206    <title>Text runs</title>
207    <para>
208      Real-world text usually contains codepoints from a mixture of
209      different Unicode scripts (including punctuation, numbers, symbols,
210      white-space characters, and other codepoints that do not belong
211      to any script). Real-world text may also be marked up with
212      formatting that changes font properties (including the font,
213      font style, and font size).
214    </para>
215    <para>
216      For shaping purposes, all real-world text streams must be first
217      segmented into runs that have a uniform set of properties.
218    </para>
219    <para>
220      In particular, shaping models always assume that every codepoint
221      in a text run has the same <emphasis>direction</emphasis>,
222      <emphasis>script</emphasis> tag, and
223      <emphasis>language</emphasis> tag.
224    </para>
225  </section>
226
227  <section id="opentype-shaping-models">
228    <title>OpenType shaping models</title>
229    <para>
230      OpenType provides shaping models for the following scripts:
231    </para>
232
233    <itemizedlist>
234      <listitem>
235	<para>
236	  The <emphasis>default</emphasis> shaping model handles all
237	  non-complex scripts, and may also be used as a fallback for
238	  handling unrecognized scripts.
239	</para>
240      </listitem>
241
242      <listitem>
243	<para>
244	  The <emphasis>Indic</emphasis> shaping model handles the Indic
245	  scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
246	  Malayalam, Oriya, Tamil, Telugu, and Sinhala.
247	</para>
248	<para>
249	  The Indic shaping model was revised significantly in
250	  2005. To denote the change, a new set of <emphasis>script
251	  tags</emphasis> was assigned for Bengali, Devanagari,
252	  Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
253	  Telugu. For the sake of clarity, the term "Indic2" is
254	  sometimes used to refer to the current, revised shaping
255	  model.
256	</para>
257      </listitem>
258
259      <listitem>
260	<para>
261	  The <emphasis>Arabic</emphasis> shaping model supports
262	  Arabic, Mongolian, N'Ko, Syriac, and several other connected
263	  or cursive scripts.
264	</para>
265      </listitem>
266
267      <listitem>
268	<para>
269	  The <emphasis>Thai/Lao</emphasis> shaping model supports
270	  the Thai and Lao scripts.
271	</para>
272      </listitem>
273
274      <listitem>
275	<para>
276	  The <emphasis>Khmer</emphasis> shaping model supports the
277	  Khmer script.
278	</para>
279      </listitem>
280
281      <listitem>
282	<para>
283	  The <emphasis>Myanmar</emphasis> shaping model supports the
284	  Myanmar (or Burmese) script.
285	</para>
286      </listitem>
287
288      <listitem>
289	<para>
290	  The <emphasis>Tibetan</emphasis> shaping model supports the
291	  Tibetan script.
292	</para>
293      </listitem>
294
295      <listitem>
296	<para>
297	  The <emphasis>Hangul</emphasis> shaping model supports the
298	  Hangul script.
299	</para>
300      </listitem>
301
302      <listitem>
303	<para>
304	  The <emphasis>Hebrew</emphasis> shaping model supports the
305	  Hebrew script.
306	</para>
307      </listitem>
308
309      <listitem>
310	<para>
311	  The <emphasis>Universal Shaping Engine</emphasis> (USE)
312	  shaping model supports complex scripts not covered by one of
313	  the above, script-specific shaping models, including
314	  Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
315	  Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
316	  Viet, and many others.
317	</para>
318      </listitem>
319
320      <listitem>
321	<para>
322	  Text runs that do not fall under one of the above shaping
323	  models may still require processing by a shaping engine. Of
324	  particular note is <emphasis>Emoji</emphasis> shaping, which
325	  may involve variation-selector sequences and glyph
326	  substitution. Emoji shaping is handled by the default
327	  shaping model.
328	</para>
329      </listitem>
330
331    </itemizedlist>
332
333  </section>
334
335  <section id="graphite-shaping">
336    <title>Graphite shaping</title>
337    <para>
338      In contrast to OpenType shaping, Graphite shaping does not
339      specify a predefined set of shaping models or a set of supported
340      scripts.
341    </para>
342    <para>
343      Instead, each Graphite font contains a complete set of rules that
344      implement the required shaping model for the intended
345      script. These rules include finite-state machines to match
346      sequences of codepoints to the shaping operations to perform.
347    </para>
348    <para>
349      Graphite shaping can perform the same shaping operations used in
350      OpenType shaping, as well as other functions that have not been
351      defined for OpenType shaping.
352    </para>
353  </section>
354
355  <section id="aat-shaping">
356    <title>AAT shaping</title>
357    <para>
358      In contrast to OpenType shaping, AAT shaping does not specify a
359      predefined set of shaping models or a set of supported scripts.
360    </para>
361    <para>
362      Instead, each AAT font includes a complete set of rules that
363      implement the desired shaping model for the intended
364      script. These rules include finite-state machines to match glyph
365      sequences and the shaping operations to perform.
366    </para>
367    <para>
368      Notably, AAT shaping rules are expressed for glyphs in the font,
369      not for Unicode codepoints. AAT shaping can perform the same
370      shaping operations used in OpenType shaping, as well as other
371      functions that have not been defined for OpenType shaping.
372    </para>
373  </section>
374</chapter>
375