1<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
5  <!ENTITY version SYSTEM "version.xml">
6]>
7<chapter id="buffers-language-script-and-direction">
8  <title>Buffers, language, script and direction</title>
9  <para>
10    The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
11    buffer. In this chapter, we'll look at how to set up a buffer with
12    the text that we want and how to customize the properties of the
13    buffer. We'll also look at a piece of lower-level machinery that
14    you will need to understand before proceeding: the functions that
15    HarfBuzz uses to retrieve Unicode information.
16  </para>
17  <para>
18    After shaping is complete, HarfBuzz puts its output back
19    into the buffer. But getting that output requires setting up a
20    face and a font first, so we will look at that in the next chapter
21    instead of here.
22  </para>
23  <section id="creating-and-destroying-buffers">
24    <title>Creating and destroying buffers</title>
25    <para>
26      As we saw in our <emphasis>Getting Started</emphasis> example, a
27      buffer is created and
28      initialized with <function>hb_buffer_create()</function>. This
29      produces a new, empty buffer object, instantiated with some
30      default values and ready to accept your Unicode strings.
31    </para>
32    <para>
33      HarfBuzz manages the memory of objects (such as buffers) that it
34      creates, so you don't have to. When you have finished working on
35      a buffer, you can call <function>hb_buffer_destroy()</function>:
36    </para>
37    <programlisting language="C">
38      hb_buffer_t *buf = hb_buffer_create();
39      ...
40      hb_buffer_destroy(buf);
41    </programlisting>
42    <para>
43      This will destroy the object and free its associated memory -
44      unless some other part of the program holds a reference to this
45      buffer. If you acquire a HarfBuzz buffer from another subsystem
46      and want to ensure that it is not garbage collected by someone
47      else destroying it, you should increase its reference count:
48    </para>
49    <programlisting language="C">
50      void somefunc(hb_buffer_t *buf) {
51      buf = hb_buffer_reference(buf);
52      ...
53    </programlisting>
54    <para>
55      And then decrease it once you're done with it:
56    </para>
57    <programlisting language="C">
58      hb_buffer_destroy(buf);
59      }
60    </programlisting>
61    <para>
62      While we are on the subject of reference-counting buffers, it is
63      worth noting that an individual buffer can only meaningfully be
64      used by one thread at a time.
65    </para>
66    <para>
67      To throw away all the data in your buffer and start from scratch,
68      call <function>hb_buffer_reset(buf)</function>. If you want to
69      throw away the string in the buffer but keep the options, you can
70      instead call <function>hb_buffer_clear_contents(buf)</function>.
71    </para>
72  </section>
73
74  <section id="adding-text-to-the-buffer">
75    <title>Adding text to the buffer</title>
76    <para>
77      Now we have a brand new HarfBuzz buffer. Let's start filling it
78      with text! From HarfBuzz's perspective, a buffer is just a stream
79      of Unicode code points, but your input string is probably in one of
80      the standard Unicode character encodings (UTF-8, UTF-16, or
81      UTF-32). HarfBuzz provides convenience functions that accept
82      each of these encodings:
83      <function>hb_buffer_add_utf8()</function>,
84      <function>hb_buffer_add_utf16()</function>, and
85      <function>hb_buffer_add_utf32()</function>. Other than the
86      character encoding they accept, they function identically.
87    </para>
88    <para>
89      You can add UTF-8 text to a buffer by passing in the text array,
90      the array's length, an offset into the array for the first
91      character to add, and the length of the segment to add:
92    </para>
93    <programlisting language="C">
94    hb_buffer_add_utf8 (hb_buffer_t *buf,
95                    const char *text,
96                    int text_length,
97                    unsigned int item_offset,
98                    int item_length)
99    </programlisting>
100    <para>
101      So, in practice, you can say:
102    </para>
103    <programlisting language="C">
104      hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
105    </programlisting>
106    <para>
107      This will append your new characters to
108      <parameter>buf</parameter>, not replace its existing
109      contents. Also, note that you can use <literal>-1</literal> in
110      place of the first instance of <function>strlen(text)</function>
111      if your text array is NULL-terminated. Similarly, you can also use
112      <literal>-1</literal> as the final argument want to add its full
113      contents.
114    </para>
115    <para>
116      Whatever start <parameter>item_offset</parameter> and
117      <parameter>item_length</parameter> you provide, HarfBuzz will also
118      attempt to grab the five characters <emphasis>before</emphasis>
119      the offset point and the five characters
120      <emphasis>after</emphasis> the designated end. These are the
121      before and after "context" segments, which are used internally
122      for HarfBuzz to make shaping decisions. They will not be part of
123      the final output, but they ensure that HarfBuzz's
124      script-specific shaping operations are correct. If there are
125      fewer than five characters available for the before or after
126      contexts, HarfBuzz will just grab what is there.
127    </para>
128    <para>
129      For longer text runs, such as full paragraphs, it might be
130      tempting to only add smaller sub-segments to a buffer and
131      shape them in piecemeal fashion. Generally, this is not a good
132      idea, however, because a lot of shaping decisions are
133      dependent on this context information. For example, in Arabic
134      and other connected scripts, HarfBuzz needs to know the code
135      points before and after each character in order to correctly
136      determine which glyph to return.
137    </para>
138    <para>
139      The safest approach is to add all of the text available, then
140      use <parameter>item_offset</parameter> and
141      <parameter>item_length</parameter> to indicate which characters you
142      want shaped, so that HarfBuzz has access to any context.
143    </para>
144    <para>
145      You can also add Unicode code points directly with
146      <function>hb_buffer_add_codepoints()</function>. The arguments
147      to this function are the same as those for the UTF
148      encodings. But it is particularly important to note that
149      HarfBuzz does not do validity checking on the text that is added
150      to a buffer. Invalid code points will be replaced, but it is up
151      to you to do any deep-sanity checking necessary.
152    </para>
153
154  </section>
155
156  <section id="setting-buffer-properties">
157    <title>Setting buffer properties</title>
158    <para>
159      Buffers containing input characters still need several
160      properties set before HarfBuzz can shape their text correctly.
161    </para>
162    <para>
163      Initially, all buffers are set to the
164      <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
165      type. After adding text, the buffer should be set to
166      <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
167      indicates that it contains un-shaped input
168      characters. After shaping, the buffer will have the
169      <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
170    </para>
171    <para>
172      <function>hb_buffer_add_utf8()</function> and the
173      other UTF functions set the content type of their buffer
174      automatically. But if you are reusing a buffer you may want to
175      check its state with
176      <function>hb_buffer_get_content_type(buffer)</function>. If
177      necessary you can set the content type with
178    </para>
179    <programlisting language="C">
180      hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
181    </programlisting>
182    <para>
183      to prepare for shaping.
184    </para>
185    <para>
186      Buffers also need to carry information about the script,
187      language, and text direction of their contents. You can set
188      these properties individually:
189    </para>
190    <programlisting language="C">
191      hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
192      hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
193      hb_buffer_set_language(buf, hb_language_from_string("en", -1));
194    </programlisting>
195    <para>
196      However, since these properties are often the repeated for
197      multiple text runs, you can also save them in a
198      <literal>hb_segment_properties_t</literal> for reuse:
199    </para>
200    <programlisting language="C">
201      hb_segment_properties_t *savedprops;
202      hb_buffer_get_segment_properties (buf, savedprops);
203      ...
204      hb_buffer_set_segment_properties (buf2, savedprops);
205    </programlisting>
206    <para>
207      HarfBuzz also provides getter functions to retrieve a buffer's
208      direction, script, and language properties individually.
209    </para>
210    <para>
211      HarfBuzz recognizes four text directions in
212      <type>hb_direction_t</type>: left-to-right
213      (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
214      top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
215      bottom-to-top (<literal>HB_DIRECTION_BTT</literal>).  For the
216      script property, HarfBuzz uses identifiers based on the
217      <ulink
218      url="https://unicode.org/iso15924/">ISO 15924
219      standard</ulink>. For languages, HarfBuzz uses tags based on the
220      <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
221    </para>
222    <para>
223      Helper functions are provided to convert character strings into
224      the necessary script and language tag types.
225    </para>
226    <para>
227      Two additional buffer properties to be aware of are the
228      "invisible glyph" and the replacement code point. The
229      replacement code point is inserted into buffer output in place of
230      any invalid code points encountered in the input. By default, it
231      is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
232      point, <literal>U+FFFD</literal> "&#xFFFD;". You can change this with
233    </para>
234    <programlisting language="C">
235      hb_buffer_set_replacement_codepoint(buf, replacement);
236    </programlisting>
237    <para>
238      passing in the replacement Unicode code point as the
239      <parameter>replacement</parameter> parameter.
240    </para>
241    <para>
242      The invisible glyph is used to replace all output glyphs that
243      are invisible. By default, the standard space character
244      <literal>U+0020</literal> is used; you can replace this (for
245      example, when using a font that provides script-specific
246      spaces) with
247    </para>
248    <programlisting language="C">
249      hb_buffer_set_invisible_glyph(buf, replacement_glyph);
250    </programlisting>
251    <para>
252      Do note that in the <parameter>replacement_glyph</parameter>
253      parameter, you must provide the glyph ID of the replacement you
254      wish to use, not the Unicode code point.
255    </para>
256    <para>
257      HarfBuzz supports a few additional flags you might want to set
258      on your buffer under certain circumstances. The
259      <literal>HB_BUFFER_FLAG_BOT</literal> and
260      <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
261      that the buffer represents the beginning or end (respectively)
262      of a text element (such as a paragraph or other block). Knowing
263      this allows HarfBuzz to apply certain contextual font features
264      when shaping, such as initial or final variants in connected
265      scripts.
266    </para>
267    <para>
268      <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
269      tells HarfBuzz not to hide glyphs with the
270      <literal>Default_Ignorable</literal> property in Unicode. This
271      property designates control characters and other non-printing
272      code points, such as joiners and variation selectors. Normally
273      HarfBuzz replaces them in the output buffer with zero-width
274      space glyphs (using the "invisible glyph" property discussed
275      above); setting this flag causes them to be printed, which can
276      be helpful for troubleshooting.
277    </para>
278    <para>
279      Conversely, setting the
280      <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
281      tells HarfBuzz to remove <literal>Default_Ignorable</literal>
282      glyphs from the output buffer entirely. Finally, setting the
283      <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
284      flag tells HarfBuzz not to insert the dotted-circle glyph
285      (<literal>U+25CC</literal>, "&#x25CC;"), which is normally
286      inserted into buffer output when broken character sequences are
287      encountered (such as combining marks that are not attached to a
288      base character).
289    </para>
290  </section>
291
292  <section id="customizing-unicode-functions">
293    <title>Customizing Unicode functions</title>
294    <para>
295      HarfBuzz requires some simple functions for accessing
296      information from the Unicode Character Database (such as the
297      <literal>General_Category</literal> (gc) and
298      <literal>Script</literal> (sc) properties) that is useful
299      for shaping, as well as some useful operations like composing and
300      decomposing code points.
301    </para>
302    <para>
303      HarfBuzz includes its own internal, lightweight set of Unicode
304      functions. At build time, it is also possible to compile support
305      for some other options, such as the Unicode functions provided
306      by GLib or the International Components for Unicode (ICU)
307      library. Generally, this option is only of interest for client
308      programs that have specific integration requirements or that do
309      a significant amount of customization.
310    </para>
311    <para>
312      If your program has access to other Unicode functions, however,
313      such as through a system library or application framework, you
314      might prefer to use those instead of the built-in
315      options. HarfBuzz supports this by implementing its Unicode
316      functions as a set of virtual methods that you can replace —
317      without otherwise affecting HarfBuzz's functionality.
318    </para>
319    <para>
320      The Unicode functions are specified in a structure called
321      <literal>unicode_funcs</literal> which is attached to each
322      buffer. But even though <literal>unicode_funcs</literal> is
323      associated with a <type>hb_buffer_t</type>, the functions
324      themselves are called by other HarfBuzz APIs that access
325      buffers, so it would be unwise for you to hook different
326      functions into different buffers.
327    </para>
328    <para>
329      In addition, you can mark your <literal>unicode_funcs</literal>
330      as immutable by calling
331      <function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
332      This is especially useful if your code is a
333      library or framework that will have its own client programs. By
334      marking your Unicode function choices as immutable, you prevent
335      your own client programs from changing the
336      <literal>unicode_funcs</literal> configuration and introducing
337      inconsistencies and errors downstream.
338    </para>
339    <para>
340      You can retrieve the Unicode-functions configuration for
341      your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
342    </para>
343    <programlisting language="C">
344      hb_unicode_funcs_t *ufunctions;
345      ufunctions = hb_buffer_get_unicode_funcs(buf);
346    </programlisting>
347    <para>
348      The current version of <literal>unicode_funcs</literal> uses six functions:
349    </para>
350    <itemizedlist>
351      <listitem>
352	<para>
353	  <function>hb_unicode_combining_class_func_t</function>:
354	  returns the Canonical Combining Class of a code point.
355      	</para>
356      </listitem>
357      <listitem>
358	<para>
359	  <function>hb_unicode_general_category_func_t</function>:
360	  returns the General Category (gc) of a code point.
361      	</para>
362      </listitem>
363      <listitem>
364	<para>
365	  <function>hb_unicode_mirroring_func_t</function>: returns
366	  the Mirroring Glyph code point (for bi-directional
367	  replacement) of a code point.
368      	</para>
369      </listitem>
370      <listitem>
371	<para>
372	  <function>hb_unicode_script_func_t</function>: returns the
373	  Script (sc) property of a code point.
374      	</para>
375      </listitem>
376      <listitem>
377	<para>
378	  <function>hb_unicode_compose_func_t</function>: returns the
379	  canonical composition of a sequence of two code points.
380	</para>
381      </listitem>
382      <listitem>
383	<para>
384	  <function>hb_unicode_decompose_func_t</function>: returns
385	  the canonical decomposition of a code point.
386	</para>
387      </listitem>
388    </itemizedlist>
389    <para>
390      Note, however, that future HarfBuzz releases may alter this set.
391    </para>
392    <para>
393      Each Unicode function has a corresponding setter, with which you
394      can assign a callback to your replacement function. For example,
395      to replace
396      <function>hb_unicode_general_category_func_t</function>, you can call
397    </para>
398    <programlisting language="C">
399      hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)
400    </programlisting>
401    <para>
402      Virtualizing this set of Unicode functions is primarily intended
403      to improve portability. There is no need for every client
404      program to make the effort to replace the default options, so if
405      you are unsure, do not feel any pressure to customize
406      <literal>unicode_funcs</literal>.
407    </para>
408  </section>
409
410</chapter>
411