1<?xml version="1.0"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> 5 <!ENTITY version SYSTEM "version.xml"> 6]> 7<chapter id="buffers-language-script-and-direction"> 8 <title>Buffers, language, script and direction</title> 9 <para> 10 The input to the HarfBuzz shaper is a series of Unicode characters, stored in a 11 buffer. In this chapter, we'll look at how to set up a buffer with 12 the text that we want and how to customize the properties of the 13 buffer. We'll also look at a piece of lower-level machinery that 14 you will need to understand before proceeding: the functions that 15 HarfBuzz uses to retrieve Unicode information. 16 </para> 17 <para> 18 After shaping is complete, HarfBuzz puts its output back 19 into the buffer. But getting that output requires setting up a 20 face and a font first, so we will look at that in the next chapter 21 instead of here. 22 </para> 23 <section id="creating-and-destroying-buffers"> 24 <title>Creating and destroying buffers</title> 25 <para> 26 As we saw in our <emphasis>Getting Started</emphasis> example, a 27 buffer is created and 28 initialized with <function>hb_buffer_create()</function>. This 29 produces a new, empty buffer object, instantiated with some 30 default values and ready to accept your Unicode strings. 31 </para> 32 <para> 33 HarfBuzz manages the memory of objects (such as buffers) that it 34 creates, so you don't have to. When you have finished working on 35 a buffer, you can call <function>hb_buffer_destroy()</function>: 36 </para> 37 <programlisting language="C"> 38 hb_buffer_t *buf = hb_buffer_create(); 39 ... 40 hb_buffer_destroy(buf); 41 </programlisting> 42 <para> 43 This will destroy the object and free its associated memory - 44 unless some other part of the program holds a reference to this 45 buffer. If you acquire a HarfBuzz buffer from another subsystem 46 and want to ensure that it is not garbage collected by someone 47 else destroying it, you should increase its reference count: 48 </para> 49 <programlisting language="C"> 50 void somefunc(hb_buffer_t *buf) { 51 buf = hb_buffer_reference(buf); 52 ... 53 </programlisting> 54 <para> 55 And then decrease it once you're done with it: 56 </para> 57 <programlisting language="C"> 58 hb_buffer_destroy(buf); 59 } 60 </programlisting> 61 <para> 62 While we are on the subject of reference-counting buffers, it is 63 worth noting that an individual buffer can only meaningfully be 64 used by one thread at a time. 65 </para> 66 <para> 67 To throw away all the data in your buffer and start from scratch, 68 call <function>hb_buffer_reset(buf)</function>. If you want to 69 throw away the string in the buffer but keep the options, you can 70 instead call <function>hb_buffer_clear_contents(buf)</function>. 71 </para> 72 </section> 73 74 <section id="adding-text-to-the-buffer"> 75 <title>Adding text to the buffer</title> 76 <para> 77 Now we have a brand new HarfBuzz buffer. Let's start filling it 78 with text! From HarfBuzz's perspective, a buffer is just a stream 79 of Unicode code points, but your input string is probably in one of 80 the standard Unicode character encodings (UTF-8, UTF-16, or 81 UTF-32). HarfBuzz provides convenience functions that accept 82 each of these encodings: 83 <function>hb_buffer_add_utf8()</function>, 84 <function>hb_buffer_add_utf16()</function>, and 85 <function>hb_buffer_add_utf32()</function>. Other than the 86 character encoding they accept, they function identically. 87 </para> 88 <para> 89 You can add UTF-8 text to a buffer by passing in the text array, 90 the array's length, an offset into the array for the first 91 character to add, and the length of the segment to add: 92 </para> 93 <programlisting language="C"> 94 hb_buffer_add_utf8 (hb_buffer_t *buf, 95 const char *text, 96 int text_length, 97 unsigned int item_offset, 98 int item_length) 99 </programlisting> 100 <para> 101 So, in practice, you can say: 102 </para> 103 <programlisting language="C"> 104 hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text)); 105 </programlisting> 106 <para> 107 This will append your new characters to 108 <parameter>buf</parameter>, not replace its existing 109 contents. Also, note that you can use <literal>-1</literal> in 110 place of the first instance of <function>strlen(text)</function> 111 if your text array is NULL-terminated. Similarly, you can also use 112 <literal>-1</literal> as the final argument want to add its full 113 contents. 114 </para> 115 <para> 116 Whatever start <parameter>item_offset</parameter> and 117 <parameter>item_length</parameter> you provide, HarfBuzz will also 118 attempt to grab the five characters <emphasis>before</emphasis> 119 the offset point and the five characters 120 <emphasis>after</emphasis> the designated end. These are the 121 before and after "context" segments, which are used internally 122 for HarfBuzz to make shaping decisions. They will not be part of 123 the final output, but they ensure that HarfBuzz's 124 script-specific shaping operations are correct. If there are 125 fewer than five characters available for the before or after 126 contexts, HarfBuzz will just grab what is there. 127 </para> 128 <para> 129 For longer text runs, such as full paragraphs, it might be 130 tempting to only add smaller sub-segments to a buffer and 131 shape them in piecemeal fashion. Generally, this is not a good 132 idea, however, because a lot of shaping decisions are 133 dependent on this context information. For example, in Arabic 134 and other connected scripts, HarfBuzz needs to know the code 135 points before and after each character in order to correctly 136 determine which glyph to return. 137 </para> 138 <para> 139 The safest approach is to add all of the text available, then 140 use <parameter>item_offset</parameter> and 141 <parameter>item_length</parameter> to indicate which characters you 142 want shaped, so that HarfBuzz has access to any context. 143 </para> 144 <para> 145 You can also add Unicode code points directly with 146 <function>hb_buffer_add_codepoints()</function>. The arguments 147 to this function are the same as those for the UTF 148 encodings. But it is particularly important to note that 149 HarfBuzz does not do validity checking on the text that is added 150 to a buffer. Invalid code points will be replaced, but it is up 151 to you to do any deep-sanity checking necessary. 152 </para> 153 154 </section> 155 156 <section id="setting-buffer-properties"> 157 <title>Setting buffer properties</title> 158 <para> 159 Buffers containing input characters still need several 160 properties set before HarfBuzz can shape their text correctly. 161 </para> 162 <para> 163 Initially, all buffers are set to the 164 <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content 165 type. After adding text, the buffer should be set to 166 <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which 167 indicates that it contains un-shaped input 168 characters. After shaping, the buffer will have the 169 <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type. 170 </para> 171 <para> 172 <function>hb_buffer_add_utf8()</function> and the 173 other UTF functions set the content type of their buffer 174 automatically. But if you are reusing a buffer you may want to 175 check its state with 176 <function>hb_buffer_get_content_type(buffer)</function>. If 177 necessary you can set the content type with 178 </para> 179 <programlisting language="C"> 180 hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE); 181 </programlisting> 182 <para> 183 to prepare for shaping. 184 </para> 185 <para> 186 Buffers also need to carry information about the script, 187 language, and text direction of their contents. You can set 188 these properties individually: 189 </para> 190 <programlisting language="C"> 191 hb_buffer_set_direction(buf, HB_DIRECTION_LTR); 192 hb_buffer_set_script(buf, HB_SCRIPT_LATIN); 193 hb_buffer_set_language(buf, hb_language_from_string("en", -1)); 194 </programlisting> 195 <para> 196 However, since these properties are often the repeated for 197 multiple text runs, you can also save them in a 198 <literal>hb_segment_properties_t</literal> for reuse: 199 </para> 200 <programlisting language="C"> 201 hb_segment_properties_t *savedprops; 202 hb_buffer_get_segment_properties (buf, savedprops); 203 ... 204 hb_buffer_set_segment_properties (buf2, savedprops); 205 </programlisting> 206 <para> 207 HarfBuzz also provides getter functions to retrieve a buffer's 208 direction, script, and language properties individually. 209 </para> 210 <para> 211 HarfBuzz recognizes four text directions in 212 <type>hb_direction_t</type>: left-to-right 213 (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>), 214 top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and 215 bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the 216 script property, HarfBuzz uses identifiers based on the 217 <ulink 218 url="https://unicode.org/iso15924/">ISO 15924 219 standard</ulink>. For languages, HarfBuzz uses tags based on the 220 <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard. 221 </para> 222 <para> 223 Helper functions are provided to convert character strings into 224 the necessary script and language tag types. 225 </para> 226 <para> 227 Two additional buffer properties to be aware of are the 228 "invisible glyph" and the replacement code point. The 229 replacement code point is inserted into buffer output in place of 230 any invalid code points encountered in the input. By default, it 231 is the Unicode <literal>REPLACEMENT CHARACTER</literal> code 232 point, <literal>U+FFFD</literal> "�". You can change this with 233 </para> 234 <programlisting language="C"> 235 hb_buffer_set_replacement_codepoint(buf, replacement); 236 </programlisting> 237 <para> 238 passing in the replacement Unicode code point as the 239 <parameter>replacement</parameter> parameter. 240 </para> 241 <para> 242 The invisible glyph is used to replace all output glyphs that 243 are invisible. By default, the standard space character 244 <literal>U+0020</literal> is used; you can replace this (for 245 example, when using a font that provides script-specific 246 spaces) with 247 </para> 248 <programlisting language="C"> 249 hb_buffer_set_invisible_glyph(buf, replacement_glyph); 250 </programlisting> 251 <para> 252 Do note that in the <parameter>replacement_glyph</parameter> 253 parameter, you must provide the glyph ID of the replacement you 254 wish to use, not the Unicode code point. 255 </para> 256 <para> 257 HarfBuzz supports a few additional flags you might want to set 258 on your buffer under certain circumstances. The 259 <literal>HB_BUFFER_FLAG_BOT</literal> and 260 <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz 261 that the buffer represents the beginning or end (respectively) 262 of a text element (such as a paragraph or other block). Knowing 263 this allows HarfBuzz to apply certain contextual font features 264 when shaping, such as initial or final variants in connected 265 scripts. 266 </para> 267 <para> 268 <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal> 269 tells HarfBuzz not to hide glyphs with the 270 <literal>Default_Ignorable</literal> property in Unicode. This 271 property designates control characters and other non-printing 272 code points, such as joiners and variation selectors. Normally 273 HarfBuzz replaces them in the output buffer with zero-width 274 space glyphs (using the "invisible glyph" property discussed 275 above); setting this flag causes them to be printed, which can 276 be helpful for troubleshooting. 277 </para> 278 <para> 279 Conversely, setting the 280 <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag 281 tells HarfBuzz to remove <literal>Default_Ignorable</literal> 282 glyphs from the output buffer entirely. Finally, setting the 283 <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal> 284 flag tells HarfBuzz not to insert the dotted-circle glyph 285 (<literal>U+25CC</literal>, "◌"), which is normally 286 inserted into buffer output when broken character sequences are 287 encountered (such as combining marks that are not attached to a 288 base character). 289 </para> 290 </section> 291 292 <section id="customizing-unicode-functions"> 293 <title>Customizing Unicode functions</title> 294 <para> 295 HarfBuzz requires some simple functions for accessing 296 information from the Unicode Character Database (such as the 297 <literal>General_Category</literal> (gc) and 298 <literal>Script</literal> (sc) properties) that is useful 299 for shaping, as well as some useful operations like composing and 300 decomposing code points. 301 </para> 302 <para> 303 HarfBuzz includes its own internal, lightweight set of Unicode 304 functions. At build time, it is also possible to compile support 305 for some other options, such as the Unicode functions provided 306 by GLib or the International Components for Unicode (ICU) 307 library. Generally, this option is only of interest for client 308 programs that have specific integration requirements or that do 309 a significant amount of customization. 310 </para> 311 <para> 312 If your program has access to other Unicode functions, however, 313 such as through a system library or application framework, you 314 might prefer to use those instead of the built-in 315 options. HarfBuzz supports this by implementing its Unicode 316 functions as a set of virtual methods that you can replace — 317 without otherwise affecting HarfBuzz's functionality. 318 </para> 319 <para> 320 The Unicode functions are specified in a structure called 321 <literal>unicode_funcs</literal> which is attached to each 322 buffer. But even though <literal>unicode_funcs</literal> is 323 associated with a <type>hb_buffer_t</type>, the functions 324 themselves are called by other HarfBuzz APIs that access 325 buffers, so it would be unwise for you to hook different 326 functions into different buffers. 327 </para> 328 <para> 329 In addition, you can mark your <literal>unicode_funcs</literal> 330 as immutable by calling 331 <function>hb_unicode_funcs_make_immutable (ufuncs)</function>. 332 This is especially useful if your code is a 333 library or framework that will have its own client programs. By 334 marking your Unicode function choices as immutable, you prevent 335 your own client programs from changing the 336 <literal>unicode_funcs</literal> configuration and introducing 337 inconsistencies and errors downstream. 338 </para> 339 <para> 340 You can retrieve the Unicode-functions configuration for 341 your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>: 342 </para> 343 <programlisting language="C"> 344 hb_unicode_funcs_t *ufunctions; 345 ufunctions = hb_buffer_get_unicode_funcs(buf); 346 </programlisting> 347 <para> 348 The current version of <literal>unicode_funcs</literal> uses six functions: 349 </para> 350 <itemizedlist> 351 <listitem> 352 <para> 353 <function>hb_unicode_combining_class_func_t</function>: 354 returns the Canonical Combining Class of a code point. 355 </para> 356 </listitem> 357 <listitem> 358 <para> 359 <function>hb_unicode_general_category_func_t</function>: 360 returns the General Category (gc) of a code point. 361 </para> 362 </listitem> 363 <listitem> 364 <para> 365 <function>hb_unicode_mirroring_func_t</function>: returns 366 the Mirroring Glyph code point (for bi-directional 367 replacement) of a code point. 368 </para> 369 </listitem> 370 <listitem> 371 <para> 372 <function>hb_unicode_script_func_t</function>: returns the 373 Script (sc) property of a code point. 374 </para> 375 </listitem> 376 <listitem> 377 <para> 378 <function>hb_unicode_compose_func_t</function>: returns the 379 canonical composition of a sequence of two code points. 380 </para> 381 </listitem> 382 <listitem> 383 <para> 384 <function>hb_unicode_decompose_func_t</function>: returns 385 the canonical decomposition of a code point. 386 </para> 387 </listitem> 388 </itemizedlist> 389 <para> 390 Note, however, that future HarfBuzz releases may alter this set. 391 </para> 392 <para> 393 Each Unicode function has a corresponding setter, with which you 394 can assign a callback to your replacement function. For example, 395 to replace 396 <function>hb_unicode_general_category_func_t</function>, you can call 397 </para> 398 <programlisting language="C"> 399 hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy) 400 </programlisting> 401 <para> 402 Virtualizing this set of Unicode functions is primarily intended 403 to improve portability. There is no need for every client 404 program to make the effort to replace the default options, so if 405 you are unsure, do not feel any pressure to customize 406 <literal>unicode_funcs</literal>. 407 </para> 408 </section> 409 410</chapter> 411