1# Hyb (hyphenation pattern binary) file format
2
3The hyb file format is how hyphenation patterns are stored in the system image.
4
5Goals include:
6
7* Concise (system image space is at a premium)
8* Usable when mmap'ed, so it doesn't take significant physical RAM
9* Fast to compute
10* Simple
11
12It is _not_ intended as an interchange format, so there is no attempt to make the format
13extensible or facilitate backward and forward compatibility.
14
15Further, at some point we will probably pack patterns for multiple languages into a single
16physical file, to reduce number of open mmap'ed files. This document doesn't cover that.
17
18## Theoretical basis
19
20At heart, the file contains packed tries with suffix compression, actually quite similar
21to the implementation in TeX.
22
23The file contains three sections. The first section represents the "alphabet," including
24case folding. It is effectively a map from Unicode code point to a small integer.
25
26The second section contains the trie in packed form. It is an array of 3-tuples, packed
27into a 32 bit integer. Each (suffix-compressed) trie node has a unique index within this
28array, and the pattern field in the tuple is the pattern for that node. Further, each edge
29in the trie has an entry in the array, and the character and link fields in the tuple
30represent the label and destination of the edge. The packing strategy is as in
31[Word Hy-phen-a-tion by Com-put-er](http://www.tug.org/docs/liang/liang-thesis.pdf) by
32Franklin Mark Liang.
33
34The trie representation is similar but not identical to the "double-array trie".
35The fundamental operation of lookup of the edge from `s` to `t` with label `c` is
36to compare `c == character[s + c]`, and if so, `t = link[s + c]`.
37
38The third section contains the pattern strings. This section is in two parts: first,
39an array with a 3-tuple for each pattern (length, number of trailing 0's, and offset
40into the string pool); and second, the string pool. Each pattern is encoded as a byte
41(packing 2 per byte would be possible but the space savings would not be signficant).
42
43As much as possible of the file is represented as 32 bit integers, as that is especially
44efficent to access. All are little-endian (this could be revised if the code ever needs
45to be ported to big-endian systems).
46
47## Header
48
49```
50uint32_t magic == 0x62ad7968
51uint32_t version = 0
52uint32_t alphabet_offset (in bytes)
53uint32_t trie_offset (in bytes)
54uint32_t pattern_offset (in bytes)
55uint32_t file_size (in bytes)
56```
57
58Offsets are from the front of the file, and in bytes.
59
60## Alphabet
61
62The alphabet table comes in two versions. The first is well suited to dense Unicode
63ranges and is limited to 256. The second is more general, but lookups will be slower.
64
65### Alphabet, direct version
66
67```
68uint32_t version = 0
69uint32_t min_codepoint
70uint32_t max_codepoint (exclusive)
71uint8_t[] data
72```
73
74The size of the data array is max_codepoint - min_codepoint. 0 represents an unmapped
75character. Note that, in the current implementation, automatic hyphenation is disabled
76for any word containing an unmapped character.
77
78In general, pad bytes follow this table, aligning the next table to a 4-byte boundary.
79
80### Alphabet, general version
81
82```
83uint32_t version = 1
84uint32_t n_entries
85uint32_t[n_entries] data
86```
87
88Each element in the data table is `(codepoint << 11) | value`. Note that this is
89restricted to 11 bits (2048 possible values). The largest known current value is 483
90(for Sanskrit).
91
92The entries are sorted by codepoint, to facilitate binary search. Another reasonable
93implementation for consumers of the data would be to build a hash table at load time.
94
95## Trie
96
97```
98uint32_t version = 0
99uint32_t char_mask
100uint32_t link_shift
101uint32_t link_mask
102uint32_t pattern_shift
103uint32_t n_entries
104uint32_t[n_entries] data
105```
106
107Each element in the data table is `(pattern << pattern_shift) | (link << link_shift) | char`.
108
109All known pattern tables fit in 32 bits total. If this is exceeded, there is a fairly
110straightforward tweak, where each node occupies a slot by itself (as opposed to sharing
111it with edge slots), which would require very minimal changes to the implementation (TODO
112present in more detail).
113
114## Pattern
115
116```
117uint32_t version = 0
118uint32_t n_entries
119uint32_t pattern_offset (in bytes)
120uint32_t pattern_size (in bytes)
121uint32_t[n_entries] data
122uint8_t[] pattern_buf
123```
124
125Each element in data table is `(len << 26) | (shift << 20) | offset`, where an offset of 0
126points to the first byte of pattern_buf.
127
128Generally pattern_offset is `16 + 4 * n_entries`.
129
130For example, 'a4m5ato' would be represented as `[4, 5, 0, 0, 0]`, then len = 2, shift = 3, and
131offset points to [4, 5] in the pattern buffer.
132
133Future extension: additional data representing nonstandard hyphenation. See
134[Automatic non-standard hyphenation in OpenOffice.org](https://www.tug.org/TUGboat/tb27-1/tb86nemeth.pdf)
135for more information about that issue.
136