1LZ4 Block Format Description
2============================
3Last revised: 2018-04-25.
4Author : Yann Collet
5
6
7This specification is intended for developers
8willing to produce LZ4-compatible compressed data blocks
9using any programming language.
10
11LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
12There is no entropy encoder back-end nor framing layer.
13The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
14This design is assumed to favor simplicity and speed.
15It helps later on for optimizations, compactness, and features.
16
17This document describes only the block format,
18not how the compressor nor decompressor actually work.
19The correctness of the decompressor should not depend
20on implementation details of the compressor, and vice versa.
21
22[LZ4 Frame format]: lz4_Frame_format.md
23
24
25
26Compressed block format
27-----------------------
28An LZ4 compressed block is composed of sequences.
29A sequence is a suite of literals (not-compressed bytes),
30followed by a match copy.
31
32Each sequence starts with a `token`.
33The `token` is a one byte value, separated into two 4-bits fields.
34Therefore each field ranges from 0 to 15.
35
36
37The first field uses the 4 high-bits of the token.
38It provides the length of literals to follow.
39
40If the field value is 0, then there is no literal.
41If it is 15, then we need to add some more bytes to indicate the full length.
42Each additional byte then represent a value from 0 to 255,
43which is added to the previous value to produce a total length.
44When the byte value is 255, another byte is output.
45There can be any number of bytes following `token`. There is no "size limit".
46(Side note : this is why a not-compressible input block is expanded by 0.4%).
47
48Example 1 : A literal length of 48 will be represented as :
49
50  - 15 : value for the 4-bits High field
51  - 33 : (=48-15) remaining length to reach 48
52
53Example 2 : A literal length of 280 will be represented as :
54
55  - 15  : value for the 4-bits High field
56  - 255 : following byte is maxed, since 280-15 >= 255
57  - 10  : (=280 - 15 - 255) ) remaining length to reach 280
58
59Example 3 : A literal length of 15 will be represented as :
60
61  - 15 : value for the 4-bits High field
62  - 0  : (=15-15) yes, the zero must be output
63
64Following `token` and optional length bytes, are the literals themselves.
65They are exactly as numerous as previously decoded (length of literals).
66It's possible that there are zero literal.
67
68
69Following the literals is the match copy operation.
70
71It starts by the `offset`.
72This is a 2 bytes value, in little endian format
73(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
74
75The `offset` represents the position of the match to be copied from.
761 means "current position - 1 byte".
77The maximum `offset` value is 65535, 65536 cannot be coded.
78Note that 0 is an invalid value, not used.
79
80Then we need to extract the `matchlength`.
81For this, we use the second token field, the low 4-bits.
82Value, obviously, ranges from 0 to 15.
83However here, 0 means that the copy operation will be minimal.
84The minimum length of a match, called `minmatch`, is 4.
85As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
86Similar to literal length, on reaching the highest possible value (15),
87we output additional bytes, one at a time, with values ranging from 0 to 255.
88They are added to total to provide the final match length.
89A 255 value means there is another byte to read and add.
90There is no limit to the number of optional bytes that can be output this way.
91(This points towards a maximum achievable compression ratio of about 250).
92
93Decoding the `matchlength` reaches the end of current sequence.
94Next byte will be the start of another sequence.
95But before moving to next sequence,
96it's time to use the decoded match position and length.
97The decoder copies `matchlength` bytes from match position to current position.
98
99In some cases, `matchlength` is larger than `offset`.
100Therefore, `match_pos + matchlength > current_pos`,
101which means that later bytes to copy are not yet decoded.
102This is called an "overlap match", and must be handled with special care.
103A common case is an offset of 1,
104meaning the last byte is repeated `matchlength` times.
105
106
107Parsing restrictions
108-----------------------
109There are specific parsing rules to respect in order to remain compatible
110with assumptions made by the decoder :
111
1121. The last 5 bytes are always literals.  In other words, the last five bytes
113   from the uncompressed input (or all bytes, if the input has less than five
114   bytes) must be encoded as literals on behalf of the last sequence.
115   The last sequence is incomplete, and stops right after the literals.
1162. The last match must start at least 12 bytes before end of block.
117   The last match is part of the penultimate sequence,
118   since the last sequence stops right after literals.
119   Note that, as a consequence, blocks < 13 bytes cannot be compressed.
120
121These rules are in place to ensure that the decoder
122can speculatively execute copy instructions
123without ever reading nor writing beyond provided I/O buffers.
124
1251. To copy literals from a non-last sequence, an 8-byte copy instruction
126   can always be safely issued (without reading past the input),
127   because literals are followed by a 2-byte offset,
128   and last sequence is at least 1+5 bytes long.
1292. Similarly, a match operation can speculatively copy up to 12 bytes
130   while remaining within output buffer boundaries.
131
132Empty inputs can be represented with a zero byte,
133interpreted as a token without literals and without a match.
134
135
136Additional notes
137-----------------------
138There is no assumption nor limits to the way the compressor
139searches and selects matches within the source data block.
140It could be a fast scan, a multi-probe, a full search using BST,
141standard hash chains or MMC, well whatever.
142
143Advanced parsing strategies can also be implemented, such as lazy match,
144or full optimal parsing.
145
146All these trade-off offer distinctive speed/memory/compression advantages.
147Whatever the method used by the compressor, its result will be decodable
148by any LZ4 decoder if it follows the format specification described above.
149