1LZ4 Format Description
2Last revised: 2012-02-27
3Author : Y. Collet
4
5
6
7This small specification intents to provide enough information
8to anyone willing to produce LZ4-compatible compressed data blocks
9using any programming language.
10
11LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
12The most important design principle behind LZ4 is simplicity.
13It helps to create an easy to read and maintain source code.
14It also helps later on for optimisations, compactness, and speed.
15There is no entropy encoder backend nor framing layer.
16The latter is assumed to be handled by other parts of the system.
17
18This document only describes the format,
19not how the LZ4 compressor nor decompressor actually work.
20The correctness of the decompressor should not depend
21on implementation details of the compressor, and vice versa.
22
23
24
25-- Compressed block format --
26
27An LZ4 compressed block is composed of sequences.
28Schematically, a sequence is a suite of literals, followed by a match copy.
29
30Each sequence starts with a token.
31The token is a one byte value, separated into two 4-bits fields.
32Therefore each field ranges from 0 to 15.
33
34
35The first field uses the 4 high-bits of the token.
36It provides the length of literals to follow.
37(Note : a literal is a not-compressed byte).
38If the field value is 0, then there is no literal.
39If it is 15, then we need to add some more bytes to indicate the full length.
40Each additionnal byte then represent a value from 0 to 255,
41which is added to the previous value to produce a total length.
42When the byte value is 255, another byte is output.
43There can be any number of bytes following the token. There is no "size limit".
44(Sidenote this is why a not-compressible input block is expanded by 0.4%).
45
46Example 1 : A length of 48 will be represented as :
47- 15 : value for the 4-bits High field
48- 33 : (=48-15) remaining length to reach 48
49
50Example 2 : A length of 280 will be represented as :
51- 15  : value for the 4-bits High field
52- 255 : following byte is maxed, since 280-15 >= 255
53- 10  : (=280 - 15 - 255) ) remaining length to reach 280
54
55Example 3 : A length of 15 will be represented as :
56- 15 : value for the 4-bits High field
57- 0  : (=15-15) yes, the zero must be output
58
59Following the token and optional length bytes, are the literals themselves.
60They are exactly as numerous as previously decoded (length of literals).
61It's possible that there are zero literal.
62
63
64Following the literals is the match copy operation.
65
66It starts by the offset.
67This is a 2 bytes value, in little endian format.
68
69The offset represents the position of the match to be copied from.
701 means "current position - 1 byte".
71The maximum offset value is 65535, 65536 cannot be coded.
72Note that 0 is an invalid value, not used.
73
74Then we need to extract the match length.
75For this, we use the second token field, the low 4-bits.
76Value, obviously, ranges from 0 to 15.
77However here, 0 means that the copy operation will be minimal.
78The minimum length of a match, called minmatch, is 4.
79As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
80Similar to literal length, on reaching the highest possible value (15),
81we output additional bytes, one at a time, with values ranging from 0 to 255.
82They are added to total to provide the final match length.
83A 255 value means there is another byte to read and add.
84There is no limit to the number of optional bytes that can be output this way.
85(This points towards a maximum achievable compression ratio of ~250).
86
87With the offset and the matchlength,
88the decoder can now proceed to copy the data from the already decoded buffer.
89On decoding the matchlength, we reach the end of the compressed sequence,
90and therefore start another one.
91
92
93-- Parsing restrictions --
94
95There are specific parsing rules to respect in order to remain compatible
96with assumptions made by the decoder :
971) The last 5 bytes are always literals
982) The last match must start at least 12 bytes before end of block
99Consequently, a block with less than 13 bytes cannot be compressed.
100These rules are in place to ensure that the decoder
101will never read beyond the input buffer, nor write beyond the output buffer.
102
103Note that the last sequence is also incomplete,
104and stops right after literals.
105
106
107-- Additional notes --
108
109There is no assumption nor limits to the way the compressor
110searches and selects matches within the source data block.
111It could be a fast scan, a multi-probe, a full search using BST,
112standard hash chains or MMC, well whatever.
113
114Advanced parsing strategies can also be implemented, such as lazy match,
115or full optimal parsing.
116
117All these trade-off offer distinctive speed/memory/compression advantages.
118Whatever the method used by the compressor, its result will be decodable
119by any LZ4 decoder if it follows the format specification described above.
120
121