1LZ4 Frame Format Description
2============================
3
4###Notices
5
6Copyright (c) 2013-2015 Yann Collet
7
8Permission is granted to copy and distribute this document
9for any  purpose and without charge,
10including translations into other  languages
11and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17###Version
18
191.5.1 (31/03/2015)
20
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
28File compression, Pipe and streaming compression
29using the [LZ4 algorithm](http://www.lz4.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the LZ4 compression method,
36and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into LZ4 format and/or decompress data from LZ4 format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn’t need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore checksums.
56Whenever it does not support a specific parameter within the compressed stream,
57it must produce a non-ambiguous error code
58and associated error message explaining which parameter is unsupported.
59
60
61General Structure of LZ4 Frame format
62-------------------------------------
63
64| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum |
65|:-------:|:-------------:| ----- | ----- | ------- | ----------- |
66| 4 bytes |  3-11 bytes   |       |       | 4 bytes | 0-4 bytes   |
67
68__Magic Number__
69
704 Bytes, Little endian format.
71Value : 0x184D2204
72
73__Frame Descriptor__
74
753 to 11 Bytes, to be detailed in the next part.
76Most important part of the spec.
77
78__Data Blocks__
79
80To be detailed later on.
81That’s where compressed data is stored.
82
83__EndMark__
84
85The flow of blocks ends when the last data block has a size of “0”.
86The size is expressed as a 32-bits value.
87
88__Content Checksum__
89
90Content Checksum verify that the full content has been decoded correctly.
91The content checksum is the result
92of [xxh32() hash function](https://github.com/Cyan4973/xxHash)
93digesting the original (decoded) data as input, and a seed of zero.
94Content checksum is only present when its associated flag
95is set in the frame descriptor.
96Content Checksum validates the result,
97that all blocks were fully transmitted in the correct order and without error,
98and also that the encoding/decoding process itself generated no distortion.
99Its usage is recommended.
100
101__Frame Concatenation__
102
103In some circumstances, it may be preferable to append multiple frames,
104for example in order to add new data to an existing compressed file
105without re-framing it.
106
107In such case, each frame has its own set of descriptor flags.
108Each frame is considered independent.
109The only relation between frames is their sequential order.
110
111The ability to decode multiple concatenated frames
112within a single stream or file
113is left outside of this specification.
114As an example, the reference lz4 command line utility behavior is
115to decode all concatenated frames in their sequential order.
116
117
118Frame Descriptor
119----------------
120
121| FLG     | BD      | (Content Size) | HC      |
122| ------- | ------- |:--------------:| ------- |
123| 1 byte  | 1 byte  |  0 - 8 bytes   | 1 byte  |
124
125The descriptor uses a minimum of 3 bytes,
126and up to 11 bytes depending on optional parameters.
127
128__FLG byte__
129
130|  BitNb  |   7-6   |    5    |     4     |   3     |     2     |    1-0   |
131| ------- | ------- | ------- | --------- | ------- | --------- | -------- |
132|FieldName| Version | B.Indep | B.Checksum| C.Size  | C.Checksum|*Reserved*|
133
134
135__BD byte__
136
137|  BitNb  |     7    |     6-5-4    |  3-2-1-0 |
138| ------- | -------- | ------------ | -------- |
139|FieldName|*Reserved*| Block MaxSize|*Reserved*|
140
141In the tables, bit 7 is highest bit, while bit 0 is lowest.
142
143__Version Number__
144
1452-bits field, must be set to “01”.
146Any other value cannot be decoded by this version of the specification.
147Other version numbers will use different flag layouts.
148
149__Block Independence flag__
150
151If this flag is set to “1”, blocks are independent.
152If this flag is set to “0”, each block depends on previous ones
153(up to LZ4 window size, which is 64 KB).
154In such case, it’s necessary to decode all blocks in sequence.
155
156Block dependency improves compression ratio, especially for small blocks.
157On the other hand, it makes direct jumps or multi-threaded decoding impossible.
158
159__Block checksum flag__
160
161If this flag is set, each data block will be followed by a 4-bytes checksum,
162calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
163The intention is to detect data corruption (storage or transmission errors)
164immediately, before decoding.
165Block checksum usage is optional.
166
167__Content Size flag__
168
169If this flag is set, the uncompressed size of data included within the frame
170will be present as an 8 bytes unsigned little endian value, after the flags.
171Content Size usage is optional.
172
173__Content checksum flag__
174
175If this flag is set, a content checksum will be appended after the EndMark.
176
177Recommended value : “1” (content checksum is present)
178
179__Block Maximum Size__
180
181This information is intended to help the decoder allocate memory.
182Size here refers to the original (uncompressed) data size.
183Block Maximum Size is one value among the following table :
184
185|  0  |  1  |  2  |  3  |   4   |   5    |  6   |  7   |
186| --- | --- | --- | --- | ----- | ------ | ---- | ---- |
187| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB |
188
189The decoder may refuse to allocate block sizes above a (system-specific) size.
190Unused values may be used in a future revision of the spec.
191A decoder conformant to the current version of the spec
192is only able to decode blocksizes defined in this spec.
193
194__Reserved bits__
195
196Value of reserved bits **must** be 0 (zero).
197Reserved bit might be used in a future version of the specification,
198typically enabling new optional features.
199If this happens, a decoder respecting the current version of the specification
200shall not be able to decode such a frame.
201
202__Content Size__
203
204This is the original (uncompressed) size.
205This information is optional, and only present if the associated flag is set.
206Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes.
207Format is Little endian.
208This value is informational, typically for display or memory allocation.
209It can be skipped by a decoder, or used to validate content correctness.
210
211__Header Checksum__
212
213One-byte checksum of combined descriptor fields, including optional ones.
214The value is the second byte of xxh32() : ` (xxh32()>>8) & 0xFF `
215using zero as a seed,
216and the full Frame Descriptor as an input
217(including optional fields when they are present).
218A wrong checksum indicates an error in the descriptor.
219Header checksum is informational and can be skipped.
220
221
222Data Blocks
223-----------
224
225| Block Size |  data  | (Block Checksum) |
226|:----------:| ------ |:----------------:|
227|  4 bytes   |        |   0 - 4 bytes    |
228
229
230__Block Size__
231
232This field uses 4-bytes, format is little-endian.
233
234The highest bit is “1” if data in the block is uncompressed.
235
236The highest bit is “0” if data in the block is compressed by LZ4.
237
238All other bits give the size, in bytes, of the following data block
239(the size does not include the block checksum if present).
240
241Block Size shall never be larger than Block Maximum Size.
242Such a thing could happen for incompressible source data.
243In such case, such a data block shall be passed in uncompressed format.
244
245__Data__
246
247Where the actual data to decode stands.
248It might be compressed or not, depending on previous field indications.
249Uncompressed size of Data can be any size, up to “block maximum size”.
250Note that data block is not necessarily full :
251an arbitrary “flush” may happen anytime. Any block can be “partially filled”.
252
253__Block checksum__
254
255Only present if the associated flag is set.
256This is a 4-bytes checksum value, in little endian format,
257calculated by using the xxHash-32 algorithm on the raw (undecoded) data block,
258and a seed of zero.
259The intention is to detect data corruption (storage or transmission errors)
260before decoding.
261
262Block checksum is cumulative with Content checksum.
263
264
265Skippable Frames
266----------------
267
268| Magic Number | Frame Size | User Data |
269|:------------:|:----------:| --------- |
270|   4 bytes    |  4 bytes   |           |
271
272Skippable frames allow the integration of user-defined data
273into a flow of concatenated frames.
274Its design is pretty straightforward,
275with the sole objective to allow the decoder to quickly skip
276over user-defined data and continue decoding.
277
278For the purpose of facilitating identification,
279it is discouraged to start a flow of concatenated frames with a skippable frame.
280If there is a need to start such a flow with some user data
281encapsulated into a skippable frame,
282it’s recommended to start with a zero-byte LZ4 frame
283followed by a skippable frame.
284This will make it easier for file type identifiers.
285
286
287__Magic Number__
288
2894 Bytes, Little endian format.
290Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
291All 16 values are valid to identify a skippable frame.
292
293__Frame Size__
294
295This is the size, in bytes, of the following User Data
296(without including the magic number nor the size field itself).
2974 Bytes, Little endian format, unsigned 32-bits.
298This means User Data can’t be bigger than (2^32-1) Bytes.
299
300__User Data__
301
302User Data can be anything. Data will just be skipped by the decoder.
303
304
305Legacy frame
306------------
307
308The Legacy frame format was defined into the initial versions of “LZ4Demo”.
309Newer compressors should not use this format anymore, as it is too restrictive.
310
311Main characteristics of the legacy format :
312
313- Fixed block size : 8 MB.
314- All blocks must be completely filled, except the last one.
315- All blocks are always compressed, even when compression is detrimental.
316- The last block is detected either because
317  it is followed by the “EOF” (End of File) mark,
318  or because it is followed by a known Frame Magic Number.
319- No checksum
320- Convention is Little endian
321
322| MagicNb | B.CSize | CData | B.CSize | CData |  (...)  | EndMark |
323| ------- | ------- | ----- | ------- | ----- | ------- | ------- |
324| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times |   EOF   |
325
326
327__Magic Number__
328
3294 Bytes, Little endian format.
330Value : 0x184C2102
331
332__Block Compressed Size__
333
334This is the size, in bytes, of the following compressed data block.
3354 Bytes, Little endian format.
336
337__Data__
338
339Where the actual compressed data stands.
340Data is always compressed, even when compression is detrimental.
341
342__EndMark__
343
344End of legacy frame is implicit only.
345It must be followed by a standard EOF (End Of File) signal,
346wether it is a file or a stream.
347
348Alternatively, if the frame is followed by a valid Frame Magic Number,
349it is considered completed.
350It makes legacy frames compatible with frame concatenation.
351
352Any other value will be interpreted as a block size,
353and trigger an error if it does not fit within acceptable range.
354
355
356Version changes
357---------------
358
3591.5.1 : changed format to MarkDown compatible
360
3611.5 : removed Dictionary ID from specification
362
3631.4.1 : changed wording from “stream” to “frame”
364
3651.4 : added skippable streams, re-added stream checksum
366
3671.3 : modified header checksum
368
3691.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.
370
3711.1 : optional fields are now part of the descriptor
372
3731.0 : changed “block size” specification, adding a compressed/uncompressed flag
374
3750.9 : reduced scale of “block maximum size” table
376
3770.8 : removed : high compression flag
378
3790.7 : removed : stream checksum
380
3810.6 : settled : stream size uses 8 bytes, endian convention is little endian
382
3830.5: added copyright notice
384
3850.4 : changed format to Google Doc compatible OpenDocument
386