1LZMA specification (DRAFT version)
2----------------------------------
3
4Author: Igor Pavlov
5Date: 2015-06-14
6
7This specification defines the format of LZMA compressed data and lzma file format.
8
9Notation
10--------
11
12We use the syntax of C++ programming language.
13We use the following types in C++ code:
14  unsigned - unsigned integer, at least 16 bits in size
15  int      - signed integer, at least 16 bits in size
16  UInt64   - 64-bit unsigned integer
17  UInt32   - 32-bit unsigned integer
18  UInt16   - 16-bit unsigned integer
19  Byte     - 8-bit unsigned integer
20  bool     - boolean type with two possible values: false, true
21
22
23lzma file format
24================
25
26The lzma file contains the raw LZMA stream and the header with related properties.
27
28The files in that format use ".lzma" extension.
29
30The lzma file format layout:
31
32Offset Size Description
33
34  0     1   LZMA model properties (lc, lp, pb) in encoded form
35  1     4   Dictionary size (32-bit unsigned integer, little-endian)
36  5     8   Uncompressed size (64-bit unsigned integer, little-endian)
37 13         Compressed data (LZMA stream)
38
39LZMA properties:
40
41    name  Range          Description
42
43      lc  [0, 8]         the number of "literal context" bits
44      lp  [0, 4]         the number of "literal pos" bits
45      pb  [0, 4]         the number of "pos" bits
46dictSize  [0, 2^32 - 1]  the dictionary size
47
48The following code encodes LZMA properties:
49
50void EncodeProperties(Byte *properties)
51{
52  properties[0] = (Byte)((pb * 5 + lp) * 9 + lc);
53  Set_UInt32_LittleEndian(properties + 1, dictSize);
54}
55
56If the value of dictionary size in properties is smaller than (1 << 12),
57the LZMA decoder must set the dictionary size variable to (1 << 12).
58
59#define LZMA_DIC_MIN (1 << 12)
60
61  unsigned lc, pb, lp;
62  UInt32 dictSize;
63  UInt32 dictSizeInProperties;
64
65  void DecodeProperties(const Byte *properties)
66  {
67    unsigned d = properties[0];
68    if (d >= (9 * 5 * 5))
69      throw "Incorrect LZMA properties";
70    lc = d % 9;
71    d /= 9;
72    pb = d / 5;
73    lp = d % 5;
74    dictSizeInProperties = 0;
75    for (int i = 0; i < 4; i++)
76      dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i);
77    dictSize = dictSizeInProperties;
78    if (dictSize < LZMA_DIC_MIN)
79      dictSize = LZMA_DIC_MIN;
80  }
81
82If "Uncompressed size" field contains ones in all 64 bits, it means that
83uncompressed size is unknown and there is the "end marker" in stream,
84that indicates the end of decoding point.
85In opposite case, if the value from "Uncompressed size" field is not
86equal to ((2^64) - 1), the LZMA stream decoding must be finished after
87specified number of bytes (Uncompressed size) is decoded. And if there
88is the "end marker", the LZMA decoder must read that marker also.
89
90
91The new scheme to encode LZMA properties
92----------------------------------------
93
94If LZMA compression is used for some another format, it's recommended to
95use a new improved scheme to encode LZMA properties. That new scheme was
96used in xz format that uses the LZMA2 compression algorithm.
97The LZMA2 is a new compression algorithm that is based on the LZMA algorithm.
98
99The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports
100only reduced set of dictionary sizes:
101  (2 << 11), (3 << 11),
102  (2 << 12), (3 << 12),
103  ...
104  (2 << 30), (3 << 30),
105  (2 << 31) - 1
106
107The dictionary size can be extracted from encoded value with the following code:
108
109  dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11));
110
111Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of
112"lc" and "lp" properties:
113
114  if (lc + lp > 4)
115    throw "Unsupported properties: (lc + lp) > 4";
116
117There are some advantages for LZMA decoder with such (lc + lp) value
118limitation. It reduces the maximum size of tables allocated by decoder.
119And it reduces the complexity of initialization procedure, that can be
120important to keep high speed of decoding of big number of small LZMA streams.
121
122It's recommended to use that limitation (lc + lp <= 4) for any new format
123that uses LZMA compression. Note that the combinations of "lc" and "lp"
124parameters, where (lc + lp > 4), can provide significant improvement in
125compression ratio only in some rare cases.
126
127The LZMA properties can be encoded into two bytes in new scheme:
128
129Offset Size Description
130
131  0     1   The dictionary size encoded with LZMA2 scheme
132  1     1   LZMA model properties (lc, lp, pb) in encoded form
133
134
135The RAM usage
136=============
137
138The RAM usage for LZMA decoder is determined by the following parts:
139
1401) The Sliding Window (from 4 KiB to 4 GiB).
1412) The probability model counter arrays (arrays of 16-bit variables).
1423) Some additional state variables (about 10 variables of 32-bit integers).
143
144
145The RAM usage for Sliding Window
146--------------------------------
147
148There are two main scenarios of decoding:
149
1501) The decoding of full stream to one RAM buffer.
151
152  If we decode full LZMA stream to one output buffer in RAM, the decoder
153  can use that output buffer as sliding window. So the decoder doesn't
154  need additional buffer allocated for sliding window.
155
1562) The decoding to some external storage.
157
158  If we decode LZMA stream to external storage, the decoder must allocate
159  the buffer for sliding window. The size of that buffer must be equal
160  or larger than the value of dictionary size from properties of LZMA stream.
161
162In this specification we describe the code for decoding to some external
163storage. The optimized version of code for decoding of full stream to one
164output RAM buffer can require some minor changes in code.
165
166
167The RAM usage for the probability model counters
168------------------------------------------------
169
170The size of the probability model counter arrays is calculated with the
171following formula:
172
173size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc))
174
175Each probability model counter is 11-bit unsigned integer.
176If we use 16-bit integer variables (2-byte integers) for these probability
177model counters, the RAM usage required by probability model counter arrays
178can be estimated with the following formula:
179
180  RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc))
181
182For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is
183
184  RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB
185
186The maximum RAM state usage is required for decoding the stream with lp = 4
187and lc = 8:
188
189  RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB
190
191If the decoder uses LZMA2's limited property condition
192(lc + lp <= 4), the RAM usage will be not larger than
193
194  RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB
195
196
197The RAM usage for encoder
198-------------------------
199
200There are many variants for LZMA encoding code.
201These variants have different values for memory consumption.
202Note that memory consumption for LZMA Encoder can not be
203smaller than memory consumption of LZMA Decoder for same stream.
204
205The RAM usage required by modern effective implementation of
206LZMA Encoder can be estimated with the following formula:
207
208  Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize.
209
210But there are some modes of the encoder that require less memory.
211
212
213LZMA Decoding
214=============
215
216The LZMA compression algorithm uses LZ-based compression with Sliding Window
217and Range Encoding as entropy coding method.
218
219
220Sliding Window
221--------------
222
223LZMA uses Sliding Window compression similar to LZ77 algorithm.
224
225LZMA stream must be decoded to the sequence that consists
226of MATCHES and LITERALS:
227
228  - a LITERAL is a 8-bit character (one byte).
229    The decoder just puts that LITERAL to the uncompressed stream.
230
231  - a MATCH is a pair of two numbers (DISTANCE-LENGTH pair).
232    The decoder takes one byte exactly "DISTANCE" characters behind
233    current position in the uncompressed stream and puts it to
234    uncompressed stream. The decoder must repeat it "LENGTH" times.
235
236The "DISTANCE" can not be larger than dictionary size.
237And the "DISTANCE" can not be larger than the number of bytes in
238the uncompressed stream that were decoded before that match.
239
240In this specification we use cyclic buffer to implement Sliding Window
241for LZMA decoder:
242
243class COutWindow
244{
245  Byte *Buf;
246  UInt32 Pos;
247  UInt32 Size;
248  bool IsFull;
249
250public:
251  unsigned TotalPos;
252  COutStream OutStream;
253
254  COutWindow(): Buf(NULL) {}
255  ~COutWindow() { delete []Buf; }
256
257  void Create(UInt32 dictSize)
258  {
259    Buf = new Byte[dictSize];
260    Pos = 0;
261    Size = dictSize;
262    IsFull = false;
263    TotalPos = 0;
264  }
265
266  void PutByte(Byte b)
267  {
268    TotalPos++;
269    Buf[Pos++] = b;
270    if (Pos == Size)
271    {
272      Pos = 0;
273      IsFull = true;
274    }
275    OutStream.WriteByte(b);
276  }
277
278  Byte GetByte(UInt32 dist) const
279  {
280    return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos];
281  }
282
283  void CopyMatch(UInt32 dist, unsigned len)
284  {
285    for (; len > 0; len--)
286      PutByte(GetByte(dist));
287  }
288
289  bool CheckDistance(UInt32 dist) const
290  {
291    return dist <= Pos || IsFull;
292  }
293
294  bool IsEmpty() const
295  {
296    return Pos == 0 && !IsFull;
297  }
298};
299
300
301In another implementation it's possible to use one buffer that contains
302Sliding Window and the whole data stream after uncompressing.
303
304
305Range Decoder
306-------------
307
308LZMA algorithm uses Range Encoding (1) as entropy coding method.
309
310LZMA stream contains just one very big number in big-endian encoding.
311LZMA decoder uses the Range Decoder to extract a sequence of binary
312symbols from that big number.
313
314The state of the Range Decoder:
315
316struct CRangeDecoder
317{
318  UInt32 Range;
319  UInt32 Code;
320  InputStream *InStream;
321
322  bool Corrupted;
323}
324
325The notes about UInt32 type for the "Range" and "Code" variables:
326
327  It's possible to use 64-bit (unsigned or signed) integer type
328  for the "Range" and the "Code" variables instead of 32-bit unsigned,
329  but some additional code must be used to truncate the values to
330  low 32-bits after some operations.
331
332  If the programming language does not support 32-bit unsigned integer type
333  (like in case of JAVA language), it's possible to use 32-bit signed integer,
334  but some code must be changed. For example, it's required to change the code
335  that uses comparison operations for UInt32 variables in this specification.
336
337The Range Decoder can be in some states that can be treated as
338"Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted":
339
340  (Corrupted == false), if the Range Decoder has not detected any corruption.
341  (Corrupted == true), if the Range Decoder has detected some corruption.
342
343The reference LZMA Decoder ignores the value of the "Corrupted" variable.
344So it continues to decode the stream, even if the corruption can be detected
345in the Range Decoder. To provide the full compatibility with output of the
346reference LZMA Decoder, another LZMA Decoder implementations must also
347ignore the value of the "Corrupted" variable.
348
349The LZMA Encoder is required to create only such LZMA streams, that will not
350lead the Range Decoder to states, where the "Corrupted" variable is set to true.
351
352The Range Decoder reads first 5 bytes from input stream to initialize
353the state:
354
355bool CRangeDecoder::Init()
356{
357  Corrupted = false;
358  Range = 0xFFFFFFFF;
359  Code = 0;
360
361  Byte b = InStream->ReadByte();
362
363  for (int i = 0; i < 4; i++)
364    Code = (Code << 8) | InStream->ReadByte();
365
366  if (b != 0 || Code == Range)
367    Corrupted = true;
368  return b == 0;
369}
370
371The LZMA Encoder always writes ZERO in initial byte of compressed stream.
372That scheme allows to simplify the code of the Range Encoder in the
373LZMA Encoder. If initial byte is not equal to ZERO, the LZMA Decoder must
374stop decoding and report error.
375
376After the last bit of data was decoded by Range Decoder, the value of the
377"Code" variable must be equal to 0. The LZMA Decoder must check it by
378calling the IsFinishedOK() function:
379
380  bool IsFinishedOK() const { return Code == 0; }
381
382If there is corruption in data stream, there is big probability that
383the "Code" value will be not equal to 0 in the Finish() function. So that
384check in the IsFinishedOK() function provides very good feature for
385corruption detection.
386
387The value of the "Range" variable before each bit decoding can not be smaller
388than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in
389described range.
390
391#define kTopValue ((UInt32)1 << 24)
392
393void CRangeDecoder::Normalize()
394{
395  if (Range < kTopValue)
396  {
397    Range <<= 8;
398    Code = (Code << 8) | InStream->ReadByte();
399  }
400}
401
402Notes: if the size of the "Code" variable is larger than 32 bits, it's
403required to keep only low 32 bits of the "Code" variable after the change
404in Normalize() function.
405
406If the LZMA Stream is not corrupted, the value of the "Code" variable is
407always smaller than value of the "Range" variable.
408But the Range Decoder ignores some types of corruptions, so the value of
409the "Code" variable can be equal or larger than value of the "Range" variable
410for some "Corrupted" archives.
411
412
413LZMA uses Range Encoding only with binary symbols of two types:
414  1) binary symbols with fixed and equal probabilities (direct bits)
415  2) binary symbols with predicted probabilities
416
417The DecodeDirectBits() function decodes the sequence of direct bits:
418
419UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits)
420{
421  UInt32 res = 0;
422  do
423  {
424    Range >>= 1;
425    Code -= Range;
426    UInt32 t = 0 - ((UInt32)Code >> 31);
427    Code += Range & t;
428
429    if (Code == Range)
430      Corrupted = true;
431
432    Normalize();
433    res <<= 1;
434    res += t + 1;
435  }
436  while (--numBits);
437  return res;
438}
439
440
441The Bit Decoding with Probability Model
442---------------------------------------
443
444The task of Bit Probability Model is to estimate probabilities of binary
445symbols. And then it provides the Range Decoder with that information.
446The better prediction provides better compression ratio.
447The Bit Probability Model uses statistical data of previous decoded
448symbols.
449
450That estimated probability is presented as 11-bit unsigned integer value
451that represents the probability of symbol "0".
452
453#define kNumBitModelTotalBits 11
454
455Mathematical probabilities can be presented with the following formulas:
456     probability(symbol_0) = prob / 2048.
457     probability(symbol_1) =  1 - Probability(symbol_0) =
458                           =  1 - prob / 2048 =
459                           =  (2048 - prob) / 2048
460where the "prob" variable contains 11-bit integer probability counter.
461
462It's recommended to use 16-bit unsigned integer type, to store these 11-bit
463probability values:
464
465typedef UInt16 CProb;
466
467Each probability value must be initialized with value ((1 << 11) / 2),
468that represents the state, where probabilities of symbols 0 and 1
469are equal to 0.5:
470
471#define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2)
472
473The INIT_PROBS macro is used to initialize the array of CProb variables:
474
475#define INIT_PROBS(p) \
476 { for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; }
477
478
479The DecodeBit() function decodes one bit.
480The LZMA decoder provides the pointer to CProb variable that contains
481information about estimated probability for symbol 0 and the Range Decoder
482updates that CProb variable after decoding. The Range Decoder increases
483estimated probability of the symbol that was decoded:
484
485#define kNumMoveBits 5
486
487unsigned CRangeDecoder::DecodeBit(CProb *prob)
488{
489  unsigned v = *prob;
490  UInt32 bound = (Range >> kNumBitModelTotalBits) * v;
491  unsigned symbol;
492  if (Code < bound)
493  {
494    v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits;
495    Range = bound;
496    symbol = 0;
497  }
498  else
499  {
500    v -= v >> kNumMoveBits;
501    Code -= bound;
502    Range -= bound;
503    symbol = 1;
504  }
505  *prob = (CProb)v;
506  Normalize();
507  return symbol;
508}
509
510
511The Binary Tree of bit model counters
512-------------------------------------
513
514LZMA uses a tree of Bit model variables to decode symbol that needs
515several bits for storing. There are two versions of such trees in LZMA:
516  1) the tree that decodes bits from high bit to low bit (the normal scheme).
517  2) the tree that decodes bits from low bit to high bit (the reverse scheme).
518
519Each binary tree structure supports different size of decoded symbol
520(the size of binary sequence that contains value of symbol).
521If that size of decoded symbol is "NumBits" bits, the tree structure
522uses the array of (2 << NumBits) counters of CProb type.
523But only ((2 << NumBits) - 1) items are used by encoder and decoder.
524The first item (the item with index equal to 0) in array is unused.
525That scheme with unused array's item allows to simplify the code.
526
527unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc)
528{
529  unsigned m = 1;
530  unsigned symbol = 0;
531  for (unsigned i = 0; i < numBits; i++)
532  {
533    unsigned bit = rc->DecodeBit(&probs[m]);
534    m <<= 1;
535    m += bit;
536    symbol |= (bit << i);
537  }
538  return symbol;
539}
540
541template <unsigned NumBits>
542class CBitTreeDecoder
543{
544  CProb Probs[(unsigned)1 << NumBits];
545
546public:
547
548  void Init()
549  {
550    INIT_PROBS(Probs);
551  }
552
553  unsigned Decode(CRangeDecoder *rc)
554  {
555    unsigned m = 1;
556    for (unsigned i = 0; i < NumBits; i++)
557      m = (m << 1) + rc->DecodeBit(&Probs[m]);
558    return m - ((unsigned)1 << NumBits);
559  }
560
561  unsigned ReverseDecode(CRangeDecoder *rc)
562  {
563    return BitTreeReverseDecode(Probs, NumBits, rc);
564  }
565};
566
567
568LZ part of LZMA
569---------------
570
571LZ part of LZMA describes details about the decoding of MATCHES and LITERALS.
572
573
574The Literal Decoding
575--------------------
576
577The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where
578each table contains 0x300 CProb values:
579
580  CProb *LitProbs;
581
582  void CreateLiterals()
583  {
584    LitProbs = new CProb[(UInt32)0x300 << (lc + lp)];
585  }
586
587  void InitLiterals()
588  {
589    UInt32 num = (UInt32)0x300 << (lc + lp);
590    for (UInt32 i = 0; i < num; i++)
591      LitProbs[i] = PROB_INIT_VAL;
592  }
593
594To select the table for decoding it uses the context that consists of
595(lc) high bits from previous literal and (lp) low bits from value that
596represents current position in outputStream.
597
598If (State > 7), the Literal Decoder also uses "matchByte" that represents
599the byte in OutputStream at position the is the DISTANCE bytes before
600current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair
601of latest decoded match.
602
603The following code decodes one literal and puts it to Sliding Window buffer:
604
605  void DecodeLiteral(unsigned state, UInt32 rep0)
606  {
607    unsigned prevByte = 0;
608    if (!OutWindow.IsEmpty())
609      prevByte = OutWindow.GetByte(1);
610
611    unsigned symbol = 1;
612    unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc));
613    CProb *probs = &LitProbs[(UInt32)0x300 * litState];
614
615    if (state >= 7)
616    {
617      unsigned matchByte = OutWindow.GetByte(rep0 + 1);
618      do
619      {
620        unsigned matchBit = (matchByte >> 7) & 1;
621        matchByte <<= 1;
622        unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]);
623        symbol = (symbol << 1) | bit;
624        if (matchBit != bit)
625          break;
626      }
627      while (symbol < 0x100);
628    }
629    while (symbol < 0x100)
630      symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]);
631    OutWindow.PutByte((Byte)(symbol - 0x100));
632  }
633
634
635The match length decoding
636-------------------------
637
638The match length decoder returns normalized (zero-based value)
639length of match. That value can be converted to real length of the match
640with the following code:
641
642#define kMatchMinLen 2
643
644    matchLen = len + kMatchMinLen;
645
646The match length decoder can return the values from 0 to 271.
647And the corresponded real match length values can be in the range
648from 2 to 273.
649
650The following scheme is used for the match length encoding:
651
652  Binary encoding    Binary Tree structure    Zero-based match length
653  sequence                                    (binary + decimal):
654
655  0 xxx              LowCoder[posState]       xxx
656  1 0 yyy            MidCoder[posState]       yyy + 8
657  1 1 zzzzzzzz       HighCoder                zzzzzzzz + 16
658
659LZMA uses bit model variable "Choice" to decode the first selection bit.
660
661If the first selection bit is equal to 0, the decoder uses binary tree
662  LowCoder[posState] to decode 3-bit zero-based match length (xxx).
663
664If the first selection bit is equal to 1, the decoder uses bit model
665  variable "Choice2" to decode the second selection bit.
666
667  If the second selection bit is equal to 0, the decoder uses binary tree
668    MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match
669    length is equal to (yyy + 8).
670
671  If the second selection bit is equal to 1, the decoder uses binary tree
672    HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based
673    match length is equal to (zzzzzzzz + 16).
674
675LZMA uses "posState" value as context to select the binary tree
676from LowCoder and MidCoder binary tree arrays:
677
678    unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
679
680The full code of the length decoder:
681
682class CLenDecoder
683{
684  CProb Choice;
685  CProb Choice2;
686  CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax];
687  CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax];
688  CBitTreeDecoder<8> HighCoder;
689
690public:
691
692  void Init()
693  {
694    Choice = PROB_INIT_VAL;
695    Choice2 = PROB_INIT_VAL;
696    HighCoder.Init();
697    for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++)
698    {
699      LowCoder[i].Init();
700      MidCoder[i].Init();
701    }
702  }
703
704  unsigned Decode(CRangeDecoder *rc, unsigned posState)
705  {
706    if (rc->DecodeBit(&Choice) == 0)
707      return LowCoder[posState].Decode(rc);
708    if (rc->DecodeBit(&Choice2) == 0)
709      return 8 + MidCoder[posState].Decode(rc);
710    return 16 + HighCoder.Decode(rc);
711  }
712};
713
714The LZMA decoder uses two instances of CLenDecoder class.
715The first instance is for the matches of "Simple Match" type,
716and the second instance is for the matches of "Rep Match" type:
717
718  CLenDecoder LenDecoder;
719  CLenDecoder RepLenDecoder;
720
721
722The match distance decoding
723---------------------------
724
725LZMA supports dictionary sizes up to 4 GiB minus 1.
726The value of match distance (decoded by distance decoder) can be
727from 1 to 2^32. But the distance value that is equal to 2^32 is used to
728indicate the "End of stream" marker. So real largest match distance
729that is used for LZ-window match is (2^32 - 1).
730
731LZMA uses normalized match length (zero-based length)
732to calculate the context state "lenState" do decode the distance value:
733
734#define kNumLenToPosStates 4
735
736    unsigned lenState = len;
737    if (lenState > kNumLenToPosStates - 1)
738      lenState = kNumLenToPosStates - 1;
739
740The distance decoder returns the "dist" value that is zero-based value
741of match distance. The real match distance can be calculated with the
742following code:
743
744  matchDistance = dist + 1;
745
746The state of the distance decoder and the initialization code:
747
748  #define kEndPosModelIndex 14
749  #define kNumFullDistances (1 << (kEndPosModelIndex >> 1))
750  #define kNumAlignBits 4
751
752  CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates];
753  CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex];
754  CBitTreeDecoder<kNumAlignBits> AlignDecoder;
755
756  void InitDist()
757  {
758    for (unsigned i = 0; i < kNumLenToPosStates; i++)
759      PosSlotDecoder[i].Init();
760    AlignDecoder.Init();
761    INIT_PROBS(PosDecoders);
762  }
763
764At first stage the distance decoder decodes 6-bit "posSlot" value with bit
765tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different
766"posSlot" values.
767
768    unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
769
770The encoding scheme for distance value is shown in the following table:
771
772posSlot (decimal) /
773      zero-based distance (binary)
774 0    0
775 1    1
776 2    10
777 3    11
778
779 4    10 x
780 5    11 x
781 6    10 xx
782 7    11 xx
783 8    10 xxx
784 9    11 xxx
78510    10 xxxx
78611    11 xxxx
78712    10 xxxxx
78813    11 xxxxx
789
79014    10 yy zzzz
79115    11 yy zzzz
79216    10 yyy zzzz
79317    11 yyy zzzz
794...
79562    10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
79663    11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
797
798where
799  "x ... x" means the sequence of binary symbols encoded with binary tree and
800      "Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13.
801  "y" means direct bit encoded with range coder.
802  "zzzz" means the sequence of four binary symbols encoded with binary
803      tree with "Reverse" scheme, where one common binary tree "AlignDecoder"
804      is used for all posSlot values.
805
806If (posSlot < 4), the "dist" value is equal to posSlot value.
807
808If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of
809  the high bits of "dist" value and the number of the low bits.
810
811  If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders.
812    (one separated bit tree decoder per one posSlot value) and "Reverse" scheme.
813    In this implementation we use one CProb array "PosDecoders" that contains
814    all CProb variables for all these bit decoders.
815
816  if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct
817    bits from RangeDecoder and the low 4 bits are decoded with a bit tree
818    decoder "AlignDecoder" with "Reverse" scheme.
819
820The code to decode zero-based match distance:
821
822  unsigned DecodeDistance(unsigned len)
823  {
824    unsigned lenState = len;
825    if (lenState > kNumLenToPosStates - 1)
826      lenState = kNumLenToPosStates - 1;
827
828    unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
829    if (posSlot < 4)
830      return posSlot;
831
832    unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1);
833    UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits);
834    if (posSlot < kEndPosModelIndex)
835      dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec);
836    else
837    {
838      dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits;
839      dist += AlignDecoder.ReverseDecode(&RangeDec);
840    }
841    return dist;
842  }
843
844
845
846LZMA Decoding modes
847-------------------
848
849There are 2 types of LZMA streams:
850
8511) The stream with "End of stream" marker.
8522) The stream without "End of stream" marker.
853
854And the LZMA Decoder supports 3 modes of decoding:
855
8561) The unpack size is undefined. The LZMA decoder stops decoding after
857   getting "End of stream" marker.
858   The input variables for that case:
859
860      markerIsMandatory = true
861      unpackSizeDefined = false
862      unpackSize contains any value
863
8642) The unpack size is defined and LZMA decoder supports both variants,
865   where the stream can contain "End of stream" marker or the stream is
866   finished without "End of stream" marker. The LZMA decoder must detect
867   any of these situations.
868   The input variables for that case:
869
870      markerIsMandatory = false
871      unpackSizeDefined = true
872      unpackSize contains unpack size
873
8743) The unpack size is defined and the LZMA stream must contain
875   "End of stream" marker
876   The input variables for that case:
877
878      markerIsMandatory = true
879      unpackSizeDefined = true
880      unpackSize contains unpack size
881
882
883The main loop of decoder
884------------------------
885
886The main loop of LZMA decoder:
887
888Initialize the LZMA state.
889loop
890{
891  // begin of loop
892  Check "end of stream" conditions.
893  Decode Type of MATCH / LITERAL.
894    If it's LITERAL, decode LITERAL value and put the LITERAL to Window.
895    If it's MATCH, decode the length of match and the match distance.
896        Check error conditions, check end of stream conditions and copy
897        the sequence of match bytes from sliding window to current position
898        in window.
899  Go to begin of loop
900}
901
902The reference implementation of LZMA decoder uses "unpackSize" variable
903to keep the number of remaining bytes in output stream. So it reduces
904"unpackSize" value after each decoded LITERAL or MATCH.
905
906The following code contains the "end of stream" condition check at the start
907of the loop:
908
909    if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory)
910      if (RangeDec.IsFinishedOK())
911        return LZMA_RES_FINISHED_WITHOUT_MARKER;
912
913LZMA uses three types of matches:
914
9151) "Simple Match" -     the match with distance value encoded with bit models.
916
9172) "Rep Match" -        the match that uses the distance from distance
918                        history table.
919
9203) "Short Rep Match" -  the match of single byte length, that uses the latest
921                        distance from distance history table.
922
923The LZMA decoder keeps the history of latest 4 match distances that were used
924by decoder. That set of 4 variables contains zero-based match distances and
925these variables are initialized with zero values:
926
927  UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0;
928
929The LZMA decoder uses binary model variables to select type of MATCH or LITERAL:
930
931#define kNumStates 12
932#define kNumPosBitsMax 4
933
934  CProb IsMatch[kNumStates << kNumPosBitsMax];
935  CProb IsRep[kNumStates];
936  CProb IsRepG0[kNumStates];
937  CProb IsRepG1[kNumStates];
938  CProb IsRepG2[kNumStates];
939  CProb IsRep0Long[kNumStates << kNumPosBitsMax];
940
941The decoder uses "state" variable value to select exact variable
942from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays.
943The "state" variable can get the value from 0 to 11.
944Initial value for "state" variable is zero:
945
946  unsigned state = 0;
947
948The "state" variable is updated after each LITERAL or MATCH with one of the
949following functions:
950
951unsigned UpdateState_Literal(unsigned state)
952{
953  if (state < 4) return 0;
954  else if (state < 10) return state - 3;
955  else return state - 6;
956}
957unsigned UpdateState_Match   (unsigned state) { return state < 7 ? 7 : 10; }
958unsigned UpdateState_Rep     (unsigned state) { return state < 7 ? 8 : 11; }
959unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; }
960
961The decoder calculates "state2" variable value to select exact variable from
962"IsMatch" and "IsRep0Long" arrays:
963
964unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
965unsigned state2 = (state << kNumPosBitsMax) + posState;
966
967The decoder uses the following code flow scheme to select exact
968type of LITERAL or MATCH:
969
970IsMatch[state2] decode
971  0 - the Literal
972  1 - the Match
973    IsRep[state] decode
974      0 - Simple Match
975      1 - Rep Match
976        IsRepG0[state] decode
977          0 - the distance is rep0
978            IsRep0Long[state2] decode
979              0 - Short Rep Match
980              1 - Rep Match 0
981          1 -
982            IsRepG1[state] decode
983              0 - Rep Match 1
984              1 -
985                IsRepG2[state] decode
986                  0 - Rep Match 2
987                  1 - Rep Match 3
988
989
990LITERAL symbol
991--------------
992If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type.
993
994At first the LZMA decoder must check that it doesn't exceed
995specified uncompressed size:
996
997      if (unpackSizeDefined && unpackSize == 0)
998        return LZMA_RES_ERROR;
999
1000Then it decodes literal value and puts it to sliding window:
1001
1002      DecodeLiteral(state, rep0);
1003
1004Then the decoder must update the "state" value and "unpackSize" value;
1005
1006      state = UpdateState_Literal(state);
1007      unpackSize--;
1008
1009Then the decoder must go to the begin of main loop to decode next Match or Literal.
1010
1011
1012Simple Match
1013------------
1014
1015If the value "1" was decoded with IsMatch[state2] decoding,
1016we have the "Simple Match" type.
1017
1018The distance history table is updated with the following scheme:
1019
1020      rep3 = rep2;
1021      rep2 = rep1;
1022      rep1 = rep0;
1023
1024The zero-based length is decoded with "LenDecoder":
1025
1026      len = LenDecoder.Decode(&RangeDec, posState);
1027
1028The state is update with UpdateState_Match function:
1029
1030      state = UpdateState_Match(state);
1031
1032and the new "rep0" value is decoded with DecodeDistance:
1033
1034      rep0 = DecodeDistance(len);
1035
1036That "rep0" will be used as zero-based distance for current match.
1037
1038If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have
1039"End of stream" marker, so we can stop decoding and check finishing
1040condition in Range Decoder:
1041
1042      if (rep0 == 0xFFFFFFFF)
1043        return RangeDec.IsFinishedOK() ?
1044            LZMA_RES_FINISHED_WITH_MARKER :
1045            LZMA_RES_ERROR;
1046
1047If uncompressed size is defined, LZMA decoder must check that it doesn't
1048exceed that specified uncompressed size:
1049
1050      if (unpackSizeDefined && unpackSize == 0)
1051        return LZMA_RES_ERROR;
1052
1053Also the decoder must check that "rep0" value is not larger than dictionary size
1054and is not larger than the number of already decoded bytes:
1055
1056      if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0))
1057        return LZMA_RES_ERROR;
1058
1059Then the decoder must copy match bytes as described in
1060"The match symbols copying" section.
1061
1062
1063Rep Match
1064---------
1065
1066If the LZMA decoder has decoded the value "1" with IsRep[state] variable,
1067we have "Rep Match" type.
1068
1069At first the LZMA decoder must check that it doesn't exceed
1070specified uncompressed size:
1071
1072      if (unpackSizeDefined && unpackSize == 0)
1073        return LZMA_RES_ERROR;
1074
1075Also the decoder must return error, if the LZ window is empty:
1076
1077      if (OutWindow.IsEmpty())
1078        return LZMA_RES_ERROR;
1079
1080If the match type is "Rep Match", the decoder uses one of the 4 variables of
1081distance history table to get the value of distance for current match.
1082And there are 4 corresponding ways of decoding flow.
1083
1084The decoder updates the distance history with the following scheme
1085depending from type of match:
1086
1087- "Rep Match 0" or "Short Rep Match":
1088      ; LZMA doesn't update the distance history
1089
1090- "Rep Match 1":
1091      UInt32 dist = rep1;
1092      rep1 = rep0;
1093      rep0 = dist;
1094
1095- "Rep Match 2":
1096      UInt32 dist = rep2;
1097      rep2 = rep1;
1098      rep1 = rep0;
1099      rep0 = dist;
1100
1101- "Rep Match 3":
1102      UInt32 dist = rep3;
1103      rep3 = rep2;
1104      rep2 = rep1;
1105      rep1 = rep0;
1106      rep0 = dist;
1107
1108Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long",
1109"IsRepG1", "IsRepG2".
1110
1111If the subtype is "Short Rep Match", the decoder updates the state, puts
1112the one byte from window to current position in window and goes to next
1113MATCH/LITERAL symbol (the begin of main loop):
1114
1115          state = UpdateState_ShortRep(state);
1116          OutWindow.PutByte(OutWindow.GetByte(rep0 + 1));
1117          unpackSize--;
1118          continue;
1119
1120In other cases (Rep Match 0/1/2/3), it decodes the zero-based
1121length of match with "RepLenDecoder" decoder:
1122
1123      len = RepLenDecoder.Decode(&RangeDec, posState);
1124
1125Then it updates the state:
1126
1127      state = UpdateState_Rep(state);
1128
1129Then the decoder must copy match bytes as described in
1130"The Match symbols copying" section.
1131
1132
1133The match symbols copying
1134-------------------------
1135
1136If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must
1137copy the sequence of bytes with calculated match distance and match length.
1138If uncompressed size is defined, LZMA decoder must check that it doesn't
1139exceed that specified uncompressed size:
1140
1141    len += kMatchMinLen;
1142    bool isError = false;
1143    if (unpackSizeDefined && unpackSize < len)
1144    {
1145      len = (unsigned)unpackSize;
1146      isError = true;
1147    }
1148    OutWindow.CopyMatch(rep0 + 1, len);
1149    unpackSize -= len;
1150    if (isError)
1151      return LZMA_RES_ERROR;
1152
1153Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL.
1154
1155
1156
1157NOTES
1158-----
1159
1160This specification doesn't describe the variant of decoder implementation
1161that supports partial decoding. Such partial decoding case can require some
1162changes in "end of stream" condition checks code. Also such code
1163can use additional status codes, returned by decoder.
1164
1165This specification uses C++ code with templates to simplify describing.
1166The optimized version of LZMA decoder doesn't need templates.
1167Such optimized version can use just two arrays of CProb variables:
1168  1) The dynamic array of CProb variables allocated for the Literal Decoder.
1169  2) The one common array that contains all other CProb variables.
1170
1171
1172References:
1173
11741. G. N. N. Martin, Range encoding: an algorithm for removing redundancy
1175   from a digitized message, Video & Data Recording Conference,
1176   Southampton, UK, July 24-27, 1979.
1177