1LZMA specification (DRAFT version) 2---------------------------------- 3 4Author: Igor Pavlov 5Date: 2013-07-28 6 7This specification defines the format of LZMA compressed data and lzma file format. 8 9Notation 10-------- 11 12We use the syntax of C++ programming language. 13We use the following types in C++ code: 14 unsigned - unsigned integer, at least 16 bits in size 15 int - signed integer, at least 16 bits in size 16 UInt64 - 64-bit unsigned integer 17 UInt32 - 32-bit unsigned integer 18 UInt16 - 16-bit unsigned integer 19 Byte - 8-bit unsigned integer 20 bool - boolean type with two possible values: false, true 21 22 23lzma file format 24================ 25 26The lzma file contains the raw LZMA stream and the header with related properties. 27 28The files in that format use ".lzma" extension. 29 30The lzma file format layout: 31 32Offset Size Description 33 34 0 1 LZMA model properties (lc, lp, pb) in encoded form 35 1 4 Dictionary size (32-bit unsigned integer, little-endian) 36 5 8 Uncompressed size (64-bit unsigned integer, little-endian) 37 13 Compressed data (LZMA stream) 38 39LZMA properties: 40 41 name Range Description 42 43 lc [0, 8] the number of "literal context" bits 44 lp [0, 4] the number of "literal pos" bits 45 pb [0, 4] the number of "pos" bits 46dictSize [0, 2^32 - 1] the dictionary size 47 48The following code encodes LZMA properties: 49 50void EncodeProperties(Byte *properties) 51{ 52 properties[0] = (Byte)((pb * 5 + lp) * 9 + lc); 53 Set_UInt32_LittleEndian(properties + 1, dictSize); 54} 55 56If the value of dictionary size in properties is smaller than (1 << 12), 57the LZMA decoder must set the dictionary size variable to (1 << 12). 58 59#define LZMA_DIC_MIN (1 << 12) 60 61 unsigned lc, pb, lp; 62 UInt32 dictSize; 63 UInt32 dictSizeInProperties; 64 65 void DecodeProperties(const Byte *properties) 66 { 67 unsigned d = properties[0]; 68 if (d >= (9 * 5 * 5)) 69 throw "Incorrect LZMA properties"; 70 lc = d % 9; 71 d /= 9; 72 pb = d / 5; 73 lp = d % 5; 74 dictSizeInProperties = 0; 75 for (int i = 0; i < 4; i++) 76 dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i); 77 dictSize = dictSizeInProperties; 78 if (dictSize < LZMA_DIC_MIN) 79 dictSize = LZMA_DIC_MIN; 80 } 81 82If "Uncompressed size" field contains ones in all 64 bits, it means that 83uncompressed size is unknown and there is the "end marker" in stream, 84that indicates the end of decoding point. 85In opposite case, if the value from "Uncompressed size" field is not 86equal to ((2^64) - 1), the LZMA stream decoding must be finished after 87specified number of bytes (Uncompressed size) is decoded. And if there 88is the "end marker", the LZMA decoder must read that marker also. 89 90 91The new scheme to encode LZMA properties 92---------------------------------------- 93 94If LZMA compression is used for some another format, it's recommended to 95use a new improved scheme to encode LZMA properties. That new scheme was 96used in xz format that uses the LZMA2 compression algorithm. 97The LZMA2 is a new compression algorithm that is based on the LZMA algorithm. 98 99The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports 100only reduced set of dictionary sizes: 101 (2 << 11), (3 << 11), 102 (2 << 12), (3 << 12), 103 ... 104 (2 << 30), (3 << 30), 105 (2 << 31) - 1 106 107The dictionary size can be extracted from encoded value with the following code: 108 109 dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11)); 110 111Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of 112"lc" and "lp" properties: 113 114 if (lc + lp > 4) 115 throw "Unsupported properties: (lc + lp) > 4"; 116 117There are some advantages for LZMA decoder with such (lc + lp) value 118limitation. It reduces the maximum size of tables allocated by decoder. 119And it reduces the complexity of initialization procedure, that can be 120important to keep high speed of decoding of big number of small LZMA streams. 121 122It's recommended to use that limitation (lc + lp <= 4) for any new format 123that uses LZMA compression. Note that the combinations of "lc" and "lp" 124parameters, where (lc + lp > 4), can provide significant improvement in 125compression ratio only in some rare cases. 126 127The LZMA properties can be encoded into two bytes in new scheme: 128 129Offset Size Description 130 131 0 1 The dictionary size encoded with LZMA2 scheme 132 1 1 LZMA model properties (lc, lp, pb) in encoded form 133 134 135The RAM usage 136============= 137 138The RAM usage for LZMA decoder is determined by the following parts: 139 1401) The Sliding Window (from 4 KiB to 4 GiB). 1412) The probability model counter arrays (arrays of 16-bit variables). 1423) Some additional state variables (about 10 variables of 32-bit integers). 143 144 145The RAM usage for Sliding Window 146-------------------------------- 147 148There are two main scenarios of decoding: 149 1501) The decoding of full stream to one RAM buffer. 151 152 If we decode full LZMA stream to one output buffer in RAM, the decoder 153 can use that output buffer as sliding window. So the decoder doesn't 154 need additional buffer allocated for sliding window. 155 1562) The decoding to some external storage. 157 158 If we decode LZMA stream to external storage, the decoder must allocate 159 the buffer for sliding window. The size of that buffer must be equal 160 or larger than the value of dictionary size from properties of LZMA stream. 161 162In this specification we describe the code for decoding to some external 163storage. The optimized version of code for decoding of full stream to one 164output RAM buffer can require some minor changes in code. 165 166 167The RAM usage for the probability model counters 168------------------------------------------------ 169 170The size of the probability model counter arrays is calculated with the 171following formula: 172 173size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc)) 174 175Each probability model counter is 11-bit unsigned integer. 176If we use 16-bit integer variables (2-byte integers) for these probability 177model counters, the RAM usage required by probability model counter arrays 178can be estimated with the following formula: 179 180 RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc)) 181 182For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is 183 184 RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB 185 186The maximum RAM state usage is required for decoding the stream with lp = 4 187and lc = 8: 188 189 RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB 190 191If the decoder uses LZMA2's limited property condition 192(lc + lp <= 4), the RAM usage will be not larger than 193 194 RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB 195 196 197The RAM usage for encoder 198------------------------- 199 200There are many variants for LZMA encoding code. 201These variants have different values for memory consumption. 202Note that memory consumption for LZMA Encoder can not be 203smaller than memory consumption of LZMA Decoder for same stream. 204 205The RAM usage required by modern effective implementation of 206LZMA Encoder can be estimated with the following formula: 207 208 Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize. 209 210But there are some modes of the encoder that require less memory. 211 212 213LZMA Decoding 214============= 215 216The LZMA compression algorithm uses LZ-based compression with Sliding Window 217and Range Encoding as entropy coding method. 218 219 220Sliding Window 221-------------- 222 223LZMA uses Sliding Window compression similar to LZ77 algorithm. 224 225LZMA stream must be decoded to the sequence that consists 226of MATCHES and LITERALS: 227 228 - a LITERAL is a 8-bit character (one byte). 229 The decoder just puts that LITERAL to the uncompressed stream. 230 231 - a MATCH is a pair of two numbers (DISTANCE-LENGTH pair). 232 The decoder takes one byte exactly "DISTANCE" characters behind 233 current position in the uncompressed stream and puts it to 234 uncompressed stream. The decoder must repeat it "LENGTH" times. 235 236The "DISTANCE" can not be larger than dictionary size. 237And the "DISTANCE" can not be larger than the number of bytes in 238the uncompressed stream that were decoded before that match. 239 240In this specification we use cyclic buffer to implement Sliding Window 241for LZMA decoder: 242 243class COutWindow 244{ 245 Byte *Buf; 246 UInt32 Pos; 247 UInt32 Size; 248 bool IsFull; 249 250public: 251 unsigned TotalPos; 252 COutStream OutStream; 253 254 COutWindow(): Buf(NULL) {} 255 ~COutWindow() { delete []Buf; } 256 257 void Create(UInt32 dictSize) 258 { 259 Buf = new Byte[dictSize]; 260 Pos = 0; 261 Size = dictSize; 262 IsFull = false; 263 TotalPos = 0; 264 } 265 266 void PutByte(Byte b) 267 { 268 TotalPos++; 269 Buf[Pos++] = b; 270 if (Pos == Size) 271 { 272 Pos = 0; 273 IsFull = true; 274 } 275 OutStream.WriteByte(b); 276 } 277 278 Byte GetByte(UInt32 dist) const 279 { 280 return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos]; 281 } 282 283 void CopyMatch(UInt32 dist, unsigned len) 284 { 285 for (; len > 0; len--) 286 PutByte(GetByte(dist)); 287 } 288 289 bool CheckDistance(UInt32 dist) const 290 { 291 return dist <= Pos || IsFull; 292 } 293 294 bool IsEmpty() const 295 { 296 return Pos == 0 && !IsFull; 297 } 298}; 299 300 301In another implementation it's possible to use one buffer that contains 302Sliding Window and the whole data stream after uncompressing. 303 304 305Range Decoder 306------------- 307 308LZMA algorithm uses Range Encoding (1) as entropy coding method. 309 310LZMA stream contains just one very big number in big-endian encoding. 311LZMA decoder uses the Range Decoder to extract a sequence of binary 312symbols from that big number. 313 314The state of the Range Decoder: 315 316struct CRangeDecoder 317{ 318 UInt32 Range; 319 UInt32 Code; 320 InputStream *InStream; 321 322 bool Corrupted; 323} 324 325The notes about UInt32 type for the "Range" and "Code" variables: 326 327 It's possible to use 64-bit (unsigned or signed) integer type 328 for the "Range" and the "Code" variables instead of 32-bit unsigned, 329 but some additional code must be used to truncate the values to 330 low 32-bits after some operations. 331 332 If the programming language does not support 32-bit unsigned integer type 333 (like in case of JAVA language), it's possible to use 32-bit signed integer, 334 but some code must be changed. For example, it's required to change the code 335 that uses comparison operations for UInt32 variables in this specification. 336 337The Range Decoder can be in some states that can be treated as 338"Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted": 339 340 (Corrupted == false), if the Range Decoder has not detected any corruption. 341 (Corrupted == true), if the Range Decoder has detected some corruption. 342 343The reference LZMA Decoder ignores the value of the "Corrupted" variable. 344So it continues to decode the stream, even if the corruption can be detected 345in the Range Decoder. To provide the full compatibility with output of the 346reference LZMA Decoder, another LZMA Decoder implementations must also 347ignore the value of the "Corrupted" variable. 348 349The LZMA Encoder is required to create only such LZMA streams, that will not 350lead the Range Decoder to states, where the "Corrupted" variable is set to true. 351 352The Range Decoder reads first 5 bytes from input stream to initialize 353the state: 354 355void CRangeDecoder::Init() 356{ 357 Corrupted = false; 358 359 if (InStream->ReadByte() != 0) 360 Corrupted = true; 361 362 Range = 0xFFFFFFFF; 363 Code = 0; 364 for (int i = 0; i < 4; i++) 365 Code = (Code << 8) | InStream->ReadByte(); 366 367 if (Code == Range) 368 Corrupted = true; 369} 370 371The LZMA Encoder always writes ZERO in initial byte of compressed stream. 372That scheme allows to simplify the code of the Range Encoder in the 373LZMA Encoder. 374 375After the last bit of data was decoded by Range Decoder, the value of the 376"Code" variable must be equal to 0. The LZMA Decoder must check it by 377calling the IsFinishedOK() function: 378 379 bool IsFinishedOK() const { return Code == 0; } 380 381If there is corruption in data stream, there is big probability that 382the "Code" value will be not equal to 0 in the Finish() function. So that 383check in the IsFinishedOK() function provides very good feature for 384corruption detection. 385 386The value of the "Range" variable before each bit decoding can not be smaller 387than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in 388described range. 389 390#define kTopValue ((UInt32)1 << 24) 391 392void CRangeDecoder::Normalize() 393{ 394 if (Range < kTopValue) 395 { 396 Range <<= 8; 397 Code = (Code << 8) | InStream->ReadByte(); 398 } 399} 400 401Notes: if the size of the "Code" variable is larger than 32 bits, it's 402required to keep only low 32 bits of the "Code" variable after the change 403in Normalize() function. 404 405If the LZMA Stream is not corrupted, the value of the "Code" variable is 406always smaller than value of the "Range" variable. 407But the Range Decoder ignores some types of corruptions, so the value of 408the "Code" variable can be equal or larger than value of the "Range" variable 409for some "Corrupted" archives. 410 411 412LZMA uses Range Encoding only with binary symbols of two types: 413 1) binary symbols with fixed and equal probabilities (direct bits) 414 2) binary symbols with predicted probabilities 415 416The DecodeDirectBits() function decodes the sequence of direct bits: 417 418UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits) 419{ 420 UInt32 res = 0; 421 do 422 { 423 Range >>= 1; 424 Code -= Range; 425 UInt32 t = 0 - ((UInt32)Code >> 31); 426 Code += Range & t; 427 428 if (Code == Range) 429 Corrupted = true; 430 431 Normalize(); 432 res <<= 1; 433 res += t + 1; 434 } 435 while (--numBits); 436 return res; 437} 438 439 440The Bit Decoding with Probability Model 441--------------------------------------- 442 443The task of Bit Probability Model is to estimate probabilities of binary 444symbols. And then it provides the Range Decoder with that information. 445The better prediction provides better compression ratio. 446The Bit Probability Model uses statistical data of previous decoded 447symbols. 448 449That estimated probability is presented as 11-bit unsigned integer value 450that represents the probability of symbol "0". 451 452#define kNumBitModelTotalBits 11 453 454Mathematical probabilities can be presented with the following formulas: 455 probability(symbol_0) = prob / 2048. 456 probability(symbol_1) = 1 - Probability(symbol_0) = 457 = 1 - prob / 2048 = 458 = (2048 - prob) / 2048 459where the "prob" variable contains 11-bit integer probability counter. 460 461It's recommended to use 16-bit unsigned integer type, to store these 11-bit 462probability values: 463 464typedef UInt16 CProb; 465 466Each probability value must be initialized with value ((1 << 11) / 2), 467that represents the state, where probabilities of symbols 0 and 1 468are equal to 0.5: 469 470#define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2) 471 472The INIT_PROBS macro is used to initialize the array of CProb variables: 473 474#define INIT_PROBS(p) \ 475 { for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; } 476 477 478The DecodeBit() function decodes one bit. 479The LZMA decoder provides the pointer to CProb variable that contains 480information about estimated probability for symbol 0 and the Range Decoder 481updates that CProb variable after decoding. The Range Decoder increases 482estimated probability of the symbol that was decoded: 483 484#define kNumMoveBits 5 485 486unsigned CRangeDecoder::DecodeBit(CProb *prob) 487{ 488 unsigned v = *prob; 489 UInt32 bound = (Range >> kNumBitModelTotalBits) * v; 490 unsigned symbol; 491 if (Code < bound) 492 { 493 v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits; 494 Range = bound; 495 symbol = 0; 496 } 497 else 498 { 499 v -= v >> kNumMoveBits; 500 Code -= bound; 501 Range -= bound; 502 symbol = 1; 503 } 504 *prob = (CProb)v; 505 Normalize(); 506 return symbol; 507} 508 509 510The Binary Tree of bit model counters 511------------------------------------- 512 513LZMA uses a tree of Bit model variables to decode symbol that needs 514several bits for storing. There are two versions of such trees in LZMA: 515 1) the tree that decodes bits from high bit to low bit (the normal scheme). 516 2) the tree that decodes bits from low bit to high bit (the reverse scheme). 517 518Each binary tree structure supports different size of decoded symbol 519(the size of binary sequence that contains value of symbol). 520If that size of decoded symbol is "NumBits" bits, the tree structure 521uses the array of (2 << NumBits) counters of CProb type. 522But only ((2 << NumBits) - 1) items are used by encoder and decoder. 523The first item (the item with index equal to 0) in array is unused. 524That scheme with unused array's item allows to simplify the code. 525 526unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc) 527{ 528 unsigned m = 1; 529 unsigned symbol = 0; 530 for (unsigned i = 0; i < numBits; i++) 531 { 532 unsigned bit = rc->DecodeBit(&probs[m]); 533 m <<= 1; 534 m += bit; 535 symbol |= (bit << i); 536 } 537 return symbol; 538} 539 540template <unsigned NumBits> 541class CBitTreeDecoder 542{ 543 CProb Probs[(unsigned)1 << NumBits]; 544 545public: 546 547 void Init() 548 { 549 INIT_PROBS(Probs); 550 } 551 552 unsigned Decode(CRangeDecoder *rc) 553 { 554 unsigned m = 1; 555 for (unsigned i = 0; i < NumBits; i++) 556 m = (m << 1) + rc->DecodeBit(&Probs[m]); 557 return m - ((unsigned)1 << NumBits); 558 } 559 560 unsigned ReverseDecode(CRangeDecoder *rc) 561 { 562 return BitTreeReverseDecode(Probs, NumBits, rc); 563 } 564}; 565 566 567LZ part of LZMA 568--------------- 569 570LZ part of LZMA describes details about the decoding of MATCHES and LITERALS. 571 572 573The Literal Decoding 574-------------------- 575 576The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where 577each table contains 0x300 CProb values: 578 579 CProb *LitProbs; 580 581 void CreateLiterals() 582 { 583 LitProbs = new CProb[(UInt32)0x300 << (lc + lp)]; 584 } 585 586 void InitLiterals() 587 { 588 UInt32 num = (UInt32)0x300 << (lc + lp); 589 for (UInt32 i = 0; i < num; i++) 590 LitProbs[i] = PROB_INIT_VAL; 591 } 592 593To select the table for decoding it uses the context that consists of 594(lc) high bits from previous literal and (lp) low bits from value that 595represents current position in outputStream. 596 597If (State > 7), the Literal Decoder also uses "matchByte" that represents 598the byte in OutputStream at position the is the DISTANCE bytes before 599current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair 600of latest decoded match. 601 602The following code decodes one literal and puts it to Sliding Window buffer: 603 604 void DecodeLiteral(unsigned state, UInt32 rep0) 605 { 606 unsigned prevByte = 0; 607 if (!OutWindow.IsEmpty()) 608 prevByte = OutWindow.GetByte(1); 609 610 unsigned symbol = 1; 611 unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc)); 612 CProb *probs = &LitProbs[(UInt32)0x300 * litState]; 613 614 if (state >= 7) 615 { 616 unsigned matchByte = OutWindow.GetByte(rep0 + 1); 617 do 618 { 619 unsigned matchBit = (matchByte >> 7) & 1; 620 matchByte <<= 1; 621 unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]); 622 symbol = (symbol << 1) | bit; 623 if (matchBit != bit) 624 break; 625 } 626 while (symbol < 0x100); 627 } 628 while (symbol < 0x100) 629 symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]); 630 OutWindow.PutByte((Byte)(symbol - 0x100)); 631 } 632 633 634The match length decoding 635------------------------- 636 637The match length decoder returns normalized (zero-based value) 638length of match. That value can be converted to real length of the match 639with the following code: 640 641#define kMatchMinLen 2 642 643 matchLen = len + kMatchMinLen; 644 645The match length decoder can return the values from 0 to 271. 646And the corresponded real match length values can be in the range 647from 2 to 273. 648 649The following scheme is used for the match length encoding: 650 651 Binary encoding Binary Tree structure Zero-based match length 652 sequence (binary + decimal): 653 654 0 xxx LowCoder[posState] xxx 655 1 0 yyy MidCoder[posState] yyy + 8 656 1 1 zzzzzzzz HighCoder zzzzzzzz + 16 657 658LZMA uses bit model variable "Choice" to decode the first selection bit. 659 660If the first selection bit is equal to 0, the decoder uses binary tree 661 LowCoder[posState] to decode 3-bit zero-based match length (xxx). 662 663If the first selection bit is equal to 1, the decoder uses bit model 664 variable "Choice2" to decode the second selection bit. 665 666 If the second selection bit is equal to 0, the decoder uses binary tree 667 MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match 668 length is equal to (yyy + 8). 669 670 If the second selection bit is equal to 1, the decoder uses binary tree 671 HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based 672 match length is equal to (zzzzzzzz + 16). 673 674LZMA uses "posState" value as context to select the binary tree 675from LowCoder and MidCoder binary tree arrays: 676 677 unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1); 678 679The full code of the length decoder: 680 681class CLenDecoder 682{ 683 CProb Choice; 684 CProb Choice2; 685 CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax]; 686 CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax]; 687 CBitTreeDecoder<8> HighCoder; 688 689public: 690 691 void Init() 692 { 693 Choice = PROB_INIT_VAL; 694 Choice2 = PROB_INIT_VAL; 695 HighCoder.Init(); 696 for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++) 697 { 698 LowCoder[i].Init(); 699 MidCoder[i].Init(); 700 } 701 } 702 703 unsigned Decode(CRangeDecoder *rc, unsigned posState) 704 { 705 if (rc->DecodeBit(&Choice) == 0) 706 return LowCoder[posState].Decode(rc); 707 if (rc->DecodeBit(&Choice2) == 0) 708 return 8 + MidCoder[posState].Decode(rc); 709 return 16 + HighCoder.Decode(rc); 710 } 711}; 712 713The LZMA decoder uses two instances of CLenDecoder class. 714The first instance is for the matches of "Simple Match" type, 715and the second instance is for the matches of "Rep Match" type: 716 717 CLenDecoder LenDecoder; 718 CLenDecoder RepLenDecoder; 719 720 721The match distance decoding 722--------------------------- 723 724LZMA supports dictionary sizes up to 4 GiB minus 1. 725The value of match distance (decoded by distance decoder) can be 726from 1 to 2^32. But the distance value that is equal to 2^32 is used to 727indicate the "End of stream" marker. So real largest match distance 728that is used for LZ-window match is (2^32 - 1). 729 730LZMA uses normalized match length (zero-based length) 731to calculate the context state "lenState" do decode the distance value: 732 733#define kNumLenToPosStates 4 734 735 unsigned lenState = len; 736 if (lenState > kNumLenToPosStates - 1) 737 lenState = kNumLenToPosStates - 1; 738 739The distance decoder returns the "dist" value that is zero-based value 740of match distance. The real match distance can be calculated with the 741following code: 742 743 matchDistance = dist + 1; 744 745The state of the distance decoder and the initialization code: 746 747 #define kEndPosModelIndex 14 748 #define kNumFullDistances (1 << (kEndPosModelIndex >> 1)) 749 #define kNumAlignBits 4 750 751 CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates]; 752 CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex]; 753 CBitTreeDecoder<kNumAlignBits> AlignDecoder; 754 755 void InitDist() 756 { 757 for (unsigned i = 0; i < kNumLenToPosStates; i++) 758 PosSlotDecoder[i].Init(); 759 AlignDecoder.Init(); 760 INIT_PROBS(PosDecoders); 761 } 762 763At first stage the distance decoder decodes 6-bit "posSlot" value with bit 764tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different 765"posSlot" values. 766 767 unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec); 768 769The encoding scheme for distance value is shown in the following table: 770 771posSlot (decimal) / 772 zero-based distance (binary) 773 0 0 774 1 1 775 2 10 776 3 11 777 778 4 10 x 779 5 11 x 780 6 10 xx 781 7 11 xx 782 8 10 xxx 783 9 11 xxx 78410 10 xxxx 78511 11 xxxx 78612 10 xxxxx 78713 11 xxxxx 788 78914 10 yy zzzz 79015 11 yy zzzz 79116 10 yyy zzzz 79217 11 yyy zzzz 793... 79462 10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz 79563 11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz 796 797where 798 "x ... x" means the sequence of binary symbols encoded with binary tree and 799 "Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13. 800 "y" means direct bit encoded with range coder. 801 "zzzz" means the sequence of four binary symbols encoded with binary 802 tree with "Reverse" scheme, where one common binary tree "AlignDecoder" 803 is used for all posSlot values. 804 805If (posSlot < 4), the "dist" value is equal to posSlot value. 806 807If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of 808 the high bits of "dist" value and the number of the low bits. 809 810 If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders. 811 (one separated bit tree decoder per one posSlot value) and "Reverse" scheme. 812 In this implementation we use one CProb array "PosDecoders" that contains 813 all CProb variables for all these bit decoders. 814 815 if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct 816 bits from RangeDecoder and the low 4 bits are decoded with a bit tree 817 decoder "AlignDecoder" with "Reverse" scheme. 818 819The code to decode zero-based match distance: 820 821 unsigned DecodeDistance(unsigned len) 822 { 823 unsigned lenState = len; 824 if (lenState > kNumLenToPosStates - 1) 825 lenState = kNumLenToPosStates - 1; 826 827 unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec); 828 if (posSlot < 4) 829 return posSlot; 830 831 unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1); 832 UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits); 833 if (posSlot < kEndPosModelIndex) 834 dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec); 835 else 836 { 837 dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits; 838 dist += AlignDecoder.ReverseDecode(&RangeDec); 839 } 840 return dist; 841 } 842 843 844 845LZMA Decoding modes 846------------------- 847 848There are 2 types of LZMA streams: 849 8501) The stream with "End of stream" marker. 8512) The stream without "End of stream" marker. 852 853And the LZMA Decoder supports 3 modes of decoding: 854 8551) The unpack size is undefined. The LZMA decoder stops decoding after 856 getting "End of stream" marker. 857 The input variables for that case: 858 859 markerIsMandatory = true 860 unpackSizeDefined = false 861 unpackSize contains any value 862 8632) The unpack size is defined and LZMA decoder supports both variants, 864 where the stream can contain "End of stream" marker or the stream is 865 finished without "End of stream" marker. The LZMA decoder must detect 866 any of these situations. 867 The input variables for that case: 868 869 markerIsMandatory = false 870 unpackSizeDefined = true 871 unpackSize contains unpack size 872 8733) The unpack size is defined and the LZMA stream must contain 874 "End of stream" marker 875 The input variables for that case: 876 877 markerIsMandatory = true 878 unpackSizeDefined = true 879 unpackSize contains unpack size 880 881 882The main loop of decoder 883------------------------ 884 885The main loop of LZMA decoder: 886 887Initialize the LZMA state. 888loop 889{ 890 // begin of loop 891 Check "end of stream" conditions. 892 Decode Type of MATCH / LITERAL. 893 If it's LITERAL, decode LITERAL value and put the LITERAL to Window. 894 If it's MATCH, decode the length of match and the match distance. 895 Check error conditions, check end of stream conditions and copy 896 the sequence of match bytes from sliding window to current position 897 in window. 898 Go to begin of loop 899} 900 901The reference implementation of LZMA decoder uses "unpackSize" variable 902to keep the number of remaining bytes in output stream. So it reduces 903"unpackSize" value after each decoded LITERAL or MATCH. 904 905The following code contains the "end of stream" condition check at the start 906of the loop: 907 908 if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory) 909 if (RangeDec.IsFinishedOK()) 910 return LZMA_RES_FINISHED_WITHOUT_MARKER; 911 912LZMA uses three types of matches: 913 9141) "Simple Match" - the match with distance value encoded with bit models. 915 9162) "Rep Match" - the match that uses the distance from distance 917 history table. 918 9193) "Short Rep Match" - the match of single byte length, that uses the latest 920 distance from distance history table. 921 922The LZMA decoder keeps the history of latest 4 match distances that were used 923by decoder. That set of 4 variables contains zero-based match distances and 924these variables are initialized with zero values: 925 926 UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0; 927 928The LZMA decoder uses binary model variables to select type of MATCH or LITERAL: 929 930#define kNumStates 12 931#define kNumPosBitsMax 4 932 933 CProb IsMatch[kNumStates << kNumPosBitsMax]; 934 CProb IsRep[kNumStates]; 935 CProb IsRepG0[kNumStates]; 936 CProb IsRepG1[kNumStates]; 937 CProb IsRepG2[kNumStates]; 938 CProb IsRep0Long[kNumStates << kNumPosBitsMax]; 939 940The decoder uses "state" variable value to select exact variable 941from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays. 942The "state" variable can get the value from 0 to 11. 943Initial value for "state" variable is zero: 944 945 unsigned state = 0; 946 947The "state" variable is updated after each LITERAL or MATCH with one of the 948following functions: 949 950unsigned UpdateState_Literal(unsigned state) 951{ 952 if (state < 4) return 0; 953 else if (state < 10) return state - 3; 954 else return state - 6; 955} 956unsigned UpdateState_Match (unsigned state) { return state < 7 ? 7 : 10; } 957unsigned UpdateState_Rep (unsigned state) { return state < 7 ? 8 : 11; } 958unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; } 959 960The decoder calculates "state2" variable value to select exact variable from 961"IsMatch" and "IsRep0Long" arrays: 962 963unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1); 964unsigned state2 = (state << kNumPosBitsMax) + posState; 965 966The decoder uses the following code flow scheme to select exact 967type of LITERAL or MATCH: 968 969IsMatch[state2] decode 970 0 - the Literal 971 1 - the Match 972 IsRep[state] decode 973 0 - Simple Match 974 1 - Rep Match 975 IsRepG0[state] decode 976 0 - the distance is rep0 977 IsRep0Long[state2] decode 978 0 - Short Rep Match 979 1 - Rep Match 0 980 1 - 981 IsRepG1[state] decode 982 0 - Rep Match 1 983 1 - 984 IsRepG2[state] decode 985 0 - Rep Match 2 986 1 - Rep Match 3 987 988 989LITERAL symbol 990-------------- 991If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type. 992 993At first the LZMA decoder must check that it doesn't exceed 994specified uncompressed size: 995 996 if (unpackSizeDefined && unpackSize == 0) 997 return LZMA_RES_ERROR; 998 999Then it decodes literal value and puts it to sliding window: 1000 1001 DecodeLiteral(state, rep0); 1002 1003Then the decoder must update the "state" value and "unpackSize" value; 1004 1005 state = UpdateState_Literal(state); 1006 unpackSize--; 1007 1008Then the decoder must go to the begin of main loop to decode next Match or Literal. 1009 1010 1011Simple Match 1012------------ 1013 1014If the value "1" was decoded with IsMatch[state2] decoding, 1015we have the "Simple Match" type. 1016 1017The distance history table is updated with the following scheme: 1018 1019 rep3 = rep2; 1020 rep2 = rep1; 1021 rep1 = rep0; 1022 1023The zero-based length is decoded with "LenDecoder": 1024 1025 len = LenDecoder.Decode(&RangeDec, posState); 1026 1027The state is update with UpdateState_Match function: 1028 1029 state = UpdateState_Match(state); 1030 1031and the new "rep0" value is decoded with DecodeDistance: 1032 1033 rep0 = DecodeDistance(len); 1034 1035That "rep0" will be used as zero-based distance for current match. 1036 1037If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have 1038"End of stream" marker, so we can stop decoding and check finishing 1039condition in Range Decoder: 1040 1041 if (rep0 == 0xFFFFFFFF) 1042 return RangeDec.IsFinishedOK() ? 1043 LZMA_RES_FINISHED_WITH_MARKER : 1044 LZMA_RES_ERROR; 1045 1046If uncompressed size is defined, LZMA decoder must check that it doesn't 1047exceed that specified uncompressed size: 1048 1049 if (unpackSizeDefined && unpackSize == 0) 1050 return LZMA_RES_ERROR; 1051 1052Also the decoder must check that "rep0" value is not larger than dictionary size 1053and is not larger than the number of already decoded bytes: 1054 1055 if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0)) 1056 return LZMA_RES_ERROR; 1057 1058Then the decoder must copy match bytes as described in 1059"The match symbols copying" section. 1060 1061 1062Rep Match 1063--------- 1064 1065If the LZMA decoder has decoded the value "1" with IsRep[state] variable, 1066we have "Rep Match" type. 1067 1068At first the LZMA decoder must check that it doesn't exceed 1069specified uncompressed size: 1070 1071 if (unpackSizeDefined && unpackSize == 0) 1072 return LZMA_RES_ERROR; 1073 1074Also the decoder must return error, if the LZ window is empty: 1075 1076 if (OutWindow.IsEmpty()) 1077 return LZMA_RES_ERROR; 1078 1079If the match type is "Rep Match", the decoder uses one of the 4 variables of 1080distance history table to get the value of distance for current match. 1081And there are 4 corresponding ways of decoding flow. 1082 1083The decoder updates the distance history with the following scheme 1084depending from type of match: 1085 1086- "Rep Match 0" or "Short Rep Match": 1087 ; LZMA doesn't update the distance history 1088 1089- "Rep Match 1": 1090 UInt32 dist = rep1; 1091 rep1 = rep0; 1092 rep0 = dist; 1093 1094- "Rep Match 2": 1095 UInt32 dist = rep2; 1096 rep2 = rep1; 1097 rep1 = rep0; 1098 rep0 = dist; 1099 1100- "Rep Match 3": 1101 UInt32 dist = rep3; 1102 rep3 = rep2; 1103 rep2 = rep1; 1104 rep1 = rep0; 1105 rep0 = dist; 1106 1107Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long", 1108"IsRepG1", "IsRepG2". 1109 1110If the subtype is "Short Rep Match", the decoder updates the state, puts 1111the one byte from window to current position in window and goes to next 1112MATCH/LITERAL symbol (the begin of main loop): 1113 1114 state = UpdateState_ShortRep(state); 1115 OutWindow.PutByte(OutWindow.GetByte(rep0 + 1)); 1116 unpackSize--; 1117 continue; 1118 1119In other cases (Rep Match 0/1/2/3), it decodes the zero-based 1120length of match with "RepLenDecoder" decoder: 1121 1122 len = RepLenDecoder.Decode(&RangeDec, posState); 1123 1124Then it updates the state: 1125 1126 state = UpdateState_Rep(state); 1127 1128Then the decoder must copy match bytes as described in 1129"The Match symbols copying" section. 1130 1131 1132The match symbols copying 1133------------------------- 1134 1135If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must 1136copy the sequence of bytes with calculated match distance and match length. 1137If uncompressed size is defined, LZMA decoder must check that it doesn't 1138exceed that specified uncompressed size: 1139 1140 len += kMatchMinLen; 1141 bool isError = false; 1142 if (unpackSizeDefined && unpackSize < len) 1143 { 1144 len = (unsigned)unpackSize; 1145 isError = true; 1146 } 1147 OutWindow.CopyMatch(rep0 + 1, len); 1148 unpackSize -= len; 1149 if (isError) 1150 return LZMA_RES_ERROR; 1151 1152Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL. 1153 1154 1155 1156NOTES 1157----- 1158 1159This specification doesn't describe the variant of decoder implementation 1160that supports partial decoding. Such partial decoding case can require some 1161changes in "end of stream" condition checks code. Also such code 1162can use additional status codes, returned by decoder. 1163 1164This specification uses C++ code with templates to simplify describing. 1165The optimized version of LZMA decoder doesn't need templates. 1166Such optimized version can use just two arrays of CProb variables: 1167 1) The dynamic array of CProb variables allocated for the Literal Decoder. 1168 2) The one common array that contains all other CProb variables. 1169 1170 1171References: 1172 11731. G. N. N. Martin, Range encoding: an algorithm for removing redundancy 1174 from a digitized message, Video & Data Recording Conference, 1175 Southampton, UK, July 24-27, 1979. 1176