1LZMA specification (DRAFT version) 2---------------------------------- 3 4Author: Igor Pavlov 5Date: 2015-06-14 6 7This specification defines the format of LZMA compressed data and lzma file format. 8 9Notation 10-------- 11 12We use the syntax of C++ programming language. 13We use the following types in C++ code: 14 unsigned - unsigned integer, at least 16 bits in size 15 int - signed integer, at least 16 bits in size 16 UInt64 - 64-bit unsigned integer 17 UInt32 - 32-bit unsigned integer 18 UInt16 - 16-bit unsigned integer 19 Byte - 8-bit unsigned integer 20 bool - boolean type with two possible values: false, true 21 22 23lzma file format 24================ 25 26The lzma file contains the raw LZMA stream and the header with related properties. 27 28The files in that format use ".lzma" extension. 29 30The lzma file format layout: 31 32Offset Size Description 33 34 0 1 LZMA model properties (lc, lp, pb) in encoded form 35 1 4 Dictionary size (32-bit unsigned integer, little-endian) 36 5 8 Uncompressed size (64-bit unsigned integer, little-endian) 37 13 Compressed data (LZMA stream) 38 39LZMA properties: 40 41 name Range Description 42 43 lc [0, 8] the number of "literal context" bits 44 lp [0, 4] the number of "literal pos" bits 45 pb [0, 4] the number of "pos" bits 46dictSize [0, 2^32 - 1] the dictionary size 47 48The following code encodes LZMA properties: 49 50void EncodeProperties(Byte *properties) 51{ 52 properties[0] = (Byte)((pb * 5 + lp) * 9 + lc); 53 Set_UInt32_LittleEndian(properties + 1, dictSize); 54} 55 56If the value of dictionary size in properties is smaller than (1 << 12), 57the LZMA decoder must set the dictionary size variable to (1 << 12). 58 59#define LZMA_DIC_MIN (1 << 12) 60 61 unsigned lc, pb, lp; 62 UInt32 dictSize; 63 UInt32 dictSizeInProperties; 64 65 void DecodeProperties(const Byte *properties) 66 { 67 unsigned d = properties[0]; 68 if (d >= (9 * 5 * 5)) 69 throw "Incorrect LZMA properties"; 70 lc = d % 9; 71 d /= 9; 72 pb = d / 5; 73 lp = d % 5; 74 dictSizeInProperties = 0; 75 for (int i = 0; i < 4; i++) 76 dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i); 77 dictSize = dictSizeInProperties; 78 if (dictSize < LZMA_DIC_MIN) 79 dictSize = LZMA_DIC_MIN; 80 } 81 82If "Uncompressed size" field contains ones in all 64 bits, it means that 83uncompressed size is unknown and there is the "end marker" in stream, 84that indicates the end of decoding point. 85In opposite case, if the value from "Uncompressed size" field is not 86equal to ((2^64) - 1), the LZMA stream decoding must be finished after 87specified number of bytes (Uncompressed size) is decoded. And if there 88is the "end marker", the LZMA decoder must read that marker also. 89 90 91The new scheme to encode LZMA properties 92---------------------------------------- 93 94If LZMA compression is used for some another format, it's recommended to 95use a new improved scheme to encode LZMA properties. That new scheme was 96used in xz format that uses the LZMA2 compression algorithm. 97The LZMA2 is a new compression algorithm that is based on the LZMA algorithm. 98 99The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports 100only reduced set of dictionary sizes: 101 (2 << 11), (3 << 11), 102 (2 << 12), (3 << 12), 103 ... 104 (2 << 30), (3 << 30), 105 (2 << 31) - 1 106 107The dictionary size can be extracted from encoded value with the following code: 108 109 dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11)); 110 111Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of 112"lc" and "lp" properties: 113 114 if (lc + lp > 4) 115 throw "Unsupported properties: (lc + lp) > 4"; 116 117There are some advantages for LZMA decoder with such (lc + lp) value 118limitation. It reduces the maximum size of tables allocated by decoder. 119And it reduces the complexity of initialization procedure, that can be 120important to keep high speed of decoding of big number of small LZMA streams. 121 122It's recommended to use that limitation (lc + lp <= 4) for any new format 123that uses LZMA compression. Note that the combinations of "lc" and "lp" 124parameters, where (lc + lp > 4), can provide significant improvement in 125compression ratio only in some rare cases. 126 127The LZMA properties can be encoded into two bytes in new scheme: 128 129Offset Size Description 130 131 0 1 The dictionary size encoded with LZMA2 scheme 132 1 1 LZMA model properties (lc, lp, pb) in encoded form 133 134 135The RAM usage 136============= 137 138The RAM usage for LZMA decoder is determined by the following parts: 139 1401) The Sliding Window (from 4 KiB to 4 GiB). 1412) The probability model counter arrays (arrays of 16-bit variables). 1423) Some additional state variables (about 10 variables of 32-bit integers). 143 144 145The RAM usage for Sliding Window 146-------------------------------- 147 148There are two main scenarios of decoding: 149 1501) The decoding of full stream to one RAM buffer. 151 152 If we decode full LZMA stream to one output buffer in RAM, the decoder 153 can use that output buffer as sliding window. So the decoder doesn't 154 need additional buffer allocated for sliding window. 155 1562) The decoding to some external storage. 157 158 If we decode LZMA stream to external storage, the decoder must allocate 159 the buffer for sliding window. The size of that buffer must be equal 160 or larger than the value of dictionary size from properties of LZMA stream. 161 162In this specification we describe the code for decoding to some external 163storage. The optimized version of code for decoding of full stream to one 164output RAM buffer can require some minor changes in code. 165 166 167The RAM usage for the probability model counters 168------------------------------------------------ 169 170The size of the probability model counter arrays is calculated with the 171following formula: 172 173size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc)) 174 175Each probability model counter is 11-bit unsigned integer. 176If we use 16-bit integer variables (2-byte integers) for these probability 177model counters, the RAM usage required by probability model counter arrays 178can be estimated with the following formula: 179 180 RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc)) 181 182For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is 183 184 RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB 185 186The maximum RAM state usage is required for decoding the stream with lp = 4 187and lc = 8: 188 189 RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB 190 191If the decoder uses LZMA2's limited property condition 192(lc + lp <= 4), the RAM usage will be not larger than 193 194 RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB 195 196 197The RAM usage for encoder 198------------------------- 199 200There are many variants for LZMA encoding code. 201These variants have different values for memory consumption. 202Note that memory consumption for LZMA Encoder can not be 203smaller than memory consumption of LZMA Decoder for same stream. 204 205The RAM usage required by modern effective implementation of 206LZMA Encoder can be estimated with the following formula: 207 208 Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize. 209 210But there are some modes of the encoder that require less memory. 211 212 213LZMA Decoding 214============= 215 216The LZMA compression algorithm uses LZ-based compression with Sliding Window 217and Range Encoding as entropy coding method. 218 219 220Sliding Window 221-------------- 222 223LZMA uses Sliding Window compression similar to LZ77 algorithm. 224 225LZMA stream must be decoded to the sequence that consists 226of MATCHES and LITERALS: 227 228 - a LITERAL is a 8-bit character (one byte). 229 The decoder just puts that LITERAL to the uncompressed stream. 230 231 - a MATCH is a pair of two numbers (DISTANCE-LENGTH pair). 232 The decoder takes one byte exactly "DISTANCE" characters behind 233 current position in the uncompressed stream and puts it to 234 uncompressed stream. The decoder must repeat it "LENGTH" times. 235 236The "DISTANCE" can not be larger than dictionary size. 237And the "DISTANCE" can not be larger than the number of bytes in 238the uncompressed stream that were decoded before that match. 239 240In this specification we use cyclic buffer to implement Sliding Window 241for LZMA decoder: 242 243class COutWindow 244{ 245 Byte *Buf; 246 UInt32 Pos; 247 UInt32 Size; 248 bool IsFull; 249 250public: 251 unsigned TotalPos; 252 COutStream OutStream; 253 254 COutWindow(): Buf(NULL) {} 255 ~COutWindow() { delete []Buf; } 256 257 void Create(UInt32 dictSize) 258 { 259 Buf = new Byte[dictSize]; 260 Pos = 0; 261 Size = dictSize; 262 IsFull = false; 263 TotalPos = 0; 264 } 265 266 void PutByte(Byte b) 267 { 268 TotalPos++; 269 Buf[Pos++] = b; 270 if (Pos == Size) 271 { 272 Pos = 0; 273 IsFull = true; 274 } 275 OutStream.WriteByte(b); 276 } 277 278 Byte GetByte(UInt32 dist) const 279 { 280 return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos]; 281 } 282 283 void CopyMatch(UInt32 dist, unsigned len) 284 { 285 for (; len > 0; len--) 286 PutByte(GetByte(dist)); 287 } 288 289 bool CheckDistance(UInt32 dist) const 290 { 291 return dist <= Pos || IsFull; 292 } 293 294 bool IsEmpty() const 295 { 296 return Pos == 0 && !IsFull; 297 } 298}; 299 300 301In another implementation it's possible to use one buffer that contains 302Sliding Window and the whole data stream after uncompressing. 303 304 305Range Decoder 306------------- 307 308LZMA algorithm uses Range Encoding (1) as entropy coding method. 309 310LZMA stream contains just one very big number in big-endian encoding. 311LZMA decoder uses the Range Decoder to extract a sequence of binary 312symbols from that big number. 313 314The state of the Range Decoder: 315 316struct CRangeDecoder 317{ 318 UInt32 Range; 319 UInt32 Code; 320 InputStream *InStream; 321 322 bool Corrupted; 323} 324 325The notes about UInt32 type for the "Range" and "Code" variables: 326 327 It's possible to use 64-bit (unsigned or signed) integer type 328 for the "Range" and the "Code" variables instead of 32-bit unsigned, 329 but some additional code must be used to truncate the values to 330 low 32-bits after some operations. 331 332 If the programming language does not support 32-bit unsigned integer type 333 (like in case of JAVA language), it's possible to use 32-bit signed integer, 334 but some code must be changed. For example, it's required to change the code 335 that uses comparison operations for UInt32 variables in this specification. 336 337The Range Decoder can be in some states that can be treated as 338"Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted": 339 340 (Corrupted == false), if the Range Decoder has not detected any corruption. 341 (Corrupted == true), if the Range Decoder has detected some corruption. 342 343The reference LZMA Decoder ignores the value of the "Corrupted" variable. 344So it continues to decode the stream, even if the corruption can be detected 345in the Range Decoder. To provide the full compatibility with output of the 346reference LZMA Decoder, another LZMA Decoder implementations must also 347ignore the value of the "Corrupted" variable. 348 349The LZMA Encoder is required to create only such LZMA streams, that will not 350lead the Range Decoder to states, where the "Corrupted" variable is set to true. 351 352The Range Decoder reads first 5 bytes from input stream to initialize 353the state: 354 355bool CRangeDecoder::Init() 356{ 357 Corrupted = false; 358 Range = 0xFFFFFFFF; 359 Code = 0; 360 361 Byte b = InStream->ReadByte(); 362 363 for (int i = 0; i < 4; i++) 364 Code = (Code << 8) | InStream->ReadByte(); 365 366 if (b != 0 || Code == Range) 367 Corrupted = true; 368 return b == 0; 369} 370 371The LZMA Encoder always writes ZERO in initial byte of compressed stream. 372That scheme allows to simplify the code of the Range Encoder in the 373LZMA Encoder. If initial byte is not equal to ZERO, the LZMA Decoder must 374stop decoding and report error. 375 376After the last bit of data was decoded by Range Decoder, the value of the 377"Code" variable must be equal to 0. The LZMA Decoder must check it by 378calling the IsFinishedOK() function: 379 380 bool IsFinishedOK() const { return Code == 0; } 381 382If there is corruption in data stream, there is big probability that 383the "Code" value will be not equal to 0 in the Finish() function. So that 384check in the IsFinishedOK() function provides very good feature for 385corruption detection. 386 387The value of the "Range" variable before each bit decoding can not be smaller 388than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in 389described range. 390 391#define kTopValue ((UInt32)1 << 24) 392 393void CRangeDecoder::Normalize() 394{ 395 if (Range < kTopValue) 396 { 397 Range <<= 8; 398 Code = (Code << 8) | InStream->ReadByte(); 399 } 400} 401 402Notes: if the size of the "Code" variable is larger than 32 bits, it's 403required to keep only low 32 bits of the "Code" variable after the change 404in Normalize() function. 405 406If the LZMA Stream is not corrupted, the value of the "Code" variable is 407always smaller than value of the "Range" variable. 408But the Range Decoder ignores some types of corruptions, so the value of 409the "Code" variable can be equal or larger than value of the "Range" variable 410for some "Corrupted" archives. 411 412 413LZMA uses Range Encoding only with binary symbols of two types: 414 1) binary symbols with fixed and equal probabilities (direct bits) 415 2) binary symbols with predicted probabilities 416 417The DecodeDirectBits() function decodes the sequence of direct bits: 418 419UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits) 420{ 421 UInt32 res = 0; 422 do 423 { 424 Range >>= 1; 425 Code -= Range; 426 UInt32 t = 0 - ((UInt32)Code >> 31); 427 Code += Range & t; 428 429 if (Code == Range) 430 Corrupted = true; 431 432 Normalize(); 433 res <<= 1; 434 res += t + 1; 435 } 436 while (--numBits); 437 return res; 438} 439 440 441The Bit Decoding with Probability Model 442--------------------------------------- 443 444The task of Bit Probability Model is to estimate probabilities of binary 445symbols. And then it provides the Range Decoder with that information. 446The better prediction provides better compression ratio. 447The Bit Probability Model uses statistical data of previous decoded 448symbols. 449 450That estimated probability is presented as 11-bit unsigned integer value 451that represents the probability of symbol "0". 452 453#define kNumBitModelTotalBits 11 454 455Mathematical probabilities can be presented with the following formulas: 456 probability(symbol_0) = prob / 2048. 457 probability(symbol_1) = 1 - Probability(symbol_0) = 458 = 1 - prob / 2048 = 459 = (2048 - prob) / 2048 460where the "prob" variable contains 11-bit integer probability counter. 461 462It's recommended to use 16-bit unsigned integer type, to store these 11-bit 463probability values: 464 465typedef UInt16 CProb; 466 467Each probability value must be initialized with value ((1 << 11) / 2), 468that represents the state, where probabilities of symbols 0 and 1 469are equal to 0.5: 470 471#define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2) 472 473The INIT_PROBS macro is used to initialize the array of CProb variables: 474 475#define INIT_PROBS(p) \ 476 { for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; } 477 478 479The DecodeBit() function decodes one bit. 480The LZMA decoder provides the pointer to CProb variable that contains 481information about estimated probability for symbol 0 and the Range Decoder 482updates that CProb variable after decoding. The Range Decoder increases 483estimated probability of the symbol that was decoded: 484 485#define kNumMoveBits 5 486 487unsigned CRangeDecoder::DecodeBit(CProb *prob) 488{ 489 unsigned v = *prob; 490 UInt32 bound = (Range >> kNumBitModelTotalBits) * v; 491 unsigned symbol; 492 if (Code < bound) 493 { 494 v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits; 495 Range = bound; 496 symbol = 0; 497 } 498 else 499 { 500 v -= v >> kNumMoveBits; 501 Code -= bound; 502 Range -= bound; 503 symbol = 1; 504 } 505 *prob = (CProb)v; 506 Normalize(); 507 return symbol; 508} 509 510 511The Binary Tree of bit model counters 512------------------------------------- 513 514LZMA uses a tree of Bit model variables to decode symbol that needs 515several bits for storing. There are two versions of such trees in LZMA: 516 1) the tree that decodes bits from high bit to low bit (the normal scheme). 517 2) the tree that decodes bits from low bit to high bit (the reverse scheme). 518 519Each binary tree structure supports different size of decoded symbol 520(the size of binary sequence that contains value of symbol). 521If that size of decoded symbol is "NumBits" bits, the tree structure 522uses the array of (2 << NumBits) counters of CProb type. 523But only ((2 << NumBits) - 1) items are used by encoder and decoder. 524The first item (the item with index equal to 0) in array is unused. 525That scheme with unused array's item allows to simplify the code. 526 527unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc) 528{ 529 unsigned m = 1; 530 unsigned symbol = 0; 531 for (unsigned i = 0; i < numBits; i++) 532 { 533 unsigned bit = rc->DecodeBit(&probs[m]); 534 m <<= 1; 535 m += bit; 536 symbol |= (bit << i); 537 } 538 return symbol; 539} 540 541template <unsigned NumBits> 542class CBitTreeDecoder 543{ 544 CProb Probs[(unsigned)1 << NumBits]; 545 546public: 547 548 void Init() 549 { 550 INIT_PROBS(Probs); 551 } 552 553 unsigned Decode(CRangeDecoder *rc) 554 { 555 unsigned m = 1; 556 for (unsigned i = 0; i < NumBits; i++) 557 m = (m << 1) + rc->DecodeBit(&Probs[m]); 558 return m - ((unsigned)1 << NumBits); 559 } 560 561 unsigned ReverseDecode(CRangeDecoder *rc) 562 { 563 return BitTreeReverseDecode(Probs, NumBits, rc); 564 } 565}; 566 567 568LZ part of LZMA 569--------------- 570 571LZ part of LZMA describes details about the decoding of MATCHES and LITERALS. 572 573 574The Literal Decoding 575-------------------- 576 577The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where 578each table contains 0x300 CProb values: 579 580 CProb *LitProbs; 581 582 void CreateLiterals() 583 { 584 LitProbs = new CProb[(UInt32)0x300 << (lc + lp)]; 585 } 586 587 void InitLiterals() 588 { 589 UInt32 num = (UInt32)0x300 << (lc + lp); 590 for (UInt32 i = 0; i < num; i++) 591 LitProbs[i] = PROB_INIT_VAL; 592 } 593 594To select the table for decoding it uses the context that consists of 595(lc) high bits from previous literal and (lp) low bits from value that 596represents current position in outputStream. 597 598If (State > 7), the Literal Decoder also uses "matchByte" that represents 599the byte in OutputStream at position the is the DISTANCE bytes before 600current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair 601of latest decoded match. 602 603The following code decodes one literal and puts it to Sliding Window buffer: 604 605 void DecodeLiteral(unsigned state, UInt32 rep0) 606 { 607 unsigned prevByte = 0; 608 if (!OutWindow.IsEmpty()) 609 prevByte = OutWindow.GetByte(1); 610 611 unsigned symbol = 1; 612 unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc)); 613 CProb *probs = &LitProbs[(UInt32)0x300 * litState]; 614 615 if (state >= 7) 616 { 617 unsigned matchByte = OutWindow.GetByte(rep0 + 1); 618 do 619 { 620 unsigned matchBit = (matchByte >> 7) & 1; 621 matchByte <<= 1; 622 unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]); 623 symbol = (symbol << 1) | bit; 624 if (matchBit != bit) 625 break; 626 } 627 while (symbol < 0x100); 628 } 629 while (symbol < 0x100) 630 symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]); 631 OutWindow.PutByte((Byte)(symbol - 0x100)); 632 } 633 634 635The match length decoding 636------------------------- 637 638The match length decoder returns normalized (zero-based value) 639length of match. That value can be converted to real length of the match 640with the following code: 641 642#define kMatchMinLen 2 643 644 matchLen = len + kMatchMinLen; 645 646The match length decoder can return the values from 0 to 271. 647And the corresponded real match length values can be in the range 648from 2 to 273. 649 650The following scheme is used for the match length encoding: 651 652 Binary encoding Binary Tree structure Zero-based match length 653 sequence (binary + decimal): 654 655 0 xxx LowCoder[posState] xxx 656 1 0 yyy MidCoder[posState] yyy + 8 657 1 1 zzzzzzzz HighCoder zzzzzzzz + 16 658 659LZMA uses bit model variable "Choice" to decode the first selection bit. 660 661If the first selection bit is equal to 0, the decoder uses binary tree 662 LowCoder[posState] to decode 3-bit zero-based match length (xxx). 663 664If the first selection bit is equal to 1, the decoder uses bit model 665 variable "Choice2" to decode the second selection bit. 666 667 If the second selection bit is equal to 0, the decoder uses binary tree 668 MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match 669 length is equal to (yyy + 8). 670 671 If the second selection bit is equal to 1, the decoder uses binary tree 672 HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based 673 match length is equal to (zzzzzzzz + 16). 674 675LZMA uses "posState" value as context to select the binary tree 676from LowCoder and MidCoder binary tree arrays: 677 678 unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1); 679 680The full code of the length decoder: 681 682class CLenDecoder 683{ 684 CProb Choice; 685 CProb Choice2; 686 CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax]; 687 CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax]; 688 CBitTreeDecoder<8> HighCoder; 689 690public: 691 692 void Init() 693 { 694 Choice = PROB_INIT_VAL; 695 Choice2 = PROB_INIT_VAL; 696 HighCoder.Init(); 697 for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++) 698 { 699 LowCoder[i].Init(); 700 MidCoder[i].Init(); 701 } 702 } 703 704 unsigned Decode(CRangeDecoder *rc, unsigned posState) 705 { 706 if (rc->DecodeBit(&Choice) == 0) 707 return LowCoder[posState].Decode(rc); 708 if (rc->DecodeBit(&Choice2) == 0) 709 return 8 + MidCoder[posState].Decode(rc); 710 return 16 + HighCoder.Decode(rc); 711 } 712}; 713 714The LZMA decoder uses two instances of CLenDecoder class. 715The first instance is for the matches of "Simple Match" type, 716and the second instance is for the matches of "Rep Match" type: 717 718 CLenDecoder LenDecoder; 719 CLenDecoder RepLenDecoder; 720 721 722The match distance decoding 723--------------------------- 724 725LZMA supports dictionary sizes up to 4 GiB minus 1. 726The value of match distance (decoded by distance decoder) can be 727from 1 to 2^32. But the distance value that is equal to 2^32 is used to 728indicate the "End of stream" marker. So real largest match distance 729that is used for LZ-window match is (2^32 - 1). 730 731LZMA uses normalized match length (zero-based length) 732to calculate the context state "lenState" do decode the distance value: 733 734#define kNumLenToPosStates 4 735 736 unsigned lenState = len; 737 if (lenState > kNumLenToPosStates - 1) 738 lenState = kNumLenToPosStates - 1; 739 740The distance decoder returns the "dist" value that is zero-based value 741of match distance. The real match distance can be calculated with the 742following code: 743 744 matchDistance = dist + 1; 745 746The state of the distance decoder and the initialization code: 747 748 #define kEndPosModelIndex 14 749 #define kNumFullDistances (1 << (kEndPosModelIndex >> 1)) 750 #define kNumAlignBits 4 751 752 CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates]; 753 CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex]; 754 CBitTreeDecoder<kNumAlignBits> AlignDecoder; 755 756 void InitDist() 757 { 758 for (unsigned i = 0; i < kNumLenToPosStates; i++) 759 PosSlotDecoder[i].Init(); 760 AlignDecoder.Init(); 761 INIT_PROBS(PosDecoders); 762 } 763 764At first stage the distance decoder decodes 6-bit "posSlot" value with bit 765tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different 766"posSlot" values. 767 768 unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec); 769 770The encoding scheme for distance value is shown in the following table: 771 772posSlot (decimal) / 773 zero-based distance (binary) 774 0 0 775 1 1 776 2 10 777 3 11 778 779 4 10 x 780 5 11 x 781 6 10 xx 782 7 11 xx 783 8 10 xxx 784 9 11 xxx 78510 10 xxxx 78611 11 xxxx 78712 10 xxxxx 78813 11 xxxxx 789 79014 10 yy zzzz 79115 11 yy zzzz 79216 10 yyy zzzz 79317 11 yyy zzzz 794... 79562 10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz 79663 11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz 797 798where 799 "x ... x" means the sequence of binary symbols encoded with binary tree and 800 "Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13. 801 "y" means direct bit encoded with range coder. 802 "zzzz" means the sequence of four binary symbols encoded with binary 803 tree with "Reverse" scheme, where one common binary tree "AlignDecoder" 804 is used for all posSlot values. 805 806If (posSlot < 4), the "dist" value is equal to posSlot value. 807 808If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of 809 the high bits of "dist" value and the number of the low bits. 810 811 If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders. 812 (one separated bit tree decoder per one posSlot value) and "Reverse" scheme. 813 In this implementation we use one CProb array "PosDecoders" that contains 814 all CProb variables for all these bit decoders. 815 816 if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct 817 bits from RangeDecoder and the low 4 bits are decoded with a bit tree 818 decoder "AlignDecoder" with "Reverse" scheme. 819 820The code to decode zero-based match distance: 821 822 unsigned DecodeDistance(unsigned len) 823 { 824 unsigned lenState = len; 825 if (lenState > kNumLenToPosStates - 1) 826 lenState = kNumLenToPosStates - 1; 827 828 unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec); 829 if (posSlot < 4) 830 return posSlot; 831 832 unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1); 833 UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits); 834 if (posSlot < kEndPosModelIndex) 835 dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec); 836 else 837 { 838 dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits; 839 dist += AlignDecoder.ReverseDecode(&RangeDec); 840 } 841 return dist; 842 } 843 844 845 846LZMA Decoding modes 847------------------- 848 849There are 2 types of LZMA streams: 850 8511) The stream with "End of stream" marker. 8522) The stream without "End of stream" marker. 853 854And the LZMA Decoder supports 3 modes of decoding: 855 8561) The unpack size is undefined. The LZMA decoder stops decoding after 857 getting "End of stream" marker. 858 The input variables for that case: 859 860 markerIsMandatory = true 861 unpackSizeDefined = false 862 unpackSize contains any value 863 8642) The unpack size is defined and LZMA decoder supports both variants, 865 where the stream can contain "End of stream" marker or the stream is 866 finished without "End of stream" marker. The LZMA decoder must detect 867 any of these situations. 868 The input variables for that case: 869 870 markerIsMandatory = false 871 unpackSizeDefined = true 872 unpackSize contains unpack size 873 8743) The unpack size is defined and the LZMA stream must contain 875 "End of stream" marker 876 The input variables for that case: 877 878 markerIsMandatory = true 879 unpackSizeDefined = true 880 unpackSize contains unpack size 881 882 883The main loop of decoder 884------------------------ 885 886The main loop of LZMA decoder: 887 888Initialize the LZMA state. 889loop 890{ 891 // begin of loop 892 Check "end of stream" conditions. 893 Decode Type of MATCH / LITERAL. 894 If it's LITERAL, decode LITERAL value and put the LITERAL to Window. 895 If it's MATCH, decode the length of match and the match distance. 896 Check error conditions, check end of stream conditions and copy 897 the sequence of match bytes from sliding window to current position 898 in window. 899 Go to begin of loop 900} 901 902The reference implementation of LZMA decoder uses "unpackSize" variable 903to keep the number of remaining bytes in output stream. So it reduces 904"unpackSize" value after each decoded LITERAL or MATCH. 905 906The following code contains the "end of stream" condition check at the start 907of the loop: 908 909 if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory) 910 if (RangeDec.IsFinishedOK()) 911 return LZMA_RES_FINISHED_WITHOUT_MARKER; 912 913LZMA uses three types of matches: 914 9151) "Simple Match" - the match with distance value encoded with bit models. 916 9172) "Rep Match" - the match that uses the distance from distance 918 history table. 919 9203) "Short Rep Match" - the match of single byte length, that uses the latest 921 distance from distance history table. 922 923The LZMA decoder keeps the history of latest 4 match distances that were used 924by decoder. That set of 4 variables contains zero-based match distances and 925these variables are initialized with zero values: 926 927 UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0; 928 929The LZMA decoder uses binary model variables to select type of MATCH or LITERAL: 930 931#define kNumStates 12 932#define kNumPosBitsMax 4 933 934 CProb IsMatch[kNumStates << kNumPosBitsMax]; 935 CProb IsRep[kNumStates]; 936 CProb IsRepG0[kNumStates]; 937 CProb IsRepG1[kNumStates]; 938 CProb IsRepG2[kNumStates]; 939 CProb IsRep0Long[kNumStates << kNumPosBitsMax]; 940 941The decoder uses "state" variable value to select exact variable 942from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays. 943The "state" variable can get the value from 0 to 11. 944Initial value for "state" variable is zero: 945 946 unsigned state = 0; 947 948The "state" variable is updated after each LITERAL or MATCH with one of the 949following functions: 950 951unsigned UpdateState_Literal(unsigned state) 952{ 953 if (state < 4) return 0; 954 else if (state < 10) return state - 3; 955 else return state - 6; 956} 957unsigned UpdateState_Match (unsigned state) { return state < 7 ? 7 : 10; } 958unsigned UpdateState_Rep (unsigned state) { return state < 7 ? 8 : 11; } 959unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; } 960 961The decoder calculates "state2" variable value to select exact variable from 962"IsMatch" and "IsRep0Long" arrays: 963 964unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1); 965unsigned state2 = (state << kNumPosBitsMax) + posState; 966 967The decoder uses the following code flow scheme to select exact 968type of LITERAL or MATCH: 969 970IsMatch[state2] decode 971 0 - the Literal 972 1 - the Match 973 IsRep[state] decode 974 0 - Simple Match 975 1 - Rep Match 976 IsRepG0[state] decode 977 0 - the distance is rep0 978 IsRep0Long[state2] decode 979 0 - Short Rep Match 980 1 - Rep Match 0 981 1 - 982 IsRepG1[state] decode 983 0 - Rep Match 1 984 1 - 985 IsRepG2[state] decode 986 0 - Rep Match 2 987 1 - Rep Match 3 988 989 990LITERAL symbol 991-------------- 992If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type. 993 994At first the LZMA decoder must check that it doesn't exceed 995specified uncompressed size: 996 997 if (unpackSizeDefined && unpackSize == 0) 998 return LZMA_RES_ERROR; 999 1000Then it decodes literal value and puts it to sliding window: 1001 1002 DecodeLiteral(state, rep0); 1003 1004Then the decoder must update the "state" value and "unpackSize" value; 1005 1006 state = UpdateState_Literal(state); 1007 unpackSize--; 1008 1009Then the decoder must go to the begin of main loop to decode next Match or Literal. 1010 1011 1012Simple Match 1013------------ 1014 1015If the value "1" was decoded with IsMatch[state2] decoding, 1016we have the "Simple Match" type. 1017 1018The distance history table is updated with the following scheme: 1019 1020 rep3 = rep2; 1021 rep2 = rep1; 1022 rep1 = rep0; 1023 1024The zero-based length is decoded with "LenDecoder": 1025 1026 len = LenDecoder.Decode(&RangeDec, posState); 1027 1028The state is update with UpdateState_Match function: 1029 1030 state = UpdateState_Match(state); 1031 1032and the new "rep0" value is decoded with DecodeDistance: 1033 1034 rep0 = DecodeDistance(len); 1035 1036That "rep0" will be used as zero-based distance for current match. 1037 1038If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have 1039"End of stream" marker, so we can stop decoding and check finishing 1040condition in Range Decoder: 1041 1042 if (rep0 == 0xFFFFFFFF) 1043 return RangeDec.IsFinishedOK() ? 1044 LZMA_RES_FINISHED_WITH_MARKER : 1045 LZMA_RES_ERROR; 1046 1047If uncompressed size is defined, LZMA decoder must check that it doesn't 1048exceed that specified uncompressed size: 1049 1050 if (unpackSizeDefined && unpackSize == 0) 1051 return LZMA_RES_ERROR; 1052 1053Also the decoder must check that "rep0" value is not larger than dictionary size 1054and is not larger than the number of already decoded bytes: 1055 1056 if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0)) 1057 return LZMA_RES_ERROR; 1058 1059Then the decoder must copy match bytes as described in 1060"The match symbols copying" section. 1061 1062 1063Rep Match 1064--------- 1065 1066If the LZMA decoder has decoded the value "1" with IsRep[state] variable, 1067we have "Rep Match" type. 1068 1069At first the LZMA decoder must check that it doesn't exceed 1070specified uncompressed size: 1071 1072 if (unpackSizeDefined && unpackSize == 0) 1073 return LZMA_RES_ERROR; 1074 1075Also the decoder must return error, if the LZ window is empty: 1076 1077 if (OutWindow.IsEmpty()) 1078 return LZMA_RES_ERROR; 1079 1080If the match type is "Rep Match", the decoder uses one of the 4 variables of 1081distance history table to get the value of distance for current match. 1082And there are 4 corresponding ways of decoding flow. 1083 1084The decoder updates the distance history with the following scheme 1085depending from type of match: 1086 1087- "Rep Match 0" or "Short Rep Match": 1088 ; LZMA doesn't update the distance history 1089 1090- "Rep Match 1": 1091 UInt32 dist = rep1; 1092 rep1 = rep0; 1093 rep0 = dist; 1094 1095- "Rep Match 2": 1096 UInt32 dist = rep2; 1097 rep2 = rep1; 1098 rep1 = rep0; 1099 rep0 = dist; 1100 1101- "Rep Match 3": 1102 UInt32 dist = rep3; 1103 rep3 = rep2; 1104 rep2 = rep1; 1105 rep1 = rep0; 1106 rep0 = dist; 1107 1108Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long", 1109"IsRepG1", "IsRepG2". 1110 1111If the subtype is "Short Rep Match", the decoder updates the state, puts 1112the one byte from window to current position in window and goes to next 1113MATCH/LITERAL symbol (the begin of main loop): 1114 1115 state = UpdateState_ShortRep(state); 1116 OutWindow.PutByte(OutWindow.GetByte(rep0 + 1)); 1117 unpackSize--; 1118 continue; 1119 1120In other cases (Rep Match 0/1/2/3), it decodes the zero-based 1121length of match with "RepLenDecoder" decoder: 1122 1123 len = RepLenDecoder.Decode(&RangeDec, posState); 1124 1125Then it updates the state: 1126 1127 state = UpdateState_Rep(state); 1128 1129Then the decoder must copy match bytes as described in 1130"The Match symbols copying" section. 1131 1132 1133The match symbols copying 1134------------------------- 1135 1136If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must 1137copy the sequence of bytes with calculated match distance and match length. 1138If uncompressed size is defined, LZMA decoder must check that it doesn't 1139exceed that specified uncompressed size: 1140 1141 len += kMatchMinLen; 1142 bool isError = false; 1143 if (unpackSizeDefined && unpackSize < len) 1144 { 1145 len = (unsigned)unpackSize; 1146 isError = true; 1147 } 1148 OutWindow.CopyMatch(rep0 + 1, len); 1149 unpackSize -= len; 1150 if (isError) 1151 return LZMA_RES_ERROR; 1152 1153Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL. 1154 1155 1156 1157NOTES 1158----- 1159 1160This specification doesn't describe the variant of decoder implementation 1161that supports partial decoding. Such partial decoding case can require some 1162changes in "end of stream" condition checks code. Also such code 1163can use additional status codes, returned by decoder. 1164 1165This specification uses C++ code with templates to simplify describing. 1166The optimized version of LZMA decoder doesn't need templates. 1167Such optimized version can use just two arrays of CProb variables: 1168 1) The dynamic array of CProb variables allocated for the Literal Decoder. 1169 2) The one common array that contains all other CProb variables. 1170 1171 1172References: 1173 11741. G. N. N. Martin, Range encoding: an algorithm for removing redundancy 1175 from a digitized message, Video & Data Recording Conference, 1176 Southampton, UK, July 24-27, 1979. 1177