Lempel–Ziv–Markov chain algorithm
   HOME

TheInfoList



OR:

The Lempel–Ziv–Markov chain algorithm (LZMA) is an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
used to perform
lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...
. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the
7-Zip 7-Zip is a free and open-source file archiver, a utility used to place groups of files within compressed containers known as "archives". It is developed by Igor Pavlov and was first released in 1999. 7-Zip has its own archive format called 7z, ...
archiver. This algorithm uses a dictionary compression scheme somewhat similar to the
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations includin ...
algorithm published by
Abraham Lempel Abraham Lempel ( he, אברהם למפל, born 10 February 1936) is an Israeli computer scientist and one of the fathers of the LZ family of lossless data compression algorithms. Biography Lempel was born on 10 February 1936 in Lwów, Poland (n ...
and
Jacob Ziv Jacob Ziv ( he, יעקב זיו; born 1931) is an Israeli electrical engineer who, along with Abraham Lempel, developed the LZ family of lossless data compression algorithms. Biography Ziv was born in Tiberias, British mandate Palestine, on 27 ...
in 1977 and features a high compression ratio (generally higher than bzip2) - LZMA Unix Port was finally replaced by xz which features better and faster compression; from here we know even LZMA Unix Port was a lot better than gzip and bzip2. and a variable compression-dictionary size (up to 4  GB), while still maintaining decompression speed similar to other commonly used compression algorithms. LZMA2 is a simple container format that can include both uncompressed data and LZMA data, possibly with multiple different LZMA encoding parameters. LZMA2 supports arbitrarily scalable multithreaded compression and decompression and efficient compression of data which is partially incompressible.


Overview

LZMA uses a dictionary compression algorithm (a variant of
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations includin ...
with huge dictionary sizes and special support for repeatedly used match distances), whose output is then encoded with a range encoder, using a complex model to make a probability prediction of each bit. The dictionary compressor finds matches using sophisticated dictionary data structures, and produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder: many encodings are possible, and a
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...
algorithm is used to select an optimal one under certain approximations. Prior to LZMA, most encoder models were purely byte-based (i.e. they coded each bit using only a cascade of contexts to represent the dependencies on previous bits from the same byte). The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase: this is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in the same context. Furthermore, compared to classic dictionary compression (such as the one used in zip and
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and in ...
formats), the dictionary sizes can be and usually are much larger, taking advantage of the large amount of memory available on modern systems.


Compressed format overview

In LZMA compression, the compressed stream is a stream of bits, encoded using an adaptive binary range coder. The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modeled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit (and related bits from the same field) in previous packets of the same type. Both the lzip and the LZMA SDK documentation describes this stream format. There are 7 types of packets: LONGREP refers to LONGREP -3packets, *REP refers to both LONGREP and SHORTREP, and *MATCH refers to both MATCH and *REP. LONGREP packets remove the distance used from the list of the most recent distances and reinsert it at the front, to avoid useless repeated entry, while MATCH just adds the distance to the front even if already present in the list and SHORTREP and LONGREP don't alter the list. The length is encoded as follows: As in LZ77, the length is not limited by the distance, because copying from the dictionary is defined as if the copy was performed byte by byte, keeping the distance constant. Distances are logically 32-bit and distance 0 points to the most recently added byte in the dictionary. The distance encoding starts with a 6-bit "distance slot", which determines how many further bits are needed. Distances are decoded as a binary concatenation of, from most to least significant, two bits depending on the distance slot, some bits encoded with fixed 0.5 probability, and some context encoded bits, according to the following table (distance slots 0−3 directly encode distances 0−3).


Decompression algorithm details

No complete natural language specification of the compressed format seems to exist, other than the one attempted in the following text. The description below is based on the compact XZ Embedded decoder by Lasse Collin included in the Linux kernel source from which the LZMA and LZMA2 algorithm details can be relatively easily deduced: thus, while citing source code as reference is not ideal, any programmer should be able to check the claims below with a few hours of work.


Range coding of bits

LZMA data is at the lowest level decoded one bit at a time by the range decoder, at the direction of the LZMA decoder. Context-based range decoding is invoked by the LZMA algorithm passing it a reference to the "context", which consists of the unsigned 11-bit variable ''prob'' (typically implemented using a 16-bit data type) representing the predicted probability of the bit being 0, which is read and updated by the range decoder (and should be initialized to , representing 0.5 probability). Fixed probability range decoding instead assumes a 0.5 probability, but operates slightly differently from context-based range decoding. The range decoder state consists of two unsigned 32-bit variables, ''range'' (representing the range size), and ''code'' (representing the encoded point within the range). Initialization of the range decoder consists of setting ''range'' to , and ''code'' to the 32-bit value starting at the second byte in the stream interpreted as big-endian; the first byte in the stream is completely ignored. Normalization proceeds in this way: # Shift both ''range'' and ''code'' left by 8 bits # Read a byte from the compressed stream # Set the least significant 8 bits of ''code'' to the byte value read Context-based range decoding of a bit using the ''prob'' probability variable proceeds in this way: # If ''range'' is less than , perform normalization # Set ''bound'' to # If ''code'' is less than ''bound'': ## Set ''range'' to ''bound'' ## Set ''prob'' to ''prob'' + ## Return bit 0 # Otherwise (if ''code'' is greater than or equal to the ''bound''): ## Set ''range'' to ''range'' − ''bound'' ## Set ''code'' to ''code'' − ''bound'' ## Set ''prob'' to ## Return bit 1 Fixed-probability range decoding of a bit proceeds in this way: # If ''range'' is less than , perform normalization # Set ''range'' to # If ''code'' is less than ''range'': ## Return bit 0 # Otherwise (if ''code'' is greater or equal than ''range''): ## Set ''code'' to ''code'' − ''range'' ## Return bit 1 The Linux kernel implementation of fixed-probability decoding in rc_direct(), for performance reasons, does not include a conditional branch, but instead subtracts ''range'' from ''code'' unconditionally. The resulting sign bit is used to both decide the bit to return and to generate a mask that is combined with ''code'' and added to ''range''. Note that: # The division by when computing ''bound'' and floor operation is done before the multiplication, not after (apparently to avoid requiring fast hardware support for 32-bit multiplication with a 64-bit result) # Fixed probability decoding is not strictly equivalent to context-based range decoding with any ''prob'' value, due to the fact that context-based range decoding discards the lower 11 bits of ''range'' before multiplying by ''prob'' as just described, while fixed probability decoding only discards the last bit


Range coding of integers

The range decoder also provides the bit-tree, reverse bit-tree and fixed probability integer decoding facilities, which are used to decode integers, and generalize the single-bit decoding described above. To decode unsigned integers less than ''limit'', an array of 11-bit probability variables is provided, which are conceptually arranged as the internal nodes of a complete binary tree with ''limit'' leaves. Non-reverse bit-tree decoding works by keeping a pointer to the tree of variables, which starts at the root. As long as the pointer does not point to a leaf, a bit is decoded using the variable indicated by the pointer, and the pointer is moved to either the left or right children depending on whether the bit is 0 or 1; when the pointer points to a leaf, the number associated with the leaf is returned. Non-reverse bit-tree decoding thus happens from most significant to least significant bit, stopping when only one value in the valid range is possible (this conceptually allows to have range sizes that are not powers of two, even though LZMA does not make use of this). Reverse bit-tree decoding instead decodes from least significant bit to most significant bits, and thus only supports ranges that are powers of two, and always decodes the same number of bits. It is equivalent to performing non-reverse bittree decoding with a power of two ''limit'', and reversing the last bits of the result. In the function in the Linux kernel, integers are actually returned in the range (with ''limit'' added to the conceptual value), and the variable at index 0 in the array is unused, while the one at index 1 is the root, and the left and right children indices are computed as 2''i'' and 2''i'' + 1. The function instead adds integers in the range to a caller-provided variable, where ''limit'' is implicitly represented by its logarithm, and has its own independent implementation for efficiency reasons. Fixed probability integer decoding simply performs fixed probability bit decoding repeatedly, reading bits from the most to the least significant.


LZMA configuration

The LZMA decoder is configured by an "properties" byte and a dictionary size. The value of the byte is , where: *''lc'' is the number of high bits of the previous byte to use as a context for literal encoding (the default value used by the LZMA SDK is 3) *''lp'' is the number of low bits of the dictionary position to include in (the default value used by the LZMA SDK is 0) *''pb'' is the number of low bits of the dictionary position to include in (the default value used by the LZMA SDK is 2) In non-LZMA2 streams, ''lc'' must not be greater than 8, and ''lp'' and ''pb'' must not be greater than 4. In LZMA2 streams, and ''pb'' must not be greater than 4. In the 7-zip LZMA file format, configuration is performed by a header containing the "properties" byte followed by the 32-bit little-endian dictionary size in bytes. In LZMA2, the properties byte can optionally be changed at the start of LZMA2 LZMA packets, while the dictionary size is specified in the LZMA2 header as later described.


LZMA coding contexts

The LZMA packet format has already been described, and this section specifies how LZMA statistically models the LZ-encoded streams, or in other words which probability variables are passed to the range decoder to decode each bit. Those probability variables are implemented as multi-dimensional arrays; before introducing them, a few values that are used as indices in these multidimensional arrays are defined. The ''state'' value is conceptually based on which of the patterns in the following table match the latest 2-4 packet types seen, and is implemented as a state machine state updated according to the transition table listed in the table every time a packet is output. The initial state is 0, and thus packets before the beginning are assumed to be LIT packets. The ''pos_state'' and ''literal_pos_state'' values consist of respectively the ''pb'' and ''lp'' (up to 4, from the LZMA header or LZMA2 properties packet) least significant bits of the dictionary position (the number of bytes coded since the last dictionary reset modulo the dictionary size). Note that the dictionary size is normally the multiple of a large power of 2, so these values are equivalently described as the least significant bits of the number of uncompressed bytes seen since the last dictionary reset. The ''prev_byte_lc_msbs'' value is set to the ''lc'' (up to 4, from the LZMA header or LZMA2 properties packet) most significant bits of the previous uncompressed byte. The ''is_REP'' value denotes whether a packet that includes a length is a LONGREP rather than a MATCH. The ''match_byte'' value is the byte that would have been decoded if a SHORTREP packet had been used (in other words, the byte found at the dictionary at the last used distance); it is only used just after a *MATCH packet. ''literal_bit_mode'' is an array of 8 values in the 0-2 range, one for each bit position in a byte, which are 1 or 2 if the previous packet was a *MATCH and it is either the most significant bit position or all the more significant bits in the literal to encode/decode are equal to the bits in the corresponding positions in ''match_byte'', while otherwise it is 0; the choice between the 1 or 2 values depends on the value of the bit at the same position in ''match_byte''. The literal/Literal set of variables can be seen as a "pseudo-bit-tree" similar to a bit-tree but with 3 variables instead of 1 in every node, chosen depending on the ''literal_bit_mode'' value at the bit position of the next bit to decode after the bit-tree context denoted by the node. The claim, found in some sources, that literals after a *MATCH are coded as the XOR of the byte value with ''match_byte'' is incorrect; they are instead coded simply as their byte value, but using the pseudo-bit-tree just described and the additional context listed in the table below. The probability variable groups used in LZMA are those:


LZMA2 format

The LZMA2 container supports multiple runs of compressed LZMA data and uncompressed data. Each LZMA compressed run can have a different LZMA configuration and dictionary. This improves the compression of partially or completely incompressible files and allows multithreaded compression and multithreaded decompression by breaking the file into runs that can be compressed or decompressed independently in parallel. Criticism of LZMA2's changes over LZMA include header fields not being covered by CRCs, and parallel decompression not being possible in practice. The LZMA2 header consists of a byte indicating the dictionary size: * 40 indicates a 4 GB − 1 dictionary size * Even values less than 40 indicate a 2''v''/2 + 12 bytes dictionary size * Odd values less than 40 indicate a 3×2(''v'' − 1)/2 + 11 bytes dictionary size * Values higher than 40 are invalid LZMA2 data consists of packets starting with a control byte, with the following values: * 0 denotes the end of the file * 1 denotes a dictionary reset followed by an uncompressed chunk * 2 denotes an uncompressed chunk without a dictionary reset * 3-0x7f are invalid values * 0x80-0xff denotes an LZMA chunk, where the lowest 5 bits are used as bit 16-20 of the uncompressed size minus one, and bit 5-6 indicates what should be reset Bits 5-6 for LZMA chunks can be: * 0: nothing reset * 1: state reset * 2: state reset, properties reset using properties byte * 3: state reset, properties reset using properties byte, dictionary reset LZMA state resets cause a reset of all LZMA state except the dictionary, and specifically: * The range coder * The ''state'' value * The last distances for repeated matches * All LZMA probabilities Uncompressed chunks consist of: * A 16-bit big-endian value encoding the data size minus one * The data to be copied verbatim into the dictionary and the output LZMA chunks consist of: * A 16-bit big-endian value encoding the low 16-bits of the uncompressed size minus one * A 16-bit big-endian value encoding the compressed size minus one * A properties/lclppb byte if bit 6 in the control byte is set * The LZMA compressed data, starting with the 5 bytes (of which the first is ignored) used to initialize the range coder (which are included in the compressed size)


xz and 7z formats

The . xz format, which can contain LZMA2 data, is documented at ''tukaani.org'', while the .7z file format, which can contain either LZMA or LZMA2 data, is documented in the 7zformat.txt file contained in the LZMA SDK.


Compression algorithm details

Similar to the decompression format situation, no complete natural language specification of the encoding techniques in
7-zip 7-Zip is a free and open-source file archiver, a utility used to place groups of files within compressed containers known as "archives". It is developed by Igor Pavlov and was first released in 1999. 7-Zip has its own archive format called 7z, ...
or xz seems to exist, other than the one attempted in the following text. The description below is based on the XZ for Java encoder by Lasse Collin, which appears to be the most readable among several rewrites of the original 7-zip using the same algorithms: again, while citing source code as reference is not ideal, any programmer should be able to check the claims below with a few hours of work.


Range encoder

The range encoder cannot make any interesting choices, and can be readily constructed based on the decoder description. Initialization and termination are not fully determined; the xz encoder outputs 0 as the first byte which is ignored by the decompressor, and encodes the lower bound of the range (which matters for the final bytes). The xz encoder uses an unsigned 33-bit variable called ''low'' (typically implemented as a 64-bit integer, initialized to 0), an unsigned 32-bit variable called ''range'' (initialized to ), an unsigned 8-bit variable called ''cache'' (initialized to 0), and an unsigned variable called ''cache_size'' which needs to be large enough to store the uncompressed size (initialized to 1, typically implemented as a 64-bit integer). The ''cache''/''cache_size'' variables are used to properly handle carries, and represent a number defined by a big-endian sequence starting with the ''cache'' value, and followed by ''cache_size'' 0xff bytes, which has been shifted out of the ''low'' register, but has not been written yet, because it could be incremented by one due to a carry. Note that the first byte output will always be 0 due to the fact that ''cache'' and ''low'' are initialized to 0, and the encoder implementation; the xz decoder ignores this byte. Normalization proceeds in this way: # If ''low'' is less than (): ## Output the byte stored in ''cache'' to the compressed stream ## Output ''cache_size'' − 1 bytes with 0xff value ## Set ''cache'' to bits 24-31 of ''low'' ## Set ''cache_size'' to 0 # If ''low'' is greater or equal than : ## Output the byte stored in ''cache'' plus one to the compressed stream ## Output ''cache_size'' − 1 bytes with 0 value ## Set ''cache'' to bits 24-31 of ''low'' ## Set ''cache_size'' to 0 # Increment ''cache_size'' # Set ''low'' to the lowest 24 bits of ''low'' shifted left by 8 bits # Set ''range'' to ''range'' shifted left by 8 bits Context-based range encoding of a bit using the ''prob'' probability variable proceeds in this way: # If ''range'' is less than , perform normalization # Set ''bound'' to # If encoding a 0 bit: ## Set ''range'' to ''bound'' ## Set ''prob'' to # Otherwise (if encoding a 1 bit): ## Set ''range'' to ''range'' − ''bound'' ## Set ''low to ''low + ''bound'' ## Set ''prob'' to Fixed-probability range encoding of a bit proceeds in this way: # If ''range'' is less than , perform normalization # Set ''range'' to # If encoding a 1 bit: ## Set ''low'' to ''low'' + ''range'' Termination proceeds this way: # Perform normalization 5 times Bit-tree encoding is performed like decoding, except that bit values are taken from the input integer to be encoded rather than from the result of the bit decoding functions. For algorithms that try to compute the encoding with the shortest post-range-encoding size, the encoder also needs to provide an estimate of that.


Dictionary search data structures

The encoder needs to be able to quickly locate matches in the dictionary. Since LZMA uses very large dictionaries (potentially on the order of gigabytes) to improve compression, simply scanning the whole dictionary would result in an encoder too slow to be practically usable, so sophisticated data structures are needed to support fast match searches.


Hash chains

The simplest approach, called "hash chains", is parameterized by a constant N which can be either 2, 3 or 4, which is typically chosen so that is greater than or equal to the dictionary size. It consists of creating, for each ''k'' less than or equal to ''N'', a hash table indexed by tuples of ''k'' bytes, where each of the buckets contains the last position where the first ''k'' bytes hashed to the hash value associated with that hash table bucket. Chaining is achieved by an additional array which stores, for every dictionary position, the last seen previous position whose first ''N'' bytes hash to the same value of the first ''N'' bytes of the position in question. To find matches of length ''N'' or higher, a search is started using the ''N''-sized hash table, and continued using the hash chain array; the search stop after a pre-defined number of hash chain nodes has been traversed, or when the hash chains "wraps around", indicating that the portion of the input that has been overwritten in the dictionary has been reached. Matches of size less than ''N'' are instead found by simply looking at the corresponding hash table, which either contains the latest such match, if any, or a string that hashes to the same value; in the latter case, the encoder will not be able to find the match. This issue is mitigated by the fact that for distant short matches using multiple literals might require less bits, and having hash conflicts in nearby strings is relatively unlikely; using larger hash tables or even direct lookup tables can reduce the problem at the cost of higher cache miss rate and thus lower performance. Note that all matches need to be validated to check that the actual bytes match currently at that specific dictionary position match, since the hashing mechanism only guarantees that at some past time there were characters hashing to the hash table bucket index (some implementations may not even guarantee that, because they do not initialize the data structures). LZMA uses
Markov chains A Markov chain or Markov process is a stochastic process, stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought ...
, as implied by "M" in its name.


Binary trees

The
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binary t ...
approach follows the hash chain approach, except that it logically uses a binary tree instead of a linked list for chaining. The binary tree is maintained so that it is always both a
search tree In computer science, a search tree is a tree data structure used for locating specific keys from within a set. In order for a tree to function as a search tree, the key for each node must be greater than any keys in subtrees on the left, and less ...
relative to the suffix lexicographic ordering, and a max-heap for the dictionary position (in other words, the root is always the most recent string, and a child cannot have been added more recently than its parent): assuming all strings are lexicographically ordered, these conditions clearly uniquely determine the binary tree (this is trivially provable by induction on the size of the tree). Since the string to search for and the string to insert are the same, it is possible to perform both dictionary search and insertion (which requires to rotate the tree) in a single tree traversal.


Patricia tries

Some old LZMA encoders also supported a data structure based on
Patricia trie In computer science, a radix tree (also radix trie or compact prefix tree or compressed trie) is a data structure that represents a space-optimized trie (prefix tree) in which each node that is the only child is merged with its parent. The resul ...
s, but such support has since been dropped since it was deemed inferior to the other options.


LZMA encoder

LZMA encoders can freely decide which match to output, or whether to ignore the presence of matches and output literals anyway. The ability to recall the 4 most recently used distances means that, in principle, using a match with a distance that will be needed again later may be globally optimal even if it is not locally optimal, and as a result of this, optimal LZMA compression probably requires knowledge of the whole input and might require algorithms too slow to be usable in practice. Due to this, practical implementations tend to employ non-global heuristics. The xz encoders use a value called ''nice_len'' (the default is 64): when any match of length at least ''nice_len'' is found, the encoder stops the search and outputs it, with the maximum matching length.


Fast encoder

The XZ fast encoder (derived from the 7-zip fast encoder) is the shortest LZMA encoder in the xz source tree. It works like this: # Perform combined search and insertion in the dictionary data structure # If any repeated distance matches with length at least ''nice_len'': #* Output the most frequently used such distance with a REP packet # If a match was found of length at least ''nice_len'': #* Output it with a MATCH packet # Set the main match to the longest match # Look at the nearest match of every length in decreasing length order, and until no replacement can be made: #* Replace the main match with a match which is one character shorter, but whose distance is less than 1/128 the current main match distance # Set the main match length to 1 if the current main match is of length 2 and distance at least 128 # If a repeated match was found, and it is shorter by at most 1 character than the main match: #* Output the repeated match with a REP packet # If a repeated match was found, and it is shorter by at most 2 characters than the main match, and the main match distance is at least 512: #* Output the repeated match with a REP packet # If a repeated match was found, and it is shorter by at most 3 characters than the main match, and the main match distance is at least 32768: #* Output the repeated match with a REP packet # If the main match size is less than 2 (or there is not any match): #* Output a LIT packet # Perform a dictionary search for the next byte # If the next byte is shorter by at most 1 character than the main match, with distance less than 1/128 times the main match distance, and if the main match length is at least 3: #* Output a LIT packet # If the next byte has a match at least as long as the main match, and with less distance than the main match: #* Output a LIT packet # If the next byte has a match at least one character longer than the main match, and such that 1/128 of its distance is less or equal than the main match distance: #* Output a LIT packet # If the next byte has a match more than one character longer than the main match: #* Output a LIT packet # If any repeated match is shorter by at most 1 character than the main match: #* Output the most frequently used such distance with a REP packet # Output the main match with a MATCH packet


Normal encoder

The XZ normal encoder (derived from the 7-zip normal encoder) is the other LZMA encoder in the xz source tree, which adopts a more sophisticated approach that tries to minimize the post-range-encoding size of the generated packets. Specifically, it encodes portions of the input using the result of a dynamic programming algorithm, where the subproblems are finding the approximately optimal encoding (the one with minimal post-range-encoding size) of the substring of length L starting at the byte being compressed. The size of the portion of the input processed in the dynamic programming algorithm is determined to be the maximum between the longest dictionary match and the longest repeated match found at the start position (which is capped by the maximum LZMA match length, 273); furthermore, if a match longer than ''nice_len'' is found at any point in the range just defined, the dynamic programming algorithm stops, the solution for the subproblem up to that point is output, the ''nice_len''-sized match is output, and a new dynamic programming problem instance is started at the byte after the match is output. Subproblem candidate solutions are incrementally updated with candidate encodings, constructed taking the solution for a shorter substring of length L', extended with all possible "tails", or sets of 1-3 packets with certain constraints that encode the input at the L' position. Once the final solution of a subproblem is found, the LZMA state and least used distances for it are computed, and are then used to appropriately compute post-range-encoding sizes of its extensions. At the end of the dynamic programming optimization, the whole optimal encoding of the longest substring considered is output, and encoding continues at the first uncompressed byte not already encoded, after updating the LZMA state and least used distances. Each subproblem is extended by a packet sequence which we call "tail", which must match one of the following patterns: The reason for not only extending with single packets is that subproblems only have the substring length as the parameter for performance and algorithmic complexity reasons, while an optimal dynamic programming approach would also require to have the last used distances and LZMA ''state'' as parameter; thus, extending with multiple packets allows to better approximate the optimal solution, and specifically to make better use of LONGREP packets. The following data is stored for each subproblem (of course, the values stored are for the candidate solution with minimum ''price''), where by "tail" we refer to the packets extending the solution of the smaller subproblem, which are described directly in the following structure: Note that in the XZ for Java implementation, the ''optPrev'' and ''backPrev'' members are reused to store a forward single-linked list of packets as part of outputting the final solution.


LZMA2 encoder

The XZ LZMA2 encoder processes the input in chunks (of up to 2 MB uncompressed size or 64 KB compressed size, whichever is lower), handing each chunk to the LZMA encoder, and then deciding whether to output an LZMA2 LZMA chunk including the encoded data, or to output an LZMA2 uncompressed chunk, depending on which is shorter (LZMA, like any other compressor, will necessarily expand rather than compress some kinds of data). The LZMA state is reset only in the first block, if the caller requests a change of properties and every time a compressed chunk is output. The LZMA properties are changed only in the first block, or if the caller requests a change of properties. The dictionary is only reset in the first block.


Upper encoding layers

Before LZMA2 encoding, depending on the options provided, xz can apply the BCJ filter, which filters executable code to replace relative offsets with absolute ones that are more repetitive, or the delta filter, which replaces each byte with the difference between it and the byte bytes before it. Parallel encoding is performed by dividing the file in chunks which are distributed to threads, and ultimately each encoded (using, for instance, xz block encoding) separately, resulting in a dictionary reset between chunks in the output file.


7-Zip reference implementation

The LZMA implementation extracted from
7-Zip 7-Zip is a free and open-source file archiver, a utility used to place groups of files within compressed containers known as "archives". It is developed by Igor Pavlov and was first released in 1999. 7-Zip has its own archive format called 7z, ...
is available as LZMA SDK. It was originally dual-licensed under both the
GNU LGPL The GNU Lesser General Public License (LGPL) is a free-software license published by the Free Software Foundation (FSF). The license allows developers and companies to use and integrate a software component released under the LGPL into their own ...
and
Common Public License In computing, the Common Public License (CPL) is a free software / open-source software license published by IBM. The Free Software Foundation and Open Source Initiative have approved the license terms of the CPL. Definition The CPL has the stat ...
, with an additional special exception for linked binaries, but was placed by Igor Pavlov in the
public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...
on December 2, 2008, with the release of version 4.62. LZMA2 compression, which is an improved version of LZMA, is now the default compression method for the .7z format, starting with version 9.30 on October 26, 2012. The reference
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
LZMA compression library was originally written in
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
but has been ported to
ANSI C ANSI C, ISO C, and Standard C are successive standards for the C programming language published by the American National Standards Institute (ANSI) and ISO/IEC JTC 1/SC 22/WG 14 of the International Organization for Standardization (ISO) and the ...
, C#, and
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
. There are also third-party
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
bindings for the C++ library, as well as ports of LZMA to
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Fren ...
, Go and
Ada Ada may refer to: Places Africa * Ada Foah, a town in Ghana * Ada (Ghana parliament constituency) * Ada, Osun, a town in Nigeria Asia * Ada, Urmia, a village in West Azerbaijan Province, Iran * Ada, Karaman, a village in Karaman Province, Tur ...
. The 7-Zip implementation uses several variants of
hash chain A hash chain is the successive application of a cryptographic hash function to a piece of data. In computer security, a hash chain is a method to produce many one-time keys from a single key or password. For non-repudiation a hash function can be ...
s,
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binary t ...
s and
Patricia tree In computer science, a radix tree (also radix trie or compact prefix tree or compressed trie) is a data structure that represents a space-optimized trie (prefix tree) in which each node that is the only child is merged with its parent. The resu ...
s as the basis for its dictionary search algorithm. In addition to LZMA, the SDK and 7-Zip also implements multiple preprocessing filters intended to improve compression, ranging from simple
delta encoding Delta encoding is a way of storing or transmitting data in the form of '' differences'' (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta compre ...
(for images) and BCJ for executable code. It also provides some other compression algorithms used in 7z. Decompression-only code for LZMA generally compiles to around 5 KB, and the amount of RAM required during decompression is principally determined by the size of the
sliding window A sliding window protocol is a feature of packet-based data transmission protocols. Sliding window protocols are used where reliable in-order delivery of packets is required, such as in the data link layer (OSI layer 2) as well as in the Transm ...
used during compression. Small code size and relatively low memory overhead, particularly with smaller dictionary lengths, and free source code make the LZMA decompression algorithm well-suited to embedded applications.


Other implementations

In addition to the 7-Zip reference implementation, the following support the LZMA format. * xz: a streaming implementation that contains a
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and in ...
-like command line tool, supporting both LZMA and LZMA2 in its xz file format. It made its way into several software of the
Unix-like A Unix-like (sometimes referred to as UN*X or *nix) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Unix-li ...
world with its high performance (compared to bzip2) and small size (compared to
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and in ...
). The
Linux kernel The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel. It was originally authored in 1991 by Linus Torvalds for his i386-based PC, and it was soon adopted as the kernel for the GNU ope ...
,
dpkg dpkg is the software at the base of the package management system in the free operating system Debian and its numerous derivatives. dpkg is used to install, remove, and provide information about .deb packages. dpkg (Debian Package) itself is a ...
and
RPM Revolutions per minute (abbreviated rpm, RPM, rev/min, r/min, or with the notation min−1) is a unit of rotational speed or rotational frequency for rotating machines. Standards ISO 80000-3:2019 defines a unit of rotation as the dimensionl ...
systems contains xz code, and many software distributors like
kernel.org kernel.org is the main distribution point of source code for the Linux kernel, which is the base of the Linux operating system. Website The website and related infrastructure, which are operated by the Linux Kernel Organization, host the reposi ...
,
Debian Debian (), also known as Debian GNU/Linux, is a Linux distribution composed of free and open-source software, developed by the community-supported Debian Project, which was established by Ian Murdock on August 16, 1993. The first version of D ...
and
Fedora A fedora () is a hat with a soft brim and indented crown.Kilgour, Ruth Edwards (1958). ''A Pageant of Hats Ancient and Modern''. R. M. McBride Company. It is typically creased lengthwise down the crown and "pinched" near the front on both sides ...
now use xz for compressing their releases. *
lzip lzip is a Free software, free, Command-line interface, command-line tool for the compression of data; it employs the Lempel–Ziv–Markov chain algorithm (LZMA) with a user interface that is familiar to users of usual Unix compression tools, s ...
: another LZMA implementation mostly for Unix-like systems to be directly competing with xz. It mainly features a simpler file format and therefore easier error recovery. * ZIPX: an extension to the ZIP compressions format that was created by
WinZip WinZip is a trialware file archiver and data compression, compressor for Microsoft Windows, macOS, iOS and Android (operating system), Android. It is developed by WinZip Computing (formerly Nico Mak Computing), which is owned by Corel, Corel Co ...
starting with version 12.1. It also can use various other compression methods such as
BZip The Basic Leucine Zipper Domain (bZIP domain) is found in many DNA binding eukaryotic proteins. One part of the domain contains a region that mediates sequence specific DNA binding properties and the leucine zipper that is required to hold tog ...
and
PPMd Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stre ...
.


LZHAM

LZHAM (LZ, Huffman, Arithmetic, Markov), is an LZMA-like implementation that trades compression throughput for very high ratios and higher decompression throughput. It was placed by its author in the
public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...
on 15 September 2020.


References


External links


Official home page







LZMA Utils = XZ Utils

Windows Binaries for XZ Utils

Data compression, Compressors & Archivers
{{DEFAULTSORT:Lempel-Ziv-Markov Chain Algorithm Lossless compression algorithms Israeli inventions