The Lempel–Ziv complexity is a measure that was first presented in the article ''On the Complexity of Finite Sequences'' (IEEE Trans. On IT-22,1 1976), by two Israeli computer scientists,
Abraham Lempel and
Jacob Ziv
Jacob Ziv (; 27 November 1931 – 25 March 2023) was an Israeli electrical engineer and information theorist who developed the LZ family of lossless data compression algorithms alongside Abraham Lempel. He is also a namesake of the Ziv–Zakai ...
. This complexity measure is related to
Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that prod ...
, but the only function it uses is the recursive copy (i.e., the shallow copy).
The underlying mechanism in this complexity measure is the starting point for some algorithms for
lossless data compression
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...
, like
LZ77, LZ78 and
LZW. Even though it is based on an elementary principle of words copying, this complexity measure is not too restrictive in the sense that it satisfies the main qualities expected by such a measure: sequences with a certain regularity do not have a too large complexity, and the complexity grows as the sequence grows in length and irregularity.
The Lempel–Ziv complexity can be used to measure the repetitiveness of binary sequences and text, like song lyrics or prose.
Fractal dimension
In mathematics, a fractal dimension is a term invoked in the science of geometry to provide a rational statistical index of complexity detail in a pattern. A fractal pattern changes with the Scaling (geometry), scale at which it is measured.
It ...
estimates of real-world data have also been shown to correlate with Lempel–Ziv complexity.
Principle
Let S be a binary sequence, of length n, for which we have to compute the Lempel–Ziv complexity, denoted C(S). The sequence is read from the left.
Imagine you have a delimiting line, which can be moved in the sequence during the calculation. At first, this line is set just after the first symbol, at the beginning of the sequence. This initial position is called position 1, from where we have to move it to a position 2, which is considered the initial position for the next step (and so on). We have to move the delimiter (starting in position 1) the furthest possible to the right, so that the sub-word between position 1 and the delimiter position be a word of the sequence that starts before the position 1 of the delimiter.
As soon as the delimiter is set on a position where this condition is not met, we stop, move the delimiter to this position, and start again by marking this position as a new initial position (i.e., position 1). Keep iterating until the end of the sequence. The Lempel–Ziv complexity corresponds to the number of iterations needed to finish this procedure.
Said differently, the Lempel–Ziv complexity is the number of different sub-strings (or sub-words) encountered as the binary sequence is viewed as a stream (from left to right).
Formal explanations
The method proposed by Lempel and Ziv uses three notions: reproducibility, producibility and exhaustive history of a sequence, that we defined here.
Notations
Let S be a binary sequence of length n (i.e.,
symbols taking value 0 or 1). Let
, with
, be the sub-word of
from index i to index j (if