In
coding theory
Coding theory is the study of the properties of codes and their respective fitness for specific applications. Codes are used for data compression, cryptography, error detection and correction, data transmission and computer data storage, data sto ...
, a variable-length code is a
code
In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communicati ...
which maps source symbols to a ''variable'' number of
bit
The bit is the most basic unit of information in computing and digital communication. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represented as ...
s. The equivalent concept in
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
is ''
bit string
A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level pa ...
''.
Variable-length codes can allow sources to be
compressed and decompressed with ''zero'' error (
lossless data compression
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...
) and still be read back symbol by symbol. With the right coding strategy, an
independent and identically-distributed source may be compressed almost arbitrarily close to its
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
. This is in contrast to fixed-length coding methods, for which data compression is only possible for large blocks of data, and any compression beyond the logarithm of the total number of possibilities comes with a finite (though perhaps arbitrarily small) probability of failure.
Some examples of well-known variable-length coding strategies are
Huffman coding
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...
,
Lempel–Ziv coding,
arithmetic coding
Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a String (computer science), string of characters is represented using a fixed number of bits per character, as in the American Standard Code for In ...
, and
context-adaptive variable-length coding.
Codes and their extensions
The extension of a code is the mapping of finite length source sequences to finite length bit strings, that is obtained by concatenating for each symbol of the source sequence the corresponding codeword produced by the original code.
Using terms from
formal language theory
In logic, mathematics, computer science, and linguistics, a formal language is a set of string (computer science), strings whose symbols are taken from a set called "#Definition, alphabet".
The alphabet of a formal language consists of symbol ...
, the precise mathematical definition is as follows: Let
and
be two finite sets, called the source and target
alphabets
An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...
, respectively. A code
is a total function
mapping each symbol from
to a
sequence of symbols over
, and the extension of
to a
homomorphism
In algebra, a homomorphism is a morphism, structure-preserving map (mathematics), map between two algebraic structures of the same type (such as two group (mathematics), groups, two ring (mathematics), rings, or two vector spaces). The word ''homo ...
of
into
, which naturally maps each sequence of source symbols to a sequence of target symbols, is referred to as its extension.
Classes of variable-length codes
Variable-length codes can be strictly nested in order of decreasing generality as non-singular codes, uniquely decodable codes, and prefix codes. Prefix codes are always uniquely decodable, and these in turn are always non-singular:
Non-singular codes
A code is non-singular if each source symbol is mapped to a different non-empty bit string; that is, the mapping from source symbols to bit strings is
injective
In mathematics, an injective function (also known as injection, or one-to-one function ) is a function that maps distinct elements of its domain to distinct elements of its codomain; that is, implies (equivalently by contraposition, impl ...
.
* For example, the mapping
is ''not'' non-singular because both "a" and "b" map to the same bit string "0"; any extension of this mapping will generate a lossy (non-lossless) coding. Such singular coding may still be useful when some loss of information is acceptable (for example, when such code is used in audio or video compression, where a lossy coding becomes equivalent to source
quantization).
* However, the mapping
''is'' non-singular; its extension will generate a lossless coding, which will be useful for general data transmission (but this feature is not always required). It is not necessary for the non-singular code to be more compact than the source (and in many applications, a larger code is useful, for example as a way to detect or recover from encoding or transmission errors, or in security applications to protect a source from undetectable tampering).
Uniquely decodable codes
A code is uniquely decodable if its extension is
§ non-singular. Whether a given code is uniquely decodable can be decided with the
Sardinas–Patterson algorithm In coding theory, the Sardinas–Patterson algorithm is a classical algorithm for determining in polynomial time whether a given variable-length code is uniquely decodable, named after August Albert Sardinas and George W. Patterson, who published it ...
.
* The mapping
is uniquely decodable (this can be demonstrated by looking at the ''follow-set'' after each target bit string in the map, because each bitstring is terminated as soon as we see a 0 bit which cannot follow any existing code to create a longer valid code in the map, but unambiguously starts a new code).
* Consider again the code
from the previous section.
[This code is based on an example found in Berstel et al. (2009), Example 2.3.1, p. 63.] This code is ''not'' uniquely decodable, since the string ''011101110011'' can be interpreted as the sequence of codewords ''01110 – 1110 – 011'', but also as the sequence of codewords ''011 – 1 – 011 – 10011''. Two possible decodings of this encoded string are thus given by ''cdb'' and ''babe''. However, such a code is useful when the set of all possible source symbols is completely known and finite, or when there are restrictions (such as a formal syntax) that determine if source elements of this extension are acceptable. Such restrictions permit the decoding of the original message by checking which of the possible source symbols mapped to the same symbol are valid under those restrictions.
Prefix codes
A code is a prefix code if no target bit string in the mapping is a prefix of the target bit string of a different source symbol in the same mapping. This means that symbols can be decoded instantaneously after their entire codeword is received. Other commonly used names for this concept are prefix-free code, instantaneous code, or context-free code.
* The example mapping
above is ''not'' a prefix code because we do not know after reading the bit string "0" whether it encodes an "a" source symbol, or if it is the prefix of the encodings of the "b" or "c" symbols.
* An example of a prefix code is shown below.
:: Example of encoding and decoding:
::: → 00100110111010 → , 0, 0, 10, 0, 110, 111, 0, 10, →
A special case of prefix codes are
block code
In coding theory, block codes are a large and important family of Channel coding, error-correcting codes that encode data in blocks.
There is a vast number of examples for block codes, many of which have a wide range of practical applications. Th ...
s. Here, all codewords must have the same length. The latter are not very useful in the context of
source coding
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
, but often serve as
forward error correction
In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels.
The centra ...
in the context of
channel coding
In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for error control, controlling errors in data transmission over unreliable or noisy communication channel ...
.
Another special case of prefix codes are
LEB128 and
variable-length quantity
A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets (eight- bit bytes) to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the add ...
(VLQ) codes, which encode arbitrarily large integers as a sequence of octets—i.e., every codeword is a multiple of 8 bits.
Advantages
The advantage of a variable-length code is that unlikely source symbols can be assigned longer codewords and likely source symbols can be assigned shorter codewords, thus giving a low
''expected'' codeword length. For the above example, if the probabilities of (a, b, c, d) were
, the expected number of bits used to represent a source symbol using the code above would be:
::
.
As the entropy of this source is 1.75 bits per symbol, this code compresses the source as much as possible so that the source can be recovered with ''zero'' error.
See also
*
Golomb code
*
Kruskal count
The Kruskal count (also known as Kruskal's principle, Dynkin–Kruskal count, Dynkin's counting trick, Dynkin's card trick, coupling card trick or shift coupling) is a probabilistic concept originally demonstrated by the Russian mathematician E ...
*
Variable-length instruction set
In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, ...
s in computing
References
Further reading
* (xii+191 pages
Errata 1https://web.archive.org/web/20230920175457/https://www.davidsalomon.name/VLCadvertis/phasedin.pdf Errata 2]
*
Draft available online
{{Compression Methods
Coding theory
Entropy coding
Data compression