coding theory Coding theory is the study of the properties of codes and their respective fitness for specific applications. Codes are used for data compression, cryptography, error detection and correction, data transmission and data storage. Codes are studied ...

a variable-length code is a code which maps source symbols to a ''variable'' number of bits. Variable-length codes can allow sources to be compressed and decompressed with ''zero'' error (

lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...

) and still be read back symbol by symbol. With the right coding strategy an independent and identically-distributed source may be compressed almost arbitrarily close to its entropy. This is in contrast to fixed length coding methods, for which data compression is only possible for large blocks of data, and any compression beyond the logarithm of the total number of possibilities comes with a finite (though perhaps arbitrarily small) probability of failure. Some examples of well-known variable-length coding strategies are

Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algo ...

, Lempel–Ziv coding,

arithmetic coding Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic ...

, and

context-adaptive variable-length coding Context-adaptive variable-length coding (CAVLC) is a form of entropy coding used in H.264/MPEG-4 AVC video encoding. It is an inherently lossless compression technique, like almost all entropy-coders. In H.264/MPEG-4 AVC, it is used to encode r ...

Codes and their extensions

The extension of a code is the mapping of finite length source sequences to finite length bit strings, that is obtained by concatenating for each symbol of the source sequence the corresponding codeword produced by the original code. Using terms from

formal language theory In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of symb ...

, the precise mathematical definition is as follows: Let

S

and

T

be two finite sets, called the source and target

alphabets An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...

, respectively. A code

C: S \to T^*

is a total function mapping each symbol from

S

to a

sequence of symbols In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...

over

T

, and the extension of

C

to a

homomorphism In algebra, a homomorphism is a structure-preserving map between two algebraic structures of the same type (such as two groups, two rings, or two vector spaces). The word ''homomorphism'' comes from the Ancient Greek language: () meaning "same" ...

S^*

into

T^*

, which naturally maps each sequence of source symbols to a sequence of target symbols, is referred to as its extension.

Classes of variable-length codes

Variable-length codes can be strictly nested in order of decreasing generality as non-singular codes, uniquely decodable codes and prefix codes. Prefix codes are always uniquely decodable, and these in turn are always non-singular:

Non-singular codes

A code is non-singular if each source symbol is mapped to a different non-empty bit string, i.e. the mapping from source symbols to bit strings is injective. * For example, the mapping

M_1 = \

is not non-singular because both "a" and "b" map to the same bit string "0" ; any extension of this mapping will generate a lossy (non-lossless) coding. Such singular coding may still be useful when some loss of information is acceptable (for example when such code is used in audio or video compression, where a lossy coding becomes equivalent to source quantization). * However, the mapping

M_2 = \

is non-singular ; its extension will generate a lossless coding, which will be useful for general data transmission (but this feature is not always required). Note that it is not necessary for the non-singular code to be more compact than the source (and in many applications, a larger code is useful, for example as a way to detect and/or recover from encoding or transmission errors, or in security applications to protect a source from undetectable tampering).

Uniquely decodable codes

A code is uniquely decodable if its extension is § non-singular. Whether a given code is uniquely decodable can be decided with the

Sardinas–Patterson algorithm In coding theory, the Sardinas–Patterson algorithm is a classical algorithm for determining in polynomial time whether a given variable-length code is uniquely decodable, named after August Albert Sardinas and George W. Patterson, who published it ...

. * The mapping

M_3 = \

is uniquely decodable (this can be demonstrated by looking at the ''follow-set'' after each target bit string in the map, because each bitstring is terminated as soon as we see a 0 bit which cannot follow any existing code to create a longer valid code in the map, but unambiguously starts a new code). * Consider again the code

M_2

from the previous section.This code is based on an example found in Berstel et al. (2009), Example 2.3.1, p. 63. This code is not uniquely decodable, since the string ''011101110011'' can be interpreted as the sequence of codewords ''01110 – 1110 – 011'', but also as the sequence of codewords ''011 – 1 – 011 – 10011''. Two possible decodings of this encoded string are thus given by ''cdb'' and ''babe''. However, such a code is useful when the set of all possible source symbols is completely known and finite, or when there are restrictions (for example a formal syntax) that determine if source elements of this extension are acceptable. Such restrictions permit the decoding of the original message by checking which of the possible source symbols mapped to the same symbol are valid under those restrictions.

Prefix codes

A code is a prefix code if no target bit string in the mapping is a prefix of the target bit string of a different source symbol in the same mapping. This means that symbols can be decoded instantaneously after their entire codeword is received. Other commonly used names for this concept are prefix-free code, instantaneous code, or context-free code. * The example mapping

M_3

in the previous paragraph is not a prefix code because we don't know after reading the bit string "0" if it encodes an "a" source symbol, or if it is the prefix of the encodings of the "b" or "c" symbols. * An example of a prefix code is shown below. :: Example of encoding and decoding: ::: → 00100110111010 → , 0, 0, 10, 0, 110, 111, 0, 10, → A special case of prefix codes are

block code In coding theory, block codes are a large and important family of error-correcting codes that encode data in blocks. There is a vast number of examples for block codes, many of which have a wide range of practical applications. The abstract defini ...

s. Here all codewords must have the same length. The latter are not very useful in the context of source coding, but often serve as error correcting codes in the context of channel coding. Another special case of prefix codes are

variable-length quantity A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets (eight-bit bytes) to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the additi ...

codes, which encode arbitrarily large integers as a sequence of octets—i.e., every codeword is a multiple of 8 bits.

Advantages

The advantage of a variable-length code is that unlikely source symbols can be assigned longer codewords and likely source symbols can be assigned shorter codewords, thus giving a low ''expected'' codeword length. For the above example, if the probabilities of (a, b, c, d) were

\textstyle\left(\frac, \frac, \frac, \frac\right)

, the expected number of bits used to represent a source symbol using the code above would be: ::

1\times\frac+2\times\frac+3\times\frac+3\times\frac=\frac

. As the entropy of this source is 1.7500 bits per symbol, this code compresses the source as much as possible so that the source can be recovered with ''zero'' error.

Codes and their extensions

Classes of variable-length codes

Non-singular codes

Uniquely decodable codes

Prefix codes

Advantages

See also

References

Further reading