Golomb coding is a
lossless data compression
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...
method using a family of
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...
codes invented by
Solomon W. Golomb
Solomon Wolf Golomb (; May 30, 1932 – May 1, 2016) was an American mathematician, engineer, and professor of electrical engineering at the University of Southern California, best known for his works on mathematical games. Most notably, he inven ...
in the 1960s. Alphabets following a
geometric distribution
In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:
* The probability distribution of the number ''X'' of Bernoulli trials needed to get one success, supported on the set \;
* ...
will have a Golomb code as an optimal
prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.
Rice coding
Rice coding (invented by
Robert F. Rice) denotes using a subset of the family of Golomb codes to produce a simpler (but possibly suboptimal) prefix code. Rice used this set of codes in an
adaptive coding
Adaptive coding refers to variants of entropy encoding methods of lossless data compression. They are particularly suited to streaming data, as they adapt to localized changes in the characteristics of the data, and don't require a first pass over ...
scheme; "Rice coding" can refer either to that adaptive scheme or to using that subset of Golomb codes. Whereas a Golomb code has a tunable parameter that can be any positive integer value, Rice codes are those in which the tunable parameter is a power of two. This makes Rice codes convenient for use on a computer since multiplication and division by 2 can be implemented more efficiently in
binary arithmetic
A binary number is a number expressed in the base-2 numeral system or binary numeral system, a method of mathematical expression which uses only two symbols: typically "0" (zero) and "1" (one).
The base-2 numeral system is a positional notation ...
.
Rice was motivated to propose this simpler subset due to the fact that geometric distributions are often varying with time, not precisely known, or both, so selecting the seemingly optimal code might not be very advantageous.
Rice coding is used as the
entropy encoding
In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
stage in a number of lossless
image compression
Image compression is a type of data compression applied to digital images, to reduce their cost for storage or transmission. Algorithms may take advantage of visual perception and the statistical properties of image data to provide superior re ...
and
audio data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
methods.
Overview
Construction of codes
Golomb coding uses a tunable parameter to divide an input value into two parts: , the result of a division by , and , the remainder. The quotient is sent in
unary coding
Unary coding, or the unary numeral system and also sometimes called thermometer code, is an entropy encoding that represents a natural number, ''n'', with a code of length ''n'' + 1 ( or ''n'' ), usually ''n'' ones followed by a zero (if ...
, followed by the remainder in
truncated binary encoding
Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number ''n''. It is a slightly more general form of binary encodin ...
. When
, Golomb coding is equivalent to unary coding.
Golomb–Rice codes can be thought of as codes that indicate a number by the position of the ''bin'' (), and the ''offset'' within the bin (). The example figure shows the position and offset for the encoding of integer using Golomb–Rice parameter , with source probabilities following a geometric distribution with .
Formally, the two parts are given by the following expression, where is the nonnegative integer being encoded:
:
and
:
.
Both and will be encoded using variable numbers of bits: by a unary code, and by bits for Rice code, or a choice between and bits for Golomb code (i.e. is not a power of 2), with
. If
, then use bits to encode ; otherwise, use +1 bits to encode . Clearly,
if is a power of 2 and we can encode all values of with bits.
The integer treated by Golomb was the run length of a
Bernoulli process
In probability and statistics, a Bernoulli process (named after Jacob Bernoulli) is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. T ...
, which has a
geometric distribution
In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:
* The probability distribution of the number ''X'' of Bernoulli trials needed to get one success, supported on the set \;
* ...
starting at 0. The best choice of parameter is a function of the corresponding Bernoulli process, which is parameterized by
the probability of success in a given
Bernoulli trial
In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is c ...
. is either the median of the distribution or the median ±1. It can be determined by these inequalities:
:
which are solved by
:
.
For the example with :
:
.
The Golomb code for this distribution is equivalent to the
Huffman code for the same probabilities, if it were possible to compute the Huffman code for the infinite set of source values.
Use with signed integers
Golomb's scheme was designed to encode sequences of non-negative numbers. However, it is easily extended to accept sequences containing negative numbers using an ''overlap and interleave'' scheme, in which all values are reassigned to some positive number in a unique and reversible way. The sequence begins: 0, −1, 1, −2, 2, −3, 3, −4, 4, ... The ''n''-th negative value (i.e., ) is mapped to the ''n''
th odd number (), and the ''m''
th positive value is mapped to the ''m''-th even number (). This may be expressed mathematically as follows: a positive value is mapped to (
), and a negative value is mapped to (
). Such a code may be used for simplicity, even if suboptimal. Truly optimal codes for two-sided geometric distributions include multiple variants of the Golomb code, depending on the distribution parameters, including this one.
Simple algorithm
Below is the Rice–Golomb encoding, where the remainder code uses simple truncated binary encoding, also named "Rice coding" (other varying-length binary encodings, like arithmetic or Huffman encodings, are possible for the remainder codes, if the statistic distribution of remainder codes is not flat, and notably when not all possible remainders after the division are used). In this algorithm, if the ''M'' parameter is a power of 2, it becomes equivalent to the simpler Rice encoding:
# Fix the parameter ''M'' to an integer value.
# For ''N'', the number to be encoded, find
## quotient = ''q'' = floor(''N''/''M'')
## remainder = ''r'' = ''N'' modulo ''M''
# Generate codeword
## The code format :
, where
## Quotient code (in unary coding
Unary coding, or the unary numeral system and also sometimes called thermometer code, is an entropy encoding that represents a natural number, ''n'', with a code of length ''n'' + 1 ( or ''n'' ), usually ''n'' ones followed by a zero (if ...
)
### Write a ''q''-length string of 1 bits (alternatively, of 0 bits)
### Write a 0 bit (respectively, a 1 bit)
## Remainder code (in truncated binary encoding
Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number ''n''. It is a slightly more general form of binary encodin ...
)
### Let
#### If code ''r'' in binary representation using ''b'' bits.
#### If code the number in binary representation using ''b'' + 1 bits.
Decoding:
# Decode the unary representation of ''q'' (count the number of 1 in the beginning of the code)
# Skip the 0 delimiter
# Let
## Interpret next ''b'' bits as a binary number ''r. If holds, then the reminder
## Otherwise interpret ''b + 1'' bits as a binary number ''r, the reminder is given by
# Compute
Example
Set . Thus . The cutoff is .
For example, with a Rice–Golomb encoding using parameter , the decimal number 42 would first be split into = 4 and = 2, and would be encoded as qcode(),rcode() = qcode(4),rcode(2) = 11110,010 (you don't need to encode the separating comma in the output stream, because the 0 at the end of the code is enough to say when ends and begins ; both the qcode and rcode are self-delimited).
Use for run-length encoding
:''Note that and are reversed in this section compared to the use in earlier sections.''
Given an alphabet of two symbols, or a set of two events, ''P'' and ''Q'', with probabilities ''p'' and () respectively, where , Golomb coding can be used to encode runs of zero or more ''P''′s separated by single ''Q''′s. In this application, the best setting of the parameter ''M'' is the nearest integer to . When ''p'' = 1/2, ''M'' = 1, and the Golomb code corresponds to unary ( ''P''′s followed by a ''Q'' is encoded as ''n'' ones followed by a zero). If a simpler code is desired, one can assign Golomb–Rice parameter (i.e., Golomb parameter ) to the integer nearest to ; although not always the best parameter, it is usually the best Rice parameter and its compression performance is quite close to the optimal Golomb code. (Rice himself proposed using various codes for the same data to figure out which was best. A later JPL researcher proposed various methods of optimizing or estimating the code parameter.)
Consider using a Rice code with a binary portion having bits to run-length encode sequences where ''P'' has a probability . If