Run-length encoding (RLE) is a form of
lossless data compression
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...
in which ''runs'' of data (consecutive occurrences of the same data value) are stored as a single occurrence of that data value and a count of its consecutive occurrences, rather than as the original run. As an imaginary example of the concept, when encoding an image built up from colored dots, the sequence "green green green green green green green green green" is shortened to "green x 9". This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, games, and animations. For files that do not have many runs, encoding them with RLE could increase the file size.
RLE may also refer in particular to an early graphics file format supported by
CompuServe
CompuServe, Inc. (CompuServe Information Service, Inc., also known by its initialism CIS or later CSi) was an American Internet company that provided the first major commercial online service provider, online service. It opened in 1969 as a times ...
for compressing black and white images, that was widely supplanted by their later
Graphics Interchange Format
The Graphics Interchange Format (GIF; or , ) is a bitmap image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released on June 15, 1987.
The format can ...
(GIF).
RLE also refers to a little-used image format in
Windows 3.x that is saved with the file extension
rle
; it is a run-length encoded bitmap, and was used as the format for the Windows 3.x startup screen.
History and applications
Run-length encoding (RLE) schemes were employed in the transmission of analog television signals as far back as 1967.
In 1983, run-length encoding was
patent
A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an sufficiency of disclosure, enabling discl ...
ed by
Hitachi
() is a Japanese Multinational corporation, multinational Conglomerate (company), conglomerate founded in 1910 and headquartered in Chiyoda, Tokyo. The company is active in various industries, including digital systems, power and renewable ener ...
.
RLE is particularly well suited to
palette-based bitmap images (which use relatively few colours) such as
computer icons, and was a popular image compression method on early
online services such as
CompuServe
CompuServe, Inc. (CompuServe Information Service, Inc., also known by its initialism CIS or later CSi) was an American Internet company that provided the first major commercial online service provider, online service. It opened in 1969 as a times ...
before the advent of more sophisticated formats such as
GIF
The Graphics Interchange Format (GIF; or , ) is a Raster graphics, bitmap Image file formats, image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released ...
.
It does not work well on continuous-tone images (which use very many colours) such as photographs, although
JPEG
JPEG ( , short for Joint Photographic Experts Group and sometimes retroactively referred to as JPEG 1) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degr ...
uses it on the coefficients that remain after transforming and
quantizing image blocks.
Common formats for run-length encoded data include
Truevision TGA,
PackBits (by Apple, used in
MacPaint),
PCX and
ILBM. The
International Telecommunication Union
The International Telecommunication Union (ITU)In the other common languages of the ITU:
*
* is a list of specialized agencies of the United Nations, specialized agency of the United Nations responsible for many matters related to information ...
also describes a standard to encode run-length colour for
fax machines, known as T.45.
That fax colour coding standard, which along with other techniques is incorporated into
Modified Huffman coding, is relatively efficient because most faxed documents are primarily white space, with occasional interruptions of black.
Algorithm
RLE has a space complexity of , where is the size of the input data.
Encoding algorithm
Run-length encoding compresses data by reducing the physical size of a repeating string of characters. This process involves converting the input data into a compressed format by identifying and counting consecutive occurrences of each character. The steps are as follows:
# Traverse the input data.
# Count the number of consecutive repeating characters (run length).
# Store the character and its run length.
Python implementation
def rle_encode(iterable, *, length_first=True):
"""
>>> "".join(rle_encode("AAAABBBCCDAA"))
'4A3B2C1D2A'
>>> "".join(rle_encode("AAAABBBCCDAA", length_first=False))
'A4B3C2D1A2'
"""
return (
f"" if length_first else f"" # ilen(g): length of iterable g
for k, g in groupby(iterable)
)
Decoding algorithm
The decoding process involves reconstructing the original data from the encoded format by repeating characters according to their counts. The steps are as follows:
# Traverse the encoded data.
# For each count-character pair, repeat the character count times.
# Append these characters to the result string.
Python implementation
def rle_decode(iterable, *, length_first=True):
"""
>>> "".join(rle_decode("4A3B2C1D2A"))
'AAAABBBCCDAA'
>>> "".join(rle_decode("A4B3C2D1A2", length_first=False))
'AAAABBBCCDAA'
"""
return chain.from_iterable(
repeat(b, int(a)) if length_first else repeat(a, int(b))
for a, b in batched(iterable, 2)
)
Example
Consider a screen containing plain black text on a solid white background. There will be many long runs of white
pixel
In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a Raster graphics, raster image, or the smallest addressable element in a dot matrix display device. In most digital display devices, p ...
s in the blank space, and many short runs of black pixels within the text. A hypothetical
scan line, with B representing a black pixel and W representing white, might read as follows:
:
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
With a run-length encoding (RLE) data compression algorithm applied to the above hypothetical scan line, it can be rendered as follows:
:
12W1B12W3B24W1B14W
This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc., and represents the original 67 characters in only 18. While the actual format used for the storage of images is generally binary rather than
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
characters like this, the principle remains the same. Even binary data files can be compressed with this method; file format specifications often dictate repeated bytes in files as padding space. However, newer compression methods such as
DEFLATE often use
LZ77
LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978.
They are also known as Lempel-Ziv 1 (LZ1) and Lempel-Ziv 2 (LZ2) respectively. These two algorithms form the basis ...
-based algorithms, a generalization of run-length encoding that can take advantage of runs of strings of characters (such as
BWWBWWBWWBWW
).
Run-length encoding can be expressed in multiple ways to accommodate data properties as well as additional compression algorithms. For instance, one popular method encodes run lengths for runs of two or more characters only, using an "escape" symbol to identify runs, or using the character itself as the escape, so that any time a character appears twice it denotes a run. On the previous example, this would give the following:
:
WW12BWW12BB3WW24BWW14
This would be interpreted as a run of twelve Ws, a B, a run of twelve Ws, a run of three Bs, etc. In data where runs are less frequent, this can significantly improve the compression rate.
One other matter is the application of additional compression algorithms. Even with the runs extracted, the frequencies of different characters may be large, allowing for further compression; however, if the run lengths are written in the file in the locations where the runs occurred, the presence of these numbers interrupts the normal flow and makes it harder to compress. To overcome this, some run-length encoders separate the data and escape symbols from the run lengths, so that the two can be handled independently. For the example data, this would result in two outputs, the string "
WWBWWBBWWBWW
" and the numbers (
12,12,3,24,14
).
Variants
* Sequential RLE: This method processes data one line at a time, scanning from left to right. It is commonly employed in image compression. Other variations of this technique include scanning the data vertically, diagonally, or in blocks.
* Lossy RLE: In this variation, some bits are intentionally discarded during compression (often by setting one or two significant bits of each pixel to 0). This leads to higher compression rates while minimally impacting the visual quality of the image.
*Adaptive RLE: Uses different encoding schemes depending on the length of runs to optimize compression ratios. For example, short runs might use a different encoding format than long runs.
See also
*
Kolakoski sequence
*
Look-and-say sequence
*
Comparison of graphics file formats
*
Golomb coding
Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making ...
*
Burrows–Wheeler transform
*
Recursive indexing
*
Run-length limited
*
Bitmap index
*
Forsyth–Edwards Notation, which uses run-length-encoding for empty spaces in chess positions.
*
DEFLATE
*
Convolution
In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions f and g that produces a third function f*g, as the integral of the product of the two ...
*
Huffman coding
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...
*
Arithmetic coding
Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a String (computer science), string of characters is represented using a fixed number of bits per character, as in the American Standard Code for In ...
References
External links
Run-length encoding implemented in different programming languages(on
Rosetta Code)
Single Header Run-Length Encoding Librarysmallest possible implementation (about 20 SLoC) in ANSI C. FOSS, compatible with
Truevision TGA, supports 8, 16, 24 and 32 bit elements too.
{{Compression formats
Lossless compression algorithms
Data compression