Range coding (or range encoding) is an
entropy coding
In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
method defined by G. Nigel N. Martin in a 1979 paper,
[G. Nigel N. Martin, ''Range encoding: An algorithm for removing redundancy from a digitized message'']
Video & Data Recording Conference, Southampton
Southampton () is a port city in the ceremonial county of Hampshire in southern England. It is located approximately south-west of London and west of Portsmouth. The city forms part of the South Hampshire built-up area, which also covers Po ...
, UK, July 24–27, 1979. which effectively rediscovered the FIFO arithmetic code first introduced by Richard Clark Pasco in 1976. Given a stream of symbols and their probabilities, a range coder produces a space-efficient stream of bits to represent these symbols and, given the stream and the probabilities, a range decoder reverses the process.
Range coding is very similar to
arithmetic coding
Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic ...
, except that coding is done with digits in any base, instead of with bits, and so it is faster when using larger bases (e.g. a
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
) at small cost in compression efficiency. After the expiration of the first (1978) arithmetic coding patent,
[ — (IBM) Filed March 4, 1977, Granted 24 October 1978 (Now expired)] range coding appeared to clearly be free of patent encumbrances. This particularly drove interest in the technique in the
open source community. Since that time, patents on various well-known arithmetic coding techniques have also expired.
How range coding works
Range coding conceptually encodes all the symbols of the message into one number, unlike
Huffman coding
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algo ...
which assigns each symbol a bit-pattern and concatenates all the bit-patterns together. Thus range coding can achieve greater compression ratios than the one-bit-per-symbol lower bound on Huffman coding and it does not suffer the inefficiencies that Huffman does when dealing with probabilities that are not an exact
power of two
A power of two is a number of the form where is an integer, that is, the result of exponentiation with number two as the base and integer as the exponent.
In a context where only integers are considered, is restricted to non-negativ ...
.
The central concept behind range coding is this: given a large-enough range of
integer
An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the languag ...
s, and a probability estimation for the symbols, the initial range can easily be divided into sub-ranges whose sizes are proportional to the probability of the symbol they represent. Each symbol of the message can then be encoded in turn, by reducing the current range down to just that sub-range which corresponds to the next symbol to be encoded. The decoder must have the same probability estimation the encoder used, which can either be sent in advance, derived from already transferred data or be part of the compressor and decompressor.
When all symbols have been encoded, merely identifying the sub-range is enough to communicate the entire message (presuming of course that the decoder is somehow notified when it has extracted the entire message). A single integer is actually sufficient to identify the sub-range, and it may not even be necessary to transmit the entire integer; if there is a sequence of digits such that every integer beginning with that prefix falls within the sub-range, then the prefix alone is all that's needed to identify the sub-range and thus transmit the message.
Example
Suppose we want to encode the message "AABA
", where is the end-of-message symbol. For this example it is assumed that the decoder knows that we intend to encode exactly five symbols in the base 10 number system
The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers of the Hindu–Arabic numeral ...
(allowing for 105 different combinations of symbols with the range )_using_the_probability_distribution">,_100000))_using_the_probability_distribution_._The_encoder_breaks_down_the_range_[0,_100000)_into_three_subranges:
_A:_____[_____0,__60000)
_B:_____[_60000,__80000)
_:_[_80000,_100000)
Since_our_first_symbol_is_an_A,_it_reduces_our_initial_range_down_to_[0,_60000)._The_second_symbol_choice_leaves_us_with_three_sub-ranges_of_this_range._We_show_them_following_the_already-encoded_'A':
_AA:_____[_____0,__36000)
_AB:_____[_36000,__48000)
_A:_[_48000,__60000)
With_two_symbols_encoded,_our_range_is_now_[0,_36000)_and_our_third_symbol_leads_to_the_following_choices:
_AAA:_____[_____0,__21600)
_AAB:_____[_21600,__28800)
_AA:_[_28800,__36000)
This_time_it_is_the_second_of_our_three_choices_that_represent_the_message_we_want_to_encode,_and_our_range_becomes_[21600,_28800)._It_may_look_harder_to_determine_our_sub-ranges_in_this_case,_but_it_is_actually_not:_we_can_merely_subtract_the_lower_bound_from_the_upper_bound_to_determine_that_there_are_7200_numbers_in_our_range;_that_the_first_4320_of_them_represent_0.60_of_the_total,_the_next_1440_represent_the_next_0.20,_and_the_remaining_1440_represent_the_remaining_0.20_of_the_total._Adding_back_the_lower_bound_gives_us_our_ranges:
_AABA:_____[21600,_25920)
_AABB:_____[25920,_27360)
_AAB:_[27360,_28800)
Finally,_with_our_range_narrowed_down_to_[21600,_25920),_we_have_just_one_more_symbol_to_encode._Using_the_same_technique_as_before_for_dividing_up_the_range_between_the_lower_and_upper_bound,_we_find_the_three_sub-ranges_are:
_AABAA:_____[21600,_24192)
_AABAB:_____[24192,_25056)
_AABA:_[25056,_25920)
And_since__is_our_final_symbol,_our_final_range_is_[25056,_25920)._Because_all_five-digit_integers_starting_with_"251"_fall_within_our_final_range,_it_is_one_of_the_three-digit_prefixes_we_could_transmit_that_would_unambiguously_convey_our_original_message._(The_fact_that_there_are_actually_eight_such_prefixes_in_all_implies_we_still_have_inefficiencies._They_have_been_introduced_by_our_use_of_ )_using_the_probability_distribution">,_100000))_using_the_probability_distribution_._The_encoder_breaks_down_the_range_[0,_100000)_into_three_subranges:
_A:_____[_____0,__60000)
_B:_____[_60000,__80000)
_:_[_80000,_100000)
Since_our_first_symbol_is_an_A,_it_reduces_our_initial_range_down_to_[0,_60000)._The_second_symbol_choice_leaves_us_with_three_sub-ranges_of_this_range._We_show_them_following_the_already-encoded_'A':
_AA:_____[_____0,__36000)
_AB:_____[_36000,__48000)
_A:_[_48000,__60000)
With_two_symbols_encoded,_our_range_is_now_[0,_36000)_and_our_third_symbol_leads_to_the_following_choices:
_AAA:_____[_____0,__21600)
_AAB:_____[_21600,__28800)
_AA:_[_28800,__36000)
This_time_it_is_the_second_of_our_three_choices_that_represent_the_message_we_want_to_encode,_and_our_range_becomes_[21600,_28800)._It_may_look_harder_to_determine_our_sub-ranges_in_this_case,_but_it_is_actually_not:_we_can_merely_subtract_the_lower_bound_from_the_upper_bound_to_determine_that_there_are_7200_numbers_in_our_range;_that_the_first_4320_of_them_represent_0.60_of_the_total,_the_next_1440_represent_the_next_0.20,_and_the_remaining_1440_represent_the_remaining_0.20_of_the_total._Adding_back_the_lower_bound_gives_us_our_ranges:
_AABA:_____[21600,_25920)
_AABB:_____[25920,_27360)
_AAB:_[27360,_28800)
Finally,_with_our_range_narrowed_down_to_[21600,_25920),_we_have_just_one_more_symbol_to_encode._Using_the_same_technique_as_before_for_dividing_up_the_range_between_the_lower_and_upper_bound,_we_find_the_three_sub-ranges_are:
_AABAA:_____[21600,_24192)
_AABAB:_____[24192,_25056)
_AABA:_[25056,_25920)
And_since__is_our_final_symbol,_our_final_range_is_[25056,_25920)._Because_all_five-digit_integers_starting_with_"251"_fall_within_our_final_range,_it_is_one_of_the_three-digit_prefixes_we_could_transmit_that_would_unambiguously_convey_our_original_message._(The_fact_that_there_are_actually_eight_such_prefixes_in_all_implies_we_still_have_inefficiencies._They_have_been_introduced_by_our_use_of_decimal">base_10_
The_decimal_numeral_system_(also_called_the_base-ten__positional_numeral_system_and_denary__or_decanary)_is_the_standard_system_for_denoting_integer_and_non-integer_numbers._It_is_the_extension_to_non-integer_numbers_of_the__Hindu–Arabic_numer_...
_rather_than_Binary_numeral_system.html" ;"title="decimal.html" "title="probability_distribution.html" ;"title=", 100000)) using the probability distribution">, 100000)) using the probability distribution . The encoder breaks down the range
. The second symbol choice leaves us with three sub-ranges of this range. We show them following the already-encoded 'A':
This time it is the second of our three choices that represent the message we want to encode, and our range becomes
. It may look harder to determine our sub-ranges in this case, but it is actually not: we can merely subtract the lower bound from the upper bound to determine that there are 7200 numbers in our range; that the first 4320 of them represent 0.60 of the total, the next 1440 represent the next 0.20, and the remaining 1440 represent the remaining 0.20 of the total. Adding back the lower bound gives us our ranges:
, we have just one more symbol to encode. Using the same technique as before for dividing up the range between the lower and upper bound, we find the three sub-ranges are: