A variable-length quantity (VLQ) is a
universal code that uses an arbitrary number of
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two digits (0 and 1)
* Binary function, a function that takes two arguments
* Binary operation, a mathematical operation that t ...
octet
Octet may refer to:
Music
* Octet (music), ensemble consisting of eight instruments or voices, or composition written for such an ensemble
** String octet, a piece of music written for eight string instruments
*** Octet (Mendelssohn), 1825 compos ...
s (eight-
bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...
s) to represent an arbitrarily large
integer
An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign (−1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the language ...
. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. VLQ is identical to
LEB128 except in
endianness
In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most sig ...
. See the example below.
Applications and history
Base-128 compression is known by many namesVB (Variable Byte), VByte, Varint, VInt, EncInt etc.
[
Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson]
"An Experimental Study of Bitmap Compression vs. Inverted List Compression"
2017.
.
A variable-length quantity (VLQ) was defined for use in the standard
MIDI file
MIDI (; Musical Instrument Digital Interface) is a technical standard that describes a communications protocol, digital interface, and electrical connectors that connect a wide variety of electronic musical instruments, computers, and ...
format
[MIDI File Format: Variable Quantities](_blank)
to save additional space for a resource-constrained system, and is also used in the later
Extensible Music Format (XMF).
Base-128 is also used in
ASN.1
Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and ...
BER encoding to encode tag numbers and
object identifier
In computing, object identifiers or OIDs are an identifier mechanism standardized by the International Telecommunication Union (ITU) and ISO/IEC for naming any object, concept, or "thing" with a globally unambiguous persistent name.
Syntax and le ...
s. It is also used in the
WAP
WAP or Wap may refer to:
Music
* "WAP" (song), a 2020 song by Cardi B featuring Megan Thee Stallion
Organizations
* Weatherization Assistance Program, for US energy costs
* Western Australia Party, a political party founded in 2016
* Western A ...
environment, where it is called variable length unsigned integer or uintvar. The
DWARF
Dwarf or dwarves may refer to:
Common uses
*Dwarf (folklore), a being from Germanic mythology and folklore
* Dwarf, a person or animal with dwarfism
Arts, entertainment, and media Fictional entities
* Dwarf (''Dungeons & Dragons''), a humanoid ...
debugging format defines a variant called
LEB128 (or ULEB128 for unsigned numbers), where the least significant group of 7 bits is encoded in the first byte, and the most significant bits are in the last byte (so effectively it is the little-endian analog of a VLQ).
Google
Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
Protocol Buffers
Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an i ...
use a similar format to have compact representation of integer values, as does
Oracle
An oracle is a person or agency considered to provide wise and insightful counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. As such, it is a form of divination.
Description
The word '' ...
Portable Object Format (POF) and the
Microsoft
Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...
.NET Framework
The .NET Framework (pronounced as "''dot net"'') is a proprietary software framework developed by Microsoft that runs primarily on Microsoft Windows. It was the predominant implementation of the Common Language Infrastructure (CLI) until bein ...
"7-bit encoded int" in the BinaryReader and BinaryWriter classes.
It is also used extensively in web browsers for source mapping – which contain a lot of integer line and column number mappings – to keep the size of the map to a minimum.
[Introduction to javascript source maps](_blank)
Variable-width integers in
LLVM
LLVM is a set of compiler and toolchain technologies that can be used to develop a front end for any programming language and a back end for any instruction set architecture. LLVM is designed around a language-independent intermediate represen ...
use a similar principle. The encoding chunks are little-endian and need not be 8 bits in size. The LLVM documentation describes a field that uses 4-bit chunk, with each chunk consisting of 1 bit continuation and 3 bits payload.
General structure
The encoding assumes an octet (an 8-bit byte) where the most significant bit (MSB), also commonly known as the
sign bit
In computer science, the sign bit is a bit in a signed number representation that indicates the sign of a number. Although only signed numeric data types have a sign bit, it is invariably located in the most significant bit position, so the term ...
, is reserved to indicate whether another VLQ octet follows.
If ''A'' is 0, then this is the last VLQ octet of the integer. If ''A'' is 1, then another VLQ octet follows.
''B'' is a 7-bit number
x00, 0x7F and ''n'' is the position of the VLQ octet where ''B''
0 is
the least significant. The VLQ octets are arranged
most significant first in a stream.
Variants
The general VLQ encoding is simple, but in basic form is only defined for
unsigned integer
In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are ...
s (nonnegative, positive or zero), and is somewhat redundant, since prepending 0x80 octets corresponds to zero padding. There are various
signed number representations
In computing, signed number representations are required to encode negative numbers in binary number systems.
In mathematics, negative numbers in any base are represented by prefixing them with a minus sign ("−"). However, in RAM or CPU regis ...
to handle negative numbers, and techniques to remove the redundancy.
Group Varint Encoding
Google developed Group Varint Encoding (GVE) after observing that traditional VLQ encoding incurs many CPU branches during decompression. GVE uses a single byte as a header for 4 variable-length uint32 values. The header byte has 4 2-bit numbers representing the storage length of each of the following 4 uint32s. Such a layout eliminates the need to check and remove VLQ continuation bits. Data bytes can be copied directly to their destination. This layout
reduces CPU branches, making GVE faster than VLQ on modern pipelined CPUs.
[
]
PrefixVarint is a similar design but with a uint64 maximum. It is said to have "been invented multiple times independently". It is possible to be changed into a chained version with infinitely many continuations.
Signed numbers
Sign bit
Negative numbers can be handled using a
sign bit
In computer science, the sign bit is a bit in a signed number representation that indicates the sign of a number. Although only signed numeric data types have a sign bit, it is invariably located in the most significant bit position, so the term ...
, which only needs to be present in the first octet.
In the data format for Unreal Packages used by the
Unreal Engine
Unreal Engine (UE) is a 3D computer graphics game engine developed by Epic Games, first showcased in the 1998 first-person shooter game ''Unreal''. Initially developed for PC first-person shooters, it has since been used in a variety of genres ...
, a variable-length quantity scheme called Compact Indices is used. The only difference in this encoding is that the first VLQ octet has the sixth bit reserved to indicate whether the encoded integer is positive or negative. Any consecutive VLQ octet follows the general structure.
If ''A'' is 0, then this is the last VLQ octet of the integer. If ''A'' is 1, then another VLQ octet follows.
If ''B'' is 0, then the VLQ represents a positive integer. If ''B'' is 1, then the VLQ represents a negative number.
''C'' is the number chunk being encoded, and ''n'' is the position of the VLQ octet where ''C''
0 is
the least significant. The VLQ octets are arranged
least significant first in a stream.
Zigzag encoding
An alternative way to encode negative numbers is to use the least significant bit for sign. This is notably done for Google Protocol Buffers, and is known as a zigzag encoding for
signed integer
In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are ...
s. One can encode the numbers so that encoded 0 corresponds to 0, 1 to −1, 10 to 1, 11 to −2, 100 to 2, etc.: counting up alternates between nonnegative (starting at 0) and negative (since each step changes the least-significant bit, hence the sign), whence the name "zigzag encoding". Concretely, transform the integer as
(n << 1) ^ (n >> k - 1)
for fixed ''k''-bit integers.
Two's complement
LEB128 uses
two's complement
Two's complement is a mathematical operation to reversibly convert a positive binary number into a negative binary number with equivalent (but negative) value, using the binary digit with the greatest place value (the leftmost bit in big- endian ...
to represent signed numbers. In this scheme of representation, ''n'' bits encode a range from −2
''n'' to 2
''n'' − 1, and all negative numbers start with a 1 in the most significant bit. In Signed LEB128, the input is
sign-extended Sign extension (abbreviated as sext) is the operation, in computer arithmetic, of increasing the number of bits of a binary number while preserving the number's sign (positive/negative) and value. This is done by appending digits to the most signi ...
so that its length is a multiple of 7 bits. From there the encoding proceeds as usual.
In LEB128, the stream is arranged least significant first.
[
]
Removing redundancy
With the VLQ encoding described above, any number that can be encoded with N octets can also be encoded with more than N octets simply by prepending additional 0x80 octets as zero-padding. For example, the decimal number 358 can be encoded as the 2-octet VLQ 0x8266, or the number 0358 can be encoded as 3-octet VLQ 0x808266, or 00358 as the 4-octet VLQ 0x80808266 and so forth.
However, the VLQ format used in Git
Git () is a distributed version control system: tracking changes in any set of files, usually used for coordinating work among programmers collaboratively developing source code during software development. Its goals include speed, data in ...
removes this prepending redundancy and extends the representable range of shorter VLQs by adding an offset to VLQs of 2 or more octets in such a way that the lowest possible value for such an (''N'' + 1)-octet VLQ becomes exactly one more than the maximum possible value for an ''N''-octet VLQ. In particular, since a 1-octet VLQ can store a maximum value of 127, the minimum 2-octet VLQ (0x8000) is assigned the value 128 instead of 0. Conversely, the maximum value of such a 2-octet VLQ (0xFF7F) is instead of just . Similarly, the minimum 3-octet VLQ (0x808000) has a value of instead of zero, which means that the maximum 3-octet VLQ (0xFFFF7F) is instead of just .
In this way, there is one and only one encoding of each integer, making this a base-128 bijective numeration
Bijective numeration is any numeral system in which every non-negative integer can be represented in exactly one way using a finite string of digits. The name refers to the bijection (i.e. one-to-one correspondence) that exists in this case betw ...
.
Examples
Here is a worked-out example for the decimal number 137:
* Represent the value in binary notation (e.g. 137 as 10001001)
* Break it up in groups of 7 bits starting from the lowest significant bit (e.g. 137 as 0000001 0001001). This is equivalent to representing the number in base 128.
* Take the lowest 7 bits, and that gives you the least significant byte (0000 1001). This byte comes last.
* For all the other groups of 7 bits (in the example, this is 000 0001), set the MSB to 1 (which gives 1000 0001 in our example). Thus 137 becomes 1000 0001 0000 1001, where the bits in boldface are something we added. These added bits denote whether there is another byte to follow or not. Thus, by definition, the very last byte of a variable-length integer will have 0 as its MSB.
Another way to look at this is to represent the value in base-128 and then set the MSB of all but the last base-128 digit to 1.
The Standard MIDI File format specification gives more examples:Standard MIDI-File Format Spec. 1.1
References
{{reflist
External links
MIDI Manufacturers Association(MMA) - Source for English-language MIDI specs
(AMEI) -Source for Japanese-language MIDI specs
Data types
MIDI