Delta encoding is a way of storing or transmitting
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
in the form of ''
differences'' (deltas) between sequential data rather than complete files; more generally this is known as
data differencing
In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as in ...
. Delta encoding is sometimes called delta compression, particularly where archival histories of changes are required (e.g., in
revision control software
In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections o ...
).
The differences are recorded in discrete files called "deltas" or "diffs". In situations where differences are small – for example, the change of a few words in a large document or the change of a few records in a large table – delta encoding greatly reduces data redundancy. Collections of unique deltas are substantially more space-efficient than their non-encoded equivalents.
From a logical point of view the difference between two data values is the information required to obtain one value from the other – see
relative entropy
Relative may refer to:
General use
*Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives''
Philosophy
*Relativism, the concept that ...
. The difference between identical values (under some
equivalence) is often called ''0'' or the neutral element.
Simple example
Perhaps the simplest example is storing values of bytes as differences (deltas) between sequential values, rather than the values themselves. So, instead of 2, 4, 6, 9, 7, we would store 2, 2, 2, 3, −2. This reduces the
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
(range) of the values when neighbor samples are correlated, enabling a lower bit usage for the same data.
IFF
In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false.
The connective is bicon ...
8SVX
8-Bit Sampled Voice (8SVX) is an audio file format standard developed by Electronic Arts for the Commodore-Amiga computer series. It is a data subtype of the IFF file container format. It typically contains linear pulse-code modulation (LPCM) ...
sound format applies this encoding to raw sound data before applying compression to it. Not even all 8-bit sound
samples compress better when delta encoded, and the usability of delta encoding is even smaller for 16-bit and better samples. Therefore, compression algorithms often choose to delta encode only when the compression is better than without. However, in
video compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
, delta frames can considerably reduce frame size and are used in virtually every video compression
codec
A codec is a device or computer program that encodes or decodes a data stream or signal. ''Codec'' is a portmanteau of coder/decoder.
In electronic communications, an endec is a device that acts as both an encoder and a decoder on a signal or da ...
.
Definition
A delta can be defined in 2 ways, ''symmetric delta'' and ''directed delta''. A ''symmetric delta'' can be expressed as
:
where
and
represent two versions.
A ''directed delta'', also called a change, is a sequence of (elementary) change operations which, when applied to one version
, yields another version
(note the correspondence to
transaction log
In the field of databases in computer science, a transaction log (also transaction journal, database log, binary log or audit trail) is a history of actions executed by a database management system used to guarantee ACID properties over crashes ...
s in databases). In computer implementations, they typically take the form of a language with two commands: ''copy data from v1'' and ''write literal data''.
Variants
A variation of delta encoding which encodes differences between the
prefixes
A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...
or
suffixes
In linguistics, a suffix is an affix which is placed after the Stem (linguistics), stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the Grammatical conjugation ...
of
strings
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
is called
incremental encoding Incremental encoding, also known as front compression, back compression, or front coding, is a type of delta encoding compression algorithm whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated. This ...
. It is particularly effective for sorted lists with small differences between strings, such as a list of
word
A word is a basic element of language that carries an semantics, objective or pragmatics, practical semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of w ...
s from a
dictionary
A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by radical and stroke for ideographic languages), which may include information on definitions, usage, etymologies ...
.
Implementation issues
The nature of the data to be encoded influences the effectiveness of a particular compression algorithm.
Delta encoding performs best when data has small or constant variation; for an unsorted data set, there may be little to no compression possible with this method.
In delta encoded transmission over a network where only a single copy of the file is available at each end of the communication channel, special
error control codes are used to detect which parts of the file have changed since its previous version.
For example,
rsync
rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operat ...
uses a rolling
checksum
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data ...
algorithm based on Mark Adler's
adler-32 Adler-32 is a checksum algorithm written by Mark Adler in 1995, modifying Fletcher's checksum. Compared to a cyclic redundancy check of the same length, it trades reliability for speed (preferring the latter). Adler-32 is more reliable than Fletcher ...
checksum.
Sample C code
The following
C code performs a simple form of delta encoding and decoding on a sequence of characters:
void delta_encode(unsigned char *buffer, int length)
void delta_decode(unsigned char *buffer, int length)
Examples
Delta encoding in HTTP
Another instance of use of delta encoding i
RFC 3229 "Delta encoding in HTTP", which proposes that
HTTP
The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...
servers should be able to send updated Web pages in the form of differences between versions (deltas), which should decrease Internet traffic, as most pages change slowly over time, rather than being completely rewritten repeatedly:
The suggested rsync-based framework was implemented in the ''rproxy'' system as a pair of HTTP proxies. Like the basic vcdiff-based implementation, both systems are rarely used.
Delta copying
''Delta copying'' is a fast way of copying a file that is partially changed, when a previous version is present on the destination location. With delta copying, only the changed part of a file is copied. It is usually used in
backup
In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", w ...
or
file copying
In digital file management, copying is a file operation that creates a new file which has the same content as an existing file. Computer operating systems include file copying methods to users, with operating systems with graphical user interface ...
software, often to save
bandwidth
Bandwidth commonly refers to:
* Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range
* Bandwidth (computing), the rate of data transfer, bit rate or thr ...
when copying between computers over a private network or the internet. One notable open-source example is
rsync
rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operat ...
.
Online backup
Many of the
online backup services
A remote, online, or managed backup service, sometimes marketed as cloud backup or backup-as-a-service, is a service that provides users with a system for the backup, storage, and recovery of computer files. Online backup providers are companies ...
adopt this methodology, often known simply as ''deltas'', in order to give their users previous versions of the same file from previous backups. This reduces associated costs, not only in the amount of data that has to be stored as differing versions (as the whole of each changed version of a file has to be offered for users to access), but also those costs in the uploading (and sometimes the downloading) of each file that has been updated (by just the smaller delta having to be used, rather than the whole file).
Delta updates
For large software packages, there is usually little data changed between versions. Many vendors choose to use delta transfers to save time and bandwidth.
Diff
Diff is a file comparison program, which is mainly used for text files. By default, it generates symmetric deltas that are reversible. Two formats used for software
patches, ''context'' and ''unified'', provides additional context lines that allow for tolerating shifts in line number.
Git
The Git source code control system employs delta compression in an auxiliary
git repack operation. Objects in the repository that have not yet been delta-compressed ("loose objects") are compared against a heuristically chosen subset of all other objects, and the common data and differences are concatenated into a "pack file" which is then compressed using conventional methods. In common use cases, where source or data files are changed incrementally between commits, this can result in significant space savings. The repack operation is typically performed as part of the
git gc process, which is triggered automatically when the numbers of loose objects or pack files exceed configured thresholds.
The format is documented in the pack-format page of the Git documentation. It implements a directed delta.
VCDIFF
One general format for directed delta encoding is VCDIFF, described i
RFC 3284 Free software
Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, no ...
implementations include
Xdelta
xdelta is a command line program for delta encoding, which generates the difference between two files. This is similar to diff and patch, but it is targeted for binary files and does not generate human readable output.
It was first released in 1 ...
and open-vcdiff.
GDIFF
Generic Diff Format (GDIFF) is another directed delta encoding format. It was submitted to
W3C
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working to ...
in 1997. In many cases, VCDIFF has better compression rate than GDIFF.
bsdiff
Bsdiff is a binary diff program using
suffix sorting. For executables that contain many changes in pointer addresses, it performs better than VCDIFF-type "copy and literal" encodings. The intent is to find a way to generate a small diff without needing to parse assembly code (as in Google's Courgette). Bsdiff achieve this by allowing "copy" matches with errors, which are then corrected using an extra "add" array of bytewise differences. Since this array is mostly either zero or repeated values for offset changes, it takes up little space after compression.
Bsdiff is useful for delta updates. Google uses bsdiff in Chromium and Android. The ''deltarpm'' feature of the
RPM Package Manager
RPM Package Manager (RPM) (originally Red Hat Package Manager, now a recursive acronym) is a free and open-source package management system. The name RPM refers to the file format and the package manager program itself. RPM was intended primaril ...
is based on a heavily-modified bsdiff that can use a hash table for matching.
FreeBSD
FreeBSD is a free and open-source Unix-like operating system descended from the Berkeley Software Distribution (BSD), which was based on Research Unix. The first version of FreeBSD was released in 1993. In 2005, FreeBSD was the most popular ...
also uses bsdiff for updates.
Since the 4.3 release of bsdiff in 2005, various improvements or fixes have been produced for it. Google maintains multiple versions of the code for each of its products. FreeBSD takes many of Google's compatible changes, mainly a vulnerability fix and a switch to the faster suffix-sorting routine.
Debian
Debian (), also known as Debian GNU/Linux, is a Linux distribution composed of free and open-source software, developed by the community-supported Debian Project, which was established by Ian Murdock on August 16, 1993. The first version of D ...
has a series of performance tweaks to the program.
''ddelta'' is a rewrite of bsdiff proposed for use in Debian's delta updates. Among other efficiency improvements, it uses a sliding window to reduce memory and CPU cost.
See also
*
Data differencing
In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as in ...
*
Interleaved deltas
Interleaved deltas, or SCCS weave is a method used by the Source Code Control System to store all revisions of a file. All lines from all revisions are "woven" together in a single block of data, with interspersed control instructions indicating w ...
*
Source Code Control System
Source Code Control System (SCCS) is a version control system designed to track changes in source code and other text files during the development of a piece of software. This allows the user to retrieve any of the previous versions of the origina ...
*
String-to-string correction problem In computer science, the string-to-string correction problem refers to determining the minimum cost sequence of edit operations necessary to change one string into another (i.e., computing the shortest edit distance). Each type of edit operation ha ...
*
Xdelta
xdelta is a command line program for delta encoding, which generates the difference between two files. This is similar to diff and patch, but it is targeted for binary files and does not generate human readable output.
It was first released in 1 ...
: open-source delta encoder
References
External links
* RFC 3229 – Delta Encoding in HTTP
{{DEFAULTSORT:Delta Encoding
Lossless compression algorithms
Data differencing
Articles with example C code