HOME

TheInfoList



OR:

Delta encoding is a way of storing or transmitting
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
in the form of '' differences'' (deltas) between sequential data rather than complete files; more generally this is known as
data differencing In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as in ...
. Delta encoding is sometimes called delta compression, particularly where archival histories of changes are required (e.g., in
revision control software In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections ...
). The differences are recorded in discrete files called "deltas" or "diffs". In situations where differences are small – for example, the change of a few words in a large document or the change of a few records in a large table – delta encoding greatly reduces data redundancy. Collections of unique deltas are substantially more space-efficient than their non-encoded equivalents. From a logical point of view the difference between two data values is the information required to obtain one value from the other – see
relative entropy Relative may refer to: General use *Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives'' Philosophy *Relativism, the concept that ...
. The difference between identical values (under some equivalence) is often called ''0'' or the neutral element.


Simple example

Perhaps the simplest example is storing values of bytes as differences (deltas) between sequential values, rather than the values themselves. So, instead of 2, 4, 6, 9, 7, we would store 2, 2, 2, 3, −2. This reduces the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
(range) of the values when neighbor samples are correlated, enabling a lower bit usage for the same data. IFF
8SVX 8-Bit Sampled Voice (8SVX) is an audio file format standard developed by Electronic Arts for the Commodore-Amiga computer series. It is a data subtype of the IFF file container format. It typically contains linear pulse-code modulation (LPCM) ...
sound format applies this encoding to raw sound data before applying compression to it. Not even all 8-bit sound samples compress better when delta encoded, and the usability of delta encoding is even smaller for 16-bit and better samples. Therefore, compression algorithms often choose to delta encode only when the compression is better than without. However, in
video compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...
, delta frames can considerably reduce frame size and are used in virtually every video compression
codec A codec is a device or computer program that encodes or decodes a data stream or signal. ''Codec'' is a portmanteau of coder/decoder. In electronic communications, an endec is a device that acts as both an encoder and a decoder on a signal or ...
.


Definition

A delta can be defined in 2 ways, ''symmetric delta'' and ''directed delta''. A ''symmetric delta'' can be expressed as : \Delta(v_1, v_2) = (v_1 \setminus v_2) \cup (v_2 \setminus v_1), where v_1 and v_2 represent two versions. A ''directed delta'', also called a change, is a sequence of (elementary) change operations which, when applied to one version v_1, yields another version v_2 (note the correspondence to
transaction log In the field of databases in computer science, a transaction log (also transaction journal, database log, binary log or audit trail) is a history of actions executed by a database management system used to guarantee ACID properties over crashes ...
s in databases). In computer implementations, they typically take the form of a language with two commands: ''copy data from v1'' and ''write literal data''.


Variants

A variation of delta encoding which encodes differences between the
prefixes A prefix is an affix which is placed before the stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy''. Particula ...
or
suffixes In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the conjugation of verbs. Suffixes can carry g ...
of strings is called
incremental encoding Incremental encoding, also known as front compression, back compression, or front coding, is a type of delta encoding compression algorithm whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated. This ...
. It is particularly effective for sorted lists with small differences between strings, such as a list of
word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
s from a
dictionary A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by radical and stroke for ideographic languages), which may include information on definitions, usage, etymologie ...
.


Implementation issues

The nature of the data to be encoded influences the effectiveness of a particular compression algorithm. Delta encoding performs best when data has small or constant variation; for an unsorted data set, there may be little to no compression possible with this method. In delta encoded transmission over a network where only a single copy of the file is available at each end of the communication channel, special error control codes are used to detect which parts of the file have changed since its previous version. For example,
rsync rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like opera ...
uses a rolling
checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data ...
algorithm based on Mark Adler's
adler-32 Adler-32 is a checksum algorithm written by Mark Adler in 1995, modifying Fletcher's checksum. Compared to a cyclic redundancy check of the same length, it trades reliability for speed (preferring the latter). Adler-32 is more reliable than Fletcher ...
checksum.


Sample C code

The following C code performs a simple form of delta encoding and decoding on a sequence of characters: void delta_encode(unsigned char *buffer, int length) void delta_decode(unsigned char *buffer, int length)


Examples


Delta encoding in HTTP

Another instance of use of delta encoding i
RFC 3229
"Delta encoding in HTTP", which proposes that
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide We ...
servers should be able to send updated Web pages in the form of differences between versions (deltas), which should decrease Internet traffic, as most pages change slowly over time, rather than being completely rewritten repeatedly: The suggested rsync-based framework was implemented in the ''rproxy'' system as a pair of HTTP proxies. Like the basic vcdiff-based implementation, both systems are rarely used.


Delta copying

''Delta copying'' is a fast way of copying a file that is partially changed, when a previous version is present on the destination location. With delta copying, only the changed part of a file is copied. It is usually used in
backup In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", ...
or
file copying In digital file management, copying is a file operation that creates a new file which has the same content as an existing file. Computer operating systems include file copying methods to users, with operating systems with graphical user interfa ...
software, often to save
bandwidth Bandwidth commonly refers to: * Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range * Bandwidth (computing), the rate of data transfer, bit rate or thr ...
when copying between computers over a private network or the internet. One notable open-source example is
rsync rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like opera ...
.


Online backup

Many of the online backup services adopt this methodology, often known simply as ''deltas'', in order to give their users previous versions of the same file from previous backups. This reduces associated costs, not only in the amount of data that has to be stored as differing versions (as the whole of each changed version of a file has to be offered for users to access), but also those costs in the uploading (and sometimes the downloading) of each file that has been updated (by just the smaller delta having to be used, rather than the whole file).


Delta updates

For large software packages, there is usually little data changed between versions. Many vendors choose to use delta transfers to save time and bandwidth.


Diff

Diff is a file comparison program, which is mainly used for text files. By default, it generates symmetric deltas that are reversible. Two formats used for software patches, ''context'' and ''unified'', provides additional context lines that allow for tolerating shifts in line number.


Git

The Git source code control system employs delta compression in an auxiliary
git repack
operation. Objects in the repository that have not yet been delta-compressed ("loose objects") are compared against a heuristically chosen subset of all other objects, and the common data and differences are concatenated into a "pack file" which is then compressed using conventional methods. In common use cases, where source or data files are changed incrementally between commits, this can result in significant space savings. The repack operation is typically performed as part of the
git gc
process, which is triggered automatically when the numbers of loose objects or pack files exceed configured thresholds. The format is documented in the pack-format page of the Git documentation. It implements a directed delta.


VCDIFF

One general format for directed delta encoding is VCDIFF, described i
RFC 3284
Free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, n ...
implementations include Xdelta and open-vcdiff.


GDIFF

Generic Diff Format (GDIFF) is another directed delta encoding format. It was submitted to W3C in 1997. In many cases, VCDIFF has better compression rate than GDIFF.


bsdiff

Bsdiff is a binary diff program using suffix sorting. For executables that contain many changes in pointer addresses, it performs better than VCDIFF-type "copy and literal" encodings. The intent is to find a way to generate a small diff without needing to parse assembly code (as in Google's Courgette). Bsdiff achieve this by allowing "copy" matches with errors, which are then corrected using an extra "add" array of bytewise differences. Since this array is mostly either zero or repeated values for offset changes, it takes up little space after compression. Bsdiff is useful for delta updates. Google uses bsdiff in Chromium and Android. The ''deltarpm'' feature of the
RPM Package Manager RPM Package Manager (RPM) (originally Red Hat Package Manager, now a recursive acronym) is a free and open-source package management system. The name RPM refers to the file format and the package manager program itself. RPM was intended primari ...
is based on a heavily-modified bsdiff that can use a hash table for matching.
FreeBSD FreeBSD is a free and open-source Unix-like operating system descended from the Berkeley Software Distribution (BSD), which was based on Research Unix. The first version of FreeBSD was released in 1993. In 2005, FreeBSD was the most popular ...
also uses bsdiff for updates. Since the 4.3 release of bsdiff in 2005, various improvements or fixes have been produced for it. Google maintains multiple versions of the code for each of its products. FreeBSD takes many of Google's compatible changes, mainly a vulnerability fix and a switch to the faster suffix-sorting routine.
Debian Debian (), also known as Debian GNU/Linux, is a Linux distribution composed of free and open-source software, developed by the community-supported Debian Project, which was established by Ian Murdock on August 16, 1993. The first version of De ...
has a series of performance tweaks to the program. ''ddelta'' is a rewrite of bsdiff proposed for use in Debian's delta updates. Among other efficiency improvements, it uses a sliding window to reduce memory and CPU cost.


See also

*
Data differencing In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as in ...
*
Interleaved deltas Interleaved deltas, or SCCS weave is a method used by the Source Code Control System to store all revisions of a file. All lines from all revisions are "woven" together in a single block of data, with interspersed control instructions indicating ...
*
Source Code Control System Source Code Control System (SCCS) is a version control system designed to track changes in source code and other text files during the development of a piece of software. This allows the user to retrieve any of the previous versions of the origin ...
* String-to-string correction problem * Xdelta: open-source delta encoder


References


External links

* RFC 3229 – Delta Encoding in HTTP {{DEFAULTSORT:Delta Encoding Lossless compression algorithms Data differencing Articles with example C code