Data Differencing
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
and
information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as input source data and target data, and produces difference data such that given the source data and the difference data, one can reconstruct the target data ("
patching Patching is a small village and civil parish that lies amid the fields and woods of the southern slopes of the South Downs in the National Park in the Arun District of West Sussex, England. It has a visible hill-workings history going back t ...
" the source with the difference to produce the target).


Examples

One of the best-known examples of data differencing is the
diff In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it ...
utility, which produces line-by-line differences of
text file A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating ...
s (and in some implementations,
binary file A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document fil ...
s, thus being a general-purpose differencing tool). Differencing of general binary files goes under the rubric of
delta encoding Delta encoding is a way of storing or transmitting data in the form of '' differences'' (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta compre ...
, with a widely used example being the algorithm used in
rsync rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operat ...
. A standardized generic differencing format is
VCDIFF VCDIFF is a format and an algorithm for delta encoding, described in IETF'RFC 3284 The algorithm is based on Jon Bentley and Douglas McIlroy's paper "Data Compression Using Long Common Strings" written in 1999. VCDIFF is used as one of the delta e ...
, implemented in such utilities as
Xdelta xdelta is a command line program for delta encoding, which generates the difference between two files. This is similar to diff and patch, but it is targeted for binary files and does not generate human readable output. It was first released in 1 ...
version 3. A high-efficiency (small patch files) differencing program is bsdiff, which uses bzip2 as a final compression step on the generated delta.


Concerns

Main concerns for data differencing are ''usability'' and ''space efficiency'' (patch size). If one simply wishes to reconstruct the target given the source and patch, one may simply include the entire target in the patch and "apply" the patch by discarding the source and outputting the target that has been included in the patch; similarly, if the source and target have the same size one may create a simple patch by
XOR Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
ing source and target. In both these cases, the patch will be as large as the target. As these examples show, if the only concern is reconstruction of target, this is easily done, at the expense of a large patch, and the main concern for general-purpose binary differencing is reducing the patch size. For structured data especially, one has other concerns, which largely fall under "usability" – for example, if one is comparing two documents, one generally wishes to know ''which'' sections have changed, or if some sections have been moved around – one wishes to understand ''how'' the documents differ. For instance "here 'cat' was changed to 'dog', and paragraph 13 was moved to paragraph 14". One may also wish to have ''robust'' differences – for example, if two documents A and B differ in paragraph 13, one may wish to be able to apply this patch even if one has changed paragraph 7 of A. An example of this is in diff, which shows which lines changed, and where the context format allows robustness and improves human readability. Other concerns include computational efficiency, as for data compression – finding a small patch can be very time and memory intensive. Best results occur when one has knowledge of the data being compared and other constraints:
diff In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it ...
is designed for line-oriented text files, particularly source code, and works best for these; the
rsync rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operat ...
algorithm is used based on source and target being across a network from each other and communication being slow, so it minimizes data that must be transmitted; and the updates for
Google Chrome Google Chrome is a cross-platform web browser developed by Google. It was first released in 2008 for Microsoft Windows, built with free software components from Apple WebKit and Mozilla Firefox. Versions were later released for Linux, macOS ...
use an algorithm customized to the archive and executable format of the program's data.


Connection with data compression

Data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
can be seen as a special case of data differencing – data differencing consists of producing a ''difference'' given a ''source'' and a ''target'', with patching producing a ''target'' given a ''source'' and a ''difference,'' while data compression consists of producing a compressed file given a target, and decompression consists of producing a target given only a compressed file. Thus, one can consider data compression as data differencing with empty source data, the compressed file corresponding to a "difference from nothing". This is the same as considering absolute
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
(corresponding to data compression) as a special case of
relative entropy Relative may refer to: General use *Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives'' Philosophy *Relativism, the concept that ...
(corresponding to data differencing) with no initial data. When one wishes to emphasize the connection, one may use the term differential compression to refer to data differencing. A dictionary translating between the terminology of the two fields is given as:


References

{{reflist