normalized compression distance
   HOME

TheInfoList



OR:

Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Such a measurement should not be application dependent or arbitrary. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other. It can be used in
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
and data mining for
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
.


Information distance

We assume that the objects one talks about are finite strings of 0s and 1s. Thus we mean string similarity. Every computer file is of this form, that is, if an object is a file in a computer it is of this form. One can define the information distance between strings x and y as the length of the shortest program p that computes x from y and vice versa. This shortest program is in a fixed programming language. For technical reasons one uses the theoretical notion of
Turing machines A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...
. Moreover, to express the length of p one uses the notion of
Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
. Then, it has been shown C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitányi, and W. Zurek, Information Distance, IEEE Trans. Inform. Theory, IT-44:4(1998) 1407–1423
/ref> :, p, = \max \ up to logarithmic additive terms which can be ignored. This information distance is shown to be a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...
(it satisfies the metric inequalities up to a logarithmic additive term), is universal (it minorizes every computable distance as computed for example from features up to a constant additive term).


Normalized information distance (similarity metric)

The information distance is absolute, but if we want to express similarity, then we are more interested in relative ones. For example, if two strings of length 1,000,000 differ by 1000 bits, then we consider that those strings are relatively more similar than two strings of 1000 bits that differ by 1000 bits. Hence we need to normalize to obtain a similarity metric. This way one obtains the normalized information distance (NID), : NID(x,y) = \frac, where K(x\mid y) is algorithmic information of x given y as input. The NID is called `the similarity metric.' since the function NID(x,y) has been shown to satisfy the basic requirements for a metric distance measure. However, it is not computable or even semicomputable.


Normalized compression distance

While the NID metric is not computable, it has an abundance of applications. Simply approximating K by real-world compressors, with Z(x) is the binary length of the file x compressed with compressor Z (for example "
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and in ...
", " bzip2", " PPMZ") in order to make NID easy to apply. Vitanyi and Cilibrasi rewrote the NID to obtain the Normalized Compression Distance (NCD) : NCD_Z(x,y) = \frac. The NCD is actually a family of distances parametrized with the compressor Z. The better Z is, the closer the NCD approaches the NID, and the better the results are.


Applications

The normalized compression distance has been used to fully automatically reconstruct language and phylogenetic trees. It can also be used for new applications of general clustering and
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
of natural data in arbitrary domains, for clustering of heterogeneous data, and for
anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority ...
across domains. The NID and NCD have been applied to numerous subjects, including music classification, to analyze network traffic and cluster computer worms and viruses, authorship attribution, gene expression dynamics, predicting useful versus useless stem cells, critical networks, image registration, question-answer systems.


Performance

Researchers from the datamining community use NCD and variants as "parameter-free, feature-free" data-mining tools. One group have experimentally tested a closely related metric on a large variety of sequence benchmarks. Comparing their compression method with 51 major methods found in 7 major data-mining conferences over the past decade, they established superiority of the compression method for clustering heterogeneous data, and for anomaly detection, and competitiveness in clustering domain data. NCD has an advantage of being
robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
to noise. However, although NCD appears "parameter-free", practical questions include which compressor to use in computing the NCD and other possible problems.


Comparison with the Normalized Relative Compression (NRC)

In order to measure the information of a string relative to another there is the need to rely on relative semi-distances (NRC). These are measures that do not need to respect symmetry and triangle inequality distance properties. Although the NCD and the NRC seem very similar, they address different questions. The NCD measures how similar both strings are, mostly using the information content, while the NRC indicates the fraction of a target string that cannot be constructed using information from another string. For a comparison, with application to the evolution of primate genomes, see.


Normalized Google distance

Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of
War and Peace ''War and Peace'' (russian: Война и мир, translit=Voyna i mir; pre-reform Russian: ; ) is a literary work by the Russian author Leo Tolstoy that mixes fictional narrative with chapters on history and philosophy. It was first published ...
by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like "the four-letter genome of a mouse," or "the text of `War and Peace' by Tolstoy." There are also objects that cannot be given literally, but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like "home" or "red." We are interested in
semantic similarity Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...
. Using code-word lengths obtained from the page-hit counts returned by Google from the web, we obtain a semantic distance using the NCD formula and viewing Google as a compressor useful for data mining, text comprehension, classification, and translation. The associated NCD, called the
normalized Google distance The normalized Google distance (NGD) is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to b ...
(NGD) can be rewritten as : NGD(x,y)= \frac, where f(x) denotes the number of pages containing the search term x, and f(x,y) denotes the number of pages containing both x and y,) as returned by Google or any search engine capable of returning an aggregate page count. The number N can be set to the number of pages indexed although it is more proper to count each page according to the number of search terms or phrases it contains. As rule of the thumb one can multiply the number of pages by, say, a thousand...


See also

* Efficient Estimation of Word Representations in Vector Space *
Word2vec Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...


References

{{Reflist


External links

* M. Li and P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications,Springer-Verlag, New York, 4th Edition 2019, Statistical distance