In
pattern recognition
Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphi ...
, the iDistance is an indexing and query processing technique for
k-nearest neighbor queries on point data in
multi-dimensional metric space
In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general settin ...
s. The kNN query is one of the hardest problems on multi-dimensional data, especially when
the dimensionality of the data is high. The iDistance is designed to process kNN queries in high-dimensional spaces efficiently and it is especially good for
skewed data distributions, which usually occur in real-life data sets. The iDistance can be augmented with machine learning models to learn the data distributions for searching and storing the multi-dimensional data.
Indexing
Building the iDistance index has two steps:
#A number of reference points in the data space are chosen. There are various ways of choosing reference points. Using
cluster centers as reference points is the most efficient way. Succinctly, the data points are partitioned into
Voronoi cells based on well-chosen reference points.
#The distance between a data point and its closest reference point is calculated. This distance plus a scaling value is called the point's ''iDistance''. By this means, points in a multi-dimensional space are mapped to one-dimensional values, and then a
B+-tree can be adopted to index the points using the iDistance as the
key
Key or The Key may refer to:
Common meanings
* Key (cryptography), a piece of information that controls the operation of a cryptography algorithm
* Key (lock), device used to control access to places or facilities restricted by a lock
* Key (map ...
.
The figure on the right shows an example where three reference points (O
1, O
2, O
3) are chosen. The data points are then mapped to a one-dimensional space and indexed in a B
+-tree. Various extensions have been proposed to make the selection of reference points for effective query performance, including employing machine learning to learn the identification of reference points.
Query processing
To process a kNN query, the query is mapped to a number of one-dimensional range queries, which can be processed efficiently on a B
+-tree. In the above figure, the query ''Q'' is mapped to a value in the B
+-tree while the kNN search ``sphere" is mapped to a range in the B
+-tree. The search sphere expands gradually until the k NNs are found. This corresponds to gradually expanding range searches in the B
+-tree.
The iDistance technique can be viewed as a way of accelerating the sequential scan. Instead of scanning records from the beginning to the end of the data file, the iDistance starts the scan from spots where the nearest neighbors can be obtained early with a very high probability.
Applications
The iDistance has been used in many applications including
*
Image retrieval
An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as caption ...
* Video indexing
* Similarity search in P2P systems
* Mobile computing
*
Recommender system
A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular u ...
Historical background
The iDistance was first proposed by Cui Yu, Beng Chin Ooi, Kian-Lee Tan and
H. V. Jagadish in 2001. Later, together with Rui Zhang, they improved the technique and performed a more comprehensive study on it in 2005.
[H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu and Rui Zhan]
iDistance: An Adaptive B+-tree Based Indexing Method for Nearest Neighbor Search
ACM Transactions on Data Base Systems (ACM TODS), 30, 2, 364-397, June 2005.
References
External links
Google's iDistance implementation in C++
{{DEFAULTSORT:Idistance
Classification algorithms
Machine learning algorithms