Similarity search is the most general term used for a range of mechanisms which share the principle of searching (typically, very large) spaces of objects where the only available comparator is the

similarity Similarity may refer to: In mathematics and computing * Similarity (geometry), the property of sharing the same shape * Matrix similarity, a relation between matrices * Similarity measure, a function that quantifies the similarity of two objects * ...

between any pair of objects. This is becoming increasingly important in an age of large information repositories where the objects contained do not possess any natural order, for example large collections of images, sounds and other sophisticated digital objects.

Nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...

and

range queries In data structures, a range query consists of preprocessing some input data into a data structure to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studie ...

are important subclasses of similarity search, and a number of solutions exist. Research in similarity search is dominated by the inherent problems of searching over complex objects. Such objects cause most known techniques to lose traction over large collections, due to a manifestation of the so-called

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. Th ...

, and there are still many unsolved problems. Unfortunately, in many cases where similarity search is necessary, the objects are inherently complex. The most general approach to similarity search relies upon the mathematical notion of

metric space In mathematics, a metric space is a set together with a notion of '' distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general sett ...

, which allows the construction of efficient index structures in order to achieve scalability in the search domain. Similarity search evolved independently in a number different scientific and computing contexts, according to various needs. In 2008 a few leading researchers in the field felt strongly that the subject should be a research topic in its own right, to allow focus on the general issues applicable across the many diverse domains of its use. This resulted in the formation of th
SISAP
foundation, whose main activity is a series of annual international conferences on the generic topic.

Metric search

Metric search is similarity search which takes place within

metric spaces In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setti ...

. While the semimetric properties are more or less necessary for any kind of search to be meaningful, the further property of

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, bu ...

is useful for engineering, rather than conceptual, purposes. A simple corollary of triangle inequality is that, if any two objects within the space are far apart, then no third object can be close to both. This observation allows data structures to be built, based on distances measured within the data collection, which allow subsets of the data to be excluded when a query is executed. As a simple example, a ''reference'' object can be chosen from the data set, and the remainder of the set divided into two parts based on distance to this object: those close to the reference object in set ''A'', and those far from the object in set ''B''. If, when the set is later queried, the distance from the query to the reference object is large, then none of the objects within set ''A'' can be very close to the query; if it is very small, then no object within set ''B'' can be close to the query. Once such situations are quantified and studied, many different metric indexing structures can be designed, variously suitable for different types of collections. The research domain of metric search can thus be characterised as the study of pre-processing algorithms over large and relatively static collections of data which, using the properties of metric spaces, allow efficient similarity search to be performed.

Types

Locality-sensitive hashing

A popular approach for similarity search is

locality sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...

(LSH). It hashes input items so that similar items map to the same "buckets" in memory with high probability (the number of buckets being much smaller than the universe of possible input items). It is often applied in nearest neighbor search on large scale high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases.

Bibliography

*Pei Lee, Laks V. S. Lakshmanan, Jeffrey Xu Yu: On Top-k Structural Similarity Search. ICDE 2012:774-785 *Zezula, P., Amato, G., Dohnal, V., and Batko, M. Similarity Search - The Metric Space Approach. Springer, 2006. *Samet, H.. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006. *E. Chavez, G. Navarro, R.A. Baeza-Yates, J.L. Marroquin
Searching in metric spaces
ACM Computing Surveys, 2001 *M.L. Hetland,
The Basic Principles of Metric Indexing
Swarm Intelligence for Multi-objective Problems in Data Mining, Studies in Computational Intelligence Volume 242, 2009, pp 199–232

Resources

The Multi-Feature Indexing Network (MUFIN) Project

MI-File (Metric Inverted File)

Content-based Photo Image Retrieval Test-Collection (CoPhIR)

Benchmarks

ANN-Benchmarks
for approximate nearest neighbor algorithms search; by

Spotify Spotify (; ) is a proprietary Swedish audio streaming and media services provider founded on 23 April 2006 by Daniel Ek and Martin Lorentzon. It is one of the largest music streaming service providers, with over 456 million monthly active us ...

References

{{Reflist Search algorithms