Random indexing is a
dimensionality reduction method and computational framework for
distributional semantics, based on the insight that very-high-dimensional
vector space model implementations are impractical, that models need not grow in dimensionality when new items (e.g. new terminology) are encountered, and that a high-dimensional model can be projected into a space of lower dimensionality without compromising L2 distance metrics if the resulting dimensions are chosen appropriately.
This is the original point of the
random projection approach to dimension reduction first formulated as the
Johnson–Lindenstrauss lemma, and
locality-sensitive hashing has some of the same starting points. Random indexing, as used in representation of language, originates from the work of
Pentti Kanerva on
sparse distributed memory, and can be described as an incremental formulation of a random projection.
It can be also verified that random indexing is a random projection technique for the construction of Euclidean spaces—i.e. L2 normed vector spaces. In Euclidean spaces, random projections are elucidated using the Johnson–Lindenstrauss lemma.
The TopSig technique extends the random indexing model to produce
bit vectors for comparison with the
Hamming distance similarity function. It is used for improving the performance of
information retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
and
document clustering. In a similar line of research, Random Manhattan Integer Indexing (RMII) is proposed for improving the performance of the methods that employ the
Manhattan distance between text units. Many random indexing methods primarily generate similarity from co-occurrence of items in a corpus. Reflexive Random Indexing (RRI)
[Cohen T., Schvaneveldt Roger & Widdows Dominic (2009]
Reflective Random Indexing and indirect inference: a scalable method for discovery of implicit connections
Journal of Biomedical Informatics, 43(2):240-56. generates similarity from co-occurrence and from shared occurrence with other items.
External links
* Zadeh Behrang Qasemi, Handschuh Siegfried. (2015
Random indexing explained with high probability TSD.
References
{{Reflist
Dimension reduction