In mathematics, the Johnson–Lindenstrauss lemma is a result named after
William B. Johnson and
Joram Lindenstrauss concerning low-distortion
embedding
In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group (mathematics), group that is a subgroup.
When some object X is said to be embedded in another object Y ...
s of points from high-dimensional into low-dimensional
Euclidean space
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
. The
lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are
nearly preserved. In the classical proof of the lemma, the embedding is a
random
In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. ...
orthogonal projection
In linear algebra and functional analysis, a projection is a linear transformation P from a vector space to itself (an endomorphism) such that P\circ P=P. That is, whenever P is applied twice to any vector, it gives the same result as if it we ...
.
The lemma has applications in
compressed sensing
Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a signal processing technique for efficiently acquiring and reconstructing a Signal (electronics), signal by finding solutions to Underdetermined s ...
,
manifold learning,
dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
,
graph embedding
In topological graph theory, an embedding (also spelled imbedding) of a graph G on a surface \Sigma is a representation of G on \Sigma in which points of \Sigma are associated with vertices and simple arcs (homeomorphic images of ,1/math>) ...
, and
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases. It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure.
Statement
Given
, a set
of
points in
, and an integer
, there is a linear map
such that
:
for all
.
The formula can be rearranged:
Alternatively, for any
and any integer
[Or any integer ] there exists a linear function
such that the restriction
is
-
bi-Lipschitz.
[This result follows from the above result. Sketch of proof: Note and for all . Do casework for 1=''N'' and 1<''N'', applying the above result to in the latter case, noting ]
Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size ''N'' that needs dimension
:
in order to preserve the distances between all pairs of points within a factor of
.
The classical proof of the lemma takes
to be a scalar multiple of an orthogonal projection
onto a random subspace of dimension
in
. An orthogonal projection collapses some dimensions of the space it is applied to, which reduces the length of all vectors, as well as distance between vectors in the space. Under the conditions of the lemma,
concentration of measure ensures there is a nonzero chance that a random orthogonal projection reduces pairwise distances between all points in
by roughly a constant factor
. Since the chance is nonzero, such projections must exist, so we can choose one
and set
.
To obtain the projection algorithmically, it suffices with high probability to repeatedly sample orthogonal projection matrices at random. If you keep rolling the dice, you will eventually obtain one in polynomial random time.
Proof
Based on.
Construct a random matrix
, obtained by sampling each entry from the standard normal distribution. Then define
. Then, for any nonzero vector
, let the projected vector be
. Standard geometric argument show that
is
chi-square distributed, that is,
. Thus, it satisfies a
concentration inequalityBy the union bound, the probability that this relation is true for all of
is greater than
.
When
, the probability is nonzero.
More generally, when
, the probability is
, allowing arbitrarily high probability of success per sample, and ''a fortiori'' polynomial random time.
Alternate statement
A related lemma is the distributional JL lemma. This lemma states that for any
and positive integer
, there exists a distribution over
from which the matrix
is drawn such that for
and for any unit-length vector
, the claim below holds.
:
One can obtain the JL lemma from the distributional version by setting
and
for some pair ''u'',''v'' both in ''X''. Then the JL lemma follows by a union bound over all such pairs.
Sparse JL transform
Database-friendly JL transform
(Achlioptas, 2003) proposed "database-friendly" JL transform, using matrices with only entries from (-1, 0, +1).
Fix some unit vector
. Define
. We have
.
Now, since the
are IID, we want to apply a Chernoff concentration bound for
around 1. This requires upper-bounding the
cumulant generating function
In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...
(CGF).
Now that
is stochastically dominated by the standard gaussian, and
, it remains to perform a Chernoff bound for
, which requires bounding the cumulant generating function on both ends.
Sparser JL transform on well-spread vectors
(Matoušek, 2008) proposed a variant of the above JL transform that is even more sparsified, though it only works on "well-spread" vectors.
The above cases are generalized to the case for matrices with independent, mean-zero, unit variance,
subgaussian entries in (Dirksen, 2016).
Speeding up the JL transform
Given ''A'', computing the matrix vector product takes
time. There has been some work in deriving distributions for which the matrix vector product can be computed in less than
time.
There are two major lines of work. The first, ''Fast Johnson Lindenstrauss Transform'' (FJLT), was introduced by Ailon and
Chazelle in 2006.
This method allows the computation of the matrix vector product in just
for any constant
.
Another approach is to build a distribution supported over matrices that are sparse.
This method allows keeping only an
fraction of the entries in the matrix, which means the computation can be done in just
time.
Furthermore, if the vector has only
non-zero entries, the Sparse JL takes time
, which may be much less than the
time used by Fast JL.
Tensorized random projections
It is possible to combine two JL matrices by taking the so-called
face-splitting product, which is defined as the tensor products of the rows (was proposed by
V. Slyusar in 1996
for
radar
Radar is a system that uses radio waves to determine the distance ('' ranging''), direction ( azimuth and elevation angles), and radial velocity of objects relative to the site. It is a radiodetermination method used to detect and track ...
and
digital antenna array
Digital antenna array (DAA) is a smart antenna with multi channels digital beamforming, usually by using fast Fourier transform (FFT).
The development and practical realization of digital antenna arrays theory started in 1962 under the guidanc ...
applications).
More directly, let
and
be two matrices.
Then the
face-splitting product is
:
This idea of tensorization was used by Kasiviswanathan et al. for
differential privacy
Differential privacy (DP) is a mathematically rigorous framework for releasing statistical information about datasets while protecting the privacy of individual data subjects. It enables a data holder to share aggregate patterns of the group while ...
.
JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity:
:
,
where
is the element-wise (
Hadamard
Jacques Salomon Hadamard (; 8 December 1865 – 17 October 1963) was a French mathematician who made major contributions in number theory, complex analysis, differential geometry, and partial differential equations.
Biography
The son of a tea ...
) product.
Such computations have been used to efficiently compute
polynomial kernels and many other .
In 2020
it was shown that if the matrices
are independent
or Gaussian matrices, the combined matrix
satisfies the distributional JL lemma if the number of rows is at least
:
.
For large
this is as good as the completely random Johnson-Lindenstrauss, but
a matching lower bound in the same paper shows that this exponential dependency on
is necessary.
Alternative JL constructions are suggested to circumvent this.
See also
*
Random projection
*
Restricted isometry property
*
Word embeddings
Notes
References
Further reading
*. Journal version of a paper previously appearing at PODC 2001.
*.
*.
*
*
* .
*
{{DEFAULTSORT:Johnson-Lindenstrauss lemma
Lemmas
Metric geometry