In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion

embedding In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group (mathematics), group that is a subgroup. When some object X is said to be embedded in another object Y ...

s of points from high-dimensional into low-dimensional

Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...

. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a

random In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. ...

orthogonal projection In linear algebra and functional analysis, a projection is a linear transformation P from a vector space to itself (an endomorphism) such that P\circ P=P. That is, whenever P is applied twice to any vector, it gives the same result as if it we ...

. The lemma has applications in

compressed sensing Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a signal processing technique for efficiently acquiring and reconstructing a Signal (electronics), signal by finding solutions to Underdetermined s ...

, manifold learning,

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

graph embedding In topological graph theory, an embedding (also spelled imbedding) of a graph G on a surface \Sigma is a representation of G on \Sigma in which points of \Sigma are associated with vertices and simple arcs (homeomorphic images of ,1/math>) ...

, and

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see

vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...

for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases. It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure.

Statement

Given

0 < \varepsilon < 1

, a set

X

N

points in

\mathbb^n

, and an integer

k > 8(\ln N)/\varepsilon^2

, there is a linear map

f: \mathbb^n \rightarrow \mathbb^k

such that :

(1-\varepsilon)\, u-v\, ^2 \leq \, f(u) - f(v)\, ^2 \leq (1+\varepsilon)\, u-v\, ^2

for all

u,v \in X

. The formula can be rearranged:

(1+\varepsilon)^\, f(u)-f(v)\, ^2 \leq \, u-v\, ^2 \leq (1-\varepsilon)^\, f(u)-f(v)\, ^2

Alternatively, for any

\epsilon\in(0,1)

and any integer

k\ge15(\ln N)/\varepsilon^2

Or any integer

k>128(\ln N)/(9\varepsilon^2).

there exists a linear function

f: \mathbb^n \rightarrow \mathbb^k

such that the restriction

f, _X

(1+\varepsilon)

- bi-Lipschitz.This result follows from the above result. Sketch of proof: Note

1/(1+\varepsilon)<\sqrt

and

\sqrt<\sqrt<1+\varepsilon

for all

\varepsilon\in(0,1)

. Do casework for 1=''N'' and 1<''N'', applying the above result to

3\varepsilon/4

in the latter case, noting

128/9<15.

Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size ''N'' that needs dimension :

\Omega \left(\frac\right)

in order to preserve the distances between all pairs of points within a factor of

(1 \pm \varepsilon)

. The classical proof of the lemma takes

f

to be a scalar multiple of an orthogonal projection

P

onto a random subspace of dimension

k

\mathbb^n

. An orthogonal projection collapses some dimensions of the space it is applied to, which reduces the length of all vectors, as well as distance between vectors in the space. Under the conditions of the lemma, concentration of measure ensures there is a nonzero chance that a random orthogonal projection reduces pairwise distances between all points in

X

by roughly a constant factor

c

. Since the chance is nonzero, such projections must exist, so we can choose one

P

and set

f(v) = Pv/c

. To obtain the projection algorithmically, it suffices with high probability to repeatedly sample orthogonal projection matrices at random. If you keep rolling the dice, you will eventually obtain one in polynomial random time.

Proof

Based on. Construct a random matrix

A \sim \mathcal N(0,1)^

, obtained by sampling each entry from the standard normal distribution. Then define

P := A/\sqrt k

. Then, for any nonzero vector

x \in \R^n

, let the projected vector be

\hat x := P x

. Standard geometric argument show that

r := \frac

is chi-square distributed, that is,

r \sim \chi^2(k)

. Thus, it satisfies a concentration inequality

Pr(r \in (1\pm \epsilon ) k ) \geq 1 - 2e^

By the union bound, the probability that this relation is true for all of

x_1, \dots, x_N

is greater than

1 - 2N e^

. When

k \geq \frac

, the probability is nonzero. More generally, when

k \geq \frac

, the probability is

\geq 1 - 1/(2N)^d

, allowing arbitrarily high probability of success per sample, and ''a fortiori'' polynomial random time.

Alternate statement

A related lemma is the distributional JL lemma. This lemma states that for any

0 < \varepsilon, \delta < 1/2

and positive integer

d

, there exists a distribution over

\mathbb^

from which the matrix

A

is drawn such that for

k = O(\varepsilon^ \log(1/\delta))

and for any unit-length vector

x \in \mathbb^

, the claim below holds. :

P(, \Vert Ax\Vert_2^2-1, >\varepsilon)<\delta

One can obtain the JL lemma from the distributional version by setting

x = (u-v)/\, u-v\, _2

and

\delta < 1/n^2

for some pair ''u'',''v'' both in ''X''. Then the JL lemma follows by a union bound over all such pairs.

Sparse JL transform

Database-friendly JL transform

(Achlioptas, 2003) proposed "database-friendly" JL transform, using matrices with only entries from (-1, 0, +1). Fix some unit vector

v \in \R^n

. Define

Q_i := \sum_j R_ v_j

. We have

\, f(v)\, _2^2 = \frac \sum_i Q_i^2

. Now, since the

Q_1, \dots, Q_k

are IID, we want to apply a Chernoff concentration bound for

\frac 1k \sum_i Q_i^2

around 1. This requires upper-bounding the

cumulant generating function In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...

(CGF). Now that

Q_i

is stochastically dominated by the standard gaussian, and

E_i^2 = 1

, it remains to perform a Chernoff bound for

Q_i^2

, which requires bounding the cumulant generating function on both ends.

Sparser JL transform on well-spread vectors

(Matoušek, 2008) proposed a variant of the above JL transform that is even more sparsified, though it only works on "well-spread" vectors. The above cases are generalized to the case for matrices with independent, mean-zero, unit variance, subgaussian entries in (Dirksen, 2016).

Speeding up the JL transform

Given ''A'', computing the matrix vector product takes

O(kd)

time. There has been some work in deriving distributions for which the matrix vector product can be computed in less than

O(kd)

time. There are two major lines of work. The first, ''Fast Johnson Lindenstrauss Transform'' (FJLT), was introduced by Ailon and Chazelle in 2006. This method allows the computation of the matrix vector product in just

d\log d + k^

for any constant

\gamma>0

. Another approach is to build a distribution supported over matrices that are sparse. This method allows keeping only an

\varepsilon

fraction of the entries in the matrix, which means the computation can be done in just

kd\varepsilon

time. Furthermore, if the vector has only

b

non-zero entries, the Sparse JL takes time

kb\varepsilon

, which may be much less than the

d\log d

time used by Fast JL.

Tensorized random projections

It is possible to combine two JL matrices by taking the so-called face-splitting product, which is defined as the tensor products of the rows (was proposed by V. Slyusar in 1996 for

radar Radar is a system that uses radio waves to determine the distance ('' ranging''), direction ( azimuth and elevation angles), and radial velocity of objects relative to the site. It is a radiodetermination method used to detect and track ...

and

digital antenna array Digital antenna array (DAA) is a smart antenna with multi channels digital beamforming, usually by using fast Fourier transform (FFT). The development and practical realization of digital antenna arrays theory started in 1962 under the guidanc ...

applications). More directly, let

\in\mathbb R^

and

\in\mathbb R^

be two matrices. Then the face-splitting product

\bullet

is :

\bullet 
= 
\left \begin 
_1 \otimes _1\\\hline 
_2 \otimes _2\\\hline
_3 \otimes _3\\
\end
\right

This idea of tensorization was used by Kasiviswanathan et al. for

differential privacy Differential privacy (DP) is a mathematically rigorous framework for releasing statistical information about datasets while protecting the privacy of individual data subjects. It enables a data holder to share aggregate patterns of the group while ...

. JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity: :

(\mathbf \bull \mathbf)(x\otimes y) = \mathbfx \circ \mathbf y
= \left \begin 
(\mathbfx)_1 (\mathbf y)_1 \\
(\mathbfx)_2 (\mathbf y)_2 \\
\vdots
\end\right

, where

\circ

is the element-wise (

Hadamard Jacques Salomon Hadamard (; 8 December 1865 – 17 October 1963) was a French mathematician who made major contributions in number theory, complex analysis, differential geometry, and partial differential equations. Biography The son of a tea ...

) product. Such computations have been used to efficiently compute polynomial kernels and many other . In 2020 it was shown that if the matrices

C_1, C_2, \dots, C_c

are independent

\pm1

or Gaussian matrices, the combined matrix

C_1 \bullet  \dots \bullet C_c

satisfies the distributional JL lemma if the number of rows is at least :

O(\epsilon^\log1/\delta + \epsilon^(\tfrac1c\log1/\delta)^c)

. For large

\epsilon

this is as good as the completely random Johnson-Lindenstrauss, but a matching lower bound in the same paper shows that this exponential dependency on

(\log1/\delta)^c

is necessary. Alternative JL constructions are suggested to circumvent this.

Statement

Proof

Alternate statement

Sparse JL transform

Database-friendly JL transform

Sparser JL transform on well-spread vectors

Speeding up the JL transform

Tensorized random projections

See also

Notes

References

Further reading