Johnson–Lindenstrauss Lemma
   HOME

TheInfoList



OR:

In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion
embedding In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group (mathematics), group that is a subgroup. When some object X is said to be embedded in another object Y ...
s of points from high-dimensional into low-dimensional
Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a
random In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. ...
orthogonal projection In linear algebra and functional analysis, a projection is a linear transformation P from a vector space to itself (an endomorphism) such that P\circ P=P. That is, whenever P is applied twice to any vector, it gives the same result as if it we ...
. The lemma has applications in
compressed sensing Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a signal processing technique for efficiently acquiring and reconstructing a Signal (electronics), signal by finding solutions to Underdetermined s ...
, manifold learning,
dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
,
graph embedding In topological graph theory, an embedding (also spelled imbedding) of a graph G on a surface \Sigma is a representation of G on \Sigma in which points of \Sigma are associated with vertices and simple arcs (homeomorphic images of ,1/math>) ...
, and
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see
vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases. It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure.


Statement

Given 0 < \varepsilon < 1, a set X of N points in \mathbb^n , and an integer k > 8(\ln N)/\varepsilon^2, there is a linear map f: \mathbb^n \rightarrow \mathbb^k such that : (1-\varepsilon)\, u-v\, ^2 \leq \, f(u) - f(v)\, ^2 \leq (1+\varepsilon)\, u-v\, ^2 for all u,v \in X. The formula can be rearranged:(1+\varepsilon)^\, f(u)-f(v)\, ^2 \leq \, u-v\, ^2 \leq (1-\varepsilon)^\, f(u)-f(v)\, ^2 Alternatively, for any \epsilon\in(0,1) and any integer k\ge15(\ln N)/\varepsilon^2Or any integer k>128(\ln N)/(9\varepsilon^2). there exists a linear function f: \mathbb^n \rightarrow \mathbb^k such that the restriction f, _X is (1+\varepsilon)- bi-Lipschitz.This result follows from the above result. Sketch of proof: Note 1/(1+\varepsilon)<\sqrt and \sqrt<\sqrt<1+\varepsilon for all \varepsilon\in(0,1). Do casework for 1=''N'' and 1<''N'', applying the above result to 3\varepsilon/4 in the latter case, noting 128/9<15. Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size ''N'' that needs dimension : \Omega \left(\frac\right) in order to preserve the distances between all pairs of points within a factor of (1 \pm \varepsilon). The classical proof of the lemma takes f to be a scalar multiple of an orthogonal projection P onto a random subspace of dimension k in \mathbb^n. An orthogonal projection collapses some dimensions of the space it is applied to, which reduces the length of all vectors, as well as distance between vectors in the space. Under the conditions of the lemma, concentration of measure ensures there is a nonzero chance that a random orthogonal projection reduces pairwise distances between all points in X by roughly a constant factor c. Since the chance is nonzero, such projections must exist, so we can choose one P and set f(v) = Pv/c. To obtain the projection algorithmically, it suffices with high probability to repeatedly sample orthogonal projection matrices at random. If you keep rolling the dice, you will eventually obtain one in polynomial random time.


Proof

Based on. Construct a random matrix A \sim \mathcal N(0,1)^ , obtained by sampling each entry from the standard normal distribution. Then define P := A/\sqrt k . Then, for any nonzero vector x \in \R^n , let the projected vector be \hat x := P x . Standard geometric argument show that r := \frac is chi-square distributed, that is, r \sim \chi^2(k) . Thus, it satisfies a concentration inequalityPr(r \in (1\pm \epsilon ) k ) \geq 1 - 2e^By the union bound, the probability that this relation is true for all of x_1, \dots, x_N is greater than 1 - 2N e^. When k \geq \frac, the probability is nonzero. More generally, when k \geq \frac, the probability is \geq 1 - 1/(2N)^d , allowing arbitrarily high probability of success per sample, and ''a fortiori'' polynomial random time.


Alternate statement

A related lemma is the distributional JL lemma. This lemma states that for any 0 < \varepsilon, \delta < 1/2 and positive integer d , there exists a distribution over \mathbb^ from which the matrix A is drawn such that for k = O(\varepsilon^ \log(1/\delta)) and for any unit-length vector x \in \mathbb^ , the claim below holds. : P(, \Vert Ax\Vert_2^2-1, >\varepsilon)<\delta One can obtain the JL lemma from the distributional version by setting x = (u-v)/\, u-v\, _2 and \delta < 1/n^2 for some pair ''u'',''v'' both in ''X''. Then the JL lemma follows by a union bound over all such pairs.


Sparse JL transform


Database-friendly JL transform

(Achlioptas, 2003) proposed "database-friendly" JL transform, using matrices with only entries from (-1, 0, +1). Fix some unit vector v \in \R^n. Define Q_i := \sum_j R_ v_j. We have \, f(v)\, _2^2 = \frac \sum_i Q_i^2 . Now, since the Q_1, \dots, Q_k are IID, we want to apply a Chernoff concentration bound for \frac 1k \sum_i Q_i^2 around 1. This requires upper-bounding the
cumulant generating function In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...
(CGF). Now that Q_i is stochastically dominated by the standard gaussian, and E _i^2= 1, it remains to perform a Chernoff bound for Q_i^2, which requires bounding the cumulant generating function on both ends.


Sparser JL transform on well-spread vectors

(Matoušek, 2008) proposed a variant of the above JL transform that is even more sparsified, though it only works on "well-spread" vectors. The above cases are generalized to the case for matrices with independent, mean-zero, unit variance, subgaussian entries in (Dirksen, 2016).


Speeding up the JL transform

Given ''A'', computing the matrix vector product takes O(kd) time. There has been some work in deriving distributions for which the matrix vector product can be computed in less than O(kd) time. There are two major lines of work. The first, ''Fast Johnson Lindenstrauss Transform'' (FJLT), was introduced by Ailon and Chazelle in 2006. This method allows the computation of the matrix vector product in just d\log d + k^ for any constant \gamma>0. Another approach is to build a distribution supported over matrices that are sparse. This method allows keeping only an \varepsilon fraction of the entries in the matrix, which means the computation can be done in just kd\varepsilon time. Furthermore, if the vector has only b non-zero entries, the Sparse JL takes time kb\varepsilon, which may be much less than the d\log d time used by Fast JL.


Tensorized random projections

It is possible to combine two JL matrices by taking the so-called face-splitting product, which is defined as the tensor products of the rows (was proposed by V. Slyusar in 1996 for
radar Radar is a system that uses radio waves to determine the distance ('' ranging''), direction ( azimuth and elevation angles), and radial velocity of objects relative to the site. It is a radiodetermination method used to detect and track ...
and
digital antenna array Digital antenna array (DAA) is a smart antenna with multi channels digital beamforming, usually by using fast Fourier transform (FFT). The development and practical realization of digital antenna arrays theory started in 1962 under the guidanc ...
applications). More directly, let \in\mathbb R^ and \in\mathbb R^ be two matrices. Then the face-splitting product \bullet is : \bullet = \left \begin _1 \otimes _1\\\hline _2 \otimes _2\\\hline _3 \otimes _3\\ \end \right This idea of tensorization was used by Kasiviswanathan et al. for
differential privacy Differential privacy (DP) is a mathematically rigorous framework for releasing statistical information about datasets while protecting the privacy of individual data subjects. It enables a data holder to share aggregate patterns of the group while ...
. JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity: :(\mathbf \bull \mathbf)(x\otimes y) = \mathbfx \circ \mathbf y = \left \begin (\mathbfx)_1 (\mathbf y)_1 \\ (\mathbfx)_2 (\mathbf y)_2 \\ \vdots \end\right, where \circ is the element-wise (
Hadamard Jacques Salomon Hadamard (; 8 December 1865 – 17 October 1963) was a French mathematician who made major contributions in number theory, complex analysis, differential geometry, and partial differential equations. Biography The son of a tea ...
) product. Such computations have been used to efficiently compute polynomial kernels and many other . In 2020 it was shown that if the matrices C_1, C_2, \dots, C_c are independent \pm1 or Gaussian matrices, the combined matrix C_1 \bullet \dots \bullet C_c satisfies the distributional JL lemma if the number of rows is at least :O(\epsilon^\log1/\delta + \epsilon^(\tfrac1c\log1/\delta)^c). For large \epsilon this is as good as the completely random Johnson-Lindenstrauss, but a matching lower bound in the same paper shows that this exponential dependency on (\log1/\delta)^c is necessary. Alternative JL constructions are suggested to circumvent this.


See also

* Random projection * Restricted isometry property * Word embeddings


Notes


References


Further reading

*. Journal version of a paper previously appearing at PODC 2001. *. *. * * * . * {{DEFAULTSORT:Johnson-Lindenstrauss lemma Lemmas Metric geometry