Tensor Sketch
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
,
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and
algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing c ...
, a tensor sketch is a type of
dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
that is particularly efficient when applied to vectors that have
tensor In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tenso ...
structure. Such a sketch can be used to speed up explicit
kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
, bilinear pooling in
neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
s and is a cornerstone in many numerical linear algebra algorithms.Woodruff, David P.
Sketching as a Tool for Numerical Linear Algebra
." Theoretical Computer Science 10.1-2 (2014): 1–157.


Mathematical definition

Mathematically, a dimensionality reduction or sketching matrix is a matrix M\in\mathbb R^, where k, such that for any vector x\in\mathbb R^d :, \, Mx\, _2 - \, x\, _2, < \varepsilon\, x\, _2 with high probability. In other words, M preserves the norm of vectors up to a small error. A tensor sketch has the extra property that if x = y \otimes z for some vectors y\in\mathbb R^, z\in\mathbb R^ such that d_1d_2=d, the transformation M(y\otimes z) can be computed more efficiently. Here \otimes denotes the
Kronecker product In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a generalization of the outer product (which is denoted by the same symbol) from vectors to ...
, rather than the
outer product In linear algebra, the outer product of two coordinate vector In linear algebra, a coordinate vector is a representation of a vector as an ordered list of numbers (a tuple) that describes the vector in terms of a particular ordered basis. An ea ...
, though the two are related by a
flattening Flattening is a measure of the compression of a circle or sphere along a diameter to form an ellipse or an ellipsoid of revolution (spheroid) respectively. Other terms used are ellipticity, or oblateness. The usual notation for flattening is ...
. The speedup is achieved by first rewriting M(y\otimes z) = M' y \circ M'' z, where \circ denotes the elementwise (
Hadamard Jacques Salomon Hadamard (; 8 December 1865 – 17 October 1963) was a French mathematician who made major contributions in number theory, complex analysis, differential geometry and partial differential equations. Biography The son of a teac ...
) product. Each of M' y and M'' z can be computed in time O(k d_1) and O(k d_2), respectively; including the Hadamard product gives overall time O(d_1 d_2 + k d_1 + k d_2). In most use cases this method is significantly faster than the full M(y\otimes z) requiring O(kd)=O(k d_1 d_2) time. For higher-order tensors, such as x = y\otimes z\otimes t, the savings are even more impressive.


History

The term tensor sketch was coined in 2013 describing a technique by Rasmus Pagh from the same year. Originally it was understood using the
fast Fourier transform A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). Fourier analysis converts a signal from its original domain (often time or space) to a representation in th ...
to do fast
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
of
count sketch Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by ...
es. Later research works generalized it to a much larger class of dimensionality reductions via Tensor random embeddings. Tensor random embeddings were introduced in 2010 in a paper on differential privacy and were first analyzed by Rudelson et al. in 2012 in the context of sparse recovery. Avron et al. were the first to study the subspace embedding properties of tensor sketches, particularly focused on applications to
polynomial kernel In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of th ...
s. In this context, the sketch is required not only to preserve the norm of each individual vector with a certain probability but to preserve the norm of all vectors in each individual
linear subspace In mathematics, and more specifically in linear algebra, a linear subspace, also known as a vector subspaceThe term ''linear subspace'' is sometimes used for referring to flats and affine subspaces. In the case of vector spaces over the reals, li ...
. This is a much stronger property, and it requires larger sketch sizes, but it allows the kernel methods to be used very broadly as explored in the book by David Woodruff.


Tensor random projections

The face-splitting product is defined as the tensor products of the rows (was proposed by V. SlyusarAnna Esteve, Eva Boj & Josep Fortiana (2009): Interaction Terms in Distance-Based Regression, Communications in Statistics – Theory and Methods, 38:19, P. 350

in 1996 for
radar Radar is a detection system that uses radio waves to determine the distance (''ranging''), angle, and radial velocity of objects relative to the site. It can be used to detect aircraft, ships, spacecraft, guided missiles, motor vehicles, w ...
and
digital antenna array Digital antenna array (DAA) is a smart antenna with multi channels digital beamforming, usually by using fast Fourier transform (FFT). The development and practical realization of digital antenna arrays theory started in 1962 under the guidance ...
applications). More directly, let \mathbf\in\mathbb R^ and \mathbf\in\mathbb R^ be two matrices. Then the face-splitting product \mathbf\bullet \mathbf is \mathbf \bull \mathbf = \left \begin \mathbf_1 \otimes \mathbf_1\\\hline \mathbf_2 \otimes \mathbf_2\\\hline \mathbf_3 \otimes \mathbf_3\\ \end \right= \left[ \begin \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ \\\hline \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ \\\hline \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ \end \right]. The reason this product is useful is the following identity: :(\mathbf \bull \mathbf)(x\otimes y) = \mathbfx \circ \mathbf y = \left[ \begin (\mathbfx)_1 (\mathbf y)_1 \\ (\mathbfx)_2 (\mathbf y)_2 \\ \vdots \end\right], where \circ is the element-wise (
Hadamard Jacques Salomon Hadamard (; 8 December 1865 – 17 October 1963) was a French mathematician who made major contributions in number theory, complex analysis, differential geometry and partial differential equations. Biography The son of a teac ...
) product. Since this operation can be computed in linear time, \mathbf \bull \mathbf can be multiplied on vectors with tensor structure much faster than normal matrices.


Construction with fast Fourier transform

The tensor sketch of Pham and Pagh computes C^x \ast C^y, where C^ and C^ are independent
count sketch Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by ...
matrices and \ast is vector
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
. They show that, amazingly, this equals C(x \otimes y) – a count sketch of the tensor product! It turns out that this relation can be seen in terms of the face-splitting product as :C^x \ast C^y = \mathcal F^(\mathcal F C^x \circ \mathcal F C^y), where \mathcal F is the DFT matrix, Fourier transform matrix. Since \mathcal F is an
orthonormal In linear algebra, two vectors in an inner product space are orthonormal if they are orthogonal (or perpendicular along a line) unit vectors. A set of vectors form an orthonormal set if all vectors in the set are mutually orthogonal and all of un ...
matrix, \mathcal F^ doesn't impact the norm of Cx and may be ignored. What's left is that C \sim \mathcal C^ \bullet \mathcal C^. On the other hand, :\mathcal F(C^x \ast C^y) = \mathcal F C^x \circ \mathcal F C^y= (\mathcal F C^ \bull \mathcal F C^)(x \otimes y).


Application to general matrices

The problem with the original tensor sketch algorithm was that it used
count sketch Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by ...
matrices, which aren't always very good dimensionality reductions. In 2020 it was shown that any matrices with random enough independent rows suffice to create a tensor sketch. This allows using matrices with stronger guarantees, such as real Gaussian Johnson Lindenstrauss matrices. In particular, we get the following theorem :Consider a matrix T with i.i.d. rows T_1, \dots, T_m\in \mathbb R^d, such that E T_1x)^2\, x\, _2^2 and E T_1x)^p \le \sqrt\, x\, _2. Let T^, \dots, T^ be independent consisting of T and M = T^ \bullet \dots \bullet T^. : Then , \, Mx\, _2 - \, x\, _2, < \varepsilon\, x\, _2 with probability 1-\delta for any vector x if :m = (4a)^ \varepsilon^ \log1/\delta + (2ae)\varepsilon^(\log1/\delta)^c. In particular, if the entries of T are \pm1 we get m = O(\varepsilon^\log1/\delta + \varepsilon^(\tfrac1c\log1/\delta)^c) which matches the normal Johnson Lindenstrauss theorem of m = O(\varepsilon^\log1/\delta) when \varepsilon is small. The paper also shows that the dependency on \varepsilon^(\tfrac1c\log1/\delta)^c is necessary for constructions using tensor randomized projections with
Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...
entries.


Variations


Recursive construction

Because of the exponential dependency on c in tensor sketches based on the face-splitting product, a different approach was developed in 2020 which applies :M(x\otimes y\otimes\cdots) = M^(x \otimes (M^y \otimes \cdots)) We can achieve such an M by letting :M = M^(M^\otimes I_d)(M^\otimes I_)\cdots(M^\otimes I_). With this method, we only apply the general tensor sketch method to order 2 tensors, which avoids the exponential dependency in the number of rows. It can be proved that combining c dimensionality reductions like this only increases \varepsilon by a factor \sqrt.


Fast constructions

The fast Johnson–Lindenstrauss transform is a dimensionality reduction matrix Given a matrix M\in\mathbb R^, computing the matrix vector product Mx takes kd time. The ''Fast Johnson Lindenstrauss Transform'' (FJLT), was introduced by Ailon and Chazelle in 2006. A version of this method takes M = \operatorname where # D is a
diagonal matrix In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal ma ...
where each diagonal entry D_ is \pm1 independently. The matrix-vector multiplication Dx can be computed in O(d) time. # H is a
Hadamard matrix In mathematics, a Hadamard matrix, named after the French mathematician Jacques Hadamard, is a square matrix whose entries are either +1 or −1 and whose rows are mutually orthogonal. In geometric terms, this means that each pair of rows ...
, which allows matrix-vector multiplication in time O(d\log d) # S is a k\times d sampling matrix which is all zeros, except a single 1 in each row. If the diagonal matrix is replaced by one which has a tensor product of \pm1 values on the diagonal, instead of being fully independent, it is possible to compute \operatorname(x\otimes y) fast. For an example of this, let \rho,\sigma\in\^2 be two independent \pm1 vectors and let D be a diagonal matrix with \rho\otimes\sigma on the diagonal. We can then split up \operatorname(x\otimes y) as follows: :\begin &\operatorname(x\otimes y) \\ &\quad= \begin 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \end \begin 1 & 1 & 1 & 1 \\ 1 & -1 & 1 & -1 \\ 1 & 1 & -1 & -1 \\ 1 & -1 & -1 & 1 \end \begin \sigma_1 \rho_1 & 0 & 0 & 0 \\ 0 & \sigma_1 \rho_2 & 0 & 0 \\ 0 & 0 & \sigma_2 \rho_1 & 0 \\ 0 & 0 & 0 & \sigma_2 \rho_2 \\ \end \begin x_1y_1 \\ x_2y_1 \\ x_1y_2 \\ x_2y_2 \end \\ pt &\quad= \left( \begin 1 & 0 \\ 0 & 1 \\ 1 & 0 \end \bullet \begin 1 & 0 \\ 1 & 0 \\ 0 & 1 \end \right) \left( \begin 1 & 1 \\ 1 & -1 \end \otimes \begin 1 & 1 \\ 1 & -1 \end \right) \left( \begin \sigma_1 & 0 \\ 0 & \sigma_2 \\ \end \otimes \begin \rho_1 & 0 \\ 0 & \rho_2 \\ \end \right) \left( \begin x_1 \\ x_2 \end \otimes \begin y_1 \\ y_2 \end \right) \\ pt &\quad= \left( \begin 1 & 0 \\ 0 & 1 \\ 1 & 0 \end \bullet \begin 1 & 0 \\ 1 & 0 \\ 0 & 1 \end \right) \left( \begin 1 & 1 \\ 1 & -1 \end \begin \sigma_1 & 0 \\ 0 & \sigma_2 \\ \end \begin x_1 \\ x_2 \end \,\otimes\, \begin 1 & 1 \\ 1 & -1 \end \begin \rho_1 & 0 \\ 0 & \rho_2 \\ \end \begin y_1 \\ y_2 \end \right) \\ pt &\quad= \begin 1 & 0 \\ 0 & 1 \\ 1 & 0 \end \begin 1 & 1 \\ 1 & -1 \end \begin \sigma_1 & 0 \\ 0 & \sigma_2 \\ \end \begin x_1 \\ x_2 \end \,\circ\, \begin 1 & 0 \\ 1 & 0 \\ 0 & 1 \end \begin 1 & 1 \\ 1 & -1 \end \begin \rho_1 & 0 \\ 0 & \rho_2 \\ \end \begin y_1 \\ y_2 \end . \end In other words, \operatorname=S^HD^ \bullet S^HD^, splits up into two Fast Johnson–Lindenstrauss transformations, and the total reduction takes time O(d_1\log d_1+d_2\log d_2) rather than d_1 d_2\log(d_1 d_2) as with the direct approach. The same approach can be extended to compute higher degree products, such as \operatorname(x\otimes y\otimes z) Ahle et al. shows that if \operatorname has \varepsilon^(\log1/\delta)^ rows, then , \, \operatornamex\, _2-\, x\, , \le \varepsilon\, x\, _2 for any vector x\in\mathbb R^ with probability 1-\delta, while allowing fast multiplication with degree c tensors. Jin et al.,Jin, Ruhui, Tamara G. Kolda, and Rachel Ward. "Faster Johnson–Lindenstrauss Transforms via Kronecker Products." arXiv preprint arXiv:1909.04801 (2019). the same year, showed a similar result for the more general class of matrices call
RIP Rest in peace (RIP), a phrase from the Latin (), is sometimes used in traditional Christian services and prayers, such as in the Catholic, Lutheran, Anglican, and Methodist denominations, to wish the soul of a decedent eternal rest and peace. ...
, which includes the subsampled Hadamard matrices. They showed that these matrices allow splitting into tensors provided the number of rows is \varepsilon^(\log1/\delta)^\log d. In the case c=2 this matches the previous result. These fast constructions can again be combined with the recursion approach mentioned above, giving the fastest overall tensor sketch.


Data aware sketching

It is also possible to do so-called "data aware" tensor sketching. Instead of multiplying a random matrix on the data, the data points are sampled independently with a certain probability depending on the norm of the point.


Applications


Explicit polynomial kernels

Kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
are popular in
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
as they give the algorithm designed the freedom to design a "feature space" in which to measure the similarity of their data points. A simple kernel-based binary classifier is based on the following computation: :\hat(\mathbf) = \sgn \sum_^n y_i k(\mathbf_i, \mathbf), where \mathbf_i\in\mathbb^d are the data points, y_i is the label of the ith point (either −1 or +1), and \hat(\mathbf) is the prediction of the class of \mathbf. The function k : \mathbb^d \times \mathbb R^d \to \mathbb R is the kernel. Typical examples are the
radial basis function kernel In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification. The RBF kernel on two s ...
, k(x,x') = \exp(-\, x-x'\, _2^2), and
polynomial kernel In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of th ...
s such as k(x,x') = (1+\langle x, x'\rangle)^2. When used this way, the kernel method is called "implicit". Sometimes it is faster to do an "explicit" kernel method, in which a pair of functions f, g : \mathbb^d \to \mathbb^D are found, such that k(x,x') = \langle f(x), g(x')\rangle. This allows the above computation to be expressed as :\hat(\mathbf) = \sgn \sum_^n y_i \langle f(\mathbf_i), g(\mathbf)\rangle = \sgn \left\langle\left(\sum_^n y_i f(\mathbf_i)\right), g(\mathbf)\right\rangle, where the value \sum_^n y_i f(\mathbf_i) can be computed in advance. The problem with this method is that the feature space can be very large. That is D >> d. For example, for the polynomial kernel k(x,x') = \langle x,x'\rangle^3 we get f(x) = x\otimes x\otimes x and g(x') = x'\otimes x'\otimes x', where \otimes is the
tensor product In mathematics, the tensor product V \otimes W of two vector spaces and (over the same field) is a vector space to which is associated a bilinear map V\times W \to V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an element of V \otimes W ...
and f(x),g(x')\in\mathbb^D where D=d^3. If d is already large, D can be much larger than the number of data points (n) and so the explicit method is inefficient. The idea of tensor sketch is that we can compute approximate functions f', g' : \mathbb R^d \to \mathbb R^t where t can even be ''smaller'' than d, and which still have the property that \langle f'(x), g'(x')\rangle \approx k(x,x'). This method was shown in 2020 to work even for high degree polynomials and radial basis function kernels.


Compressed matrix multiplication

Assume we have two large datasets, represented as matrices X, Y\in\mathbb R^, and we want to find the rows i,j with the largest inner products \langle X_i, Y_j\rangle. We could compute Z = X Y^T \in \mathbb R^ and simply look at all n^2 possibilities. However, this would take at least n^2 time, and probably closer to n^2d using standard matrix multiplication techniques. The idea of Compressed Matrix Multiplication is the general identity :X Y^T = \sum_^d X_i \otimes Y_i where \otimes is the
tensor product In mathematics, the tensor product V \otimes W of two vector spaces and (over the same field) is a vector space to which is associated a bilinear map V\times W \to V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an element of V \otimes W ...
. Since we can compute a (
linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
) approximation to X_i \otimes Y_i efficiently, we can sum those up to get an approximation for the complete product.


Compact multilinear pooling

Bilinear pooling is the technique of taking two input vectors, x, y from different sources, and using the tensor product x\otimes y as the input layer to a neural network. In the authors considered using tensor sketch to reduce the number of variables needed. In 2017 another paperAlgashaam, Faisal M., et al.
Multispectral periocular classification with multimodal compact multi-linear pooling
." IEEE Access 5 (2017): 14572–14578.
takes the FFT of the input features, before they are combined using the element-wise product. This again corresponds to the original tensor sketch.


References


Further reading

* * * * *{{Cite journal, last=Slyusar, first=V. I., date=March 13, 1998, title=A Family of Face Products of Matrices and its Properties, url=http://slyusar.kiev.ua/FACE.pdf, journal=Cybernetics and Systems Analysis C/C of Kibernetika I Sistemnyi Analiz.- 1999., volume=35, issue=3, pages=379–384, doi=10.1007/BF02733426, s2cid=119661450 Dimension reduction Tensors