statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

and

algorithms In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...

, a tensor sketch is a type of

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

that is particularly efficient when applied to vectors that have

tensor In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects associated with a vector space. Tensors may map between different objects such as vectors, scalars, and even other ...

structure. Such a sketch can be used to speed up explicit

kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pa ...

, bilinear pooling in

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

s and is a cornerstone in many

numerical linear algebra Numerical linear algebra, sometimes called applied linear algebra, is the study of how matrix operations can be used to create computer algorithms which efficiently and accurately provide approximate answers to questions in continuous mathemati ...

algorithms.Woodruff, David P.
Sketching as a Tool for Numerical Linear Algebra
." Theoretical Computer Science 10.1-2 (2014): 1–157.

Mathematical definition

Mathematically, a dimensionality reduction or sketching matrix is a matrix

M\in\mathbb R^

, where

k, such that for any vector x\in\mathbb R^d :, \, Mx\, _2 - \, x\, _2,  < \varepsilon\, x\, _2 with high probability.
In other words, M preserves the norm of vectors up to a small error.

A tensor sketch has the extra property that if x = y \otimes z for some vectors y\in\mathbb R^, z\in\mathbb R^such that d_1d_2=d, the transformation M(y\otimes z) can be computed more efficiently. Here \otimes denotes the

Kronecker product In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a specialization of the tensor product (which is denoted by the same symbol) from vector ...

, rather than the

outer product In linear algebra, the outer product of two coordinate vectors is the matrix whose entries are all products of an element in the first vector with an element in the second vector. If the two coordinate vectors have dimensions ''n'' and ''m'', the ...

, though the two are related by a

flattening Flattening is a measure of the compression of a circle or sphere along a diameter to form an ellipse or an ellipsoid of revolution (spheroid) respectively. Other terms used are ellipticity, or oblateness. The usual notation for flattening is f ...

. The speedup is achieved by first rewriting

M(y\otimes z) = M' y \circ M'' z

, where

\circ

denotes the elementwise (

Hadamard Jacques Salomon Hadamard (; 8 December 1865 – 17 October 1963) was a French mathematician who made major contributions in number theory, complex analysis, differential geometry, and partial differential equations. Biography The son of a tea ...

) product. Each of

M' y

and

M'' z

can be computed in time

O(k d_1)

and

O(k d_2)

, respectively; including the Hadamard product gives overall time

O(d_1 d_2 + k d_1 + k d_2)

. In most use cases this method is significantly faster than the full

M(y\otimes z)

requiring

O(kd)=O(k d_1 d_2)

time. For higher-order tensors, such as

x = y\otimes z\otimes t

, the savings are even more impressive.

History

The term tensor sketch was coined in 2013 describing a technique by Rasmus Pagh from the same year. Originally it was understood using the

fast Fourier transform A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). A Fourier transform converts a signal from its original domain (often time or space) to a representation in ...

to do fast

convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions f and g that produces a third function f*g, as the integral of the product of the two ...

of count sketches. Later research works generalized it to a much larger class of dimensionality reductions via Tensor random embeddings. Tensor random embeddings were introduced in 2010 in a paper on differential privacy and were first analyzed by Rudelson et al. in 2012 in the context of sparse recovery. Avron et al. were the first to study the subspace embedding properties of tensor sketches, particularly focused on applications to

polynomial kernel In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other Kernel trick, kernelized models, that represents the similarity of vectors (training samples) in a feature space over pol ...

s. In this context, the sketch is required not only to preserve the norm of each individual vector with a certain probability but to preserve the norm of all vectors in each individual

linear subspace In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a ''function (mathematics), function'' (or ''mapping (mathematics), mapping''); * linearity of a ''polynomial''. An example of a li ...

. This is a much stronger property, and it requires larger sketch sizes, but it allows the kernel methods to be used very broadly as explored in the book by David Woodruff.

Tensor random projections

The face-splitting product is defined as the tensor products of the rows (was proposed by V. SlyusarAnna Esteve, Eva Boj & Josep Fortiana (2009): Interaction Terms in Distance-Based Regression, Communications in Statistics – Theory and Methods, 38:19, P. 350

in 1996 for

radar Radar is a system that uses radio waves to determine the distance ('' ranging''), direction ( azimuth and elevation angles), and radial velocity of objects relative to the site. It is a radiodetermination method used to detect and track ...

and

digital antenna array Digital antenna array (DAA) is a smart antenna with multi channels digital beamforming, usually by using fast Fourier transform (FFT). The development and practical realization of digital antenna arrays theory started in 1962 under the guidanc ...

applications). More directly, let

\mathbf\in\mathbb R^

and

\mathbf\in\mathbb R^

be two matrices. Then the face-splitting product

\mathbf\bullet \mathbf

= \left[ \begin \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ \\\hline \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ \\\hline \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ & \mathbf_\mathbf_ \end \right].

The reason this product is useful is the following identity: :

(\mathbf \bull \mathbf)(x\otimes y) = \mathbfx \circ \mathbf y
= \left[
\begin 
(\mathbfx)_1 (\mathbf y)_1 \\
(\mathbfx)_2 (\mathbf y)_2 \\
\vdots
\end\right],

where

\circ

is the element-wise (

) product. Since this operation can be computed in linear time,

\mathbf \bull \mathbf

can be multiplied on vectors with tensor structure much faster than normal matrices.

Construction with fast Fourier transform

The tensor sketch of Pham and Pagh computes

C^x \ast C^y

, where

C^

and

C^

are independent count sketch matrices and

\ast

is vector

. They show that, amazingly, this equals

C(x \otimes y)

– a count sketch of the tensor product! It turns out that this relation can be seen in terms of the face-splitting product as :

C^x \ast C^y = \mathcal F^(\mathcal F C^x \circ \mathcal F C^y)

, where

\mathcal F

is the DFT matrix, Fourier transform matrix. Since

\mathcal F

is an

orthonormal In linear algebra, two vectors in an inner product space are orthonormal if they are orthogonal unit vectors. A unit vector means that the vector has a length of 1, which is also known as normalized. Orthogonal means that the vectors are all perpe ...

matrix,

\mathcal F^

doesn't impact the norm of

Cx

and may be ignored. What's left is that

C \sim \mathcal C^ \bullet \mathcal C^

. On the other hand, :

\mathcal F(C^x \ast C^y) = \mathcal F C^x \circ \mathcal F C^y=  (\mathcal F C^ \bull \mathcal F C^)(x \otimes y)

Application to general matrices

The problem with the original tensor sketch algorithm was that it used count sketch matrices, which aren't always very good dimensionality reductions. In 2020 it was shown that any matrices with random enough independent rows suffice to create a tensor sketch. This allows using matrices with stronger guarantees, such as real Gaussian Johnson Lindenstrauss matrices. In particular, we get the following theorem :Consider a matrix

T

with i.i.d. rows

T_1, \dots, T_m\in \mathbb R^d

, such that

\, x\, _2^2

and

\le \sqrt\, x\, _2

. Let

T^, \dots, T^

be independent consisting of

T

and

M = T^ \bullet \dots \bullet T^

. : Then

, \, Mx\, _2 - \, x\, _2,  < \varepsilon\, x\, _2

with probability

1-\delta

for any vector

x

if :

m = (4a)^ \varepsilon^ \log1/\delta + (2ae)\varepsilon^(\log1/\delta)^c

. In particular, if the entries of

T

are

\pm1

we get

m = O(\varepsilon^\log1/\delta + \varepsilon^(\tfrac1c\log1/\delta)^c)

which matches the normal Johnson Lindenstrauss theorem of

m = O(\varepsilon^\log1/\delta)

when

\varepsilon

is small. The paper also shows that the dependency on

\varepsilon^(\tfrac1c\log1/\delta)^c

is necessary for constructions using tensor randomized projections with

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

entries.

Variations

Recursive construction

Because of the exponential dependency on

c

in tensor sketches based on the face-splitting product, a different approach was developed in 2020 which applies :

M(x\otimes y\otimes\cdots)
= M^(x \otimes (M^y \otimes \cdots))

We can achieve such an

M

by letting :

M = M^(M^\otimes I_d)(M^\otimes I_)\cdots(M^\otimes I_)

. With this method, we only apply the general tensor sketch method to order 2 tensors, which avoids the exponential dependency in the number of rows. It can be proved that combining

c

dimensionality reductions like this only increases

\varepsilon

by a factor

\sqrt

Fast constructions

The fast Johnson–Lindenstrauss transform is a dimensionality reduction matrix Given a matrix

M\in\mathbb R^

, computing the matrix vector product

Mx

takes

kd

time. The ''Fast Johnson Lindenstrauss Transform'' (FJLT), was introduced by Ailon and Chazelle in 2006. A version of this method takes

M = \operatorname

where #

D

is a

diagonal matrix In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagon ...

where each diagonal entry

D_

\pm1

independently. The matrix-vector multiplication

Dx

can be computed in

O(d)

time. #

H

is a

Hadamard matrix In mathematics, an Hadamard matrix, named after the French mathematician Jacques Hadamard, is a square matrix whose entries are either +1 or −1 and whose rows are mutually orthogonal. In geometry, geometric terms, this means that each pair of r ...

, which allows matrix-vector multiplication in time

O(d\log d)

S

is a

k\times d

sampling matrix Sampling may refer to: *Sampling (signal processing), converting a continuous signal into a discrete signal * Sampling (graphics), converting continuous colors into discrete color components *Sampling (music), the reuse of a sound recording in ano ...

which is all zeros, except a single 1 in each row. If the diagonal matrix is replaced by one which has a tensor product of

\pm1

values on the diagonal, instead of being fully independent, it is possible to compute

\operatorname(x\otimes y)

fast. For an example of this, let

\rho,\sigma\in\^2

be two independent

\pm1

vectors and let

D

be a diagonal matrix with

\rho\otimes\sigma

on the diagonal. We can then split up

\operatorname(x\otimes y)

as follows: :

&\quad= \begin 1 & 0 \\ 0 & 1 \\ 1 & 0 \end \begin 1 & 1 \\ 1 & -1 \end \begin \sigma_1 & 0 \\ 0 & \sigma_2 \\ \end \begin x_1 \\ x_2 \end \,\circ\, \begin 1 & 0 \\ 1 & 0 \\ 0 & 1 \end \begin 1 & 1 \\ 1 & -1 \end \begin \rho_1 & 0 \\ 0 & \rho_2 \\ \end \begin y_1 \\ y_2 \end . \end

In other words,

\operatorname=S^HD^ \bullet S^HD^

, splits up into two Fast Johnson–Lindenstrauss transformations, and the total reduction takes time

O(d_1\log d_1+d_2\log d_2)

rather than

d_1 d_2\log(d_1 d_2)

as with the direct approach. The same approach can be extended to compute higher degree products, such as

\operatorname(x\otimes y\otimes z)

Ahle et al. shows that if

\operatorname

has

\varepsilon^(\log1/\delta)^

rows, then

, \, \operatornamex\, _2-\, x\, ,  \le \varepsilon\, x\, _2

for any vector

x\in\mathbb R^

with probability

1-\delta

, while allowing fast multiplication with degree

c

tensors. Jin et al.,Jin, Ruhui, Tamara G. Kolda, and Rachel Ward. "Faster Johnson–Lindenstrauss Transforms via Kronecker Products." arXiv preprint arXiv:1909.04801 (2019). the same year, showed a similar result for the more general class of matrices call RIP, which includes the subsampled Hadamard matrices. They showed that these matrices allow splitting into tensors provided the number of rows is

\varepsilon^(\log1/\delta)^\log d

. In the case

c=2

this matches the previous result. These fast constructions can again be combined with the recursion approach mentioned above, giving the fastest overall tensor sketch.

Data aware sketching

It is also possible to do so-called "data aware" tensor sketching. Instead of multiplying a random matrix on the data, the data points are sampled independently with a certain probability depending on the norm of the point.

Applications

Explicit polynomial kernels

Kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pa ...

are popular in

as they give the algorithm designed the freedom to design a "feature space" in which to measure the similarity of their data points. A simple kernel-based binary classifier is based on the following computation: :

\hat(\mathbf) = \sgn \sum_^n y_i k(\mathbf_i, \mathbf),

where

\mathbf_i\in\mathbb^d

are the data points,

y_i

is the label of the

i

th point (either −1 or +1), and

\hat(\mathbf)

is the prediction of the class of

\mathbf

. The function

k : \mathbb^d \times \mathbb R^d \to \mathbb R

is the kernel. Typical examples are the radial basis function kernel,

k(x,x') = \exp(-\, x-x'\, _2^2)

, and

s such as

k(x,x') = (1+\langle x, x'\rangle)^2

. When used this way, the kernel method is called "implicit". Sometimes it is faster to do an "explicit" kernel method, in which a pair of functions

f, g : \mathbb^d \to \mathbb^D

are found, such that

k(x,x') = \langle f(x), g(x')\rangle

. This allows the above computation to be expressed as :

\hat(\mathbf)
= \sgn \sum_^n y_i \langle f(\mathbf_i), g(\mathbf)\rangle
= \sgn \left\langle\left(\sum_^n y_i f(\mathbf_i)\right), g(\mathbf)\right\rangle,

where the value

\sum_^n y_i f(\mathbf_i)

can be computed in advance. The problem with this method is that the feature space can be very large. That is

D >> d

. For example, for the polynomial kernel

k(x,x') = \langle x,x'\rangle^3

we get

f(x) = x\otimes x\otimes x

and

g(x') = x'\otimes x'\otimes x'

, where

\otimes

is the

tensor product In mathematics, the tensor product V \otimes W of two vector spaces V and W (over the same field) is a vector space to which is associated a bilinear map V\times W \rightarrow V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an element of ...

and

f(x),g(x')\in\mathbb^D

where

D=d^3

. If

d

is already large,

D

can be much larger than the number of data points (

n

) and so the explicit method is inefficient. The idea of tensor sketch is that we can compute approximate functions

f', g' : \mathbb R^d \to \mathbb R^t

where

t

can even be ''smaller'' than

d

, and which still have the property that

\langle f'(x), g'(x')\rangle \approx k(x,x')

. This method was shown in 2020 to work even for high degree polynomials and radial basis function kernels.

Compressed matrix multiplication

Assume we have two large datasets, represented as matrices

X, Y\in\mathbb R^

, and we want to find the rows

i,j

with the largest inner products

\langle X_i, Y_j\rangle

. We could compute

Z = X Y^T \in \mathbb R^

and simply look at all

n^2

possibilities. However, this would take at least

n^2

time, and probably closer to

n^2d

using standard matrix multiplication techniques. The idea of Compressed Matrix Multiplication is the general identity :

X Y^T = \sum_^d X_i \otimes Y_i

where

\otimes

is the

. Since we can compute a (

linear In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a '' function'' (or '' mapping''); * linearity of a '' polynomial''. An example of a linear function is the function defined by f(x) ...

) approximation to

X_i \otimes Y_i

efficiently, we can sum those up to get an approximation for the complete product.

Compact multilinear pooling

Bilinear pooling is the technique of taking two input vectors,

x, y

from different sources, and using the tensor product

x\otimes y

as the input layer to a neural network. In the authors considered using tensor sketch to reduce the number of variables needed. In 2017 another paperAlgashaam, Faisal M., et al.
Multispectral periocular classification with multimodal compact multi-linear pooling
." IEEE Access 5 (2017): 14572–14578. takes the FFT of the input features, before they are combined using the element-wise product. This again corresponds to the original tensor sketch.

Mathematical definition

History

Tensor random projections

Construction with fast Fourier transform

Application to general matrices

Variations

Recursive construction

Fast constructions

Data aware sketching

Applications

Explicit polynomial kernels

Compressed matrix multiplication

Compact multilinear pooling

References

Further reading