Low-rank Matrix Approximations
   HOME

TheInfoList



OR:

Low-rank matrix approximations are essential tools in the application of ''kernel methods to large-scale learning'' problems.Francis R. Bach and Michael I. Jordan (2005)
"Predictive low-rank decomposition for kernel methods"
''ICML''.
Kernel methods (for instance,
support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
s or
Gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. e ...
es) project data points into a high-dimensional or infinite-dimensional feature space and find the optimal splitting hyperplane. In the kernel method the data is represented in a ''kernel matrix'' (or, Gram matrix). Many algorithms can solve machine learning problems using the ''kernel matrix''. The main problem of kernel method is its high computational cost associated with ''kernel matrices''. The cost is at least quadratic in the number of training data points, but most kernel methods include computation of matrix inversion or eigenvalue decomposition and the cost becomes cubic in the number of training data. Large training sets cause large storage and computational costs. While low rank decomposition methods ( Cholesky decomposition) reduce this cost, they still require computing the ''kernel matrix''. One of the approaches to deal with this problem is low-rank matrix approximations. The most popular examples of them are Nyström method and the ''random features.'' Both of them have been successfully applied to efficient kernel learning.


Nyström approximation

Kernel methods become unfeasible when the number of points n is so large such that the kernel matrix \hat cannot be stored in memory. If n is the number of training examples, the storage and computational cost required to find the solution of the problem using general kernel method is O(n^2) and O(n^3) respectively. The Nyström approximation can allow a significant speed-up of the computations.Petros Drineas and Michael W. Mahoney (2005)
"On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning".
''Journal of Machine Learning Research 6'', pp. 2153–2175.
This speed-up is achieved by using, instead of the kernel matrix, its ''approximation'' \tilde of rank q. An advantage of the method is that it is not necessary to compute or store the whole kernel matrix, but only a submatrix of size q \times n. It reduces the storage and complexity requirements to O(nq) and O(nq^2) respectively. The method is named "Nyström approximation" because it can be interpreted as a case of the Nyström method from integral equation theory.


Theorem for kernel approximation

\hat is a ''kernel matrix'' for some kernel method. Consider the first q < n points in the training set. Then there exists a matrix \tilde of rank q: \tilde=\hat_\hat_^\hat_^\text , where (\hat_q)_ = K(x_i,x_j), i, j = 1,\dots,q , \hat_q is invertible matrix and (\hat_)_ = K(x_i,x_j), i = 1,\dots, n \text j = 1,\dots,q.


Proof


= Singular-value decomposition application

= Applying
singular-value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is relate ...
(SVD) to matrix A with dimensions p \times m produces a ''singular system'' consisting of singular values \_^k,\text (\sigma_j > 0 \text \forall j=1,\dots,k), vectors \_^m \in \Complex^m and \_^p \in \Complex^p such that they form orthonormal bases of \Complex^m and \Complex^p respectively: \begin A^\textAv_j = \sigma_j v_j, \text j=1,\dots,k,\\ A^\textAv_j =0, \text j=k+1,\dots,m,\\ AA^\textu_j=\sigma_j u_j, \text j=1,\dots,k,\\ AA^\textu_j=0, \text j=k+1,\dots,p. \end If U and V are matrices with u's and v's in the columns and \Sigma is a diagonal p \times m matrix having singular values \sigma_i on the first k-entries on the diagonal (all the other elements of the matrix are zeros): \begin Av_j = \sqrtu_j, \text j=1,\dots,k, \\ Av_j=0, \text j=k+1,\dots,m, \\ A^\textu_j= \sqrtv_j, \text j=1,\dots,k,\\ A^\textu_j=0, \text j=k+1,\dots,p, \end then the matrix A can be rewritten as:C. Eckart, G. Young, The approximation of one matrix by another of lower rank. Psychometrika, Volume 1, 1936, Pages 211–8. A=U\Sigma^V^\text.


= Further proof

= * \hat is n \times D data matrix * \hat=\hat\hat^\text * \hat=\hat^\text\hat Applying singular-value decomposition to these matrices: \hat=\hat\hat^\hat^\text,\text\hat=\hat\hat\hat^, \text \hat=\hat\hat\hat^\text. * \hat_q is the q\times D-dimensional matrix consisting of the first q rows of matrix \hat * \hat_q=\hat_q\hat_q^\text * \hat_q=\hat^\text_q \hat_q Applying singular-value decomposition to these matrices: \hat_q=\hat_q\hat_q^\hat_q^\text,\text\hat_q=\hat_q\hat_q\hat_q^, \text \hat_q=\hat_q\hat_q\hat_q^\text. Since \hat,\text \hat, \hat_q\text \hat_q are
orthogonal matrices In linear algebra, an orthogonal matrix, or orthonormal matrix, is a real square matrix whose columns and rows are orthonormal vectors. One way to express this is Q^\mathrm Q = Q Q^\mathrm = I, where is the transpose of and is the identity ma ...
, \hat=\hat\hat\hat^, \text\hat_q=\hat_q^\text\hat_q\hat_q^. Replacing \hat, \hat by \hat_q and \hat_q , an approximation for \hat can be obtained: \tilde=\hat\hat_q^\text\hat_q\hat_q^ ( \tilde is not necessarily an orthogonal matrix). However, defining \tilde=\tilde\hat_q \tilde^\text , it can be computed the next way: \begin \tilde=\tilde\hat_q \tilde^\text = \hat\hat_q^\text\hat_q\hat_q^ \hat_q (\hat\hat_q^\text\hat_q \hat_q^ )^\text \\ = \hat\hat_q^\text\big\(\hat\hat_q^\text)^\text \\ \end By the characterization for orthogonal matrix \hat_q : equality (\hat_q)^\text=(\hat_q)^ holds. Then, using the formula for the inverse of matrix multiplication (AB)^=B^A^ for invertible matrices A and B , the expression in braces can be rewritten as: \hat_q (\hat_q^)^\text\hat_q^\text = (\hat_q \hat_q^\text\hat_q^\text )^= (\hat_q)^. Then the expression for \tilde : \tilde=(\hat\hat_q^\text)\hat_q^(\hat\hat_q^\text)^\text . Defining \hat_=\hat\hat_q^\text , the proof is finished.


General theorem for kernel approximation for a feature map

For a feature map \Phi: \mathcal \to \mathcal with associated kernel K(x,x') = \langle \Phi(x),\Phi(x')\rangle_\mathcal : equality \hat=\hat_\hat_^\hat_^\text also follows by replacing \hat by the operator \hat: \mathcal \to \Reals^n such that \langle \hatw\rangle_i = \langle \Phi(x_i), w\rangle_\mathcal, \text i=1,\dots,n, w \in \mathcal, and \hat_q by the operator \hat_q: \mathcal \to \Reals^q such that \langle \hat_q w\rangle_i = \langle \Phi(x_i), w\rangle_\mathcal, \text i=1,\dots,q, w \in \mathcal . Once again, a simple inspection shows that the feature map is only needed in the proof while the end result only depends on computing the kernel function.


Application for regularized least squares

In a vector and kernel notation, the problem of Regularized least squares can be rewritten as: \min_\frac\, \hat-\hatc\, ^_ + \lambda\langle c,\hatc\rangle_ . Computing the gradient and setting in to 0, the minimum can be obtained: \begin & -\frac\hat(\hat-\hatc) + \lambda \hatc = 0 \\ \Rightarrow & \hat(\hat+\lambda n I)c = \hat \hat \\ \Rightarrow & c = (\hat+\lambda n I)^\hat, \text c \in \Reals^ \end The inverse matrix (\hat+\lambda n I)^ can be computed using Woodbury matrix identity: \begin (\hat+\lambda n I)^ &= \frac\left(\frac\hat + I\right)^ \\ &= \frac\left(I + \hat_(\hat_)^\hat_^\text\right)^ \\ &= \frac\left(I-\hat_(\lambda n\hat_+\hat_^\text \hat_)^\hat_^\text\right) \end It has the desired storage and complexity requirements.


Randomized feature maps approximation

Let \mathbf, \mathbf \in \Reals^d – samples of data, z: \Reals^d \to \Reals^D – a randomized feature map (maps a single vector to a vector of higher dimensionality) so that the inner product between a pair of transformed points approximates their kernel evaluation: K(\mathbf, \mathbf) = \langle\Phi(\mathbf),\Phi(\mathbf)\rangle \approx z(\mathbf)^\textz(\mathbf), where \Phi is the mapping embedded in the RBF kernel. Since z is low-dimensional, the input can be easily transformed with z, after that different linear learning methods to approximate the answer of the corresponding nonlinear kernel can be applied. There are different randomized feature maps to compute the approximations to the RBF kernels. For instance, Random Fourier features and random binning features.


Random Fourier features

Random Fourier features map produces a Monte Carlo approximation to the feature map. The Monte Carlo method is considered to be randomized. These ''random features'' consists of sinusoids \cos(w^\text\mathbf+b) randomly drawn from
Fourier transform A Fourier transform (FT) is a mathematical transform that decomposes functions into frequency components, which are represented by the output of the transform as a function of frequency. Most commonly functions of time or space are transformed, ...
of the kernel to be approximated, where w \in \Reals^d and b \in \Reals are
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s. The line is randomly chosen, then the data points are projected on it by the mappings. The resulting scalar is passed through a sinusoid. The product of the transformed points will approximate a shift-invariant kernel. Since the map is smooth, random Fourier features work well on interpolation tasks.


Random binning features

A random binning features map partitions the input space using randomly shifted grids at randomly chosen resolutions and assigns to an input point a binary bit string that corresponds to the bins in which it falls. The grids are constructed so that the probability that two points \mathbf, \mathbf \in \Reals^d are assigned to the same bin is proportional to K(\mathbf, \mathbf). The inner product between a pair of transformed points is proportional to the number of times the two points are binned together, and is therefore an unbiased estimate of K(\mathbf, \mathbf). Since this mapping is not smooth and uses the proximity between input points, Random Binning Features works well for approximating kernels that depend only on the L_1 distance between datapoints.


Comparison of approximation methods

The approaches for large-scale kernel learning ( Nyström method and random features) differs in the fact that the Nyström method uses data dependent basis functions while in random features approach the basis functions are sampled from a distribution independent from the training data. This difference leads to an improved analysis for kernel learning approaches based on the Nyström method. When there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nyström method can achieve better results than Random Features based approach.Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin and Zhi-Hua Zhou (2012)
"Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison".
''Advances in Neural Information Processing Systems 25 (NIPS)''.


See also

* Nyström method *
Support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
* Radial basis function kernel * Regularized least squares


External links

* Andreas Müller (2012)
Kernel Approximations for Efficient SVMs (and other feature extraction methods)


References

{{reflist, 30em Kernel methods for machine learning