HOME

TheInfoList



OR:

In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix (the data) and an approximating matrix (the optimization variable), subject to a constraint that the approximating matrix has reduced
rank Rank is the relative position, value, worth, complexity, power, importance, authority, level, etc. of a person or object within a ranking, such as: Level or position in a hierarchical organization * Academic rank * Diplomatic rank * Hierarchy * H ...
. The problem is used for
mathematical model A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used in the natural sciences (such as physics, ...
ing and
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure. Low-rank approximation is closely related to: *
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
, *
factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
, *
total least squares In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalizat ...
, *
latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
*
orthogonal regression In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model which tries to find the line of best fit for a two-dimensional dataset. It differs from the simple linear regression in that it accounts for error ...
, and *
dynamic mode decomposition Dynamic mode decomposition (DMD) is a dimensionality reduction algorithm developed by Peter Schmid in 2008. Given a time series of data, DMD computes a set of modes each of which is associated with a fixed oscillation frequency and decay/growth r ...
.


Definition

Given * structure specification \mathcal : \mathbb^ \to \mathbb^, * vector of structure parameters p\in\mathbb^, *
norm Naturally occurring radioactive materials (NORM) and technologically enhanced naturally occurring radioactive materials (TENORM) consist of materials, usually industrial wastes or by-products enriched with radioactive elements found in the envir ...
\, \cdot \, , and * desired rank r, : \text \quad \text \widehat p \quad \, p - \widehat p\, \quad\text\quad \operatorname\big(\mathcal(\widehat p)\big) \leq r.


Applications

* Linear
system identification The field of system identification uses statistical methods to build mathematical models of dynamical systems from measured data. System identification also includes the optimal design of experiments for efficiently generating informative data f ...
, in which case the approximating matrix is Hankel structured. I. Markovsky, Structured low-rank approximation and its applications, Automatica, Volume 44, Issue 4, April 2008, Pages 891–909. I. Markovsky, J. C. Willems, S. Van Huffel, B. De Moor, and R. Pintelon, Application of structured total least squares for system identification and model reduction. IEEE Transactions on Automatic Control, Volume 50, Number 10, 2005, pages 1490–1500. *
Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, in which case the approximating matrix is nonlinearly structured.I. Markovsky, Low-Rank Approximation: Algorithms, Implementation, Applications, Springer, 2012, *
Recommender system A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular ...
s, in which cases the data matrix has
missing values In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. M ...
and the approximation is categorical. * Distance
matrix completion Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the mo ...
, in which case there is a positive definiteness constraint. *
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
, in which case the approximation is
nonnegative In mathematics, the sign of a real number is its property of being either positive, negative, or zero. Depending on local conventions, zero may be considered as being neither positive nor negative (having no sign or a unique third sign), or it ...
. *
Computer algebra In mathematics and computer science, computer algebra, also called symbolic computation or algebraic computation, is a scientific area that refers to the study and development of algorithms and software for manipulating mathematical expression ...
, in which case the approximation is Sylvester structured.


Basic low-rank approximation problem

The unstructured problem with fit measured by the
Frobenius norm In mathematics, a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices (of given dimensions). Preliminaries Given a field K of either real or complex numbers, let K^ be the -vector space of matrices with m ...
, i.e., : \text \quad \text \widehat D \quad \, D - \widehat D\, _ \quad\text\quad \operatorname\big(\widehat D\big) \leq r has analytic solution in terms of the
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is r ...
of the data matrix. The result is referred to as the matrix approximation lemma or Eckart–Young–Mirsky theorem. This problem was originally solved by
Erhard Schmidt Erhard Schmidt (13 January 1876 – 6 December 1959) was a Baltic German mathematician whose work significantly influenced the direction of mathematics in the twentieth century. Schmidt was born in Tartu (german: link=no, Dorpat), in the Gover ...
E. Schmidt, Zur Theorie der linearen und nichtlinearen Integralgleichungen, Math. Annalen 63 (1907), 433-476. in the infinite dimensional context of integral operators (although his methods easily generalize to arbitrary compact operators on Hilbert spaces) and later rediscovered by C. Eckart and G. Young.C. Eckart, G. Young, The approximation of one matrix by another of lower rank. Psychometrika, Volume 1, 1936, Pages 211–8. L. Mirsky generalized the result to arbitrary unitarily invariant norms.L. Mirsky, Symmetric gauge functions and unitarily invariant norms, Q.J. Math. 11 (1960), 50-59. Let : D = U\Sigma V^ \in \mathbb^, \quad m \geq n be the singular value decomposition of D and partition U, \Sigma=:\operatorname(\sigma_1,\ldots,\sigma_m), and V as follows: : U =: \begin U_1 & U_2\end, \quad \Sigma =: \begin \Sigma_1 & 0 \\ 0 & \Sigma_2 \end, \quad\text\quad V =: \begin V_1 & V_2 \end, where U_1 is m\times r, \Sigma_1 is r\times r, and V_1 is n\times r. Then the rank-r matrix, obtained from the truncated singular value decomposition : \widehat D^* = U_1 \Sigma_1 V_1^, is such that : \, D-\widehat D^*\, _ = \min_ \, D-\widehat D\, _ = \sqrt. The minimizer \widehat D^* is unique if and only if \sigma_\neq\sigma_.


Proof of Eckart–Young–Mirsky theorem (for

spectral norm In mathematics, a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices (of given dimensions). Preliminaries Given a field K of either real or complex numbers, let K^ be the -vector space of matrices with m row ...
)

Let A\in\mathbb^ be a real (possibly rectangular) matrix with m\geq n. Suppose that : A = U\Sigma V^\top is the
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is r ...
of A. Recall that U and V are
orthogonal In mathematics, orthogonality is the generalization of the geometric notion of '' perpendicularity''. By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
matrices, and \Sigma is an m\times n
diagonal In geometry, a diagonal is a line segment joining two vertices of a polygon or polyhedron, when those vertices are not on the same edge. Informally, any sloping line is called diagonal. The word ''diagonal'' derives from the ancient Gree ...
matrix with entries (\sigma_, \sigma_, \cdots ,\sigma_) such that \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n\geq 0. We claim that the best rank-k approximation to A in the spectral norm, denoted by \, \cdot\, _2, is given by : A_k = \sum_^k \sigma_i u_i v_i^\top where u_iand v_i denote the ith column of U and V, respectively. First, note that we have : \, A-A_k\, _2 = \left\, \sum_^ \sigma_i u_i v_i^\top - \sum_^ \sigma_i u_i v_i^\top\right\, _2 = \left\, \sum_^n \sigma_i u_i v_i^\top \right\, _2 = \sigma_ Therefore, we need to show that if B_k = XY^\top where X and Y have k columns then \, A-A_k\, _2=\sigma_\leq \, A-B_k\, _2. Since Y has k columns, then there must be a nontrivial linear combination of the first k+1 columns of V, i.e., : w = \gamma_1v_1+\cdots+\gamma_v_, such that Y^\top w = 0. Without loss of generality, we can scale w so that \, w\, _2 = 1 or (equivalently) \gamma_1^2+\cdots +\gamma_^2 = 1. Therefore, : \, A-B_k\, _2^2 \geq \, (A-B_k)w\, _2^2 = \, Aw\, _2^2 = \gamma_1^2\sigma_1^2+\cdots+\gamma_^2\sigma_^2\geq \sigma_^2. The result follows by taking the square root of both sides of the above inequality.


Proof of Eckart–Young–Mirsky theorem (for

Frobenius norm In mathematics, a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices (of given dimensions). Preliminaries Given a field K of either real or complex numbers, let K^ be the -vector space of matrices with m ...
)

Let A\in\mathbb^ be a real (possibly rectangular) matrix with m\geq n. Suppose that : A = U\Sigma V^\top is the
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is r ...
of A. We claim that the best rank k approximation to A in the Frobenius norm, denoted by \, \cdot\, _F, is given by : A_k = \sum_^k \sigma_i u_i v_i^\top where u_i and v_i denote the ith column of U and V, respectively. First, note that we have : \, A-A_k\, _F^2 = \left\, \sum_^n \sigma_i u_iv_i^\top \right\, _F^2 = \sum_^n \sigma_i^2 Therefore, we need to show that if B_k = XY^\top where X and Y have k columns then : \, A-A_k\, _F^2=\sum_^n\sigma_^2\leq \, A-B_k\, _F^2. By the triangle inequality with the spectral norm, if A = A' + A'' then \sigma_1(A)\leq \sigma_1(A') + \sigma_1(A''). Suppose A'_k and A''_k respectively denote the rank k approximation to A' and A'' by SVD method described above. Then, for any i,j\geq1 : \begin \sigma_i(A')+\sigma_j(A'') &= \sigma_1(A'-A'_) + \sigma_1(A''-A''_)\\ &\geq \sigma_1(A-A'_-A''_)\\ &\geq \sigma_1(A-A_)\qquad (\text (A'_+A''_)\leq (A_))\\ &=\sigma_(A). \end Since \sigma_(B_k) = 0 , when A' = A-B_k and A'' = B_k we conclude that for i\geq 1,j=k+1 : \sigma_i(A-B_k) \geq \sigma_(A). Therefore, : \, A-B_k\, _F^2 = \sum_^n \sigma_i(A-B_k)^2 \geq \sum_^n\sigma_i(A)^2 = \, A-A_k\, _F^2, as required.


Weighted low-rank approximation problems

The Frobenius norm weights uniformly all elements of the approximation error D - \widehat D. Prior knowledge about distribution of the errors can be taken into account by considering the weighted low-rank approximation problem : \text \quad \text \widehat D \quad \operatorname(D - \widehat D)^ W \operatorname(D - \widehat D) \quad\text\quad \operatorname(\widehat D) \leq r, where \text(A) vectorizes the matrix A column wise and W is a given positive (semi)definite weight matrix. The general weighted low-rank approximation problem does not admit an analytic solution in terms of the singular value decomposition and is solved by local optimization methods, which provide no guarantee that a globally optimal solution is found. In case of uncorrelated weights, weighted low-rank approximation problem also can be formulated in this way: for a non-negative matrix W and a matrix A we want to minimize \sum_ ( W_( A_ - B_ ))^2 over matrices, B, of rank at most r.


Entry-wise ''L''''p'' low-rank approximation problems

Let \, A\, _p = \left(\sum_, A_^p, \right)^ . For p = 2 , the fastest algorithm runs in nnz(A) + n \cdot poly(k/\epsilon) time. One of the important ideas been used is called Oblivious Subspace Embedding (OSE), it is first proposed by Sarlos. For p = 1 , it is known that this entry-wise L1 norm is more robust than the Frobenius norm in the presence of outliers and is indicated in models where Gaussian assumptions on the noise may not apply. It is natural to seek to minimize \, B - A\, _1. For p=0 and p \geq 1, there are some algorithms with provable guarantees.


Distance low-rank approximation problem

Let P = \ and Q = \ be two point sets in an arbitrary metric space. Let A represent the m \times n matrix where A_ = dist(p_i,q_i) . Such distances matrices are commonly computed in software packages and have applications to learning image manifolds,
handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other dev ...
, and multi-dimensional unfolding. In an attempt to reduce their description size, one can study low rank approximation of such matrices.


Distributed/Streaming low-rank approximation problem

The low-rank approximation problems in the distributed and streaming setting has been considered in.


Image and kernel representations of the rank constraints

Using the equivalences : \operatorname(\widehat D) \leq r \quad\iff\quad \text P\in\R^ \text L\in\R^ \text \widehat D = PL and : \operatorname(\widehat D) \leq r \quad\iff\quad \text R\in\R^ \text R \widehat D = 0 the weighted low-rank approximation problem becomes equivalent to the parameter optimization problems : \text \quad \text \widehat D, P \text L \quad \operatorname^(D - \widehat D) W \operatorname(D - \widehat D) \quad\text\quad \widehat D = PL and : \text \quad \text \widehat D \text R \quad \operatorname^(D - \widehat D) W \operatorname(D - \widehat D) \quad\text\quad R \widehat D = 0 \quad\text\quad RR^ = I_r, where I_r is the
identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial ...
of size r.


Alternating projections algorithm

The image representation of the rank constraint suggests a parameter optimization method in which the cost function is minimized alternatively over one of the variables (P or L) with the other one fixed. Although simultaneous minimization over both P and L is a difficult
biconvex optimization Biconvex optimization is a generalization of convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functi ...
problem, minimization over one of the variables alone is a
linear least squares Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and ...
problem and can be solved globally and efficiently. The resulting optimization algorithm (called alternating projections) is globally convergent with a linear convergence rate to a locally optimal solution of the weighted low-rank approximation problem. Starting value for the P (or L) parameter should be given. The iteration is stopped when a user defined convergence condition is satisfied.
Matlab MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...
implementation of the alternating projections algorithm for weighted low-rank approximation: function h, f= wlra_ap(d, w, p, tol, maxiter)
, n The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline of ...
= size(d); r = size(p, 2); f = inf; for i = 2:maxiter % minimization over L bp = kron(eye(n), p); vl = (bp' * w * bp) \ bp' * w * d(:); l = reshape(vl, r, n); % minimization over P bl = kron(l', eye(m)); vp = (bl' * w * bl) \ bl' * w * d(:); p = reshape(vp, m, r); % check exit condition dh = p * l; dd = d - dh; f(i) = dd(:)' * w * dd(:); if abs(f(i - 1) - f(i)) < tol, break, end endfor


Variable projections algorithm

The alternating projections algorithm exploits the fact that the low rank approximation problem, parameterized in the image form, is bilinear in the variables P or L. The bilinear nature of the problem is effectively used in an alternative approach, called variable projections. Consider again the weighted low rank approximation problem, parameterized in the image form. Minimization with respect to the L variable (a linear least squares problem) leads to the closed form expression of the approximation error as a function of P : f(P) = \sqrt. The original problem is therefore equivalent to the nonlinear least squares problem of minimizing f(P) with respect to P. For this purpose standard optimization methods, e.g. the Levenberg-Marquardt algorithm can be used.
Matlab MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...
implementation of the variable projections algorithm for weighted low-rank approximation: function h, f= wlra_varpro(d, w, p, tol, maxiter) prob = optimset(); prob.solver = 'lsqnonlin'; prob.options = optimset('MaxIter', maxiter, 'TolFun', tol); prob.x0 = p; prob.objective = @(p) cost_fun(p, d, w);
, f The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
= lsqnonlin(prob); , vl= cost_fun(p, d, w); dh = p * reshape(vl, size(p, 2), size(d, 2)); function , vl= cost_fun(p, d, w) bp = kron(eye(size(d, 2)), p); vl = (bp' * w * bp) \ bp' * w * d(:); f = d(:)' * w * (d(:) - bp * vl);
The variable projections approach can be applied also to low rank approximation problems parameterized in the kernel form. The method is effective when the number of eliminated variables is much larger than the number of optimization variables left at the stage of the nonlinear least squares minimization. Such problems occur in system identification, parameterized in the kernel form, where the eliminated variables are the approximating trajectory and the remaining variables are the model parameters. In the context of linear time-invariant systems, the elimination step is equivalent to Kalman smoothing.


A Variant: convex-restricted low rank approximation

Usually, we want our new solution not only to be of low rank, but also satisfy other convex constraints due to application requirements. Our interested problem would be as follows, : \text \quad \text \widehat p \quad \, p - \widehat p\, \quad\text\quad \operatorname\big(\mathcal(\widehat p)\big) \leq r \text g(\widehat p) \leq 0 This problem has many real world applications, including to recover a good solution from an inexact (semidefinite programming) relaxation. If additional constraint g(\widehat p) \leq 0 is linear, like we require all elements to be nonnegative, the problem is called structured low rank approximation. The more general form is named convex-restricted low rank approximation. This problem is helpful in solving many problems. However, it is challenging due to the combination of the convex and nonconvex (low-rank) constraints. Different techniques were developed based on different realizations of g(\widehat p) \leq 0. However, the Alternating Direction Method of Multipliers (ADMM) can be applied to solve the nonconvex problem with convex objective function, rank constraints and other convex constraints, and is thus suitable to solve our above problem. Moreover, unlike the general nonconvex problems, ADMM will guarantee to converge a feasible solution as long as its dual variable converges in the iterations.


See also

* CUR matrix approximation is made from the rows and columns of the original matrix


References

* M. T. Chu, R. E. Funderlic, R. J. Plemmons, Structured low-rank approximation, Linear Algebra and its Applications, Volume 366, 1 June 2003, Pages 157–172 {{doi, 10.1016/S0024-3795(02)00505-0


External links


C++ package for structured-low rank approximation
Numerical linear algebra Dimension reduction Mathematical optimization