Sparse dictionary learning (also known as sparse coding or SDL) is a

representation learning In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...

method which aims to find a sparse representation of the input data in the form of a

linear combination In mathematics, a linear combination or superposition is an Expression (mathematics), expression constructed from a Set (mathematics), set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of ''x'' a ...

of basic elements as well as those basic elements themselves. These elements are called ''atoms'', and they compose a ''dictionary''. Atoms in the dictionary are not required to be

orthogonal In mathematics, orthogonality (mathematics), orthogonality is the generalization of the geometric notion of ''perpendicularity''. Although many authors use the two terms ''perpendicular'' and ''orthogonal'' interchangeably, the term ''perpendic ...

, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than any one of the signals being observed. These two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal, but also provide an improvement in

sparsity In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse ...

and flexibility of the representation. One of the most important applications of sparse dictionary learning is in the field of

compressed sensing Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a signal processing technique for efficiently acquiring and reconstructing a Signal (electronics), signal by finding solutions to Underdetermined s ...

signal recovery Detection theory or signal detection theory is a means to measure the ability to differentiate between information-bearing patterns (called stimulus in living organisms, signal in machines) and random patterns that distract from the information (c ...

. In compressed sensing, a high-dimensional signal can be recovered with only a few linear measurements, provided that the signal is sparse or near-sparse. Since not all signals satisfy this condition, it is crucial to find a sparse representation of that signal such as the wavelet transform or the directional gradient of a rasterized matrix. Once a matrix or a high-dimensional vector is transferred to a sparse space, different recovery algorithms like

basis pursuit Basis pursuit is the mathematical optimization problem of the form : \min_x \, x\, _1 \quad \text \quad y = Ax, where ''x'' is a ''N''-dimensional solution vector (signal), ''y'' is a ''M''-dimensional vector of observations (measurements), ''A' ...

, CoSaMP, or fast non-iterative algorithms can be used to recover the signal. One of the key principles of dictionary learning is that the dictionary has to be inferred from the input data. The emergence of sparse dictionary learning methods was stimulated by the fact that in

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...

, one typically wants to represent the input data using a minimal amount of components. Before this approach, the general practice was to use predefined dictionaries such as Fourier or

wavelet A wavelet is a wave-like oscillation with an amplitude that begins at zero, increases or decreases, and then returns to zero one or more times. Wavelets are termed a "brief oscillation". A taxonomy of wavelets has been established, based on the n ...

transforms. However, in certain cases, a dictionary that is trained to fit the input data can significantly improve the sparsity, which has applications in data decomposition,

compression Compression may refer to: Physical science *Compression (physics), size reduction due to forces *Compression member, a structural element such as a column *Compressibility, susceptibility to compression * Gas compression *Compression ratio, of a ...

, and

analysis Analysis (: analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (38 ...

, and has been used in the fields of image

denoising Noise reduction is the process of removing noise from a signal. Noise reduction techniques exist for audio and images. Noise reduction algorithms may distort the signal to some degree. Noise rejection is the ability of a circuit to isolate an u ...

and

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

, and video and audio processing. Sparsity and overcomplete dictionaries have immense applications in image compression, image fusion, and

inpainting Inpainting is a conservation process where damaged, deteriorated, or missing parts of an artwork are filled in to present a complete image. This process is commonly used in image restoration. It can be applied to both physical and digital art m ...

Problem statement

Given the input dataset

x_i \in \mathbb^d

we wish to find a dictionary

\mathbf \in \mathbb^: D =_1, ..., d_n /math> and a representation R =_1,...,r_K r_i \in \mathbb^n such that both \, X-\mathbfR\, ^2_F is minimized and the representations r_i are sparse enough. This can be formulated as the following

optimization problem In mathematics, engineering, computer science and economics Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goo ...

\underset \sum_^K\, x_i-\mathbfr_i\, _2^2+\lambda \, r_i\, _0

, where

\mathcal \equiv \

\lambda>0

\mathcal

is required to constrain

\mathbf

so that its atoms would not reach arbitrarily high values allowing for arbitrarily low (but non-zero) values of

r_i

\lambda

controls the trade off between the sparsity and the minimization error. The minimization problem above is not convex because of the ℓ₀-"norm" and solving this problem is NP-hard. In some cases '' L''¹ -norm is known to ensure sparsity and so the above becomes a

convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization problems ...

problem with respect to each of the variables

\mathbf

and

\mathbf

when the other one is fixed, but it is not jointly convex in

(\mathbf, \mathbf)

Properties of the dictionary

The dictionary

\mathbf

defined above can be "undercomplete" if

n < d

or "overcomplete" in case

n>d

with the latter being a typical assumption for a sparse dictionary learning problem. The case of a complete dictionary does not provide any improvement from a representational point of view and thus isn't considered. Undercomplete dictionaries represent the setup in which the actual input data lies in a lower-dimensional space. This case is strongly related to

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

and techniques like

principal component analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that th ...

which require atoms

d_1,...,d_n

to be orthogonal. The choice of these subspaces is crucial for efficient dimensionality reduction, but it is not trivial. And dimensionality reduction based on dictionary representation can be extended to address specific tasks such as data analysis or classification. However, their main downside is limiting the choice of atoms. Overcomplete dictionaries, however, do not require the atoms to be orthogonal (they will never have a basis anyway) thus allowing for more flexible dictionaries and richer data representations. An overcomplete dictionary which allows for sparse representation of signal can be a famous transform matrix (wavelets transform, fourier transform) or it can be formulated so that its elements are changed in such a way that it sparsely represents the given signal in a best way. Learned dictionaries are capable of giving sparser solutions as compared to predefined transform matrices.

Algorithms

As the optimization problem described above can be solved as a convex problem with respect to either dictionary or sparse coding while the other one of the two is fixed, most of the algorithms are based on the idea of iteratively updating one and then the other. The problem of finding an optimal sparse coding

R

with a given dictionary

\mathbf

is known as sparse approximation (or sometimes just sparse coding problem). A number of algorithms have been developed to solve it (such as matching pursuit and

LASSO A lasso or lazo ( or ), also called reata or la reata in Mexico, and in the United States riata or lariat (from Mexican Spanish lasso for roping cattle), is a loop of rope designed as a restraint to be thrown around a target and tightened when ...

) and are incorporated in the algorithms described below.

Method of optimal directions (MOD)

The method of optimal directions (or MOD) was one of the first methods introduced to tackle the sparse dictionary learning problem. The core idea of it is to solve the minimization problem subject to the limited number of non-zero components of the representation vector:

\min_\ \,\, \text\,\, \forall i \,\,\, r_i\, _0 \leq T

Here,

F

denotes the

Frobenius norm In the field of mathematics, norms are defined for elements within a vector space. Specifically, when the vector space comprises matrices, such norms are referred to as matrix norms. Matrix norms differ from vector norms in that they must also ...

. MOD alternates between getting the

sparse coding Neural coding (or neural representation) is a neuroscience field concerned with characterising the hypothetical relationship between the Stimulus (physiology), stimulus and the neuronal responses, and the relationship among the Electrophysiology, e ...

using a method such as matching pursuit and updating the dictionary by computing the analytical solution of the problem given by

\mathbf = XR^+

where

R^+

is a Moore-Penrose pseudoinverse. After this update

\mathbf

is renormalized to fit the constraints and the new sparse coding is obtained again. The process is repeated until convergence (or until a sufficiently small residue). MOD has proved to be a very efficient method for low-dimensional input data

X

requiring just a few iterations to converge. However, due to the high complexity of the matrix-inversion operation, computing the pseudoinverse in high-dimensional cases is in many cases intractable. This shortcoming has inspired the development of other dictionary learning methods.

K-SVD

K-SVD is an algorithm that performs SVD at its core to update the atoms of the dictionary one by one and basically is a generalization of

K-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...

. It enforces that each element of the input data

x_i

is encoded by a linear combination of not more than

T_0

elements in a way identical to the MOD approach:

\min_\ \,\, \text\,\, \forall i \,\,\, r_i\, _0 \leq T_0

This algorithm's essence is to first fix the dictionary, find the best possible

R

under the above constraint (using Orthogonal Matching Pursuit) and then iteratively update the atoms of dictionary

\mathbf

in the following manner:

\, X - \mathbfR\, ^2_F =  \left,  X - \sum_^K d_i x^i_T\^2_F = \,  E_k - d_k x^k_T\, ^2_F

The next steps of the algorithm include rank-1 approximation of the residual matrix

E_k

, updating

d_k

and enforcing the sparsity of

x_k

after the update. This algorithm is considered to be standard for dictionary learning and is used in a variety of applications. However, it shares weaknesses with MOD being efficient only for signals with relatively low dimensionality and having the possibility for being stuck at local minima.

Stochastic gradient descent

One can also apply a widespread stochastic gradient descent method with iterative projection to solve this problem. The idea of this method is to update the dictionary using the first order stochastic gradient and project it on the constraint set

\mathcal

. The step that occurs at i-th iteration is described by this expression:

\mathbf_i = \text_ \left\

, where

S

is a random subset of

\

and

\delta_i

is a gradient step.

Lagrange dual method

An algorithm based on solving a dual Lagrangian problem provides an efficient way to solve for the dictionary having no complications induced by the sparsity function. Consider the following Lagrangian:

\mathcal(\mathbf, \Lambda) = \text\left((X-\mathbfR)^T(X-\mathbfR)\right) + \sum_^n\lambda_j \left( \right)

, where

c

is a constraint on the norm of the atoms and

\lambda_i

are the so-called dual variables forming the diagonal matrix

\Lambda

. We can then provide an analytical expression for the Lagrange dual after minimization over

\mathbf

\mathcal(\Lambda) = \min_\mathcal(\mathbf, \Lambda) = \text(X^TX-XR^T(RR^T+\Lambda)^(XR^T)^T-c\Lambda)

. After applying one of the optimization methods to the value of the dual (such as

Newton's method In numerical analysis, the Newton–Raphson method, also known simply as Newton's method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a ...

conjugate gradient In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an it ...

) we get the value of

\mathbf

\mathbf^T=(RR^T+\Lambda)^(XR^T)^T

Solving this problem is less computational hard because the amount of dual variables

n

is a lot of times much less than the amount of variables in the primal problem.

LASSO

In this approach, the optimization problem is formulated as:

\min_\ \,\, \text\,\,\, X-\mathbfR\, ^2_F < \epsilon

, where

\epsilon

is the permitted error in the reconstruction LASSO. It finds an estimate of

r_i

by minimizing the least square error subject to a '' L''¹ -norm constraint in the solution vector, formulated as:

\min_ \,\, \dfrac\,\,\, X-\mathbfr\, ^2_F + \lambda \,\,\, r\, _1

, where

\lambda > 0

controls the trade-off between sparsity and the reconstruction error. This gives the global optimal solution. See als
Online dictionary learning for Sparse coding

Parametric training methods

Parametric training methods are aimed to incorporate the best of both worlds — the realm of analytically constructed dictionaries and the learned ones. This allows to construct more powerful generalized dictionaries that can potentially be applied to the cases of arbitrary-sized signals. Notable approaches include: * Translation-invariant dictionaries. These dictionaries are composed by the translations of the atoms originating from the dictionary constructed for a finite-size signal patch. This allows the resulting dictionary to provide a representation for the arbitrary-sized signal. * Multiscale dictionaries. This method focuses on constructing a dictionary that is composed of differently scaled dictionaries to improve sparsity. * Sparse dictionaries. This method focuses on not only providing a sparse representation but also constructing a sparse dictionary which is enforced by the expression

\mathbf = \mathbf\mathbf

where

\mathbf

is some pre-defined analytical dictionary with desirable properties such as fast computation and

\mathbf

is a sparse matrix. Such formulation allows to directly combine the fast implementation of analytical dictionaries with the flexibility of sparse approaches.

Online dictionary learning
LASSO approach

Many common approaches to sparse dictionary learning rely on the fact that the whole input data

X

(or at least a large enough training dataset) is available for the algorithm. However, this might not be the case in the real-world scenario as the size of the input data might be too big to fit it into memory. The other case where this assumption can not be made is when the input data comes in a form of a

stream A stream is a continuous body of water, body of surface water Current (stream), flowing within the stream bed, bed and bank (geography), banks of a channel (geography), channel. Depending on its location or certain characteristics, a strea ...

. Such cases lie in the field of study of online learning which essentially suggests iteratively updating the model upon the new data points

x

becoming available. A dictionary can be learned in an online manner the following way: # For

t = 1...T:

# Draw a new sample

x_t

# Find a sparse coding using

LARS Lars is a common male name in Scandinavian countries. Origin ''Lars'' means "from the city of Laurentum". Lars is derived from the Latin name Laurentius, which means "from Laurentum" or "crowned with laurel", and is therefore related to the name ...

r_t = \underset\left(\frac\, x_t-\mathbf_r\, +\lambda\, r\, _1\right)

# Update dictionary using block-coordinate approach:

\mathbf_t = \underset\frac\sum_^t\left(\frac\, x_i-\mathbfr_i\, ^2_2+\lambda\, r_i\, _1\right)

This method allows us to gradually update the dictionary as new data becomes available for sparse representation learning and helps drastically reduce the amount of memory needed to store the dataset (which often has a huge size).

Applications

The dictionary learning framework, namely the linear decomposition of an input signal using a few basis elements learned from data itself, has led to state-of-art results in various image and video processing tasks. This technique can be applied to classification problems in a way that if we have built specific dictionaries for each class, the input signal can be classified by finding the dictionary corresponding to the sparsest representation. It also has properties that are useful for signal denoising since usually one can learn a dictionary to represent the meaningful part of the input signal in a sparse way but the noise in the input will have a much less sparse representation. Sparse dictionary learning has been successfully applied to various image, video and audio processing tasks as well as to texture synthesis and unsupervised clustering. In evaluations with the Bag-of-Words model, sparse coding was found empirically to outperform other coding approaches on the object category recognition tasks. Dictionary learning is used to analyse medical signals in detail. Such medical signals include those from electroencephalography (EEG), electrocardiography (ECG), magnetic resonance imaging (MRI), functional MRI (fMRI), continuous glucose monitors {{Cite journal, last1=AlMatouq, first1=Ali, last2=LalegKirati, first2=TaousMeriem, last3=Novara, first3=Carlo, last4=Ivana, first4=Rabbone, last5=Vincent, first5=Tyrone, date=2019-03-15, title=Sparse Reconstruction of Glucose Fluxes Using Continuous Glucose Monitors, journal=IEEE/ACM Transactions on Computational Biology and Bioinformatics, volume=17, issue=5, pages=1797–1809, doi=10.1109/TCBB.2019.2905198, pmid=30892232, issn=1545-5963, url=https://ieeexplore.ieee.org/document/8667648, hdl=10754/655914, s2cid=84185121, hdl-access=free and ultrasound computer tomography (USCT), where different assumptions are used to analyze each signal.

References

Unsupervised learning

Problem statement

Properties of the dictionary

Algorithms

Method of optimal directions (MOD)

K-SVD

Stochastic gradient descent

Lagrange dual method

LASSO

Parametric training methods

Online dictionary learningLASSO approach

Applications

See also

References

Online dictionary learning
LASSO approach