Coordinate descent is an
optimization algorithm
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a
coordinate
In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is sign ...
or coordinate block via a coordinate selection rule, then exactly or inexactly minimizes over the corresponding coordinate hyperplane while fixing all other coordinates or coordinate blocks. A
line search
In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region.
The line search approach first finds a ...
along the
coordinate
In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is sign ...
direction can be performed at the current iterate to determine the appropriate step size. Coordinate descent is applicable in both differentiable and derivative-free contexts.
Description
Coordinate descent is based on the idea that the minimization of a multivariable function
can be achieved by minimizing it along one direction at a time, i.e., solving univariate (or at least much simpler) optimization problems in a loop.
In the simplest case of ''cyclic coordinate descent'', one cyclically iterates through the directions, one at a time, minimizing the objective function with respect to each coordinate direction at a time. That is, starting with initial variable values
:
,
round
defines
from
by iteratively solving the single variable optimization problems
:
for each variable
of
, for
from 1 to
.
Thus, one begins with an initial guess
for a local minimum of
, and gets a sequence
iteratively.
By doing
line search
In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region.
The line search approach first finds a ...
in each iteration, one automatically has
:
It can be shown that this sequence has similar convergence properties as steepest descent. No improvement after one cycle of
line search
In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region.
The line search approach first finds a ...
along coordinate directions implies a stationary point is reached.
This process is illustrated below.
Differentiable case
In the case of a
continuously differentiable function , a coordinate descent algorithm can be
sketched as:
* Choose an initial parameter vector .
* Until convergence is reached, or for some fixed number of iterations:
** Choose an index from to .
** Choose a step size .
** Update to .
The step size can be chosen in various ways, e.g., by solving for the exact minimizer of (i.e., with all variables but fixed), or by traditional line search criteria.
Limitations
Coordinate descent has two problems. One of them is having a non-
smooth
Smooth may refer to:
Mathematics
* Smooth function, a function that is infinitely differentiable; used in calculus and topology
* Smooth manifold, a differentiable manifold for which all the transition maps are smooth functions
* Smooth algebrai ...
multivariable function. The following picture shows that coordinate descent iteration may get stuck at a non-
stationary point
In mathematics, particularly in calculus, a stationary point of a differentiable function of one variable is a point on the graph of the function where the function's derivative is zero. Informally, it is a point where the function "stops" in ...
if the level curves of a function are not smooth. Suppose that the algorithm is at the point ; then there are two axis-aligned directions it can consider for taking a step, indicated by the red arrows. However, every step along these two directions will increase the objective function's value (assuming a minimization problem), so the algorithm will not take any step, even though both steps together would bring the algorithm closer to the optimum. While this example shows that coordinate descent is not necessarily convergent to the optimum, it is possible to show formal convergence under reasonable conditions.
![Nonsmooth coordinate descent](https://upload.wikimedia.org/wikipedia/commons/6/6a/Nonsmooth_coordinate_descent.svg)
The other problem is difficulty in parallelism. Since the nature of coordinate descent is to cycle through the directions and minimize the objective function with respect to each coordinate direction, coordinate descent is not an obvious candidate for massive parallelism. Recent research works have shown that massive parallelism is applicable to coordinate descent by relaxing the change of the objective function with respect to each coordinate direction.
Applications
Coordinate descent algorithms are popular with practitioners owing to their simplicity, but the same property has led optimization researchers to largely ignore them in favor of more interesting (complicated) methods. An early application of coordinate descent optimization was in the area of computed tomography where it has been found to have rapid convergence and was subsequently used for clinical multi-slice helical scan CT reconstruction. A cyclic coordinate descent algorithm (CCD) has been applied in protein structure prediction. Moreover, there has been increased interest in the use of coordinate descent with the advent of large-scale problems in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, where coordinate descent has been shown competitive to other methods when applied to such problems as training linear
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
s (see
LIBLINEAR
LIBSVM and LIBLINEAR are two popular open source machine learning libraries, both developed at the National Taiwan University and both written in C++ though with a C API. LIBSVM implements the Sequential minimal optimization (SMO) algorithm for ...
) and
non-negative matrix factorization
Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...
. They are attractive for problems where computing gradients is infeasible, perhaps because the data required to do so are distributed across computer networks.
See also
*
Adaptive coordinate descent Adaptive coordinate descent is an improvement of the coordinate descent algorithm to non-separable optimization by the use of adaptive encoding. The adaptive coordinate descent approach gradually builds a transformation of the coordinate system suc ...
*
Conjugate gradient
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-definite. The conjugate gradient method is often implemented as an iterat ...
*
Gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
*
Line search
In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region.
The line search approach first finds a ...
*
Mathematical optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
*
Newton's method
In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valu ...
*
Stochastic gradient descent – uses one example at a time, rather than one coordinate
References
*
* Bertsekas, Dimitri P. (1999). ''Nonlinear Programming, Second Edition'' Athena Scientific, Belmont, Massachusetts. .
*.
*.
*.
*.
{{Optimization algorithms
Gradient methods