HOME

TheInfoList



OR:

Proximal gradient (forward backward splitting) methods for learning is an area of research in
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
and
statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...
which studies algorithms for a general class of
convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytope ...
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
problems where the regularization penalty may not be
differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
. One such example is \ell_1 regularization (also known as Lasso) of the form :\min_ \frac\sum_^n (y_i- \langle w,x_i\rangle)^2+ \lambda \, w\, _1, \quad \text x_i\in \mathbb^d\text y_i\in\mathbb. Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application. Such customized penalties can help to induce certain structure in problem solutions, such as ''sparsity'' (in the case of
lasso A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...
) or ''group structure'' (in the case of group lasso).


Relevant background

Proximal gradient method Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems. Many interesting problems can be formulated as convex optimization problems of the form \operatorname\limits_ \sum_^n ...
s are applicable in a wide variety of scenarios for solving
convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization probl ...
problems of the form : \min_ F(x)+R(x), where F is
convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytope ...
and differentiable with
Lipschitz continuous In mathematical analysis, Lipschitz continuity, named after German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for functions. Intuitively, a Lipschitz continuous function is limited in how fast it can change: there e ...
gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gradi ...
, R is a
convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytope ...
,
lower semicontinuous In mathematical analysis, semicontinuity (or semi-continuity) is a property of extended real-valued functions that is weaker than continuity. An extended real-valued function f is upper (respectively, lower) semicontinuous at a point x_0 if, r ...
function which is possibly nondifferentiable, and \mathcal is some set, typically a
Hilbert space In mathematics, Hilbert spaces (named after David Hilbert) allow generalizing the methods of linear algebra and calculus from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise natural ...
. The usual criterion of x minimizes F(x)+R(x) if and only if \nabla (F+R)(x) = 0 in the convex, differentiable setting is now replaced by : 0\in \partial (F+R)(x), where \partial \varphi denotes the
subdifferential In mathematics, the subderivative, subgradient, and subdifferential generalize the derivative to convex functions which are not necessarily differentiable. Subderivatives arise in convex analysis, the study of convex functions, often in connect ...
of a real-valued, convex function \varphi. Given a convex function \varphi:\mathcal \to \mathbb an important operator to consider is its
proximal operator In mathematical optimization, the proximal operator is an operator associated with a proper,An (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its ...
\operatorname_:\mathcal\to\mathcal defined by : \operatorname_(u) = \operatorname\min_ \varphi(x)+\frac\, u-x\, _2^2, which is well-defined because of the strict convexity of the \ell_2 norm. The proximal operator can be seen as a generalization of a
projection Projection, projections or projective may refer to: Physics * Projection (physics), the action/process of light, heat, or sound reflecting from a surface to another in a different direction * The display of images by a projector Optics, graphic ...
. We see that the proximity operator is important because x^* is a minimizer to the problem \min_ F(x)+R(x) if and only if :x^* = \operatorname_\left(x^*-\gamma\nabla F(x^*)\right), where \gamma>0 is any positive real number.


Moreau decomposition

One important technique related to proximal gradient methods is the Moreau decomposition, which decomposes the identity operator as the sum of two proximity operators. Namely, let \varphi:\mathcal\to\mathbb be a
lower semicontinuous In mathematical analysis, semicontinuity (or semi-continuity) is a property of extended real-valued functions that is weaker than continuity. An extended real-valued function f is upper (respectively, lower) semicontinuous at a point x_0 if, r ...
, convex function on a vector space \mathcal. We define its
Fenchel conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformati ...
\varphi^*:\mathcal\to\mathbb to be the function :\varphi^*(u) := \sup_ \langle x,u\rangle - \varphi(x). The general form of Moreau's decomposition states that for any x\in\mathcal and any \gamma>0 that :x = \operatorname_(x) + \gamma\operatorname_(x/\gamma), which for \gamma=1 implies that x = \operatorname_(x)+\operatorname_(x). The Moreau decomposition can be seen to be a generalization of the usual orthogonal decomposition of a vector space, analogous with the fact that proximity operators are generalizations of projections. In certain situations it may be easier to compute the proximity operator for the conjugate \varphi^* instead of the function \varphi, and therefore the Moreau decomposition can be applied. This is the case for group lasso.


Lasso regularization

Consider the regularized
empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an al ...
problem with square loss and with the \ell_1 norm as the regularization penalty: :\min_ \frac\sum_^n (y_i- \langle w,x_i\rangle)^2+ \lambda \, w\, _1, where x_i\in \mathbb^d\text y_i\in\mathbb. The \ell_1 regularization problem is sometimes referred to as ''lasso'' (
least absolute shrinkage and selection operator In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy ...
). Such \ell_1 regularization problems are interesting because they induce '' sparse'' solutions, that is, solutions w to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem :\min_ \frac\sum_^n (y_i- \langle w,x_i\rangle)^2+ \lambda \, w\, _0, where \, w\, _0 denotes the \ell_0 "norm", which is the number of nonzero entries of the vector w. Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.


Solving for L1 proximity operator

For simplicity we restrict our attention to the problem where \lambda=1. To solve the problem :\min_ \frac\sum_^n (y_i- \langle w,x_i\rangle)^2+ \, w\, _1, we consider our objective function in two parts: a convex, differentiable term F(w) = \frac\sum_^n (y_i- \langle w,x_i\rangle)^2 and a convex function R(w) = \, w\, _1. Note that R is not strictly convex. Let us compute the proximity operator for R(w). First we find an alternative characterization of the proximity operator \operatorname_(x) as follows: \begin u = \operatorname_R(x) \iff & 0\in \partial \left(R(u)+\frac\, u-x\, _2^2\right)\\ \iff & 0\in \partial R(u) + u-x\\ \iff & x-u\in \partial R(u). \end For R(w) = \, w\, _1 it is easy to compute \partial R(w): the ith entry of \partial R(w) is precisely : \partial , w_i, = \begin 1,&w_i>0\\ -1,&w_i<0\\ \left 1,1\right&w_i = 0. \end Using the recharacterization of the proximity operator given above, for the choice of R(w) = \, w\, _1 and \gamma>0 we have that \operatorname_(x) is defined entrywise by ::\left(\operatorname_(x)\right)_i = \begin x_i-\gamma,&x_i>\gamma\\ 0,&, x_i, \leq\gamma\\ x_i+\gamma,&x_i<-\gamma, \end which is known as the soft thresholding operator S_(x)=\operatorname_(x).


Fixed point iterative schemes

To finally solve the lasso problem we consider the fixed point equation shown earlier: :x^* = \operatorname_\left(x^*-\gamma\nabla F(x^*)\right). Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure. Namely, fix some initial w^0\in\mathbb^d, and for k=1,2,\ldots define :w^ = S_\left(w^k - \gamma \nabla F\left(w^k\right)\right). Note here the effective trade-off between the empirical error term F(w) and the regularization penalty R(w). This fixed point method has decoupled the effect of the two different convex functions which comprise the objective function into a gradient descent step ( w^k - \gamma \nabla F\left(w^k\right)) and a soft thresholding step (via S_\gamma). Convergence of this fixed point scheme is well-studied in the literature and is guaranteed under appropriate choice of step size \gamma and loss function (such as the square loss taken here). Accelerated methods were introduced by Nesterov in 1983 which improve the rate of convergence under certain regularity assumptions on F. Such methods have been studied extensively in previous years. For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term R, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.


Practical considerations

There have been numerous developments within the past decade in
convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization probl ...
techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.


Adaptive step size

In the fixed point iteration scheme :w^ = \operatorname_\left(w^k-\gamma \nabla F\left(w^k\right)\right), one can allow variable step size \gamma_k instead of a constant \gamma. Numerous adaptive step size schemes have been proposed throughout the literature. Applications of these schemes suggest that these can offer substantial improvement in number of iterations required for fixed point convergence.


Elastic net (mixed norm regularization)

Elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Specification The elas ...
offers an alternative to pure \ell_1 regularization. The problem of lasso (\ell_1) regularization involves the penalty term R(w) = \, w\, _1, which is not strictly convex. Hence, solutions to \min_w F(w) + R(w), where F is some empirical loss function, need not be unique. This is often avoided by the inclusion of an additional strictly convex term, such as an \ell_2 norm regularization penalty. For example, one can consider the problem :\min_ \frac\sum_^n (y_i- \langle w,x_i\rangle)^2+ \lambda \left((1-\mu)\, w\, _1+\mu \, w\, _2^2\right), where x_i\in \mathbb^d\text y_i\in\mathbb. For 0<\mu\leq 1 the penalty term \lambda \left((1-\mu)\, w\, _1+\mu \, w\, _2^2\right) is now strictly convex, and hence the minimization problem now admits a unique solution. It has been observed that for sufficiently small \mu > 0, the additional penalty term \mu \, w\, _2^2 acts as a preconditioner and can substantially improve convergence while not adversely affecting the sparsity of solutions.


Exploiting group structure

Proximal gradient methods provide a general framework which is applicable to a wide variety of problems in
statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...
. Certain problems in learning can often involve data which has additional structure that is known '' a priori''. In the past several years there have been new developments which incorporate information about group structure to provide methods which are tailored to different applications. Here we survey a few such methods.


Group lasso

Group lasso is a generalization of the lasso method when features are grouped into disjoint blocks. Suppose the features are grouped into blocks \. Here we take as a regularization penalty :R(w) =\sum_^G \, w_g\, _2, which is the sum of the \ell_2 norm on corresponding feature vectors for the different groups. A similar proximity operator analysis as above can be used to compute the proximity operator for this penalty. Where the lasso penalty has a proximity operator which is soft thresholding on each individual component, the proximity operator for the group lasso is soft thresholding on each group. For the group w_g we have that proximity operator of \lambda\gamma\left(\sum_^G \, w_g\, _2\right) is given by :\widetilde_(w_g) = \begin w_g-\lambda\gamma \frac, & \, w_g\, _2>\lambda\gamma \\ 0, & \, w_g\, _2\leq \lambda\gamma \end where w_g is the gth group. In contrast to lasso, the derivation of the proximity operator for group lasso relies on the
Moreau decomposition Moreau may refer to: People *Moreau (surname) Places * Moreau, New York *Moreau River (disambiguation) Music *An alternate name for the band Cousteau, used for the album ''Nova Scotia'' in the United States for legal reasons In fiction *Dr. M ...
. Here the proximity operator of the conjugate of the group lasso penalty becomes a projection onto the
ball A ball is a round object (usually spherical, but can sometimes be ovoid) with several uses. It is used in ball games, where the play of the game follows the state of the ball as it is hit, kicked or thrown by players. Balls can also be used f ...
of a
dual norm In functional analysis, the dual norm is a measure of size for a continuous linear function defined on a normed vector space. Definition Let X be a normed vector space with norm \, \cdot\, and let X^* denote its continuous dual space. The dual ...
.


Other group structures

In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested structure. Such generalizations of group lasso have been considered in a variety of contexts. For overlapping groups one common approach is known as ''latent group lasso'' which introduces latent variables to account for overlap. Nested group structures are studied in ''hierarchical structure prediction'' and with
directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ve ...
s.


See also

*
Convex analysis Convex analysis is the branch of mathematics devoted to the study of properties of convex functions and convex sets, often with applications in convex minimization, a subdomain of optimization theory. Convex sets A subset C \subseteq X of som ...
*
Proximal gradient method Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems. Many interesting problems can be formulated as convex optimization problems of the form \operatorname\limits_ \sum_^n ...
*
Regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
*
Statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...


References

{{reflist
First order methods First or 1st is the ordinal form of the number one (#1). First or 1st may also refer to: *World record, specifically the first instance of a particular achievement Arts and media Music * 1$T, American rapper, singer-songwriter, DJ, and rec ...
Convex optimization Machine learning