HOME

TheInfoList



OR:

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical
Stochastic approximation Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving l ...
setting. Variance reduction approaches are widely used for training machine learning models such as
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
and
support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
as these problems have finite-sum structure and uniform
conditioning Conditioning may refer to: Science, computing, and technology * Air conditioning, the removal of heat from indoor air for thermal comfort ** Automobile air conditioning, air conditioning in a vehicle ** Ice storage air conditioning, air condition ...
that make them ideal candidates for variance reduction.


Finite sum objectives

A function f is considered to have finite sum structure if it can be decomposed into a summation or average: :f(x) = \frac\sum_^n f_i(x), where the function value and derivative of each f_i can be queried independently. Although variance reduction methods can be applied for any positive n and any f_i structure, their favorable theoretical and practical properties arise when n is large compared to the
condition number In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input ...
of each f_i, and when the f_i have similar (but not necessarily identical) Lipschitz smoothness and strong convexity constants. The finite sum structure should be contrasted with the stochastic approximation setting which deals with functions of the form f(\theta) = \operatorname E_ (\theta,\xi) which is the expected value of a function depending on a
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
\xi . Any finite sum problem can be optimized using a stochastic approximation algorithm by using F(\cdot,\xi)=f_\xi.


Rapid Convergence

Stochastic variance reduced methods without acceleration are able to find a minima of f within accuracy \epsilon>, i.e. f(x)-f(x_*)\leq \epsilon in a number of steps of the order: : O \left( \left( \frac + n \right)\log \left( \frac \right)\right). The number of steps depends only logarithmically on the level of accuracy required, in contrast to the stochastic approximation framework, where the number of steps O\bigl( L/(\mu \epsilon)\bigr) required grows proportionally to the accuracy required. Stochastic variance reduction methods converge almost as fast as the gradient descent method's O\bigl( (L/\mu)\log(1/\epsilon) \bigr) rate, despite using only a stochastic gradient, at a 1/n lower cost than gradient descent. Accelerated methods in the stochastic variance reduction framework achieve even faster convergence rates, requiring only : O \left( \left( \sqrt + n \right)\log \left( \frac \right)\right) steps to reach \epsilon accuracy, potentially \sqrt faster than non-accelerated methods. Lower complexity bounds. for the finite sum class establish that this rate is the fastest possible for smooth strongly convex problems.


Approaches

Variance reduction approaches fall within 3 main categories: table averaging methods, full-gradient snapshot methods and dual methods. Each category contains methods designed for dealing with convex, non-smooth, and non-convex problems, each differing in hyper-parameter settings and other algorithmic details.


SAGA

In the SAGA method, the prototypical table averaging approach, a table of size n is maintained that contains the last gradient witnessed for each f_i term, which we denote g_i. At each step, an index i is sampled, and a new gradient \nabla f_i(x_k) is computed. The iterate x_k is updated with: : x_ = x_k - \gamma \left \nabla f_i(x_k) - g_i + \frac\sum_^n g_i \right and afterwards table entry i is updated with g_i=\nabla f_i(x_k). SAGA is among the most popular of the variance reduction methods due to its simplicity, easily adaptable theory, and excellent performance. It is the successor of the SAG method, improving on its flexibility and performance.


SVRG

The stochastic variance reduced gradient method (SVRG), the prototypical snapshot method, uses a similar update except instead of using the average of a table it instead uses a full-gradient that is reevaluated at a snapshot point \tilde at regular intervals of m\geq n iterations. The update becomes: : x_ = x_k - \gamma nabla f_i(x_k) - \nabla f_i(\tilde) + \nabla f(\tilde) This approach requires two stochastic gradient evaluations per step, one to compute \nabla f_i(x_k) and one to compute \nabla f_i(\tilde)], where-as table averaging approaches need only one. Despite the high computational cost, SVRG is popular as its simple convergence theory is highly adaptable to new optimization settings. It also has lower storage requirements than tabular averaging approaches, which make it applicable in many settings where tabular methods can not be used.


SDCA

Exploiting the Duality (optimization), dual representation of the objective leads to another variance reduction approach that is particularly suited to finite-sums where each term has a structure that makes computing the
convex conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformation ...
f_^, or its
proximal operator In mathematical optimization, the proximal operator is an operator associated with a proper,An (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its ...
tractable. The standard SDCA method considers finite sums that have additional structure compared to generic finite sum setting: :f(x) = \frac\sum_^n f_i(x^Tv_i) + \frac\, x\, ^2, where each f_i is 1 dimensional and each v_i is a data point associated with f_i. SDCA solves the dual problem: : \max_ -\frac\sum_^n f_^(-\alpha_i)-\frac \left \, \frac\sum_^n \alpha_i v_i \right \, ^2, by a stochastic coordinate ascent procedure, where at each step the objective is optimized with respect to a randomly chosen coordinate \alpha_i, leaving all other coordinates the same. An approximate primal solution x can be recovered from the \alpha values: : x = \frac\sum_^n \alpha_i v_i. This method obtains similar theoretical rates of convergence to other stochastic variance reduced methods, while avoiding the need to specify a step-size parameter. It is fast in practice when \lambda is large, but significantly slower than the other approaches when \lambda is small.


Accelerated approaches

Accelerated variance reduction methods are built upon the standard methods above. The earliest approaches make use of proximal operators to accelerate convergence, either approximately or exactly. Direct acceleration approaches have also been developed


Catalyst acceleration

The catalyst framework uses any of the standard methods above as an inner optimizer to approximately solve a
proximal operator In mathematical optimization, the proximal operator is an operator associated with a proper,An (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its ...
: : x_k \approx \text_x \left \ after which it uses an extrapolation step to determine the next y: : y_k = x_k +\beta_k (x_k - x_) The catalyst method's flexibility and simplicity make it a popular baseline approach. It doesn't achieve the optimal rate of convergence among accelerated methods, it is potentially slower by up to a log factor in the hyper-parameters.


Point-SAGA

Proximal operations may also be applied directly to the f_i terms to yield an accelerated method. The Point-SAGA method replaces the gradient operations in SAGA with proximal operator evaluations, result in a simple, direct acceleration method: : x_ = \text^\gamma_j\left(z_k \triangleq x_k +\gamma \left g_ - \frac \sum_^n g_i \right\right), with the table update g_j = \frac(z_k - x_) performed after each step. Here \text^\gamma_j is defined as the proximal operator for the jth term: : \text^\gamma_j(y) = \text_x \left \. Unlike other known accelerated methods, Point-SAGA requires only a single iterate sequence x to be maintained between steps, and it has the advantage of only having a single tunable parameter \gamma. It obtains the optimal accelerated rate of convergence for strongly convex finite-sum minimization without additional log factors.


See also

*
Stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
*
Coordinate descent Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, ...
*
Online machine learning In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques whic ...
*
Proximal operator In mathematical optimization, the proximal operator is an operator associated with a proper,An (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its ...
*
Stochastic optimization Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involves random objective funct ...
*
Stochastic approximation Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving l ...


References

{{Reflist Stochastic optimization Gradient methods Machine learning algorithms Convex optimization