Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...

with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of th ...

) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in trade for a lower convergence rate. While the basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s, stochastic gradient descent has become an important optimization method in machine learning.

Background

Both statistical estimation and machine learning consider the problem of minimizing an

that has the form of a sum: :

Q(w) = \frac\sum_^n Q_i(w),

where the parameter

w

that minimizes

Q(w)

is to be

estimated Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...

. Each summand function

Q_i

is typically associated with the

i

-th

observation Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. Th ...

in the

(used for training). In classical statistics, sum-minimization problems arise in

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...

and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider

stationary point In mathematics, particularly in calculus, a stationary point of a differentiable function of one variable is a point on the graph of the function where the function's derivative is zero. Informally, it is a point where the function "stops" in ...

s of the likelihood function (or zeros of its derivative, the score function, and other estimating equations). The sum-minimization problem also arises for empirical risk minimization. In this case,

Q_i(w)

is the value of the loss function at

i

-th example, and

Q(w)

is the empirical risk. When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations: :

w := w - \eta \nabla Q(w) = w - \frac \sum_^n \nabla Q_i(w),

where

\eta

is a step size (sometimes called the ''

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

'' in machine learning). In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations. However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.

Iterative method

In stochastic (or "on-line") gradient descent, the true gradient of

Q(w)

is approximated by a gradient at a single sample: :

w := w - \eta \nabla Q_i(w).

As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges. In pseudocode, stochastic gradient descent can be presented as :

* Choose an initial vector of parameters

w

and learning rate

\eta

. * Repeat until an approximate minimum is obtained: ** Randomly shuffle samples in the training set. ** For

i=1, 2, ..., n

, do: ***

w := w - \eta \nabla Q_i(w).

A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of

vectorization Vectorization may refer to: Computing * Array programming, a style of computer programming where operations are applied to whole arrays instead of individual elements * Automatic vectorization, a compiler optimization that transforms loops to vect ...

libraries rather than computing each step separately as was first shown in where it was called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the gradient computed at each step is averaged over more training sample. The convergence of stochastic gradient descent has been analyzed using the theories of

convex minimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization probl ...

and of stochastic approximation. Briefly, when the

\eta

decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges

almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (or Lebesgue measure 1). In other words, the set of possible exceptions may be non-empty, but it has probability 0. ...

to a global minimum when the objective function is

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polyto ...

pseudoconvex In mathematics, more precisely in the theory of functions of several complex variables, a pseudoconvex set is a special type of open set in the ''n''-dimensional complex space C''n''. Pseudoconvex sets are important, as they allow for classificati ...

, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins–Siegmund theorem.

Example

Let's suppose we want to fit a straight line

\hat = \! w_1 + w_2 x

to a training set with observations

(x_1, x_2, \ldots, x_n)

and corresponding estimated responses

(\hat, \hat, \ldots, \hat)

using

. The objective function to be minimized is: :

Q(w) = \sum_^n Q_i(w) = \sum_^n \left(\hat-y_i\right)^2 = \sum_^n \left(w_1 + w_2 x_i - y_i\right)^2.

The last line in the above pseudocode for this specific problem will become: :

\begin w_1 \\ w_2 \end :=
    \begin w_1 \\ w_2 \end
    - \eta \begin \frac (w_1 + w_2 x_i - y_i)^2 \\
            \frac (w_1 + w_2 x_i - y_i)^2 \end =
    \begin w_1 \\ w_2 \end
    -  \eta  \begin 2 (w_1 + w_2 x_i - y_i) \\ 2 x_i(w_1 + w_2 x_i - y_i) \end.

Note that in each iteration (also called update), the gradient is only evaluated at a single point

x_i

instead of at the set of all samples. The key difference compared to standard (Batch) Gradient Descent is that only one piece of data from the dataset is used to calculate the step, and the piece of data is picked randomly at each step.

Notable applications

Stochastic gradient descent is a popular algorithm for training a wide range of models in machine learning, including (linear) support vector machines,

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analy ...

(see, e.g., Vowpal Wabbit) and

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability ...

s. When combined with the

backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gene ...

algorithm, it is the ''de facto'' standard algorithm for training

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...

s. Its use has been also reported in the Geophysics community, specifically to applications of Full Waveform Inversion (FWI). Stochastic gradient descent competes with the

L-BFGS Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algo ...

algorithm, which is also widely used. Stochastic gradient descent has been used since at least 1960 for training linear regression models, originally under the name ADALINE. Another stochastic gradient descent algorithm is the least mean squares (LMS) adaptive filter.

Extensions and variants

Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a

(step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function of the iteration number , giving a ''learning rate schedule'', so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on -means clustering. Practical guidance on choosing the step size in several variants of SGD is given by Spall.

Implicit updates (ISGD)

As mentioned earlier, classical stochastic gradient descent is generally sensitive to

. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved by considering ''implicit updates'' whereby the stochastic gradient is evaluated at the next iterate rather than the current one: :

w^ := w^ - \eta \nabla Q_i(w^).

This equation is implicit since

w^

appears on both sides of the equation. It is a stochastic form of the

proximal gradient method Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems. Many interesting problems can be formulated as convex optimization problems of the form \operatorname\limits_ \sum_^n ...

since the update can also be written as: :

w^ := \arg\min_w \.

As an example, consider least squares with features

x_1, \ldots, x_n \in\mathbb^p

and observations

y_1, \ldots, y_n\in\mathbb

. We wish to solve: :

\min_w \sum_^n (y_j - x_j'w)^2,

where

x_j' w = x_ w_1 + x_ w_2 + ... + x_ w_p

indicates the inner product. Note that

x

could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows: :

w^ = w^ + \eta (y_i - x_i'w^) x_i

where

i

is uniformly sampled between 1 and

n

. Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when

\eta

is misspecified so that

I - \eta x_i x_i'

has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast, ''implicit stochastic gradient descent'' (shortened as ISGD) can be solved in closed-form as: :

w^ = w^ + \frac (y_i - x_i'w^) x_i.

This procedure will remain numerically stable virtually for all

\eta

as the

is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between least mean squares (LMS) and normalized least mean squares filter (NLMS). Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that

Q_i(w)

depends on

w

only through a linear combination with features

x_i

, so that we can write

\nabla_w Q_i(w) = -q(x_i'w) x_i

, where

q() \in\mathbb

may depend on

x_i, y_i

as well but not on

w

except through

x_i'w

. Least squares obeys this rule, and so does

, and most

generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...

s. For instance, in least squares,

q(x_i'w) = y_i - x_i'w

, and in logistic regression

q(x_i'w) = y_i - S(x_i'w)

, where

S(u) = e^u/(1+e^u)

is the

logistic function A logistic function or logistic curve is a common S-shaped curve ( sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with th ...

. In

Poisson regression In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable ''Y'' has a Poisson distribution, and assumes the logari ...

q(x_i'w) = y_i - e^

, and so on. In such settings, ISGD is simply implemented as follows. Let

f(\xi) = \eta q(x_i'w^ + \xi , , x_i, , ^2)

, where

\xi

is scalar. Then, ISGD is equivalent to: :

w^ = w^ + \xi^\ast x_i,~\text~\xi^\ast = f(\xi^\ast).

The scaling factor

\xi^\ast\in\mathbb

can be found through the

bisection method In mathematics, the bisection method is a root-finding method that applies to any continuous function for which one knows two values with opposite signs. The method consists of repeatedly bisecting the interval defined by these values and th ...

since in most regular models, such as the aforementioned generalized linear models, function

q()

is decreasing, and thus the search bounds for

\xi^\ast

are

min(0, f(0)), \max(0, f(0)) /math>.

Momentum

Further proposals include the ''momentum method'', which appeared in Rumelhart, Hinton and Williams' paper on backpropagation learning. Stochastic gradient descent with momentum remembers the update at each iteration, and determines the next update as a linear combination of the gradient and the previous update: :

\Delta w := \alpha \Delta w - \eta \nabla Q_i(w)

w := w + \Delta w

that leads to: :

w := w - \eta \nabla Q_i(w) + \alpha \Delta w

where the parameter

w

which minimizes

Q(w)

is to be

\eta

is a step size (sometimes called the ''

'' in machine learning) and

\alpha

is an exponential decay factor between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change. The name momentum stems from an analogy to momentum in physics: the weight vector

w

, thought of as a particle traveling through parameter space, incurs acceleration from the gradient of the loss (" force"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of

artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected uni ...

for several decades. The ''momentum method'' is closely related to underdamped Langevin dynamics, and may be combined with

Simulated Annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It ...

Averaging

''Averaged stochastic gradient descent'', invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of :

\bar = \frac \sum_^ w_i

. When optimization is done, this averaged parameter vector takes the place of .

AdaGrad

''AdaGrad'' (for adaptive gradient algorithm) is a modified stochastic gradient descent algorithm with per-parameter

, first published in 2011. Informally, this increases the learning rate for sparser parameters and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition. It still has a base learning rate , but this is multiplied with the elements of a vector which is the diagonal of the

outer product In linear algebra, the outer product of two coordinate vectors is a matrix. If the two vectors have dimensions ''n'' and ''m'', then their outer product is an ''n'' × ''m'' matrix. More generally, given two tensors (multidimensional arrays of nu ...

matrix :

G = \sum_^t g_\tau g_\tau^\mathsf

where

g_\tau = \nabla Q_i(w)

, the gradient, at iteration . The diagonal is given by :

G_ = \sum_^t g_^2

. This vector is updated after every iteration. The formula for an update is now :

w := w - \eta\, \mathrm(G)^ \odot g

or, written as per-parameter updates, :

w_j := w_j - \frac g_j.

Each gives rise to a scaling factor for the learning rate that applies to a single parameter . Since the denominator in this factor,

\sqrt = \sqrt

is the ''ℓ''₂ norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates. While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.

RMSProp

''RMSProp'' (for Root Mean Square Propagation) is also a method in which the

is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. So, first the running average is calculated in terms of means square, :

v(w,t):=\gamma v(w,t-1)+(1-\gamma)(\nabla Q_i(w))^2

where,

\gamma

is the forgetting factor. And the parameters are updated as, :

w:=w-\frac\nabla Q_i(w)

RMSProp has shown good adaptation of learning rate in different applications. RMSProp can be seen as a generalization of

Rprop Rprop, short for resilient backpropagation, is a learning heuristic for supervised learning in feedforward artificial neural networks. This is a first-order optimization algorithm. This algorithm was created by Martin Riedmiller and Heinrich B ...

and is capable to work with mini-batches as well opposed to only full-batches.

Adam

''Adam'' (short for Adaptive Moment Estimation) is an update to the ''RMSProp'' optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters

w^

and a loss function

L ^

, where

t

indexes the current training iteration (indexed at

0

), Adam's parameter update is given by: :

m_w ^  \leftarrow \beta_1 m_w ^  + (1 - \beta_1) \nabla _w L ^

v_w ^  \leftarrow \beta_2 v_w ^  + (1 - \beta_2) (\nabla _w L ^  )^2

\hat_w = \frac

\hat_w = \frac

w ^  \leftarrow w ^  - \eta \frac

where

\epsilon

is a small scalar (e.g.

10^

) used to prevent division by 0, and

\beta_1

(e.g. 0.9) and

\beta_2

(e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done element-wise. The profound influence of this algorithm inspired multiple newer, less well-known momentum-based optimization schemes using Nesterov-enhanced gradients (eg: ''NAdam'' and ''FASFA'') and varying interpretations of second-order information (eg: ''Powerpropagation'' and ''AdaSqrt''). However, the most commonly used variants are ''AdaMax'', which generalizes ''Adam'' using the infinity norm, and ''AMSGrad'', which addresses convergence problems from ''Adam.''

Backtracking line search

Backtracking line search In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient ...

is another variant of gradient descent. All of the below are sourced from the mentioned link. It is based on a condition known as the Armijo–Goldstein condition. Both methods allow learning rates to change at each iteration; however, the manner of the change is different. Backtracking line search uses function evaluations to check Armijo's condition, and in principle the loop in the algorithm for determining the learning rates can be long and unknown in advance. Adaptive SGD does not need a loop in determining learning rates. On the other hand, adaptive SGD does not guarantee the "descent property" – which Backtracking line search enjoys – which is that

f(x_)\leq f(x_n)

for all n. If the gradient of the cost function is globally Lipschitz continuous, with Lipschitz constant L, and learning rate is chosen of the order 1/L, then the standard version of SGD is a special case of backtracking line search.

Second-order methods

A stochastic analogue of the standard (deterministic) Newton–Raphson algorithm (a "second-order" method) provides an asymptotically optimal or near-optimal form of iterative optimization in the setting of stochastic approximation. A method that uses direct measurements of the Hessian matrices of the summands in the empirical risk function was developed by Byrd, Hansen, Nocedal, and Singer. However, directly determining the required Hessian matrices for optimization may not be possible in practice. Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given by Spall and others. (A less efficient method based on finite differences, instead of simultaneous perturbations, is given by Ruppert.) Another approach to the approximation Hessian matrix is replacing it with the Fisher information matrix, which transforms usual gradient to natural. These methods not requiring direct Hessian information are based on either values of the summands in the above empirical risk function or values of the gradients of the summands (i.e., the SGD inputs). In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function.

Notes

References

External links

Using stochastic gradient descent in C++, Boost, Ublas for linear regression

Machine Learning Algorithms
* * Interactive paper explaining momentum. {{Differentiable computing Stochastic optimization Computational statistics Gradient methods M-estimators Machine learning algorithms Convex optimization Statistical approximations

Background

Iterative method

Example

Notable applications

Extensions and variants

Implicit updates (ISGD)

Momentum

Averaging

AdaGrad

RMSProp

Adam

Backtracking line search

Second-order methods

Notes

See also

References

Further reading

External links