Stochastic gradient descent (often abbreviated SGD) is an

iterative Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

method for optimizing an

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

with suitable

smoothness In mathematical analysis, the smoothness of a function is a property measured by the number of continuous derivatives (''differentiability class)'' it has over its domain. A function of class C^k is a function of smoothness at least ; t ...

properties (e.g.

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...

or subdifferentiable). It can be regarded as a

stochastic approximation Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving l ...

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

optimization, since it replaces the actual gradient (calculated from the entire

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...

) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate. The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

Background

Both

statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

estimation Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is d ...

and

consider the problem of minimizing an

that has the form of a sum:

Q(w) = \frac\sum_^n Q_i(w),

where the

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

w

that minimizes

Q(w)

is to be

estimated Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...

. Each summand function

Q_i

is typically associated with the

i

-th

observation Observation in the natural sciences is an act or instance of noticing or perceiving and the acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the percep ...

in the

(used for training). In classical statistics, sum-minimization problems arise in

least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

and in

maximum-likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

(for independent observations). The general class of estimators that arise as minimizers of sums are called

M-estimator In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estim ...

s. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider

stationary point In mathematics, particularly in calculus, a stationary point of a differentiable function of one variable is a point on the graph of a function, graph of the function where the function's derivative is zero. Informally, it is a point where the ...

s of the

likelihood function A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...

(or zeros of its derivative, the score function, and other

estimating equations In statistics, the method of estimating equations is a way of specifying how the parameters of a statistical model should be estimated. This can be thought of as a generalisation of many classical methods—the method of moments, least squares, ...

). The sum-minimization problem also arises for

empirical risk minimization In statistical learning theory, the principle of empirical risk minimization defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large num ...

. There,

Q_i(w)

is the value of the

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

i

-th example, and

Q(w)

is the empirical risk. When used to minimize the above function, a standard (or "batch")

method would perform the following iterations:

w := w - \eta\,\nabla Q(w) = w - \frac \sum_^n \nabla Q_i(w).

The step size is denoted by

\eta

(sometimes called the ''

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

'' in machine learning) and here "

:=

" denotes the update of a variable in the algorithm. In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations. However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.

Iterative method

In stochastic (or "on-line") gradient descent, the true gradient of

Q(w)

is approximated by a gradient at a single sample:

w := w - \eta\, \nabla Q_i(w).

As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges. In pseudocode, stochastic gradient descent can be presented as :

* Choose an initial vector of parameters

w

and learning rate

\eta

. * Repeat until an approximate minimum is obtained: ** Randomly shuffle samples in the training set. ** For

i=1, 2, ..., n

, do: ***

w := w - \eta\, \nabla Q_i(w).

A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately as was first shown in where it was called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples. The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of

. Briefly, when the

\eta

decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges

almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (with respect to the probability measure). In other words, the set of outcomes on which the event does not occur ha ...

to a global minimum when the objective function is

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins–Siegmund theorem.

Linear regression

Suppose we want to fit a straight line

\hat y = w_1 + w_2 x

to a training set with observations

((x_1, y_1), (x_2, y_2) \ldots, (x_n, y_n))

and corresponding estimated responses

(\hat y_1, \hat y_2, \ldots, \hat y_n)

using

. The objective function to be minimized is

Q(w) = \sum_^n Q_i(w) = \sum_^n \left(\hat y_i - y_i\right)^2 = \sum_^n \left(w_1 + w_2 x_i - y_i\right)^2.

The last line in the above pseudocode for this specific problem will become:

\begin w_1 \\ w_2 \end \leftarrow
 \begin w_1 \\ w_2 \end
 - \eta \begin \frac (w_1 + w_2 x_i - y_i)^2 \\
   \frac (w_1 + w_2 x_i - y_i)^2 \end =
 \begin w_1 \\ w_2 \end
 -  \eta  \begin 2 (w_1 + w_2 x_i - y_i) \\ 2 x_i(w_1 + w_2 x_i - y_i) \end.

Note that in each iteration or update step, the gradient is only evaluated at a single

x_i

. This is the key difference between stochastic gradient descent and batched gradient descent. In general, given a linear regression

\hat y = \sum_ w_k x_k

problem, stochastic gradient descent behaves differently when

m < n

(underparameterized) and

m \geq n

(overparameterized). In the overparameterized case, stochastic gradient descent converges to

\arg\min_ \, w - w_0\,

. That is, SGD converges to the interpolation solution with minimum distance from the starting

w_0

. This is true even when the learning rate remains constant. In the underparameterized case, SGD does not converge if learning rate remains constant.

History

In 1951,

Herbert Robbins Herbert Ellis Robbins (January 12, 1915 – February 12, 2001) was an American mathematician and statistician. He did research in topology, measure theory, statistics, and a variety of other fields. He was the co-author, with Richard Courant ...

and Sutton Monro introduced the earliest stochastic approximation methods, preceding stochastic gradient descent. Building on this work one year later, Jack Kiefer and Jacob Wolfowitz published an optimization algorithm very close to stochastic gradient descent, using central differences as an approximation of the gradient. Later in the 1950s,

Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning for his pioneering work on artificial neural networks. Life and career ...

used SGD to optimize his perceptron model, demonstrating the first applicability of stochastic gradient descent to neural networks.

Backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

was first described in 1986, with stochastic gradient descent being used to efficiently optimize parameters across neural networks with multiple

hidden layers In artificial neural networks, a hidden layer is a layer of artificial neurons that is neither an input layer nor an output layer. The simplest examples appear in Feedforward neural network, multilayer perceptrons (MLP), as illustrated in the diag ...

. Soon after, another improvement was developed: mini-batch gradient descent, where small batches of data are substituted for single samples. In 1997, the practical performance benefits from vectorization achievable with such small batches were first explored, paving the way for efficient optimization in machine learning. As of 2023, this mini-batch approach remains the norm for training neural networks, balancing the benefits of stochastic gradient descent with

. By the 1980s,

momentum In Newtonian mechanics, momentum (: momenta or momentums; more specifically linear momentum or translational momentum) is the product of the mass and velocity of an object. It is a vector quantity, possessing a magnitude and a direction. ...

had already been introduced, and was added to SGD optimization techniques in 1986. However, these optimization techniques assumed constant hyperparameters, i.e. a fixed learning rate and momentum parameter. In the 2010s, adaptive approaches to applying SGD with a per-parameter learning rate were introduced with AdaGrad (for "Adaptive Gradient") in 2011 and RMSprop (for "Root Mean Square Propagation") in 2012. In 2014, Adam (for "Adaptive Moment Estimation") was published, applying the adaptive approaches of RMSprop to momentum; many improvements and branches of Adam were then developed such as Adadelta, Adagrad, AdamW, and Adamax. Within machine learning, approaches to optimization in 2023 are dominated by Adam-derived optimizers. TensorFlow and PyTorch, by far the most popular machine learning libraries, as of 2023 largely only include Adam-derived optimizers, as well as predecessors to Adam such as RMSprop and classic SGD. PyTorch also partially supports

Limited-memory BFGS Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular alg ...

, a line-search method, but only for single-device setups without parameter groups.

Notable applications

Stochastic gradient descent is a popular algorithm for training a wide range of models in

, including (linear)

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

(see, e.g.,

Vowpal Wabbit Vowpal Wabbit (VW) is an open-source fast online interactive machine learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research. It was started and is led by John Langford. Vowpal Wabbit's in ...

) and

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. Graphical models are commonly used in ...

s. When combined with the back propagation algorithm, it is the ''de facto'' standard algorithm for training

artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

s. Its use has been also reported in the

Geophysics Geophysics () is a subject of natural science concerned with the physical processes and Physical property, properties of Earth and its surrounding space environment, and the use of quantitative methods for their analysis. Geophysicists conduct i ...

community, specifically to applications of Full Waveform Inversion (FWI). Stochastic gradient descent competes with the L-BFGS algorithm, which is also widely used. Stochastic gradient descent has been used since at least 1960 for training

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

models, originally under the name ADALINE. Another stochastic gradient descent algorithm is the least mean squares (LMS) adaptive filter.

Extensions and variants

Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a

(step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function of the iteration number , giving a ''learning rate schedule'', so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on -means clustering. Practical guidance on choosing the step size in several variants of SGD is given by Spall. Optimizer Animations

Implicit updates (ISGD)

As mentioned earlier, classical stochastic gradient descent is generally sensitive to

. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved by considering ''implicit updates'' whereby the stochastic gradient is evaluated at the next iterate rather than the current one:

w^\text := w^\text - \eta\, \nabla Q_i(w^\text).

This equation is implicit since

w^\text

appears on both sides of the equation. It is a stochastic form of the

proximal gradient method Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems. Many interesting problems can be formulated as convex optimization problems of the form \operatorname\limits_ \sum_^ ...

since the update can also be written as:

w^\text := \arg\min_w \left\.

As an example, consider least squares with features

x_1, \ldots, x_n \in\mathbb^p

and observations

y_1, \ldots, y_n\in\mathbb

. We wish to solve:

\min_w \sum_^n \left(y_j - x_j'w\right)^2,

where

x_j' w = x_ w_1 + x_ w_2 + ... + x_ w_p

indicates the inner product. Note that

x

could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows:

w^\text = w^\text + \eta \left(y_i - x_i'w^\text\right) x_i

where

i

is uniformly sampled between 1 and

n

. Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when

\eta

is misspecified so that

I - \eta x_i x_i'

has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast, ''implicit stochastic gradient descent'' (shortened as ISGD) can be solved in closed-form as:

w^\text = w^\text + \frac \left(y_i - x_i'w^\text\right) x_i.

This procedure will remain numerically stable virtually for all

\eta

as the

is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between least mean squares (LMS) and normalized least mean squares filter (NLMS). Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that

Q_i(w)

depends on

w

only through a linear combination with features

x_i

, so that we can write

\nabla_w Q_i(w) = -q(x_i'w) x_i

, where

q() \in\mathbb

may depend on

x_i, y_i

as well but not on

w

except through

x_i'w

. Least squares obeys this rule, and so does

, and most

generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...

s. For instance, in least squares,

q(x_i'w) = y_i - x_i'w

, and in logistic regression

q(x_i'w) = y_i - S(x_i'w)

, where

S(u) = e^u/(1+e^u)

is the

logistic function A logistic function or logistic curve is a common S-shaped curve ( sigmoid curve) with the equation f(x) = \frac where The logistic function has domain the real numbers, the limit as x \to -\infty is 0, and the limit as x \to +\infty is L. ...

. In

Poisson regression In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable ''Y'' has a Poisson distribution, and assumes the lo ...

q(x_i'w) = y_i - e^

, and so on. In such settings, ISGD is simply implemented as follows. Let

f(\xi) = \eta q(x_i'w^\text + \xi \, x_i\, ^2)

, where

\xi

is scalar. Then, ISGD is equivalent to:

w^\text = w^\text + \xi^\ast x_i,~\text~\xi^\ast = f(\xi^\ast).

The scaling factor

\xi^\ast\in\mathbb

can be found through the

bisection method In mathematics, the bisection method is a root-finding method that applies to any continuous function for which one knows two values with opposite signs. The method consists of repeatedly bisecting the interval defined by these values and t ...

since in most regular models, such as the aforementioned generalized linear models, function

q()

is decreasing, and thus the search bounds for

\xi^\ast

are

min(0, f(0)), \max(0, f(0)) /math>.

Momentum

Further proposals include the ''momentum method'' or the ''heavy ball method'', which in ML context appeared in Rumelhart, Hinton and Williams' paper on backpropagation learning and borrowed the idea from Soviet mathematician Boris Polyak's 1964 article on solving functional equations. Stochastic gradient descent with momentum remembers the update at each iteration, and determines the next update as a

linear combination In mathematics, a linear combination or superposition is an Expression (mathematics), expression constructed from a Set (mathematics), set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of ''x'' a ...

of the gradient and the previous update:

\Delta w := \alpha \Delta w - \eta\, \nabla Q_i(w)

w := w + \Delta w

that leads to:

w := w - \eta\, \nabla Q_i(w) + \alpha \Delta w

where the

w

which minimizes

Q(w)

is to be

\eta

is a step size (sometimes called the ''

'' in machine learning) and

\alpha

is an exponential decay factor between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change. The name momentum stems from an analogy to

in physics: the weight vector

w

, thought of as a particle traveling through parameter space, incurs acceleration from the gradient of the loss ("

force In physics, a force is an influence that can cause an Physical object, object to change its velocity unless counterbalanced by other forces. In mechanics, force makes ideas like 'pushing' or 'pulling' mathematically precise. Because the Magnitu ...

"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of

artificial neural networks In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

for several decades. The ''momentum method'' is closely related to underdamped Langevin dynamics, and may be combined with

simulated annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. ...

. In mid-1980s the method was modified by Yurii Nesterov to use the gradient predicted at the next point, and the resulting so-called ''Nesterov Accelerated Gradient'' was sometimes used in ML in the 2010s.

Averaging

''Averaged stochastic gradient descent'', invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of

\bar = \frac \sum_^ w_i.

When optimization is done, this averaged parameter vector takes the place of .

AdaGrad

''AdaGrad'' (for adaptive

gradient In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...

algorithm) is a modified stochastic gradient descent algorithm with per-parameter

, first published in 2011. Informally, this increases the learning rate for and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition. It still has a base learning rate , but this is multiplied with the elements of a vector which is the diagonal of the

outer product In linear algebra, the outer product of two coordinate vectors is the matrix whose entries are all products of an element in the first vector with an element in the second vector. If the two coordinate vectors have dimensions ''n'' and ''m'', the ...

matrix

G = \sum_^t g_\tau g_\tau^\mathsf

where

g_\tau = \nabla Q_i(w)

, the gradient, at iteration . The diagonal is given by

G_ = \sum_^t g_^2.

This vector essentially stores a historical sum of gradient squares by dimension and is updated after every iteration. The formula for an update is now

w := w - \eta\, \mathrm(G)^ \odot g

or, written as per-parameter updates,

w_j := w_j - \frac g_j.

Each gives rise to a scaling factor for the learning rate that applies to a single parameter . Since the denominator in this factor,

\sqrt = \sqrt

is the ''ℓ''₂ norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates. While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.

RMSProp

''RMSProp'' (for Root Mean Square Propagation) is a method invented in 2012 by James Martens and

Ilya Sutskever Ilya Sutskever (; born 8 December 1986) is an Israeli-Canadian computer scientist who specializes in machine learning. He has made several major contributions to the field of deep learning. With Alex Krizhevsky and Geoffrey Hinton, he co-inv ...

, at the time both PhD students in Geoffrey Hinton's group, in which the

is, like in Adagrad, adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. Unusually, it was not published in an article but merely described in a

Coursera Coursera Inc. () is an American global massive open online course provider. It was founded in 2012 by Stanford University computer science professors Andrew Ng and Daphne Koller. Coursera works with universities and other organizations to offe ...

lecture. So, first the running average is calculated in terms of means square,

v(w,t):=\gamma v(w,t-1) + \left(1-\gamma\right) \left(\nabla Q_i(w)\right)^2

where,

\gamma

is the forgetting factor. The concept of storing the historical gradient as sum of squares is borrowed from Adagrad, but "forgetting" is introduced to solve Adagrad's diminishing learning rates in non-convex problems by gradually decreasing the influence of old data. And the parameters are updated as,

w:=w-\frac\nabla Q_i(w)

RMSProp has shown good adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.

Adam

''Adam'' (short for Adaptive Moment Estimation) is a 2014 update to the ''RMSProp'' optimizer combining it with the main feature of the ''Momentum method''. In this optimization algorithm, running averages with exponential forgetting of both the gradients and the second moments of the gradients are used. Given parameters

w^

and a loss function

L ^

, where

t

indexes the current training iteration (indexed at

1

), Adam's parameter update is given by:

m_w ^  := \beta_1 m_w ^  + \left(1 - \beta_1\right) \nabla _w L ^

v_w ^  := \beta_2 v_w ^  + \left(1 - \beta_2\right) \left(\nabla _w L ^  \right)^2

\hat_w ^  = \frac

\hat_w ^  = \frac

w ^  := w ^  - \eta \frac

where

\varepsilon

is a small scalar (e.g.

10^

) used to prevent division by 0, and

\beta_1

(e.g. 0.9) and

\beta_2

(e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done element-wise. As the exponential moving averages of the gradient

m_w ^

and the squared gradient

v_w ^

are initialized with a vector of 0's, there would be a bias towards zero in the first training iterations. A factor

\tfrac

is introduced to compensate this bias and get better estimates

\hat_w ^

and

\hat_w ^

. The initial proof establishing the convergence of Adam was incomplete, and subsequent analysis has revealed that Adam does not converge for all convex objectives. Despite this, ''Adam'' continues to be used due to its strong performance in practice.

Variants

The popularity of ''Adam'' inspired many variants and enhancements. Some examples include: * Nesterov-enhanced gradients: ''NAdam'', ''FASFA'' * varying interpretations of second-order information: ''Powerpropagation'' and ''AdaSqrt''. * Using infinity norm: ''AdaMax'' * ''AMSGrad'', which improves convergence over ''Adam'' by using maximum of past squared gradients instead of the exponential average. ''AdamX'' further improves convergence over ''AMSGrad''. * ''AdamW'', which improves the weight decay.

Sign-based stochastic gradient descent

Even though sign-based optimization goes back to the aforementioned ''Rprop'', in 2018 researchers tried to simplify Adam by removing the magnitude of the stochastic gradient from being taken into account and only considering its sign.

Backtracking line search

Backtracking line search In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient ...

is another variant of gradient descent. All of the below are sourced from the mentioned link. It is based on a condition known as the Armijo–Goldstein condition. Both methods allow learning rates to change at each iteration; however, the manner of the change is different. Backtracking line search uses function evaluations to check Armijo's condition, and in principle the loop in the algorithm for determining the learning rates can be long and unknown in advance. Adaptive SGD does not need a loop in determining learning rates. On the other hand, adaptive SGD does not guarantee the "descent property" – which Backtracking line search enjoys – which is that

f(x_)\leq f(x_n)

for all n. If the gradient of the cost function is globally Lipschitz continuous, with Lipschitz constant L, and learning rate is chosen of the order 1/L, then the standard version of SGD is a special case of backtracking line search.

Second-order methods

A stochastic analogue of the standard (deterministic) Newton–Raphson algorithm (a "second-order" method) provides an asymptotically optimal or near-optimal form of iterative optimization in the setting of stochastic approximation. A method that uses direct measurements of the Hessian matrices of the summands in the empirical risk function was developed by Byrd, Hansen, Nocedal, and Singer. However, directly determining the required Hessian matrices for optimization may not be possible in practice. Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given by Spall and others. (A less efficient method based on finite differences, instead of simultaneous perturbations, is given by Ruppert.) Another approach to the approximation Hessian matrix is replacing it with the Fisher information matrix, which transforms usual gradient to natural. These methods not requiring direct Hessian information are based on either values of the summands in the above empirical risk function or values of the gradients of the summands (i.e., the SGD inputs). In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function. When the objective is a nonlinear least-squres loss

Q(w) = \frac \sum_^n Q_i(w) = \frac \sum_^n (m(w;x_i)-y_i)^2,

where

m(w;x_i)

is the predictive model (e.g., a

deep neural network Deep learning is a subset of machine learning that focuses on utilizing multilayered neural network (machine learning), neural networks to perform tasks such as Statistical classification, classification, Regression analysis, regression, and re ...

) the objective's structure can be exploited to estimate 2nd order information using gradients only. The resulting methods are simple and often effective

Approximations in continuous time

For small learning rate

\eta

stochastic gradient descent

(w_n)_

can be viewed as a discretization of the

gradient flow In vector calculus and physics, a vector field is an assignment of a vector to each point in a space, most commonly Euclidean space \mathbb^n. A vector field on a plane can be visualized as a collection of arrows with given magnitudes and direc ...

ODE

\frac W_t = -\nabla Q(W_t)

subject to additional stochastic noise. This approximation is only valid on a finite time-horizon in the following sense: assume that all the coefficients

Q_i

are sufficiently smooth. Let

T >0

and

g: \R^d \to \R

be a sufficiently smooth test function. Then, there exists a constant

C>0

such that for all

\eta >0

g(W_)\ \le C \eta,

where

\mathbb E

denotes taking the expectation with respect to the random choice of indices in the stochastic gradient descent scheme. Since this approximation does not capture the random fluctuations around the mean behavior of stochastic gradient descent solutions to

stochastic differential equations A stochastic differential equation (SDE) is a differential equation in which one or more of the terms is a stochastic process, resulting in a solution which is also a stochastic process. SDEs have many applications throughout pure mathematics and ...

(SDEs) have been proposed as limiting objects. More precisely, the solution to the SDE

d W_t = - \nabla \left(Q(W_t)+\tfrac 1 4 \eta , \nabla Q(W_t), ^2\right)dt + \sqrt \eta \Sigma (W_t)^ dB_t,

for

\Sigma(w) = \frac \left(\sum_^n Q_i(w)-Q(w)\right)\left(\sum_^n Q_i(w)-Q(w)\right)^T

where

dB_t

denotes the Ito-integral with respect to a

Brownian motion Brownian motion is the random motion of particles suspended in a medium (a liquid or a gas). The traditional mathematical formulation of Brownian motion is that of the Wiener process, which is often called Brownian motion, even in mathematical ...

is a more precise approximation in the sense that there exists a constant

C>0

such that

\le C \eta^2.

However this SDE only approximates the one-point motion of stochastic gradient descent. For an approximation of the stochastic flow one has to consider SDEs with infinite-dimensional noise.

Notes

References

External links

* * Interactive paper explaining momentum. {{Artificial intelligence navbox Stochastic optimization Computational statistics Gradient methods M-estimators Machine learning algorithms Convex optimization Statistical approximations

Background

Iterative method

Linear regression

History

Notable applications

Extensions and variants

Implicit updates (ISGD)

Momentum

Averaging

AdaGrad

RMSProp

Adam

Variants

Sign-based stochastic gradient descent

Backtracking line search

Second-order methods

Approximations in continuous time

See also

Notes

References

Further reading

External links