Stochastic approximation methods are a family of
iterative methods
In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the ''n''-th approximation is derived from the pre ...
typically used for
root-finding problems or for
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating
extreme values of functions which cannot be computed directly, but only estimated via noisy observations.
In a nutshell, stochastic approximation algorithms deal with a function of the form
which is the expected value of a function depending on a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
. The goal is to recover properties of such a function
without evaluating it directly. Instead, stochastic approximation algorithms use random samples of
to efficiently approximate properties of
such as zeros or extrema.
Recently, stochastic approximations have found extensive applications in the fields of statistics and machine learning, especially in settings with
big data
Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
. These applications range from
stochastic optimization
Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involves random objective funct ...
methods and algorithms, to online forms of the
EM algorithm, reinforcement learning via
temporal differences, and
deep learning
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
De ...
, and others.
Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.
The earliest, and prototypical, algorithms of this kind are the Robbins–Monro and Kiefer–Wolfowitz algorithms introduced respectively in 1951 and 1952.
Robbins–Monro algorithm
The Robbins–Monro algorithm, introduced in 1951 by
Herbert Robbins
Herbert Ellis Robbins (January 12, 1915 – February 12, 2001) was an American mathematician and statistician. He did research in topology, measure theory, statistics, and a variety of other fields.
He was the co-author, with Richard Courant ...
and
Sutton Monro,
presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function
, and a constant
, such that the equation
has a unique root at
. It is assumed that while we cannot directly observe the function
, we can instead obtain measurements of the random variable
where
. The structure of the algorithm is to then generate iterates of the form:
::
Here,
is a sequence of positive step sizes.
Robbins and Monro proved
, Theorem 2 that
converges in
(and hence also in probability) to
, and Blum
later proved the convergence is actually with probability one, provided that:
*
is uniformly bounded,
*
is nondecreasing,
*
exists and is positive, and
* The sequence
satisfies the following requirements:
::
A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form:
, for
. Other series are possible but in order to average out the noise in
, the above condition must be met.
Complexity results
#If
is twice continuously differentiable, and strongly convex, and the minimizer of
belongs to the interior of
, then the Robbins–Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being
, where
is the minimal value of
over
.
# Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin
[Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, ''Wiley -Intersci. Ser. Discrete Math'' 15 ''John Wiley'' ''New York'' (1983) .] have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is
. They have also proven that this rate cannot be improved.
Subsequent developments and Polyak–Ruppert averaging
While the Robbins–Monro algorithm is theoretically able to achieve
under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.
[Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control](_blank)
J.C. Spall, ''John Wiley'' ''Hoboken, NJ'', (2003).
Chung (1954) and Fabian (1968) showed that we would achieve optimal convergence rate
with
(or
). Lai and Robbins designed adaptive procedures to estimate
such that
has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak (1991) and Ruppert (1988) independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky
also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:
The convergence of
to the unique root
relies on the condition that the step sequence
decreases sufficiently slowly. That is
''A1)'' ''
Therefore, the sequence
with
satisfies this restriction, but
does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate
yet with a more robust step size policy.
Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin
[On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, ''Dokl. Akad. Nauk SSR'' 2939, (1978 (Russian)), Soviet Math. Dokl. 19 (1978 (English)).] for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rate
.
A more general result is given in Chapter 11 of Kushner and Yin by defining interpolated time
, interpolated process
and interpolated normalized process
as
Let the iterate average be
and the associate normalized error to be
.
With assumption A1) and the following A2)
''A2)'' ''There is a Hurwitz matrix
and a symmetric and positive-definite matrix
such that
converges weakly to
, where
is the statisolution to''
where
is a standard Wiener process.''
satisfied, and define ''
''. Then for each ''
'',
''
''
The success of the averaging idea is because of the time scale separation of the original sequence ''
'' and the averaged sequence ''
'', with the time scale of the former one being faster.
Application in stochastic optimization
Suppose we want to solve the following stochastic optimization problem
where
is differentiable and convex, then this problem is equivalent to find the root
of
. Here
can be interpreted as some "observed" cost as a function of the chosen
and random effects
. In practice, it might be hard to get an analytical form of
, Robbins–Monro method manages to generate a sequence
to approximate
if one can generate
, in which the conditional expectation of
given
is exactly
, i.e.
is simulated from a conditional distribution defined by
Here
is an unbiased estimator of
. If
depends on
, there is in general no natural way of generating a random outcome
that is an unbiased estimator of the gradient. In some special cases when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient estimator
. If
is viewed as some "fundamental" underlying random process that is generated ''independently'' of
, and under some regularization conditions for derivative-integral interchange operations so that
, then
gives the fundamental gradient unbiased estimate. However, for some applications we have to use finite-difference methods in which
has a conditional expectation close to
but not exactly equal to it.
We then define a recursion analogously to Newton's Method in the deterministic algorithm:
:
Convergence of the algorithm
The following result gives sufficient conditions on
for the algorithm to converge:
C1)
C2)
C3)
C4)
C5)
:
Then
converges to
almost surely.
Here are some intuitive explanations about these conditions. Suppose
is a uniformly bounded random variables. If C2) is not satisfied, i.e.
, then
is a bounded sequence, so the iteration cannot converge to
if the initial guess
is too far away from
. As for C3) note that if
converges to
then
so we must have
,and the condition C3) ensures it. A natural choice would be
. Condition C5) is a fairly stringent condition on the shape of
; it gives the search direction of the algorithm.
Example (where the stochastic gradient method is appropriate)
Suppose
, where
is differentiable and
is a random variable independent of
. Then
depends on the mean of
, and the stochastic gradient method would be appropriate in this problem. We can choose
Kiefer–Wolfowitz algorithm
The Kiefer–Wolfowitz algorithm was introduced in 1952 by
Jacob Wolfowitz
Jacob Wolfowitz (March 19, 1910 – July 16, 1981) was a Polish-born American Jewish statistician and Shannon Award-winning information theorist. He was the father of former United States Deputy Secretary of Defense and World Bank Group President ...
and
Jack Kiefer,
and was motivated by the publication of the Robbins–Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function. Let
be a function which has a maximum at the point
. It is assumed that
is unknown; however, certain observations
, where
, can be made at any point
. The structure of the algorithm follows a gradient-like method, with the iterates being generated as follows:
::
where
and
are independent, and the gradient of
is approximated using finite differences. The sequence
specifies the sequence of finite difference widths used for the gradient approximation, while the sequence
specifies a sequence of positive step sizes taken along that direction. Kiefer and Wolfowitz proved that, if
satisfied certain regularity conditions, then
will converge to
in probability as
, and later Blum
in 1954 showed
converges to
almost surely, provided that:
*
for all
.
* The function
has a unique point of maximum (minimum) and is strong concave (convex)
** The algorithm was first presented with the requirement that the function
maintains strong global convexity (concavity) over the entire feasible space. Given this condition is too restrictive to impose over the entire domain, Kiefer and Wolfowitz proposed that it is sufficient to impose the condition over a compact set
which is known to include the optimal solution.
*The function
satisfies the regularity conditions as follows:
** There exists
and
such that