In
calculus
Calculus is the mathematics, mathematical study of continuous change, in the same way that geometry is the study of shape, and algebra is the study of generalizations of arithmetic operations.
Originally called infinitesimal calculus or "the ...
,
Newton's method
In numerical analysis, the Newton–Raphson method, also known simply as Newton's method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a ...
(also called Newton–Raphson) is an
iterative method
In computational mathematics, an iterative method is a Algorithm, mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the ''i''-th approximation (called an " ...
for finding the
roots
A root is the part of a plant, generally underground, that anchors the plant body, and absorbs and stores water and nutrients.
Root or roots may also refer to:
Art, entertainment, and media
* ''The Root'' (magazine), an online magazine focusin ...
of a
differentiable function
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...
, which are solutions to the
equation
In mathematics, an equation is a mathematical formula that expresses the equality of two expressions, by connecting them with the equals sign . The word ''equation'' and its cognates in other languages may have subtly different meanings; for ...
. However, to optimize a twice-differentiable
, our goal is to find the roots of
. We can therefore use Newton's method on its
derivative
In mathematics, the derivative is a fundamental tool that quantifies the sensitivity to change of a function's output with respect to its input. The derivative of a function of a single variable at a chosen input value, when it exists, is t ...
to find solutions to
, also known as the
critical points of
. These solutions may be minima, maxima, or saddle points; see section
"Several variables" in
Critical point (mathematics)
In mathematics, a critical point is the argument of a function where the function derivative is zero (or undefined, as specified below).
The value of the function at a critical point is a .
More specifically, when dealing with functions of a ...
and also section
"Geometric interpretation" in this article. This is relevant in
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
, which aims to find (global) minima of the function
.
Newton's method
The central problem of optimization is minimization of functions. Let us first consider the case of univariate functions, i.e., functions of a single real variable. We will later consider the more general and more practically useful multivariate case.
Given a twice differentiable function
, we seek to solve the optimization problem
:
Newton's method attempts to solve this problem by constructing a
sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is cal ...
from an initial guess (starting point)
that converges towards a minimizer
of
by using a sequence of second-order Taylor approximations of
around the iterates. The second-order
Taylor expansion
In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...
of around
is
:
The next iterate
is defined so as to minimize this quadratic approximation in
, and setting
. If the second derivative is positive, the quadratic approximation is a convex function of
, and its minimum can be found by setting the derivative to zero. Since
:
the minimum is achieved for
:
Putting everything together, Newton's method performs the iteration
:
Geometric interpretation
The geometric interpretation of Newton's method is that at each iteration, it amounts to the fitting of a
parabola
In mathematics, a parabola is a plane curve which is Reflection symmetry, mirror-symmetrical and is approximately U-shaped. It fits several superficially different Mathematics, mathematical descriptions, which can all be proved to define exactl ...
to the
graph
Graph may refer to:
Mathematics
*Graph (discrete mathematics), a structure made of vertices and edges
**Graph theory, the study of such graphs and their properties
*Graph (topology), a topological space resembling a graph in the sense of discret ...
of
at the trial value
, having the same slope and curvature as the graph at that point, and then proceeding to the maximum or minimum of that parabola (in higher dimensions, this may also be a
saddle point
In mathematics, a saddle point or minimax point is a Point (geometry), point on the surface (mathematics), surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a Critical point (mathematics), ...
), see below. Note that if
happens to a quadratic function, then the exact extremum is found in one step.
Higher dimensions
The above
iterative scheme can be generalized to
dimensions by replacing the derivative with the
gradient
In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...
(different authors use different notation for the gradient, including
), and the
reciprocal of the second derivative with the
inverse of the
Hessian matrix
In mathematics, the Hessian matrix, Hessian or (less commonly) Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued Function (mathematics), function, or scalar field. It describes the local curvature of a functio ...
(different authors use different notation for the Hessian, including
). One thus obtains the iterative scheme
:
Often Newton's method is modified to include a small
step size instead of
:
:
This is often done to ensure that the
Wolfe conditions, or much simpler and efficient
Armijo's condition, are satisfied at each step of the method. For step sizes other than 1, the method is often referred to as the relaxed or damped Newton's method.
Convergence
If is a strongly convex function with Lipschitz Hessian, then provided that
is close enough to
, the sequence
generated by Newton's method will converge to the (necessarily unique) minimizer
of
quadratically fast.
That is,
:
Computing the Newton direction
Finding the inverse of the Hessian in high dimensions to compute the Newton direction
can be an expensive operation. In such cases, instead of directly inverting the Hessian, it is better to calculate the vector
as the solution to the
system of linear equations
In mathematics, a system of linear equations (or linear system) is a collection of two or more linear equations involving the same variable (math), variables.
For example,
: \begin
3x+2y-z=1\\
2x-2y+4z=-2\\
-x+\fracy-z=0
\end
is a system of th ...
:
which may be solved by various factorizations or approximately (but to great accuracy) using
iterative methods
In computational mathematics, an iterative method is a Algorithm, mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the ''i''-th approximation (called an " ...
. Many of these methods are only applicable to certain types of equations, for example the
Cholesky factorization and
conjugate gradient
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an it ...
will only work if
is a positive definite matrix. While this may seem like a limitation, it is often a useful indicator of something gone wrong; for example if a minimization problem is being approached and
is not positive definite, then the iterations are converging to a
saddle point
In mathematics, a saddle point or minimax point is a Point (geometry), point on the surface (mathematics), surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a Critical point (mathematics), ...
and not a minimum.
On the other hand, if a
constrained optimization
In mathematical optimization, constrained optimization (in some contexts called constraint optimization) is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The obj ...
is done (for example, with
Lagrange multipliers
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equation constraints (i.e., subject to the condition that one or more equations have to be satisfie ...
), the problem may become one of saddle point finding, in which case the Hessian will be symmetric indefinite and the solution of
will need to be done with a method that will work for such, such as the
variant of
Cholesky factorization or the
conjugate residual method.
There also exist various
quasi-Newton methods, where an approximation for the Hessian (or its inverse directly) is built up from changes in the gradient.
If the Hessian is close to a non-
invertible matrix
In linear algebra, an invertible matrix (''non-singular'', ''non-degenarate'' or ''regular'') is a square matrix that has an inverse. In other words, if some other matrix is multiplied by the invertible matrix, the result can be multiplied by a ...
, the inverted Hessian can be numerically unstable and the solution may diverge. In this case, certain workarounds have been tried in the past, which have varied success with certain problems. One can, for example, modify the Hessian by adding a correction matrix
so as to make
positive definite. One approach is to diagonalize the Hessian and choose
so that
has the same eigenvectors as the Hessian, but with each negative eigenvalue replaced by
.
An approach exploited in the
Levenberg–Marquardt algorithm (which uses an approximate Hessian) is to add a scaled identity matrix to the Hessian,
, with the scale adjusted at every iteration as needed. For large
and small Hessian, the iterations will behave like
gradient descent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
The idea is to take repeated steps in the opposite direction of the gradi ...
with step size
. This results in slower but more reliable convergence where the Hessian doesn't provide useful information.
Some caveats
Newton's method, in its original version, has several caveats:
# It does not work if the Hessian is not invertible. This is clear from the very definition of Newton's method, which requires taking the inverse of the Hessian.
# It may not converge at all, but can enter a cycle having more than 1 point. See the .
# It can converge to a saddle point instead of to a local minimum, see the section "
Geometric interpretation" in this article.
The popular modifications of Newton's method, such as quasi-Newton methods or Levenberg-Marquardt algorithm mentioned above, also have caveats:
For example, it is usually required that the cost function is (strongly) convex and the Hessian is globally bounded or Lipschitz continuous, for example this is mentioned in the section "Convergence" in this article. If one looks at the papers by Levenberg and Marquardt in the reference for
Levenberg–Marquardt algorithm, which are the original sources for the mentioned method, one can see that there is basically no theoretical analysis in the paper by Levenberg, while the paper by Marquardt only analyses a local situation and does not prove a global convergence result. One can compare with
Backtracking line search method for Gradient descent, which has good theoretical guarantee under more general assumptions, and can be implemented and works well in practical large scale problems such as Deep Neural Networks.
See also
*
Quasi-Newton method
*
Gradient descent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
The idea is to take repeated steps in the opposite direction of the gradi ...
*
Gauss–Newton algorithm
*
Levenberg–Marquardt algorithm
*
Trust region
*
Optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
*
Nelder–Mead method
*
Self-concordant function - a function for which Newton's method has very good global convergence rate.
Notes
References
*
*
*
*
*
*
External links
*
{{DEFAULTSORT:Newton's Method In Optimization
Optimization algorithms and methods
fr:Méthode de Newton