calculus Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematics, mathematical study of continuous change, in the same way that geometry is the study of shape, and algebra is the study of generalizati ...

Newton's method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real- ...

is an

iterative method In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the ''n''-th approximation is derived from the pre ...

for finding the

roots A root is the part of a plant, generally underground, that anchors the plant body, and absorbs and stores water and nutrients. Root or roots may also refer to: Art, entertainment, and media * ''The Root'' (magazine), an online magazine focusing ...

of a

differentiable function In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...

, which are solutions to the

equation In mathematics, an equation is a formula that expresses the equality of two expressions, by connecting them with the equals sign . The word ''equation'' and its cognates in other languages may have subtly different meanings; for example, in F ...

. As such, Newton's method can be applied to the

derivative In mathematics, the derivative of a function of a real variable measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value). Derivatives are a fundamental tool of calculus. ...

of a twice-differentiable function to find the roots of the derivative (solutions to ), also known as the critical points of . These solutions may be minima, maxima, or saddle points; see section "Several variables" in

Critical point (mathematics) Critical point is a wide term used in many branches of mathematics. When dealing with functions of a real variable, a critical point is a point in the domain of the function where the function is either not differentiable or the derivative i ...

and also section "Geometric interpretation" in this article. This is relevant in

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

, which aims to find (global) minima of the function .

Newton's method

The central problem of optimization is minimization of functions. Let us first consider the case of univariate functions, i.e., functions of a single real variable. We will later consider the more general and more practically useful multivariate case. Given a twice differentiable function

f:\mathbb\to \mathbb

, we seek to solve the optimization problem :

\min_ f(x) .

Newton's method attempts to solve this problem by constructing a

sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is called ...

\

from an initial guess (starting point)

x_0\in \mathbb

that converges towards a minimizer

x_*

f

by using a sequence of second-order Taylor approximations of

f

around the iterates. The second-order

Taylor expansion In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor seri ...

of around

x_k

is :

f(x_k + t) \approx  f(x_k) + f'(x_k) t + \frac f''(x_k) t^2.

The next iterate

x_

is defined so as to minimize this quadratic approximation in

t

, and setting

x_=x_k + t

. If the second derivative is positive, the quadratic approximation is a convex function of

t

, and its minimum can be found by setting the derivative to zero. Since :

\displaystyle 0 = \frac \left(f(x_k) + f'(x_k) t + \frac 1 2 f''(x_k) t^2\right) = f'(x_k) + f'' (x_k) t,

the minimum is achieved for :

t = -\frac .

Putting everything together, Newton's method performs the iteration :

x_ = x_k + t = x_k - \frac.

Geometric interpretation

The geometric interpretation of Newton's method is that at each iteration, it amounts to the fitting of a

parabola In mathematics, a parabola is a plane curve which is mirror-symmetrical and is approximately U-shaped. It fits several superficially different mathematical descriptions, which can all be proved to define exactly the same curves. One descri ...

to the

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...

f(x)

at the trial value

x_k

, having the same slope and curvature as the graph at that point, and then proceeding to the maximum or minimum of that parabola (in higher dimensions, this may also be a

saddle point In mathematics, a saddle point or minimax point is a point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the functi ...

), see below. Note that if

f

happens to a quadratic function, then the exact extremum is found in one step.

Higher dimensions

The above iterative scheme can be generalized to

d>1

dimensions by replacing the derivative with the

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...

(different authors use different notation for the gradient, including

f'(x) = \nabla f(x) = g_f(x)\in \mathbb^d

), and the

reciprocal Reciprocal may refer to: In mathematics * Multiplicative inverse, in mathematics, the number 1/''x'', which multiplied by ''x'' gives the product 1, also known as a ''reciprocal'' * Reciprocal polynomial, a polynomial obtained from another pol ...

of the second derivative with the

inverse Inverse or invert may refer to: Science and mathematics * Inverse (logic), a type of conditional sentence which is an immediate inference made from another conditional sentence * Additive inverse (negation), the inverse of a number that, when a ...

of the

Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

(different authors use different notation for the Hessian, including

f''(x) = \nabla^2 f(x) = H_f(x)\in \mathbb^

). One thus obtains the iterative scheme :

x_ = x_k -

''(x_k) In formal language theory, the empty string, or empty word, is the unique string of length zero. Formal theory Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special case ...

f'(x_k), \qquad k \ge 0. Often Newton's method is modified to include a small step size

0 < \gamma \le 1

instead of

\gamma=1

: :

x_ = x_k - \gamma

f' (x_k). This is often done to ensure that the Wolfe conditions, or much simpler and efficient Armijo's condition, are satisfied at each step of the method. For step sizes other than 1, the method is often referred to as the relaxed or damped Newton's method.

Convergence

If is a strongly convex function with Lipschitz Hessian, then provided that

x_0

is close enough to

x_*=\arg \min f(x)

, the sequence

x_0, x_1, x_2, \dots

generated by Newton's method will converge to the (necessarily unique) minimizer

x_*

f

quadratically fast. That is, :

\, x_-x_*\,  \leq \frac \, x_-x_*\, ^2, \qquad \forall k\geq 0.

Computing the Newton direction

Finding the inverse of the Hessian in high dimensions to compute the Newton direction

h = - (f''(x_k))^ f'(x_k)

can be an expensive operation. In such cases, instead of directly inverting the Hessian, it is better to calculate the vector

h

as the solution to the

system of linear equations In mathematics, a system of linear equations (or linear system) is a collection of one or more linear equations involving the same variables. For example, :\begin 3x+2y-z=1\\ 2x-2y+4z=-2\\ -x+\fracy-z=0 \end is a system of three equations in t ...

h = - f'(x_k) which may be solved by various factorizations or approximately (but to great accuracy) using iterative methods. Many of these methods are only applicable to certain types of equations, for example the Cholesky factorization and

conjugate gradient In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-definite. The conjugate gradient method is often implemented as an itera ...

will only work if

f''(x_k)

is a positive definite matrix. While this may seem like a limitation, it is often a useful indicator of something gone wrong; for example if a minimization problem is being approached and

f''(x_k)

is not positive definite, then the iterations are converging to a

and not a minimum. On the other hand, if a

constrained optimization In mathematical optimization, constrained optimization (in some contexts called constraint optimization) is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The ob ...

is done (for example, with

Lagrange multipliers In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied e ...

), the problem may become one of saddle point finding, in which case the Hessian will be symmetric indefinite and the solution of

x_

will need to be done with a method that will work for such, such as the

LDL^\top

variant of Cholesky factorization or the

conjugate residual method The conjugate residual method is an iterative numeric method used for solving systems of linear equations. It's a Krylov subspace method very similar to the much more popular conjugate gradient method, with similar construction and convergence prope ...

. There also exist various

quasi-Newton method Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. ...

s, where an approximation for the Hessian (or its inverse directly) is built up from changes in the gradient. If the Hessian is close to a non-

invertible matrix In linear algebra, an -by- square matrix is called invertible (also nonsingular or nondegenerate), if there exists an -by- square matrix such that :\mathbf = \mathbf = \mathbf_n \ where denotes the -by- identity matrix and the multiplicati ...

, the inverted Hessian can be numerically unstable and the solution may diverge. In this case, certain workarounds have been tried in the past, which have varied success with certain problems. One can, for example, modify the Hessian by adding a correction matrix

B_k

so as to make

f''(x_k) + B_k

positive definite. One approach is to diagonalize the Hessian and choose

B_k

so that

f''(x_k) + B_k

has the same eigenvectors as the Hessian, but with each negative eigenvalue replaced by

\epsilon>0

. An approach exploited in the

Levenberg–Marquardt algorithm In mathematics and computing, the Levenberg–Marquardt algorithm (LMA or just LM), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least sq ...

(which uses an approximate Hessian) is to add a scaled identity matrix to the Hessian,

\mu I

, with the scale adjusted at every iteration as needed. For large

\mu

and small Hessian, the iterations will behave like

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

with step size

1/\mu

. This results in slower but more reliable convergence where the Hessian doesn't provide useful information.

Some caveats

Newton's method, in its original version, has several caveats: # It does not work if the Hessian is not invertible. This is clear from the very definition of Newton's method, which requires taking the inverse of the Hessian. # It may not converge at all, but can enter a cycle having more than 1 point. See the section "Failure analysis" in

. # It can converge to a saddle point instead of to a local minimum, see the section "Geometric interpretation" in this article. The popular modifications of Newton's method, such as quasi-Newton methods or Levenberg-Marquardt algorithm mentioned above, also have caveats: For example, it is usually required that the cost function is (strongly) convex and the Hessian is globally bounded or Lipschitz continuous, for example this is mentioned in the section "Convergence" in this article. If one looks at the papers by Levenberg and Marquardt in the reference for

, which are the original sources for the mentioned method, one can see that there is basically no theoretical analysis in the paper by Levenberg, while the paper by Marquardt only analyses a local situation and does not prove a global convergence result. One can compare with Backtracking line search method for Gradient descent, which has good theoretical guarantee under more general assumptions, and can be implemented and works well in practical large scale problems such as Deep Neural Networks.

Notes

References

* * * * * *

External links

* {{DEFAULTSORT:Newton's Method In Optimization Optimization algorithms and methods fr:Méthode de Newton