In mathematics and computing, the Levenberg–Marquardt algorithm (LMA or just LM), also known as the damped least-squares (DLS) method, is used to solve

non-linear least squares Non-linear least squares is the form of least squares analysis used to fit a set of ''m'' observations with a model that is non-linear in ''n'' unknown parameters (''m'' ≥ ''n''). It is used in some forms of nonlinear regression. The ...

problems. These minimization problems arise especially in

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...

curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of th ...

. The LMA is more

robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...

than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a

trust region In mathematical optimization, a trust region is the subset of the region of the objective function that is approximated using a model function (often a quadratic). If an adequate model of the objective function is found within the trust region, the ...

approach. The algorithm was first published in 1944 by Kenneth Levenberg, while working at the Frankford Army Arsenal. It was rediscovered in 1963 by Donald Marquardt, who worked as a

statistician A statistician is a person who works with theoretical or applied statistics. The profession exists in both the private and public sectors. It is common to combine statistical knowledge with expertise in other subjects, and statisticians may wor ...

DuPont DuPont de Nemours, Inc., commonly shortened to DuPont, is an American multinational chemical company first formed in 1802 by French-American chemist and industrialist Éleuthère Irénée du Pont de Nemours. The company played a major role in ...

, and independently by Girard, Wynne and Morrison. The LMA is used in many software applications for solving generic curve-fitting problems. By using the Gauss–Newton algorithm it often converges faster than first-order methods. However, like other iterative optimization algorithms, the LMA finds only a

local minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...

, which is not necessarily the

global minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...

The problem

The primary application of the Levenberg–Marquardt algorithm is in the least-squares curve fitting problem: given a set of

m

empirical pairs

\left (x_i, y_i\right )

of independent and dependent variables, find the parameters of the model curve

f\left (x, \boldsymbol\beta\right )

so that the sum of the squares of the deviations

S\left (\boldsymbol\beta\right )

is minimized: :

\hat \in \operatorname\limits_ S\left (\boldsymbol\beta\right ) \equiv \operatorname\limits_ \sum_^m \left_i - f\left (x_i, \boldsymbol\beta\right )\right 2,

which is assumed to be non-empty.

The solution

Like other numeric minimization algorithms, the Levenberg–Marquardt algorithm is an

iterative Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

procedure. To start a minimization, the user has to provide an initial guess for the parameter vector . In cases with only one minimum, an uninformed standard guess like

\boldsymbol\beta^\text = \begin1,\ 1,\ \dots,\ 1\end

will work fine; in cases with multiple minima, the algorithm converges to the global minimum only if the initial guess is already somewhat close to the final solution. In each iteration step, the parameter vector is replaced by a new estimate . To determine , the function

f\left (x_i, \boldsymbol\beta + \boldsymbol\delta\right )

is approximated by its

linearization In mathematics, linearization is finding the linear approximation to a function at a given point. The linear approximation of a function is the first order Taylor expansion around the point of interest. In the study of dynamical systems, linear ...

: :

f\left (x_i, \boldsymbol\beta + \boldsymbol\delta\right ) \approx f\left (x _i, \boldsymbol\beta\right ) + \mathbf J_i \boldsymbol\delta,

where :

\mathbf J_i = \frac

is the

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gradi ...

(row-vector in this case) of with respect to . The sum

S\left (\boldsymbol\beta\right )

of square deviations has its minimum at a zero

with respect to . The above first-order approximation of

f\left (x_i, \boldsymbol\beta + \boldsymbol\delta\right )

gives :

S\left (\boldsymbol\beta + \boldsymbol\delta\right ) \approx \sum_^m \left_i - f\left (x_i, \boldsymbol\beta\right ) - \mathbf J_i \boldsymbol\delta\right 2,

or in vector notation, :

\mathbf J \boldsymbol\delta + \boldsymbol\delta^ \mathbf J^ \mathbf J\boldsymbol\delta. \end

Taking the derivative of

S\left (\boldsymbol\beta + \boldsymbol\delta\right )

with respect to and setting the result to zero gives :

\left (\mathbf J^ \mathbf J\right )\boldsymbol\delta = \mathbf J^\left mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right

where

\mathbf J

is the

Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables a ...

, whose -th row equals

\mathbf J_i

, and where

\mathbf f\left (\boldsymbol\beta\right )

and

\mathbf y

are vectors with -th component

f\left (x_i, \boldsymbol\beta\right )

and

y_i

respectively. The above expression obtained for comes under the Gauss–Newton method. The Jacobian matrix as defined above is not (in general) a square matrix, but a rectangular matrix of size

m \times n

, where

n

is the number of parameters (size of the vector

\boldsymbol\beta

). The matrix multiplication

\left (\mathbf J^ \mathbf J\right)

yields the required

n \times n

square matrix and the matrix-vector product on the right hand side yields a vector of size

n

. The result is a set of

n

linear equations, which can be solved for . Levenberg's contribution is to replace this equation by a "damped version": :

\left (\mathbf J^ \mathbf J + \lambda\mathbf I\right ) \boldsymbol\delta = \mathbf J^\left mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right

where is the identity matrix, giving as the increment to the estimated parameter vector . The (non-negative) damping factor is adjusted at each iteration. If reduction of is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual, can be increased, giving a step closer to the gradient-descent direction. Note that the

of with respect to equals

-2\left (\mathbf J^\left mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right right )^

. Therefore, for large values of , the step will be taken approximately in the direction opposite to the gradient. If either the length of the calculated step or the reduction of sum of squares from the latest parameter vector fall below predefined limits, iteration stops, and the last parameter vector is considered to be the solution. When the damping factor is large relative to

\,  \mathbf J^ \mathbf J \,

, inverting

\mathbf J^ \mathbf J + \lambda \mathbf I

is not necessary, as the update is well-approximated by the small gradient step

\lambda^ \mathbf J^\left mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right /math>.

To make the solution scale invariant Marquardt's algorithm solved a modified problem with each component of the gradient scaled according to the curvature. This provides larger movement along the directions where the gradient is smaller, which avoids slow convergence in the direction of small gradient. Fletcher in his 1971 paper ''A modified Marquardt subroutine for non-linear least squares'' simplified the form, replacing the identity matrix  with the diagonal matrix consisting of the diagonal elements of :

: \left mathbf J^ \mathbf J + \lambda \operatorname\left (\mathbf J^ \mathbf J\right )\right \boldsymbol\delta = \mathbf J^\left mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right A similar damping factor appears in

Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

, which is used to solve linear ill-posed problems, as well as in

ridge regression Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

, an

estimation Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...

technique in

statistics Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industri ...

Choice of damping parameter

Various more or less heuristic arguments have been put forward for the best choice for the damping parameter . Theoretical arguments exist showing why some of these choices guarantee local convergence of the algorithm; however, these choices can make the global convergence of the algorithm suffer from the undesirable properties of

steepest descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...

, in particular, very slow convergence close to the optimum. The absolute values of any choice depend on how well-scaled the initial problem is. Marquardt recommended starting with a value and a factor . Initially setting

\lambda = \lambda_0

and computing the residual sum of squares

S\left (\boldsymbol\beta\right )

after one step from the starting point with the damping factor of

\lambda = \lambda_0

and secondly with . If both of these are worse than the initial point, then the damping is increased by successive multiplication by until a better point is found with a new damping factor of for some . If use of the damping factor results in a reduction in squared residual, then this is taken as the new value of (and the new optimum location is taken as that obtained with this damping factor) and the process continues; if using resulted in a worse residual, but using resulted in a better residual, then is left unchanged and the new optimum is taken as the value obtained with as damping factor. An effective strategy for the control of the damping parameter, called ''delayed gratification'', consists of increasing the parameter by a small amount for each uphill step, and decreasing by a large amount for each downhill step. The idea behind this strategy is to avoid moving downhill too fast in the beginning of optimization, therefore restricting the steps available in future iterations and therefore slowing down convergence. An increase by a factor of 2 and a decrease by a factor of 3 has been shown to be effective in most cases, while for large problems more extreme values can work better, with an increase by a factor of 1.5 and a decrease by a factor of 5.

Geodesic acceleration

When interpreting the Levenberg–Marquardt step as the velocity

\boldsymbol_k

along a

geodesic In geometry, a geodesic () is a curve representing in some sense the shortest path ( arc) between two points in a surface, or more generally in a Riemannian manifold. The term also has meaning in any differentiable manifold with a connection. ...

path in the parameter space, it is possible to improve the method by adding a second order term that accounts for the acceleration

\boldsymbol_k

along the geodesic :

\boldsymbol_k + \frac \boldsymbol_k

where

\boldsymbol_k

is the solution of :

\boldsymbol_k \boldsymbol_k = -f_ .

Since this geodesic acceleration term depends only on the

directional derivative In mathematics, the directional derivative of a multivariable differentiable (scalar) function along a given vector v at a given point x intuitively represents the instantaneous rate of change of the function, moving through x with a velocity ...

f_ = \sum_ v_ v_ \partial_ \partial_ f (\boldsymbol)

along the direction of the velocity

\boldsymbol

, it does not require computing the full second order derivative matrix, requiring only a small overhead in terms of computing cost. Since the second order derivative can be a fairly complex expression, it can be convenient to replace it with a

finite difference A finite difference is a mathematical expression of the form . If a finite difference is divided by , one gets a difference quotient. The approximation of derivatives by finite differences plays a central role in finite difference methods for th ...

approximation :

\begin
f_^i &\approx \frac \\
           &= \frac \left( \frac - \boldsymbol_i \boldsymbol \right)
\end

where

f(\boldsymbol)

and

\boldsymbol

have already been computed by the algorithm, therefore requiring only one additional function evaluation to compute

f(\boldsymbol + h \boldsymbol)

. The choice of the finite difference step

h

can affect the stability of the algorithm, and a value of around 0.1 is usually reasonable in general. Since the acceleration may point in opposite direction to the velocity, to prevent it to stall the method in case the damping is too small, an additional criterion on the acceleration is added in order to accept a step, requiring that :

\frac \le \alpha

where

\alpha

is usually fixed to a value lesser than 1, with smaller values for harder problems. The addition of a geodesic acceleration term can allow significant increase in convergence speed and it is especially useful when the algorithm is moving through narrow canyons in the landscape of the objective function, where the allowed steps are smaller and the higher accuracy due to the second order term gives significative improvements.

Example

In this example we try to fit the function

y = a \cos\left (bX\right ) + b \sin\left (aX\right )

using the Levenberg–Marquardt algorithm implemented in

GNU Octave GNU Octave is a high-level programming language primarily intended for scientific computing and numerical computation. Octave helps in solving linear and nonlinear problems numerically, and for performing other numerical experiments using a langu ...

as the ''leasqr'' function. The graphs show progressively better fitting for the parameters

a = 100

b = 102

used in the initial curve. Only when the parameters in the last graph are chosen closest to the original, are the curves fitting exactly. This equation is an example of very sensitive initial conditions for the Levenberg–Marquardt algorithm. One reason for this sensitivity is the existence of multiple minima — the function

\cos\left (\beta x\right )

has minima at parameter value

\hat\beta

and

\hat\beta + 2n\pi

References

External links

* Detailed description of the algorithm can be found i
Numerical Recipes in C, Chapter 15.5: Nonlinear models
* C. T. Kelley, ''Iterative Methods for Optimization'', SIAM Frontiers in Applied Mathematics, no 18, 1999,
Online copy

A tutorial by Ananth Ranganathan
* K. Madsen, H. B. Nielsen, O. Tingleff,
Methods for Non-Linear Least Squares Problems
' (nonlinear least-squares tutorial; L-M code
analytic Jacobiansecant
* T. Strutz: ''Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond).'' 2nd edition, Springer Vieweg, 2016, . * H. P. Gavin
''The Levenberg-Marquardt method for nonlinear least-squares curve-fitting problems''
(

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementatio ...

implementation included) {{DEFAULTSORT:Levenberg-Marquardt algorithm Statistical algorithms Optimization algorithms and methods Least squares

The problem

The solution

Choice of damping parameter

Geodesic acceleration

Example

See also

References

Further reading

External links