Limited-memory BFGS (L-BFGS or LM-BFGS) is an
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
in the family of
quasi-Newton methods that approximates the
Broyden–Fletcher–Goldfarb–Shanno algorithm
In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the ...
(BFGS) using a limited amount of
computer memory
Computer memory stores information, such as data and programs, for immediate use in the computer. The term ''memory'' is often synonymous with the terms ''RAM,'' ''main memory,'' or ''primary storage.'' Archaic synonyms for main memory include ...
. It is a popular algorithm for parameter estimation in
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
.
The algorithm's target problem is to minimize
over unconstrained values of the real-vector
where
is a differentiable scalar function.
Like the original BFGS, L-BFGS uses an estimate of the inverse
Hessian matrix
In mathematics, the Hessian matrix, Hessian or (less commonly) Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued Function (mathematics), function, or scalar field. It describes the local curvature of a functio ...
to steer its search through variable space, but where BFGS stores a dense
approximation to the inverse Hessian (''n'' being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with many variables. Instead of the inverse Hessian H''
k'', L-BFGS maintains a history of the past ''m'' updates of the position x and gradient ∇''f''(x), where generally the history size ''m'' can be small (often
). These updates are used to implicitly do operations requiring the H''
k''-vector product.
Algorithm
The algorithm starts with an initial estimate of the optimal value,
, and proceeds iteratively to refine that estimate with a sequence of better estimates
. The derivatives of the function
are used as a key driver of the algorithm to identify the direction of steepest descent, and also to form an estimate of the Hessian matrix (second derivative) of
.
L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication
is carried out, where
is the approximate Newton's direction,
is the current gradient, and
is the inverse of the Hessian matrix. There are multiple published approaches using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."
We take as given
, the position at the -th iteration, and
where
is the function being minimized, and all vectors are column vectors. We also assume that we have stored the last updates of the form
:
:
.
We define
, and
will be the 'initial' approximate of the inverse Hessian that our estimate at iteration begins with.
The algorithm is based on the BFGS recursion for the inverse Hessian as
:
For a fixed we define a sequence of vectors
as
and
. Then a recursive algorithm for calculating
from
is to define
and
. We also define another sequence of vectors
as
. There is another recursive algorithm for calculating these vectors which is to define
and then recursively define
and
. The value of
is then our ascent direction.
Thus we can compute the
descent direction as follows:
:
This formulation gives the search direction for the minimization problem, i.e.,
. For maximization problems, one should thus take instead. Note that the initial approximate inverse Hessian
is chosen as a
diagonal matrix
In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagon ...
or even a multiple of the identity matrix since this is numerically efficient.
The scaling of the initial matrix
ensures that the search direction is well scaled and therefore the unit step length is accepted in most iterations. A
Wolfe line search is used to ensure that the curvature condition is satisfied and the BFGS updating is stable. Note that some software implementations use an Armijo
backtracking line search, but cannot guarantee that the curvature condition
will be satisfied by the chosen step since a step length greater than
may be needed to satisfy this condition. Some implementations address this by skipping the BFGS update when
is negative or too close to zero, but this approach is not generally recommended since the updates may be skipped too often to allow the Hessian approximation
to capture important curvature information. Some solvers employ so called damped (L)BFGS update which modifies quantities
and
in order to satisfy the curvature condition.
The two-loop recursion formula is widely used by unconstrained optimizers due to its efficiency in multiplying by the inverse Hessian. However, it does not allow for the explicit formation of either the direct or inverse Hessian and is incompatible with non-box constraints. An alternative approach is the
compact representation, which involves a low-rank representation for the direct and/or inverse Hessian. This represents the Hessian as a sum of a diagonal matrix and a low-rank update. Such a representation enables the use of L-BFGS in constrained settings, for example, as part of the SQP method.
Applications
L-BFGS has been called "the algorithm of choice" for fitting
log-linear (MaxEnt) models and
conditional random field
Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consi ...
s with
-regularization.
Variants
Since BFGS (and hence L-BFGS) is designed to minimize
smooth functions without
constraints, the L-BFGS algorithm must be modified to handle functions that include non-
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...
components or constraints. A popular class of modifications are called active-set methods, based on the concept of the
active set
In mathematical optimization, the active-set method is an algorithm used to identify the active constraints in a set of inequality constraints. The active constraints are then expressed as equality constraints, thereby transforming an inequalit ...
. The idea is that when restricted to a small neighborhood of the current iterate, the function and constraints can be simplified.
L-BFGS-B
The L-BFGS-B algorithm extends L-BFGS to handle simple box constraints (aka bound constraints) on variables; that is, constraints of the form where and are per-variable constant lower and upper bounds, respectively (for each , either or both bounds may be omitted).
The method works by identifying fixed and free variables at every step (using a simple gradient method), and then using the L-BFGS method on the free variables only to get higher accuracy, and then repeating the process.
OWL-QN
Orthant-wise limited-memory quasi-Newton (OWL-QN) is an L-BFGS variant for fitting
-
regularized models, exploiting the inherent
sparsity
In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse ...
of such models.
It minimizes functions of the form
:
where
is a
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...
convex
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytop ...
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
. The method is an active-set type method: at each iterate, it estimates the
sign
A sign is an object, quality, event, or entity whose presence or occurrence indicates the probable presence or occurrence of something else. A natural sign bears a causal relation to its object—for instance, thunder is a sign of storm, or me ...
of each component of the variable, and restricts the subsequent step to have the same sign. Once the sign is fixed, the non-differentiable
term becomes a smooth linear term which can be handled by L-BFGS. After an L-BFGS step, the method allows some variables to change sign, and repeats the process.
O-LBFGS
Schraudolph ''et al.'' present an
online
In computer technology and telecommunications, online indicates a state of connectivity, and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed as "on lin ...
approximation to both BFGS and L-BFGS. Similar to
stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...
, this can be used to reduce the computational complexity by evaluating the
error function
In mathematics, the error function (also called the Gauss error function), often denoted by , is a function \mathrm: \mathbb \to \mathbb defined as:
\operatorname z = \frac\int_0^z e^\,\mathrm dt.
The integral here is a complex Contour integrat ...
and gradient on a randomly drawn subset of the overall dataset in each iteration. It has been shown that O-LBFGS has a global almost sure convergence while the online approximation of BFGS (O-BFGS) is not necessarily convergent.
Implementation of variants
Notable open source implementations include:
*
ALGLIB
ALGLIB is a cross-platform open source numerical analysis and data processing library. It can be used from several programming languages ( C++, C#, VB.NET, Python, Delphi, Java).
ALGLIB started in 1999 and has a long history of steady developm ...
implements L-BFGS in C++ and C# as well as a separate box/linearly constrained version, BLEIC.
*
R's
optim
general-purpose optimizer routine uses the L-BFGS-B method.
*
SciPy
SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing.
SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier ...
's optimization module's minimize method also includes an option to use L-BFGS-B.
*
Julia's
Optim.jl
also implements the L-BFGS and L-BFGS-B algorithm.
Notable non open source implementations include:
* The L-BFGS-B variant also exists as ACM TOMS algorithm 778.
In February 2011, some of the authors of the original L-BFGS-B code posted a major update (version 3.0).
* A reference implementation in
Fortran 77 (and with a
Fortran 90 interface).
This version, as well as older versions, has been converted to many other languages.
* An OWL-QN C++ implementation by its designers.
Works cited
Further reading
*
*
*
{{Optimization algorithms, unconstrained
Optimization algorithms and methods