The Frank–Wolfe algorithm is an iterative first-order

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

for constrained

convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization problems ...

. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function (taken over the same domain).

Problem statement

Suppose

\mathcal

is a

compact Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to: * Interstate compact, a type of agreement used by U.S. states * Blood compact, an ancient ritual of the Philippines * Compact government, a t ...

convex set In geometry, a set of points is convex if it contains every line segment between two points in the set. For example, a solid cube (geometry), cube is a convex set, but anything that is hollow or has an indent, for example, a crescent shape, is n ...

in a

vector space In mathematics and physics, a vector space (also called a linear space) is a set (mathematics), set whose elements, often called vector (mathematics and physics), ''vectors'', can be added together and multiplied ("scaled") by numbers called sc ...

and

f \colon \mathcal \to \mathbb

is a

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...

real-valued function In mathematics, a real-valued function is a function whose values are real numbers. In other words, it is a function that assigns a real number to each member of its domain. Real-valued functions of a real variable (commonly called ''real ...

. The Frank–Wolfe algorithm solves the

optimization problem In mathematics, engineering, computer science and economics Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goo ...

:Minimize

f(\mathbf)

:subject to

\mathbf \in \mathcal

Algorithm

:''Initialization:'' Let

k \leftarrow 0

, and let

\mathbf_0 \!

be any point in

\mathcal

. :Step 1. ''Direction-finding subproblem:'' Find

\mathbf_k

solving ::Minimize

\mathbf^T \nabla f(\mathbf_k)

::Subject to

\mathbf \in \mathcal

:''(Interpretation: Minimize the linear approximation of the problem given by the first-order

Taylor approximation In calculus, Taylor's theorem gives an approximation of a k-times differentiable function around a given point by a polynomial of degree k, called the k-th-order Taylor polynomial. For a smooth function, the Taylor polynomial is the truncation a ...

f

around

\mathbf_k \!

constrained to stay within

\mathcal

.)'' :Step 2. ''Step size determination:'' Set

\alpha \leftarrow \frac

, or alternatively find

\alpha

that minimizes

f(\mathbf_k+\alpha(\mathbf_k -\mathbf_k))

subject to

0 \le \alpha \le 1

. :Step 3. ''Update:'' Let

\mathbf_\leftarrow \mathbf_k+\alpha(\mathbf_k-\mathbf_k)

, let

k \leftarrow k+1

and go to Step 1.

Properties

While competing methods such as

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

for constrained optimization require a projection step back to the feasible set in each iteration, the Frank–Wolfe algorithm only needs the solution of a convex problem over the same set in each iteration, and automatically stays in the feasible set. The convergence of the Frank–Wolfe algorithm is sublinear in general: the error in the objective function to the optimum is

O(1/k)

after ''k'' iterations, so long as the gradient is

Lipschitz continuous In mathematical analysis, Lipschitz continuity, named after Germany, German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for function (mathematics), functions. Intuitively, a Lipschitz continuous function is limited in h ...

with respect to some norm. The same convergence rate can also be shown if the sub-problems are only solved approximately. The iterations of the algorithm can always be represented as a sparse convex combination of the extreme points of the feasible set, which has helped to the popularity of the algorithm for sparse greedy optimization in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

and

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...

problems, as well as for example the optimization of minimum–cost flows in transportation networks. If the feasible set is given by a set of linear constraints, then the subproblem to be solved in each iteration becomes a

linear program Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements and objective are represented by linear relationships. Linear ...

. While the worst-case convergence rate with

O(1/k)

can not be improved in general, faster convergence can be obtained for special problem classes, such as some strongly convex problems.

Lower bounds on the solution value, and primal-dual analysis

Since

f

, for any two points

\mathbf, \mathbf \in \mathcal

we have: :

f(\mathbf) \geq f(\mathbf) +  (\mathbf - \mathbf)^T \nabla f(\mathbf)

This also holds for the (unknown) optimal solution

\mathbf^*

. That is,

f(\mathbf^*) \ge f(\mathbf) +  (\mathbf^* - \mathbf)^T \nabla f(\mathbf)

. The best lower bound with respect to a given point

\mathbf

is given by :

\begin
f(\mathbf^*) 
& \ge f(\mathbf) +  (\mathbf^* - \mathbf)^T \nabla f(\mathbf) \\ 
&\geq \min_ \left\ \\
&= f(\mathbf) - \mathbf^T \nabla f(\mathbf) + \min_ \mathbf^T \nabla f(\mathbf)
\end

The latter optimization problem is solved in every iteration of the Frank–Wolfe algorithm, therefore the solution

\mathbf_k

of the direction-finding subproblem of the

k

-th iteration can be used to determine increasing lower bounds

l_k

during each iteration by setting

l_0 = - \infty

and :

l_k := \max (l_, f(\mathbf_k) +  (\mathbf_k - \mathbf_k)^T \nabla f(\mathbf_k))

Such lower bounds on the unknown optimal value are important in practice because they can be used as a stopping criterion, and give an efficient certificate of the approximation quality in every iteration, since always

l_k \leq f(\mathbf^*) \leq f(\mathbf_k)

. It has been shown that this corresponding duality gap, that is the difference between

f(\mathbf_k)

and the lower bound

l_k

, decreases with the same convergence rate, i.e.

f(\mathbf_k) - l_k = O(1/k) .

Notes

Bibliography

* (Overview paper)
The Frank–Wolfe algorithm
description * .

External links

*https://conditional-gradients.org/: a survey of Frank–Wolfe algorithms.
Marguerite Frank giving a personal account of the history of the algorithm