Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for
numerical optimization.
Evolution strategies (ES) are
stochastic
Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselve ...
,
derivative-free methods for
numerical optimization of non-
linear
Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...
or non-
convex continuous optimization problems. They belong to the class of
evolutionary algorithms and
evolutionary computation. An
evolutionary algorithm
In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as rep ...
is broadly based on the principle of
biological evolution, namely the repeated interplay of variation (via recombination and mutation) and selection: in each generation (iteration) new individuals (candidate solutions, denoted as
) are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
value
. Like this, over the generation sequence, individuals with better and better
-values are generated.
In an
evolution strategy, new candidate solutions are sampled according to a
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
in
. Recombination amounts to selecting a new mean value for the distribution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise dependencies between the variables in the distribution are represented by a
covariance matrix. The covariance matrix adaptation (CMA) is a method to update the
covariance matrix of this distribution. This is particularly useful if the function
is
ill-conditioned.
Adaptation of the
covariance matrix amounts to learning a second order model of the underlying
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
similar to the approximation of the inverse
Hessian matrix in the
quasi-Newton method in classical
optimization. In contrast to most classical methods, fewer assumptions on the underlying objective function are made. Because only a ranking (or, equivalently, sorting) of candidate solutions is exploited, neither derivatives nor even an (explicit) objective function is required by the method. For example, the ranking could come about from pairwise competitions between the candidate solutions in a
Swiss-system tournament.
Principles
Two main principles for the adaptation of parameters of the search distribution are exploited in the CMA-ES algorithm.
First, a
maximum-likelihood principle, based on the idea to increase the probability of successful candidate solutions and search steps. The mean of the distribution is updated such that the
likelihood
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
of previously successful candidate solutions is maximized. The
covariance matrix of the distribution is updated (incrementally) such that the likelihood of previously successful search steps is increased. Both updates can be interpreted as a
natural gradient descent. Also, in consequence, the CMA conducts an iterated
principal components analysis of successful search steps while retaining ''all'' principal axes.
Estimation of distribution algorithms and the
Cross-Entropy Method are based on very similar ideas, but estimate (non-incrementally) the covariance matrix by maximizing the likelihood of successful solution ''points'' instead of successful search ''steps''.
Second, two paths of the time evolution of the distribution mean of the strategy are recorded, called search or evolution paths. These paths contain significant information about the correlation between consecutive steps. Specifically, if consecutive steps are taken in a similar direction, the evolution paths become long. The evolution paths are exploited in two ways. One path is used for the covariance matrix adaptation procedure in place of single successful search steps and facilitates a possibly much faster variance increase of favorable directions. The other path is used to conduct an additional step-size control. This step-size control aims to make consecutive movements of the distribution mean orthogonal in expectation. The step-size control effectively prevents
premature convergence yet allowing fast convergence to an optimum.
Algorithm
In the following the most commonly used (''μ''/''μ''
''w'', ''λ'')-CMA-ES is outlined, where in each iteration step a weighted combination of the ''μ'' best out of ''λ'' new candidate solutions is used to update the distribution parameters. The main loop consists of three main parts: 1) sampling of new solutions, 2) re-ordering of the sampled solutions based on their fitness, 3) update of the internal state variables based on the re-ordered samples. A
pseudocode of the algorithm looks as follows.
set
// number of samples per iteration, at least two, generally > 4
initialize
,
,
,
,
// initialize state variables
while ''not terminate'' do // iterate
for
in
do // sample
new solutions and evaluate them
sample_multivariate_normal(mean
, covariance_matrix
)
←
with
// sort solutions
// we need later
and
← update_m
// move mean to better solutions
← update_ps
// update isotropic evolution path
← update_pc
// update anisotropic evolution path
← update_C
// update covariance matrix
← update_sigma
// update step-size using isotropic path length
return
or
The order of the five update assignments is relevant:
must be updated first,
and
must be updated before
, and
must be updated last. The update equations for the five state variables are specified in the following.
Given are the search space dimension
and the iteration step
. The five state variables are
:
, the distribution mean and current favorite solution to the optimization problem,
:
, the step-size,
:
, a symmetric and
positive-definite covariance matrix with
and
:
, two evolution paths, initially set to the zero vector.
The iteration starts with sampling
candidate solutions
from a
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
, i.e.
for
::
The second line suggests the interpretation as unbiased perturbation (mutation) of the current favorite solution vector
(the distribution mean vector). The candidate solutions
are evaluated on the objective function
to be minimized. Denoting the
-sorted candidate solutions as
:
the new mean value is computed as
::
where the positive (recombination) weights
sum to one. Typically,
and the weights are chosen such that
. The only feedback used from the objective function here and in the following is an ordering of the sampled candidate solutions due to the indices
.
The step-size
is updated using ''cumulative step-size adaptation'' (CSA), sometimes also denoted as ''path length control''. The evolution path (or search path)
is updated first.
::
::
where
:
is the backward time horizon for the evolution path
and larger than one (
is reminiscent of an
exponential decay
A quantity is subject to exponential decay if it decreases at a rate proportional to its current value. Symbolically, this process can be expressed by the following differential equation, where is the quantity and (lambda) is a positive rate ...
constant as
where
is the associated lifetime and
the half-life),
:
is the variance effective selection mass and
by definition of
,
:
is the unique symmetric
square root
In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or ⋅ ) is . For example, 4 and −4 are square roots of 16, because .
...
of the
inverse
Inverse or invert may refer to:
Science and mathematics
* Inverse (logic), a type of conditional sentence which is an immediate inference made from another conditional sentence
* Additive inverse (negation), the inverse of a number that, when ad ...
of
, and
:
is the damping parameter usually close to one. For
or
the step-size remains unchanged.
The step-size
is increased if and only if
is larger than the
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
:
and decreased if it is smaller. For this reason, the step-size update tends to make consecutive steps
-conjugate, in that after the adaptation has been successful
.
Finally, the
covariance matrix is updated, where again the respective evolution path is updated first.
::
::
where
denotes the transpose and
:
is the backward time horizon for the evolution path
and larger than one,
:
and the
indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x ...
evaluates to one
iff
In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false.
The connective is bicondi ...