Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for
numerical optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
.
Evolution strategies
In computer science, an evolution strategy (ES) is an optimization technique based on ideas of evolution. It belongs to the general class of evolutionary computation or artificial evolution methodologies.
History
The 'evolution strategy' optimiza ...
(ES) are
stochastic
Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...
,
derivative-free methods for
numerical optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
of non-
linear
Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
or non-
convex
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytope ...
continuous optimization
Continuous optimization is a branch of optimization in applied mathematics.
As opposed to discrete optimization, the variables used in the objective function are required to be continuous variables—that is, to be chosen from a set of rea ...
problems. They belong to the class of
evolutionary algorithms
In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduc ...
and
evolutionary computation
In computer science, evolutionary computation is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms, they ...
. An
evolutionary algorithm
In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduc ...
is broadly based on the principle of
biological evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation t ...
, namely the repeated interplay of variation (via recombination and mutation) and selection: in each generation (iteration) new individuals (candidate solutions, denoted as
) are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
value
. Like this, over the generation sequence, individuals with better and better
-values are generated.
In an
evolution strategy
In computer science, an evolution strategy (ES) is an optimization technique based on ideas of evolution. It belongs to the general class of evolutionary computation or artificial evolution methodologies.
History
The 'evolution strategy' optimizat ...
, new candidate solutions are sampled according to a
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
in
. Recombination amounts to selecting a new mean value for the distribution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise dependencies between the variables in the distribution are represented by a
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
. The covariance matrix adaptation (CMA) is a method to update the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of this distribution. This is particularly useful if the function
is
ill-conditioned
In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input ...
.
Adaptation of the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
amounts to learning a second order model of the underlying
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
similar to the approximation of the inverse
Hessian matrix
In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...
in the
quasi-Newton method Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. ...
in classical
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
. In contrast to most classical methods, fewer assumptions on the underlying objective function are made. Because only a ranking (or, equivalently, sorting) of candidate solutions is exploited, neither derivatives nor even an (explicit) objective function is required by the method. For example, the ranking could come about from pairwise competitions between the candidate solutions in a
Swiss-system tournament
A Swiss-system tournament is a non-eliminating tournament format that features a fixed number of rounds of competition, but considerably fewer than for a round-robin tournament; thus each competitor (team or individual) does not play all the other ...
.
Principles
Two main principles for the adaptation of parameters of the search distribution are exploited in the CMA-ES algorithm.
First, a
maximum-likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
principle, based on the idea to increase the probability of successful candidate solutions and search steps. The mean of the distribution is updated such that the
likelihood
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
of previously successful candidate solutions is maximized. The
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the distribution is updated (incrementally) such that the likelihood of previously successful search steps is increased. Both updates can be interpreted as a
natural gradient descent. Also, in consequence, the CMA conducts an iterated
principal components analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
of successful search steps while retaining ''all'' principal axes.
Estimation of distribution algorithms and the
Cross-Entropy Method The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.
The method approximates the optimal importance ...
are based on very similar ideas, but estimate (non-incrementally) the covariance matrix by maximizing the likelihood of successful solution ''points'' instead of successful search ''steps''.
Second, two paths of the time evolution of the distribution mean of the strategy are recorded, called search or evolution paths. These paths contain significant information about the correlation between consecutive steps. Specifically, if consecutive steps are taken in a similar direction, the evolution paths become long. The evolution paths are exploited in two ways. One path is used for the covariance matrix adaptation procedure in place of single successful search steps and facilitates a possibly much faster variance increase of favorable directions. The other path is used to conduct an additional step-size control. This step-size control aims to make consecutive movements of the distribution mean orthogonal in expectation. The step-size control effectively prevents
premature convergence yet allowing fast convergence to an optimum.
Algorithm
In the following the most commonly used (''μ''/''μ''
''w'', ''λ'')-CMA-ES is outlined, where in each iteration step a weighted combination of the ''μ'' best out of ''λ'' new candidate solutions is used to update the distribution parameters. The main loop consists of three main parts: 1) sampling of new solutions, 2) re-ordering of the sampled solutions based on their fitness, 3) update of the internal state variables based on the re-ordered samples. A
pseudocode
In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...
of the algorithm looks as follows.
set
// number of samples per iteration, at least two, generally > 4
initialize
,
,
,
,
// initialize state variables
while ''not terminate'' do // iterate
for
in
do // sample
new solutions and evaluate them
sample_multivariate_normal(mean
, covariance_matrix
)
←
with
// sort solutions
// we need later
and
← update_m
// move mean to better solutions
← update_ps
// update isotropic evolution path
← update_pc
// update anisotropic evolution path
← update_C
// update covariance matrix
← update_sigma
// update step-size using isotropic path length
return
or
The order of the five update assignments is relevant:
must be updated first,
and
must be updated before
, and
must be updated last. The update equations for the five state variables are specified in the following.
Given are the search space dimension
and the iteration step
. The five state variables are
:
, the distribution mean and current favorite solution to the optimization problem,
:
, the step-size,
:
, a symmetric and
positive-definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular:
* Positive-definite bilinear form
* Positive-definite fu ...
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
with
and
:
, two evolution paths, initially set to the zero vector.
The iteration starts with sampling
candidate solutions
from a
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
, i.e.
for
::
The second line suggests the interpretation as unbiased perturbation (mutation) of the current favorite solution vector
(the distribution mean vector). The candidate solutions
are evaluated on the objective function
to be minimized. Denoting the
-sorted candidate solutions as
:
the new mean value is computed as
::
where the positive (recombination) weights
sum to one. Typically,
and the weights are chosen such that
. The only feedback used from the objective function here and in the following is an ordering of the sampled candidate solutions due to the indices
.
The step-size
is updated using ''cumulative step-size adaptation'' (CSA), sometimes also denoted as ''path length control''. The evolution path (or search path)
is updated first.
::
::
where
:
is the backward time horizon for the evolution path
and larger than one (
is reminiscent of an
exponential decay
A quantity is subject to exponential decay if it decreases at a rate proportional to its current value. Symbolically, this process can be expressed by the following differential equation, where is the quantity and (lambda) is a positive rate ...
constant as
where
is the associated lifetime and
the half-life),
:
is the variance effective selection mass and
by definition of
,
:
is the unique symmetric
square root
In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or ⋅ ) is . For example, 4 and −4 are square roots of 16, because .
E ...
of the
inverse of
, and
:
is the damping parameter usually close to one. For
or
the step-size remains unchanged.
The step-size
is increased if and only if
is larger than the
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
:
and decreased if it is smaller. For this reason, the step-size update tends to make consecutive steps
-conjugate, in that after the adaptation has been successful
.
Finally, the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
is updated, where again the respective evolution path is updated first.
::
::
where
denotes the transpose and
:
is the backward time horizon for the evolution path
and larger than one,
:
and the
indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
evaluates to one
iff
In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false.
The connective is bicon ...