Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for

numerical optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

Evolution strategies In computer science, an evolution strategy (ES) is an optimization technique based on ideas of evolution. It belongs to the general class of evolutionary computation or artificial evolution methodologies. History The 'evolution strategy' optimiza ...

(ES) are

stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...

, derivative-free methods for

of non-

linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...

or non-

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytope ...

continuous optimization Continuous optimization is a branch of optimization in applied mathematics. As opposed to discrete optimization, the variables used in the objective function are required to be continuous variables—that is, to be chosen from a set of rea ...

problems. They belong to the class of

evolutionary algorithms In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduc ...

and

evolutionary computation In computer science, evolutionary computation is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms, they ...

. An

evolutionary algorithm In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduc ...

is broadly based on the principle of

biological evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation t ...

, namely the repeated interplay of variation (via recombination and mutation) and selection: in each generation (iteration) new individuals (candidate solutions, denoted as

x

) are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...

value

f(x)

. Like this, over the generation sequence, individuals with better and better

f

-values are generated. In an

evolution strategy In computer science, an evolution strategy (ES) is an optimization technique based on ideas of evolution. It belongs to the general class of evolutionary computation or artificial evolution methodologies. History The 'evolution strategy' optimizat ...

, new candidate solutions are sampled according to a

multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...

\mathbb^n

. Recombination amounts to selecting a new mean value for the distribution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise dependencies between the variables in the distribution are represented by a

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

. The covariance matrix adaptation (CMA) is a method to update the

of this distribution. This is particularly useful if the function

f

ill-conditioned In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input ...

. Adaptation of the

amounts to learning a second order model of the underlying

similar to the approximation of the inverse

Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

in the

quasi-Newton method Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. ...

in classical

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

. In contrast to most classical methods, fewer assumptions on the underlying objective function are made. Because only a ranking (or, equivalently, sorting) of candidate solutions is exploited, neither derivatives nor even an (explicit) objective function is required by the method. For example, the ranking could come about from pairwise competitions between the candidate solutions in a

Swiss-system tournament A Swiss-system tournament is a non-eliminating tournament format that features a fixed number of rounds of competition, but considerably fewer than for a round-robin tournament; thus each competitor (team or individual) does not play all the other ...

Principles

Two main principles for the adaptation of parameters of the search distribution are exploited in the CMA-ES algorithm. First, a

maximum-likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

principle, based on the idea to increase the probability of successful candidate solutions and search steps. The mean of the distribution is updated such that the

likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

of previously successful candidate solutions is maximized. The

of the distribution is updated (incrementally) such that the likelihood of previously successful search steps is increased. Both updates can be interpreted as a natural gradient descent. Also, in consequence, the CMA conducts an iterated

principal components analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...

of successful search steps while retaining ''all'' principal axes. Estimation of distribution algorithms and the

Cross-Entropy Method The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective. The method approximates the optimal importance ...

are based on very similar ideas, but estimate (non-incrementally) the covariance matrix by maximizing the likelihood of successful solution ''points'' instead of successful search ''steps''. Second, two paths of the time evolution of the distribution mean of the strategy are recorded, called search or evolution paths. These paths contain significant information about the correlation between consecutive steps. Specifically, if consecutive steps are taken in a similar direction, the evolution paths become long. The evolution paths are exploited in two ways. One path is used for the covariance matrix adaptation procedure in place of single successful search steps and facilitates a possibly much faster variance increase of favorable directions. The other path is used to conduct an additional step-size control. This step-size control aims to make consecutive movements of the distribution mean orthogonal in expectation. The step-size control effectively prevents premature convergence yet allowing fast convergence to an optimum.

Algorithm

In the following the most commonly used (''μ''/''μ''_''w'', ''λ'')-CMA-ES is outlined, where in each iteration step a weighted combination of the ''μ'' best out of ''λ'' new candidate solutions is used to update the distribution parameters. The main loop consists of three main parts: 1) sampling of new solutions, 2) re-ordering of the sampled solutions based on their fitness, 3) update of the internal state variables based on the re-ordered samples. A

pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...

of the algorithm looks as follows. set

\lambda

// number of samples per iteration, at least two, generally > 4 initialize

m

\sigma

C=I

p_\sigma=0

p_c=0

// initialize state variables while ''not terminate'' do // iterate for

i

\

do // sample

\lambda

new solutions and evaluate them

x_i =

sample_multivariate_normal(mean

=m

, covariance_matrix

=\sigma^2 C

)

f_i = \operatorname(x_i)

x_

←

x_

with

s(i) = \operatorname(f_, i)

// sort solutions

m' = m

// we need later

m - m'

and

x_i - m'

m

← update_m

(x_1, \ldots ,x_\lambda)

// move mean to better solutions

p_\sigma

← update_ps

(p_\sigma,\sigma^ C^ (m - m'))

// update isotropic evolution path

p_c

← update_pc

(p_c,\sigma^(m - m'),\, p_\sigma\, )

// update anisotropic evolution path

C

← update_C

(C,p_c,(x_1 - m')/\sigma,\ldots ,(x_\lambda - m')/\sigma)

// update covariance matrix

\sigma

← update_sigma

(\sigma,\, p_\sigma\, )

// update step-size using isotropic path length return

m

x_1

The order of the five update assignments is relevant:

m

must be updated first,

p_\sigma

and

p_c

must be updated before

C

, and

\sigma

must be updated last. The update equations for the five state variables are specified in the following. Given are the search space dimension

n

and the iteration step

k

. The five state variables are :

m_k\in\mathbb^n

, the distribution mean and current favorite solution to the optimization problem, :

\sigma_k>0

, the step-size, :

C_k

, a symmetric and

positive-definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite fu ...

n\times n

with

C_0 = I

and :

p_\sigma\in\mathbb^n, p_c\in\mathbb^n

, two evolution paths, initially set to the zero vector. The iteration starts with sampling

\lambda>1

candidate solutions

x_i\in\mathbb^n

from a

\textstyle \mathcal(m_k,\sigma_k^2 C_k)

, i.e. for

i=1,\ldots,\lambda

\begin
  x_i \ &\sim\ \mathcal(m_k,\sigma_k^2 C_k)      
     \\&\sim\ m_k + \sigma_k\times\mathcal(0,C_k) 
\end

The second line suggests the interpretation as unbiased perturbation (mutation) of the current favorite solution vector

m_k

(the distribution mean vector). The candidate solutions

x_i

are evaluated on the objective function

f:\mathbb^n\to\mathbb

to be minimized. Denoting the

f

-sorted candidate solutions as :

\ = \ \text
     f(x_)\le\dots\le f(x_)\le f(x_) \le \cdots,

the new mean value is computed as ::

\begin
  m_ &= \sum_^\mu w_i\, x_ 
  \\ &= m_k + \sum_^\mu w_i\, (x_ - m_k) 
\end

where the positive (recombination) weights

w_1 \ge w_2 \ge \dots \ge w_\mu > 0

sum to one. Typically,

\mu \le \lambda/2

and the weights are chosen such that

\textstyle \mu_w := 1 / \sum_^\mu w_i^2 \approx \lambda/4

. The only feedback used from the objective function here and in the following is an ordering of the sampled candidate solutions due to the indices

i:\lambda

. The step-size

\sigma_k

is updated using ''cumulative step-size adaptation'' (CSA), sometimes also denoted as ''path length control''. The evolution path (or search path)

p_\sigma

is updated first. ::

p_\sigma \gets \underbrace_\, p_\sigma 
    + \overbrace^ \underbrace_

\sigma_ = \sigma_k \times \exp\bigg(\frac
                          \underbrace_\bigg)

where :

c_\sigma^\approx n/3

is the backward time horizon for the evolution path

p_\sigma

and larger than one (

c_\sigma \ll 1

is reminiscent of an

exponential decay A quantity is subject to exponential decay if it decreases at a rate proportional to its current value. Symbolically, this process can be expressed by the following differential equation, where is the quantity and (lambda) is a positive rate ...

constant as

(1-c_\sigma)^k\approx\exp(-c_\sigma k)

where

c_\sigma^

is the associated lifetime and

c_\sigma^\ln(2)\approx0.7c_\sigma^

the half-life), :

\mu_w=\left(\sum_^\mu w_i^2\right)^

is the variance effective selection mass and

1 \le \mu_w \le \mu

by definition of

w_i

, :

C_k^ = \sqrt^ = \sqrt

is the unique symmetric

square root In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or  ⋅ ) is . For example, 4 and −4 are square roots of 16, because . E ...

of the inverse of

C_k

, and :

d_\sigma

is the damping parameter usually close to one. For

d_\sigma=\infty

c_\sigma=0

the step-size remains unchanged. The step-size

\sigma_k

is increased if and only if

\, p_\sigma\,

is larger than the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...

\begin \operatorname E\, \mathcal(0,I)\,  &= \sqrt\,\Gamma((n+1)/2)/\Gamma(n/2) 
  \\&\approx \sqrt\,(1-1/(4\,n)+1/(21\,n^2)) \end

and decreased if it is smaller. For this reason, the step-size update tends to make consecutive steps

C_k^

-conjugate, in that after the adaptation has been successful

\textstyle\left(\frac\right)^T\! C_k^ \frac \approx 0

. Finally, the

is updated, where again the respective evolution path is updated first. ::

p_c \gets \underbrace_\, 
            p_c + 
     \underbrace_ 
     \overbrace^
     \underbrace_

C_ = \underbrace_
               \, C_k + c_1 \underbrace_ 
         + \,c_\mu \underbrace_

where

T

denotes the transpose and :

c_c^\approx n/4

is the backward time horizon for the evolution path

p_c

and larger than one, :

\alpha\approx 1.5

and the

indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...

\mathbf_(\, p_\sigma\, )

evaluates to one

iff In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false. The connective is bicon ...

\, p_\sigma\, \in,\alpha\sqrt /math> or, in other words, \, p_\sigma\, \le\alpha\sqrt, which is usually the case,

: c_s = (1 - \mathbf_(\, p_\sigma\, )^2) \,c_1 c_c (2-c_c) makes partly up for the small variance loss in case the indicator is zero,

: c_1 \approx 2 / n^2 is the learning rate for the rank-one update of the

and :

c_\mu \approx \mu_w / n^2

is the learning rate for the rank-

\mu

update of the

and must not exceed

1 - c_1

. The

update tends to increase the

for

p_c

and for

(x_ - m_k)/\sigma_k

to be sampled from

\mathcal(0,C_)

. This completes the iteration step. The number of candidate samples per iteration,

\lambda

, is not determined a priori and can vary in a wide range. Smaller values, for example

\lambda=10

, lead to more local search behavior. Larger values, for example

\lambda=10n

with default value

\mu_w \approx \lambda/4

, render the search more global. Sometimes the algorithm is repeatedly restarted with increasing

\lambda

by a factor of two for each restart. Besides of setting

\lambda

(or possibly

\mu

instead, if for example

\lambda

is predetermined by the number of available processors), the above introduced parameters are not specific to the given objective function and therefore not meant to be modified by the user.

Example code in MATLAB/Octave

function xmin=purecmaes % (mu/mu_w, lambda)-CMA-ES % -------------------- Initialization -------------------------------- % User defined input parameters (need to be edited) strfitnessfct = 'frosenbrock'; % name of objective/fitness function N = 20; % number of objective variables/problem dimension xmean = rand(N,1); % objective variables initial point sigma = 0.3; % coordinate wise standard deviation (step size) stopfitness = 1e-10; % stop if fitness < stopfitness (minimization) stopeval = 1e3*N^2; % stop after stopeval number of function evaluations % Strategy parameter setting: Selection lambda = 4+floor(3*log(N)); % population size, offspring number mu = lambda/2; % number of parents/points for recombination weights = log(mu+1/2)-log(1:mu)'; % muXone array for weighted recombination mu = floor(mu); weights = weights/sum(weights); % normalize recombination weights array mueff=sum(weights)^2/sum(weights.^2); % variance-effectiveness of sum w_i x_i % Strategy parameter setting: Adaptation cc = (4+mueff/N) / (N+4 + 2*mueff/N); % time constant for cumulation for C cs = (mueff+2) / (N+mueff+5); % t-const for cumulation for sigma control c1 = 2 / ((N+1.3)^2+mueff); % learning rate for rank-one update of C cmu = min(1-c1, 2 * (mueff-2+1/mueff) / ((N+2)^2+mueff)); % and for rank-mu update damps = 1 + 2*max(0, sqrt((mueff-1)/(N+1))-1) + cs; % damping for sigma % usually close to 1 % Initialize dynamic (internal) strategy parameters and constants pc = zeros(N,1); ps = zeros(N,1); % evolution paths for C and sigma B = eye(N,N); % B defines the coordinate system D = ones(N,1); % diagonal D defines the scaling C = B * diag(D.^2) * B'; % covariance matrix C invsqrtC = B * diag(D.^-1) * B'; % C^-1/2 eigeneval = 0; % track update of B and D chiN=N^0.5*(1-1/(4*N)+1/(21*N^2)); % expectation of % , , N(0,I), ,

norm(randn(N,1)) % -------------------- Generation Loop -------------------------------- counteval = 0; % the next 40 lines contain the 20 lines of interesting code while counteval < stopeval % Generate and evaluate lambda offspring for k=1:lambda arx(:,k) = xmean + sigma * B * (D .* randn(N,1)); % m + sig * Normal(0,C) arfitness(k) = feval(strfitnessfct, arx(:,k)); % objective function call counteval = counteval+1; end % Sort by fitness and compute weighted mean into xmean rfitness, arindex= sort(arfitness); % minimization xold = xmean; xmean = arx(:,arindex(1:mu))*weights; % recombination, new mean value % Cumulation: Update evolution paths ps = (1-cs)*ps ... + sqrt(cs*(2-cs)*mueff) * invsqrtC * (xmean-xold) / sigma; hsig = norm(ps)/sqrt(1-(1-cs)^(2*counteval/lambda))/chiN < 1.4 + 2/(N+1); pc = (1-cc)*pc ... + hsig * sqrt(cc*(2-cc)*mueff) * (xmean-xold) / sigma; % Adapt covariance matrix C artmp = (1/sigma) * (arx(:,arindex(1:mu))-repmat(xold,1,mu)); C = (1-c1-cmu) * C ... % regard old matrix + c1 * (pc*pc' ... % plus rank one update + (1-hsig) * cc*(2-cc) * C) ... % minor correction if hsig

0 + cmu * artmp * diag(weights) * artmp'; % plus rank mu update % Adapt step size sigma sigma = sigma * exp((cs/damps)*(norm(ps)/chiN - 1)); % Decomposition of C into B*diag(D.^2)*B' (diagonalization) if counteval - eigeneval > lambda/(c1+cmu)/N/10 % to achieve O(N^2) eigeneval = counteval; C = triu(C) + triu(C,1)'; % enforce symmetry ,D= eig(C); % eigen decomposition, B

normalized eigenvectors D = sqrt(diag(D)); % D is a vector of standard deviations now invsqrtC = B * diag(D.^-1) * B'; end % Break, if fitness is good enough or condition exceeds 1e14, better termination methods are advisable if arfitness(1) <= stopfitness , , max(D) > 1e7 * min(D) break; end end % while, end generation loop xmin = arx(:, arindex(1)); % Return best point of last iteration. % Notice that xmean is expected to be even % better. end % --------------------------------------------------------------- function f=frosenbrock(x) if size(x,1) < 2 error('dimension must be greater one'); end f = 100*sum((x(1:end-1).^2 - x(2:end)).^2) + sum((x(1:end-1)-1).^2); end

Theoretical foundations

Given the distribution parameters—mean, variances and covariances—the normal probability distribution for sampling new candidate solutions is the

maximum entropy probability distribution In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entro ...

over

\mathbb^n

, that is, the sample distribution with the minimal amount of prior information built into the distribution. More considerations on the update equations of CMA-ES are made in the following.

Variable metric

The CMA-ES implements a stochastic variable-metric method. In the very particular case of a convex-quadratic objective function ::

f(x) = (x-x^*)^T H (x-x^*)

the covariance matrix

C_k

adapts to the inverse of the

H

up to Two Mathematical object, mathematical objects ''a'' and ''b'' are called equal up to an equivalence relation ''R'' * if ''a'' and ''b'' are related by ''R'', that is, * if ''aRb'' holds, that is, * if the equivalence classes of ''a'' and ''b'' wi ...

a scalar factor and small random fluctuations. More general, also on the function

g \circ f

, where

g

is strictly increasing and therefore order preserving, the covariance matrix

C_k

adapts to

H^

a scalar factor and small random fluctuations. For selection ratio

\lambda/\mu\to\infty

, the

\mu

selected solutions yield an empirical covariance matrix reflective of the inverse-Hessian even in evolution strategies without adaptation of the covariance matrix

C_k

. This result has been proven for

\mu=1

on a static model, relying on the quadratic approximation.

Maximum-likelihood updates

The update equations for mean and covariance matrix maximize a

while resembling an expectation-maximization algorithm. The update of the mean vector

m

maximizes a log-likelihood, such that ::

m_ = \arg\max_m \sum_^\mu w_i \log p_\mathcal(x_ \mid m)

where ::

\log p_\mathcal(x) = 
   - \frac \log\det(2\pi C) - \frac (x-m)^T C^ (x-m)

denotes the log-likelihood of

x

from a multivariate normal distribution with mean

m

and any positive definite covariance matrix

C

. To see that

m_

is independent of

C

remark first that this is the case for any diagonal matrix

C

, because the coordinate-wise maximizer is independent of a scaling factor. Then, rotation of the data points or choosing

C

non-diagonal are equivalent. The rank-

\mu

update of the covariance matrix, that is, the right most summand in the update equation of

C_k

, maximizes a log-likelihood in that ::

\sum_^\mu w_i \frac 
             \left( \frac \right)^T 
     = \arg\max_ \sum_^\mu w_i \log p_\mathcal\left(\left.\frac \ C\right)

for

\mu\ge n

(otherwise

C

is singular, but substantially the same result holds for

\mu < n

). Here,

p_\mathcal(x ,  C)

denotes the likelihood of

x

from a multivariate normal distribution with zero mean and covariance matrix

C

. Therefore, for

c_1=0

and

c_\mu=1

C_

is the above

estimator. See

estimation of covariance matrices In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis ...

for details on the derivation.

Natural gradient descent in the space of sample distributions

Akimoto ''et al.'' and Glasmachers ''et al.'' discovered independently that the update of the distribution parameters resembles the descent in direction of a sampled natural gradient of the expected objective function value

E f(x)

(to be minimized), where the expectation is taken under the sample distribution. With the parameter setting of

c_\sigma=0

and

c_1=0

, i.e. without step-size control and rank-one update, CMA-ES can thus be viewed as an instantiation of Natural Evolution Strategies (NES). The ''natural'' gradient is independent of the parameterization of the distribution. Taken with respect to the parameters of the sample distribution , the gradient of

E f(x)

can be expressed as ::

\begin 
  _ E(f(x) \mid \theta) 
     &= \nabla_ \int_f(x) p(x) \, \mathrmx
  \\ &= \int_f(x) \nabla_ p(x) \, \mathrmx
  \\ &= \int_f(x) p(x) \nabla_ \ln p(x) \, \mathrmx
  \\ &= \operatorname E(f(x) \nabla_ \ln p(x\mid\theta))
\end

where

p(x)=p(x\mid\theta)

depends on the parameter vector

\theta

. The so-called score function,

\nabla_ \ln p(x\mid\theta) = \frac

, indicates the relative sensitivity of w.r.t. , and the expectation is taken with respect to the distribution . The ''natural'' gradient of

E f(x)

, complying with the

Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability measures defined on a common probability space ...

(an informational distance measure between probability distributions and the curvature of the

relative entropy Relative may refer to: General use *Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives'' Philosophy *Relativism, the concept that ...

), now reads ::

\begin 
  \tilde \operatorname E(f(x) \mid \theta) 
  &= F^_\theta \nabla_ \operatorname E(f(x) \mid \theta) 
\end

where the

Fisher information In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...

matrix

F_\theta

is the expectation of the

Hessian A Hessian is an inhabitant of the German state of Hesse. Hessian may also refer to: Named from the toponym *Hessian (soldier), eighteenth-century German regiments in service with the British Empire **Hessian (boot), a style of boot **Hessian f ...

of and renders the expression independent of the chosen parameterization. Combining the previous equalities we get ::

\begin 
  \tilde \operatorname E(f(x) \mid \theta) 
  &= F^_\theta \operatorname E(f(x) \nabla_ \ln p(x\mid\theta))
  \\ &= \operatorname E(f(x) F^_\theta \nabla_ \ln p(x\mid\theta))
\end

A Monte Carlo approximation of the latter expectation takes the average over samples from ::

\tilde \widehat_\theta(f) := -\sum_^\lambda \overbrace^ \underbrace_
\quad\text w_i = -f(x_)/\lambda

where the notation

i:\lambda

from above is used and therefore

w_i

are monotonically decreasing in

i

. Ollivier ''et al.'' finally found a rigorous derivation for the weights,

w_i

, as they are defined in the CMA-ES. The weights are the consistent estimator for the CDF of

f(X)

where

X\sim p(., \theta)

, at the point

f(x_)

, composed with a fixed monotonically decreasing transformation

w

, that is, ::

w_i = w\left(\frac\right)

These weights make the algorithm insensitive to the specific

f

-values. More concisely, using the CDF estimator of

f

instead of

f

itself let the algorithm only depend on the ranking of

f

-values but not on their underlying distribution. It renders the algorithm invariant to monotonically

f

-transformations. Let ::

T \in \mathbb^

such that

p(\cdot\mid\theta)

is the density of the

\mathcal N(m_k,\sigma_k^2 C_k)

. Then, we have an explicit expression for the inverse of the Fisher information matrix where

\sigma_k

is fixed ::

F^_ = \left begin\sigma_k^2 C_k&0\\ 0&2 C_k\otimes C_k\end\right /math>

and for

:: \ln p(x\mid\theta) = \ln p(x\mid m_k,\sigma_k^2 C_k) = -\frac(x-m_k)^T \sigma_k^ C_k^ (x-m_k) - \frac\ln\det(2\pi\sigma_k^2 C_k) and, after some calculations, the updates in the CMA-ES turn out as

\begin 
   m_ 
     &= m_k - \underbrace_ 
   \\
     &= m_k + \sum_^\lambda w_i (x_ - m_k) 
   \end

and ::

\begin 
   C_ 
     &= C_k + c_1(p_c p_c^T - C_k)  
         - c_\mu\operatorname(\overbrace^)\\
     &= C_k + c_1(p_c p_c^T - C_k) 
        + c_\mu \sum_^\lambda w_i \left(\frac \left(\frac\right)^T - C_k\right) 
   \end

where mat forms the proper matrix from the respective natural gradient sub-vector. That means, setting

c_1=c_\sigma=0

, the CMA-ES updates descend in direction of the approximation

\tilde \widehat_\theta(f)

of the natural gradient while using different step-sizes (learning rates 1 and

c_\mu

) for the orthogonal parameters

m

and

C

respectively. The most recent version of CMA-ES also use a different function

w

for

m

and

C

with negative values only for the latter (so-called active CMA).

Stationarity or unbiasedness

It is comparatively easy to see that the update equations of CMA-ES satisfy some stationarity conditions, in that they are essentially unbiased. Under neutral selection, where

x_ \sim \mathcal N(m_k,\sigma_k^2 C_k)

, we find that ::

\operatorname E(m_\mid m_k) = m_k

and under some mild additional assumptions on the initial conditions ::

\operatorname E(\log \sigma_ \mid \sigma_k) = \log \sigma_k

and with an additional minor correction in the covariance matrix update for the case where the indicator function evaluates to zero, we find ::

\operatorname E(C_ \mid C_k) = C_k

Invariance

Invariance properties imply uniform performance on a class of objective functions. They have been argued to be an advantage, because they allow to generalize and predict the behavior of the algorithm and therefore strengthen the meaning of empirical results obtained on single functions. The following invariance properties have been established for CMA-ES. * Invariance under order-preserving transformations of the objective function value

f

, in that for any

h:\mathbb^n\to\mathbb

the behavior is identical on

f:x\mapsto g(h(x))

for all strictly increasing

g:\mathbb\to\mathbb

. This invariance is easy to verify, because only the

f

-ranking is used in the algorithm, which is invariant under the choice of

g

. *

Scale-invariance In physics, mathematics and statistics, scale invariance is a feature of objects or laws that do not change if scales of length, energy, or other variables, are multiplied by a common factor, and thus represent a universality. The technical term ...

, in that for any

h:\mathbb^n\to \mathbb

the behavior is independent of

\alpha>0

for the objective function

f:x\mapsto h(\alpha x)

given

\sigma_0\propto1/\alpha

and

m_0\propto1/\alpha

. * Invariance under rotation of the search space in that for any

h:\mathbb^n\to \mathbb

and any

z\in\mathbb^n

the behavior on

f:x\mapsto h(R x)

is independent of the

orthogonal matrix In linear algebra, an orthogonal matrix, or orthonormal matrix, is a real square matrix whose columns and rows are orthonormal vectors. One way to express this is Q^\mathrm Q = Q Q^\mathrm = I, where is the transpose of and is the identity ma ...

R

, given

m_0=R^ z

. More general, the algorithm is also invariant under general linear transformations

R

when additionally the initial covariance matrix is chosen as

R^^T

. Any serious parameter optimization method should be translation invariant, but most methods do not exhibit all the above described invariance properties. A prominent example with the same invariance properties is the

Nelder–Mead method The Nelder–Mead method (also downhill simplex method, amoeba method, or polytope method) is a numerical method used to find the minimum or maximum of an objective function in a multidimensional space. It is a direct search method (based on ...

, where the initial simplex must be chosen respectively.

Convergence

Conceptual considerations like the scale-invariance property of the algorithm, the analysis of simpler

evolution strategies In computer science, an evolution strategy (ES) is an optimization technique based on ideas of evolution. It belongs to the general class of evolutionary computation or artificial evolution methodologies. History The 'evolution strategy' optimiza ...

, and overwhelming empirical evidence suggest that the algorithm converges on a large class of functions fast to the global optimum, denoted as

x^*

. On some functions, convergence occurs independently of the initial conditions with probability one. On some functions the probability is smaller than one and typically depends on the initial

m_0

and

\sigma_0

. Empirically, the fastest possible convergence rate in

k

for rank-based direct search methods can often be observed (depending on the context denoted as ''

'' or ''log-linear'' or ''exponential'' convergence). Informally, we can write ::

\, m_k - x^*\,  \;\approx\; \, m_0 - x^*\,  \times e^

for some

c>0

, and more rigorously ::

\frac\sum_^k\log\frac 
   \;=\; \frac\log\frac
    \;\to\; -c < 0 \quad\text k\to\infty\;,

or similarly, ::

\operatorname E\log\frac
    \;\to\; -c < 0 \quad\text k\to\infty\;.

This means that on average the distance to the optimum decreases in each iteration by a "constant" factor, namely by

\exp(-c)

. The convergence rate

c

is roughly

0.1\lambda/n

, given

\lambda

is not much larger than the dimension

n

. Even with optimal

\sigma

and

C

, the convergence rate

c

cannot largely exceed

0.25\lambda/n

, given the above recombination weights

w_i

are all non-negative. The actual linear dependencies in

\lambda

and

n

are remarkable and they are in both cases the best one can hope for in this kind of algorithm. Yet, a rigorous proof of convergence is missing.

Interpretation as coordinate-system transformation

Using a non-identity covariance matrix for the

is equivalent to a coordinate system transformation of the solution vectors, mainly because the sampling equation :

\begin
  x_i &\sim\ m_k + \sigma_k\times\mathcal(0,C_k)
    \\
  &\sim\ m_k + \sigma_k \times C_k^\mathcal(0,I) 
\end

can be equivalently expressed in an "encoded space" as :

\underbrace_ 
     \sim\ \underbrace  + \sigma_k \times\mathcal(0,I)

The covariance matrix defines a

bijective In mathematics, a bijection, also known as a bijective function, one-to-one correspondence, or invertible function, is a function between the elements of two sets, where each element of one set is paired with exactly one element of the other s ...

transformation (encoding) for all solution vectors into a space, where the sampling takes place with identity covariance matrix. Because the update equations in the CMA-ES are invariant under linear coordinate system transformations, the CMA-ES can be re-written as an adaptive encoding procedure applied to a simple

with identity covariance matrix. This adaptive encoding procedure is not confined to algorithms that sample from a multivariate normal distribution (like evolution strategies), but can in principle be applied to any iterative search method.

Performance in practice

In contrast to most other

, the CMA-ES is, from the user's perspective, quasi-parameter-free. The user has to choose an initial solution point,

m_0\in\mathbb^n

, and the initial step-size,

\sigma_0>0

. Optionally, the number of candidate samples λ (population size) can be modified by the user in order to change the characteristic search behavior (see above) and termination conditions can or should be adjusted to the problem at hand. The CMA-ES has been empirically successful in hundreds of applications and is considered to be useful in particular on non-convex, non-separable, ill-conditioned, multi-modal or noisy objective functions. One survey of Black-Box optimizations found it outranked 31 other optimization algorithms, performing especially strong on "difficult functions" or larger dimensional search spaces. The search space dimension ranges typically between two and a few hundred. Assuming a black-box optimization scenario, where gradients are not available (or not useful) and function evaluations are the only considered cost of search, the CMA-ES method is likely to be outperformed by other methods in the following conditions: * on low-dimensional functions, say

n < 5

, for example by the downhill simplex method or surrogate-based methods (like

kriging In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging giv ...

with expected improvement); * on separable functions without or with only negligible dependencies between the design variables in particular in the case of multi-modality or large dimension, for example by

differential evolution In evolutionary computation, differential evolution (DE) is a method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics as ...

; * on (nearly)

-quadratic functions with low or moderate

condition number In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input ...

of the

, where BFGS or

NEWUOA Michael James David Powell (29 July 193619 April 2015) was a British mathematician, who worked in the Department of Applied Mathematics and Theoretical Physics (DAMTP) at the University of Cambridge. Education and early life Born in London, Pow ...

or SLSQP are typically at least ten times faster; * on functions that can already be solved with a comparatively small number of function evaluations, say no more than

10 n

, where CMA-ES is often slower than, for example,

or Multilevel Coordinate Search (MCS). On separable functions, the performance disadvantage is likely to be most significant in that CMA-ES might not be able to find at all comparable solutions. On the other hand, on non-separable functions that are ill-conditioned or rugged or can only be solved with more than

100 n

function evaluations, the CMA-ES shows most often superior performance.

Variations and extensions

The (1+1)-CMA-ES generates only one candidate solution per iteration step which becomes the new distribution mean if it is better than the current mean. For

c_c=1

the (1+1)-CMA-ES is a close variant of Gaussian adaptation. Some Natural Evolution Strategies are close variants of the CMA-ES with specific parameter settings. Natural Evolution Strategies do not utilize evolution paths (that means in CMA-ES setting

c_c=c_\sigma=1

) and they formalize the update of variances and covariances on a Cholesky factor instead of a covariance matrix. The CMA-ES has also been extended to

multiobjective optimization Multi-objective optimization (also known as multi-objective programming, vector optimization, multicriteria optimization, multiattribute optimization or Pareto optimization) is an area of multiple criteria decision making that is concerned with ...

as MO-CMA-ES. Another remarkable extension has been the addition of a negative update of the covariance matrix with the so-called active CMA. Using the additional active CMA update is considered as the default variant nowadays.

References

Bibliography

*Hansen N, Ostermeier A (2001). Completely derandomized self-adaptation in evolution strategies
''Evolutionary Computation'', 9(2)
pp. 159–195

*Hansen N, Müller SD, Koumoutsakos P (2003). Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES)
''Evolutionary Computation'', 11(1)
pp. 1–18.

*Hansen N, Kern S (2004). Evaluating the CMA evolution strategy on multimodal test functions. In Xin Yao et al., editors, ''Parallel Problem Solving from Nature – PPSN VIII'', pp. 282–291, Springer

*Igel C, Hansen N, Roth S (2007). Covariance Matrix Adaptation for Multi-objective Optimization
''Evolutionary Computation'', 15(1)
pp. 1–28

External links

A short introduction to CMA-ES by N. Hansen

The CMA Evolution Strategy: A Tutorial

{{DEFAULTSORT:Cma-Es Evolutionary algorithms Stochastic optimization Optimization algorithms and methods fr:Stratégie d'évolution#CMA-ES

Principles

Algorithm

Example code in MATLAB/Octave

Theoretical foundations

Variable metric

Maximum-likelihood updates

Natural gradient descent in the space of sample distributions

Stationarity or unbiasedness

Invariance

Convergence

Interpretation as coordinate-system transformation

Performance in practice

Variations and extensions

See also

References

Bibliography

External links