CMA-ES

Last updated

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation (via recombination and mutation) and selection: in each generation (iteration) new individuals (candidate solutions, denoted as ) are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

Contents

In an evolution strategy, new candidate solutions are sampled according to a multivariate normal distribution in . Recombination amounts to selecting a new mean value for the distribution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise dependencies between the variables in the distribution are represented by a covariance matrix. The covariance matrix adaptation (CMA) is a method to update the covariance matrix of this distribution. This is particularly useful if the function is ill-conditioned.

Adaptation of the covariance matrix amounts to learning a second order model of the underlying objective function similar to the approximation of the inverse Hessian matrix in the quasi-Newton method in classical optimization. In contrast to most classical methods, fewer assumptions on the underlying objective function are made. Because only a ranking (or, equivalently, sorting) of candidate solutions is exploited, neither derivatives nor even an (explicit) objective function is required by the method. For example, the ranking could come about from pairwise competitions between the candidate solutions in a Swiss-system tournament.

Principles

Illustration of an actual optimization run with covariance matrix adaptation on a simple two-dimensional problem. The spherical optimization landscape is depicted with solid lines of equal
f
{\displaystyle f}
-values. The population (dots) is much larger than necessary, but clearly shows how the distribution of the population (dotted line) changes during the optimization. On this simple problem, the population concentrates over the global optimum within a few generations. Concept of directional optimization in CMA-ES algorithm.png
Illustration of an actual optimization run with covariance matrix adaptation on a simple two-dimensional problem. The spherical optimization landscape is depicted with solid lines of equal -values. The population (dots) is much larger than necessary, but clearly shows how the distribution of the population (dotted line) changes during the optimization. On this simple problem, the population concentrates over the global optimum within a few generations.

Two main principles for the adaptation of parameters of the search distribution are exploited in the CMA-ES algorithm.

First, a maximum-likelihood principle, based on the idea to increase the probability of successful candidate solutions and search steps. The mean of the distribution is updated such that the likelihood of previously successful candidate solutions is maximized. The covariance matrix of the distribution is updated (incrementally) such that the likelihood of previously successful search steps is increased. Both updates can be interpreted as a natural gradient descent. Also, in consequence, the CMA conducts an iterated principal components analysis of successful search steps while retaining all principal axes. Estimation of distribution algorithms and the Cross-Entropy Method are based on very similar ideas, but estimate (non-incrementally) the covariance matrix by maximizing the likelihood of successful solution points instead of successful search steps.

Second, two paths of the time evolution of the distribution mean of the strategy are recorded, called search or evolution paths. These paths contain significant information about the correlation between consecutive steps. Specifically, if consecutive steps are taken in a similar direction, the evolution paths become long. The evolution paths are exploited in two ways. One path is used for the covariance matrix adaptation procedure in place of single successful search steps and facilitates a possibly much faster variance increase of favorable directions. The other path is used to conduct an additional step-size control. This step-size control aims to make consecutive movements of the distribution mean orthogonal in expectation. The step-size control effectively prevents premature convergence yet allowing fast convergence to an optimum.

Algorithm

In the following the most commonly used (μ/μw, λ)-CMA-ES is outlined, where in each iteration step a weighted combination of the μ best out of λ new candidate solutions is used to update the distribution parameters. The main loop consists of three main parts: 1) sampling of new solutions, 2) re-ordering of the sampled solutions based on their fitness, 3) update of the internal state variables based on the re-ordered samples. A pseudocode of the algorithm looks as follows.

set  // number of samples per iteration, at least two, generally > 4 initialize, , , ,   // initialize state variables whilenot terminatedo  // iterate     forindo  // sample  new solutions and evaluate them         sample_multivariate_normal(mean, covariance_matrix)          with  // sort solutions       // we need later  and  update_m  // move mean to better solutions       update_ps  // update isotropic evolution path      update_pc  // update anisotropic evolution path      update_C  // update covariance matrix      update_sigma  // update step-size using isotropic path length return or 

The order of the five update assignments is relevant: must be updated first, and must be updated before , and must be updated last. The update equations for the five state variables are specified in the following.

Given are the search space dimension and the iteration step . The five state variables are

, the distribution mean and current favorite solution to the optimization problem,
, the step-size,
, a symmetric and positive-definite covariance matrix with and
, two evolution paths, initially set to the zero vector.

The iteration starts with sampling candidate solutions from a multivariate normal distribution , i.e. for

The second line suggests the interpretation as unbiased perturbation (mutation) of the current favorite solution vector (the distribution mean vector). The candidate solutions are evaluated on the objective function to be minimized. Denoting the -sorted candidate solutions as

the new mean value is computed as

where the positive (recombination) weights sum to one. Typically, and the weights are chosen such that . The only feedback used from the objective function here and in the following is an ordering of the sampled candidate solutions due to the indices .

The step-size is updated using cumulative step-size adaptation (CSA), sometimes also denoted as path length control. The evolution path (or search path) is updated first.

where

is the backward time horizon for the evolution path and larger than one ( is reminiscent of an exponential decay constant as where is the associated lifetime and the half-life),
is the variance effective selection mass and by definition of ,
is the unique symmetric square root of the inverse of , and
is the damping parameter usually close to one. For or the step-size remains unchanged.

The step-size is increased if and only if is larger than the expected value

and decreased if it is smaller. For this reason, the step-size update tends to make consecutive steps -conjugate, in that after the adaptation has been successful . [1]

Finally, the covariance matrix is updated, where again the respective evolution path is updated first.

where denotes the transpose and

is the backward time horizon for the evolution path and larger than one,
and the indicator function evaluates to one iff or, in other words, , which is usually the case,
makes partly up for the small variance loss in case the indicator is zero,
is the learning rate for the rank-one update of the covariance matrix and
is the learning rate for the rank- update of the covariance matrix and must not exceed .

The covariance matrix update tends to increase the likelihood for and for to be sampled from . This completes the iteration step.

The number of candidate samples per iteration, , is not determined a priori and can vary in a wide range. Smaller values, for example , lead to more local search behavior. Larger values, for example with default value , render the search more global. Sometimes the algorithm is repeatedly restarted with increasing by a factor of two for each restart. [2] Besides of setting (or possibly instead, if for example is predetermined by the number of available processors), the above introduced parameters are not specific to the given objective function and therefore not meant to be modified by the user.

Example code in MATLAB/Octave

functionxmin=purecmaes   % (mu/mu_w, lambda)-CMA-ES% --------------------  Initialization --------------------------------  % User defined input parameters (need to be edited)strfitnessfct='frosenbrock';% name of objective/fitness functionN=20;% number of objective variables/problem dimensionxmean=rand(N,1);% objective variables initial pointsigma=0.3;% coordinate wise standard deviation (step size)stopfitness=1e-10;% stop if fitness < stopfitness (minimization)stopeval=1e3*N^2;% stop after stopeval number of function evaluations% Strategy parameter setting: Selection  lambda=4+floor(3*log(N));% population size, offspring numbermu=lambda/2;% number of parents/points for recombinationweights=log(mu+1/2)-log(1:mu)';% muXone array for weighted recombinationmu=floor(mu);weights=weights/sum(weights);% normalize recombination weights arraymueff=sum(weights)^2/sum(weights.^2);% variance-effectiveness of sum w_i x_i% Strategy parameter setting: Adaptationcc=(4+mueff/N)/(N+4+2*mueff/N);% time constant for cumulation for Ccs=(mueff+2)/(N+mueff+5);% t-const for cumulation for sigma controlc1=2/((N+1.3)^2+mueff);% learning rate for rank-one update of Ccmu=min(1-c1,2*(mueff-2+1/mueff)/((N+2)^2+mueff));% and for rank-mu updatedamps=1+2*max(0,sqrt((mueff-1)/(N+1))-1)+cs;% damping for sigma % usually close to 1% Initialize dynamic (internal) strategy parameters and constantspc=zeros(N,1);ps=zeros(N,1);% evolution paths for C and sigmaB=eye(N,N);% B defines the coordinate systemD=ones(N,1);% diagonal D defines the scalingC=B*diag(D.^2)*B';% covariance matrix CinvsqrtC=B*diag(D.^-1)*B';% C^-1/2 eigeneval=0;% track update of B and DchiN=N^0.5*(1-1/(4*N)+1/(21*N^2));% expectation of %   ||N(0,I)|| == norm(randn(N,1)) % -------------------- Generation Loop --------------------------------counteval=0;% the next 40 lines contain the 20 lines of interesting code whilecounteval<stopeval% Generate and evaluate lambda offspringfork=1:lambdaarx(:,k)=xmean+sigma*B*(D.*randn(N,1));% m + sig * Normal(0,C) arfitness(k)=feval(strfitnessfct,arx(:,k));% objective function callcounteval=counteval+1;end% Sort by fitness and compute weighted mean into xmean[arfitness,arindex]=sort(arfitness);% minimizationxold=xmean;xmean=arx(:,arindex(1:mu))*weights;% recombination, new mean value% Cumulation: Update evolution pathsps=(1-cs)*ps...+sqrt(cs*(2-cs)*mueff)*invsqrtC*(xmean-xold)/sigma;hsig=norm(ps)/sqrt(1-(1-cs)^(2*counteval/lambda))/chiN<1.4+2/(N+1);pc=(1-cc)*pc...+hsig*sqrt(cc*(2-cc)*mueff)*(xmean-xold)/sigma;% Adapt covariance matrix Cartmp=(1/sigma)*(arx(:,arindex(1:mu))-repmat(xold,1,mu));C=(1-c1-cmu)*C...                  % regard old matrix  +c1*(pc*pc'...                 % plus rank one update+(1-hsig)*cc*(2-cc)*C)... % minor correction if hsig==0+cmu*artmp*diag(weights)*artmp';% plus rank mu update% Adapt step size sigmasigma=sigma*exp((cs/damps)*(norm(ps)/chiN-1));% Decomposition of C into B*diag(D.^2)*B' (diagonalization)ifcounteval-eigeneval>lambda/(c1+cmu)/N/10% to achieve O(N^2)eigeneval=counteval;C=triu(C)+triu(C,1)';% enforce symmetry[B,D]=eig(C);% eigen decomposition, B==normalized eigenvectorsD=sqrt(diag(D));% D is a vector of standard deviations nowinvsqrtC=B*diag(D.^-1)*B';end% Break, if fitness is good enough or condition exceeds 1e14, better termination methods are advisable ifarfitness(1)<=stopfitness||max(D)>1e7*min(D)break;endend% while, end generation loopxmin=arx(:,arindex(1));% Return best point of last iteration.% Notice that xmean is expected to be even% better.end% ---------------------------------------------------------------  functionf=frosenbrock(x)ifsize(x,1)<2error('dimension must be greater one');endf=100*sum((x(1:end-1).^2-x(2:end)).^2)+sum((x(1:end-1)-1).^2);end

Theoretical foundations

Given the distribution parametersmean, variances and covariancesthe normal probability distribution for sampling new candidate solutions is the maximum entropy probability distribution over , that is, the sample distribution with the minimal amount of prior information built into the distribution. More considerations on the update equations of CMA-ES are made in the following.

Variable metric

The CMA-ES implements a stochastic variable-metric method. In the very particular case of a convex-quadratic objective function

the covariance matrix adapts to the inverse of the Hessian matrix , up to a scalar factor and small random fluctuations. More general, also on the function , where is strictly increasing and therefore order preserving, the covariance matrix adapts to , up to a scalar factor and small random fluctuations. For selection ratio (and hence population size ), the selected solutions yield an empirical covariance matrix reflective of the inverse-Hessian even in evolution strategies without adaptation of the covariance matrix. This result has been proven for on a static model, relying on the quadratic approximation. [3]

Maximum-likelihood updates

The update equations for mean and covariance matrix maximize a likelihood while resembling an expectation-maximization algorithm. The update of the mean vector maximizes a log-likelihood, such that

where

denotes the log-likelihood of from a multivariate normal distribution with mean and any positive definite covariance matrix . To see that is independent of remark first that this is the case for any diagonal matrix , because the coordinate-wise maximizer is independent of a scaling factor. Then, rotation of the data points or choosing non-diagonal are equivalent.

The rank- update of the covariance matrix, that is, the right most summand in the update equation of , maximizes a log-likelihood in that

for (otherwise is singular, but substantially the same result holds for ). Here, denotes the likelihood of from a multivariate normal distribution with zero mean and covariance matrix . Therefore, for and , is the above maximum-likelihood estimator. See estimation of covariance matrices for details on the derivation.

Natural gradient descent in the space of sample distributions

Akimoto et al. [4] and Glasmachers et al. [5] discovered independently that the update of the distribution parameters resembles the descent in direction of a sampled natural gradient of the expected objective function value (to be minimized), where the expectation is taken under the sample distribution. With the parameter setting of and , i.e. without step-size control and rank-one update, CMA-ES can thus be viewed as an instantiation of Natural Evolution Strategies (NES). [4] [5] The natural gradient is independent of the parameterization of the distribution. Taken with respect to the parameters θ of the sample distribution p, the gradient of can be expressed as

where depends on the parameter vector . The so-called score function, , indicates the relative sensitivity of p w.r.t. θ, and the expectation is taken with respect to the distribution p. The natural gradient of , complying with the Fisher information metric (an informational distance measure between probability distributions and the curvature of the relative entropy), now reads

where the Fisher information matrix is the expectation of the Hessian of −lnp and renders the expression independent of the chosen parameterization. Combining the previous equalities we get

A Monte Carlo approximation of the latter expectation takes the average over λ samples from p

where the notation from above is used and therefore are monotonically decreasing in .

Ollivier et al. [6] finally found a rigorous derivation for the weights, , as they are defined in the CMA-ES. The weights are an asymptotically consistent estimator of the CDF of at the points of the th order statistic , as defined above, where , composed with a fixed monotonically decreasing transformation , that is,

.

These weights make the algorithm insensitive to the specific -values. More concisely, using the CDF estimator of instead of itself let the algorithm only depend on the ranking of -values but not on their underlying distribution. This renders the algorithm invariant to strictly increasing -transformations. Now we define

such that is the density of the multivariate normal distribution . Then, we have an explicit expression for the inverse of the Fisher information matrix where is fixed

and for

and, after some calculations, the updates in the CMA-ES turn out as [4]

and

where mat forms the proper matrix from the respective natural gradient sub-vector. That means, setting , the CMA-ES updates descend in direction of the approximation of the natural gradient while using different step-sizes (learning rates 1 and ) for the orthogonal parameters and respectively. More recent versions allow a different learning rate for the mean as well. [7] The most recent version of CMA-ES also use a different function for and with negative values only for the latter (so-called active CMA).

Stationarity or unbiasedness

It is comparatively easy to see that the update equations of CMA-ES satisfy some stationarity conditions, in that they are essentially unbiased. Under neutral selection, where , we find that

and under some mild additional assumptions on the initial conditions

and with an additional minor correction in the covariance matrix update for the case where the indicator function evaluates to zero, we find

Invariance

Invariance properties imply uniform performance on a class of objective functions. They have been argued to be an advantage, because they allow to generalize and predict the behavior of the algorithm and therefore strengthen the meaning of empirical results obtained on single functions. The following invariance properties have been established for CMA-ES.

Any serious parameter optimization method should be translation invariant, but most methods do not exhibit all the above described invariance properties. A prominent example with the same invariance properties is the Nelder–Mead method, where the initial simplex must be chosen respectively.

Convergence

Conceptual considerations like the scale-invariance property of the algorithm, the analysis of simpler evolution strategies, and overwhelming empirical evidence suggest that the algorithm converges on a large class of functions fast to the global optimum, denoted as . On some functions, convergence occurs independently of the initial conditions with probability one. On some functions the probability is smaller than one and typically depends on the initial and . Empirically, the fastest possible convergence rate in for rank-based direct search methods can often be observed (depending on the context denoted as linear or log-linear or exponential convergence). Informally, we can write

for some , and more rigorously

or similarly,

This means that on average the distance to the optimum decreases in each iteration by a "constant" factor, namely by . The convergence rate is roughly , given is not much larger than the dimension . Even with optimal and , the convergence rate cannot largely exceed , given the above recombination weights are all non-negative. The actual linear dependencies in and are remarkable and they are in both cases the best one can hope for in this kind of algorithm. Yet, a rigorous proof of convergence is missing.

Interpretation as coordinate-system transformation

Using a non-identity covariance matrix for the multivariate normal distribution in evolution strategies is equivalent to a coordinate system transformation of the solution vectors, [8] mainly because the sampling equation

can be equivalently expressed in an "encoded space" as

The covariance matrix defines a bijective transformation (encoding) for all solution vectors into a space, where the sampling takes place with identity covariance matrix. Because the update equations in the CMA-ES are invariant under linear coordinate system transformations, the CMA-ES can be re-written as an adaptive encoding procedure applied to a simple evolution strategy with identity covariance matrix. [8] This adaptive encoding procedure is not confined to algorithms that sample from a multivariate normal distribution (like evolution strategies), but can in principle be applied to any iterative search method.

Performance in practice

In contrast to most other evolutionary algorithms, the CMA-ES is, from the user's perspective, quasi-parameter-free. The user has to choose an initial solution point, , and the initial step-size, . Optionally, the number of candidate samples λ (population size) can be modified by the user in order to change the characteristic search behavior (see above) and termination conditions can or should be adjusted to the problem at hand.

The CMA-ES has been empirically successful in hundreds of applications and is considered to be useful in particular on non-convex, non-separable, ill-conditioned, multi-modal or noisy objective functions. [9] One survey of Black-Box optimizations found it outranked 31 other optimization algorithms, performing especially strongly on "difficult functions" or larger-dimensional search spaces. [10]

The search space dimension ranges typically between two and a few hundred. Assuming a black-box optimization scenario, where gradients are not available (or not useful) and function evaluations are the only considered cost of search, the CMA-ES method is likely to be outperformed by other methods in the following conditions:

On separable functions, the performance disadvantage is likely to be most significant in that CMA-ES might not be able to find at all comparable solutions. On the other hand, on non-separable functions that are ill-conditioned or rugged or can only be solved with more than function evaluations, the CMA-ES shows most often superior performance.

Variations and extensions

The (1+1)-CMA-ES [11] generates only one candidate solution per iteration step which becomes the new distribution mean if it is better than the current mean. For the (1+1)-CMA-ES is a close variant of Gaussian adaptation. Some Natural Evolution Strategies are close variants of the CMA-ES with specific parameter settings. Natural Evolution Strategies do not utilize evolution paths (that means in CMA-ES setting ) and they formalize the update of variances and covariances on a Cholesky factor instead of a covariance matrix. The CMA-ES has also been extended to multiobjective optimization as MO-CMA-ES. [12] Another remarkable extension has been the addition of a negative update of the covariance matrix with the so-called active CMA. [13] Using the additional active CMA update is considered as the default variant nowadays. [7]

See also

Related Research Articles

<span class="mw-page-title-main">Lorentz transformation</span> Family of linear transformations

In physics, the Lorentz transformations are a six-parameter family of linear transformations from a coordinate frame in spacetime to another frame that moves at a constant velocity relative to the former. The respective inverse transformation is then parameterized by the negative of this velocity. The transformations are named after the Dutch physicist Hendrik Lorentz.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Lorentz group</span> Lie group of Lorentz transformations

In physics and mathematics, the Lorentz group is the group of all Lorentz transformations of Minkowski spacetime, the classical and quantum setting for all (non-gravitational) physical phenomena. The Lorentz group is named for the Dutch physicist Hendrik Lorentz.

Linear elasticity is a mathematical model of how solid objects deform and become internally stressed due to prescribed loading conditions. It is a simplification of the more general nonlinear theory of elasticity and a branch of continuum mechanics.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In theoretical physics, a supermultiplet is a representation of a supersymmetry algebra, possibly with extended supersymmetry.

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher information matrix:

In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters.

<span class="mw-page-title-main">Inverse Gaussian distribution</span> Family of continuous probability distributions

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF).

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

In mathematics, the spectral theory of ordinary differential equations is the part of spectral theory concerned with the determination of the spectrum and eigenfunction expansion associated with a linear ordinary differential equation. In his dissertation, Hermann Weyl generalized the classical Sturm–Liouville theory on a finite closed interval to second order differential operators with singularities at the endpoints of the interval, possibly semi-infinite or infinite. Unlike the classical case, the spectrum may no longer consist of just a countable set of eigenvalues, but may also contain a continuous part. In this case the eigenfunction expansion involves an integral over the continuous part with respect to a spectral measure, given by the Titchmarsh–Kodaira formula. The theory was put in its final simplified form for singular differential equations of even degree by Kodaira and others, using von Neumann's spectral theorem. It has had important applications in quantum mechanics, operator theory and harmonic analysis on semisimple Lie groups.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

References

  1. Hansen, N. (2006), "The CMA evolution strategy: a comparing review", Towards a new evolutionary computation. Advances on estimation of distribution algorithms, Springer, pp. 1769–1776, CiteSeerX   10.1.1.139.7369
  2. Auger, A.; N. Hansen (2005). "A Restart CMA Evolution Strategy With Increasing Population Size" (PDF). 2005 IEEE Congress on Evolutionary Computation, Proceedings. IEEE. pp. 1769–1776. Archived from the original (PDF) on 2016-03-04. Retrieved 2012-07-13.
  3. Shir, O.M.; A. Yehudayoff (2020). "On the covariance-Hessian relation in evolution strategies". Theoretical Computer Science. 801. Elsevier: 157–174. arXiv: 1806.03674 . doi: 10.1016/j.tcs.2019.09.002 .
  4. 1 2 3 Akimoto, Y.; Y. Nagata; I. Ono; S. Kobayashi (2010). "Bidirectional Relation between CMA Evolution Strategies and Natural Evolution Strategies". Parallel Problem Solving from Nature, PPSN XI. Springer. pp. 154–163.
  5. 1 2 Glasmachers, T.; T. Schaul; Y. Sun; D. Wierstra; J. Schmidhuber (2010). "Exponential Natural Evolution Strategies" (PDF). Genetic and Evolutionary Computation Conference GECCO. Portland, OR.
  6. Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. (2017). "Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles" (PDF). Journal of Machine Learning Research. 18 (18): 1−65.
  7. 1 2 Hansen, N. (2016). "The CMA Evolution Strategy: A Tutorial". arXiv: 1604.00772 [cs.LG].
  8. 1 2 Hansen, N. (2008). "Adaptive Encoding: How to Render Search Coordinate System Invariant". Parallel Problem Solving from Nature, PPSN X. Springer. pp. 205–214.
  9. "References to CMA-ES Applications" (PDF).
  10. Hansen, Nikolaus (2010). "Comparing Results of 31 Algorithms from the Black-Box Optimization Benchmarking BBOB-2009" (PDF).
  11. Igel, C.; T. Suttorp; N. Hansen (2006). "A Computational Efficient Covariance Matrix Update and a (1+1)-CMA for Evolution Strategies" (PDF). Proceedings of the Genetic and Evolutionary Computation Conference (GECCO). ACM Press. pp. 453–460.
  12. Igel, C.; N. Hansen; S. Roth (2007). "Covariance Matrix Adaptation for Multi-objective Optimization". Evolutionary Computation. 15 (1): 1–28. doi:10.1162/evco.2007.15.1.1. PMID   17388777. S2CID   7479494.
  13. Jastrebski, G.A.; D.V. Arnold (2006). "Improving Evolution Strategies through Active Covariance Matrix Adaptation". 2006 IEEE World Congress on Computational Intelligence, Proceedings. IEEE. pp. 9719–9726. doi:10.1109/CEC.2006.1688662.

Bibliography