Natural evolution strategy

Last updated January 01, 2025

Natural evolution strategies (NES) are a family of numerical optimization algorithms for black box problems. Similar in spirit to evolution strategies, they iteratively update the (continuous) parameters of a search distribution by following the natural gradient towards higher expected fitness.

Method

The general procedure is as follows: the parameterized search distribution is used to produce a batch of search points, and the fitness function is evaluated at each such point. The distribution’s parameters (which include strategy parameters) allow the algorithm to adaptively capture the (local) structure of the fitness function. For example, in the case of a Gaussian distribution, this comprises the mean and the covariance matrix. From the samples, NES estimates a search gradient on the parameters towards higher expected fitness. NES then performs a gradient ascent step along the natural gradient , a second order method which, unlike the plain gradient, renormalizes the update with respect to uncertainty. This step is crucial, since it prevents oscillations, premature convergence, and undesired effects stemming from a given parameterization. The entire process reiterates until a stopping criterion is met.

All members of the NES family operate based on the same principles. They differ in the type of probability distribution and the gradient approximation method used. Different search spaces require different search distributions; for example, in low dimensionality it can be highly beneficial to model the full covariance matrix. In high dimensions, on the other hand, a more scalable alternative is to limit the covariance to the diagonal only. In addition, highly multi-modal search spaces may benefit from more heavy-tailed distributions (such as Cauchy, as opposed to the Gaussian). A last distinction arises between distributions where we can analytically compute the natural gradient, and more general distributions where we need to estimate it from samples.

Search gradients

Let $\theta$ denote the parameters of the search distribution $\pi (x\,|\,\theta )$ and $f(x)$ the fitness function evaluated at $x$ . NES then pursues the objective of maximizing the expected fitness under the search distribution

J(\theta )=\operatorname {E} _{\theta }[f(x)]=\int f(x)\;\pi (x\,|\,\theta )\;dx

through gradient ascent. The gradient can be rewritten as

\nabla _{\theta }J(\theta )=\nabla _{\theta }\int f(x)\;\pi (x\,|\,\theta )\;dx

=\int f(x)\;\nabla _{\theta }\pi (x\,|\,\theta )\;dx

=\int f(x)\;\nabla _{\theta }\pi (x\,|\,\theta )\;{\frac {\pi (x\,|\,\theta )}{\pi (x\,|\,\theta )}}\;dx

=\int {\Big [}f(x)\;\nabla _{\theta }\log \pi (x\,|\,\theta ){\Big ]}\;\pi (x\,|\,\theta )\;dx

=\operatorname {E} _{\theta }\left[f(x)\;\nabla _{\theta }\log \pi (x\,|\,\theta )\right]

that is, the expected value of $f(x)$ times the log-derivatives at $x$ . In practice, it is possible to use the Monte Carlo approximation based on a finite number of $\lambda$ samples

\nabla _{\theta }J(\theta )\approx {\frac {1}{\lambda }}\sum _{k=1}^{\lambda }f(x_{k})\;\nabla _{\theta }\log \pi (x_{k}\,|\,\theta )

.

Finally, the parameters of the search distribution can be updated iteratively

\theta \leftarrow \theta +\eta \nabla _{\theta }J(\theta )

Natural gradient ascent

Instead of using the plain stochastic gradient for updates, NES follows the natural gradient, which has been shown to possess numerous advantages over the plain (vanilla) gradient, e.g.:

the gradient direction is independent of the parameterization of the search distribution
the updates magnitudes are automatically adjusted based on uncertainty, in turn speeding convergence on plateaus and ridges.

The NES update is therefore

\theta \leftarrow \theta +\eta \mathbf {F} ^{-1}\nabla _{\theta }J(\theta )

,

where $\mathbf {F}$ is the Fisher information matrix. The Fisher matrix can sometimes be computed exactly, otherwise it is estimated from samples, reusing the log-derivatives $\nabla _{\theta }\log \pi (x|\theta )$ .

Fitness shaping

NES utilizes rank-based fitness shaping in order to render the algorithm more robust, and invariant under monotonically increasing transformations of the fitness function. For this purpose, the fitness of the population is transformed into a set of utility values $u_{1}\geq \dots \geq u_{\lambda }$ . Let $x_{i}$ denote the i^th best individual. Replacing fitness with utility, the gradient estimate becomes

\nabla _{\theta }J(\theta )=\sum _{k=1}^{\lambda }u_{k}\;\nabla _{\theta }\log \pi (x_{k}\,|\,\theta )

.

The choice of utility function is a free parameter of the algorithm.

Pseudocode

input:  $f,\;\;\theta _{init}$   1  repeat     2     for  $k=1\ldots \lambda$ do//  $λ$  is the population size         3         draw sample  $x_{k}\sim \pi (\cdot |\theta )$          4         evaluate fitness  $f(x_{k})$          5         calculate log-derivatives  $\nabla _{\theta }\log \pi (x_{k}|\theta )$          6     end     7     assign the utilities  $u_{k}$ // based on rank     8     estimate the gradient  $\nabla _{\theta }J\leftarrow {\frac {1}{\lambda }}\sum _{k=1}^{\lambda }u_{k}\cdot \nabla _{\theta }\log \pi (x_{k}|\theta )$      9     estimate  $\mathbf {F} \leftarrow {\frac {1}{\lambda }}\sum _{k=1}^{\lambda }\nabla _{\theta }\log \pi (x_{k}|\theta )\nabla _{\theta }\log \pi (x_{k}|\theta )^{\top }$ // or compute it exactly      10    update parameters  $\theta \leftarrow \theta +\eta \cdot \mathbf {F} ^{-1}\nabla _{\theta }J$ //  $η$  is the learning rate  11 until stopping criterion is met

Bibliography

D. Wierstra, T. Schaul, J. Peters and J. Schmidhuber (2008). Natural Evolution Strategies. IEEE Congress on Evolutionary Computation (CEC).
Y. Sun, D. Wierstra, T. Schaul and J. Schmidhuber (2009). Stochastic Search using the Natural Gradient. International Conference on Machine Learning (ICML).
T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra and J. Schmidhuber (2010). Exponential Natural Evolution Strategies. Genetic and Evolutionary Computation Conference (GECCO).
T. Schaul, T. Glasmachers and J. Schmidhuber (2011). High Dimensions and Heavy Tails for Natural Evolution Strategies. Genetic and Evolutionary Computation Conference (GECCO).
T. Schaul (2012). Natural Evolution Strategies Converge on Sphere Functions. Genetic and Evolutionary Computation Conference (GECCO).

External links

Collection of NES implementations in different languages

Related Research Articles

In mathematics and physics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace, who first studied its properties. This is often written as $or where is the Laplace operator, is the divergence operator, is the gradient operator, and is a twice-differentiable real-valued function. The Laplace operator therefore maps a scalar function to another scalar function.$

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In physics, the screened Poisson equation is a Poisson equation, which arises in the Klein–Gordon equation, electric field screening in plasmas, and nonlocal granular fluidity in granular flow.

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $α$ and a scale parameter $θ$
With a shape parameter $and a rate parameter ⁠ ⁠$

Stellar dynamics is the branch of astrophysics which describes in a statistical way the collective motions of stars subject to their mutual gravity. The essential difference from celestial mechanics is that the number of body

In Bayesian statistics, the Jeffreys prior is a non-informative prior distribution for a parameter space. Named after Sir Harold Jeffreys, its density function is proportional to the square root of the determinant of the Fisher information matrix:

In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters.

In electromagnetics, directivity is a parameter of an antenna or optical system which measures the degree to which the radiation emitted is concentrated in a single direction. It is the ratio of the radiation intensity in a given direction from the antenna to the radiation intensity averaged over all directions. Therefore, the directivity of a hypothetical isotropic radiator, a source of electromagnetic waves which radiates the same power in all directions, is 1, or 0 dBi.

Cylindrical multipole moments are the coefficients in a series expansion of a potential that varies logarithmically with the distance to a source, i.e., as $. Such potentials arise in the electric potential of long line charges, and the analogous sources for the magnetic potential and gravitational potential.$

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value $. Like this, individuals with better and better -values are generated over the generation sequence.$

In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF).

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In fluid dynamics, the Oseen equations describe the flow of a viscous and incompressible fluid at small Reynolds numbers, as formulated by Carl Wilhelm Oseen in 1910. Oseen flow is an improved description of these flows, as compared to Stokes flow, with the (partial) inclusion of convective acceleration.

In quantum computing, the quantum phase estimation algorithm is a quantum algorithm to estimate the phase corresponding to an eigenvalue of a given unitary operator. Because the eigenvalues of a unitary operator always have unit modulus, they are characterized by their phase, and therefore the algorithm can be equivalently described as retrieving either the phase or the eigenvalue itself. The algorithm was initially introduced by Alexei Kitaev in 1995.

<span class="mw-page-title-main">Wrapped exponential distribution</span> Probability distribution

In probability theory and directional statistics, a wrapped exponential distribution is a wrapped probability distribution that results from the "wrapping" of the exponential distribution around the unit circle.

In optics, the Fraunhofer diffraction equation is used to model the diffraction of waves when the diffraction pattern is viewed at a long distance from the diffracting object, and also when it is viewed at the focal plane of an imaging lens.

<span class="mw-page-title-main">Wrapped asymmetric Laplace distribution</span>

In probability theory and directional statistics, a wrapped asymmetric Laplace distribution is a wrapped probability distribution that results from the "wrapping" of the asymmetric Laplace distribution around the unit circle. For the symmetric case (asymmetry parameter κ = 1), the distribution becomes a wrapped Laplace distribution. The distribution of the ratio of two circular variates (Z) from two different wrapped exponential distributions will have a wrapped asymmetric Laplace distribution. These distributions find application in stochastic modelling of financial data.

In plasma physics and magnetic confinement fusion, neoclassical transport or neoclassical diffusion is a theoretical description of collisional transport in toroidal plasmas, usually found in tokamaks or stellarators. It is a modification of classical diffusion adding in effects of non-uniform magnetic fields due to the toroidal geometry, which give rise to new diffusion effects.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

v t e Evolutionary computation
Main Topics	Evolutionary algorithm Evolutionary data mining Evolutionary multimodal optimization Human-based evolutionary computation Interactive evolutionary computation
Algorithms	Cellular evolutionary algorithm Covariance Matrix Adaptation Evolution Strategy (CMA-ES) Cultural algorithm Differential evolution Evolutionary programming Genetic algorithm Genetic programming Gene expression programming Evolution strategy Natural evolution strategy Neuroevolution Learning classifier system
Related techniques	Swarm intelligence Ant colony optimization Bees algorithm Cuckoo search Particle swarm optimization Bacterial Colony Optimization
Metaheuristic methods	Firefly algorithm Harmony search Gaussian adaptation Memetic algorithm
Related topics	Artificial development Artificial intelligence Artificial life Digital organism Evolutionary robotics Fitness function Fitness landscape Fitness approximation Genetic operators Interactive evolutionary computation No free lunch in search and optimization Machine learning Mating pool Premature convergence Program synthesis
Journals	Evolutionary Computation (journal)