# Stochastic approximation

Last updated

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

## Contents

In a nutshell, stochastic approximation algorithms deal with a function of the form ${\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]}$ which is the expected value of a function depending on a random variable ${\textstyle \xi }$. The goal is to recover properties of such a function ${\textstyle f}$ without evaluating it directly. Instead, stochastic approximation algorithms use random samples of ${\textstyle F(\theta ,\xi )}$ to efficiently approximate properties of ${\textstyle f}$ such as zeros or extrema.

Recently, stochastic approximations have found extensive applications in the fields of statistics and machine learning, especially in settings with big data. These applications range from stochastic optimization methods and algorithms, to online forms of the EM algorithm, reinforcement learning via temporal differences, and deep learning, and others. [1] Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory. [2]

The earliest, and prototypical, algorithms of this kind are the Robbins–Monro and Kiefer–Wolfowitz algorithms introduced respectively in 1951 and 1952.

## Robbins–Monro algorithm

The Robbins–Monro algorithm, introduced in 1951 by Herbert Robbins and Sutton Monro, [3] presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function ${\textstyle M(\theta )}$, and a constant ${\textstyle \alpha }$, such that the equation ${\textstyle M(\theta )=\alpha }$ has a unique root at ${\textstyle \theta ^{*}}$. It is assumed that while we cannot directly observe the function ${\textstyle M(\theta )}$, we can instead obtain measurements of the random variable ${\textstyle N(\theta )}$ where ${\textstyle \operatorname {E} [N(\theta )]=M(\theta )}$. The structure of the algorithm is to then generate iterates of the form:

${\displaystyle \theta _{n+1}=\theta _{n}-a_{n}(N(\theta _{n})-\alpha )}$

Here, ${\displaystyle a_{1},a_{2},\dots }$ is a sequence of positive step sizes. Robbins and Monro proved [3] , Theorem 2 that ${\displaystyle \theta _{n}}$ converges in ${\displaystyle L^{2}}$ (and hence also in probability) to ${\displaystyle \theta ^{*}}$, and Blum [4] later proved the convergence is actually with probability one, provided that:

• ${\textstyle N(\theta )}$ is uniformly bounded,
• ${\textstyle M(\theta )}$ is nondecreasing,
• ${\textstyle M'(\theta ^{*})}$ exists and is positive, and
• The sequence ${\textstyle a_{n}}$ satisfies the following requirements:
${\displaystyle \qquad \sum _{n=0}^{\infty }a_{n}=\infty \quad {\mbox{ and }}\quad \sum _{n=0}^{\infty }a_{n}^{2}<\infty \quad }$

A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: ${\textstyle a_{n}=a/n}$, for ${\textstyle a>0}$. Other series are possible but in order to average out the noise in ${\textstyle N(\theta )}$, the above condition must be met.

### Complexity results

1. If ${\textstyle f(\theta )}$ is twice continuously differentiable, and strongly convex, and the minimizer of ${\textstyle f(\theta )}$ belongs to the interior of ${\textstyle \Theta }$, then the Robbins–Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being ${\textstyle \operatorname {E} [f(\theta _{n})-f^{*}]=O(1/n)}$, where ${\textstyle f^{*}}$ is the minimal value of ${\textstyle f(\theta )}$ over ${\textstyle \theta \in \Theta }$. [5] [6]
2. Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin [7] have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is ${\textstyle O(1/{\sqrt {n}})}$. They have also proven that this rate cannot be improved.

### Subsequent developments and Polyak–Ruppert averaging

While the Robbins–Monro algorithm is theoretically able to achieve ${\textstyle O(1/n)}$ under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning. [6] [8]

Chung [9] (1954) and Fabian [10] (1968) showed that we would achieve optimal convergence rate ${\textstyle O(1/{\sqrt {n}})}$ with ${\textstyle a_{n}=\bigtriangledown ^{2}f(\theta ^{*})^{-1}/n}$ (or ${\textstyle a_{n}={\frac {1}{(nM'(\theta ^{*}))}}}$). Lai and Robbins [11] [12] designed adaptive procedures to estimate ${\textstyle M'(\theta ^{*})}$ such that ${\textstyle \theta _{n}}$ has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak [13] (1991) and Ruppert [14] (1988) independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky [15] also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:

${\displaystyle \theta _{n+1}-\theta _{n}=a_{n}(\alpha -N(\theta _{n})),\qquad {\bar {\theta }}_{n}={\frac {1}{n}}\sum _{i=0}^{n-1}\theta _{i}}$

The convergence of ${\displaystyle {\bar {\theta }}_{n}}$ to the unique root ${\displaystyle \theta ^{*}}$ relies on the condition that the step sequence ${\displaystyle \{a_{n}\}}$ decreases sufficiently slowly. That is A1)

${\displaystyle a_{n}\rightarrow 0,\qquad {\frac {a_{n}-a_{n+1}}{a_{n}}}=o(a_{n})}$

Therefore, the sequence ${\textstyle a_{n}=n^{-\alpha }}$ with ${\textstyle 0<\alpha <1}$ satisfies this restriction, but ${\textstyle \alpha =1}$ does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate ${\textstyle O(1/{\sqrt {n}})}$ yet with a more robust step size policy. [15] Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin [16] for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rate ${\textstyle O(1/{\sqrt {n}})}$.

A more general result is given in Chapter 11 of Kushner and Yin [17] by defining interpolated time ${\textstyle t_{n}=\sum _{i=0}^{n-1}a_{i}}$, interpolated process ${\textstyle \theta ^{n}(\cdot )}$ and interpolated normalized process ${\textstyle U^{n}(\cdot )}$ as

${\displaystyle \theta ^{n}(t)=\theta _{n+i},\quad U^{n}(t)=(\theta _{n+i}-\theta ^{*})/{\sqrt {a_{n+i}}}\quad {\mbox{for}}\quad t\in [t_{n+i}-t_{n},t_{n+i+1}-t_{n}),i\geq 0}$

Let the iterate average be ${\displaystyle \Theta _{n}={\frac {a_{n}}{t}}\sum _{i=n}^{n+t/a_{n}-1}\theta _{i}}$ and the associate normalized error to be ${\displaystyle {\hat {U}}^{n}(t)={\frac {\sqrt {a_{n}}}{t}}\sum _{i=n}^{n+t/a_{n}-1}(\theta _{i}-\theta ^{*})}$.

With assumption A1) and the following A2)

A2)There is a Hurwitz matrix ${\textstyle A}$ and a symmetric and positive-definite matrix ${\textstyle \Sigma }$ such that ${\textstyle \{U^{n}(\cdot )\}}$ converges weakly to ${\textstyle U(\cdot )}$, where ${\textstyle U(\cdot )}$ is the statisolution to

${\displaystyle dU=AU\,dt+\Sigma ^{1/2}\,dw}$

where ${\textstyle w(\cdot )}$ is a standard Wiener process.

satisfied, and define ${\textstyle {\bar {V}}=(A^{-1})'\Sigma (A')^{-1}}$. Then for each ${\textstyle t}$,

${\displaystyle {\hat {U}}^{n}(t){\stackrel {\mathcal {D}}{\longrightarrow }}{\mathcal {N}}(0,V_{t}),\quad {\text{where}}\quad V_{t}={\bar {V}}/t+O(1/t^{2}).}$

The success of the averaging idea is because of the time scale separation of the original sequence ${\textstyle \{\theta _{n}\}}$ and the averaged sequence ${\textstyle \{\Theta _{n}\}}$, with the time scale of the former one being faster.

### Application in stochastic optimization

Suppose we want to solve the following stochastic optimization problem

where ${\textstyle g(\theta )=\operatorname {E} [Q(\theta ,X)]}$ is differentiable and convex, then this problem is equivalent to find the root ${\displaystyle \theta ^{*}}$ of ${\displaystyle \nabla g(\theta )=0}$. Here ${\displaystyle Q(\theta ,X)}$ can be interpreted as some "observed" cost as a function of the chosen ${\displaystyle \theta }$ and random effects ${\displaystyle X}$. In practice, it might be hard to get an analytical form of ${\displaystyle \nabla g(\theta )}$, Robbins–Monro method manages to generate a sequence ${\displaystyle (\theta _{n})_{n\geq 0}}$ to approximate ${\displaystyle \theta ^{*}}$ if one can generate ${\displaystyle (X_{n})_{n\geq 0}}$ , in which the conditional expectation of ${\displaystyle X_{n}}$ given ${\displaystyle \theta _{n}}$ is exactly ${\displaystyle \nabla g(\theta _{n})}$, i.e. ${\displaystyle X_{n}}$ is simulated from a conditional distribution defined by

Here ${\displaystyle H(\theta ,X)}$ is an unbiased estimator of ${\displaystyle \nabla g(\theta )}$. If ${\displaystyle X}$ depends on ${\displaystyle \theta }$, there is in general no natural way of generating a random outcome ${\displaystyle H(\theta ,X)}$ that is an unbiased estimator of the gradient. In some special cases when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient estimator ${\displaystyle H(\theta ,X)}$. If ${\displaystyle X}$ is viewed as some "fundamental" underlying random process that is generated independently of ${\displaystyle \theta }$, and under some regularization conditions for derivative-integral interchange operations so that ${\displaystyle \operatorname {E} {\Big [}{\frac {\partial }{\partial \theta }}Q(\theta ,X){\Big ]}=\nabla g(\theta )}$, then ${\displaystyle H(\theta ,X)={\frac {\partial }{\partial \theta }}Q(\theta ,X)}$ gives the fundamental gradient unbiased estimate. However, for some applications we have to use finite-difference methods in which ${\displaystyle H(\theta ,X)}$ has a conditional expectation close to ${\displaystyle \nabla g(\theta )}$ but not exactly equal to it.

We then define a recursion analogously to Newton's Method in the deterministic algorithm:

#### Convergence of the algorithm

The following result gives sufficient conditions on ${\displaystyle \theta _{n}}$ for the algorithm to converge: [18]

C1) ${\displaystyle \varepsilon _{n}\geq 0,\forall \;n\geq 0.}$

C2) ${\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}=\infty }$

C3) ${\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}^{2}<\infty }$

C4) ${\displaystyle |X_{n}|\leq B,{\text{ for a fixed bound }}B.}$

C5) ${\displaystyle g(\theta ){\text{ is strictly convex, i.e.}}}$

Then ${\displaystyle \theta _{n}}$ converges to ${\displaystyle \theta ^{*}}$ almost surely.

Here are some intuitive explanations about these conditions. Suppose ${\displaystyle H(\theta _{n},X_{n+1})}$ is a uniformly bounded random variables. If C2) is not satisfied, i.e. ${\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}<\infty }$ , then

is a bounded sequence, so the iteration cannot converge to ${\displaystyle \theta ^{*}}$ if the initial guess ${\displaystyle \theta _{0}}$ is too far away from ${\displaystyle \theta ^{*}}$. As for C3) note that if ${\displaystyle \theta _{n}}$ converges to ${\displaystyle \theta ^{*}}$ then

so we must have ${\displaystyle \varepsilon _{n}\downarrow 0}$ ，and the condition C3) ensures it. A natural choice would be ${\displaystyle \varepsilon _{n}=1/n}$. Condition C5) is a fairly stringent condition on the shape of ${\displaystyle g(\theta )}$; it gives the search direction of the algorithm.

#### Example (where the stochastic gradient method is appropriate) [8]

Suppose ${\displaystyle Q(\theta ,X)=f(\theta )+\theta ^{T}X}$, where ${\displaystyle f}$ is differentiable and ${\displaystyle X\in \mathbb {R} ^{p}}$ is a random variable independent of ${\displaystyle \theta }$. Then ${\displaystyle g(\theta )=\operatorname {E} [Q(\theta ,X)]=f(\theta )+\theta ^{T}\operatorname {E} X}$ depends on the mean of ${\displaystyle X}$, and the stochastic gradient method would be appropriate in this problem. We can choose ${\displaystyle H(\theta ,X)={\frac {\partial }{\partial \theta }}Q(\theta ,X)={\frac {\partial }{\partial \theta }}f(\theta )+X.}$

## Kiefer–Wolfowitz algorithm

The Kiefer–Wolfowitz algorithm was introduced in 1952 by Jacob Wolfowitz and Jack Kiefer, [19] and was motivated by the publication of the Robbins–Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function. Let ${\displaystyle M(x)}$ be a function which has a maximum at the point ${\displaystyle \theta }$. It is assumed that ${\displaystyle M(x)}$ is unknown; however, certain observations ${\displaystyle N(x)}$, where ${\displaystyle \operatorname {E} [N(x)]=M(x)}$, can be made at any point ${\displaystyle x}$. The structure of the algorithm follows a gradient-like method, with the iterates being generated as follows:

${\displaystyle x_{n+1}=x_{n}+a_{n}{\bigg (}{\frac {N(x_{n}+c_{n})-N(x_{n}-c_{n})}{2c_{n}}}{\bigg )}}$

where ${\displaystyle N(x_{n}+c_{n})}$ and ${\displaystyle N(x_{n}-c_{n})}$ are independent, and the gradient of ${\displaystyle M(x)}$ is approximated using finite differences. The sequence ${\displaystyle \{c_{n}\}}$ specifies the sequence of finite difference widths used for the gradient approximation, while the sequence ${\displaystyle \{a_{n}\}}$ specifies a sequence of positive step sizes taken along that direction. Kiefer and Wolfowitz proved that, if ${\displaystyle M(x)}$ satisfied certain regularity conditions, then ${\displaystyle x_{n}}$ will converge to ${\displaystyle \theta }$ in probability as ${\displaystyle n\to \infty }$, and later Blum [4] in 1954 showed ${\displaystyle x_{n}}$ converges to ${\displaystyle \theta }$ almost surely, provided that:

• ${\displaystyle \operatorname {Var} (N(x))\leq S<\infty }$ for all ${\displaystyle x}$.
• The function ${\displaystyle M(x)}$ has a unique point of maximum (minimum) and is strong concave (convex)
• The algorithm was first presented with the requirement that the function ${\displaystyle M(\cdot )}$ maintains strong global convexity (concavity) over the entire feasible space. Given this condition is too restrictive to impose over the entire domain, Kiefer and Wolfowitz proposed that it is sufficient to impose the condition over a compact set ${\displaystyle C_{0}\subset \mathbb {R} ^{d}}$ which is known to include the optimal solution.
• The function ${\displaystyle M(x)}$ satisfies the regularity conditions as follows:
• There exists ${\displaystyle \beta >0}$ and ${\displaystyle B>0}$ such that
${\displaystyle |x'-\theta |+|x''-\theta |<\beta \quad \Longrightarrow \quad |M(x')-M(x'')|
• There exists ${\displaystyle \rho >0}$ and ${\displaystyle R>0}$ such that
${\displaystyle |x'-x''|<\rho \quad \Longrightarrow \quad |M(x')-M(x'')|
• For every ${\displaystyle \delta >0}$, there exists some ${\displaystyle \pi (\delta )>0}$ such that
${\displaystyle |z-\theta |>\delta \quad \Longrightarrow \quad \inf _{\delta /2>\varepsilon >0}{\frac {|M(z+\varepsilon )-M(z-\varepsilon )|}{\varepsilon }}>\pi (\delta )}$
• The selected sequences ${\displaystyle \{a_{n}\}}$ and ${\displaystyle \{c_{n}\}}$ must be infinite sequences of positive numbers such that
• ${\displaystyle \quad c_{n}\rightarrow 0\quad {\text{as}}\quad n\to \infty }$
• ${\displaystyle \sum _{n=0}^{\infty }a_{n}=\infty }$
• ${\displaystyle \sum _{n=0}^{\infty }a_{n}c_{n}<\infty }$
• ${\displaystyle \sum _{n=0}^{\infty }a_{n}^{2}c_{n}^{-2}<\infty }$

A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would be ${\displaystyle a_{n}=1/n}$ and ${\displaystyle c_{n}=n^{-1/3}}$.

### Subsequent developments and important issues

1. The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least ${\displaystyle d+1}$ different parameter values must be simulated for every iteration of the algorithm, where ${\displaystyle d}$ is the dimension of the search space. This means that when ${\displaystyle d}$ is large, the Kiefer–Wolfowitz algorithm will require substantial computational effort per iteration, leading to slow convergence.
1. To address this problem, Spall proposed the use of simultaneous perturbations to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension ${\displaystyle d}$. [20]
2. In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.

## Further developments

An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on. [21] [22] These methods are also applied in control theory, in which case the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step size ${\displaystyle a_{n}}$ should not converge to zero but should be chosen so as to track the function. [21] , 2nd ed., chapter 3

C. Johan Masreliez and R. Douglas Martin were the first to apply stochastic approximation to robust estimation. [23]

The main tool for analyzing stochastic approximations algorithms (including the Robbins–Monro and the Kiefer–Wolfowitz algorithms) is a theorem by Aryeh Dvoretzky published in the proceedings of the third Berkeley symposium on mathematical statistics and probability, 1956. [24]

## Related Research Articles

In mathematics, an infinite series of numbers is said to converge absolutely if the sum of the absolute values of the summands is finite. More precisely, a real or complex series is said to converge absolutely if for some real number Similarly, an improper integral of a function, is said to converge absolutely if the integral of the absolute value of the integrand is finite—that is, if

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and tends to become closer to the expected value as more trials are performed.

In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral, and by extension, the term is also sometimes used to describe the numerical solution of differential equations. This article focuses on calculation of definite integrals.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In calculus, the squeeze theorem is a theorem regarding the limit of a function that is trapped between two other functions.

In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of exact simulation method. The method works for any distribution in with a density.

In mathematics, the Poisson summation formula is an equation that relates the Fourier series coefficients of the periodic summation of a function to values of the function's continuous Fourier transform. Consequently, the periodic summation of a function is completely defined by discrete samples of the original function's Fourier transform. And conversely, the periodic summation of a function's Fourier transform is completely defined by discrete samples of the original function. The Poisson summation formula was discovered by Siméon Denis Poisson and is sometimes called Poisson resummation.

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

In numerical analysis, a numerical method is a mathematical tool designed to solve numerical problems. The implementation of a numerical method with an appropriate convergence check in a programming language is called a numerical algorithm.

Arc length is the distance between two points along a section of a curve.

In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

In mathematics, more specifically in dynamical systems, the method of averaging exploits systems containing time-scales separation: a fast oscillationversus a slow drift. It suggests that we perform an averaging over a given amount of time in order to iron out the fast oscillations and observe the qualitative behavior from the resulting dynamics. The approximated solution holds under finite time inversely proportional to the parameter denoting the slow time scale. It turns out to be a customary problem where there exists the trade off between how good is the approximated solution balanced by how much time it holds to be close to the original solution.

A pendulum is a body suspended from a fixed support so that it swings freely back and forth under the influence of gravity. When a pendulum is displaced sideways from its resting, equilibrium position, it is subject to a restoring force due to gravity that will accelerate it back toward the equilibrium position. When released, the restoring force acting on the pendulum's mass causes it to oscillate about the equilibrium position, swinging it back and forth. The mathematics of pendulums are in general quite complicated. Simplifying assumptions can be made, which in the case of a simple pendulum allow the equations of motion to be solved analytically for small-angle oscillations.

Anatoly Alexeyevich Karatsuba was a Russian mathematician working in the field of analytic number theory, p-adic numbers and Dirichlet series.

In mathematics, potential flow around a circular cylinder is a classical solution for the flow of an inviscid, incompressible fluid around a cylinder that is transverse to the flow. Far from the cylinder, the flow is unidirectional and uniform. The flow has no vorticity and thus the velocity field is irrotational and can be modeled as a potential flow. Unlike a real fluid, this solution indicates a net zero drag on the body, a result known as d'Alembert's paradox.

In mathematics, singular integral operators of convolution type are the singular integral operators that arise on Rn and Tn through convolution by distributions; equivalently they are the singular integral operators that commute with translations. The classical examples in harmonic analysis are the harmonic conjugation operator on the circle, the Hilbert transform on the circle and the real line, the Beurling transform in the complex plane and the Riesz transforms in Euclidean space. The continuity of these operators on L2 is evident because the Fourier transform converts them into multiplication operators. Continuity on Lp spaces was first established by Marcel Riesz. The classical techniques include the use of Poisson integrals, interpolation theory and the Hardy–Littlewood maximal function. For more general operators, fundamental new techniques, introduced by Alberto Calderón and Antoni Zygmund in 1952, were developed by a number of authors to give general criteria for continuity on Lp spaces. This article explains the theory for the classical operators and sketches the subsequent general theory.

In mathematics, the field of logarithmic-exponential transseries is a non-Archimedean ordered differential field which extends comparability of asymptotic growth rates of elementary nontrigonometric functions to a much broader class of objects. Each log-exp transseries represents a formal asymptotic behavior, and it can be manipulated formally, and when it converges, corresponds to actual behavior. Transseries can also be convenient for representing functions. Through their inclusion of exponentiation and logarithms, transseries are a strong generalization of the power series at infinity and other similar asymptotic expansions.

Stochastic gradient Langevin dynamics (SGLD) is an optimization technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which introduces additional noise to the stochastic gradient estimator used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning, since the method produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

## References

1. Toulis, Panos; Airoldi, Edoardo (2015). "Scalable estimation strategies based on stochastic approximations: classical results and new insights". Statistics and Computing. 25 (4): 781–795. doi:10.1007/s11222-015-9560-y. PMC  . PMID   26139959.
2. Le Ny, Jerome. "Introduction to Stochastic Approximation Algorithms" (PDF). Polytechnique Montreal. Teaching Notes. Retrieved 16 November 2016.
3. Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:.
4. Blum, Julius R. (1954-06-01). "Approximation Methods which Converge with Probability one". The Annals of Mathematical Statistics. 25 (2): 382–386. doi:. ISSN   0003-4851.
5. Sacks, J. (1958). "Asymptotic Distribution of Stochastic Approximation Procedures". The Annals of Mathematical Statistics. 29 (2): 373–405. doi:. JSTOR   2237335.
6. Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. (2009). "Robust Stochastic Approximation Approach to Stochastic Programming". SIAM Journal on Optimization. 19 (4): 1574. doi:10.1137/070704277.
7. Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, Wiley -Intersci. Ser. Discrete Math15John WileyNew York (1983) .
8. Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control, J.C. Spall, John WileyHoboken, NJ, (2003).
9. Chung, K. L. (1954-09-01). "On a Stochastic Approximation Method". The Annals of Mathematical Statistics. 25 (3): 463–483. doi:. ISSN   0003-4851.
10. Fabian, Vaclav (1968-08-01). "On Asymptotic Normality in Stochastic Approximation". The Annals of Mathematical Statistics. 39 (4): 1327–1332. doi:. ISSN   0003-4851.
11. Lai, T. L.; Robbins, Herbert (1979-11-01). "Adaptive Design and Stochastic Approximation". The Annals of Statistics. 7 (6): 1196–1221. doi:. ISSN   0090-5364.
12. Lai, Tze Leung; Robbins, Herbert (1981-09-01). "Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes". Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete. 56 (3): 329–360. doi:10.1007/BF00536178. ISSN   0044-3719. S2CID   122109044.
13. Polyak, B T (1990-01-01). "New stochastic approximation type procedures. (In Russian.)". 7 (7).{{cite journal}}: Cite journal requires |journal= (help)
14. Ruppert, D. "Efficient estimators from a slowly converging robbins-monro process".{{cite journal}}: Cite journal requires |journal= (help)
15. Polyak, B. T.; Juditsky, A. B. (1992). "Acceleration of Stochastic Approximation by Averaging". SIAM Journal on Control and Optimization. 30 (4): 838. doi:10.1137/0330046.
16. On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, Dokl. Akad. Nauk SSR2939, (1978 (Russian)), Soviet Math. Dokl. 19 (1978 (English)).
17. Kushner, Harold; George Yin, G. (2003-07-17). Stochastic Approximation and Recursive Algorithms and | Harold Kushner | Springer. www.springer.com. ISBN   9780387008943 . Retrieved 2016-05-16.
18. Bouleau, N.; Lepingle, D. (1994). Numerical Methods for stochastic Processes. New York: John Wiley. ISBN   9780471546412.
19. Kiefer, J.; Wolfowitz, J. (1952). "Stochastic Estimation of the Maximum of a Regression Function". The Annals of Mathematical Statistics. 23 (3): 462. doi:.
20. Spall, J. C. (2000). "Adaptive stochastic approximation by the simultaneous perturbation method". IEEE Transactions on Automatic Control. 45 (10): 1839–1853. doi:10.1109/TAC.2000.880982.
21. Kushner, H. J.; Yin, G. G. (1997). Stochastic Approximation Algorithms and Applications. doi:10.1007/978-1-4899-2696-8. ISBN   978-1-4899-2698-2.
22. Stochastic Approximation and Recursive Estimation, Mikhail Borisovich Nevel'son and Rafail Zalmanovich Has'minskiĭ, translated by Israel Program for Scientific Translations and B. Silver, Providence, RI: American Mathematical Society, 1973, 1976. ISBN   0-8218-1597-0.
23. Martin, R.; Masreliez, C. (1975). "Robust estimation via stochastic approximation". IEEE Transactions on Information Theory. 21 (3): 263. doi:10.1109/TIT.1975.1055386.
24. Dvoretzky, Aryeh (1956-01-01). "On Stochastic Approximation". The Regents of the University of California.{{cite journal}}: Cite journal requires |journal= (help)