Simultaneous perturbation stochastic approximation

Last updated October 05, 2024

Simultaneous perturbation stochastic approximation (SPSA) is an algorithmic method for optimizing systems with multiple unknown parameters. It is a type of stochastic approximation algorithm. As an optimization method, it is appropriately suited to large-scale population models, adaptive modeling, simulation optimization, and atmospheric modeling. Many examples are presented at the SPSA website http://www.jhuapl.edu/SPSA. A comprehensive book on the subject is Bhatnagar et al. (2013). An early paper on the subject is Spall (1987) and the foundational paper providing the key theory and justification is Spall (1992).

SPSA is a descent method capable of finding global minima, sharing this property with other methods such as simulated annealing. Its main feature is the gradient approximation that requires only two measurements of the objective function, regardless of the dimension of the optimization problem. Recall that we want to find the optimal control $u^{*}$ with loss function $J(u)$ :

u^{*}=\arg \min _{u\in U}J(u).

Both Finite Differences Stochastic Approximation (FDSA) and SPSA use the same iterative process:

u_{n+1}=u_{n}-a_{n}{\hat {g}}_{n}(u_{n}),

where $u_{n}=((u_{n})_{1},(u_{n})_{2},\ldots ,(u_{n})_{p})^{T}$ represents the $n^{th}$ iterate, ${\hat {g}}_{n}(u_{n})$ is the estimate of the gradient of the objective function $g(u)={\frac {\partial }{\partial u}}J(u)$ evaluated at ${u_{n}}$ , and $\{a_{n}\}$ is a positive number sequence converging to 0. If $u_{n}$ is a p-dimensional vector, the $i^{th}$ component of the symmetric finite difference gradient estimator is:

FD:

({\hat {g_{n}}}(u_{n}))_{i}={\frac {J(u_{n}+c_{n}e_{i})-J(u_{n}-c_{n}e_{i})}{2c_{n}}},

1 ≤i ≤p, where $e_{i}$ is the unit vector with a 1 in the $i^{th}$ place, and $c_{n}$ is a small positive number that decreases with n. With this method, 2p evaluations of J for each $g_{n}$ are needed. When p is large, this estimator loses efficiency.

Let now $\Delta _{n}$ be a random perturbation vector. The $i^{th}$ component of the stochastic perturbation gradient estimator is:

SP:

({\hat {g_{n}}}(u_{n}))_{i}={\frac {J(u_{n}+c_{n}\Delta _{n})-J(u_{n}-c_{n}\Delta _{n})}{2c_{n}(\Delta _{n})_{i}}}.

Remark that FD perturbs only one direction at a time, while the SP estimator disturbs all directions at the same time (the numerator is identical in all p components). The number of loss function measurements needed in the SPSA method for each $g_{n}$ is always 2, independent of the dimension p. Thus, SPSA uses p times fewer function evaluations than FDSA, which makes it a lot more efficient.

Simple experiments with p=2 showed that SPSA converges in the same number of iterations as FDSA. The latter follows approximately the steepest descent direction, behaving like the gradient method. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. In average though, it tracks it nearly because the gradient approximation is an almost unbiased estimator of the gradient, as shown in the following lemma.

Convergence lemma

Denote by

b_{n}=E[{\hat {g}}_{n}|u_{n}]-\nabla J(u_{n})

the bias in the estimator ${\hat {g}}_{n}$ . Assume that $\{(\Delta _{n})_{i}\}$ are all mutually independent with zero-mean, bounded second moments, and $E(|(\Delta _{n})_{i}|^{-1})$ uniformly bounded. Then $b_{n}$ →0 w.p. 1.

Sketch of the proof

The main idea is to use conditioning on $\Delta _{n}$ to express $E[({\hat {g}}_{n})_{i}]$ and then to use a second order Taylor expansion of $J(u_{n}+c_{n}\Delta _{n})_{i}$ and $J(u_{n}-c_{n}\Delta _{n})_{i}$ . After algebraic manipulations using the zero mean and the independence of $\{(\Delta _{n})_{i}\}$ , we get

E[({\hat {g}}_{n})_{i}]=(g_{n})_{i}+O(c_{n}^{2})

The result follows from the hypothesis that $c_{n}$ →0.

Next we resume some of the hypotheses under which $u_{t}$ converges in probability to the set of global minima of $J(u)$ . The efficiency of the method depends on the shape of $J(u)$ , the values of the parameters $a_{n}$ and $c_{n}$ and the distribution of the perturbation terms $\Delta _{ni}$ . First, the algorithm parameters must satisfy the following conditions:

$a_{n}$ >0, $a_{n}$ →0 when n→∝ and $\sum _{n=1}^{\infty }a_{n}=\infty$ . A good choice would be $a_{n}={\frac {a}{n}};$ a>0;
$c_{n}={\frac {c}{n^{\gamma }}}$ , where c>0, $\gamma \in \left[{\frac {1}{6}},{\frac {1}{2}}\right]$ ;
$\sum _{n=1}^{\infty }({\frac {a_{n}}{c_{n}}})^{2}<\infty$
$\Delta _{ni}$ must be mutually independent zero-mean random variables, symmetrically distributed about zero, with $\Delta _{ni}<a_{1}<\infty$ . The inverse first and second moments of the $\Delta _{ni}$ must be finite.

A good choice for $\Delta _{ni}$ is the Rademacher distribution, i.e. Bernoulli +-1 with probability 0.5. Other choices are possible too, but note that the uniform and normal distributions cannot be used because they do not satisfy the finite inverse moment conditions.

The loss function J(u) must be thrice continuously differentiable and the individual elements of the third derivative must be bounded: $|J^{(3)}(u)|<a_{3}<\infty$ . Also, $|J(u)|\rightarrow \infty$ as $u\rightarrow \infty$ .

In addition, $\nabla J$ must be Lipschitz continuous, bounded and the ODE ${\dot {u}}=g(u)$ must have a unique solution for each initial condition. Under these conditions and a few others, $u_{k}$ converges in probability to the set of global minima of J(u) (see Maryak and Chin, 2008).

It has been shown that differentiability is not required: continuity and convexity are sufficient for convergence.^[1]

Extension to second-order (Newton) methods

It is known that a stochastic version of the standard (deterministic) Newton-Raphson algorithm (a “second-order” method) provides an asymptotically optimal or near-optimal form of stochastic approximation. SPSA can also be used to efficiently estimate the Hessian matrix of the loss function based on either noisy loss measurements or noisy gradient measurements (stochastic gradients). As with the basic SPSA method, only a small fixed number of loss measurements or gradient measurements are needed at each iteration, regardless of the problem dimension p. See the brief discussion in Stochastic gradient descent.

Related Research Articles

In mathematical analysis, the Dirac delta function, also known as the unit impulse, is a generalized function on the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line is equal to one. Since there is no function having this property, modelling the delta "function" rigorously involves the use of limits or, as is common in mathematics, measure theory and the theory of distributions.

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In the field of mathematical optimization, stochastic programming is a framework for modeling optimization problems that involve uncertainty. A stochastic program is an optimization problem in which some or all problem parameters are uncertain, but follow known probability distributions. This framework contrasts with deterministic optimization, in which all problem parameters are assumed to be known exactly. The goal of stochastic programming is to find a decision which both optimizes some criteria chosen by the decision maker, and appropriately accounts for the uncertainty of the problem parameters. Because many real-world decisions involve uncertainty, stochastic programming has found applications in a broad range of areas ranging from finance to transportation to energy optimization.

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

Quasi-Newton methods are methods used to find either zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. The "full" Newton's method requires the Jacobian in order to search for zeros, or the Hessian for finding extrema. Some iterative methods that reduce to Newton's method, such as SLSQP, may be considered quasi-Newtonian.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In numerical optimization, the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. For a quadratic function

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

In geometry, the Chebyshev center of a bounded set $having non-empty interior is the center of the minimal-radius ball enclosing the entire set, or alternatively the center of largest inscribed ball of .$

In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.

Orbit modeling is the process of creating mathematical models to simulate motion of a massive body as it moves in orbit around another massive body due to gravity. Other forces such as gravitational attraction from tertiary bodies, air resistance, solar pressure, or thrust from a propulsion system are typically modeled as secondary effects. Directly modeling an orbit can push the limits of machine precision due to the need to model small perturbations to very large orbits. Because of this, perturbation methods are often used to model the orbit in order to achieve better accuracy.

In queueing theory, a discipline within the mathematical theory of probability, a heavy traffic approximation involves the matching of a queueing model with a diffusion process under some limiting conditions on the model's parameters. The first such result was published by John Kingman, who showed that when the utilisation parameter of an M/M/1 queue is near 1, a scaled version of the queue length process can be accurately approximated by a reflected Brownian motion.

Data-driven control systems are a broad family of control systems, in which the identification of the process model and/or the design of the controller are based entirely on experimental data collected from the plant.

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

The Barzilai-Borwein method is an iterative gradient descent method for unconstrained optimization using either of two step sizes derived from the linear trend of the most recent two iterates. This method, and modifications, are globally convergent under mild conditions, and perform competitively with conjugate gradient methods for many problems. Not depending on the objective itself, it can also solve some systems of linear and non-linear equations.

In statistics, the Innovation Method provides an estimator for the parameters of stochastic differential equations given a time series of observations of the state variables. In the framework of continuous-discrete state space models, the innovation estimator is obtained by maximizing the log-likelihood of the corresponding discrete-time innovation process with respect to the parameters. The innovation estimator can be classified as a M-estimator, a quasi-maximum likelihood estimator or a prediction error estimator depending on the inferential considerations that want to be emphasized. The innovation method is a system identification technique for developing mathematical models of dynamical systems from measured data and for the optimal design of experiments.

Deep backward stochastic differential equation method is a numerical method that combines deep learning with Backward stochastic differential equation (BSDE). This method is particularly useful for solving high-dimensional problems in financial derivatives pricing and risk management. By leveraging the powerful function approximation capabilities of deep neural networks, deep BSDE addresses the computational challenges faced by traditional numerical methods in high-dimensional settings.

References

Bhatnagar, S., Prasad, H. L., and Prashanth, L. A. (2013), Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods, Springer .
Hirokami, T., Maeda, Y., Tsukada, H. (2006) "Parameter estimation using simultaneous perturbation stochastic approximation", Electrical Engineering in Japan, 154 (2), 30–3
Maryak, J.L., and Chin, D.C. (2008), "Global Random Optimization by Simultaneous Perturbation Stochastic Approximation," IEEE Transactions on Automatic Control, vol. 53, pp. 780-783.
Spall, J. C. (1987), “A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates,” Proceedings of the American Control Conference, Minneapolis, MN, June 1987, pp. 1161–1167.
Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation,” IEEE Transactions on Automatic Control, vol. 37(3), pp. 332–341.
Spall, J.C. (1998). "Overview of the Simultaneous Perturbation Method for Efficient Optimization" 2. Johns Hopkins APL Technical Digest, 19(4), 482–492.
Spall, J.C. (2003) Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley. ISBN 0-471-33052-3 (Chapter 7)

↑ He, Ying; Fu, Michael C.; Steven I., Marcus (August 2003). "Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization". IEEE Transactions on Automatic Control. 48 (8): 1459–1463. doi:10.1109/TAC.2003.815008 . Retrieved March 6, 2022.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] He, Ying; Fu, Michael C.; Steven I., Marcus (August 2003). "Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization". IEEE Transactions on Automatic Control. 48 (8): 1459–1463. doi:10.1109/TAC.2003.815008 . Retrieved March 6, 2022.

[1]