Stochastic variance reduction

Last updated October 02, 2024

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

Finite sum objectives

A function $f$ is considered to have finite sum structure if it can be decomposed into a summation or average:

f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x),

where the function value and derivative of each $f_{i}$ can be queried independently. Although variance reduction methods can be applied for any positive $n$ and any $f_{i}$ structure, their favorable theoretical and practical properties arise when $n$ is large compared to the condition number of each $f_{i}$ , and when the $f_{i}$ have similar (but not necessarily identical) Lipschitz smoothness and strong convexity constants.

The finite sum structure should be contrasted with the stochastic approximation setting which deals with functions of the form ${\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]}$ which is the expected value of a function depending on a random variable ${\textstyle \xi }$ . Any finite sum problem can be optimized using a stochastic approximation algorithm by using $F(\cdot ,\xi )=f_{\xi }$ .

Rapid Convergence

Stochastic variance reduced methods without acceleration are able to find a minima of $f$ within accuracy $\epsilon >$ , i.e. $f(x)-f(x_{*})\leq \epsilon$ in a number of steps of the order:

O\left(\left({\frac {L}{\mu }}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right).

The number of steps depends only logarithmically on the level of accuracy required, in contrast to the stochastic approximation framework, where the number of steps $O{\bigl (}L/(\mu \epsilon ){\bigr )}$ required grows proportionally to the accuracy required. Stochastic variance reduction methods converge almost as fast as the gradient descent method's $O{\bigl (}(L/\mu )\log(1/\epsilon ){\bigr )}$ rate, despite using only a stochastic gradient, at a $1/n$ lower cost than gradient descent.

Accelerated methods in the stochastic variance reduction framework achieve even faster convergence rates, requiring only

O\left(\left({\sqrt {\frac {nL}{\mu }}}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right)

steps to reach $\epsilon$ accuracy, potentially ${\sqrt {n}}$ faster than non-accelerated methods. Lower complexity bounds.^[2] for the finite sum class establish that this rate is the fastest possible for smooth strongly convex problems.

Approaches

Variance reduction approaches fall within 3 main categories: table averaging methods, full-gradient snapshot methods and dual methods. Each category contains methods designed for dealing with convex, non-smooth, and non-convex problems, each differing in hyper-parameter settings and other algorithmic details.

SAGA

In the SAGA method,^[3] the prototypical table averaging approach, a table of size $n$ is maintained that contains the last gradient witnessed for each $f_{i}$ term, which we denote $g_{i}$ . At each step, an index $i$ is sampled, and a new gradient $\nabla f_{i}(x_{k})$ is computed. The iterate $x_{k}$ is updated with:

x_{k+1}=x_{k}-\gamma \left[\nabla f_{i}(x_{k})-g_{i}+{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right],

and afterwards table entry $i$ is updated with $g_{i}=\nabla f_{i}(x_{k})$ .

SAGA is among the most popular of the variance reduction methods due to its simplicity, easily adaptable theory, and excellent performance. It is the successor of the SAG method,^[4] improving on its flexibility and performance.

SVRG

The stochastic variance reduced gradient method (SVRG),^[5] the prototypical snapshot method, uses a similar update except instead of using the average of a table it instead uses a full-gradient that is reevaluated at a snapshot point ${\tilde {x}}$ at regular intervals of $m\geq n$ iterations. The update becomes:

x_{k+1}=x_{k}-\gamma [\nabla f_{i}(x_{k})-\nabla f_{i}({\tilde {x}})+\nabla f({\tilde {x}})],

This approach requires two stochastic gradient evaluations per step, one to compute $\nabla f_{i}(x_{k})$ and one to compute $\nabla f_{i}({\tilde {x}}),$ where-as table averaging approaches need only one.

Despite the high computational cost, SVRG is popular as its simple convergence theory is highly adaptable to new optimization settings. It also has lower storage requirements than tabular averaging approaches, which make it applicable in many settings where tabular methods can not be used.

SDCA

Exploiting the dual representation of the objective leads to another variance reduction approach that is particularly suited to finite-sums where each term has a structure that makes computing the convex conjugate $f_{i}^{*},$ or its proximal operator tractable. The standard SDCA method^[6] considers finite sums that have additional structure compared to generic finite sum setting:

f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x^{T}v_{i})+{\frac {\lambda }{2}}\|x\|^{2},

where each $f_{i}$ is 1 dimensional and each $v_{i}$ is a data point associated with $f_{i}$ . SDCA solves the dual problem:

\max _{\alpha \in \mathbb {R} ^{n}}-{\frac {1}{n}}\sum _{i=1}^{n}f_{i}^{*}(-\alpha _{i})-{\frac {\lambda }{2}}\left\|{\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}\right\|^{2},

by a stochastic coordinate ascent procedure, where at each step the objective is optimized with respect to a randomly chosen coordinate $\alpha _{i}$ , leaving all other coordinates the same. An approximate primal solution $x$ can be recovered from the $\alpha$ values:

x={\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}

.

This method obtains similar theoretical rates of convergence to other stochastic variance reduced methods, while avoiding the need to specify a step-size parameter. It is fast in practice when $\lambda$ is large, but significantly slower than the other approaches when $\lambda$ is small.

Accelerated approaches

Accelerated variance reduction methods are built upon the standard methods above. The earliest approaches make use of proximal operators to accelerate convergence, either approximately or exactly. Direct acceleration approaches have also been developed^[7]

Catalyst acceleration

The catalyst framework^[8] uses any of the standard methods above as an inner optimizer to approximately solve a proximal operator:

x_{k}\approx {\text{argmin}}_{x}\left\{f(x)+{\frac {\kappa }{2}}\|x-y_{k-1}\|^{2}\right\}

after which it uses an extrapolation step to determine the next $y$ :

y_{k}=x_{k}+\beta _{k}(x_{k}-x_{k-1})

The catalyst method's flexibility and simplicity make it a popular baseline approach. It doesn't achieve the optimal rate of convergence among accelerated methods, it is potentially slower by up to a log factor in the hyper-parameters.

Point-SAGA

Proximal operations may also be applied directly to the $f_{i}$ terms to yield an accelerated method. The Point-SAGA method^[9] replaces the gradient operations in SAGA with proximal operator evaluations, result in a simple, direct acceleration method:

x_{k+1}={\text{prox}}_{j}^{\gamma }\left(z_{k}\triangleq x_{k}+\gamma \left[g_{j}-{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right]\right),

with the table update $g_{j}={\frac {1}{\gamma }}(z_{k}-x_{k+1})$ performed after each step. Here ${\text{prox}}_{j}^{\gamma }$ is defined as the proximal operator for the $j$ th term:

{\text{prox}}_{j}^{\gamma }(y)={\text{argmin}}_{x}\left\{f_{j}(x)+{\frac {1}{2\gamma }}\|x-y\|^{2}\right\}.

Unlike other known accelerated methods, Point-SAGA requires only a single iterate sequence $x$ to be maintained between steps, and it has the advantage of only having a single tunable parameter $\gamma$ . It obtains the optimal accelerated rate of convergence for strongly convex finite-sum minimization without additional log factors.

Related Research Articles

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.

Geometrical optics, or ray optics, is a model of optics that describes light propagation in terms of rays. The ray in geometrical optics is an abstraction useful for approximating the paths along which light propagates under certain circumstances.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In calculus, Newton's method is an iterative method for finding the roots of a differentiable function $, which are solutions to the equation . However, to optimize a twice-differentiable, our goal is to find the roots of . We can therefore use Newton's method on its derivative to find solutions to, also known as the critical points of . These solutions may be minima, maxima, or saddle points; see section "Several variables" in Critical point (mathematics) and also section "Geometric interpretation" in this article. This is relevant in optimization, which aims to find (global) minima of the function .$

<span class="mw-page-title-main">Interior-point method</span> Algorithms for solving convex optimization problems

Interior-point methods are algorithms for solving linear and non-linear convex optimization problems. IPMs combine two advantages of previously-known algorithms:

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer of a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Subgradient methods are convex optimization methods which use subderivatives. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

In computer networks, self-similarity is a feature of network data transfer dynamics. When modeling network data dynamics the traditional time series models, such as an autoregressive moving average model are not appropriate. This is because these models only provide a finite number of parameters in the model and thus interaction in a finite time window, but the network data usually have a long-range dependent temporal structure. A self-similar process is one way of modeling network data dynamics with such a long range correlation. This article defines and describes network data transfer dynamics in the context of a self-similar process. Properties of the process are shown and methods are given for graphing and estimating parameters modeling the self-similarity of network data.

In continuum mechanics, a compatible deformation tensor field in a body is that unique tensor field that is obtained when the body is subjected to a continuous, single-valued, displacement field. Compatibility is the study of the conditions under which such a displacement field can be guaranteed. Compatibility conditions are particular cases of integrability conditions and were first derived for linear elasticity by Barré de Saint-Venant in 1864 and proved rigorously by Beltrami in 1886.

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is $regularization of the form$

In mathematical optimization, the proximal operator is an operator associated with a proper, lower semi-continuous convex function $from a Hilbert space to, and is defined by:$

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

In mathematical optimization, oracle complexity is a standard theoretical framework to study the computational requirements for solving classes of optimization problems. It is suitable for analyzing iterative algorithms which proceed by computing local information about the objective function at various points. The framework has been used to provide tight worst-case guarantees on the number of required iterations, for several important classes of optimization problems.

In mathematics, the Chambolle-Pock algorithm is an algorithm used to solve convex optimization problems. It was introduced by Antonin Chambolle and Thomas Pock in 2011 and has since become a widely used method in various fields, including image processing, computer vision, and signal processing.

The reparameterization trick is a technique used in statistical machine learning, particularly in variational inference, variational autoencoders, and stochastic optimization. It allows for the efficient computation of gradients through random variables, enabling the optimization of parametric probability models using stochastic gradient descent, and the variance reduction of estimators.

References

↑ "sklearn.linear_model.LogisticRegression". Scikit Learn. Retrieved Feb 26, 2022.
↑ Lan, Guanghui; Zhou, Yi (2018). "An optimal randomized incremental gradient method". Mathematical Programming: Series A and B. 171 (1–2): 167–215. arXiv: 1507.02000 . doi:10.1007/s10107-017-1173-0. S2CID 9143586.
↑ Defazio, Aaron; Bach, Francis; Lacoste-Julien, Simon (2014). "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives". Neural Information Processing Systems. arXiv: 1407.0202 .
↑ Schmidt, Mark; Le Roux, Nicolas; Bach, Francis (2017). "Minimizing finite sums with the stochastic average gradient". Mathematical Programming. 162. arXiv: 1309.2388 .
↑ Johnson, Rie; Zhang, Tong (2013). "Accelerating Stochastic Gradient Descent using Predictive Variance Reduction" (PDF). Neural Information Processing Systems.
↑ Shalev-Shwartz, Shai; Zhang, Tong (2013). "Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization" (PDF). Journal of Machine Learning Research. 14.
↑ Lan, Guanghui; Zhou, Yi (2018). "An optimal randomized incremental gradient method". Mathematical Programming: Series A and B. 171 (1–2): 167–215. arXiv: 1507.02000 . doi:10.1007/s10107-017-1173-0. S2CID 9143586.
↑ Lin, Hongzhou; Mairal, Julien; Harchaoui, Zaid (2016). "Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice". Journal of Machine Learning Research. 18. arXiv: 1712.05654 .
↑ Defazio, Aaron (2016). "A Simple Practical Accelerated Method for Finite Sums". Neural Information Processing Systems. arXiv: 1602.02442 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "sklearn.linear_model.LogisticRegression". Scikit Learn. Retrieved Feb 26, 2022.

[2] Lan, Guanghui; Zhou, Yi (2018). "An optimal randomized incremental gradient method". Mathematical Programming: Series A and B. 171 (1–2): 167–215. arXiv: 1507.02000 . doi:10.1007/s10107-017-1173-0. S2CID 9143586.

[3] Defazio, Aaron; Bach, Francis; Lacoste-Julien, Simon (2014). "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives". Neural Information Processing Systems. arXiv: 1407.0202 .

[4] Schmidt, Mark; Le Roux, Nicolas; Bach, Francis (2017). "Minimizing finite sums with the stochastic average gradient". Mathematical Programming. 162. arXiv: 1309.2388 .

[5] Johnson, Rie; Zhang, Tong (2013). "Accelerating Stochastic Gradient Descent using Predictive Variance Reduction" (PDF). Neural Information Processing Systems.

[6] Shalev-Shwartz, Shai; Zhang, Tong (2013). "Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization" (PDF). Journal of Machine Learning Research. 14.

[7] Lan, Guanghui; Zhou, Yi (2018). "An optimal randomized incremental gradient method". Mathematical Programming: Series A and B. 171 (1–2): 167–215. arXiv: 1507.02000 . doi:10.1007/s10107-017-1173-0. S2CID 9143586.

[8] Lin, Hongzhou; Mairal, Julien; Harchaoui, Zaid (2016). "Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice". Journal of Machine Learning Research. 18. arXiv: 1712.05654 .

[9] Defazio, Aaron (2016). "A Simple Practical Accelerated Method for Finite Sums". Neural Information Processing Systems. arXiv: 1602.02442 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]