Stochastic variance reduction

Last updated

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

Contents

Variance reduction approaches are widely used for training machine learning models such as logistic regression and support vector machines [1] as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction.

Finite sum objectives

A function is considered to have finite sum structure if it can be decomposed into a summation or average:

where the function value and derivative of each can be queried independently. Although variance reduction methods can be applied for any positive and any structure, their favorable theoretical and practical properties arise when is large compared to the condition number of each , and when the have similar (but not necessarily identical) Lipschitz smoothness and strong convexity constants.

The finite sum structure should be contrasted with the stochastic approximation setting which deals with functions of the form which is the expected value of a function depending on a random variable . Any finite sum problem can be optimized using a stochastic approximation algorithm by using .

Rapid Convergence

Stochastic variance reduced methods without acceleration are able to find a minima of within accuracy , i.e. in a number of steps of the order:

The number of steps depends only logarithmically on the level of accuracy required, in contrast to the stochastic approximation framework, where the number of steps required grows proportionally to the accuracy required. Stochastic variance reduction methods converge almost as fast as the gradient descent method's rate, despite using only a stochastic gradient, at a lower cost than gradient descent.

Accelerated methods in the stochastic variance reduction framework achieve even faster convergence rates, requiring only

steps to reach accuracy, potentially faster than non-accelerated methods. Lower complexity bounds. [2] for the finite sum class establish that this rate is the fastest possible for smooth strongly convex problems.

Approaches

Variance reduction approaches fall within 3 main categories: table averaging methods, full-gradient snapshot methods and dual methods. Each category contains methods designed for dealing with convex, non-smooth, and non-convex problems, each differing in hyper-parameter settings and other algorithmic details.

SAGA

In the SAGA method, [3] the prototypical table averaging approach, a table of size is maintained that contains the last gradient witnessed for each term, which we denote . At each step, an index is sampled, and a new gradient is computed. The iterate is updated with:

and afterwards table entry is updated with .

SAGA is among the most popular of the variance reduction methods due to its simplicity, easily adaptable theory, and excellent performance. It is the successor of the SAG method, [4] improving on its flexibility and performance.

SVRG

The stochastic variance reduced gradient method (SVRG), [5] the prototypical snapshot method, uses a similar update except instead of using the average of a table it instead uses a full-gradient that is reevaluated at a snapshot point at regular intervals of iterations. The update becomes:

This approach requires two stochastic gradient evaluations per step, one to compute and one to compute where-as table averaging approaches need only one.

Despite the high computational cost, SVRG is popular as its simple convergence theory is highly adaptable to new optimization settings. It also has lower storage requirements than tabular averaging approaches, which make it applicable in many settings where tabular methods can not be used.

SDCA

Exploiting the dual representation of the objective leads to another variance reduction approach that is particularly suited to finite-sums where each term has a structure that makes computing the convex conjugate or its proximal operator tractable. The standard SDCA method [6] considers finite sums that have additional structure compared to generic finite sum setting:

where each is 1 dimensional and each is a data point associated with . SDCA solves the dual problem:

by a stochastic coordinate ascent procedure, where at each step the objective is optimized with respect to a randomly chosen coordinate , leaving all other coordinates the same. An approximate primal solution can be recovered from the values:

.

This method obtains similar theoretical rates of convergence to other stochastic variance reduced methods, while avoiding the need to specify a step-size parameter. It is fast in practice when is large, but significantly slower than the other approaches when is small.

Accelerated approaches

Accelerated variance reduction methods are built upon the standard methods above. The earliest approaches make use of proximal operators to accelerate convergence, either approximately or exactly. Direct acceleration approaches have also been developed [7]

Catalyst acceleration

The catalyst framework [8] uses any of the standard methods above as an inner optimizer to approximately solve a proximal operator:

after which it uses an extrapolation step to determine the next :

The catalyst method's flexibility and simplicity make it a popular baseline approach. It doesn't achieve the optimal rate of convergence among accelerated methods, it is potentially slower by up to a log factor in the hyper-parameters.

Point-SAGA

Proximal operations may also be applied directly to the terms to yield an accelerated method. The Point-SAGA method [9] replaces the gradient operations in SAGA with proximal operator evaluations, result in a simple, direct acceleration method:

with the table update performed after each step. Here is defined as the proximal operator for the th term:

Unlike other known accelerated methods, Point-SAGA requires only a single iterate sequence to be maintained between steps, and it has the advantage of only having a single tunable parameter . It obtains the optimal accelerated rate of convergence for strongly convex finite-sum minimization without additional log factors.

See also

Related Research Articles

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

<span class="mw-page-title-main">Gradient descent</span> Optimization algorithm

In mathematics, gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.

Geometrical optics, or ray optics, is a model of optics that describes light propagation in terms of rays. The ray in geometrical optics is an abstraction useful for approximating the paths along which light propagates under certain circumstances.

<span class="mw-page-title-main">Stochastic gradient descent</span> Optimization algorithm

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

<span class="mw-page-title-main">Newton's method in optimization</span> Method for finding stationary points of a function

In calculus, Newton's method is an iterative method for finding the roots of a differentiable function F, which are solutions to the equation F (x) = 0. As such, Newton's method can be applied to the derivative f of a twice-differentiable function f to find the roots of the derivative (solutions to f ′(x) = 0), also known as the critical points of f. These solutions may be minima, maxima, or saddle points; see section "Several variables" in Critical point (mathematics) and also section "Geometric interpretation" in this article. This is relevant in optimization, which aims to find (global) minima of the function f.

In mathematics, the discrete Laplace operator is an analog of the continuous Laplace operator, defined so that it has meaning on a graph or a discrete grid. For the case of a finite-dimensional graph, the discrete Laplace operator is more commonly called the Laplacian matrix.

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

In the mathematical discipline of graph theory, the expander walk sampling theorem intuitively states that sampling vertices in an expander graph by doing relatively short random walk can simulate sampling the vertices independently from a uniform distribution. The earliest version of this theorem is due to Ajtai, Komlós & Szemerédi (1987), and the more general version is typically attributed to Gillman (1998).

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Subgradient methods are iterative methods for solving convex minimization problems. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.

<span class="mw-page-title-main">Online machine learning</span> Method of machine learning

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

In computer networks, self-similarity is a feature of network data transfer dynamics. When modeling network data dynamics the traditional time series models, such as an autoregressive moving average model are not appropriate. This is because these models only provide a finite number of parameters in the model and thus interaction in a finite time window, but the network data usually have a long-range dependent temporal structure. A self-similar process is one way of modeling network data dynamics with such a long range correlation. This article defines and describes network data transfer dynamics in the context of a self-similar process. Properties of the process are shown and methods are given for graphing and estimating parameters modeling the self-similarity of network data.

In continuum mechanics, a compatible deformation tensor field in a body is that unique tensor field that is obtained when the body is subjected to a continuous, single-valued, displacement field. Compatibility is the study of the conditions under which such a displacement field can be guaranteed. Compatibility conditions are particular cases of integrability conditions and were first derived for linear elasticity by Barré de Saint-Venant in 1864 and proved rigorously by Beltrami in 1886.

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is regularization of the form

<span class="mw-page-title-main">Vanishing gradient problem</span> Machine learning model training problem

In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.

In mathematical optimization, the proximal operator is an operator associated with a proper, lower semi-continuous convex function from a Hilbert space to , and is defined by:

<span class="mw-page-title-main">Batch normalization</span> Method used to make artificial neural networks faster and stable by re-centering and re-scaling

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

<span class="mw-page-title-main">Stochastic gradient Langevin dynamics</span>

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

In mathematical optimization, oracle complexity is a standard theoretical framework to study the computational requirements for solving classes of optimization problems. It is suitable for analyzing iterative algorithms which proceed by computing local information about the objective function at various points. The framework has been used to provide tight worst-case guarantees on the number of required iterations, for several important classes of optimization problems.

References

  1. "sklearn.linear_model.LogisticRegression". Scikit Learn. Retrieved Feb 26, 2022.
  2. Lan, Guanghui; Zhou, Yi (2018). "An optimal randomized incremental gradient method". Mathematical Programming: Series a and B. 171 (1–2): 167–215. arXiv: 1507.02000 . doi:10.1007/s10107-017-1173-0. S2CID   9143586.
  3. Defazio, Aaron; Bach, Francis; Lacoste-Julien, Simon (2014). "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives". Neural Information Processing Systems. arXiv: 1407.0202 .
  4. Schmidt, Mark; Le Roux, Nicolas; Bach, Francis (2017). "Minimizing finite sums with the stochastic average gradient". Mathematical Programming. 162. arXiv: 1309.2388 .
  5. Johnson, Rie; Zhang, Tong (2013). "Accelerating Stochastic Gradient Descent using Predictive Variance Reduction" (PDF). Neural Information Processing Systems.
  6. Shalev-Shwartz, Shai; Zhang, Tong (2013). "Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization" (PDF). Journal of Machine Learning Research. 14.
  7. Lan, Guanghui; Zhou, Yi (2018). "An optimal randomized incremental gradient method". Mathematical Programming: Series a and B. 171 (1–2): 167–215. arXiv: 1507.02000 . doi:10.1007/s10107-017-1173-0. S2CID   9143586.
  8. Lin, Hongzhou; Mairal, Julien; Harchaoui, Zaid (2016). "Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice". Journal of Machine Learning Research. 18. arXiv: 1712.05654 .
  9. Defazio, Aaron (2016). "A Simple Practical Accelerated Method for Finite Sums". Neural Information Processing Systems. arXiv: 1602.02442 .