Backtracking line search

Last updated

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

Contents

The method involves starting with a relatively large estimate of the step size for movement along the line search direction, and iteratively shrinking the step size (i.e., "backtracking") until a decrease of the objective function is observed that adequately corresponds to the amount of decrease that is expected, based on the step size and the local gradient of the objective function. A common stopping criterion is the Armijo–Goldstein condition.

Backtracking line search is typically used for gradient descent (GD), but it can also be used in other contexts. For example, it can be used with Newton's method if the Hessian matrix is positive definite.

Motivation

Given a starting position and a search direction , the task of a line search is to determine a step size that adequately reduces the objective function (assumed i.e. continuously differentiable), i.e., to find a value of that reduces relative to . However, it is usually undesirable to devote substantial resources to finding a value of to precisely minimize . This is because the computing resources needed to find a more precise minimum along one particular direction could instead be employed to identify a better search direction. Once an improved starting point has been identified by the line search, another subsequent line search will ordinarily be performed in a new direction. The goal, then, is just to identify a value of that provides a reasonable amount of improvement in the objective function, rather than to find the actual minimizing value of .

The backtracking line search starts with a large estimate of and iteratively shrinks it. The shrinking continues until a value is found that is small enough to provide a decrease in the objective function that adequately matches the decrease that is expected to be achieved, based on the local function gradient

Define the local slope of the function of along the search direction as (where denotes the dot product). It is assumed that is a vector for which some local decrease is possible, i.e., it is assumed that .

Based on a selected control parameter , the Armijo–Goldstein condition tests whether a step-wise movement from a current position to a modified position achieves an adequately corresponding decrease in the objective function. The condition is fulfilled, see Armijo (1966), if

This condition, when used appropriately as part of a line search, can ensure that the step size is not excessively large. However, this condition is not sufficient on its own to ensure that the step size is nearly optimal, since any value of that is sufficiently small will satisfy the condition.

Thus, the backtracking line search strategy starts with a relatively large step size, and repeatedly shrinks it by a factor until the Armijo–Goldstein condition is fulfilled.

The search will terminate after a finite number of steps for any positive values of and that are less than 1. For example, Armijo used 12 for both and in Armijo (1966).

Algorithm

This condition is from Armijo (1966). Starting with a maximum candidate step size value , using search control parameters and , the backtracking line search algorithm can be expressed as follows:

  1. Set and iteration counter .
  2. Until the condition is satisfied that repeatedly increment and set
  3. Return as the solution.

In other words, reduce by a factor of in each iteration until the Armijo–Goldstein condition is fulfilled.

Function minimization using backtracking line search in practice

In practice, the above algorithm is typically iterated to produce a sequence , , to converge to a minimum, provided such a minimum exists and is selected appropriately in each step. For gradient descent, is selected as .

The value of for the that fulfills the Armijo–Goldstein condition depends on and , and is thus denoted below by . It also depends on , , and of course, although these dependencies can be left implicit if they are assumed to be fixed with respect to the optimization problem.

The detailed steps are thus, see Armijo (1966), Bertsekas (2016):

  1. Choose an initial starting point and set the iteration counter .
  2. Until some stopping condition is satisfied, choose a descent direction , update the position to , and increment .
  3. Return as the minimizing position and as the function minimum.

To assure good behavior, it is necessary that some conditions must be satisfied by . Roughly speaking should not be too far away from . A precise version is as follows (see e.g. Bertsekas (2016)). There are constants so that the following two conditions are satisfied:

  1. For all n, . Here, is the Euclidean norm of . (This assures that if , then also . More generally, if , then also .) A more strict version requires also the reverse inequality: for a positive constant .
  2. For all n, . (This condition ensures that the directions of and are similar.)

Lower bound for learning rates

This addresses the question whether there is a systematic way to find a positive number - depending on the function f, the point and the descent direction - so that all learning rates satisfy Armijo's condition. When , we can choose in the order of , where is a local Lipschitz constant for the gradient near the point (see Lipschitz continuity). If the function is , then is close to the Hessian of the function at the point . See Armijo (1966) for more detail.

Upper bound for learning rates

In the same situation where , an interesting question is how large learning rates can be chosen in Armijo's condition (that is, when one has no limit on as defined in the section "Function minimization using backtracking line search in practice"), since larger learning rates when is closer to the limit point (if exists) can make convergence faster. For example, in Wolfe conditions, there is no mention of but another condition called curvature condition is introduced.

An upper bound for learning rates is shown to exist if one wants the constructed sequence converges to a non-degenerate critical point, see Truong & Nguyen (2020): The learning rates must be bounded from above roughly by . Here H is the Hessian of the function at the limit point, is its inverse, and is the norm of a linear operator. Thus, this result applies for example when one uses Backtracking line search for Morse functions. Note that in dimension 1, is a number and hence this upper bound is of the same size as the lower bound in the section "Lower bound for learning rates".

On the other hand, if the limit point is degenerate, then learning rates can be unbounded. For example, a modification of backtracking line search known as unbounded backtracking gradient descent (see Truong & Nguyen (2020)) allows the learning rate to be half the size , where is a constant. Experiments with simple functions such as show that unbounded backtracking gradient descent converges much faster than the basic version described in the section "Function minimization using backtracking line search in practice".

Time efficiency

An argument against the use of Backtracking line search, in particular in Large scale optimisation, is that satisfying Armijo's condition is expensive. There is a way (so-called Two-way Backtracking) to go around, with good theoretical guarantees and has been tested with good results on deep neural networks, see Truong & Nguyen (2020). (There, one can find also good/stable implementations of Armijo's condition and its combination with some popular algorithms such as Momentum and NAG, on datasets such as Cifar10 and Cifar100.) One observes that if the sequence converges (as wished when one makes use of an iterative optimisation method), then the sequence of learning rates should vary little when n is large enough. Therefore, in the search for , if one always starts from , one would waste a lot of time if it turns out that the sequence stays far away from . Instead, one should search for by starting from . The second observation is that could be larger than , and hence one should allow to increase learning rate (and not just decrease as in the section Algorithm). Here is the detailed algorithm for Two-way Backtracking: At step n

  1. Set . Set and iteration counter .
  2. (Increase learning rate if Armijo's condition is satisfied.) If , then while this condition and the condition that are satisfied, repeatedly set and increase j.
  3. (Otherwise, reduce the learning rate if Armijo's condition is not satisfied.) If in contrast , then until the condition is satisfied that repeatedly increment and set
  4. Return for the learning rate .

(In Nocedal & Wright (2000) one can find a description of an algorithm with 1), 3) and 4) above, which was not tested in deep neural networks before the cited paper.)

One can save time further by a hybrid mixture between two-way backtracking and the basic standard gradient descent algorithm. This procedure also has good theoretical guarantee and good test performance. Roughly speaking, we run two-way backtracking a few times, then use the learning rate we get from then unchanged, except if the function value increases. Here is precisely how it is done. One choose in advance a number , and a number .

  1. Set iteration counter j=0.
  2. At the steps , use Two-way Backtracking.
  3. At each step k in the set : Set . If , then choose and . (So, in this case, use the learning rate unchanged.) Otherwise, if , use Two-way Backtracking. Increase k by 1 and repeat.
  4. Increase j by 1.

Theoretical guarantee (for gradient descent)

Compared with Wolfe's conditions, which is more complicated, Armijo's condition has a better theoretical guarantee. Indeed, so far backtracking line search and its modifications are the most theoretically guaranteed methods among all numerical optimization algorithms concerning convergence to critical points and avoidance of saddle points, see below.

Critical points are points where the gradient of the objective function is 0. Local minima are critical points, but there are critical points which are not local minima. An example is saddle points. Saddle points are critical points, at which there are at least one direction where the function is (local) maximum. Therefore, these points are far from being local minima. For example, if a function has at least one saddle point, then it cannot be convex. The relevance of saddle points to optimisation algorithms is that in large scale (i.e. high-dimensional) optimisation, one likely sees more saddle points than minima, see Bray & Dean (2007). Hence, a good optimisation algorithm should be able to avoid saddle points. In the setting of deep learning, saddle points are also prevalent, see Dauphin et al. (2014). Thus, to apply in deep learning, one needs results for non-convex functions.

For convergence to critical points: For example, if the cost function is a real analytic function, then it is shown in Absil, Mahony & Andrews (2005) that convergence is guaranteed. The main idea is to use Łojasiewicz inequality which is enjoyed by a real analytic function. For non-smooth functions satisfying Łojasiewicz inequality, the above convergence guarantee is extended, see Attouch, Bolte & Svaiter (2011). In Bertsekas (2016), there is a proof that for every sequence constructed by backtracking line search, a cluster point (i.e. the limit of one subsequence, if the subsequence converges) is a critical point. For the case of a function with at most countably many critical points (such as a Morse function) and compact sublevels, as well as with Lipschitz continuous gradient where one uses standard GD with learning rate <1/L (see the section "Stochastic gradient descent"), then convergence is guaranteed, see for example Chapter 12 in Lange (2013). Here the assumption about compact sublevels is to make sure that one deals with compact sets of the Euclidean space only. In the general case, where is only assumed to be and have at most countably many critical points, convergence is guaranteed, see Truong & Nguyen (2020). In the same reference, similarly convergence is guaranteed for other modifications of Backtracking line search (such as Unbounded backtracking gradient descent mentioned in the section "Upper bound for learning rates"), and even if the function has uncountably many critical points still one can deduce some non-trivial facts about convergence behaviour. In the stochastic setting, under the same assumption that the gradient is Lipschitz continuous and one uses a more restrictive version (requiring in addition that the sum of learning rates is infinite and the sum of squares of learning rates is finite) of diminishing learning rate scheme (see section "Stochastic gradient descent") and moreover the function is strictly convex, then the convergence is established in the well-known result Robbins & Monro (1951), see Bertsekas & Tsitsiklis (2006) for generalisations to less restrictive versions of a diminishing learning rate scheme. None of these results (for non-convex functions) have been proven for any other optimization algorithm so far.[ citation needed ]

For avoidance of saddle points: For example, if the gradient of the cost function is Lipschitz continuous and one chooses standard GD with learning rate <1/L, then with a random choice of initial point (more precisely, outside a set of Lebesgue measure zero), the sequence constructed will not converge to a non-degenerate saddle point (proven in Lee et al. (2016)), and more generally it is also true that the sequence constructed will not converge to a degenerate saddle point (proven in Panageas & Piliouras (2017)). Under the same assumption that the gradient is Lipschitz continuous and one uses a diminishing learning rate scheme (see the section "Stochastic gradient descent"), then avoidance of saddle points is established in Panageas, Piliouras & Wang (2019).

A special case: (standard) stochastic gradient descent (SGD)

While it is trivial to mention, if the gradient of a cost function is Lipschitz continuous, with Lipschitz constant L, then with choosing learning rate to be constant and of the size , one has a special case of backtracking line search (for gradient descent). This has been used at least in Armijo (1966). This scheme however requires that one needs to have a good estimate for L, otherwise if learning rate is too big (relative to 1/L) then the scheme has no convergence guarantee. One can see what will go wrong if the cost function is a smoothing (near the point 0) of the function f(t)=|t|. Such a good estimate is, however, difficult and laborious in large dimensions. Also, if the gradient of the function is not globally Lipschitz continuous, then this scheme has no convergence guarantee. For example, this is similar to an exercise in Bertsekas (2016), for the cost function and for whatever constant learning rate one chooses, with a random initial point the sequence constructed by this special scheme does not converge to the global minimum 0.

If one does not care about the condition that learning rate must be bounded by 1/L, then this special scheme has been used much older, at least since 1847 by Cauchy, which can be called standard GD (not to be confused with stochastic gradient descent, which is abbreviated herein as SGD). In the stochastic setting (such as in the mini-batch setting in deep learning), standard GD is called stochastic gradient descent, or SGD.

Even if the cost function has globally continuous gradient, good estimate of the Lipschitz constant for the cost functions in deep learning may not be feasible or desirable, given the very high dimensions of deep neural networks. Hence, there is a technique of fine-tuning of learning rates in applying standard GD or SGD. One way is to choose many learning rates from a grid search, with the hope that some of the learning rates can give good results. (However, if the loss function does not have global Lipschitz continuous gradient, then the example with above shows that grid search cannot help.) Another way is the so-called adaptive standard GD or SGD, some representatives are Adam, Adadelta, RMSProp and so on, see the article on Stochastic gradient descent. In adaptive standard GD or SGD, learning rates are allowed to vary at each iterate step n, but in a different manner from Backtracking line search for gradient descent. Apparently, it would be more expensive to use Backtracking line search for gradient descent, since one needs to do a loop search until Armijo's condition is satisfied, while for adaptive standard GD or SGD no loop search is needed. Most of these adaptive standard GD or SGD do not have the descent property , for all n, as Backtracking line search for gradient descent. Only a few has this property, and which have good theoretical properties, but they turn out to be special cases of Backtracking line search or more generally Armijo's condition Armijo (1966). The first one is when one chooses learning rate to be a constant <1/L, as mentioned above, if one can have a good estimate of L. The second is the so called diminishing learning rate, used in the well-known paper by Robbins & Monro (1951), if again the function has globally Lipschitz continuous gradient (but the Lipschitz constant may be unknown) and the learning rates converge to 0.

Summary

In summary, backtracking line search (and its modifications) is a method which is easy to implement, is applicable for very general functions, has very good theoretical guarantee (for both convergence to critical points and avoidance of saddle points) and works well in practice. Several other methods which have good theoretical guarantee, such as diminishing learning rates or standard GD with learning rate <1/L – both require the gradient of the objective function to be Lipschitz continuous, turn out to be a special case of Backtracking line search or satisfy Armijo's condition. Even though a priori one needs the cost function to be continuously differentiable to apply this method, in practice one can apply this method successfully also for functions which are continuously differentiable on a dense open subset such as or .

See also

Related Research Articles

<span class="mw-page-title-main">Gradient descent</span> Optimization algorithm

Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for finding a local minimum of a differentiable multivariate function.

<span class="mw-page-title-main">Onsager reciprocal relations</span> Relations between flows and forces, or gradients, in thermodynamic systems

In thermodynamics, the Onsager reciprocal relations express the equality of certain ratios between flows and forces in thermodynamic systems out of equilibrium, but where a notion of local equilibrium exists.

<span class="mw-page-title-main">Gauss–Newton algorithm</span> Mathematical algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

<span class="mw-page-title-main">Newton's method in optimization</span> Method for finding stationary points of a function

In calculus, Newton's method (also called Newton–Raphson) is an iterative method for finding the roots of a differentiable function F, which are solutions to the equation F (x) = 0. As such, Newton's method can be applied to the derivative f of a twice-differentiable function f to find the roots of the derivative (solutions to f ′(x) = 0), also known as the critical points of f. These solutions may be minima, maxima, or saddle points; see section "Several variables" in Critical point (mathematics) and also section "Geometric interpretation" in this article. This is relevant in optimization, which aims to find (global) minima of the function f.

<span class="mw-page-title-main">Conjugate gradient method</span> Mathematical optimization algorithm

In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition. Large sparse systems often arise when numerically solving partial differential equations or optimization problems.

In optimization, line search is a basic iterative approach to find a local minimum of an objective function . It first finds a descent direction along which the objective function will be reduced, and then computes a step size that determines how far should move along that direction. The descent direction can be computed by various methods, such as gradient descent or quasi-Newton method. The step size can be determined either exactly or inexactly.

In the unconstrained minimization problem, the Wolfe conditions are a set of inequalities for performing inexact line search, especially in quasi-Newton methods, first published by Philip Wolfe in 1969.

In differential geometry, the four-gradient is the four-vector analogue of the gradient from vector calculus.

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

In mathematics, a real or complex-valued function f on d-dimensional Euclidean space satisfies a Hölder condition, or is Hölder continuous, when there are real constants C ≥ 0, α > 0, such that for all x and y in the domain of f. More generally, the condition can be formulated for functions between any two metric spaces. The number is called the exponent of the Hölder condition. A function on an interval satisfying the condition with α > 1 is constant. If α = 1, then the function satisfies a Lipschitz condition. For any α > 0, the condition implies the function is uniformly continuous. The condition is named after Otto Hölder.

In mathematics, preconditioning is the application of a transformation, called the preconditioner, that conditions a given problem into a form that is more suitable for numerical solving methods. Preconditioning is typically related to reducing a condition number of the problem. The preconditioned problem is then usually solved by an iterative method.

The gradient theorem, also known as the fundamental theorem of calculus for line integrals, says that a line integral through a gradient field can be evaluated by evaluating the original scalar field at the endpoints of the curve. The theorem is a generalization of the second fundamental theorem of calculus to any curve in a plane or space rather than just the real line.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.

In numerical optimization, the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. For a quadratic function

Subgradient methods are convex optimization methods which use subderivatives. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.

In continuum mechanics, a compatible deformation tensor field in a body is that unique tensor field that is obtained when the body is subjected to a continuous, single-valued, displacement field. Compatibility is the study of the conditions under which such a displacement field can be guaranteed. Compatibility conditions are particular cases of integrability conditions and were first derived for linear elasticity by Barré de Saint-Venant in 1864 and proved rigorously by Beltrami in 1886.

In mathematics, mirror descent is an iterative optimization algorithm for finding a local minimum of a differentiable function.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

References