Quasi-Newton method

Last updated

In numerical analysis, a quasi-Newton method is an iterative numerical method used either to find zeroes or to find local maxima and minima of functions via an iterative recurrence formula much like the one for Newton's method, except using approximations of the derivatives of the functions in place of exact derivatives. Newton's method requires the Jacobian matrix of all partial derivatives of a multivariate function when used to search for zeros or the Hessian matrix when used for finding extrema. Quasi-Newton methods, on the other hand, can be used when the Jacobian matrices or Hessian matrices are unavailable or are impractical to compute at every iteration.

Contents

Some iterative methods that reduce to Newton's method, such as sequential quadratic programming, may also be considered quasi-Newton methods.

Search for zeros: root finding

Newton's method to find zeroes of a function of multiple variables is given by , where is the left inverse of the Jacobian matrix of evaluated for .

Strictly speaking, any method that replaces the exact Jacobian with an approximation is a quasi-Newton method. [1] For instance, the chord method (where is replaced by for all iterations) is a simple example. The methods given below for optimization refer to an important subclass of quasi-Newton methods, secant methods. [2]

Using methods developed to find extrema in order to find zeroes is not always a good idea, as the majority of the methods used to find extrema require that the matrix that is used is symmetrical. While this holds in the context of the search for extrema, it rarely holds when searching for zeroes. Broyden's "good" and "bad" methods are two methods commonly used to find extrema that can also be applied to find zeroes. Other methods that can be used are the column-updating method, the inverse column-updating method, the quasi-Newton least squares method and the quasi-Newton inverse least squares method.

More recently quasi-Newton methods have been applied to find the solution of multiple coupled systems of equations (e.g. fluid–structure interaction problems or interaction problems in physics). They allow the solution to be found by solving each constituent system separately (which is simpler than the global system) in a cyclic, iterative fashion until the solution of the global system is found. [2] [3]

Search for extrema: optimization

The search for a minimum or maximum of a scalar-valued function is closely related to the search for the zeroes of the gradient of that function. Therefore, quasi-Newton methods can be readily applied to find extrema of a function. In other words, if is the gradient of , then searching for the zeroes of the vector-valued function corresponds to the search for the extrema of the scalar-valued function ; the Jacobian of now becomes the Hessian of . The main difference is that the Hessian matrix is a symmetric matrix, unlike the Jacobian when searching for zeroes. Most quasi-Newton methods used in optimization exploit this symmetry.

In optimization, quasi-Newton methods (a special case of variable-metric methods) are algorithms for finding local maxima and minima of functions. Quasi-Newton methods for optimization are based on Newton's method to find the stationary points of a function, points where the gradient is 0. Newton's method assumes that the function can be locally approximated as a quadratic in the region around the optimum, and uses the first and second derivatives to find the stationary point. In higher dimensions, Newton's method uses the gradient and the Hessian matrix of second derivatives of the function to be minimized.

In quasi-Newton methods the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In multiple dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian.

The first quasi-Newton algorithm was proposed by William C. Davidon, a physicist working at Argonne National Laboratory. He developed the first quasi-Newton algorithm in 1959: the DFP updating formula, which was later popularized by Fletcher and Powell in 1963, but is rarely used today. The most common quasi-Newton algorithms are currently the SR1 formula (for "symmetric rank-one"), the BHHH method, the widespread BFGS method (suggested independently by Broyden, Fletcher, Goldfarb, and Shanno, in 1970), and its low-memory extension L-BFGS. The Broyden's class is a linear combination of the DFP and BFGS methods.

The SR1 formula does not guarantee the update matrix to maintain positive-definiteness and can be used for indefinite problems. The Broyden's method does not require the update matrix to be symmetric and is used to find the root of a general system of equations (rather than the gradient) by updating the Jacobian (rather than the Hessian).

One of the chief advantages of quasi-Newton methods over Newton's method is that the Hessian matrix (or, in the case of quasi-Newton methods, its approximation) does not need to be inverted. Newton's method, and its derivatives such as interior point methods, require the Hessian to be inverted, which is typically implemented by solving a system of linear equations and is often quite costly. In contrast, quasi-Newton methods usually generate an estimate of directly.

As in Newton's method, one uses a second-order approximation to find the minimum of a function . The Taylor series of around an iterate is

where () is the gradient, and an approximation to the Hessian matrix. [4] The gradient of this approximation (with respect to ) is

and setting this gradient to zero (which is the goal of optimization) provides the Newton step:

The Hessian approximation is chosen to satisfy

which is called the secant equation (the Taylor series of the gradient itself). In more than one dimension is underdetermined. In one dimension, solving for and applying the Newton's step with the updated value is equivalent to the secant method. The various quasi-Newton methods differ in their choice of the solution to the secant equation (in one dimension, all the variants are equivalent). Most methods (but with exceptions, such as Broyden's method) seek a symmetric solution (); furthermore, the variants listed below can be motivated by finding an update that is as close as possible to in some norm; that is, , where is some positive-definite matrix that defines the norm. An approximate initial value is often sufficient to achieve rapid convergence, although there is no general strategy to choose . [5] Note that should be positive-definite. The unknown is updated applying the Newton's step calculated using the current approximate Hessian matrix :

is used to update the approximate Hessian , or directly its inverse using the Sherman–Morrison formula.

The most popular update formulas are:

Method
BFGS
Broyden
Broyden family
DFP
SR1

Other methods are Pearson's method, McCormick's method, the Powell symmetric Broyden (PSB) method and Greenstadt's method. [2] . These recursive low-rank matrix updates can also represented as an initial matrix plus a low-rank correction. This is the Compact quasi-Newton representation, which is particularly effective for constrained and/or large problems.

Relationship to matrix inversion

When is a convex quadratic function with positive-definite Hessian , one would expect the matrices generated by a quasi-Newton method to converge to the inverse Hessian . This is indeed the case for the class of quasi-Newton methods based on least-change updates. [6]

Notable implementations

Implementations of quasi-Newton methods are available in many programming languages.

Notable open source implementations include:

Notable proprietary implementations include:

See also

Related Research Articles

<span class="mw-page-title-main">Gradient descent</span> Optimization algorithm

Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.

In mathematics, the Hessian matrix, Hessian or Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named after him. Hesse originally used the term "functional determinants". The Hessian is sometimes denoted by H or, ambiguously, by ∇2.

<span class="mw-page-title-main">Inverse kinematics</span> Computing joint values of a kinematic chain from a known end position

In computer animation and robotics, inverse kinematics is the mathematical process of calculating the variable joint parameters needed to place the end of a kinematic chain, such as a robot manipulator or animation character's skeleton, in a given position and orientation relative to the start of the chain. Given joint parameters, the position and orientation of the chain's end, e.g. the hand of the character or robot, can typically be calculated directly using multiple applications of trigonometric formulas, a process known as forward kinematics. However, the reverse operation is, in general, much more challenging.

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

<span class="mw-page-title-main">Gauss–Newton algorithm</span> Mathematical algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

<span class="mw-page-title-main">Newton's method in optimization</span> Method for finding stationary points of a function

In calculus, Newton's method is an iterative method for finding the roots of a differentiable function , which are solutions to the equation . However, to optimize a twice-differentiable , our goal is to find the roots of . We can therefore use Newton's method on its derivative to find solutions to , also known as the critical points of . These solutions may be minima, maxima, or saddle points; see section "Several variables" in Critical point (mathematics) and also section "Geometric interpretation" in this article. This is relevant in optimization, which aims to find (global) minima of the function .

<span class="mw-page-title-main">Interior-point method</span> Algorithms for solving convex optimization problems

Interior-point methods are algorithms for solving linear and non-linear convex optimization problems. IPMs combine two advantages of previously-known algorithms:

In the unconstrained minimization problem, the Wolfe conditions are a set of inequalities for performing inexact line search, especially in quasi-Newton methods, first published by Philip Wolfe in 1969.

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.

The Berndt–Hall–Hall–Hausman (BHHH) algorithm is a numerical optimization algorithm similar to the Newton–Raphson algorithm, but it replaces the observed negative Hessian matrix with the outer product of the gradient. This approximation is based on the information matrix equality and therefore only valid while maximizing a likelihood function. The BHHH algorithm is named after the four originators: Ernst R. Berndt, Bronwyn Hall, Robert Hall, and Jerry Hausman.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.

Sequential quadratic programming (SQP) is an iterative method for constrained nonlinear optimization which may be considered a quasi-Newton method. SQP methods are used on mathematical problems for which the objective function and the constraints are twice continuously differentiable, but not necessarily convex.

In numerical optimization, the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. For a quadratic function

In numerical analysis, Broyden's method is a quasi-Newton method for finding roots in k variables. It was originally described by C. G. Broyden in 1965.

The Symmetric Rank 1 (SR1) method is a quasi-Newton method to update the second derivative (Hessian) based on the derivatives (gradients) calculated at two points. It is a generalization to the secant method for a multidimensional problem. This update maintains the symmetry of the matrix but does not guarantee that the update be positive definite.

The Davidon–Fletcher–Powell formula finds the solution to the secant equation that is closest to the current estimate and satisfies the curvature condition. It was the first quasi-Newton method to generalize the secant method to a multidimensional problem. This update maintains the symmetry and positive definiteness of the Hessian matrix.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ().

The Barzilai-Borwein method is an iterative gradient descent method for unconstrained optimization using either of two step sizes derived from the linear trend of the most recent two iterates. This method, and modifications, are globally convergent under mild conditions, and perform competitively with conjugate gradient methods for many problems. Not depending on the objective itself, it can also solve some systems of linear and non-linear equations.

The compact representation for quasi-Newton methods is a matrix decomposition, which is typically used in gradient based optimization algorithms or for solving nonlinear systems. The decomposition uses a low-rank representation for the direct and/or inverse Hessian or the Jacobian of a nonlinear system. Because of this, the compact representation is often used for large problems and constrained optimization.

References

  1. Broyden, C. G. (1972). "Quasi-Newton Methods". In Murray, W. (ed.). Numerical Methods for Unconstrained Optimization. London: Academic Press. pp. 87–106. ISBN   0-12-512250-0.
  2. 1 2 3 Haelterman, Rob (2009). "Analytical study of the Least Squares Quasi-Newton method for interaction problems". PhD Thesis, Ghent University. Retrieved 2014-08-14.
  3. Rob Haelterman; Dirk Van Eester; Daan Verleyen (2015). "Accelerating the solution of a physics model inside a tokamak using the (Inverse) Column Updating Method". Journal of Computational and Applied Mathematics. 279: 133–144. doi: 10.1016/j.cam.2014.11.005 .
  4. "Introduction to Taylor's theorem for multivariable functions - Math Insight". mathinsight.org. Retrieved November 11, 2021.
  5. Nocedal, Jorge; Wright, Stephen J. (2006). Numerical Optimization . New York: Springer. pp.  142. ISBN   0-387-98793-2.
  6. Robert Mansel Gower; Peter Richtarik (2015). "Randomized Quasi-Newton Updates are Linearly Convergent Matrix Inversion Algorithms". arXiv: 1602.01768 [math.NA].
  7. "optim function - RDocumentation". www.rdocumentation.org. Retrieved 2022-02-21.
  8. "Scipy.optimize.minimize — SciPy v1.7.1 Manual".
  9. "Unconstrained Optimization: Methods for Local Minimization—Wolfram Language Documentation". reference.wolfram.com. Retrieved 2022-02-21.
  10. The Numerical Algorithms Group. "Keyword Index: Quasi-Newton". NAG Library Manual, Mark 23. Retrieved 2012-02-09.
  11. The Numerical Algorithms Group. "E04 – Minimizing or Maximizing a Function" (PDF). NAG Library Manual, Mark 23. Retrieved 2012-02-09.
  12. "Find minimum of unconstrained multivariable function - MATLAB fminunc".
  13. "Constrained Nonlinear Optimization Algorithms - MATLAB & Simulink". www.mathworks.com. Retrieved 2022-02-21.

Further reading