Nonlinear conjugate gradient method

Last updated November 01, 2024

In numerical optimization, the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. For a quadratic function $\displaystyle f(x)$

\displaystyle f(x)=\|Ax-b\|^{2},

the minimum of $f$ is obtained when the gradient is 0:

\nabla _{x}f=2A^{T}(Ax-b)=0

.

Whereas linear conjugate gradient seeks a solution to the linear equation $\displaystyle A^{T}Ax=A^{T}b$ , the nonlinear conjugate gradient method is generally used to find the local minimum of a nonlinear function using its gradient $\nabla _{x}f$ alone. It works when the function is approximately quadratic near the minimum, which is the case when the function is twice differentiable at the minimum and the second derivative is non-singular there.

Given a function $\displaystyle f(x)$ of $N$ variables to minimize, its gradient $\nabla _{x}f$ indicates the direction of maximum increase. One simply starts in the opposite (steepest descent) direction:

\Delta x_{0}=-\nabla _{x}f(x_{0})

with an adjustable step length $\displaystyle \alpha$ and performs a line search in this direction until it reaches the minimum of $\displaystyle f$ :

\displaystyle \alpha _{0}:=\arg \min _{\alpha }f(x_{0}+\alpha \Delta x_{0})

,

\displaystyle x_{1}=x_{0}+\alpha _{0}\Delta x_{0}

After this first iteration in the steepest direction $\displaystyle \Delta x_{0}$ , the following steps constitute one iteration of moving along a subsequent conjugate direction $\displaystyle s_{n}$ , where $\displaystyle s_{0}=\Delta x_{0}$ :

Calculate the steepest direction: $\Delta x_{n}=-\nabla _{x}f(x_{n})$ ,
Compute $\displaystyle \beta _{n}$ according to one of the formulas below,
Update the conjugate direction: $\displaystyle s_{n}=\Delta x_{n}+\beta _{n}s_{n-1}$
Perform a line search: optimize $\displaystyle \alpha _{n}=\arg \min _{\alpha }f(x_{n}+\alpha s_{n})$ ,
Update the position: $\displaystyle x_{n+1}=x_{n}+\alpha _{n}s_{n}$ ,

With a pure quadratic function the minimum is reached within N iterations (excepting roundoff error), but a non-quadratic function will make slower progress. Subsequent search directions lose conjugacy requiring the search direction to be reset to the steepest descent direction at least every N iterations, or sooner if progress stops. However, resetting every iteration turns the method into steepest descent. The algorithm stops when it finds the minimum, determined when no progress is made after a direction reset (i.e. in the steepest descent direction), or when some tolerance criterion is reached.

Within a linear approximation, the parameters $\displaystyle \alpha$ and $\displaystyle \beta$ are the same as in the linear conjugate gradient method but have been obtained with line searches. The conjugate gradient method can follow narrow (ill-conditioned) valleys, where the steepest descent method slows down and follows a criss-cross pattern.

Four of the best known formulas for $\displaystyle \beta _{n}$ are named after their developers:

Fletcher–Reeves:^[1]

\beta _{n}^{FR}={\frac {\Delta x_{n}^{T}\Delta x_{n}}{\Delta x_{n-1}^{T}\Delta x_{n-1}}}.

Polak–Ribière:^[2]

\beta _{n}^{PR}={\frac {\Delta x_{n}^{T}(\Delta x_{n}-\Delta x_{n-1})}{\Delta x_{n-1}^{T}\Delta x_{n-1}}}.

Hestenes–Stiefel:^[3]

\beta _{n}^{HS}={\frac {\Delta x_{n}^{T}(\Delta x_{n}-\Delta x_{n-1})}{-s_{n-1}^{T}(\Delta x_{n}-\Delta x_{n-1})}}.

Dai–Yuan:^[4]

\beta _{n}^{DY}={\frac {\Delta x_{n}^{T}\Delta x_{n}}{-s_{n-1}^{T}(\Delta x_{n}-\Delta x_{n-1})}}.

.

These formulas are equivalent for a quadratic function, but for nonlinear optimization the preferred formula is a matter of heuristics or taste. A popular choice is $\displaystyle \beta =\max\{0,\beta ^{PR}\}$ , which provides a direction reset automatically.^[5]

Algorithms based on Newton's method potentially converge much faster. There, both step direction and length are computed from the gradient as the solution of a linear system of equations, with the coefficient matrix being the exact Hessian matrix (for Newton's method proper) or an estimate thereof (in the quasi-Newton methods, where the observed change in the gradient during the iterations is used to update the Hessian estimate). For high-dimensional problems, the exact computation of the Hessian is usually prohibitively expensive, and even its storage can be problematic, requiring $O(N^{2})$ memory (but see the limited-memory L-BFGS quasi-Newton method).

The conjugate gradient method can also be derived using optimal control theory.^[6] In this accelerated optimization theory, the conjugate gradient method falls out as a nonlinear optimal feedback controller,

$u=k(x,{\dot {x}}):=-\gamma _{a}\nabla _{x}f(x)-\gamma _{b}{\dot {x}}$

for the double integrator system,

${\ddot {x}}=u$

The quantities $\gamma _{a}>0$ and $\gamma _{b}>0$ are variable feedback gains.^[6]

Related Research Articles

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition. Large sparse systems often arise when numerically solving partial differential equations or optimization problems.

In optimization, line search is a basic iterative approach to find a local minimum $of an objective function . It first finds a descent direction along which the objective function will be reduced, and then computes a step size that determines how far should move along that direction. The descent direction can be computed by various methods, such as gradient descent or quasi-Newton method. The step size can be determined either exactly or inexactly.$

Mehrotra's predictor–corrector method in optimization is a specific interior point method for linear programming. It was proposed in 1989 by Sanjay Mehrotra.

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

In numerical analysis, the Crank–Nicolson method is a finite difference method used for numerically solving the heat equation and similar partial differential equations. It is a second-order method in time. It is implicit in time, can be written as an implicit Runge–Kutta method, and it is numerically stable. The method was developed by John Crank and Phyllis Nicolson in the 1940s.

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize $over unconstrained values of the real-vector where is a differentiable scalar function.$

The Newman–Penrose (NP) formalism is a set of notation developed by Ezra T. Newman and Roger Penrose for general relativity (GR). Their notation is an effort to treat general relativity in terms of spinor notation, which introduces complex forms of the usual variables used in GR. The NP formalism is itself a special case of the tetrad formalism, where the tensors of the theory are projected onto a complete vector basis at each point in spacetime. Usually this vector basis is chosen to reflect some symmetry of the spacetime, leading to simplified expressions for physical observables. In the case of the NP formalism, the vector basis chosen is a null tetrad: a set of four null vectors—two real, and a complex-conjugate pair. The two real members often asymptotically point radially inward and radially outward, and the formalism is well adapted to treatment of the propagation of radiation in curved spacetime. The Weyl scalars, derived from the Weyl tensor, are often used. In particular, it can be shown that one of these scalars— $in the appropriate frame—encodes the outgoing gravitational radiation of an asymptotically flat system.$

In numerical analysis, a quasi-Newton method is an iterative numerical method used either to find zeroes or to find local maxima and minima of functions via an iterative recurrence formula much like the one for Newton's method, except using approximations of the derivatives of the functions in place of exact derivatives. Newton's method requires the Jacobian matrix of all partial derivatives of a multivariate function when used to search for zeros or the Hessian matrix when used for finding extrema. Quasi-Newton methods, on the other hand, can be used when the Jacobian matrices or Hessian matrices are unavailable or are impractical to compute at every iteration.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Subgradient methods are convex optimization methods which use subderivatives. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

Powell's dog leg method, also called Powell's hybrid method, is an iterative optimisation algorithm for the solution of non-linear least squares problems, introduced in 1970 by Michael J. D. Powell. Similarly to the Levenberg–Marquardt algorithm, it combines the Gauss–Newton algorithm with gradient descent, but it uses an explicit trust region. At each iteration, if the step from the Gauss–Newton algorithm is within the trust region, it is used to update the current solution. If not, the algorithm searches for the minimum of the objective function along the steepest descent direction, known as Cauchy point. If the Cauchy point is outside of the trust region, it is truncated to the boundary of the latter and it is taken as the new solution. If the Cauchy point is inside the trust region, the new solution is taken at the intersection between the trust region boundary and the line joining the Cauchy point and the Gauss-Newton step.

The Barzilai-Borwein method is an iterative gradient descent method for unconstrained optimization using either of two step sizes derived from the linear trend of the most recent two iterates. This method, and modifications, are globally convergent under mild conditions, and perform competitively with conjugate gradient methods for many problems. Not depending on the objective itself, it can also solve some systems of linear and non-linear equations.

References

↑ Fletcher, R.; Reeves, C. M. (1964). "Function minimization by conjugate gradients". The Computer Journal. 7 (2): 149–154. doi: 10.1093/comjnl/7.2.149 .
↑ Polak, E.; Ribière, G. (1969). "Note sur la convergence de méthodes de directions conjuguées". Revue Française d'Automatique, Informatique, Recherche Opérationnelle. 3 (1): 35–43.
↑ Hestenes, M. R.; Stiefel, E. (1952). "Methods of Conjugate Gradients for Solving Linear Systems". Journal of Research of the National Bureau of Standards. 49 (6): 409–436. doi: 10.6028/jres.049.044 .
↑ Dai, Y.-H.; Yuan, Y. (1999). "A nonlinear conjugate gradient method with a strong global convergence property". SIAM Journal on Optimization. 10 (1): 177–182. doi:10.1137/S1052623497318992.
↑ Shewchuk, J. R. (August 1994). "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" (PDF).
1 2 Ross, I. M. (2019). "An Optimal Control Theory for Accelerated Optimization". arXiv: 1902.09004 [math.OC].

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Fletcher, R.; Reeves, C. M. (1964). "Function minimization by conjugate gradients". The Computer Journal. 7 (2): 149–154. doi: 10.1093/comjnl/7.2.149 .

[2] Polak, E.; Ribière, G. (1969). "Note sur la convergence de méthodes de directions conjuguées". Revue Française d'Automatique, Informatique, Recherche Opérationnelle. 3 (1): 35–43.

[3] Hestenes, M. R.; Stiefel, E. (1952). "Methods of Conjugate Gradients for Solving Linear Systems". Journal of Research of the National Bureau of Standards. 49 (6): 409–436. doi: 10.6028/jres.049.044 .

[4] Dai, Y.-H.; Yuan, Y. (1999). "A nonlinear conjugate gradient method with a strong global convergence property". SIAM Journal on Optimization. 10 (1): 177–182. doi:10.1137/S1052623497318992.

[5] Shewchuk, J. R. (August 1994). "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" (PDF).

[:0-6] 1 2 Ross, I. M. (2019). "An Optimal Control Theory for Accelerated Optimization". arXiv: 1902.09004 [math.OC].

[1]

[2]

[3]

[4]

[5]

[6]

Nonlinear conjugate gradient method

See also

Related Research Articles

References