# Newton's method in optimization

Last updated

In calculus, Newton's method is an iterative method for finding the roots of a differentiable function F, which are solutions to the equation F (x) = 0. As such, Newton's method can be applied to the derivative f of a twice-differentiable function f to find the roots of the derivative (solutions to f ′(x) = 0), also known as the critical points of f. These solutions may be minima, maxima, or saddle points; see section "Several variables" in Critical point (mathematics) and also section "Geometric interpretation" in this article. This is relevant in optimization, which aims to find (global) minima of the function f.

## Newton's Method

The central problem of optimization is minimization of functions. Let us first consider the case of univariate functions, i.e., functions of a single real variable. We will later consider the more general and more practically useful multivariate case.

Given a twice differentiable function ${\displaystyle f:\mathbb {R} \to \mathbb {R} }$, we seek to solve the optimization problem

${\displaystyle \min _{x\in \mathbb {R} }f(x).}$

Newton's method attempts to solve this problem by constructing a sequence ${\displaystyle \{x_{k}\}}$ from an initial guess (starting point) ${\displaystyle x_{0}\in \mathbb {R} }$ that converges towards a minimizer ${\displaystyle x_{*}}$ of ${\displaystyle f}$ by using a sequence of second-order Taylor approximations of ${\displaystyle f}$ around the iterates. The second-order Taylor expansion of f around ${\displaystyle x_{k}}$ is

${\displaystyle f(x_{k}+t)\approx f(x_{k})+f'(x_{k})t+{\frac {1}{2}}f''(x_{k})t^{2}.}$

The next iterate ${\displaystyle x_{k+1}}$ is defined so as to minimize this quadratic approximation in ${\displaystyle t}$, and setting ${\displaystyle x_{k+1}=x_{k}+t}$. If the second derivative is positive, the quadratic approximation is a convex function of ${\displaystyle t}$, and its minimum can be found by setting the derivative to zero. Since

${\displaystyle \displaystyle 0={\frac {\rm {d}}{{\rm {d}}t}}\left(f(x_{k})+f'(x_{k})t+{\frac {1}{2}}f''(x_{k})t^{2}\right)=f'(x_{k})+f''(x_{k})t,}$

the minimum is achieved for

${\displaystyle t=-{\frac {f'(x_{k})}{f''(x_{k})}}.}$

Putting everything together, Newton's method performs the iteration

${\displaystyle x_{k+1}=x_{k}+t=x_{k}-{\frac {f'(x_{k})}{f''(x_{k})}}.}$

## Geometric interpretation

The geometric interpretation of Newton's method is that at each iteration, it amounts to the fitting of a parabola to the graph of ${\displaystyle f(x)}$ at the trial value ${\displaystyle x_{k}}$, having the same slope and curvature as the graph at that point, and then proceeding to the maximum or minimum of that parabola (in higher dimensions, this may also be a saddle point), see below. Note that if ${\displaystyle f}$ happens to be a quadratic function, then the exact extremum is found in one step. Applying this simple observation to simple quadratic cost functions, we obtain different behaviour of Newton's method:

- For the function ${\displaystyle f(x)=x^{2}}$, which has a global minimum at 0, Newton's method will converge to 0 after 1 step.

- For the function ${\displaystyle f(x)=-x^{2}}$, which has a global maximum at 0, Newton's method will converge to 0 after 1 step.

## Higher dimensions

The above iterative scheme can be generalized to ${\displaystyle d>1}$ dimensions by replacing the derivative with the gradient (different authors use different notation for the gradient, including ${\displaystyle f'(x)=\nabla f(x)=g_{f}(x)\in \mathbb {R} ^{d}}$), and the reciprocal of the second derivative with the inverse of the Hessian matrix (different authors use different notation for the Hessian, including ${\displaystyle f''(x)=\nabla ^{2}f(x)=H_{f}(x)\in \mathbb {R} ^{d\times d}}$). One thus obtains the iterative scheme

${\displaystyle x_{k+1}=x_{k}-[f''(x_{k})]^{-1}f'(x_{k}),\qquad k\geq 0.}$

Often Newton's method is modified to include a small step size ${\displaystyle 0<\gamma \leq 1}$ instead of ${\displaystyle \gamma =1}$:

${\displaystyle x_{k+1}=x_{k}-\gamma [f''(x_{k})]^{-1}f'(x_{k}).}$

This is often done to ensure that the Wolfe conditions, or much simpler and efficient Armijo's condition, are satisfied at each step of the method. For step sizes other than 1, the method is often referred to as the relaxed or damped Newton's method.

## Convergence

If f is a strongly convex function with Lipschitz Hessian, then provided that ${\displaystyle x_{0}}$ is close enough to ${\displaystyle x_{*}=\arg \min f(x)}$, the sequence ${\displaystyle x_{0},x_{1},x_{2},\dots }$ generated by Newton's method will converge to the (necessarily unique) minimizer ${\displaystyle x_{*}}$ of ${\displaystyle f}$ quadratically fast. [1] That is,

${\displaystyle \|x_{k+1}-x_{*}\|\leq {\frac {1}{2}}\|x_{k}-x_{*}\|^{2},\qquad \forall k\geq 0.}$

## Computing the Newton direction

Finding the inverse of the Hessian in high dimensions to compute the Newton direction ${\displaystyle h=-(f''(x_{k}))^{-1}f'(x_{k})}$ can be an expensive operation. In such cases, instead of directly inverting the Hessian, it is better to calculate the vector ${\displaystyle h}$ as the solution to the system of linear equations

${\displaystyle [f''(x_{k})]h=-f'(x_{k})}$

which may be solved by various factorizations or approximately (but to great accuracy) using iterative methods. Many of these methods are only applicable to certain types of equations, for example the Cholesky factorization and conjugate gradient will only work if ${\displaystyle f''(x_{k})}$ is a positive definite matrix. While this may seem like a limitation, it is often a useful indicator of something gone wrong; for example if a minimization problem is being approached and ${\displaystyle f''(x_{k})}$ is not positive definite, then the iterations are converging to a saddle point and not a minimum.

On the other hand, if a constrained optimization is done (for example, with Lagrange multipliers), the problem may become one of saddle point finding, in which case the Hessian will be symmetric indefinite and the solution of ${\displaystyle x_{k+1}}$ will need to be done with a method that will work for such, such as the ${\displaystyle LDL^{\top }}$ variant of Cholesky factorization or the conjugate residual method.

There also exist various quasi-Newton methods, where an approximation for the Hessian (or its inverse directly) is built up from changes in the gradient.

If the Hessian is close to a non-invertible matrix, the inverted Hessian can be numerically unstable and the solution may diverge. In this case, certain workarounds have been tried in the past, which have varied success with certain problems. One can, for example, modify the Hessian by adding a correction matrix ${\displaystyle B_{k}}$ so as to make ${\displaystyle f''(x_{k})+B_{k}}$ positive definite. One approach is to diagonalize the Hessian and choose ${\displaystyle B_{k}}$ so that ${\displaystyle f''(x_{k})+B_{k}}$ has the same eigenvectors as the Hessian, but with each negative eigenvalue replaced by ${\displaystyle \epsilon >0}$.

An approach exploited in the Levenberg–Marquardt algorithm (which uses an approximate Hessian) is to add a scaled identity matrix to the Hessian, ${\displaystyle \mu I}$, with the scale adjusted at every iteration as needed. For large ${\displaystyle \mu }$ and small Hessian, the iterations will behave like gradient descent with step size ${\displaystyle 1/\mu }$. This results in slower but more reliable convergence where the Hessian doesn't provide useful information.

## Some caveats

Newton's method, in its original version, has several caveats:

First: It does not work if the Hessian is not invertible. This is clear from the very definition of Newton's method, which requires taking the inverse of the Hessian.

Second: It may not converge at all, but can enter a cycle having more than 1 point. See the section "Failure analysis" in Newton's method.

Third: It can converge to a saddle point instead of to a local minimum, see the section "Geometric interpretation" in this article.

The popular modifications of Newton's method, such as quasi-Newton methods or Levenberg-Marquardt algorithm mentioned above, also have caveats:

For example, it is usually required that the cost function is (strongly) convex and the Hessian is globally bounded or Lipschitz continuous, for example this is mentioned in the section "Convergence" in this article. If one looks at the papers by Levenberg and Marquardt in the reference for Levenberg–Marquardt algorithm, which are the original sources for the mentioned method, one can see that there is basically no theoretical analysis in the paper by Levenberg, while the paper by Marquardt only analyses a local situation and does not prove a global convergence result. One can compare with Backtracking line search method for Gradient descent, which has good theoretical guarantee under more general assumptions, and can be implemented and works well in practical large scale problems such as Deep Neural Networks.

## Notes

1. Nocedal, Jorge; Wright, Stephen J. (2006). Numerical optimization (2nd ed.). New York: Springer. p. 44. ISBN   0387303030.

## Related Research Articles

In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots of a real-valued function. The most basic version starts with a single-variable function f defined for a real variable x, the function's derivative f ′, and an initial guess x0 for a root of f. If the function satisfies sufficient assumptions and the initial guess is close, then

Mathematical optimization or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. Optimization problems of sorts arise in all quantitative disciplines from computer science and engineering to operations research and economics, and the development of solution methods has been of interest in mathematics for centuries.

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints. It is named after the mathematician Joseph-Louis Lagrange. The basic idea is to convert a constrained problem into a form such that the derivative test of an unconstrained problem can still be applied. The relationship between the gradient of the function and gradients of the constraints rather naturally leads to a reformulation of the original problem, known as the Lagrangian function.

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.

In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named after him. Hesse originally used the term "functional determinants".

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting.

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton's method for finding a minimum of a function. Unlike Newton's method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.

Interior-point methods are a certain class of algorithms that solve linear and nonlinear convex optimization problems.

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.

In mathematical optimization, a trust region is the subset of the region of the objective function that is approximated using a model function. If an adequate model of the objective function is found within the trust region, then the region is expanded; conversely, if the approximation is poor, then the region is contracted.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.

Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. The "full" Newton's method requires the Jacobian in order to search for zeros, or the Hessian for finding extrema.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

In numerical optimization, the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. For a quadratic function

The Symmetric Rank 1 (SR1) method is a quasi-Newton method to update the second derivative (Hessian) based on the derivatives (gradients) calculated at two points. It is a generalization to the secant method for a multidimensional problem. This update maintains the symmetry of the matrix but does not guarantee that the update be positive definite.

In optimization, a self-concordant function is a function for which

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box-Cox transformed regressors.

Powell's dog leg method is an iterative optimisation algorithm for the solution of non-linear least squares problems, introduced in 1970 by Michael J. D. Powell. Similarly to the Levenberg–Marquardt algorithm, it combines the Gauss–Newton algorithm with gradient descent, but it uses an explicit trust region. At each iteration, if the step from the Gauss–Newton algorithm is within the trust region, it is used to update the current solution. If not, the algorithm searches for the minimum of the objective function along the steepest descent direction, known as Cauchy point. If the Cauchy point is outside of the trust region, it is truncated to the boundary of the latter and it is taken as the new solution. If the Cauchy point is inside the trust region, the new solution is taken at the intersection between the trust region boundary and the line joining the Cauchy point and the Gauss-Newton step.