Anderson acceleration

Last updated

In mathematics, Anderson acceleration, also called Anderson mixing, is a method for the acceleration of the convergence rate of fixed-point iterations. Introduced by Donald G. Anderson, [1] this technique can be used to find the solution to fixed point equations often arising in the field of computational science.

Contents

Definition

Given a function , consider the problem of finding a fixed point of , which is a solution to the equation . A classical approach to the problem is to employ a fixed-point iteration scheme; [2] that is, given an initial guess for the solution, to compute the sequence until some convergence criterion is met. However, the convergence of such a scheme is not guaranteed in general; moreover, the rate of convergence is usually linear, which can become too slow if the evaluation of the function is computationally expensive. [2] Anderson acceleration is a method to accelerate the convergence of the fixed-point sequence. [2]

Define the residual , and denote and (where corresponds to the sequence of iterates from the previous paragraph). Given an initial guess and an integer parameter , the method can be formulated as follows: [3] [note 1]

where the matrix–vector multiplication , and is the th element of . Conventional stopping criteria can be used to end the iterations of the method. For example, iterations can be stopped when falls under a prescribed tolerance, or when the residual falls under a prescribed tolerance. [2]

With respect to the standard fixed-point iteration, the method has been found to converge faster and be more robust, and in some cases avoid the divergence of the fixed-point sequence. [3] [4]

Derivation

For the solution , we know that , which is equivalent to saying that . We can therefore rephrase the problem as an optimization problem where we want to minimize .

Instead of going directly from to by choosing as in fixed-point iteration, let's consider an intermediate point that we choose to be the linear combination , where the coefficient vector , and is the matrix containing the last points, and choose such that it minimizes . Since the elements in sum to one, we can make the first order approximation , and our problem becomes to find the that minimizes . After having found , we could in principle calculate .

However, since is designed to bring a point closer to , is probably closer to than what is, so it makes sense to choose rather than . Furthermore, since the elements in sum to one, we can make the first order approximation . We therefore choose

.

Solution of the minimization problem

At each iteration of the algorithm, the constrained optimization problem , subject to needs to be solved. The problem can be recast in several equivalent formulations, [3] yielding different solution methods which may result in a more convenient implementation:

For both choices, the optimization problem is in the form of an unconstrained linear least-squares problem, which can be solved by standard methods including QR decomposition [3] and singular value decomposition, [4] possibly including regularization techniques to deal with rank deficiencies and conditioning issues in the optimization problem. Solving the least-squares problem by solving the normal equations is generally not advisable due to potential numerical instabilities and generally high computational cost. [4]

Stagnation in the method (i.e. subsequent iterations with the same value, ) causes the method to break down, due to the singularity of the least-squares problem. Similarly, near-stagnation () results in bad conditioning of the least squares problem. Moreover, the choice of the parameter might be relevant in determining the conditioning of the least-squares problem, as discussed below. [3]

Relaxation

The algorithm can be modified introducing a variable relaxation parameter (or mixing parameter) . [1] [3] [4] At each step, compute the new iterate as

The choice of is crucial to the convergence properties of the method; in principle, might vary at each iteration, although it is often chosen to be constant. [4]

Choice of m

The parameter determines how much information from previous iterations is used to compute the new iteration . On the one hand, if is chosen to be too small, too little information is used and convergence may be undesirably slow. On the other hand, if is too large, information from old iterations may be retained for too many subsequent iterations, so that again convergence may be slow. [3] Moreover, the choice of affects the size of the optimization problem. A too large value of may worsen the conditioning of the least squares problem and the cost of its solution. [3] In general, the particular problem to be solved determines the best choice of the parameter. [3]

Choice of mk

With respect to the algorithm described above, the choice of at each iteration can be modified. One possibility is to choose for each iteration (sometimes referred to as Anderson acceleration without truncation). [3] This way, every new iteration is computed using all the previously computed iterations. A more sophisticated technique is based on choosing so as to maintain a small enough conditioning for the least-squares problem. [3]

Relations to other classes of methods

Newton's method can be applied to the solution of to compute a fixed point of with quadratic convergence. However, such method requires the evaluation of the exact derivative of , which can be very costly. [4] Approximating the derivative by means of finite differences is a possible alternative, but it requires multiple evaluations of at each iteration, which again can become very costly. Anderson acceleration requires only one evaluation of the function per iteration, and no evaluation of its derivative. On the other hand, the convergence of an Anderson-accelerated fixed point sequence is still linear in general. [5]

Several authors have pointed out similarities between the Anderson acceleration scheme and other methods for the solution of non-linear equations. In particular:

Moreover, several equivalent or nearly equivalent methods have been independently developed by other authors, [9] [10] [11] [12] [13] although most often in the context of some specific application of interest rather than as a general method for fixed point equations.

Example MATLAB implementation

The following is an example implementation in MATLAB language of the Anderson acceleration scheme for finding the fixed-point of the function . Notice that:

f=@(x)sin(x)+atan(x);% Function whose fixed point is to be computed.x0=1;% Initial guess.k_max=100;% Maximum number of iterations.tol_res=1e-6;% Tolerance on the residual.m=3;% Parameter m.x=[x0,f(x0)];% Vector of iterates x.g=f(x)-x;% Vector of residuals.G_k=g(2)-g(1);% Matrix of increments in residuals.X_k=x(2)-x(1);% Matrix of increments in x.k=2;whilek<k_max&&abs(g(k))>tol_resm_k=min(k,m);% Solve the optimization problem by QR decomposition.[Q,R]=qr(G_k);gamma_k=R\(Q'*g(k));% Compute new iterate and new residual.x(k+1)=x(k)+g(k)-(X_k+G_k)*gamma_k;g(k+1)=f(x(k+1))-x(k+1);% Update increment matrices with new elements.X_k=[X_k,x(k+1)-x(k)];G_k=[G_k,g(k+1)-g(k)];n=size(X_k,2);ifn>m_kX_k=X_k(:,n-m_k+1:end);G_k=G_k(:,n-m_k+1:end);endk=k+1;end% Prints result: Computed fixed point 2.013444 after 9 iterationsfprintf("Computed fixed point %f after %d iterations\n",x(end),k);

See also

Notes

  1. This formulation is not the same as given by the original author; [1] it is an equivalent, more explicit formulation given by Walker and Ni. [3]

Related Research Articles

In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots of a real-valued function. The most basic version starts with a single-variable function f defined for a real variable x, the function's derivative f, and an initial guess x0 for a root of f. If the function satisfies sufficient assumptions and the initial guess is close, then

<span class="mw-page-title-main">Gradient descent</span> Optimization algorithm

In mathematics, gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.

In numerical analysis, one of the most important problems is designing efficient and stable algorithms for finding the eigenvalues of a matrix. These eigenvalue algorithms may also find eigenvectors.

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

<span class="mw-page-title-main">Gauss–Newton algorithm</span>

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

<span class="mw-page-title-main">Stochastic gradient descent</span> Optimization algorithm

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

<span class="mw-page-title-main">Conjugate gradient method</span> Mathematical optimization algorithm

In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-definite. The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition. Large sparse systems often arise when numerically solving partial differential equations or optimization problems.

Mehrotra's predictor–corrector method in optimization is a specific interior point method for linear programming. It was proposed in 1989 by Sanjay Mehrotra.

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

<span class="mw-page-title-main">Crank–Nicolson method</span> Finite difference method for numerically solving parabolic differential equations

In numerical analysis, the Crank–Nicolson method is a finite difference method used for numerically solving the heat equation and similar partial differential equations. It is a second-order method in time. It is implicit in time, can be written as an implicit Runge–Kutta method, and it is numerically stable. The method was developed by John Crank and Phyllis Nicolson in the mid 20th century.

In numerical analysis, the Weierstrass method or Durand–Kerner method, discovered by Karl Weierstrass in 1891 and rediscovered independently by Durand in 1960 and Kerner in 1966, is a root-finding algorithm for solving polynomial equations. In other words, the method can be used to solve numerically the equation

In numerical analysis, Aitken's delta-squared process or Aitken extrapolation is a series acceleration method, used for accelerating the rate of convergence of a sequence. It is named after Alexander Aitken, who introduced this method in 1926. Its early form was known to Seki Kōwa and was found for rectification of the circle, i.e. the calculation of π. It is most useful for accelerating the convergence of a sequence that is converging linearly.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Subgradient methods are iterative methods for solving convex minimization problems. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.

<span class="mw-page-title-main">Vortex lattice method</span>

The Vortex lattice method, (VLM), is a numerical method used in computational fluid dynamics, mainly in the early stages of aircraft design and in aerodynamic education at university level. The VLM models the lifting surfaces, such as a wing, of an aircraft as an infinitely thin sheet of discrete vortices to compute lift and induced drag. The influence of the thickness and viscosity is neglected.

Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research. SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool. The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.

In machine learning, automatic basis function construction is the mathematical method of looking for a set of task-independent basis functions that map the state space to a lower-dimensional embedding, while still representing the value function accurately. Automatic basis construction is independent of prior knowledge of the domain, which allows it to perform well where expert-constructed basis functions are difficult or impossible to create.

Perspective-n-Point is the problem of estimating the pose of a calibrated camera given a set of n 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6 degrees-of-freedom (DOF) which are made up of the rotation and 3D translation of the camera with respect to the world. This problem originates from camera calibration and has many applications in computer vision and other areas, including 3D pose estimation, robotics and augmented reality. A commonly used solution to the problem exists for n = 3 called P3P, and many solutions are available for the general case of n ≥ 3. A solution for n = 2 exists if feature orientations are available at the two points. Implementations of these solutions are also available in open source software.

Numerical certification is the process of verifying the correctness of a candidate solution to a system of equations. In (numerical) computational mathematics, such as numerical algebraic geometry, candidate solutions are computed algorithmically, but there is the possibility that errors have corrupted the candidates. For instance, in addition to the inexactness of input data and candidate solutions, numerical errors or errors in the discretization of the problem may result in corrupted candidate solutions. The goal of numerical certification is to provide a certificate which proves which of these candidates are, indeed, approximate solutions.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

References

  1. 1 2 3 4 Anderson, Donald G. (October 1965). "Iterative Procedures for Nonlinear Integral Equations". Journal of the ACM. 12 (4): 547–560. doi: 10.1145/321296.321305 .
  2. 1 2 3 4 Quarteroni, Alfio; Sacco, Riccardo; Saleri, Fausto. Numerical mathematics (2nd ed.). Springer. ISBN   978-3-540-49809-4.
  3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Walker, Homer F.; Ni, Peng (January 2011). "Anderson Acceleration for Fixed-Point Iterations". SIAM Journal on Numerical Analysis. 49 (4): 1715–1735. doi:10.1137/10078356X.
  4. 1 2 3 4 5 6 7 8 Fang, Haw-ren; Saad, Yousef (March 2009). "Two classes of multisecant methods for nonlinear acceleration". Numerical Linear Algebra with Applications. 16 (3): 197–221. doi:10.1002/nla.617.
  5. Evans, Claire; Pollock, Sara; Rebholz, Leo G.; Xiao, Mengying (20 February 2020). "A Proof That Anderson Acceleration Improves the Convergence Rate in Linearly Converging Fixed-Point Methods (But Not in Those Converging Quadratically)". SIAM Journal on Numerical Analysis. 58 (1): 788–810. arXiv: 1810.08455 . doi:10.1137/19M1245384.
  6. Eyert, V. (March 1996). "A Comparative Study on Methods for Convergence Acceleration of Iterative Vector Sequences". Journal of Computational Physics. 124 (2): 271–285. doi:10.1006/jcph.1996.0059.
  7. Broyden, C. G. (1965). "A class of methods for solving nonlinear simultaneous equations". Mathematics of Computation. 19 (92): 577–577. doi: 10.1090/S0025-5718-1965-0198670-6 .
  8. Ni, Peng (November 2009). Anderson Acceleration of Fixed-point Iteration with Applications to Electronic Structure Computations (PhD).
  9. 1 2 Oosterlee, C. W.; Washio, T. (January 2000). "Krylov Subspace Acceleration of Nonlinear Multigrid with Application to Recirculating Flows". SIAM Journal on Scientific Computing. 21 (5): 1670–1690. doi:10.1137/S1064827598338093.
  10. Pulay, Péter (July 1980). "Convergence acceleration of iterative sequences. the case of scf iteration". Chemical Physics Letters. 73 (2): 393–398. doi:10.1016/0009-2614(80)80396-4.
  11. Pulay, P. (1982). "ImprovedSCF convergence acceleration". Journal of Computational Chemistry. 3 (4): 556–560. doi:10.1002/jcc.540030413.
  12. Carlson, Neil N.; Miller, Keith (May 1998). "Design and Application of a Gradient-Weighted Moving Finite Element Code I: in One Dimension". SIAM Journal on Scientific Computing. 19 (3): 728–765. doi:10.1137/S106482759426955X.
  13. Miller, Keith (November 2005). "Nonlinear Krylov and moving nodes in the method of lines". Journal of Computational and Applied Mathematics. 183 (2): 275–287. doi: 10.1016/j.cam.2004.12.032 .
  14. Daniel, J. W.; Gragg, W. B.; Kaufman, L.; Stewart, G. W. (October 1976). "Reorthogonalization and stable algorithms for updating the Gram-Schmidt $QR$ factorization". Mathematics of Computation. 30 (136): 772–772. doi: 10.1090/S0025-5718-1976-0431641-8 .