Anderson acceleration

Last updated

In mathematics, Anderson acceleration, also called Anderson mixing, is a method for the acceleration of the convergence rate of fixed-point iterations. Introduced by Donald G. Anderson, [1] this technique can be used to find the solution to fixed point equations often arising in the field of computational science.

Contents

Definition

Given a function , consider the problem of finding a fixed point of , which is a solution to the equation . A classical approach to the problem is to employ a fixed-point iteration scheme; [2] that is, given an initial guess for the solution, to compute the sequence until some convergence criterion is met. However, the convergence of such a scheme is not guaranteed in general; moreover, the rate of convergence is usually linear, which can become too slow if the evaluation of the function is computationally expensive. [2] Anderson acceleration is a method to accelerate the convergence of the fixed-point sequence. [2]

Define the residual , and denote and (where corresponds to the sequence of iterates from the previous paragraph). Given an initial guess and an integer parameter , the method can be formulated as follows: [3] [note 1]

where the matrix–vector multiplication , and is the th element of . Conventional stopping criteria can be used to end the iterations of the method. For example, iterations can be stopped when falls under a prescribed tolerance, or when the residual falls under a prescribed tolerance. [2]

With respect to the standard fixed-point iteration, the method has been found to converge faster and be more robust, and in some cases avoid the divergence of the fixed-point sequence. [3] [4]

Derivation

For the solution , we know that , which is equivalent to saying that . We can therefore rephrase the problem as an optimization problem where we want to minimize .

Instead of going directly from to by choosing as in fixed-point iteration, let's consider an intermediate point that we choose to be the linear combination , where the coefficient vector , and is the matrix containing the last points, and choose such that it minimizes . Since the elements in sum to one, we can make the first order approximation , and our problem becomes to find the that minimizes . After having found , we could in principle calculate .

However, since is designed to bring a point closer to , is probably closer to than what is, so it makes sense to choose rather than . Furthermore, since the elements in sum to one, we can make the first order approximation . We therefore choose

.

Solution of the minimization problem

At each iteration of the algorithm, the constrained optimization problem , subject to needs to be solved. The problem can be recast in several equivalent formulations, [3] yielding different solution methods which may result in a more convenient implementation:

For both choices, the optimization problem is in the form of an unconstrained linear least-squares problem, which can be solved by standard methods including QR decomposition [3] and singular value decomposition, [4] possibly including regularization techniques to deal with rank deficiencies and conditioning issues in the optimization problem. Solving the least-squares problem by solving the normal equations is generally not advisable due to potential numerical instabilities and generally high computational cost. [4]

Stagnation in the method (i.e. subsequent iterations with the same value, ) causes the method to break down, due to the singularity of the least-squares problem. Similarly, near-stagnation () results in bad conditioning of the least squares problem. Moreover, the choice of the parameter might be relevant in determining the conditioning of the least-squares problem, as discussed below. [3]

Relaxation

The algorithm can be modified introducing a variable relaxation parameter (or mixing parameter) . [1] [3] [4] At each step, compute the new iterate as The choice of is crucial to the convergence properties of the method; in principle, might vary at each iteration, although it is often chosen to be constant. [4]

Choice of m

The parameter determines how much information from previous iterations is used to compute the new iteration . On the one hand, if is chosen to be too small, too little information is used and convergence may be undesirably slow. On the other hand, if is too large, information from old iterations may be retained for too many subsequent iterations, so that again convergence may be slow. [3] Moreover, the choice of affects the size of the optimization problem. A too large value of may worsen the conditioning of the least squares problem and the cost of its solution. [3] In general, the particular problem to be solved determines the best choice of the parameter. [3]

Choice of mk

With respect to the algorithm described above, the choice of at each iteration can be modified. One possibility is to choose for each iteration (sometimes referred to as Anderson acceleration without truncation). [3] This way, every new iteration is computed using all the previously computed iterations. A more sophisticated technique is based on choosing so as to maintain a small enough conditioning for the least-squares problem. [3]

Relations to other classes of methods

Newton's method can be applied to the solution of to compute a fixed point of with quadratic convergence. However, such method requires the evaluation of the exact derivative of , which can be very costly. [4] Approximating the derivative by means of finite differences is a possible alternative, but it requires multiple evaluations of at each iteration, which again can become very costly. Anderson acceleration requires only one evaluation of the function per iteration, and no evaluation of its derivative. On the other hand, the convergence of an Anderson-accelerated fixed point sequence is still linear in general. [5]

Several authors have pointed out similarities between the Anderson acceleration scheme and other methods for the solution of non-linear equations. In particular:

Moreover, several equivalent or nearly equivalent methods have been independently developed by other authors, [9] [10] [11] [12] [13] although most often in the context of some specific application of interest rather than as a general method for fixed point equations.

Example MATLAB implementation

The following is an example implementation in MATLAB language of the Anderson acceleration scheme for finding the fixed-point of the function . Notice that:

f=@(x)sin(x)+atan(x);% Function whose fixed point is to be computed.x0=1;% Initial guess.k_max=100;% Maximum number of iterations.tol_res=1e-6;% Tolerance on the residual.m=3;% Parameter m.x=[x0,f(x0)];% Vector of iterates x.g=f(x)-x;% Vector of residuals.G_k=g(2)-g(1);% Matrix of increments in residuals.X_k=x(2)-x(1);% Matrix of increments in x.k=2;whilek<k_max&&abs(g(k))>tol_resm_k=min(k,m);% Solve the optimization problem by QR decomposition.[Q,R]=qr(G_k);gamma_k=R\(Q'*g(k));% Compute new iterate and new residual.x(k+1)=x(k)+g(k)-(X_k+G_k)*gamma_k;g(k+1)=f(x(k+1))-x(k+1);% Update increment matrices with new elements.X_k=[X_k,x(k+1)-x(k)];G_k=[G_k,g(k+1)-g(k)];n=size(X_k,2);ifn>m_kX_k=X_k(:,n-m_k+1:end);G_k=G_k(:,n-m_k+1:end);endk=k+1;end% Prints result: Computed fixed point 2.013444 after 9 iterationsfprintf("Computed fixed point %f after %d iterations\n",x(end),k);

See also

Notes

  1. This formulation is not the same as given by the original author; [1] it is an equivalent, more explicit formulation given by Walker and Ni. [3]

Related Research Articles

<span class="mw-page-title-main">Newton's method</span> Algorithm for finding zeros of functions

In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots of a real-valued function. The most basic version starts with a real-valued function f, its derivative f, and an initial guess x0 for a root of f. If f satisfies certain assumptions and the initial guess is close, then

<span class="mw-page-title-main">Gradient descent</span> Optimization algorithm

Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

<span class="mw-page-title-main">Gauss–Newton algorithm</span> Mathematical algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

<span class="mw-page-title-main">Conjugate gradient method</span> Mathematical optimization algorithm

In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition. Large sparse systems often arise when numerically solving partial differential equations or optimization problems.

Mehrotra's predictor–corrector method in optimization is a specific interior point method for linear programming. It was proposed in 1989 by Sanjay Mehrotra.

In numerical analysis, the Crank–Nicolson method is a finite difference method used for numerically solving the heat equation and similar partial differential equations. It is a second-order method in time. It is implicit in time, can be written as an implicit Runge–Kutta method, and it is numerically stable. The method was developed by John Crank and Phyllis Nicolson in the 1940s.

In numerical analysis, the Weierstrass method or Durand–Kerner method, discovered by Karl Weierstrass in 1891 and rediscovered independently by Durand in 1960 and Kerner in 1966, is a root-finding algorithm for solving polynomial equations. In other words, the method can be used to solve numerically the equation

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.

In numerical analysis, Aitken's delta-squared process or Aitken extrapolation is a series acceleration method used for accelerating the rate of convergence of a sequence. It is named after Alexander Aitken, who introduced this method in 1926. It is most useful for accelerating the convergence of a sequence that is converging linearly. A precursor form was known to Seki Kōwa and applied to the rectification of the circle, i.e., to the calculation of π.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Subgradient methods are convex optimization methods which use subderivatives. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.

<span class="mw-page-title-main">Vortex lattice method</span>

The Vortex lattice method, (VLM), is a numerical method used in computational fluid dynamics, mainly in the early stages of aircraft design and in aerodynamic education at university level. The VLM models the lifting surfaces, such as a wing, of an aircraft as an infinitely thin sheet of discrete vortices to compute lift and induced drag. The influence of the thickness and viscosity is neglected.

Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research. SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool. The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.

In machine learning, automatic basis function construction is the mathematical method of looking for a set of task-independent basis functions that map the state space to a lower-dimensional embedding, while still representing the value function accurately. Automatic basis construction is independent of prior knowledge of the domain, which allows it to perform well where expert-constructed basis functions are difficult or impossible to create.

Perspective-n-Point is the problem of estimating the pose of a calibrated camera given a set of n 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6 degrees-of-freedom (DOF) which are made up of the rotation and 3D translation of the camera with respect to the world. This problem originates from camera calibration and has many applications in computer vision and other areas, including 3D pose estimation, robotics and augmented reality. A commonly used solution to the problem exists for n = 3 called P3P, and many solutions are available for the general case of n ≥ 3. A solution for n = 2 exists if feature orientations are available at the two points. Implementations of these solutions are also available in open source software.

Numerical certification is the process of verifying the correctness of a candidate solution to a system of equations. In (numerical) computational mathematics, such as numerical algebraic geometry, candidate solutions are computed algorithmically, but there is the possibility that errors have corrupted the candidates. For instance, in addition to the inexactness of input data and candidate solutions, numerical errors or errors in the discretization of the problem may result in corrupted candidate solutions. The goal of numerical certification is to provide a certificate which proves which of these candidates are, indeed, approximate solutions.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

Progressive-iterative approximation method is an iterative method of data fitting with geometric meanings. Given the data points to be fitted, the method obtains a series of fitting curves (surfaces) by iteratively updating the control points, and the limit curve (surface) can interpolate or approximate the given data points. It avoids solving a linear system of equations directly and allows flexibility in adding constraints during the iterative process. Therefore, it has been widely used in geometric design and related fields.

References

  1. 1 2 3 4 Anderson, Donald G. (October 1965). "Iterative Procedures for Nonlinear Integral Equations". Journal of the ACM. 12 (4): 547–560. doi: 10.1145/321296.321305 .
  2. 1 2 3 4 Quarteroni, Alfio; Sacco, Riccardo; Saleri, Fausto (30 November 2010). Numerical mathematics (2nd ed.). Springer. ISBN   978-3-540-49809-4.
  3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Walker, Homer F.; Ni, Peng (January 2011). "Anderson Acceleration for Fixed-Point Iterations". SIAM Journal on Numerical Analysis. 49 (4): 1715–1735. CiteSeerX   10.1.1.722.2636 . doi:10.1137/10078356X.
  4. 1 2 3 4 5 6 7 8 Fang, Haw-ren; Saad, Yousef (March 2009). "Two classes of multisecant methods for nonlinear acceleration". Numerical Linear Algebra with Applications. 16 (3): 197–221. doi:10.1002/nla.617.
  5. Evans, Claire; Pollock, Sara; Rebholz, Leo G.; Xiao, Mengying (20 February 2020). "A Proof That Anderson Acceleration Improves the Convergence Rate in Linearly Converging Fixed-Point Methods (But Not in Those Converging Quadratically)". SIAM Journal on Numerical Analysis. 58 (1): 788–810. arXiv: 1810.08455 . doi:10.1137/19M1245384.
  6. Eyert, V. (March 1996). "A Comparative Study on Methods for Convergence Acceleration of Iterative Vector Sequences". Journal of Computational Physics. 124 (2): 271–285. doi:10.1006/jcph.1996.0059.
  7. Broyden, C. G. (1965). "A class of methods for solving nonlinear simultaneous equations". Mathematics of Computation. 19 (92): 577–593. doi: 10.1090/S0025-5718-1965-0198670-6 .
  8. Ni, Peng (November 2009). Anderson Acceleration of Fixed-point Iteration with Applications to Electronic Structure Computations (PhD).
  9. 1 2 Oosterlee, C. W.; Washio, T. (January 2000). "Krylov Subspace Acceleration of Nonlinear Multigrid with Application to Recirculating Flows". SIAM Journal on Scientific Computing. 21 (5): 1670–1690. doi:10.1137/S1064827598338093.
  10. Pulay, Péter (July 1980). "Convergence acceleration of iterative sequences. the case of scf iteration". Chemical Physics Letters. 73 (2): 393–398. doi:10.1016/0009-2614(80)80396-4.
  11. Pulay, P. (1982). "ImprovedSCF convergence acceleration". Journal of Computational Chemistry. 3 (4): 556–560. doi:10.1002/jcc.540030413.
  12. Carlson, Neil N.; Miller, Keith (May 1998). "Design and Application of a Gradient-Weighted Moving Finite Element Code I: in One Dimension". SIAM Journal on Scientific Computing. 19 (3): 728–765. doi:10.1137/S106482759426955X.
  13. Miller, Keith (November 2005). "Nonlinear Krylov and moving nodes in the method of lines". Journal of Computational and Applied Mathematics. 183 (2): 275–287. doi:10.1016/j.cam.2004.12.032.
  14. Daniel, J. W.; Gragg, W. B.; Kaufman, L.; Stewart, G. W. (October 1976). "Reorthogonalization and stable algorithms for updating the Gram-Schmidt $QR$ factorization". Mathematics of Computation. 30 (136): 772. doi: 10.1090/S0025-5718-1976-0431641-8 .