In mathematics, Anderson acceleration, also called Anderson mixing, is a method for the acceleration of the convergence rate of fixed-point iterations. Introduced by Donald G. Anderson, [1] this technique can be used to find the solution to fixed point equations often arising in the field of computational science.
Given a function , consider the problem of finding a fixed point of , which is a solution to the equation . A classical approach to the problem is to employ a fixed-point iteration scheme; [2] that is, given an initial guess for the solution, to compute the sequence until some convergence criterion is met. However, the convergence of such a scheme is not guaranteed in general; moreover, the rate of convergence is usually linear, which can become too slow if the evaluation of the function is computationally expensive. [2] Anderson acceleration is a method to accelerate the convergence of the fixed-point sequence. [2]
Define the residual , and denote and (where corresponds to the sequence of iterates from the previous paragraph). Given an initial guess and an integer parameter , the method can be formulated as follows: [3] [note 1]
where the matrix–vector multiplication , and is the th element of . Conventional stopping criteria can be used to end the iterations of the method. For example, iterations can be stopped when falls under a prescribed tolerance, or when the residual falls under a prescribed tolerance. [2]
With respect to the standard fixed-point iteration, the method has been found to converge faster and be more robust, and in some cases avoid the divergence of the fixed-point sequence. [3] [4]
For the solution , we know that , which is equivalent to saying that . We can therefore rephrase the problem as an optimization problem where we want to minimize .
Instead of going directly from to by choosing as in fixed-point iteration, let's consider an intermediate point that we choose to be the linear combination , where the coefficient vector , and is the matrix containing the last points, and choose such that it minimizes . Since the elements in sum to one, we can make the first order approximation , and our problem becomes to find the that minimizes . After having found , we could in principle calculate .
However, since is designed to bring a point closer to , is probably closer to than what is, so it makes sense to choose rather than . Furthermore, since the elements in sum to one, we can make the first order approximation . We therefore choose
.
At each iteration of the algorithm, the constrained optimization problem , subject to needs to be solved. The problem can be recast in several equivalent formulations, [3] yielding different solution methods which may result in a more convenient implementation:
For both choices, the optimization problem is in the form of an unconstrained linear least-squares problem, which can be solved by standard methods including QR decomposition [3] and singular value decomposition, [4] possibly including regularization techniques to deal with rank deficiencies and conditioning issues in the optimization problem. Solving the least-squares problem by solving the normal equations is generally not advisable due to potential numerical instabilities and generally high computational cost. [4]
Stagnation in the method (i.e. subsequent iterations with the same value, ) causes the method to break down, due to the singularity of the least-squares problem. Similarly, near-stagnation () results in bad conditioning of the least squares problem. Moreover, the choice of the parameter might be relevant in determining the conditioning of the least-squares problem, as discussed below. [3]
The algorithm can be modified introducing a variable relaxation parameter (or mixing parameter) . [1] [3] [4] At each step, compute the new iterate as The choice of is crucial to the convergence properties of the method; in principle, might vary at each iteration, although it is often chosen to be constant. [4]
The parameter determines how much information from previous iterations is used to compute the new iteration . On the one hand, if is chosen to be too small, too little information is used and convergence may be undesirably slow. On the other hand, if is too large, information from old iterations may be retained for too many subsequent iterations, so that again convergence may be slow. [3] Moreover, the choice of affects the size of the optimization problem. A too large value of may worsen the conditioning of the least squares problem and the cost of its solution. [3] In general, the particular problem to be solved determines the best choice of the parameter. [3]
With respect to the algorithm described above, the choice of at each iteration can be modified. One possibility is to choose for each iteration (sometimes referred to as Anderson acceleration without truncation). [3] This way, every new iteration is computed using all the previously computed iterations. A more sophisticated technique is based on choosing so as to maintain a small enough conditioning for the least-squares problem. [3]
Newton's method can be applied to the solution of to compute a fixed point of with quadratic convergence. However, such method requires the evaluation of the exact derivative of , which can be very costly. [4] Approximating the derivative by means of finite differences is a possible alternative, but it requires multiple evaluations of at each iteration, which again can become very costly. Anderson acceleration requires only one evaluation of the function per iteration, and no evaluation of its derivative. On the other hand, the convergence of an Anderson-accelerated fixed point sequence is still linear in general. [5]
Several authors have pointed out similarities between the Anderson acceleration scheme and other methods for the solution of non-linear equations. In particular:
Moreover, several equivalent or nearly equivalent methods have been independently developed by other authors, [9] [10] [11] [12] [13] although most often in the context of some specific application of interest rather than as a general method for fixed point equations.
The following is an example implementation in MATLAB language of the Anderson acceleration scheme for finding the fixed-point of the function . Notice that:
f=@(x)sin(x)+atan(x);% Function whose fixed point is to be computed.x0=1;% Initial guess.k_max=100;% Maximum number of iterations.tol_res=1e-6;% Tolerance on the residual.m=3;% Parameter m.x=[x0,f(x0)];% Vector of iterates x.g=f(x)-x;% Vector of residuals.G_k=g(2)-g(1);% Matrix of increments in residuals.X_k=x(2)-x(1);% Matrix of increments in x.k=2;whilek<k_max&&abs(g(k))>tol_resm_k=min(k,m);% Solve the optimization problem by QR decomposition.[Q,R]=qr(G_k);gamma_k=R\(Q'*g(k));% Compute new iterate and new residual.x(k+1)=x(k)+g(k)-(X_k+G_k)*gamma_k;g(k+1)=f(x(k+1))-x(k+1);% Update increment matrices with new elements.X_k=[X_k,x(k+1)-x(k)];G_k=[G_k,g(k+1)-g(k)];n=size(X_k,2);ifn>m_kX_k=X_k(:,n-m_k+1:end);G_k=G_k(:,n-m_k+1:end);endk=k+1;end% Prints result: Computed fixed point 2.013444 after 9 iterationsfprintf("Computed fixed point %f after %d iterations\n",x(end),k);
In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots of a real-valued function. The most basic version starts with a real-valued function f, its derivative f′, and an initial guess x0 for a root of f. If f satisfies certain assumptions and the initial guess is close, then
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.
The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.
Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition. Large sparse systems often arise when numerically solving partial differential equations or optimization problems.
Mehrotra's predictor–corrector method in optimization is a specific interior point method for linear programming. It was proposed in 1989 by Sanjay Mehrotra.
In numerical analysis, the Crank–Nicolson method is a finite difference method used for numerically solving the heat equation and similar partial differential equations. It is a second-order method in time. It is implicit in time, can be written as an implicit Runge–Kutta method, and it is numerically stable. The method was developed by John Crank and Phyllis Nicolson in the 1940s.
In numerical analysis, the Weierstrass method or Durand–Kerner method, discovered by Karl Weierstrass in 1891 and rediscovered independently by Durand in 1960 and Kerner in 1966, is a root-finding algorithm for solving polynomial equations. In other words, the method can be used to solve numerically the equation
Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.
In numerical analysis, Aitken's delta-squared process or Aitken extrapolation is a series acceleration method used for accelerating the rate of convergence of a sequence. It is named after Alexander Aitken, who introduced this method in 1926. It is most useful for accelerating the convergence of a sequence that is converging linearly. A precursor form was known to Seki Kōwa and applied to the rectification of the circle, i.e., to the calculation of π.
Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.
Subgradient methods are convex optimization methods which use subderivatives. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.
The Vortex lattice method, (VLM), is a numerical method used in computational fluid dynamics, mainly in the early stages of aircraft design and in aerodynamic education at university level. The VLM models the lifting surfaces, such as a wing, of an aircraft as an infinitely thin sheet of discrete vortices to compute lift and induced drag. The influence of the thickness and viscosity is neglected.
Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research. SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool. The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.
In machine learning, automatic basis function construction is the mathematical method of looking for a set of task-independent basis functions that map the state space to a lower-dimensional embedding, while still representing the value function accurately. Automatic basis construction is independent of prior knowledge of the domain, which allows it to perform well where expert-constructed basis functions are difficult or impossible to create.
Perspective-n-Point is the problem of estimating the pose of a calibrated camera given a set of n 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6 degrees-of-freedom (DOF) which are made up of the rotation and 3D translation of the camera with respect to the world. This problem originates from camera calibration and has many applications in computer vision and other areas, including 3D pose estimation, robotics and augmented reality. A commonly used solution to the problem exists for n = 3 called P3P, and many solutions are available for the general case of n ≥ 3. A solution for n = 2 exists if feature orientations are available at the two points. Implementations of these solutions are also available in open source software.
Numerical certification is the process of verifying the correctness of a candidate solution to a system of equations. In (numerical) computational mathematics, such as numerical algebraic geometry, candidate solutions are computed algorithmically, but there is the possibility that errors have corrupted the candidates. For instance, in addition to the inexactness of input data and candidate solutions, numerical errors or errors in the discretization of the problem may result in corrupted candidate solutions. The goal of numerical certification is to provide a certificate which proves which of these candidates are, indeed, approximate solutions.
(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.
Progressive-iterative approximation method is an iterative method of data fitting with geometric meanings. Given the data points to be fitted, the method obtains a series of fitting curves (surfaces) by iteratively updating the control points, and the limit curve (surface) can interpolate or approximate the given data points. It avoids solving a linear system of equations directly and allows flexibility in adding constraints during the iterative process. Therefore, it has been widely used in geometric design and related fields.