Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.
Viewed differently, the unconstrained objective is the Lagrangian of the constrained problem, with an additional penalty term (the augmentation).
The method was originally known as the method of multipliers and was studied in the 1970s and 1980s as a potential alternative to penalty methods. It was first discussed by Magnus Hestenes [1] and then by Michael Powell in 1969. [2] The method was studied by R. Tyrrell Rockafellar in relation to Fenchel duality, particularly in relation to proximal-point methods, Moreau–Yosida regularization, and maximal monotone operators; these methods were used in structural optimization. The method was also studied by Dimitri Bertsekas, notably in his 1982 book, [3] together with extensions involving non-quadratic regularization functions (e.g., entropic regularization). This combined study gives rise to the "exponential method of multipliers" which handles inequality constraints with a twice-differentiable augmented Lagrangian function.
Since the 1970s, sequential quadratic programming (SQP) and interior point methods (IPM) have been given more attention, in part because they more easily use sparse matrix subroutines from numerical software libraries, and in part because IPMs possess proven complexity results via the theory of self-concordant functions. The augmented Lagrangian method was rejuvenated by the optimization systems LANCELOT, ALGENCAN [4] [5] and AMPL, which allowed sparse matrix techniques to be used on seemingly dense but "partially-separable" problems. The method is still useful for some problems. [6]
Around 2007, there was a resurgence of augmented Lagrangian methods in fields such as total variation denoising and compressed sensing. In particular, a variant of the standard augmented Lagrangian method that uses partial updates (similar to the Gauss–Seidel method for solving linear equations) known as the alternating direction method of multipliers or ADMM gained some attention.
Consider solving the following constrained optimization problem:
subject to
where denotes the indices for equality constraints. This problem can be solved as a series of unconstrained minimization problems. For reference, we first list the kth step of the penalty method approach:
The penalty method solves this problem, then at the next iteration it re-solves the problem using a larger value of and using the old solution as the initial guess or "warm start".
The augmented Lagrangian method uses the following unconstrained objective:
and after each iteration, in addition to updating , the variable is also updated according to the rule
where is the solution to the unconstrained problem at the kth step (i.e. ).
The variable is an estimate of the Lagrange multiplier, and the accuracy of this estimate improves at every step. The major advantage of the method is that unlike the penalty method, it is not necessary to take in order to solve the original constrained problem. Because of the presence of the Lagrange multiplier term, can stay much smaller, and thus avoiding ill-conditioning. [6] Nevertheless, it is common in practical implementations to project multipliers estimates in a large bounded set (safeguards) which avoids numerical instabilities and leads to strong theoretical convergence. [5]
The method can be extended to handle inequality constraints. For a discussion of practical improvements, see refs. [6] [5]
The alternating direction method of multipliers (ADMM) is a variant of the augmented Lagrangian scheme that uses partial updates for the dual variables. This method is often applied to solve problems such as,
This is equivalent to the constrained problem,
Though this change may seem trivial, the problem can now be attacked using methods of constrained optimization (in particular, the augmented Lagrangian method), and the objective function is separable in x and y. The dual update requires solving a proximity function in x and y at the same time; the ADMM technique allows this problem to be solved approximately by first solving for x with y fixed and then solving for y with x fixed. Rather than iterate until convergence (like the Jacobi method), the algorithm proceeds directly to updating the dual variable and then repeating the process. This is not equivalent to the exact minimization, but the method still converges to the correct solution under some assumptions. Because of this approximation, the algorithm is distinct from the pure augmented Lagrangian method.
The ADMM can be viewed as an application of the Douglas-Rachford splitting algorithm, and the Douglas-Rachford algorithm is in turn an instance of the Proximal point algorithm; details can be found in ref. [7] There are several modern software packages, including YALL1 [8] (2009), SpaRSA [9] (2009) and SALSA [10] (2009), which solve Basis pursuit and variants and use the ADMM. There are also packages which use the ADMM to solve more general problems, some of which can exploit multiple computing cores (e.g., SNAPVX [11] (2015), parADMM [12] (2016)).
Stochastic optimization considers the problem of minimizing a loss function with access to noisy samples of the (gradient of the) function. The goal is to have an estimate of the optimal parameter (minimizer) per new sample. With some modifications, ADMM can be used for stochastic optimization. In a stochastic setting, only noisy samples of a gradient are accessible, so an inexact approximation of the Lagrangian is used:
where is a time-varying step size. [13]
ADMM has been applied to solve regularized problems, where the function optimization and regularization can be carried out locally and then coordinated globally via constraints. [14] [15] [16] [17]
Regularized optimization problems are especially relevant in the high-dimensional regime as regularization is a natural mechanism to overcome ill-posedness and to encourage parsimony in the optimal solution (e.g., sparsity and low rank). ADMM's effectiveness for solving regularized problems may mean it could be useful for solving high-dimensional stochastic optimization problems. [18]
Open source and non-free/commercial implementations of the augmented Lagrangian method:
Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize a multivariate quadratic function subject to linear constraints on the variables. Quadratic programming is a type of nonlinear programming.
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equation constraints. It is named after the mathematician Joseph-Louis Lagrange.
In the calculus of variations and classical mechanics, the Euler–Lagrange equations are a system of second-order ordinary differential equations whose solutions are stationary points of the given action functional. The equations were discovered in the 1750s by Swiss mathematician Leonhard Euler and Italian mathematician Joseph-Louis Lagrange.
Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets. Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard. With recent advancements in computing and optimization algorithms, convex programming is nearly as straightforward as linear programming.
In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.
In mathematical optimization, the Karush–Kuhn–Tucker (KKT) conditions, also known as the Kuhn–Tucker conditions, are first derivative tests for a solution in nonlinear programming to be optimal, provided that some regularity conditions are satisfied.
The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.
In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then the dual is a maximization problem. Any feasible solution to the primal (minimization) problem is at least as large as any feasible solution to the dual (maximization) problem. Therefore, the solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal. This fact is called weak duality.
In mathematical optimization, constrained optimization is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The objective function is either a cost function or energy function, which is to be minimized, or a reward function or utility function, which is to be maximized. Constraints can be either hard constraints, which set conditions for the variables that are required to be satisfied, or soft constraints, which have some variable values that are penalized in the objective function if, and based on the extent that, the conditions on the variables are not satisfied.
Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.
Sequential quadratic programming (SQP) is an iterative method for constrained nonlinear optimization which may be considered a quasi-Newton method. SQP methods are used on mathematical problems for which the objective function and the constraints are twice continuously differentiable, but not necessarily convex.
Penalty methods are a certain class of algorithms for solving constrained optimization problems.
In computational chemistry, a constraint algorithm is a method for satisfying the Newtonian motion of a rigid body which consists of mass points. A restraint algorithm is used to ensure that the distance between mass points is maintained. The general steps involved are: (i) choose novel unconstrained coordinates, (ii) introduce explicit constraint forces, (iii) minimize constraint forces implicitly by the technique of Lagrange multipliers or projection methods.
Subgradient methods are convex optimization methods which use subderivatives. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.
The Bregman method is an iterative algorithm to solve certain convex optimization problems involving regularization. The original version is due to Lev M. Bregman, who published it in 1967.
In physics, Lagrangian mechanics is a formulation of classical mechanics founded on the stationary-action principle. It was introduced by the Italian-French mathematician and astronomer Joseph-Louis Lagrange in his presentation to the Turin Academy of Science in 1760 culminating in his 1788 grand opus, Mécanique analytique.
In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.
Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.
Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is regularization of the form
In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.