Wolfe conditions

Last updated January 19, 2025

In the unconstrained minimization problem, the Wolfe conditions are a set of inequalities for performing inexact line search, especially in quasi-Newton methods, first published by Philip Wolfe in 1969.^[1]^[2]

In these methods the idea is to find $\min _{x}f(\mathbf {x} )$ for some smooth $f\colon \mathbb {R} ^{n}\to \mathbb {R}$ . Each step often involves approximately solving the subproblem $\min _{\alpha }f(\mathbf {x} _{k}+\alpha \mathbf {p} _{k})$ where $\mathbf {x} _{k}$ is the current best guess, $\mathbf {p} _{k}\in \mathbb {R} ^{n}$ is a search direction, and $\alpha \in \mathbb {R}$ is the step length.

The inexact line searches provide an efficient way of computing an acceptable step length $\alpha$ that reduces the objective function 'sufficiently', rather than minimizing the objective function over $\alpha \in \mathbb {R} ^{+}$ exactly. A line search algorithm can use Wolfe conditions as a requirement for any guessed $\alpha$ , before finding a new search direction $\mathbf {p} _{k}$ .

Armijo rule and curvature

A step length $\alpha _{k}$ is said to satisfy the Wolfe conditions, restricted to the direction $\mathbf {p} _{k}$ , if the following two inequalities hold:

$f(\mathbf {x} _{k}+\alpha _{k}\mathbf {p} _{k})\leq f(\mathbf {x} _{k})+c_{1}\alpha _{k}\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}),$
${-\mathbf {p} }_{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}+\alpha _{k}\mathbf {p} _{k})\leq -c_{2}\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}),$

with $0<c_{1}<c_{2}<1$ . (In examining condition (ii), recall that to ensure that $\mathbf {p} _{k}$ is a descent direction, we have $\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k})<0$ , as in the case of gradient descent, where $\mathbf {p} _{k}=-\nabla f(\mathbf {x} _{k})$ , or Newton–Raphson, where $\mathbf {p} _{k}=-\mathbf {H} ^{-1}\nabla f(\mathbf {x} _{k})$ with $\mathbf {H}$ positive definite.)

$c_{1}$ is usually chosen to be quite small while $c_{2}$ is much larger; Nocedal and Wright give example values of $c_{1}=10^{-4}$ and $c_{2}=0.9$ for Newton or quasi-Newton methods and $c_{2}=0.1$ for the nonlinear conjugate gradient method.^[3] Inequality i) is known as the Armijo rule^[4] and ii) as the curvature condition; i) ensures that the step length $\alpha _{k}$ decreases $f$ 'sufficiently', and ii) ensures that the slope has been reduced sufficiently. Conditions i) and ii) can be interpreted as respectively providing an upper and lower bound on the admissible step length values.

Strong Wolfe condition on curvature

Denote a univariate function $\varphi$ restricted to the direction $\mathbf {p} _{k}$ as $\varphi (\alpha )=f(\mathbf {x} _{k}+\alpha \mathbf {p} _{k})$ . The Wolfe conditions can result in a value for the step length that is not close to a minimizer of $\varphi$ . If we modify the curvature condition to the following,

${\big |}\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}+\alpha _{k}\mathbf {p} _{k}){\big |}\leq c_{2}{\big |}\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}){\big |}$

then i) and iii) together form the so-called strong Wolfe conditions, and force $\alpha _{k}$ to lie close to a critical point of $\varphi$ .

Rationale

The principal reason for imposing the Wolfe conditions in an optimization algorithm where $\mathbf {x} _{k+1}=\mathbf {x} _{k}+\alpha \mathbf {p} _{k}$ is to ensure convergence of the gradient to zero. In particular, if the cosine of the angle between $\mathbf {p} _{k}$ and the gradient, $\cos \theta _{k}={\frac {\nabla f(\mathbf {x} _{k})^{\mathrm {T} }\mathbf {p} _{k}}{\|\nabla f(\mathbf {x} _{k})\|\|\mathbf {p} _{k}\|}}$ is bounded away from zero and the i) and ii) conditions hold, then $\nabla f(\mathbf {x} _{k})\rightarrow 0$ .

An additional motivation, in the case of a quasi-Newton method, is that if $\mathbf {p} _{k}=-B_{k}^{-1}\nabla f(\mathbf {x} _{k})$ , where the matrix $B_{k}$ is updated by the BFGS or DFP formula, then if $B_{k}$ is positive definite ii) implies $B_{k+1}$ is also positive definite.

Comments

Wolfe's conditions are more complicated than Armijo's condition, and a gradient descent algorithm based on Armijo's condition has a better theoretical guarantee than one based on Wolfe conditions (see the sections on "Upper bound for learning rates" and "Theoretical guarantee" in the Backtracking line search article).

Related Research Articles

In mathematics, any vector space has a corresponding dual vector space consisting of all linear forms on together with the vector space structure of pointwise addition and scalar multiplication by constants.

<span class="mw-page-title-main">Gradient</span> Multivariate derivative (mathematics)

In vector calculus, the gradient of a scalar-valued differentiable function $of several variables is the vector field whose value at a point gives the direction and the rate of fastest increase. The gradient transforms like a vector under change of basis of the space of variables of . If the gradient of a function is non-zero at a point, the direction of the gradient is the direction in which the function increases most quickly from, and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to minimize a function by gradient descent. In coordinate-free terms, the gradient of a function may be defined by:$

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

Poisson's equation is an elliptic partial differential equation of broad utility in theoretical physics. For example, the solution to Poisson's equation is the potential field caused by a given electric charge or mass density distribution; with the potential field known, one can then calculate the corresponding electrostatic or gravitational (force) field. It is a generalization of Laplace's equation, which is also frequently seen in physics. The equation is named after French mathematician and physicist Siméon Denis Poisson who published it in 1823.

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In mathematics, a Green's function is the impulse response of an inhomogeneous linear differential operator defined on a domain with specified initial conditions or boundary conditions.

In vector calculus, a conservative vector field is a vector field that is the gradient of some function. A conservative vector field has the property that its line integral is path independent; the choice of path between two points does not change the value of the line integral. Path independence of the line integral is equivalent to the vector field under the line integral being conservative. A conservative vector field is also irrotational; in three dimensions, this means that it has vanishing curl. An irrotational vector field is necessarily conservative provided that the domain is simply connected.

<span class="mw-page-title-main">Halbach array</span> Special arrangement of permanent magnets

A Halbach array is a special arrangement of permanent magnets that augments the magnetic field on one side of the array while cancelling the field to near zero on the other side. This is achieved by having a spatially rotating pattern of magnetisation.

In mathematics, the total variation identifies several slightly different concepts, related to the (local or global) structure of the codomain of a function or a measure. For a real-valued continuous function f, defined on an interval [a, b] ⊂ R, its total variation on the interval of definition is a measure of the one-dimensional arclength of the curve with parametric equation x ↦ f(x), for x ∈ [a, b]. Functions whose total variation is finite are called functions of bounded variation.

In physics and mathematics, the Helmholtz decomposition theorem or the fundamental theorem of vector calculus states that certain differentiable vector fields can be resolved into the sum of an irrotational (curl-free) vector field and a solenoidal (divergence-free) vector field. In physics, often only the decomposition of sufficiently smooth, rapidly decaying vector fields in three dimensions is discussed. It is named after Hermann von Helmholtz.

In optimization, line search is a basic iterative approach to find a local minimum $of an objective function . It first finds a descent direction along which the objective function will be reduced, and then computes a step size that determines how far should move along that direction. The descent direction can be computed by various methods, such as gradient descent or quasi-Newton method. The step size can be determined either exactly or inexactly.$

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.

The diabatic representation as a mathematical tool for theoretical calculations of atomic collisions and of molecular interactions.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

In electromagnetism and applications, an inhomogeneous electromagnetic wave equation, or nonhomogeneous electromagnetic wave equation, is one of a set of wave equations describing the propagation of electromagnetic waves generated by nonzero source charges and currents. The source terms in the wave equations make the partial differential equations inhomogeneous, if the source terms are zero the equations reduce to the homogeneous electromagnetic wave equations, which follow from Maxwell's equations.

The gradient theorem, also known as the fundamental theorem of calculus for line integrals, says that a line integral through a gradient field can be evaluated by evaluating the original scalar field at the endpoints of the curve. The theorem is a generalization of the second fundamental theorem of calculus to any curve in a plane or space rather than just the real line.

The Navier–Stokes existence and smoothness problem concerns the mathematical properties of solutions to the Navier–Stokes equations, a system of partial differential equations that describe the motion of a fluid in space. Solutions to the Navier–Stokes equations are used in many practical applications. However, theoretical understanding of the solutions to these equations is incomplete. In particular, solutions of the Navier–Stokes equations often include turbulence, which remains one of the greatest unsolved problems in physics, despite its immense importance in science and engineering.

There are various mathematical descriptions of the electromagnetic field that are used in the study of electromagnetism, one of the four fundamental interactions of nature. In this article, several approaches are discussed, although the equations are in terms of electric and magnetic fields, potentials, and charges with currents, generally speaking.

In differential calculus, there is no single uniform notation for differentiation. Instead, various notations for the derivative of a function or variable have been proposed by various mathematicians. The usefulness of each notation varies with the context, and it is sometimes advantageous to use more than one notation in a given context. The most common notations for differentiation are listed below.

References

↑ Wolfe, P. (1969). "Convergence Conditions for Ascent Methods". SIAM Review . 11 (2): 226–235. doi:10.1137/1011036. JSTOR 2028111.
↑ Wolfe, P. (1971). "Convergence Conditions for Ascent Methods. II: Some Corrections". SIAM Review . 13 (2): 185–188. doi:10.1137/1013035. JSTOR 2028821.
↑ Nocedal, Jorge; Wright, Stephen (1999). Numerical Optimization. Springer. p. 38. ISBN 978-0-387-98793-4.
↑ Armijo, Larry (1966). "Minimization of functions having Lipschitz continuous first partial derivatives". Pacific J. Math. 16 (1): 1–3. doi: 10.2140/pjm.1966.16.1 .