Wolfe conditions

Last updated October 27, 2023

In the unconstrained minimization problem, the Wolfe conditions are a set of inequalities for performing inexact line search, especially in quasi-Newton methods, first published by Philip Wolfe in 1969.^[1]^[2]

The inexact line searches provide an efficient way of computing an acceptable step length $\alpha$ that reduces the objective function 'sufficiently', rather than minimizing the objective function over $\alpha \in \mathbb {R} ^{+}$ exactly. A line search algorithm can use Wolfe conditions as a requirement for any guessed $\alpha$ , before finding a new search direction $\mathbf {p} _{k}$ .

Armijo rule and curvature

A step length $\alpha _{k}$ is said to satisfy the Wolfe conditions, restricted to the direction $\mathbf {p} _{k}$ , if the following two inequalities hold:

$f(\mathbf {x} _{k}+\alpha _{k}\mathbf {p} _{k})\leq f(\mathbf {x} _{k})+c_{1}\alpha _{k}\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}),$
${-\mathbf {p} }_{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}+\alpha _{k}\mathbf {p} _{k})\leq -c_{2}\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}),$

with $0<c_{1}<c_{2}<1$ . (In examining condition (ii), recall that to ensure that $\mathbf {p} _{k}$ is a descent direction, we have $\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k})<0$ , as in the case of gradient descent, where $\mathbf {p} _{k}=-\nabla f(\mathbf {x} _{k})$ , or Newton–Raphson, where $\mathbf {p} _{k}=-\mathbf {H} ^{-1}\nabla f(\mathbf {x} _{k})$ with $\mathbf {H}$ positive definite.)

$c_{1}$ is usually chosen to be quite small while $c_{2}$ is much larger; Nocedal and Wright give example values of $c_{1}=10^{-4}$ and $c_{2}=0.9$ for Newton or quasi-Newton methods and $c_{2}=0.1$ for the nonlinear conjugate gradient method.^[3] Inequality i) is known as the Armijo rule^[4] and ii) as the curvature condition; i) ensures that the step length $\alpha _{k}$ decreases $f$ 'sufficiently', and ii) ensures that the slope has been reduced sufficiently. Conditions i) and ii) can be interpreted as respectively providing an upper and lower bound on the admissible step length values.

Strong Wolfe condition on curvature

Denote a univariate function $\varphi$ restricted to the direction $\mathbf {p} _{k}$ as $\varphi (\alpha )=f(\mathbf {x} _{k}+\alpha \mathbf {p} _{k})$ . The Wolfe conditions can result in a value for the step length that is not close to a minimizer of $\varphi$ . If we modify the curvature condition to the following,

${\big |}\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}+\alpha _{k}\mathbf {p} _{k}){\big |}\leq c_{2}{\big |}\mathbf {p} _{k}^{\mathrm {T} }\nabla f(\mathbf {x} _{k}){\big |}$

then i) and iii) together form the so-called strong Wolfe conditions, and force $\alpha _{k}$ to lie close to a critical point of $\varphi$ .

Rationale

The principal reason for imposing the Wolfe conditions in an optimization algorithm where $\mathbf {x} _{k+1}=\mathbf {x} _{k}+\alpha \mathbf {p} _{k}$ is to ensure convergence of the gradient to zero. In particular, if the cosine of the angle between $\mathbf {p} _{k}$ and the gradient,

\cos \theta _{k}={\frac {\nabla f(\mathbf {x} _{k})^{\mathrm {T} }\mathbf {p} _{k}}{\|\nabla f(\mathbf {x} _{k})\|\|\mathbf {p} _{k}\|}}

is bounded away from zero and the i) and ii) conditions hold, then $\nabla f(\mathbf {x} _{k})\rightarrow 0$ .

An additional motivation, in the case of a quasi-Newton method, is that if $\mathbf {p} _{k}=-B_{k}^{-1}\nabla f(\mathbf {x} _{k})$ , where the matrix $B_{k}$ is updated by the BFGS or DFP formula, then if $B_{k}$ is positive definite ii) implies $B_{k+1}$ is also positive definite.

Comments

Wolfe's conditions are more complicated than Armijo's condition, and a gradient descent algorithm based on Armijo's condition has a better theoretical guarantee than one based on Wolfe conditions (see the sections on "Upper bound for learning rates" and "Theoretical guarantee" in the Backtracking line search article).

Related Research Articles

<span class="mw-page-title-main">Gradient</span> Multivariate derivative (mathematics)

In vector calculus, the gradient of a scalar-valued differentiable function $of several variables is the vector field whose value at a point is the "direction and rate of fastest increase". The gradient transforms like a vector under change of basis of the space of variables of . If the gradient of a function is non-zero at a point, the direction of the gradient is the direction in which the function increases most quickly from, and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to maximize a function by gradient ascent. In coordinate-free terms, the gradient of a function may be defined by:$

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances, named after French engineer and physicist Claude-Louis Navier and Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In mathematics, the covariant derivative is a way of specifying a derivative along tangent vectors of a manifold. Alternatively, the covariant derivative is a way of introducing and working with a connection on a manifold by means of a differential operator, to be contrasted with the approach given by a principal connection on the frame bundle – see affine connection. In the special case of a manifold isometrically embedded into a higher-dimensional Euclidean space, the covariant derivative can be viewed as the orthogonal projection of the Euclidean directional derivative onto the manifold's tangent space. In this case the Euclidean derivative is broken into two parts, the extrinsic normal component and the intrinsic covariant derivative component.

In vector calculus, a conservative vector field is a vector field that is the gradient of some function. A conservative vector field has the property that its line integral is path independent; the choice of any path between two points does not change the value of the line integral. Path independence of the line integral is equivalent to the vector field under the line integral being conservative. A conservative vector field is also irrotational; in three dimensions, this means that it has vanishing curl. An irrotational vector field is necessarily conservative provided that the domain is simply connected.

A Halbach array is a special arrangement of permanent magnets that augments the magnetic field on one side of the array while cancelling the field to near zero on the other side. This is achieved by having a spatially rotating pattern of magnetisation.

In mathematics, the total variation identifies several slightly different concepts, related to the (local or global) structure of the codomain of a function or a measure. For a real-valued continuous function f, defined on an interval [a, b] ⊂ R, its total variation on the interval of definition is a measure of the one-dimensional arclength of the curve with parametric equation x ↦ f(x), for x ∈ [a, b]. Functions whose total variation is finite are called functions of bounded variation.

In physics and mathematics, in the area of vector calculus, Helmholtz's theorem, also known as the fundamental theorem of vector calculus, states that any sufficiently smooth, rapidly decaying vector field in three dimensions can be resolved into the sum of an irrotational (curl-free) vector field and a solenoidal (divergence-free) vector field; this is known as the Helmholtz decomposition or Helmholtz representation. It is named after Hermann von Helmholtz.

In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum $of an objective function . The other approach is trust region.$

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.

One of the guiding principles in modern chemical dynamics and spectroscopy is that the motion of the nuclei in a molecule is slow compared to that of its electrons. This is justified by the large disparity between the mass of an electron, and the typical mass of a nucleus and leads to the Born–Oppenheimer approximation and the idea that the structure and dynamics of a chemical species are largely determined by nuclear motion on potential energy surfaces.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

In electromagnetism and applications, an inhomogeneous electromagnetic wave equation, or nonhomogeneous electromagnetic wave equation, is one of a set of wave equations describing the propagation of electromagnetic waves generated by nonzero source charges and currents. The source terms in the wave equations make the partial differential equations inhomogeneous, if the source terms are zero the equations reduce to the homogeneous electromagnetic wave equations. The equations follow from Maxwell's equations.

The gradient theorem, also known as the fundamental theorem of calculus for line integrals, says that a line integral through a gradient field can be evaluated by evaluating the original scalar field at the endpoints of the curve. The theorem is a generalization of the second fundamental theorem of calculus to any curve in a plane or space rather than just the real line.

The Navier–Stokes existence and smoothness problem concerns the mathematical properties of solutions to the Navier–Stokes equations, a system of partial differential equations that describe the motion of a fluid in space. Solutions to the Navier–Stokes equations are used in many practical applications. However, theoretical understanding of the solutions to these equations is incomplete. In particular, solutions of the Navier–Stokes equations often include turbulence, which remains one of the greatest unsolved problems in physics, despite its immense importance in science and engineering.

In applied mathematics, polyharmonic splines are used for function approximation and data interpolation. They are very useful for interpolating and fitting scattered data in many dimensions. Special cases include thin plate splines and natural cubic splines in one dimension.

There are various mathematical descriptions of the electromagnetic field that are used in the study of electromagnetism, one of the four fundamental interactions of nature. In this article, several approaches are discussed, although the equations are in terms of electric and magnetic fields, potentials, and charges with currents, generally speaking.

In differential calculus, there is no single uniform notation for differentiation. Instead, various notations for the derivative of a function or variable have been proposed by various mathematicians. The usefulness of each notation varies with the context, and it is sometimes advantageous to use more than one notation in a given context. The most common notations for differentiation are listed below.

Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.

References

↑ Wolfe, P. (1969). "Convergence Conditions for Ascent Methods". SIAM Review . 11 (2): 226–235. doi:10.1137/1011036. JSTOR 2028111.
↑ Wolfe, P. (1971). "Convergence Conditions for Ascent Methods. II: Some Corrections". SIAM Review . 13 (2): 185–188. doi:10.1137/1013035. JSTOR 2028821.
↑ Nocedal, Jorge; Wright, Stephen (1999). Numerical Optimization. p. 38.
↑ Armijo, Larry (1966). "Minimization of functions having Lipschitz continuous first partial derivatives". Pacific J. Math. 16 (1): 1–3. doi: 10.2140/pjm.1966.16.1 .