LPBoost

Last updated October 29, 2024

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes, and thus also belongs to the class of margin classifier algorithms.

Consider a classification function $f:{\mathcal {X}}\to \{-1,1\},$ which classifies samples from a space ${\mathcal {X}}$ into one of two classes, labelled 1 and -1, respectively. LPBoost is an algorithm for learning such a classification function, given a set of training examples with known class labels. LPBoost is a machine learning technique especially suited for joint classification and feature selection in structured domains.

LPBoost overview

As in all boosting classifiers, the final classification function is of the form

f({\boldsymbol {x}})=\sum _{j=1}^{J}\alpha _{j}h_{j}({\boldsymbol {x}}),

where $\alpha _{j}$ are non-negative weightings for weak classifiers $h_{j}:{\mathcal {X}}\to \{-1,1\}$ . Each individual weak classifier $h_{j}$ may be just a little bit better than random, but the resulting linear combination of many weak classifiers can perform very well.

LPBoost constructs $f$ by starting with an empty set of weak classifiers. Iteratively, a single weak classifier to add to the set of considered weak classifiers is selected, added and all the weights ${\boldsymbol {\alpha }}$ for the current set of weak classifiers are adjusted. This is repeated until no weak classifiers to add remain.

The property that all classifier weights are adjusted in each iteration is known as totally-corrective property. Early boosting methods, such as AdaBoost do not have this property and converge slower.

Linear program

More generally, let ${\displaystyle {\mathcal {H}}=\{h(\cdot$ be the possibly infinite set of weak classifiers, also termed hypotheses. One way to write down the problem LPBoost solves is as a linear program with infinitely many variables.

The primal linear program of LPBoost, optimizing over the non-negative weight vector ${\boldsymbol {\alpha }}$ , the non-negative vector ${\boldsymbol {\xi }}$ of slack variables and the margin $\rho$ is the following.

{\begin{array}{cl}{\underset {{\boldsymbol {\alpha }},{\boldsymbol {\xi }},\rho }{\min }}&-\rho +D\sum _{n=1}^{\ell }\xi _{n}\\{\textrm {sb.t.}}&\sum _{\omega \in \Omega }y_{n}\alpha _{\omega }h({\boldsymbol {x}}_{n};\omega )+\xi _{n}\geq \rho ,\qquad n=1,\dots ,\ell ,\\&\sum _{\omega \in \Omega }\alpha _{\omega }=1,\\&\xi _{n}\geq 0,\qquad n=1,\dots ,\ell ,\\&\alpha _{\omega }\geq 0,\qquad \omega \in \Omega ,\\&\rho \in {\mathbb {R} }.\end{array}}

Note the effects of slack variables ${\boldsymbol {\xi }}\geq 0$ : their one-norm is penalized in the objective function by a constant factor $D$ , which—if small enough—always leads to a primal feasible linear program.

Here we adopted the notation of a parameter space $\Omega$ , such that for a choice $\omega \in \Omega$ the weak classifier ${\displaystyle h(\cdot$ is uniquely defined.

When the above linear program was first written down in early publications about boosting methods it was disregarded as intractable due to the large number of variables ${\boldsymbol {\alpha }}$ . Only later it was discovered that such linear programs can indeed be solved efficiently using the classic technique of column generation.

Column generation for LPBoost

In a linear program a column corresponds to a primal variable. Column generation is a technique to solve large linear programs. It typically works in a restricted problem, dealing only with a subset of variables. By generating primal variables iteratively and on-demand, eventually the original unrestricted problem with all variables is recovered. By cleverly choosing the columns to generate the problem can be solved such that while still guaranteeing the obtained solution to be optimal for the original full problem, only a small fraction of columns has to be created.

LPBoost dual problem

Columns in the primal linear program corresponds to rows in the dual linear program. The equivalent dual linear program of LPBoost is the following linear program.

{\begin{array}{cl}{\underset {{\boldsymbol {\lambda }},\gamma }{\max }}&\gamma \\{\textrm {sb.t.}}&\sum _{n=1}^{\ell }y_{n}h({\boldsymbol {x}}_{n};\omega )\lambda _{n}+\gamma \leq 0,\qquad \omega \in \Omega ,\\&0\leq \lambda _{n}\leq D,\qquad n=1,\dots ,\ell ,\\&\sum _{n=1}^{\ell }\lambda _{n}=1,\\&\gamma \in \mathbb {R} .\end{array}}

For linear programs the optimal value of the primal and dual problem are equal. For the above primal and dual problems, the optimal value is equal to the negative 'soft margin'. The soft margin is the size of the margin separating positive from negative training instances minus positive slack variables that carry penalties for margin-violating samples. Thus, the soft margin may be positive although not all samples are linearly separated by the classification function. The latter is called the 'hard margin' or 'realized margin'.

Convergence criterion

Consider a subset of the satisfied constraints in the dual problem. For any finite subset we can solve the linear program and thus satisfy all constraints. If we could prove that of all the constraints which we did not add to the dual problem no single constraint is violated, we would have proven that solving our restricted problem is equivalent to solving the original problem. More formally, let $\gamma ^{*}$ be the optimal objective function value for any restricted instance. Then, we can formulate a search problem for the 'most violated constraint' in the original problem space, namely finding $\omega ^{*}\in \Omega$ as

\omega ^{*}={\underset {\omega \in \Omega }{\textrm {argmax}}}\sum _{n=1}^{\ell }y_{n}h({\boldsymbol {x}}_{n};\omega )\lambda _{n}.

That is, we search the space ${\mathcal {H}}$ for a single decision stump ${\displaystyle h(\cdot$ maximizing the left hand side of the dual constraint. If the constraint cannot be violated by any choice of decision stump, none of the corresponding constraint can be active in the original problem and the restricted problem is equivalent.

Penalization constant $D$

The positive value of penalization constant $D$ has to be found using model selection techniques. However, if we choose $D={\frac {1}{\ell \nu }}$ , where $\ell$ is the number of training samples and $0<\nu <1$ , then the new parameter $\nu$ has the following properties.

$\nu$ is an upper bound on the fraction of training errors; that is, if $k$ denotes the number of misclassified training samples, then ${\frac {k}{\ell }}\leq \nu$ .
$\nu$ is a lower bound on the fraction of training samples outside or on the margin.

Algorithm

Input:
- Training set $X=\{{\boldsymbol {x}}_{1},\dots ,{\boldsymbol {x}}_{\ell }\}$ , ${\boldsymbol {x}}_{i}\in {\mathcal {X}}$
- Training labels $Y=\{y_{1},\dots ,y_{\ell }\}$ , $y_{i}\in \{-1,1\}$
- Convergence threshold $\theta \geq 0$
Output:
- Classification function $f:{\mathcal {X}}\to \{-1,1\}$

Initialization
1. Weights, uniform $\lambda _{n}\leftarrow {\frac {1}{\ell }},\quad n=1,\dots ,\ell$
2. Edge $\gamma \leftarrow 0$
3. Hypothesis count $J\leftarrow 1$
Iterate
1. ${\hat {h}}\leftarrow {\underset {\omega \in \Omega }{\textrm {argmax}}}\sum _{n=1}^{\ell }y_{n}h({\boldsymbol {x}}_{n};\omega )\lambda _{n}$
2. if $\sum _{n=1}^{\ell }y_{n}{\hat {h}}({\boldsymbol {x}}_{n})\lambda _{n}+\gamma \leq \theta$ $LPBoost$ then
  1. break
3. $h_{J}\leftarrow {\hat {h}}$
4. $J\leftarrow J+1$
5. $({\boldsymbol {\lambda }},\gamma )\leftarrow$ solution of the LPBoost dual
6. ${\boldsymbol {\alpha }}\leftarrow$ Lagrangian multipliers of solution to LPBoost dual problem
$f({\boldsymbol {x}}):={\textrm {sign}}\left(\sum _{j=1}^{J}\alpha _{j}h_{j}({\boldsymbol {x}})\right)$

Note that if the convergence threshold is set to $\theta =0$ the solution obtained is the global optimal solution of the above linear program. In practice, $\theta$ is set to a small positive value in order obtain a good solution quickly.

Realized margin

The actual margin separating the training samples is termed the realized margin and is defined as

\rho ({\boldsymbol {\alpha }}):=\min _{n=1,\dots ,\ell }y_{n}\sum _{\alpha _{\omega }\in \Omega }\alpha _{\omega }h({\boldsymbol {x}}_{n};\omega ).

The realized margin can and will usually be negative in the first iterations. For a hypothesis space that permits singling out of any single sample, as is commonly the case, the realized margin will eventually converge to some positive value.

Convergence guarantee

While the above algorithm is proven to converge, in contrast to other boosting formulations, such as AdaBoost and TotalBoost, there are no known convergence bounds for LPBoost. In practise however, LPBoost is known to converge quickly, often faster than other formulations.

Base learners

LPBoost is an ensemble learning method and thus does not dictate the choice of base learners, the space of hypotheses ${\mathcal {H}}$ . Demiriz et al. showed that under mild assumptions, any base learner can be used. If the base learners are particularly simple, they are often referred to as decision stumps .

The number of base learners commonly used with Boosting in the literature is large. For example, if ${\mathcal {X}}\subseteq {\mathbb {R} }^{n}$ , a base learner could be a linear soft margin support vector machine. Or even more simple, a simple stump of the form

h({\boldsymbol {x}};\omega \in \{1,-1\},p\in \{1,\dots ,n\},t\in {\mathbb {R} }):=\left\{{\begin{array}{cl}\omega &{\textrm {if~}}{\boldsymbol {x}}_{p}\leq t\\-\omega &{\textrm {otherwise}}\end{array}}\right..

The above decision stumps looks only along a single dimension $p$ of the input space and simply thresholds the respective column of the sample using a constant threshold $t$ . Then, it can decide in either direction, depending on $\omega$ for a positive or negative class.

Given weights for the training samples, constructing the optimal decision stump of the above form simply involves searching along all sample columns and determining $p$ , $t$ and $\omega$ in order to optimize the gain function.

Related Research Articles

In physics, the Lorentz transformations are a six-parameter family of linear transformations from a coordinate frame in spacetime to another frame that moves at a constant velocity relative to the former. The respective inverse transformation is then parameterized by the negative of this velocity. The transformations are named after the Dutch physicist Hendrik Lorentz.

In particle physics, the Dirac equation is a relativistic wave equation derived by British physicist Paul Dirac in 1928. In its free form, or including electromagnetic interactions, it describes all spin-1/2 massive particles, called "Dirac particles", such as electrons and quarks for which parity is a symmetry. It is consistent with both the principles of quantum mechanics and the theory of special relativity, and was the first theory to account fully for special relativity in the context of quantum mechanics. It was validated by accounting for the fine structure of the hydrogen spectrum in a completely rigorous way. It has become vital in the building of the Standard Model.

Magnetic circular dichroism (MCD) is the differential absorption of left and right circularly polarized light, induced in a sample by a strong magnetic field oriented parallel to the direction of light propagation. MCD measurements can detect transitions which are too weak to be seen in conventional optical absorption spectra, and it can be used to distinguish between overlapping transitions. Paramagnetic systems are common analytes, as their near-degenerate magnetic sublevels provide strong MCD intensity that varies with both field strength and sample temperature. The MCD signal also provides insight into the symmetry of the electronic levels of the studied systems, such as metal ion sites.

In the mathematical field of representation theory, a weight of an algebra A over a field F is an algebra homomorphism from A to F, or equivalently, a one-dimensional representation of A over F. It is the algebra analogue of a multiplicative character of a group. The importance of the concept, however, stems from its application to representations of Lie algebras and hence also to representations of algebraic and Lie groups. In this context, a weight of a representation is a generalization of the notion of an eigenvalue, and the corresponding eigenspace is called a weight space.

In mathematics, integral equations are equations in which an unknown function appears under an integral sign. In mathematical notation, integral equations may thus be expressed as being of the form: $where is an integral operator acting on u. Hence, integral equations may be viewed as the analog to differential equations where instead of the equation involving derivatives, the equation contains integrals. A direct comparison can be seen with the mathematical form of the general integral equation above with the general form of a differential equation which may be expressed as follows: where may be viewed as a differential operator of order i . Due to this close connection between differential and integral equations, one can often convert between the two. For example, one method of solving a boundary value problem is by converting the differential equation with its boundary conditions into an integral equation and solving the integral equation. In addition, because one can convert between the two, differential equations in physics such as Maxwell's equations often have an analog integral and differential form. See also, for example, Green's function and Fredholm theory.$

In probability theory and related fields, Malliavin calculus is a set of mathematical techniques and ideas that extend the mathematical field of calculus of variations from deterministic functions to stochastic processes. In particular, it allows the computation of derivatives of random variables. Malliavin calculus is also called the stochastic calculus of variations. P. Malliavin first initiated the calculus on infinite dimensional space. Then, the significant contributors such as S. Kusuoka, D. Stroock, J-M. Bismut, Shinzo Watanabe, I. Shigekawa, and so on finally completed the foundations.

In rotordynamics, the rigid rotor is a mechanical model of rotating systems. An arbitrary rigid rotor is a 3-dimensional rigid object, such as a top. To orient such an object in space requires three angles, known as Euler angles. A special rigid rotor is the linear rotor requiring only two angles to describe, for example of a diatomic molecule. More general molecules are 3-dimensional, such as water, ammonia, or methane.

In probability theory the hypoexponential distribution or the generalized Erlang distribution is a continuous distribution, that has found use in the same fields as the Erlang distribution, such as queueing theory, teletraffic engineering and more generally in stochastic processes. It is called the hypoexponetial distribution as it has a coefficient of variation less than one, compared to the hyper-exponential distribution which has coefficient of variation greater than one and the exponential distribution which has coefficient of variation of one.

In mathematics, the Schur orthogonality relations, which were proven by Issai Schur through Schur's lemma, express a central fact about representations of finite groups. They admit a generalization to the case of compact groups in general, and in particular compact Lie groups, such as the rotation group SO(3).

In mathematical physics, the almost Mathieu operator, named for its similarity to the Mathieu operator introduced by Émile Léonard Mathieu, arises in the study of the quantum Hall effect. It is given by

Multipole radiation is a theoretical framework for the description of electromagnetic or gravitational radiation from time-dependent distributions of distant sources. These tools are applied to physical phenomena which occur at a variety of length scales - from gravitational waves due to galaxy collisions to gamma radiation resulting from nuclear decay. Multipole radiation is analyzed using similar multipole expansion techniques that describe fields from static sources, however there are important differences in the details of the analysis because multipole radiation fields behave quite differently from static fields. This article is primarily concerned with electromagnetic multipole radiation, although the treatment of gravitational waves is similar.

In mathematics, the Kodaira–Spencer map, introduced by Kunihiko Kodaira and Donald C. Spencer, is a map associated to a deformation of a scheme or complex manifold X, taking a tangent space of a point of the deformation space to the first cohomology group of the sheaf of vector fields on X.

In physics, relativistic angular momentum refers to the mathematical formalisms and physical concepts that define angular momentum in special relativity (SR) and general relativity (GR). The relativistic quantity is subtly different from the three-dimensional quantity in classical mechanics.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over $to find a vector that is a stable solution to the regression problem. When the system is described by a matrix rather than a vector, this problem can be written as where the vector norm enforcing a regularization penalty on has been extended to a matrix norm on .$

In fluid dynamics, a trochoidal wave or Gerstner wave is an exact solution of the Euler equations for periodic surface gravity waves. It describes a progressive wave of permanent form on the surface of an incompressible fluid of infinite depth. The free surface of this wave solution is an inverted (upside-down) trochoid – with sharper crests and flat troughs. This wave solution was discovered by Gerstner in 1802, and rediscovered independently by Rankine in 1863.

The Fokas method, or unified transform, is an algorithmic procedure for analysing boundary value problems for linear partial differential equations and for an important class of nonlinear PDEs belonging to the so-called integrable systems. It is named after Greek mathematician Athanassios S. Fokas.

This article summarizes several identities in exterior calculus, a mathematical notation used in differential geometry.

Tau functions are an important ingredient in the modern mathematical theory of integrable systems, and have numerous applications in a variety of other domains. They were originally introduced by Ryogo Hirota in his direct method approach to soliton equations, based on expressing them in an equivalent bilinear form.

Hamiltonian truncation is a numerical method used to study quantum field theories (QFTs) in $spacetime dimensions. Hamiltonian truncation is an adaptation of the Rayleigh-Ritz method from quantum mechanics. It is closely related to the exact diagonalization method used to treat spin systems in condensed matter physics. The method is typically used to study QFTs on spacetimes of the form, specifically to compute the spectrum of the Hamiltonian along . A key feature of Hamiltonian truncation is that an explicit ultraviolet cutoff is introduced, akin to the lattice spacing a in lattice Monte Carlo methods. Since Hamiltonian truncation is a nonperturbative method, it can be used to study strong-coupling phenomena like spontaneous symmetry breaking.$

References

Linear Programming Boosting via Column Generation, A. Demiriz and K.P. Bennett and J. Shawe-Taylor. Published 2002 in Kluwer Machine Learning 46, pages 225–254.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.