LPBoost

Last updated

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

Boosting (machine learning) Method in machine learning

Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant : "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

Contents

which classifies samples from a space into one of two classes, labelled 1 and -1, respectively. LPBoost is an algorithm to learn such a classification function given a set of training examples with known class labels. LPBoost is a machine learning technique and especially suited for applications of joint classification and feature selection in structured domains.

Machine learning branch of statistics and computer science, which studies algorithms and architectures that learn from observed facts

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.

LPBoost overview

As in all boosting classifiers, the final classification function is of the form

where are non-negative weightings for weak classifiers . Each individual weak classifier may be just a little bit better than random, but the resulting linear combination of many weak classifiers can perform very well.

LPBoost constructs by starting with an empty set of weak classifiers. Iteratively, a single weak classifier to add to the set of considered weak classifiers is selected, added and all the weights for the current set of weak classifiers are adjusted. This is repeated until no weak classifiers to add remain.

The property that all classifier weights are adjusted in each iteration is known as totally-corrective property. Early boosting methods, such as AdaBoost do not have this property and converge slower.

AdaBoost boosting algorithm

AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner.

Linear program

More generally, let be the possibly infinite set of weak classifiers, also termed hypotheses. One way to write down the problem LPBoost solves is as a linear program with infinitely many variables.

The primal linear program of LPBoost, optimizing over the non-negative weight vector , the non-negative vector of slack variables and the margin is the following.

Note the effects of slack variables : their one-norm is penalized in the objective function by a constant factor , which—if small enough—always leads to a primal feasible linear program.

Here we adopted the notation of a parameter space , such that for a choice the weak classifier is uniquely defined.

When the above linear program was first written down in early publications about boosting methods it was disregarded as intractable due to the large number of variables . Only later it was discovered that such linear programs can indeed be solved efficiently using the classic technique of column generation.

Column Generation for LPBoost

In a linear program a column corresponds to a primal variable. Column generation is a technique to solve large linear programs. It typically works in a restricted problem, dealing only with a subset of variables. By generating primal variables iteratively and on-demand, eventually the original unrestricted problem with all variables is recovered. By cleverly choosing the columns to generate the problem can be solved such that while still guaranteeing the obtained solution to be optimal for the original full problem, only a small fraction of columns has to be created.

LPBoost dual problem

Columns in the primal linear program corresponds to rows in the dual linear program. The equivalent dual linear program of LPBoost is the following linear program.

For linear programs the optimal value of the primal and dual problem are equal. For the above primal and dual problems, the optimal value is equal to the negative 'soft margin'. The soft margin is the size of the margin separating positive from negative training instances minus positive slack variables that carry penalties for margin-violating samples. Thus, the soft margin may be positive although not all samples are linearly separated by the classification function. The latter is called the 'hard margin' or 'realized margin'.

Convergence criterion

Consider a subset of the satisfied constraints in the dual problem. For any finite subset we can solve the linear program and thus satisfy all constraints. If we could prove that of all the constraints which we did not add to the dual problem no single constraint is violated, we would have proven that solving our restricted problem is equivalent to solving the original problem. More formally, let be the optimal objective function value for any restricted instance. Then, we can formulate a search problem for the 'most violated constraint' in the original problem space, namely finding as

That is, we search the space for a single decision stump maximizing the left hand side of the dual constraint. If the constraint cannot be violated by any choice of decision stump, none of the corresponding constraint can be active in the original problem and the restricted problem is equivalent.

Penalization constant

The positive value of penalization constant has to be found using model selection techniques. However, if we choose , where is the number of training samples and , then the new parameter has the following properties.

  • is an upper bound on the fraction of training errors; that is, if denotes the number of misclassified training samples, then .
  • is a lower bound on the fraction of training samples outside or on the margin.

Algorithm

  1. Initialization
    1. Weights, uniform
    2. Edge
    3. Hypothesis count
  2. Iterate
    1. if then
      1. break
    2. solution of the LPBoost dual
    3. Lagrangian multipliers of solution to LPBoost dual problem

Note that if the convergence threshold is set to the solution obtained is the global optimal solution of the above linear program. In practice, is set to a small positive value in order obtain a good solution quickly.

Realized margin

The actual margin separating the training samples is termed the realized margin and is defined as

The realized margin can and will usually be negative in the first iterations. For a hypothesis space that permits singling out of any single sample, as is commonly the case, the realized margin will eventually converge to some positive value.

Convergence guarantee

While the above algorithm is proven to converge, in contrast to other boosting formulations, such as AdaBoost and TotalBoost, there are no known convergence bounds for LPBoost. In practise however, LPBoost is known to converge quickly, often faster than other formulations.

Base learners

LPBoost is an ensemble learning method and thus does not dictate the choice of base learners, the space of hypotheses . Demiriz et al. showed that under mild assumptions, any base learner can be used. If the base learners are particularly simple, they are often referred to as decision stumps .

The number of base learners commonly used with Boosting in the literature is large. For example, if , a base learner could be a linear soft margin support vector machine. Or even more simple, a simple stump of the form

The above decision stumps looks only along a single dimension of the input space and simply thresholds the respective column of the sample using a constant threshold . Then, it can decide in either direction, depending on for a positive or negative class.

Given weights for the training samples, constructing the optimal decision stump of the above form simply involves searching along all sample columns and determining , and in order to optimize the gain function.

Related Research Articles

Support-vector machine set of methods for supervised statistical learning

In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. A SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In physics, angular velocity refers to how fast an object rotates or revolves relative to another point, i.e. how fast the angular position or orientation of an object changes with time. There are two types of angular velocity: orbital angular velocity and spin angular velocity. Spin angular velocity refers to how fast a rigid body rotates with respect to its centre of rotation. Orbital angular velocity refers to how fast a rigid body's centre of rotation revolves about a fixed origin, i.e. the time rate of change of its angular position relative to the origin. In general, angular velocity is measured in angle per unit time, radians per second in SI units, and is usually represented by the symbol omega. By convention, positive angular velocity indicates counter-clockwise rotation, while negative is clockwise.

Moment of inertia moment of inertia

The moment of inertia, otherwise known as the angular mass or rotational inertia, of a rigid body is a quantity that determines the torque needed for a desired angular acceleration about a rotational axis; similar to how mass determines the force needed for a desired acceleration. It depends on the body's mass distribution and the axis chosen, with larger moments requiring more torque to change the body's rotation rate. It is an extensive (additive) property: for a point mass the moment of inertia is just the mass times the square of the perpendicular distance to the rotation axis. The moment of inertia of a rigid composite system is the sum of the moments of inertia of its component subsystems. Its simplest definition is the second moment of mass with respect to distance from an axis. For bodies constrained to rotate in a plane, only their moment of inertia about an axis perpendicular to the plane, a scalar value, matters. For bodies free to rotate in three dimensions, their moments can be described by a symmetric 3 × 3 matrix, with a set of mutually perpendicular principal axes for which this matrix is diagonal and torques around the axes act independently of each other.

In differential geometry, the Lie derivative, named after Sophus Lie by Władysław Ślebodziński, evaluates the change of a tensor field, along the flow defined by another vector field. This change is coordinate invariant and therefore the Lie derivative is defined on any differentiable manifold.

In mathematics, the classical orthogonal polynomials are the most widely used orthogonal polynomials: the Hermite polynomials, Laguerre polynomials, Jacobi polynomials.

The rigid rotor is a mechanical model of rotating systems. An arbitrary rigid rotor is a 3-dimensional rigid object, such as a top. To orient such an object in space requires three angles, known as Euler angles. A special rigid rotor is the linear rotor requiring only two angles to describe, for example of a diatomic molecule. More general molecules are 3-dimensional, such as water, ammonia, or methane.

In probability theory the hypoexponential distribution or the generalized Erlang distribution is a continuous distribution, that has found use in the same fields as the Erlang distribution, such as queueing theory, teletraffic engineering and more generally in stochastic processes. It is called the hypoexponetial distribution as it has a coefficient of variation less than one, compared to the hyper-exponential distribution which has coefficient of variation greater than one and the exponential distribution which has coefficient of variation of one.

In set theory, Silver machines are devices used for bypassing the use of fine structure in proofs of statements holding in L. They were invented by set theorist Jack Silver as a means of proving global square holds in the constructible universe.

In mathematical physics, the almost Mathieu operator arises in the study of the quantum Hall effect. It is given by

The structured support vector machine is a machine learning algorithm that generalizes the Support Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

In mathematics, the Schauder estimates are a collection of results due to Juliusz Schauder concerning the regularity of solutions to linear, uniformly elliptic partial differential equations. The estimates say that when the equation has appropriately smooth terms and appropriately smooth solutions, then the Hölder norm of the solution can be controlled in terms of the Hölder norms for the coefficient and source terms. Since these estimates assume by hypothesis the existence of a solution, they are called a priori estimates.

The Stokes operator, named after George Gabriel Stokes, is an unbounded linear operator used in the theory of partial differential equations, specifically in the fields of fluid dynamics and electromagnetics.

In continuum mechanics, a compatible deformation tensor field in a body is that unique tensor field that is obtained when the body is subjected to a continuous, single-valued, displacement field. Compatibility is the study of the conditions under which such a displacement field can be guaranteed. Compatibility conditions are particular cases of integrability conditions and were first derived for linear elasticity by Barré de Saint-Venant in 1864 and proved rigorously by Beltrami in 1886.

Multivariate stable distribution

The multivariate stable distribution is a multivariate probability distribution that is a multivariate generalisation of the univariate stable distribution. The multivariate stable distribution defines linear relations between stable distribution marginals. In the same way as for the univariate case, the distribution is defined in terms of its characteristic function.

Relativistic angular momentum

In physics, relativistic angular momentum refers to the mathematical formalisms and physical concepts that define angular momentum in special relativity (SR) and general relativity (GR). The relativistic quantity is subtly different from the three-dimensional quantity in classical mechanics.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over

Trochoidal wave An exact solution of the Euler equations for periodic surface gravity waves

In fluid dynamics, a trochoidal wave or Gerstner wave is an exact solution of the Euler equations for periodic surface gravity waves. It describes a progressive wave of permanent form on the surface of an incompressible fluid of infinite depth. The free surface of this wave solution is an inverted (upside-down) trochoid – with sharper crests and flat troughs. This wave solution was discovered by Gerstner in 1802, and rediscovered independently by Rankine in 1863.

Regularized least squares

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

This article summarizes important identities in exterior calculus.

References