CoBoosting

Last updated August 17, 2020

CoBoost is a semi-supervised training algorithm proposed by Collins and Singer in 1999. The original application for the algorithm was the task of Named Entity Classification using very weak learners.^[1] It can be used for performing semi-supervised learning in cases in which there exist redundancy in features.

It may be seen as a combination of co-training and boosting. Each example is available in two views (subsections of the feature set), and boosting is applied iteratively in alternation with each view using predicted labels produced in the alternate view on the previous iteration. CoBoosting is not a valid boosting algorithm in the PAC learning sense.

Motivation

CoBoosting was an attempt by Collins and Singer to improve on previous attempts to leverage redundancy in features for training classifiers in a semi-supervised fashion. CoTraining, a seminal work by Blum and Mitchell, was shown to be a powerful framework for learning classifiers given a small number of seed examples by iteratively inducing rules in a decision list. The advantage of CoBoosting to CoTraining is that it generalizes the CoTraining pattern so that it could be used with any classifier. CoBoosting accomplishes this feat by borrowing concepts from AdaBoost.

In both CoTrain and CoBoost the training and testing example sets must follow two properties. The first is that the feature space of the examples can separated into two feature spaces (or views) such that each view is sufficiently expressive for classification. Formally, there exist two functions $f_{1}(x_{1})$ and $f_{2}(x_{2})$ such that for all examples $x=(x_{1},x_{2})$ , $f_{1}(x_{1})=f_{2}(x_{2})=f(x)$ . While ideal, this constraint is in fact too strong due to noise and other factors, and both algorithms instead seek to maximize the agreement between the two functions. The second property is that the two views must not be highly correlated.

Algorithm

Input: $\{(x_{1,i},x_{2,i})\}_{i=1}^{n}$ , $\{y_{i}\}_{i=1}^{m}$

Initialize: $\forall i,j:g_{j}^{0}({\boldsymbol {x_{i}}})=0$ .

For $t=1,...,T$ and for $j=1,2$ :

Set pseudo-labels:

${\hat {y_{i}}}=\left\{{\begin{array}{ll}y_{i},1\leq i\leq m\\sign(g_{3-j}^{t-1}({\boldsymbol {x_{3-j,i}}})),m<i\leq n\end{array}}\right.$

Set virtual distribution: $D_{t}^{j}(i)={\frac {1}{Z_{t}^{j}}}e^{-{\hat {y_{i}}}g_{j}^{t-1}({\boldsymbol {x_{j,i}}})}$

where $Z_{t}^{j}=\sum _{i=1}^{n}e^{-{\hat {y_{i}}}g_{j}^{t-1}({\boldsymbol {x_{j,i}}})}$

Find the weak hypothesis $h_{t}^{j}$ that minimizes expanded training error.

Choose value for $\alpha _{t}$ that minimizes expanded training error.

Update the value for current strong non-thresholded classifier:

$\forall i:g_{j}^{t}({\boldsymbol {x_{j,i}}})=g_{j}^{t-1}({\boldsymbol {x_{j,i}}})+\alpha _{t}h_{t}^{j}({\boldsymbol {x_{j,i}}})$

The final strong classifier output is

$f({\boldsymbol {x}})=sign\left(\sum _{j=1}^{2}g_{j}^{T}({\boldsymbol {x_{j}}})\right)$

Setting up AdaBoost

CoBoosting builds on the AdaBoost algorithm, which gives CoBoosting its generalization ability since AdaBoost can be used in conjunction with many other learning algorithms. This build up assumes a two class classification task, although it can be adapted to multiple class classification. In the AdaBoost framework, weak classifiers are generated in series as well as a distribution over examples in the training set. Each weak classifier is given a weight and the final strong classifier is defined as the sign of the sum of the weak classifiers weighted by their assigned weight. (See AdaBoost Wikipedia page for notation). In the AdaBoost framework Schapire and Singer have shown that the training error is bounded by the following equation:

${\frac {1}{m}}\sum _{i=1}^{m}e^{\left(-y_{i}\left(\sum _{t=1}^{T}\alpha _{t}h_{t}({\boldsymbol {x_{i}}})\right)\right)}=\prod _{t}Z_{t}$

Where $Z_{t}$ is the normalizing factor for the distribution $D_{t+1}$ . Solving for $Z_{t}$ in the equation for $D_{t}(i)$ we get:

$Z_{t}=\sum _{i:x_{t}\notin x_{i}}D_{t}(i)+\sum _{i:x_{t}\in x_{i}}D_{t}(i)e^{-y_{i}\alpha _{i}h_{t}({\boldsymbol {x_{i}}})}$

Where $x_{t}$ is the feature selected in the current weak hypothesis. Three equations are defined describing the sum of the distributions for in which the current hypothesis has selected either correct or incorrect label. Note that it is possible for the classifier to abstain from selecting a label for an example, in which the label provided is 0. The two labels are selected to be either -1 or 1.

$W_{0}=\sum _{i:h_{t}(x_{i})=0}D_{t}(i)$

$W_{+}=\sum _{i:h_{t}(x_{i})=y_{i}}D_{t}(i)$

$W_{-}=\sum _{i:h_{t}(x_{i})=-y_{i}}D_{t}(i)$

Schapire and Singer have shown that the value $Z_{t}$ can be minimized (and thus the training error) by selecting $\alpha _{t}$ to be as follows:

$\alpha _{t}={\frac {1}{2}}\ln \left({\frac {W_{+}}{W_{-}}}\right)$

Providing confidence values for the current hypothesized classifier based on the number of correctly classified vs. the number of incorrectly classified examples weighted by the distribution over examples. This equation can be smoothed to compensate for cases in which $W_{-}$ is too small. Deriving $Z_{t}$ from this equation we get:

$Z_{t}=W_{0}+2{\sqrt {W_{+}W_{-}}}$

The training error thus is minimized by selecting the weak hypothesis at every iteration that minimizes the previous equation.

AdaBoost with two views

CoBoosting extends this framework in the case where one has a labeled training set (examples from $1...m$ ) and an unlabeled training set (from $m_{1}...n$ ), as well as satisfy the conditions of redundancy in features in the form of $x_{i}=(x_{1,i},x_{2,i})$ . The algorithm trains two classifiers in the same fashion as AdaBoost that agree on the labeled training sets correct labels and maximizes the agreement between the two classifiers on the unlabeled training set. The final classifier is the sign of the sum of the two strong classifiers. The bounded training error on CoBoost is extended as follows, where $Z_{CO}$ is the extension of $Z_{t}$ :

$Z_{CO}=\sum _{i=1}^{m}e^{-y_{i}g_{1}({\boldsymbol {x_{1,i}}})}+\sum _{i=1}^{m}e^{-y_{i}g_{2}({\boldsymbol {x_{2,i}}})}+\sum _{i=m+1}^{n}e^{-f_{2}({\boldsymbol {x_{2,i}}})g_{1}({\boldsymbol {x_{1,i}}})}+\sum _{i=m+1}^{n}e^{-f_{1}({\boldsymbol {x_{1,i}}})g_{2}({\boldsymbol {x_{2,i}}})}$

Where $g_{j}$ is the summation of hypotheses weight by their confidence values for the $j^{th}$ view (j = 1 or 2). $f_{j}$ is the sign of $g_{j}$ . At each iteration of CoBoost both classifiers are updated iteratively. If $g_{j}^{t-1}$ is the strong classifier output for the $j^{th}$ view up to the $t-1$ iteration we can set the pseudo-labels for the jth update to be:

${\hat {y_{i}}}=\left\{{\begin{array}{ll}y_{i}1\leq i\leq m\\sign(g_{3-j}^{t-1}({\boldsymbol {x_{3-j,i}}}))m<i\leq n\end{array}}\right.$

In which $3-j$ selects the other view to the one currently being updated. $Z_{CO}$ is split into two such that $Z_{CO}=Z_{CO}^{1}+Z_{CO}^{2}$ . Where

$Z_{CO}^{j}=\sum _{i=1}^{n}e^{-{\hat {y_{i}}}(g_{j}^{t-1}({\boldsymbol {x_{i}}})+\alpha _{t}^{j}g_{t}^{j}({\boldsymbol {x_{j,i}}}))}$

The distribution over examples for each view $j$ at iteration $t$ is defined as follows:

$D_{t}^{j}(i)={\frac {1}{Z_{t}^{j}}}e^{-{\hat {y_{i}}}g_{j}^{t-1}({\boldsymbol {x_{j,i}}})}$

At which point $Z_{CO}^{j}$ can be rewritten as

$Z_{CO}^{j}=\sum _{i=1}^{n}D_{t}^{j}e^{-{\hat {y_{i}}}\alpha _{t}^{j}g_{t}^{j}({\boldsymbol {x_{j,i}}})}$

Which is identical to the equation in AdaBoost. Thus the same process can be used to update the values of $\alpha _{t}^{j}$ as in AdaBoost using ${\hat {y_{i}}}$ and $D_{t}^{j}$ . By alternating this, the minimization of $Z_{CO}^{1}$ and $Z_{CO}^{2}$ in this fashion $Z_{CO}$ is minimized in a greedy fashion.

Related Research Articles

Taylors theorem Approximation of a function by a truncated power series

In calculus, Taylor's theorem gives an approximation of a k-times differentiable function around a given point by a polynomial of degree k, called the kth-order Taylor polynomial. For a smooth function, the Taylor polynomial is the truncation at the order k of the Taylor series of the function. The first-order Taylor polynomial is the linear approximation of the function, and the second-order Taylor polynomial is often referred to as the quadratic approximation. There are several versions of Taylor's theorem, some giving explicit estimates of the approximation error of the function by its Taylor polynomial.

Moment of inertia Scalar measure of the rotational inertia with respect to a fixed axis of rotation

The moment of inertia, otherwise known as the mass moment of inertia, angular mass or rotational inertia, of a rigid body is a quantity that determines the torque needed for a desired angular acceleration about a rotational axis; similar to how mass determines the force needed for a desired acceleration. It depends on the body's mass distribution and the axis chosen, with larger moments requiring more torque to change the body's rate of rotation.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space $under the operation of composition. By definition, a rotation about the origin is a transformation that preserves the origin, Euclidean distance, and orientation. Every non-trivial rotation is determined by its axis of rotation and its angle of rotation. Composing two rotations results in another rotation; every rotation has a unique inverse rotation; and the identity map satisfies the definition of a rotation. Owing to the above properties, the set of all rotations is a group under composition. Rotations are not commutative, making it a nonabelian group. Moreover, the rotation group has a natural structure as a manifold for which the group operations are smoothly differentiable; so it is in fact a Lie group. It is compact and has dimension 3.$

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

Dirichlet distribution Probability distribution

In probability and statistics, the Dirichlet distribution, often denoted $, is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD) . Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.$

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton's method for finding a minimum of a function. Unlike Newton's method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.

AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector $, and an observation drawn from a multinomial distribution with probability vector p and number of trials n . The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.$

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

BrownBoost is a boosting algorithm that may be robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is true for all boosting algorithms, BrownBoost is used in conjunction with other machine learning methods. BrownBoost was introduced by Yoav Freund in 2001.

The Viola–Jones object detection framework is the first object detection framework to provide competitive object detection rates in real-time proposed in 2001 by Paul Viola and Michael Jones. Although it can be trained to detect a variety of object classes, it was motivated primarily by the problem of face detection.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box-Cox transformed regressors.

In machine learning, a margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier is used, the distance of an example from the separating hyperplane is the margin of that example.

In numerical linear algebra, the conjugate gradient method is an iterative method for numerically solving the linear system

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

References

Footnotes

↑ Michael Collins and Yoram Singer, Unsupervised Models for Named Entity Classification. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 100-110, 1999.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Collins99-1] Michael Collins and Yoram Singer, Unsupervised Models for Named Entity Classification. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 100-110, 1999.