CoBoosting

Last updated October 29, 2024

CoBoost is a semi-supervised training algorithm proposed by Collins and Singer in 1999.^[1] The original application for the algorithm was the task of named-entity recognition using very weak learners, but it can be used for performing semi-supervised learning in cases where data features may be redundant.^[1]

It may be seen as a combination of co-training and boosting. Each example is available in two views (subsections of the feature set), and boosting is applied iteratively in alternation with each view using predicted labels produced in the alternate view on the previous iteration. CoBoosting is not a valid boosting algorithm in the PAC learning sense.

Motivation

CoBoosting was an attempt by Collins and Singer to improve on previous attempts to leverage redundancy in features for training classifiers in a semi-supervised fashion. CoTraining, a seminal work by Blum and Mitchell, was shown to be a powerful framework for learning classifiers given a small number of seed examples by iteratively inducing rules in a decision list. The advantage of CoBoosting to CoTraining is that it generalizes the CoTraining pattern so that it could be used with any classifier. CoBoosting accomplishes this feat by borrowing concepts from AdaBoost.

In both CoTrain and CoBoost the training and testing example sets must follow two properties. The first is that the feature space of the examples can separated into two feature spaces (or views) such that each view is sufficiently expressive for classification. Formally, there exist two functions $f_{1}(x_{1})$ and $f_{2}(x_{2})$ such that for all examples $x=(x_{1},x_{2})$ , $f_{1}(x_{1})=f_{2}(x_{2})=f(x)$ . While ideal, this constraint is in fact too strong due to noise and other factors, and both algorithms instead seek to maximize the agreement between the two functions. The second property is that the two views must not be highly correlated.

Algorithm

Input: $\{(x_{1,i},x_{2,i})\}_{i=1}^{n}$ , $\{y_{i}\}_{i=1}^{m}$

Initialize: $\forall i,j:g_{j}^{0}({\boldsymbol {x_{i}}})=0$ .

For $t=1,...,T$ and for $j=1,2$ :

Set pseudo-labels:

${\hat {y_{i}}}=\left\{{\begin{array}{ll}y_{i},1\leq i\leq m\\sign(g_{3-j}^{t-1}({\boldsymbol {x_{3-j,i}}})),m<i\leq n\end{array}}\right.$

Set virtual distribution: $D_{t}^{j}(i)={\frac {1}{Z_{t}^{j}}}e^{-{\hat {y_{i}}}g_{j}^{t-1}({\boldsymbol {x_{j,i}}})}$

where $Z_{t}^{j}=\sum _{i=1}^{n}e^{-{\hat {y_{i}}}g_{j}^{t-1}({\boldsymbol {x_{j,i}}})}$

Find the weak hypothesis $h_{t}^{j}$ that minimizes expanded training error.

Choose value for $\alpha _{t}$ that minimizes expanded training error.

Update the value for current strong non-thresholded classifier:

$\forall i:g_{j}^{t}({\boldsymbol {x_{j,i}}})=g_{j}^{t-1}({\boldsymbol {x_{j,i}}})+\alpha _{t}h_{t}^{j}({\boldsymbol {x_{j,i}}})$

The final strong classifier output is

$f({\boldsymbol {x}})=sign\left(\sum _{j=1}^{2}g_{j}^{T}({\boldsymbol {x_{j}}})\right)$

Setting up AdaBoost

CoBoosting builds on the AdaBoost algorithm, which gives CoBoosting its generalization ability since AdaBoost can be used in conjunction with many other learning algorithms. This build up assumes a two class classification task, although it can be adapted to multiple class classification. In the AdaBoost framework, weak classifiers are generated in series as well as a distribution over examples in the training set. Each weak classifier is given a weight and the final strong classifier is defined as the sign of the sum of the weak classifiers weighted by their assigned weight. (See AdaBoost Wikipedia page for notation). In the AdaBoost framework Schapire and Singer have shown that the training error is bounded by the following equation:

${\frac {1}{m}}\sum _{i=1}^{m}e^{\left(-y_{i}\left(\sum _{t=1}^{T}\alpha _{t}h_{t}({\boldsymbol {x_{i}}})\right)\right)}=\prod _{t}Z_{t}$

Where $Z_{t}$ is the normalizing factor for the distribution $D_{t+1}$ . Solving for $Z_{t}$ in the equation for $D_{t}(i)$ we get:

$Z_{t}=\sum _{i:x_{t}\notin x_{i}}D_{t}(i)+\sum _{i:x_{t}\in x_{i}}D_{t}(i)e^{-y_{i}\alpha _{i}h_{t}({\boldsymbol {x_{i}}})}$

Where $x_{t}$ is the feature selected in the current weak hypothesis. Three equations are defined describing the sum of the distributions for in which the current hypothesis has selected either correct or incorrect label. Note that it is possible for the classifier to abstain from selecting a label for an example, in which the label provided is 0. The two labels are selected to be either -1 or 1.

$W_{0}=\sum _{i:h_{t}(x_{i})=0}D_{t}(i)$

$W_{+}=\sum _{i:h_{t}(x_{i})=y_{i}}D_{t}(i)$

$W_{-}=\sum _{i:h_{t}(x_{i})=-y_{i}}D_{t}(i)$

Schapire and Singer have shown that the value $Z_{t}$ can be minimized (and thus the training error) by selecting $\alpha _{t}$ to be as follows:

$\alpha _{t}={\frac {1}{2}}\ln \left({\frac {W_{+}}{W_{-}}}\right)$

Providing confidence values for the current hypothesized classifier based on the number of correctly classified vs. the number of incorrectly classified examples weighted by the distribution over examples. This equation can be smoothed to compensate for cases in which $W_{-}$ is too small. Deriving $Z_{t}$ from this equation we get:

$Z_{t}=W_{0}+2{\sqrt {W_{+}W_{-}}}$

The training error thus is minimized by selecting the weak hypothesis at every iteration that minimizes the previous equation.

AdaBoost with two views

CoBoosting extends this framework in the case where one has a labeled training set (examples from $1...m$ ) and an unlabeled training set (from $m_{1}...n$ ), as well as satisfy the conditions of redundancy in features in the form of $x_{i}=(x_{1,i},x_{2,i})$ . The algorithm trains two classifiers in the same fashion as AdaBoost that agree on the labeled training sets correct labels and maximizes the agreement between the two classifiers on the unlabeled training set. The final classifier is the sign of the sum of the two strong classifiers. The bounded training error on CoBoost is extended as follows, where $Z_{CO}$ is the extension of $Z_{t}$ :

$Z_{CO}=\sum _{i=1}^{m}e^{-y_{i}g_{1}({\boldsymbol {x_{1,i}}})}+\sum _{i=1}^{m}e^{-y_{i}g_{2}({\boldsymbol {x_{2,i}}})}+\sum _{i=m+1}^{n}e^{-f_{2}({\boldsymbol {x_{2,i}}})g_{1}({\boldsymbol {x_{1,i}}})}+\sum _{i=m+1}^{n}e^{-f_{1}({\boldsymbol {x_{1,i}}})g_{2}({\boldsymbol {x_{2,i}}})}$

Where $g_{j}$ is the summation of hypotheses weight by their confidence values for the $j^{th}$ view (j = 1 or 2). $f_{j}$ is the sign of $g_{j}$ . At each iteration of CoBoost both classifiers are updated iteratively. If $g_{j}^{t-1}$ is the strong classifier output for the $j^{th}$ view up to the $t-1$ iteration we can set the pseudo-labels for the jth update to be:

${\hat {y_{i}}}=\left\{{\begin{array}{ll}y_{i}1\leq i\leq m\\sign(g_{3-j}^{t-1}({\boldsymbol {x_{3-j,i}}}))m<i\leq n\end{array}}\right.$

In which $3-j$ selects the other view to the one currently being updated. $Z_{CO}$ is split into two such that $Z_{CO}=Z_{CO}^{1}+Z_{CO}^{2}$ . Where

$Z_{CO}^{j}=\sum _{i=1}^{n}e^{-{\hat {y_{i}}}(g_{j}^{t-1}({\boldsymbol {x_{i}}})+\alpha _{t}^{j}g_{t}^{j}({\boldsymbol {x_{j,i}}}))}$

The distribution over examples for each view $j$ at iteration $t$ is defined as follows:

$D_{t}^{j}(i)={\frac {1}{Z_{t}^{j}}}e^{-{\hat {y_{i}}}g_{j}^{t-1}({\boldsymbol {x_{j,i}}})}$

At which point $Z_{CO}^{j}$ can be rewritten as

$Z_{CO}^{j}=\sum _{i=1}^{n}D_{t}^{j}e^{-{\hat {y_{i}}}\alpha _{t}^{j}g_{t}^{j}({\boldsymbol {x_{j,i}}})}$

Which is identical to the equation in AdaBoost. Thus the same process can be used to update the values of $\alpha _{t}^{j}$ as in AdaBoost using ${\hat {y_{i}}}$ and $D_{t}^{j}$ . By alternating this, the minimization of $Z_{CO}^{1}$ and $Z_{CO}^{2}$ in this fashion $Z_{CO}$ is minimized in a greedy fashion.

Related Research Articles

In mathematical analysis, the Dirac delta function, also known as the unit impulse, is a generalized function on the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line is equal to one. Thus it can be represented heuristically as

<span class="mw-page-title-main">Taylor's theorem</span> Approximation of a function by a truncated power series

In calculus, Taylor's theorem gives an approximation of a $-times differentiable function around a given point by a polynomial of degree, called the -th-order Taylor polynomial . For a smooth function, the Taylor polynomial is the truncation at the order of the Taylor series of the function. The first-order Taylor polynomial is the linear approximation of the function, and the second-order Taylor polynomial is often referred to as the quadratic approximation . There are several versions of Taylor's theorem, some giving explicit estimates of the approximation error of the function by its Taylor polynomial.$

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In mathematics, specifically linear algebra, the Cauchy–Binet formula, named after Augustin-Louis Cauchy and Jacques Philippe Marie Binet, is an identity for the determinant of the product of two rectangular matrices of transpose shapes. It generalizes the statement that the determinant of a product of square matrices is equal to the product of their determinants. The formula is valid for matrices with the entries from any commutative ring.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution, often denoted $, is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.$

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

AdaBoost is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values.

In differential geometry, the four-gradient $is the four-vector analogue of the gradient from vector calculus.$

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector $, and an observation drawn from a multinomial distribution with probability vector p and number of trials n . The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.$

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes, and thus also belongs to the class of margin classifier algorithms.

BrownBoost is a boosting algorithm that may be robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is the case for all boosting algorithms, BrownBoost is used in conjunction with other machine learning methods. BrownBoost was introduced by Yoav Freund in 2001.

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

In machine learning, a margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier is used, the distance of an example from the separating hyperplane is the margin of that example.

In numerical linear algebra, the conjugate gradient method is an iterative method for numerically solving the linear system

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

Quantum optimization algorithms are quantum algorithms that are used to solve optimization problems. Mathematical optimization deals with finding the best solution to a problem from a set of possible solutions. Mostly, the optimization problem is formulated as a minimization problem, where one tries to minimize an error which depends on the solution: the optimal solution has the minimal error. Different optimization techniques are applied in various fields such as mechanics, economics and engineering, and as the complexity and amount of data involved rise, more efficient ways of solving optimization problems are needed. Quantum computing may allow problems which are not practically feasible on classical computers to be solved, or suggest a considerable speed up with respect to the best known classical algorithm.

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

Footnotes

1 2 Michael Collins and Yoram Singer, Unsupervised Models for Named Entity Classification. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 100-110, 1999.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Collins99-1] 1 2 Michael Collins and Yoram Singer, Unsupervised Models for Named Entity Classification. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 100-110, 1999.

[1]