BrownBoost

Last updated

BrownBoost is a boosting algorithm that may be robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is true for all boosting algorithms, BrownBoost is used in conjunction with other machine learning methods. BrownBoost was introduced by Yoav Freund in 2001. [1]

Contents

Motivation

AdaBoost performs well on a variety of datasets; however, it can be shown that AdaBoost does not perform well on noisy data sets. [2] This is a result of AdaBoost's focus on examples that are repeatedly misclassified. In contrast, BrownBoost effectively "gives up" on examples that are repeatedly misclassified. The core assumption of BrownBoost is that noisy examples will be repeatedly mislabeled by the weak hypotheses and non-noisy examples will be correctly labeled frequently enough to not be "given up on." Thus only noisy examples will be "given up on," whereas non-noisy examples will contribute to the final classifier. In turn, if the final classifier is learned from the non-noisy examples, the generalization error of the final classifier may be much better than if learned from noisy and non-noisy examples.

The user of the algorithm can set the amount of error to be tolerated in the training set. Thus, if the training set is noisy (say 10% of all examples are assumed to be mislabeled), the booster can be told to accept a 10% error rate. Since the noisy examples may be ignored, only the true examples will contribute to the learning process.

Algorithm description

BrownBoost uses a non-convex potential loss function, thus it does not fit into the AdaBoost framework. The non-convex optimization provides a method to avoid overfitting noisy data sets. However, in contrast to boosting algorithms that analytically minimize a convex loss function (e.g. AdaBoost and LogitBoost), BrownBoost solves a system of two equations and two unknowns using standard numerical methods.

The only parameter of BrownBoost ( in the algorithm) is the "time" the algorithm runs. The theory of BrownBoost states that each hypothesis takes a variable amount of time ( in the algorithm) which is directly related to the weight given to the hypothesis . The time parameter in BrownBoost is analogous to the number of iterations in AdaBoost.

A larger value of means that BrownBoost will treat the data as if it were less noisy and therefore will give up on fewer examples. Conversely, a smaller value of means that BrownBoost will treat the data as more noisy and give up on more examples.

During each iteration of the algorithm, a hypothesis is selected with some advantage over random guessing. The weight of this hypothesis and the "amount of time passed" during the iteration are simultaneously solved in a system of two non-linear equations ( 1. uncorrelated hypothesis w.r.t example weights and 2. hold the potential constant) with two unknowns (weight of hypothesis and time passed ). This can be solved by bisection (as implemented in the JBoost software package) or Newton's method (as described in the original paper by Freund). Once these equations are solved, the margins of each example ( in the algorithm) and the amount of time remaining are updated appropriately. This process is repeated until there is no time remaining.

The initial potential is defined to be . Since a constraint of each iteration is that the potential be held constant, the final potential is . Thus the final error is likely to be near . However, the final potential function is not the 0-1 loss error function. For the final error to be exactly , the variance of the loss function must decrease linearly w.r.t. time to form the 0-1 loss function at the end of boosting iterations. This is not yet discussed in the literature and is not in the definition of the algorithm below.

The final classifier is a linear combination of weak hypotheses and is evaluated in the same manner as most other boosting algorithms.

BrownBoost learning algorithm definition

Input:

Initialise:

While :

Output:

Empirical results

In preliminary experimental results with noisy datasets, BrownBoost outperformed AdaBoost's generalization error; however, LogitBoost performed as well as BrownBoost. [4] An implementation of BrownBoost can be found in the open source software JBoost.

See also

Related Research Articles

In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots of a real-valued function. The most basic version starts with a single-variable function f defined for a real variable x, the function's derivative f, and an initial guess x0 for a root of f. If the function satisfies sufficient assumptions and the initial guess is close, then

<span class="mw-page-title-main">Boosting (machine learning)</span> Method in machine learning

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant : "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

In number theory, the integer square root (isqrt) of a non-negative integer n is the non-negative integer m which is the greatest integer less than or equal to the square root of n,

<span class="mw-page-title-main">Stochastic gradient descent</span> Optimization algorithm

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals on the real line.

<span class="mw-page-title-main">Crystal Ball function</span>

The Crystal Ball function, named after the Crystal Ball Collaboration, is a probability density function commonly used to model various lossy processes in high-energy physics. It consists of a Gaussian core portion and a power-law low-end tail, below a certain threshold. The function itself and its first derivative are both continuous.

Noncentral <i>t</i>-distribution Probability distribution

The noncentral t-distribution generalizes Student's t-distribution using a noncentrality parameter. Whereas the central probability distribution describes how a test statistic t is distributed when the difference tested is null, the noncentral distribution describes how t is distributed when the null is false. This leads to its use in statistics, especially calculating statistical power. The noncentral t-distribution is also known as the singly noncentral t-distribution, and in addition to its primary use in statistical inference, is also used in robust modeling for data.

<span class="mw-page-title-main">Correlogram</span> Image of correlation statistics

In the analysis of data, a correlogram is a chart of correlation statistics. For example, in time series analysis, a plot of the sample autocorrelations versus is an autocorrelogram. If cross-correlation is plotted, the result is called a cross-correlogram.

<span class="mw-page-title-main">Folded normal distribution</span> Probability distribution

The folded normal distribution is a probability distribution related to the normal distribution. Given a normally distributed random variable X with mean μ and variance σ2, the random variable Y = |X| has a folded normal distribution. Such a case may be encountered if only the magnitude of some variable is recorded, but not its sign. The distribution is called "folded" because probability mass to the left of x = 0 is folded over by taking the absolute value. In the physics of heat conduction, the folded normal distribution is a fundamental solution of the heat equation on the half space; it corresponds to having a perfect insulator on a hyperplane through the origin.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

An alternating decision tree (ADTree) is a machine learning method for classification. It generalizes decision trees and has connections to boosting.

In computational number theory and computational algebra, Pollard's kangaroo algorithm is an algorithm for solving the discrete logarithm problem. The algorithm was introduced in 1978 by the number theorist J. M. Pollard, in the same paper as his better-known Pollard's rho algorithm for solving the same problem. Although Pollard described the application of his algorithm to the discrete logarithm problem in the multiplicative group of units modulo a prime p, it is in fact a generic discrete logarithm algorithm—it will work in any finite cyclic group.

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes.

<span class="mw-page-title-main">Half-normal distribution</span> Probability distribution

In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.

In machine learning, a margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier is used, the distance of an example from the separating hyperplane is the margin of that example.

Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research. SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool. The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.

CoBoost is a semi-supervised training algorithm proposed by Collins and Singer in 1999. The original application for the algorithm was the task of Named Entity Classification using very weak learners. It can be used for performing semi-supervised learning in cases in which there exist redundancy in features.

<span class="mw-page-title-main">Occam learning</span> Model of algorithmic learning

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

The multiplicative weights update method is an algorithmic technique most commonly used for decision making and prediction, and also widely deployed in game theory and algorithm design. The simplest use case is the problem of prediction from expert advice, in which a decision maker needs to iteratively decide on an expert whose advice to follow. The method assigns initial weights to the experts, and updates these weights multiplicatively and iteratively according to the feedback of how well an expert performed: reducing it in case of poor performance, and increasing it otherwise. It was discovered repeatedly in very diverse fields such as machine learning, optimization, theoretical computer science, and game theory.

References

  1. Yoav Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293--318, June 2001.
  2. Dietterich, T. G., (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40 (2) 139-158.
  3. Robert Schapire and Yoram Singer. Improved Boosting Using Confidence-rated Predictions. Journal of Machine Learning, Vol 37(3), pages 297-336. 1999
  4. Ross A. McDonald, David J. Hand, Idris A. Eckley. An Empirical Comparison of Three Boosting Algorithms on Real Data Sets with Artificial Class Noise. Multiple Classifier Systems, In Series Lecture Notes in Computer Science, pages 35-44, 2003.