Boosting (machine learning)

Last updated

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance [1] in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. [2] Boosting is based on the question posed by Kearns and Valiant (1988, 1989): [3] [4] "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

Contents

Robert Schapire's affirmative answer in a 1990 paper [5] to the question of Kearns and Valiant has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting. [6]

When first introduced, the hypothesis boosting problem simply referred to the process of turning a weak learner into a strong learner. "Informally, [the hypothesis boosting] problem asks whether an efficient learning algorithm […] that outputs a hypothesis whose performance is only slightly better than random guessing [i.e. a weak learner] implies the existence of an efficient algorithm that outputs a hypothesis of arbitrary accuracy [i.e. a strong learner]." [3] Algorithms that achieve hypothesis boosting quickly became simply known as "boosting". Freund and Schapire's arcing (Adapt[at]ive Resampling and Combining), [7] as a general technique, is more or less synonymous with boosting. [8]

Boosting algorithms

While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are weighted in a way that is related to the weak learners' accuracy. After a weak learner is added, the data weights are readjusted, known as "re-weighting". Misclassified input data gain a higher weight and examples that are classified correctly lose weight. [note 1] Thus, future weak learners focus more on the examples that previous weak learners misclassified.

An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and weighted dataset Ensemble Boosting.svg
An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and weighted dataset

There are many boosting algorithms. The original ones, proposed by Robert Schapire (a recursive majority gate formulation), [5] and Yoav Freund (boost by majority), [9] were not adaptive and could not take full advantage of the weak learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel Prize.

Only algorithms that are provable boosting algorithms in the probably approximately correct learning formulation can accurately be called boosting algorithms. Other algorithms that are similar in spirit[ clarification needed ] to boosting algorithms are sometimes called "leveraging algorithms", although they are also sometimes incorrectly called boosting algorithms. [9]

The main variation between many boosting algorithms is their method of weighting training data points and hypotheses. AdaBoost is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners. It is often the basis of introductory coverage of boosting in university machine learning courses. [10] There are many more recent algorithms such as LPBoost, TotalBoost, BrownBoost, xgboost, MadaBoost, LogitBoost, and others. Many boosting algorithms fit into the AnyBoost framework, [9] which shows that boosting performs gradient descent in a function space using a convex cost function.

Object categorization in computer vision

Given images containing various known objects in the world, a classifier can be learned from them to automatically classify the objects in future images. Simple classifiers built based on some image feature of the object tend to be weak in categorization performance. Using boosting methods for object categorization is a way to unify the weak classifiers in a special way to boost the overall ability of categorization.[ citation needed ]

Problem of object categorization

Object categorization is a typical task of computer vision that involves determining whether or not an image contains some specific category of object. The idea is closely related with recognition, identification, and detection. Appearance based object categorization typically contains feature extraction, learning a classifier, and applying the classifier to new examples. There are many ways to represent a category of objects, e.g. from shape analysis, bag of words models, or local descriptors such as SIFT, etc. Examples of supervised classifiers are Naive Bayes classifiers, support vector machines, mixtures of Gaussians, and neural networks. However, research[ which? ] has shown that object categories and their locations in images can be discovered in an unsupervised manner as well. [11]

Status quo for object categorization

The recognition of object categories in images is a challenging problem in computer vision, especially when the number of categories is large. This is due to high intra class variability and the need for generalization across variations of objects within the same category. Objects within one category may look quite different. Even the same object may appear unalike under different viewpoint, scale, and illumination. Background clutter and partial occlusion add difficulties to recognition as well. [12] Humans are able to recognize thousands of object types, whereas most of the existing object recognition systems are trained to recognize only a few,[ quantify ] e.g. human faces, cars, simple objects, etc. [13] [ needs update? ] Research has been very active on dealing with more categories and enabling incremental additions of new categories, and although the general problem remains unsolved, several multi-category objects detectors (for up to hundreds or thousands of categories [14] ) have been developed. One means is by feature sharing and boosting.

Boosting for binary categorization

AdaBoost can be used for face detection as an example of binary categorization. The two categories are faces versus background. The general algorithm is as follows:

  1. Form a large set of simple features
  2. Initialize weights for training images
  3. For T rounds
    1. Normalize the weights
    2. For available features from the set, train a classifier using a single feature and evaluate the training error
    3. Choose the classifier with the lowest error
    4. Update the weights of the training images: increase if classified wrongly by this classifier, decrease if correctly
  4. Form the final strong classifier as the linear combination of the T classifiers (coefficient larger if training error is small)

After boosting, a classifier constructed from 200 features could yield a 95% detection rate under a false positive rate. [15]

Another application of boosting for binary categorization is a system that detects pedestrians using patterns of motion and appearance. [16] This work is the first to combine both motion information and appearance information as features to detect a walking person. It takes a similar approach to the Viola-Jones object detection framework.

Boosting for multi-class categorization

Compared with binary categorization, multi-class categorization looks for common features that can be shared across the categories at the same time. They turn to be more generic edge like features. During learning, the detectors for each category can be trained jointly. Compared with training separately, it generalizes better, needs less training data, and requires fewer features to achieve the same performance.

The main flow of the algorithm is similar to the binary case. What is different is that a measure of the joint training error shall be defined in advance. During each iteration the algorithm chooses a classifier of a single feature (features that can be shared by more categories shall be encouraged). This can be done via converting multi-class classification into a binary one (a set of categories versus the rest), [17] or by introducing a penalty error from the categories that do not have the feature of the classifier. [18]

In the paper "Sharing visual features for multiclass and multiview object detection", A. Torralba et al. used GentleBoost for boosting and showed that when training data is limited, learning via sharing features does a much better job than no sharing, given same boosting rounds. Also, for a given performance level, the total number of features required (and therefore the run time cost of the classifier) for the feature sharing detectors, is observed to scale approximately logarithmically with the number of class, i.e., slower than linear growth in the non-sharing case. Similar results are shown in the paper "Incremental learning of object detectors using a visual shape alphabet", yet the authors used AdaBoost for boosting.

Convex vs. non-convex boosting algorithms

Boosting algorithms can be based on convex or non-convex optimization algorithms. Convex algorithms, such as AdaBoost and LogitBoost, can be "defeated" by random noise such that they can't learn basic and learnable combinations of weak hypotheses. [19] [20] This limitation was pointed out by Long & Servedio in 2008. However, by 2009, multiple authors demonstrated that boosting algorithms based on non-convex optimization, such as BrownBoost, can learn from noisy datasets and can specifically learn the underlying classifier of the Long–Servedio dataset.

See also

Implementations

Notes

  1. Some boosting-based classification algorithms actually decrease the weight of repeatedly misclassified examples; for example boost by majority and BrownBoost.

Related Research Articles

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals on the real line.

Meta learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments. As of 2017, the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.

<span class="mw-page-title-main">Yoav Freund</span> American computer scientist

Yoav Freund is an Israeli professor of computer science at the University of California San Diego who mainly works on machine learning, probability theory and related fields and applications.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

<span class="mw-page-title-main">Decision stump</span> Boolean classifier from one decision

A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node which is immediately connected to the terminal nodes. A decision stump makes a prediction based on the value of just a single input feature. Sometimes they are also called 1-rules.

BrownBoost is a boosting algorithm that may be robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is true for all boosting algorithms, BrownBoost is used in conjunction with other machine learning methods. BrownBoost was introduced by Yoav Freund in 2001.

Haar-like features are digital image features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector.

Robert Elias Schapire is an American computer scientist, former David M. Siegel '83 Professor in the computer science department at Princeton University, and has recently moved to Microsoft Research. His primary specialty is theoretical and applied machine learning.

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes.

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

In machine learning, a margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier is used, the distance of an example from the separating hyperplane is the margin of that example.

In machine learning and computational learning theory, LogitBoost is a boosting algorithm formulated by Jerome Friedman, Trevor Hastie, and Robert Tibshirani.

Cascading is a particular case of ensemble learning based on the concatenation of several classifiers, using all information collected from the output from a given classifier as additional information for the next classifier in the cascade. Unlike voting or stacking ensembles, which are multiexpert systems, cascading is a multistage one.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

CoBoost is a semi-supervised training algorithm proposed by Collins and Singer in 1999. The original application for the algorithm was the task of Named Entity Classification using very weak learners. It can be used for performing semi-supervised learning in cases in which there exist redundancy in features.

In machine learning, the kernel perceptron is a variant of the popular perceptron learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training samples. The algorithm was invented in 1964, making it the first kernel classification learner.

Michael Justin Kearns is an American computer scientist, professor and National Center Chair at the University of Pennsylvania, the founding director of Penn's Singh Program in Networked & Social Systems Engineering (NETS), the founding director of Warren Center for Network and Data Sciences, and also holds secondary appointments in Penn's Wharton School and department of Economics. He is a leading researcher in computational learning theory and algorithmic game theory, and interested in machine learning, artificial intelligence, computational finance, algorithmic trading, computational social science and social networks. He previously led the Advisory and Research function in Morgan Stanley's Artificial Intelligence Center of Excellence team, and is currently an Amazon Scholar within Amazon Web Services.

The multiplicative weights update method is an algorithmic technique most commonly used for decision making and prediction, and also widely deployed in game theory and algorithm design. The simplest use case is the problem of prediction from expert advice, in which a decision maker needs to iteratively decide on an expert whose advice to follow. The method assigns initial weights to the experts, and updates these weights multiplicatively and iteratively according to the feedback of how well an expert performed: reducing it in case of poor performance, and increasing it otherwise. It was discovered repeatedly in very diverse fields such as machine learning, optimization, theoretical computer science, and game theory.

References

  1. Leo Breiman (1996). "BIAS, VARIANCE, AND ARCING CLASSIFIERS" (PDF). TECHNICAL REPORT. Archived from the original (PDF) on 2015-01-19. Retrieved 19 January 2015. Arcing [Boosting] is more successful than bagging in variance reduction
  2. Zhou Zhi-Hua (2012). Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC. p. 23. ISBN   978-1439830031. The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners
  3. 1 2 Michael Kearns(1988); Thoughts on Hypothesis Boosting, Unpublished manuscript (Machine Learning class project, December 1988)
  4. Michael Kearns; Leslie Valiant (1989). "Crytographic limitations on learning Boolean formulae and finite automata". Proceedings of the twenty-first annual ACM symposium on Theory of computing - STOC '89. Vol. 21. ACM. pp. 433–444. doi:10.1145/73007.73049. ISBN   978-0897913072. S2CID   536357.
  5. 1 2 Schapire, Robert E. (1990). "The Strength of Weak Learnability" (PDF). Machine Learning. 5 (2): 197–227. CiteSeerX   10.1.1.20.723 . doi:10.1007/bf00116037. S2CID   53304535. Archived from the original (PDF) on 2012-10-10. Retrieved 2012-08-23.
  6. Leo Breiman (1998). "Arcing classifier (with discussion and a rejoinder by the author)". Ann. Stat. 26 (3): 801–849. doi: 10.1214/aos/1024691079 . Schapire (1990) proved that boosting is possible. (Page 823)
  7. Yoav Freund and Robert E. Schapire (1997); A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences, 55(1):119-139
  8. Leo Breiman (1998); Arcing Classifier (with Discussion and a Rejoinder by the Author), Annals of Statistics, vol. 26, no. 3, pp. 801-849: "The concept of weak learning was introduced by Kearns and Valiant (1988, 1989), who left open the question of whether weak and strong learnability are equivalent. The question was termed the boosting problem since [a solution must] boost the low accuracy of a weak learner to the high accuracy of a strong learner. Schapire (1990) proved that boosting is possible. A boosting algorithm is a method that takes a weak learner and converts it into a strong learner. Freund and Schapire (1997) proved that an algorithm similar to arc-fs is boosting.
  9. 1 2 3 Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean (2000); Boosting Algorithms as Gradient Descent, in S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Advances in Neural Information Processing Systems 12, pp. 512-518, MIT Press
  10. Emer, Eric. "Boosting (AdaBoost algorithm)" (PDF). MIT. Archived (PDF) from the original on 2022-10-09. Retrieved 2018-10-10.
  11. Sivic, Russell, Efros, Freeman & Zisserman, "Discovering objects and their location in images", ICCV 2005
  12. A. Opelt, A. Pinz, et al., "Generic Object Recognition with Boosting", IEEE Transactions on PAMI 2006
  13. M. Marszalek, "Semantic Hierarchies for Visual Object Recognition", 2007
  14. "Large Scale Visual Recognition Challenge". December 2017.
  15. P. Viola, M. Jones, "Robust Real-time Object Detection", 2001
  16. Viola, P.; Jones, M.; Snow, D. (2003). Detecting Pedestrians Using Patterns of Motion and Appearance (PDF). ICCV. Archived (PDF) from the original on 2022-10-09.
  17. A. Torralba, K. P. Murphy, et al., "Sharing visual features for multiclass and multiview object detection", IEEE Transactions on PAMI 2006
  18. A. Opelt, et al., "Incremental learning of object detectors using a visual shape alphabet", CVPR 2006
  19. P. Long and R. Servedio. 25th International Conference on Machine Learning (ICML), 2008, pp. 608--615.
  20. Long, Philip M.; Servedio, Rocco A. (March 2010). "Random classification noise defeats all convex potential boosters" (PDF). Machine Learning. 78 (3): 287–304. doi: 10.1007/s10994-009-5165-z . S2CID   53861. Archived (PDF) from the original on 2022-10-09. Retrieved 2015-11-17.

Further reading