Cascading classifiers

Last updated December 09, 2022

Cascading is a particular case of ensemble learning based on the concatenation of several classifiers, using all information collected from the output from a given classifier as additional information for the next classifier in the cascade. Unlike voting or stacking ensembles, which are multiexpert systems, cascading is a multistage one.

Cascading classifiers are trained with several hundred "positive" sample views of a particular object and arbitrary "negative" images of the same size. After the classifier is trained it can be applied to a region of an image and detect the object in question. To search for the object in the entire frame, the search window can be moved across the image and check every location with the classifier. This process is most commonly used in image processing for object detection and tracking, primarily facial detection and recognition.

The first cascading classifier was the face detector of Viola and Jones (2001). The requirement for this classifier was to be fast in order to be implemented on low-power CPUs, such as cameras and phones.

Algorithm properties

Scaling and rotations

It can be seen from this description that the classifier will not accept faces that are upside down (the eyebrows are not in a correct position) or the side of the face (the nose is no longer in the center, and shadows on the side of the nose might be missing). Separate cascade classifiers have to be trained for every rotation that is not in the image plane (side of face) and will have to be retrained or run on rotated features for every rotation that is in the image plane (face upside down or tilted to the side). Scaling is not a problem, since the features can be scaled (centerpixel, leftpixels and rightpixels have a dimension only relative to the rectangle examined). In recent cascades, pixel value from some part of a rectangle compared to another have been replaced with Haar wavelets.

Stage properties

To have good overall performance, the following criteria must be met:

Each stage must validate all faces, and can produce many false positives. For example, if stage 1 were to mark as 'does not contain a face' 20% of rectangles containing a face (false negative rate=20%), then the total performance of the chain cannot be higher than 80% true positive, whatever the next stages are, since 20% of faces have already been rejected.
This suggests that a good stage must have 100% true positive and for example 40% false positive, that is accept all rectangles containing faces and erroneously mark many rectangles as potentially containing a face, to be eliminated by later stages. For a first stage, 100% true positive and 40% false positive still gives a lot of false negative, if only 1 in a 1000 rectangles in an image contain a face, there will still be 400 to 1 false possible faces after the first stage.
If the first stage is very fast (a few operations), we have eliminated 60% of rectangles not containing a face very quickly.

The training procedure for one stage is therefore to have many weak learners (simple pixel difference operators), train them as a group (raise their weight if they give correct result), but be mindful of having only a few active weak learners so the computation time remains low.

The first detector of Viola & Jones had 38 stages, with 1 feature in the first stage, then 10, 25, 25, 50 in the next five stages, for a total of 6000 features. The first stages remove unwanted rectangles rapidly to avoid paying the computational costs of the next stages, so that computational time is spent analyzing deeply the part of the image that have a high probability of containing the object.

Cascade training

Cascades are usually done through cost-aware ADAboost. The sensitivity threshold (0.8 in our example) can be adjusted so that there is close to 100% true positives and some false positives. The procedure can then be started again for stage 2, until the desired accuracy/computation time is reached.

After the initial algorithm, it was understood that training the cascade as a whole can be optimized, to achieve a desired true detection rate with minimal complexity. Examples of such algorithms are RCBoost, ECBoost or RCECBoost. In their most basic versions, they can be understood as choosing, at each step, between adding a stage or adding a weak learner to a previous stage, whichever is less costly, until the desired accuracy has been reached. Every stage of the classifier cannot have a detection rate (sensitivity) below the desired rate, so this is a constrained optimization problem. To be precise, the total sensitivity will be the product of stage sensitivities.

Cascade classifiers are available in OpenCV, with pre-trained cascades for frontal faces and upper body. Training a new cascade in OpenCV is also possible with either haar_training or train_cascades methods. This can be used for rapid object detection of more specific targets, including non-human objects with Haar-like features. The process requires two sets of samples: negative and positive, where the negative samples correspond to arbitrary non-object images. The time constraint in training a cascade classifier can be circumvented using cloud-computing methods.

Cascading classifiers in statistics

The term is also used in statistics to describe a model that is staged. For example, a classifier (for example k-means), takes a vector of features (decision variables) and outputs for each possible classification result the probability that the vector belongs to the class. This is usually used to take a decision (classify into the class with highest probability), but cascading classifiers use this output as the input to another model (another stage). This is particularly useful for models that have highly combinatorial or counting rules (for example, class1 if exactly two features are negative, class2 otherwise), which cannot be fitted without looking at all the interaction terms. Having cascading classifiers enables the successive stage to gradually approximate the combinatorial nature of the classification, or to add interaction terms in classification algorithms that cannot express them in one stage.

As a simple example, if we try to match the rule (class1 if exactly 2 features out of 3 are negative, class2 otherwise), a decision tree would be:

feature 1 negative
- feature 2 negative
  - feature 3 negative -> class2
  - feature 3 positive -> class1
- feature 2 positive
  - feature 3 negative -> class1
  - feature 3 positive -> class2
feature 1 positive
- feature 2 negative
  - feature 3 negative -> class1
  - feature 3 positive -> class2
- feature 2 positive
  - feature 3 negative -> class2
  - feature 3 positive -> class2

The tree has all the combinations of possible leaves to express the full ruleset, whereas (feature1 positive, feature2 negative) and (feature1 negative, feature2 positive) should actually join to the same rule. This leads to a tree with too few samples on the leaves. A two-stage algorithm can effectively merge these two cases by giving a medium-high probability to class1 if feature1 or (exclusive) feature2 is negative. The second classifier can pick up this higher probability and make a decision on the sign of feature3.

In a bias-variance decomposition, cascaded models are usually seen as lowering bias while raising variance.

Related Research Articles

Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning algorithms is learning a function that maps feature vectors (inputs) to labels (output), based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant : "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. Some implementations of machine learning use data and neural networks in a way that mimics the working of a biological brain. In its application across business problems, machine learning is also referred to as predictive analytics.

<span class="mw-page-title-main">Receiver operating characteristic</span> Diagnostic plot

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of military radar receivers starting in 1941, which led to its name.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. An imperfect classification is one in which some errors appear, and then statistical analysis must be applied to analyse the classification.

Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied are considered "positive" and those for which it is not are considered "negative".

Computer-aided detection (CADe), also called computer-aided diagnosis (CADx), are systems that assist doctors in the interpretation of medical images. Imaging techniques in X-ray, MRI, Endoscopy, and ultrasound diagnostics yield a great deal of information that the radiologist or other medical professional has to analyze and evaluate comprehensively in a short time. CAD systems process digital images or videos for typical appearances and to highlight conspicuous sections, such as possible diseases, in order to offer input to support a decision taken by the professional.

Haar-like features are digital image features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector.

In computer vision, speeded up robust features (SURF) is a patented local feature detector and descriptor. It can be used for tasks such as object recognition, image registration, classification, or 3D reconstruction. It is partly inspired by the scale-invariant feature transform (SIFT) descriptor. The standard version of SURF is several times faster than SIFT and claimed by its authors to be more robust against different image transformations than SIFT.

<span class="mw-page-title-main">Precision and recall</span> Pattern recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes.

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.

<span class="mw-page-title-main">Object detection</span>

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Geometric feature learning is a technique combining machine learning and computer vision to solve visual tasks. The main goal of this method is to find a set of representative features of geometric form to represent an object by collecting geometric features from images and learning them using efficient machine learning methods. Humans solve visual tasks and can give fast response to the environment by extracting perceptual information from what they see. Researchers simulate humans' ability of recognizing objects to solve computer vision problems. For example, M. Mata et al.(2002) applied feature learning techniques to the mobile robot navigation tasks in order to avoid obstacles. They used genetic algorithms for learning features and recognizing objects (figures). Geometric feature learning methods can not only solve recognition problems but also predict subsequent actions by analyzing a set of sequential input sensory images, usually some extracting features of images. Through learning, some hypothesis of the next action are given and according to the probability of each hypothesis give a most probable action. This technique is widely used in the area of artificial intelligence.

Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine learning models. Decisions made by computers after a machine-learning process may be considered unfair if they were based on variables considered sensitive. Examples of these kinds of variable include gender, ethnicity, sexual orientation, disability and more. As it is the case with many ethical concepts, definitions of fairness and bias are always controversial. In general, fairness and bias are considered relevant when the decision process impacts people's lives. In machine learning, the problem of algorithmic bias is well known and well studied. Outcomes may be skewed by a range of factors and thus might be considered unfair with respect to certain groups or individuals. An example would be the way social media sites deliver personalized news to consumers.

References

Sources

Gama, J.; Brazdil, P. (2000). "Cascade Generalization". Machine Learning. 41 (3): 315–343. CiteSeerX 10.1.1.46.635 . doi:10.1023/a:1007652114878. S2CID 36907021.
Minguillón, J. (2002). On Cascading Small Decision Trees (PhD thesis). Universitat Autònoma de Barcelona.
Zhao, H.; Ram, S. (2004). "Constrained Cascade Generalization of Decision Trees". IEEE Transactions on Knowledge and Data Engineering. 16 (6): 727–739. CiteSeerX 10.1.1.199.2077 . doi:10.1109/tkde.2004.3. S2CID 8937272.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.