Multiclass classification

Last updated

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification).

Contents

While many classification algorithms (notably multinomial logistic regression) naturally permit the use of more than two classes, some are by nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies.

Multiclass classification should not be confused with multi-label classification, where multiple labels are to be predicted for each instance.

General strategies

The existing multi-class classification techniques can be categorised into

Transformation to binary

This section discusses strategies for reducing the problem of multiclass classification to multiple binary classification problems. It can be categorized into one vs rest and one vs one. The techniques developed based on reducing the multi-class problem into multiple binary problems can also be called problem transformation techniques.

One-vs.-rest

One-vs.-rest [2] :182,338 (OvR or one-vs.-all, OvA or one-against-all, OAA) strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued score for its decision (see also scoring rule), rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample. [2] :182 [note 1]

In pseudocode, the training algorithm for an OvR learner constructed from a binary classification learner L is as follows:

Inputs:
  • L, a learner (training algorithm for binary classifiers)
  • samples X
  • labels y where yi ∈ {1, … K} is the label for the sample Xi
Output:
  • a list of classifiers fk for k ∈ {1, …, K}
Procedure:
  • For each k in {1, …, K}
    • Construct a new label vector z where zi=yi if yi = k and zi = 0 otherwise
    • Apply L to X, z to obtain fk

Making decisions means applying all classifiers to an unseen sample x and predicting the label k for which the corresponding classifier reports the highest confidence score:

Although this strategy is popular, it is a heuristic that suffers from several problems. Firstly, the scale of the confidence values may differ between the binary classifiers. Second, even if the class distribution is balanced in the training set, the binary classification learners see unbalanced distributions because typically the set of negatives they see is much larger than the set of positives. [2] :338

One-vs.-one

In the one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier. [2] :339

Like OvR, OvO suffers from ambiguities in that some regions of its input space may receive the same number of votes. [2] :183

Extension from binary

This section discusses strategies of extending the existing binary classifiers to solve multi-class classification problems. Several algorithms have been developed based on neural networks, decision trees, k-nearest neighbors, naive Bayes, support vector machines and extreme learning machines to address multi-class classification problems. These types of techniques can also be called algorithm adaptation techniques.

Neural networks

Multiclass perceptrons provide a natural extension to the multi-class problem. Instead of just having one neuron in the output layer, with binary output, one could have N binary neurons leading to multi-class classification. In practice, the last layer of a neural network is usually a softmax function layer, which is the algebraic simplification of N logistic classifiers, normalized per class by the sum of the N-1 other logistic classifiers.

Extreme learning machines

Extreme learning machines (ELM) is a special case of single hidden layer feed-forward neural networks (SLFNs) wherein the input weights and the hidden node biases can be chosen at random. Many variants and developments are made to the ELM for multiclass classification.

k-nearest neighbours

k-nearest neighbors kNN is considered among the oldest non-parametric classification algorithms. To classify an unknown example, the distance from that example to every other training example is measured. The k smallest distances are identified, and the most represented class by these k nearest neighbours is considered the output class label.

Naive Bayes

Naive Bayes is a successful classifier based upon the principle of maximum a posteriori (MAP). This approach is naturally extensible to the case of having more than two classes, and was shown to perform well in spite of the underlying simplifying assumption of conditional independence.

Decision trees

Decision tree learning is a powerful classification technique. The tree tries to infer a split of the training data based on the values of the available features to produce a good generalization. The algorithm can naturally handle binary or multiclass classification problems. The leaf nodes can refer to any of the K classes concerned.

Support vector machines

Support vector machines are based upon the idea of maximizing the margin i.e. maximizing the minimum distance from the separating hyperplane to the nearest example. The basic SVM supports only binary classification, but extensions have been proposed to handle the multiclass classification case as well. In these extensions, additional parameters and constraints are added to the optimization problem to handle the separation of the different classes.

Multi expression programming

Multi expression programming (MEP) is an evolutionary algorithm for generating computer programs (that can be used for classification tasks too). MEP has a unique feature: it encodes multiple programs into a single chromosome. Each of these programs can be used to generate the output for a class, thus making MEP naturally suitable for solving multi-class classification problems.

Hierarchical classification

Hierarchical classification tackles the multi-class classification problem by dividing the output space i.e. into a tree. Each parent node is divided into multiple child nodes and the process is continued until each child node represents only one class. Several methods have been proposed based on hierarchical classification.

Learning paradigms

Based on learning paradigms, the existing multi-class classification techniques can be classified into batch learning and online learning. Batch learning algorithms require all the data samples to be available beforehand. It trains the model using the entire training data and then predicts the test sample using the found relationship. The online learning algorithms, on the other hand, incrementally build their models in sequential iterations. In iteration t, an online algorithm receives a sample, xt and predicts its label ŷt using the current model; the algorithm then receives yt, the true label of xt and updates its model based on the sample-label pair: (xt, yt). Recently, a new learning paradigm called progressive learning technique has been developed. [3] The progressive learning technique is capable of not only learning from new samples but also capable of learning new classes of data and yet retain the knowledge learnt thus far. [4]

See also

Notes

  1. In multi-label classification, OvR is known as binary relevance and the prediction of multiple classes is considered a feature, not a problem.

Related Research Articles

<span class="mw-page-title-main">Support vector machine</span> Set of methods for supervised statistical learning

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels.

<span class="mw-page-title-main">Boosting (machine learning)</span> Method in machine learning

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant : "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.

<span class="mw-page-title-main">Pattern recognition</span> Automated recognition of patterns and regularities in data

Pattern recognition is the automated recognition of patterns and regularities in data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

<span class="mw-page-title-main">Perceptron</span> Algorithm for supervised learning of binary classifiers

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

<span class="mw-page-title-main">Decision tree learning</span> Machine learning algorithm

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

<span class="mw-page-title-main">Bootstrap aggregating</span> Ensemble method within machine learning

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

<span class="mw-page-title-main">Random forest</span> Binary search tree based ensemble machine learning method


Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set. Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

<span class="mw-page-title-main">Structured prediction</span> Supervised machine learning techniques

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

<span class="mw-page-title-main">Active learning (machine learning)</span> Machine learning strategy

Active learning is a special case of machine learning in which a learning algorithm can interactively query a user to label new data points with the desired outputs. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.

<span class="mw-page-title-main">Hinge loss</span> Loss function in machine learning

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

<span class="mw-page-title-main">Platt scaling</span> Machine learning calibration technique

In machine learning, Platt scaling or Platt calibration is a way of transforming the outputs of a classification model into a probability distribution over classes. The method was invented by John Platt in the context of support vector machines, replacing an earlier method by Vapnik, but can be applied to other classification models. Platt scaling works by fitting a logistic regression model to a classifier's scores.

<span class="mw-page-title-main">Probabilistic classification</span> Machine learning problem

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

<span class="mw-page-title-main">Multiple instance learning</span>

In machine learning, multiple-instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept.

<span class="mw-page-title-main">Outline of machine learning</span> Overview of and topical guide to machine learning

The following outline is provided as an overview of and topical guide to machine learning:

Structured k-Nearest Neighbours is a machine learning algorithm that generalizes the k-Nearest Neighbors (kNN) classifier. Whereas the kNN classifier supports binary classification, multiclass classification and regression, the Structured kNN (SkNN) allows training of a classifier for general structured output labels.

In machine learning and data mining, quantification is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies of the classes of interest in a sample of unlabelled data items. For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these 100,000 tweets which belong to class `Positive', and to do the same for classes `Neutral' and `Negative'.

References

  1. Mohamed, Aly (2005). "Survey on multiclass classification methods". Technical Report, Caltech.
  2. 1 2 3 4 5 Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer.
  3. Venkatesan, Rajasekar; Meng Joo, Er (2016). "A novel progressive learning technique for multi-class classification". Neurocomputing. 207: 310–321. arXiv: 1609.00085 . doi:10.1016/j.neucom.2016.05.006. S2CID   12510650.
  4. Venkatesan, Rajasekar. "Progressive Learning Technique".