Multi-label classification

Last updated

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several (greater than or equal to two) classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

Contents

Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y; that is, it assigns a value of 0 or 1 for each element (label) in y.

Problem transformation methods

Several problem transformation methods exist for multi-label classification, and can be roughly broken down into:

Transformation into binary classification problems

The baseline approach, called the binary relevance method, [1] amounts to independently training one binary classifier for each label. Given an unseen sample, the combined model then predicts all labels for this sample for which the respective classifiers predict a positive result. Although this method of dividing the task into multiple binary tasks may resemble superficially the one-vs.-all (OvA) and one-vs.-rest (OvR) methods for multiclass classification, it is essentially different from both, because a single classifier under binary relevance deals with a single label, without any regard to other labels whatsoever. A classifier chain is an alternative method for transforming a multi-label classification problem into several binary classification problems. It differs from binary relevance in that labels are predicted sequentially, and the output of all previous classifiers (i.e. positive or negative for a particular label) are input as features to subsequent classifiers. [1] Classifier chains have been applied, for instance, in HIV drug resistance prediction. [2] [3] Bayesian network has also been applied to optimally order classifiers in Classifier chains. [4]

In case of transforming the problem to multiple binary classifications, the likelihood function reads where index runs over the samples, index runs over the labels, indicates the binary outcomes 0 or 1, indicates the Kronecker delta, indicates the multiple hot encoded labels of sample .

Transformation into multi-class classification problem

The label powerset (LP) transformation creates one binary classifier for every label combination present in the training set. For example, if possible labels for an example were A, B, and C, the label powerset representation of this problem is a multi-class classification problem with the classes [0 0 0], [1 0 0], [0 1 0], [0 0 1], [1 1 0], [1 0 1], [0 1 1], and [1 1 1] where for example [1 0 1] denotes an example where labels A and C are present and label B is absent. [5]

Ensemble methods

A set of multi-class classifiers can be used to create a multi-label ensemble classifier. For a given example, each classifier outputs a single class (corresponding to a single label in the multi-label problem). These predictions are then combined by an ensemble method, usually a voting scheme where every class that receives a requisite percentage of votes from individual classifiers (often referred to as the discrimination threshold [6] ) is predicted as a present label in the multi-label output. However, more complex ensemble methods exist, such as committee machines. Another variation is the random k-labelsets (RAKEL) algorithm, which uses multiple LP classifiers, each trained on a random subset of the actual labels; label prediction is then carried out by a voting scheme. [7] A set of multi-label classifiers can be used in a similar way to create a multi-label ensemble classifier. In this case, each classifier votes once for each label it predicts rather than for a single label.

Adapted algorithms

Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these including for multi-label data are

Learning paradigms

Based on learning paradigms, the existing multi-label classification techniques can be classified into batch learning and online machine learning. Batch learning algorithms require all the data samples to be available beforehand. It trains the model using the entire training data and then predicts the test sample using the found relationship. The online learning algorithms, on the other hand, incrementally build their models in sequential iterations. In iteration t, an online algorithm receives a sample, xt and predicts its label(s) ŷt using the current model; the algorithm then receives yt, the true label(s) of xt and updates its model based on the sample-label pair: (xt, yt).

Multi-label stream classification

Data streams are possibly infinite sequences of data that continuously and rapidly grow over time. [14] Multi-label stream classification (MLSC) is the version of multi-label classification task that takes place in data streams. It is sometimes also called online multi-label classification. The difficulties of multi-label classification (exponential number of possible label sets, capturing dependencies between labels) are combined with difficulties of data streams (time and memory constraints, addressing infinite stream with finite means, concept drifts).

Many MLSC methods resort to ensemble methods in order to increase their predictive performance and deal with concept drifts. Below are the most widely used ensemble methods in the literature:

Statistics and evaluation metrics

Considering to be a set of labels for data sample (do not confuse it with a one-hot vector; it is simply a collection of all of the labels that belong to this sample), the extent to which a dataset is multi-label can be captured in two statistics:

Evaluation metrics for multi-label classification performance are inherently different from those used in multi-class (or binary) classification, due to the inherent differences of the classification problem. If T denotes the true set of labels for a given sample, and P the predicted set of labels, then the following metrics can be defined on that sample:

Cross-validation in multi-label settings is complicated by the fact that the ordinary (binary/multiclass) way of stratified sampling will not work; alternative ways of approximate stratified sampling have been suggested. [24]

Implementations and datasets

Java implementations of multi-label algorithms are available in the Mulan and Meka software packages, both based on Weka.

The scikit-learn Python package implements some multi-labels algorithms and metrics.

The scikit-multilearn Python package specifically caters to the multi-label classification. It provides multi-label implementation of several well-known techniques including SVM, kNN and many more. The package is built on top of scikit-learn ecosystem.

The binary relevance method, classifier chains and other multilabel algorithms with a lot of different base learners are implemented in the R-package mlr [25]

A list of commonly used multi-label data-sets is available at the Mulan website.

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974).

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.


Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

<span class="mw-page-title-main">F-score</span> Statistical measure of a tests accuracy

In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.

Within statistics, oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set. These terms are used both in statistical sampling, survey design methodology and in machine learning.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

Classifier chains is a machine learning method for problem transformation in multi-label classification. It combines the computational efficiency of the binary relevance method while still being able to take the label dependencies into account for classification.

In machine learning, Platt scaling or Platt calibration is a way of transforming the outputs of a classification model into a probability distribution over classes. The method was invented by John Platt in the context of support vector machines, replacing an earlier method by Vapnik, but can be applied to other classification models. Platt scaling works by fitting a logistic regression model to a classifier's scores.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

In machine learning, multiple-instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept.

In network theory, collective classification is the simultaneous prediction of the labels for multiple objects, where each label is predicted using information about the object's observed features, the observed features and labels of its neighbors, and the unobserved labels of its neighbors. Collective classification problems are defined in terms of networks of random variables, where the network structure determines the relationship between the random variables. Inference is performed on multiple random variables simultaneously, typically by propagating information between nodes in the network to perform approximate inference. Approaches that use collective classification can make use of relational information when performing inference. Examples of collective classification include predicting attributes of individuals in a social network, classifying webpages in the World Wide Web, and inferring the research area of a paper in a scientific publication dataset.

References

  1. 1 2 3 4 Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank. Classifier Chains for Multi-label Classification. Machine Learning Journal. Springer. Vol. 85(3), (2011).
  2. Heider, D; Senge, R; Cheng, W; Hüllermeier, E (2013). "Multilabel classification for exploiting cross-resistance information in HIV-1 drug resistance prediction". Bioinformatics. 29 (16): 1946–52. doi: 10.1093/bioinformatics/btt331 . PMID   23793752.
  3. Riemenschneider, M; Senge, R; Neumann, U; Hüllermeier, E; Heider, D (2016). "Exploiting HIV-1 protease and reverse transcriptase cross-resistance information for improved drug resistance prediction by means of multi-label classification". BioData Mining. 9: 10. doi: 10.1186/s13040-016-0089-1 . PMC   4772363 . PMID   26933450.
  4. Soufan, Othman; Ba-Alawi, Wail; Afeef, Moataz; Essack, Magbubah; Kalnis, Panos; Bajic, Vladimir B. (2016-11-10). "DRABAL: novel method to mine large high-throughput screening assays using Bayesian active learning". Journal of Cheminformatics. 8: 64. doi: 10.1186/s13321-016-0177-8 . ISSN   1758-2946. PMC   5105261 . PMID   27895719.
  5. Spolaôr, Newton; Cherman, Everton Alvares; Monard, Maria Carolina; Lee, Huei Diana (March 2013). "A Comparison of Multi-label Feature Selection Methods using the Problem Transformation Approach". Electronic Notes in Theoretical Computer Science. 292: 135–151. doi: 10.1016/j.entcs.2013.02.010 . ISSN   1571-0661.
  6. "Discrimination Threshold — yellowbrick 0.9 documentation". www.scikit-yb.org. Retrieved 2018-11-29.
  7. Tsoumakas, Grigorios; Vlahavas, Ioannis (2007). Random k-labelsets: An ensemble method for multilabel classification (PDF). ECML. Archived from the original (PDF) on 2014-07-29. Retrieved 2014-07-26.
  8. Zhang, M.L.; Zhou, Z.H. (2007). "ML-KNN: A lazy learning approach to multi-label learning". Pattern Recognition. 40 (7): 2038–2048. Bibcode:2007PatRe..40.2038Z. CiteSeerX   10.1.1.538.9597 . doi:10.1016/j.patcog.2006.12.019. S2CID   14886376.
  9. Madjarov, Gjorgji; Kocev, Dragi; Gjorgjevikj, Dejan; Džeroski, Sašo (2012). "An extensive experimental comparison of methods for multi-label learning". Pattern Recognition. 45 (9): 3084–3104. Bibcode:2012PatRe..45.3084M. doi:10.1016/j.patcog.2012.03.004. S2CID   14064264.
  10. Chen, Yen-Liang; Hsu, Chang-Ling; Chou, Shih-chieh (2003). "Constructing a multi-valued and multi-labeled decision tree". Expert Systems with Applications. 25 (2): 199–209. doi:10.1016/S0957-4174(03)00047-2.
  11. Chou, Shihchieh; Hsu, Chang-Ling (2005-05-01). "MMDT: a multi-valued and multi-labeled decision tree classifier for data mining". Expert Systems with Applications. 28 (4): 799–812. doi:10.1016/j.eswa.2004.12.035.
  12. Li, Hong; Guo, Yue-jian; Wu, Min; Li, Ping; Xiang, Yao (2010-12-01). "Combine multi-valued attribute decomposition with multi-label learning". Expert Systems with Applications. 37 (12): 8721–8728. doi:10.1016/j.eswa.2010.06.044.
  13. Zhang, M.L.; Zhou, Z.H. (2006). Multi-label neural networks with applications to functional genomics and text categorization (PDF). IEEE Transactions on Knowledge and Data Engineering. Vol. 18. pp. 1338–1351.
  14. Aggarwal, Charu C., ed. (2007). Data Streams. Advances in Database Systems. Vol. 31. doi:10.1007/978-0-387-47534-9. ISBN   978-0-387-28759-1.
  15. Oza, Nikunj (2005). "Online Bagging and Boosting". IEEE International Conference on Systems, Man and Cybernetics. hdl: 2060/20050239012 .
  16. Read, Jesse; Pfahringer, Bernhard; Holmes, Geoff (2008-12-15). "Multi-label Classification Using Ensembles of Pruned Sets". 2008 Eighth IEEE International Conference on Data Mining. IEEE Computer Society. pp. 995–1000. doi:10.1109/ICDM.2008.74. hdl:10289/8077. ISBN   9780769535029. S2CID   16059274.
  17. 1 2 Osojnik, Aljaź; Panov, PanăźE; DźEroski, Sašo (2017-06-01). "Multi-label classification via multi-target regression on data streams". Machine Learning. 106 (6): 745–770. doi: 10.1007/s10994-016-5613-5 . ISSN   0885-6125.
  18. Sousa, Ricardo; Gama, João (2018-01-24). "Multi-label classification from high-speed data streams with adaptive model rules and random rules". Progress in Artificial Intelligence. 7 (3): 177–187. doi:10.1007/s13748-018-0142-z. ISSN   2192-6352. S2CID   32376722.
  19. 1 2 3 4 Read, Jesse; Bifet, Albert; Holmes, Geoff; Pfahringer, Bernhard (2012-02-21). "Scalable and efficient multi-label classification for evolving data streams". Machine Learning. 88 (1–2): 243–272. doi: 10.1007/s10994-012-5279-6 . ISSN   0885-6125.
  20. Bifet, Albert; Gavaldà, Ricard (2007-04-26), "Learning from Time-Changing Data with Adaptive Windowing", Proceedings of the 2007 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pp. 443–448, CiteSeerX   10.1.1.215.8387 , doi:10.1137/1.9781611972771.42, ISBN   9780898716306, S2CID   2279539
  21. 1 2 3 4 5 Büyükçakir, Alican; Bonab, Hamed; Can, Fazli (2018-10-17). "A Novel Online Stacked Ensemble for Multi-Label Stream Classification". Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM. pp. 1063–1072. arXiv: 1809.09994 . doi:10.1145/3269206.3271774. ISBN   9781450360142. S2CID   52843253.
  22. Xioufis, Eleftherios Spyromitros; Spiliopoulou, Myra; Tsoumakas, Grigorios; Vlahavas, Ioannis (2011-07-16). Dealing with concept drift and class imbalance in multi-label stream classification. AAAI Press. pp. 1583–1588. doi:10.5591/978-1-57735-516-8/IJCAI11-266. ISBN   9781577355144.
  23. Godbole, Shantanu; Sarawagi, Sunita (2004). Discriminative methods for multi-labeled classification (PDF). Advances in Knowledge Discovery and Data Mining. pp. 22–30.
  24. Sechidis, Konstantinos; Tsoumakas, Grigorios; Vlahavas, Ioannis (2011). On the stratification of multi-label data (PDF). ECML PKDD. pp. 145–158.
  25. Philipp Probst, Quay Au, Giuseppe Casalicchio, Clemens Stachl, Bernd Bischl. Multilabel Classification with R Package mlr. The R Journal (2017) 9:1, pages 352-369.

Further reading