Classifier chains

Last updated June 07, 2023

Classifier chains is a machine learning method for problem transformation in multi-label classification. It combines the computational efficiency of the binary relevance method while still being able to take the label dependencies into account for classification.^[1]

Problem transformation

Several problem transformation methods exist. One of them is the Binary Relevance method (BR). Given a set of labels ${\mathit {L}}\,$ and a data set with instances of the form ${\mathit {(x,Y)}}\,$ where ${\mathit {x}}\,$ is a feature vector and $Y\subseteq L$ is a set of labels assigned to the instance. BR transforms the data set into $\left\vert L\right\vert$ data sets and learns $\left\vert L\right\vert$ binary classifiers $H:X\rightarrow \{l,\neg l\}$ for each label $l\in L$ . During this process the information about dependencies between labels is not preserved. This can lead to a situation where a set of labels is assigned to an instance although these labels never co-occur together in the data set. Thus, information about label co-occurrence can help to assign correct label combinations. Loss of this information can in some cases lead to a decrease in classification performance.^[2]

Another approach, which takes into account label correlations, is the Label Powerset method (LP). Each combination of labels in a data set is considered to be a single label. After transformation a single-label classifier $H:X\rightarrow {\mathcal {P}}(L)$ is trained where ${\mathcal {P}}(L)$ is the power set of all labels in ${\mathit {L}}$ . The main drawback of this approach is that the number of label combinations grows exponentially with the number of labels. For example, a multi-label data set with 10 labels can have up to $2^{10}=1024$ label combinations. This increases the run-time of classification.

The Classifier Chains method is based on the BR method and it is efficient even on a big number of labels. Furthermore, it considers dependencies between labels.

Method description

For a given set of labels ${\mathit {L}}\,$ the Classifier Chain model (CC) learns $\left\vert L\right\vert$ classifiers as in the Binary Relevance method. All classifiers are linked in a chain through feature space.

Given a data set where the $i$ -th instance has the form ${\mathit {(x_{i},Y_{i})}}\,$ where ${\mathit {Y_{i}}}\,$ is a subset of labels, ${\mathit {x_{i}}}\,$ is a set of features. The data set is transformed in $\left\vert L\right\vert$ data sets where instances of the $j$ -th data set has the form $((x_{i},l_{1},...,l_{j-1}),l_{j}),l_{j}\in \{0,1\}$ . If the $j$ -th label was assigned to the instance then ${\mathit {l_{j}}}\,$ is $1$ , otherwise it is $0$ . Thus, classifiers build a chain where each of them learns binary classification of a single label. The features given to each classifier are extended with binary values that indicate which of previous labels were assigned to the instance.

By classifying new instances the labels are again predicted by building a chain of classifiers. The classification begins with the first classifier ${\mathit {C_{1}}}\,$ and proceeds to the last one ${\mathit {C_{|L|}}}\,$ by passing label information between classifiers through the feature space. Hence, the inter-label dependency is preserved. However, the result can vary for different order of chains. For example, if a label often co-occur with some other label, then only instances of the label which comes later in the chain will have information about the other one in its feature vector. In order to solve this problem and increase accuracy it is possible to use ensemble of classifiers.^[3]

In Ensemble of Classifier Chains (ECC) several CC classifiers can be trained with random order of chains (i.e. random order of labels) on a random subset of data set. Labels of a new instance are predicted by each classifier separately. After that, the total number of predictions or "votes" is counted for each label. The label is accepted if it was predicted by a percentage of classifiers that is bigger than some threshold value.

Adaptations

There is also regressor chains, which themselves can resemble vector autoregression models if the order of the chain makes sure temporal order is respected.

Related Research Articles

Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labeled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning algorithms is learning a function that maps feature vectors (inputs) to labels (output), based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. These activities can be viewed as two facets of the same field of application, and they have undergone substantial development over the past few decades.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

Discriminative models, also referred to as conditional models, are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data, such as pass/fail, win/lose, alive/dead or healthy/sick.

In statistics, the phi coefficient is a measure of association for two binary variables. In machine learning, it is known as the Matthews correlation coefficient (MCC) and used as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975. Introduced by Karl Pearson, and also known as the Yule phi coefficient from its introduction by Udny Yule in 1912 this measure is similar to the Pearson correlation coefficient in its interpretation. In fact, a Pearson correlation coefficient estimated for two binary variables will return the phi coefficient. Two binary variables are considered positively associated if most of the data falls along the diagonal cells. In contrast, two binary variables are considered negatively associated if most of the data falls off the diagonal. If we have a 2×2 table for two random variables x and y

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

In machine learning, a Ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that Ranking SVM also can be used to solve other problems such as Rank SIFT.

Preference learning is a subfield in machine learning, which is a classification method based on observed preference information. In the view of supervised learning, preference learning trains on a set of items which have preferences toward labels or other items and predicts the preferences for all items.

Contextual image classification, a topic of pattern recognition in computer vision, is an approach of classification based on contextual information in images. "Contextual" means this approach is focusing on the relationship of the nearby pixels, which is also called neighbourhood. The goal of this approach is to classify the images by using the contextual information.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

In machine learning, multiple-instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept.

In network theory, collective classification is the simultaneous prediction of the labels for multiple objects, where each label is predicted using information about the object's observed features, the observed features and labels of its neighbors, and the unobserved labels of its neighbors. Collective classification problems are defined in terms of networks of random variables, where the network structure determines the relationship between the random variables. Inference is performed on multiple random variables simultaneously, typically by propagating information between nodes in the network to perform approximate inference. Approaches that use collective classification can make use of relational information when performing inference. Examples of collective classification include predicting attributes of individuals in a social network, classifying webpages in the World Wide Web, and inferring the research area of a paper in a scientific publication dataset.

P₄ metric enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P₄ is designed in similar way to F₁ metric, however addressing the criticisms leveled against F₁. It may be perceived as its extension.

References

↑ Read, Jesse; Bernhard Pfahringer; Geoff Holmes; Eibe Frank (2009). "Classifier Chains for Multi-label Classification" (PDF). Proc 13th European Conference on Principles and Practice of Knowledge Discovery in Databases and 20th European Conference on Machine Learning. 2009.
↑ Dembczynski, Krzysztof; Willem Waegeman; Weiwei Cheng; Eyke Hüllermeier (2010). "On label dependence in multi-label classification" (PDF). Workshop Proceedings of Learning from Multi-Label Data. 2010: 5–12.
↑ Rokach, Lior (2010). "Ensemble-based classifiers" (PDF). Artif. Intell. Rev. Norwell, MA, USA: ACM. 33 (1–2): 1–39. doi:10.1007/s10462-009-9124-7.

External links

Better Classifier Chains for Multi-label Classification Presentation on Classifier Chains by Jesse Read and Fernando Pérez Cruz

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Read, Jesse; Bernhard Pfahringer; Geoff Holmes; Eibe Frank (2009). "Classifier Chains for Multi-label Classification" (PDF). Proc 13th European Conference on Principles and Practice of Knowledge Discovery in Databases and 20th European Conference on Machine Learning. 2009.

[2] Dembczynski, Krzysztof; Willem Waegeman; Weiwei Cheng; Eyke Hüllermeier (2010). "On label dependence in multi-label classification" (PDF). Workshop Proceedings of Learning from Multi-Label Data. 2010: 5–12.

[3] Rokach, Lior (2010). "Ensemble-based classifiers" (PDF). Artif. Intell. Rev. Norwell, MA, USA: ACM. 33 (1–2): 1–39. doi:10.1007/s10462-009-9124-7.

[1]

[2]

[3]