Classification

Last updated

Classification is usually understood to mean the allocation of objects to certain pre-existing classes or categories. This distinguishes it from the earlier step in which the classes themselves are established, often through clustering in which similar objects are grouped together. [1] Examples include a pregnancy test, identifying spam emails and deciding whether to give someone a driving licence.

Contents

Classification is a part of many different kinds of activities and studied from many different points of view including medicine, philosophy, law, anthropology, biology, taxonomy, cognition, communications, knowledge organization, psychology, statistics, machine learning, librarianship and mathematics.

As well as 'category', synonyms or near-synonyms for 'class' include 'type', 'species', 'order', 'concept', 'taxon', 'group' and 'division'.

The meaning of the word 'classification' (and its synonyms) may take on one of several related meanings. It may encompass both classification and the creation of classes, as for example in 'the task of categorizing pages in Wikipedia'; this overall activity is listed under Taxonomy. It may refer exclusively to the underlying scheme of classes (which otherwise may be called a taxonomy). Or it may refer to the label given to an object by the classifier.

Binary vs multi-class classification

Methodological work aimed at improving the accuracy of a classifier is commonly divided between cases where there are exactly two classes (binary classification) and cases where there are three or more classes (multiclass classification).

Evaluation of accuracy

Unlike in decision theory, it is assumed that a classifier repeats the classification task over and over. And unlike a lottery, it is assumed that each classification can be either right or wrong; in the theory of measurement, classification is understood as measurement against a nominal scale. Thus it is possible to try to measure the accuracy of a classifier.

Measuring the accuracy of a classifier allows a choice to be made between two alternative classifiers. This is important both when developing a classifier and in choosing which classifier to deploy. There are however many different methods for evaluating the accuracy of a classifier and no general method for determining which method should be used in which circumstances. Different fields have taken different approaches, even in binary classification. In pattern recognition, error rate is popular. The Gini coefficient and KS statistic are widely used in the credit scoring industry. Sensitivity and specificity are widely used in epidemiology and medicine. Precision and recall are widely used in information retrieval. [2]

Classifier accuracy depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems (a phenomenon that may be explained by the no-free-lunch theorem).

See also

Related Research Articles

<span class="mw-page-title-main">Accuracy and precision</span> Characterization of measurement error

Accuracy and precision are two measures of observational error.

Categorization is a type of cognition involving conceptual differentiation between characteristics of conscious experience, such as objects, events, or ideas. It involves the abstraction and differentiation of aspects of experience by sorting and distinguishing between groupings, through classification or typification on the basis of traits, features, similarities or other criteria that are universal to the group. Categorization is considered one of the most fundamental cognitive abilities, and it is studied particularly by psychology and cognitive linguistics.

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, variance. It is used in supervised learning and a family of machine learning algorithms that convert weak learners to strong ones.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 5–12–23

Binary classification is the task of classifying the elements of a set into one of two groups. Typical binary classification problems include:

There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

Discriminative models, also referred to as conditional models, are a class of models frequently used for classification. They are typically used to assign labels, such as pass/fail, win/lose, alive/dead or healthy/sick, to existing datapoints.

A Web query topic classification/categorization is a problem in information science. The task is to assign a Web search query to one or more predefined categories, based on its topics. The importance of query classification is underscored by many services provided by Web search. A direct application is to provide better search result pages for users with interests of different categories. For example, the users issuing a Web query "apple" might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the computer company. Online advertisement services can rely on the query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm. However, the computation of query classification is non-trivial. Different from the document classification tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, query topic classification is much more difficult than traditional document classification tasks.

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

<span class="mw-page-title-main">Classification of percussion instruments</span>

There are several overlapping schemes for the classification of percussion instruments.

<span class="mw-page-title-main">Taxonomy</span> Development of classes and classifications

Taxonomy is a practice and science concerned with classification or categorization. Typically, there are two parts to it: the development of an underlying scheme of classes and the allocation of things to the classes (classification).

<span class="mw-page-title-main">Evaluation of binary classifiers</span> Quantitative measurement of accuracy

Evaluation of a binary classifier typically assigns a numerical value, or values, to a classifier that represent its accuracy. An example is error rate, which measures how frequently the classifier makes a mistake.

Fast-and-frugal treeormatching heuristic(in the study of decision-making) is a simple graphical structure that categorizes objects by asking one question at a time. These decision trees are used in a range of fields: psychology, artificial intelligence, and management science. Unlike other decision or classification trees, such as Leo Breiman's CART, fast-and-frugal trees are intentionally simple, both in their construction as well as their execution, and operate speedily with little information. For this reason, fast-and-frugal-trees are potentially attractive when designing resource-constrained tasks.

In machine learning and data mining, quantification is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies of the classes of interest in a sample of unlabelled data items. For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these 100,000 tweets which belong to class `Positive', and to do the same for classes `Neutral' and `Negative'.

References

  1. https://www.theclassificationsociety.org/about/
  2. David Hand (2012). "Assessing the Performance of Classification Methods". International Statistical Review . 80 (3): 400–414.