Classification

Last updated

Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). [1] Examples include diagnostic tests, identifying spam emails and deciding whether to give someone a driving license.

Contents

As well as 'category', synonyms or near-synonyms for 'class' include 'type', 'species', 'order', 'concept', 'taxon', 'group', 'identification' and 'division'.

The meaning of the word 'classification' (and its synonyms) may take on one of several related meanings. It may encompass both classification and the creation of classes, as for example in 'the task of categorizing pages in Wikipedia'; this overall activity is listed under taxonomy. It may refer exclusively to the underlying scheme of classes (which otherwise may be called a taxonomy). Or it may refer to the label given to an object by the classifier.

Classification is a part of many different kinds of activities and is studied from many different points of view including medicine, philosophy [2] , law, anthropology, biology, taxonomy, cognition, communications, knowledge organization, psychology, statistics, machine learning, economics and mathematics.

Binary vs multi-class classification

Methodological work aimed at improving the accuracy of a classifier is commonly divided between cases where there are exactly two classes (binary classification) and cases where there are three or more classes (multiclass classification). Another distinction is between categorical classification, which disregards the inherent ordering of classes, and ordinal classification, which considers and preserves the natural order of the classes. [3]

Evaluation of accuracy

Unlike in decision theory, it is assumed that a classifier repeats the classification task over and over. And unlike a lottery, it is assumed that each classification can be either right or wrong; in the theory of measurement, classification is understood as measurement against a nominal scale. Thus it is possible to try to measure the accuracy of a classifier.

Measuring the accuracy of a classifier allows a choice to be made between two alternative classifiers. This is important both when developing a classifier and in choosing which classifier to deploy. There are however many different methods for evaluating the accuracy of a classifier and no general method for determining which method should be used in which circumstances. Different fields have taken different approaches, even in binary classification. In pattern recognition, error rate is popular. The Gini coefficient and KS statistic are widely used in the credit scoring industry. Sensitivity and specificity are widely used in epidemiology and medicine. Precision and recall are widely used in information retrieval. [4]

Classifier accuracy depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems (a phenomenon that may be explained by the no-free-lunch theorem).

See also

Related Research Articles

<span class="mw-page-title-main">Accuracy and precision</span> Characterization of measurement error

Accuracy and precision are two measures of observational error. Accuracy is how close a given set of measurements are to their true value. Precision is how close the measurements are to each other.

Categorization is a type of cognition involving conceptual differentiation between characteristics of conscious experience, such as objects, events, or ideas. It involves the abstraction and differentiation of aspects of experience by sorting and distinguishing between groupings, through classification or typification on the basis of traits, features, similarities or other criteria that are universal to the group. Categorization is considered one of the most fundamental cognitive abilities, and it is studied particularly by psychology and cognitive linguistics.

In machine learning (ML), boosting is an ensemble metaheuristic for primarily reducing bias. It can also improve the stability and accuracy of ML classification and regression algorithms. Hence, it is prevalent in supervised learning for converting weak learners to strong learners.

In machine learning, a linear classifier makes a classification decision for each object based on a linear combination of its features. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.

<span class="mw-page-title-main">Binary classification</span> Dividing things between two categories

Binary classification is the task of classifying the elements of a set into one of two groups. Typical binary classification problems include:

Ground truth is information that is known to be real or true, provided by direct observation and measurement as opposed to information provided by inference.

<span class="mw-page-title-main">Dichotomy</span> Splitting of a whole into exactly two non-overlapping parts; dyadic relations and processes

A dichotomy is a partition of a whole into two parts (subsets). In other words, this couple of parts must be

There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean

Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and has since had a complex history, being adopted and extended in some disciplines and by some scholars, and criticized or rejected by others. Other classifications include those by Mosteller and Tukey, and by Chrisman.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method. It was first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. Most often, it is used for classification, as a k-NN classifier, the output of which is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to. The formulation of multi-label learning was first introduced by Shen et al. in the context of Semantic Scene Classification, and later gained popularity across various areas of machine learning.

A web query topic classification/categorization is a problem in information science. The task is to assign a web search query to one or more predefined categories, based on its topics. The importance of query classification is underscored by many services provided by Web search. A direct application is to provide better search result pages for users with interests in different categories. For example, users issuing a Web query such as "apple" might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the computer company. Online advertisement services can rely on the query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm. However, the computation of query classification is non-trivial. Different from the document classification tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, query topic classification is much more difficult than traditional document classification tasks.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes. For example, deciding on whether an image is showing a banana, peach, orange, or an apple is a multiclass classification problem, with four possible classes, while deciding on whether an image contains an apple or not is a binary classification problem.

<span class="mw-page-title-main">Taxonomy</span> Development of classes and classifications

Taxonomy is a practice and science concerned with classification or categorization. Typically, there are two parts to it: the development of an underlying scheme of classes and the allocation of things to the classes (classification).

<span class="mw-page-title-main">Evaluation of binary classifiers</span> Quantitative measurement of accuracy

Evaluation of a binary classifier typically assigns a numerical value, or values, to a classifier that represent its accuracy. An example is error rate, which measures how frequently the classifier makes a mistake.

Fast-and-frugal treeormatching heuristic(in the study of decision-making) is a simple graphical structure that categorizes objects by asking one question at a time. These decision trees are used in a range of fields: psychology, artificial intelligence, and management science. Unlike other decision or classification trees, such as Leo Breiman's CART, fast-and-frugal trees are intentionally simple, both in their construction as well as their execution, and operate speedily with little information. For this reason, fast-and-frugal-trees are potentially attractive when designing resource-constrained tasks.

In machine learning and data mining, quantification is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies of the classes of interest in a sample of unlabelled data items. For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these tweets which belong to class `Positive', and to do the same for classes `Neutral' and `Negative'.

References

  1. "The Classification Society | Scientific Classification Organization".
  2. "Classification". Internet Encyclopedia of Philosophy. Retrieved 10 January 2025.
  3. Marudi, M., Ben-Gal I., and Singer G. (2022). "A decision tree-based method for ordinal classification problems" (PDF). IISE Transactions (2022): 1-15.{{cite web}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
  4. David Hand (2012). "Assessing the Performance of Classification Methods". International Statistical Review . 80 (3): 400–414. doi:10.1111/j.1751-5823.2012.00183.x.