Classification rule

Last updated

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. [1] A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes (i.e. features or regressors) of the elements to be classified.

Contents

A special kind of classification rule is binary classification, for problems in which there are only two classes.

Testing classification rules

Given a data set consisting of pairs x and y, where x denotes an element of the population and y the class it belongs to, a classification rule h(x) is a function that assigns each element x to a predicted class A binary classification is such that the label y can take only one of two values.

The true labels yi can be known but will not necessarily match their approximations . In a binary classification, the elements that are not correctly classified are named false positives and false negatives.

Some classification rules are static functions. Others can be computer programs. A computer classifier can be able to learn or can implement static classification rules. For a training data-set, the true labels yj are unknown, but it is a prime target for the classification procedure that the approximation as well as possible, where the quality of this approximation needs to be judged on the basis of the statistical or probabilistic properties of the overall population from which future observations will be drawn.

Given a classification rule, a classification test is the result of applying the rule to a finite sample of the initial data set.

Binary and multiclass classification

Classification can be thought of as two separate problems – binary classification and multiclass classification. In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes. [2] Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers. An important point is that in many practical binary classification problems, the two groups are not symmetric – rather than overall accuracy, the relative proportion of different types of errors is of interest. For example, in medical testing, a false positive (detecting a disease when it is not present) is considered differently from a false negative (not detecting a disease when it is present). In multiclass classifications, the classes may be considered symmetrically (all errors are equivalent), or asymmetrically, which is considerably more complicated.

Binary classification methods include probit regression and logistic regression. Multiclass classification methods include multinomial probit and multinomial logit.

Confusion Matrix and Classifiers

The left, and right, halves respectively contain instances that in fact have, and do not have, the condition. The oval contains instances that are classified (predicted) as positive (having the condition). Green and red respectively contain instances that are correctly (true), and wrongly (false), classified.
TP=True Positive; TN=True Negative; FP=False Positive (type I error); FN=False Negative (type II error); TPR=True Positive Rate; FPR=False Positive Rate; PPV=Positive Predictive Value; NPV=Negative Predictive Value. Binary-classification-labeled.svg
The left, and right, halves respectively contain instances that in fact have, and do not have, the condition. The oval contains instances that are classified (predicted) as positive (having the condition). Green and red respectively contain instances that are correctly (true), and wrongly (false), classified.
TP=True Positive; TN=True Negative; FP=False Positive (type I error); FN=False Negative (type II error); TPR=True Positive Rate; FPR=False Positive Rate; PPV=Positive Predictive Value; NPV=Negative Predictive Value.

When the classification function is not perfect, false results will appear. In the example in the image to the right. There are 20 dots on the left side of the line (true side) while only 8 of those 20 were actually true. In a similar situation for the right side of the line (false side) where there are 16 dots on the right side and 4 of those 16 dots were inaccurately marked as true. Using the dot locations, we can build a confusion matrix to express the values. We can use 4 different metrics to express the 4 different possible outcomes. There is true positive (TP), false positive (FP), false negative (FN), and true negative (TN).

Example confusion matrix
  Predicted

Actual
TrueFalse
True84
False1212

False positives

False positives result when a test falsely (incorrectly) reports a positive result. For example, a medical test for a disease may return a positive result indicating that the patient has the disease even if the patient does not have the disease. False positive is commonly denoted as the top right (Condition negative X test outcome positive) unit in a Confusion matrix.

False negatives

On the other hand, false negatives result when a test falsely or incorrectly reports a negative result. For example, a medical test for a disease may return a negative result indicating that patient does not have a disease even though the patient actually has the disease. False negative is commonly denoted as the bottom left (Condition positive X test outcome negative) unit in a Confusion matrix.

True positives

True positives result when a test correctly reports a positive result. As an example, a medical test for a disease may return a positive result indicating that the patient has the disease. This is shown to be true when the patient test confirms the existence of the disease. True positive is commonly denoted as the top left (Condition positive X test outcome positive) unit in a Confusion matrix.

True negatives

True negative result when a test correctly reports a negative result. As an example, a medical test for a disease may return a positive result indicating that the patient does not have the disease. This is shown to be true when the patients test also reports not having the disease. True negative is commonly denoted as the bottom right (Condition negative X test outcome negative) unit in a Confusion matrix.

Application with Bayes’ Theorem

We can also calculate true positives, false positive, true negative, and false negatives using Bayes' theorem. Using Bayes' theorem will help describe the Probability of an Event (probability theory), based on prior knowledge of conditions that might be related to the event. Expressed are the four classifications using the example below.

In terms of true positive, false positive, false negative, and true negative:

False positives

We can use Bayes' theorem to determine the probability that a positive result is in fact a false positive. We find that if a disease is rare, then the majority of positive results may be false positives, even if the test is relatively accurate.

Naively, one might think that only 5% of positive test results are false, but that is quite wrong, as we shall see.

Suppose that only 0.1% of the population has that disease, so that a randomly selected patient has a 0.001 prior probability of having the disease.

We can use Bayes' theorem to calculate the probability that a positive test result is a false positive.

and hence the probability that a positive result is a false positive is about 1  0.019 = 0.98, or 98%.

Despite the apparent high accuracy of the test, the incidence of the disease is so low that the vast majority of patients who test positive do not have the disease. Nonetheless, the fraction of patients who test positive who do have the disease (0.019) is 19 times the fraction of people who have not yet taken the test who have the disease (0.001). Thus the test is not useless, and re-testing may improve the reliability of the result.

In order to reduce the problem of false positives, a test should be very accurate in reporting a negative result when the patient does not have the disease. If the test reported a negative result in patients without the disease with probability 0.999, then

so that 1  0.5 = 0.5 now is the probability of a false positive.

False negatives

We can use Bayes' theorem to determine the probability that the negative result is in fact a false negative using the example from above:

The probability that a negative result is a false negative is about 0.0000105 or 0.00105%. When a disease is rare, false negatives will not be a major problem with the test.

But if 60% of the population had the disease, then the probability of a false negative would be greater. With the above test, the probability of a false negative would be

The probability that a negative result is a false negative rises to 0.0155 or 1.55%.

True positives

We can use Bayes' theorem to determine the probability that the positive result is in fact a true positive using the example from above:

Let A represent the condition in which the patient has the disease, and B represent the evidence of a positive test result. Then, the probability that the patient actually has the disease given a positive test result is:

The probability that a positive result is a true positive is about 0.019%

True negatives

We can also use Bayes' theorem to calculate the probability of true negative. Using the examples above:

The probability that a negative result is a true negative is 0.9999494 or 99.99%. Since the disease is rare and the positive to positive rate is high and the negative to negative rate is also high, this will produce a large True Negative rate.

Measuring a classifier with sensitivity and specificity

In training a classifier, one may wish to measure its performance using the well-accepted metrics of sensitivity and specificity. It may be instructive to compare the classifier to a random classifier that flips a coin based on the prevalence of a disease. Suppose that the probability a person has the disease is and the probability that they do not is . Suppose then that we have a random classifier that guesses that the patient has the disease with that same probability and guesses that he does not with the same probability .

The probability of a true positive is the probability that the patient has the disease times the probability that the random classifier guesses this correctly, or . With similar reasoning, the probability of a false negative is . From the definitions above, the sensitivity of this classifier is . With similar reasoning, we can calculate the specificity as .

So, while the measure itself is independent of disease prevalence, the performance of this random classifier depends on disease prevalence. The classifier may have performance that is like this random classifier, but with a better-weighted coin (higher sensitivity and specificity). So, these measures may be influenced by disease prevalence. An alternative measure of performance is the Matthews correlation coefficient, for which any random classifier will get an average score of 0.

The extension of this concept to non-binary classifications yields the confusion matrix.

See also

Notes

    Related Research Articles

    In probability theory and statistics, Bayes' theorem, named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes' theorem allows the risk to an individual of a known age to be assessed more accurately by conditioning it relative to their age, rather than simply assuming that the individual is typical of the population as a whole.

    <span class="mw-page-title-main">De Morgan's laws</span> Pair of logical equivalences

    In propositional logic and Boolean algebra, De Morgan's laws, also known as De Morgan's theorem, are a pair of transformation rules that are both valid rules of inference. They are named after Augustus De Morgan, a 19th-century British mathematician. The rules allow the expression of conjunctions and disjunctions purely in terms of each other via negation.

    <span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

    In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (nativity) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

    Binary classification is the task of classifying the elements of a set into on of two groups on the basis of a classification rule. Typical binary classification problems include:

    In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

    <span class="mw-page-title-main">Base rate fallacy</span> Error in thinking which involves under-valuing base rate information

    The base rate fallacy, also called base rate neglect or base rate bias, is a type of fallacy in which people tend to ignore the base rate in favor of the individuating information . Base rate neglect is a specific form of the more general extension neglect.

    <span class="mw-page-title-main">Receiver operating characteristic</span> Diagnostic plot of binary classifier ability

    A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.

    In evidence-based medicine, likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result usefully changes the probability that a condition exists. The first description of the use of likelihood ratios for decision rules was made at a symposium on information theory in 1954. In medicine, likelihood ratios were introduced between 1975 and 1980.

    <span class="mw-page-title-main">Positive and negative predictive values</span> In biostatistics, proportion of true positive and true negative results

    The positive and negative predictive values are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. The PPV and NPV are not intrinsic to the test ; they depend also on the prevalence. Both PPV and NPV can be derived using Bayes' theorem.

    <span class="mw-page-title-main">F-score</span> Statistical measure of a tests accuracy

    In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

    <span class="mw-page-title-main">Sensitivity and specificity</span> Statistical measures of the performance of a binary classification test

    In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:

    In statistical hypothesis testing, a type I error is the mistaken rejection of a null hypothesis that is actually true. A type I error is also known as a "false positive" finding or conclusion; example: "an innocent person is convicted". A type II error is the failure to reject a null hypothesis that is actually false. A type II error is also known as a "false negative" finding or conclusion; example: "a guilty person is not convicted". Much of statistical theory revolves around the minimization of one or both of these errors, though the complete elimination of either is a statistical impossibility if the outcome is not determined by a known, observable causal process. By selecting a low threshold (cut-off) value and modifying the alpha (α) level, the quality of the hypothesis test can be increased. The knowledge of type I errors and type II errors is widely used in medical science, biometrics and computer science.

    Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

    <span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

    In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

    Confusion of the inverse, also called the conditional probability fallacy or the inverse fallacy, is a logical fallacy whereupon a conditional probability is equated with its inverse; that is, given two events A and B, the probability of A happening given that B has happened is assumed to be about the same as the probability of B given A, when there is actually no evidence for this assumption. More formally, P(A|B) is assumed to be approximately equal to P(B|A).

    In statistics, the phi coefficient is a measure of association for two binary variables.

    <span class="mw-page-title-main">Evaluation of binary classifiers</span>

    The evaluation of binary classifiers compares two methods of assigning a binary attribute, one of which is usually a standard method and the other is being investigated. There are many metrics that can be used to measure the performance of a classifier or predictor; different fields have different preferences for specific metrics due to different goals. For example, in medicine sensitivity and specificity are often used, while in computer science precision and recall are preferred. An important distinction is between metrics that are independent on the prevalence, and metrics that depend on the prevalence – both types are useful, but they have very different properties.

    A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition, while a false negative is the opposite error, where the test result incorrectly indicates the absence of a condition when it is actually present. These are the two kinds of errors in a binary test, in contrast to the two kinds of correct result. They are also known in medicine as a false positivediagnosis, and in statistical classification as a false positiveerror.

    Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine learning models. Decisions made by computers after a machine-learning process may be considered unfair if they were based on variables considered sensitive. Examples of these kinds of variable include gender, ethnicity, sexual orientation, disability and more. As it is the case with many ethical concepts, definitions of fairness and bias are always controversial. In general, fairness and bias are considered relevant when the decision process impacts people's lives. In machine learning, the problem of algorithmic bias is well known and well studied. Outcomes may be skewed by a range of factors and thus might be considered unfair with respect to certain groups or individuals. An example would be the way social media sites deliver personalized news to consumers.

    P4 metric enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P4 is designed in similar way to F1 metric, however addressing the criticisms leveled against F1. It may be perceived as its extension.

    References

    1. Mathworld article for statistical test
    2. Har-Peled, S., Roth, D., Zimak, D. (2003) "Constraint Classification for Multiclass Classification and Ranking." In: Becker, B., Thrun, S., Obermayer, K. (Eds) Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference, MIT Press. ISBN   0-262-02550-7