Evaluation of binary classifiers

Last updated
From the confusion matrix you can derive four basic measures. Binary-classification-labeled.svg
From the confusion matrix you can derive four basic measures.

Evaluation of a binary classifier typically assigns a numerical value, or values, to a classifier that represent its accuracy. An example is error rate, which measures how frequently the classifier makes a mistake.

Contents

There are many metrics that can be used; different fields have different preferences. For example, in medicine sensitivity and specificity are often used, while in computer science precision and recall are preferred.

An important distinction is between metrics that are independent of the prevalence or skew (how often each class occurs in the population), and metrics that depend on the prevalence – both types are useful, but they have very different properties.

Often, evaluation is used to compare two methods of classification, one of which is usually a standard method and the other is being investigated.

Contingency table

Given a data set, a classification (the output of a classifier on that set) gives two numbers: the number of positives and the number of negatives, which add up to the total size of the set. To evaluate a classifier, one compares its output to another reference classification – ideally a perfect classification, but in practice the output of another gold standard test – and cross tabulates the data into a 2×2 contingency table, comparing the two classifications. One then evaluates the classifier relative to the gold standard by computing summary statistics of these 4 numbers. Generally these statistics will be scale invariant (scaling all the numbers by the same factor does not change the output), to make them independent of population size, which is achieved by using ratios of homogeneous functions, most simply homogeneous linear or homogeneous quadratic functions.

Say we test some people for the presence of a disease. Some of these people have the disease, and our test correctly says they are positive. They are called true positives (TP). Some have the disease, but the test incorrectly claims they don't. They are called false negatives (FN). Some don't have the disease, and the test says they don't – true negatives (TN). Finally, there might be healthy people who have a positive test result – false positives (FP). These can be arranged into a 2×2 contingency table (confusion matrix), conventionally with the test result on the vertical axis and the actual condition on the horizontal axis.

These numbers can then be totaled, yielding both a grand total and marginal totals. Totaling the entire table, the number of true positives, false negatives, true negatives, and false positives add up to 100% of the set. Totaling the columns (adding vertically) the number of true positives and false positives add up to 100% of the test positives, and likewise for negatives. Totaling the rows (adding horizontally), the number of true positives and false negatives add up to 100% of the condition positives (conversely for negatives). The basic marginal ratio statistics are obtained by dividing the 2×2=4 values in the table by the marginal totals (either rows or columns), yielding 2 auxiliary 2×2 tables, for a total of 8 ratios. These ratios come in 4 complementary pairs, each pair summing to 1, and so each of these derived 2×2 tables can be summarized as a pair of 2 numbers, together with their complements. Further statistics can be obtained by taking ratios of these ratios, ratios of ratios, or more complicated functions.

The contingency table and the most common derived ratios are summarized below; see sequel for details.

Predicted conditionSources: [1] [2] [3] [4] [5] [6] [7] [8]
Total population
= P + N
Predicted Positive (PP)Predicted Negative (PN) Informedness, bookmaker informedness (BM)
= TPR + TNR − 1
Prevalence threshold (PT)
= TPR × FPR - FPR/TPR - FPR
Actual condition
Positive (P) [lower-alpha 1] True positive (TP),
hit [lower-alpha 2]
False negative (FN),
miss, underestimation
True positive rate (TPR), recall, sensitivity (SEN), probability of detection, hit rate, power
= TP/P= 1 − FNR
False negative rate (FNR),
miss rate
type II error [lower-alpha 3]
= FN/P= 1 − TPR
Negative (N) [lower-alpha 4] False positive (FP),
false alarm, overestimation
True negative (TN),
correct rejection [lower-alpha 5]
False positive rate (FPR),
probability of false alarm, fall-out
type I error [lower-alpha 6]
= FP/N= 1 − TNR
True negative rate (TNR),
specificity (SPC), selectivity
= TN/N= 1 − FPR
Prevalence
= P/P + N
Positive predictive value (PPV), precision
= TP/PP= 1 − FDR
False omission rate (FOR)
= FN/PN= 1 − NPV
Positive likelihood ratio (LR+)
= TPR/FPR
Negative likelihood ratio (LR−)
= FNR/TNR
Accuracy (ACC)
= TP + TN/P + N
False discovery rate (FDR)
= FP/PP= 1 − PPV
Negative predictive value (NPV)
= TN/PN= 1 − FOR
Markedness (MK), deltaP (Δp)
= PPV + NPV − 1
Diagnostic odds ratio (DOR)
= LR+/LR−
Balanced accuracy (BA)
= TPR + TNR/2
F1 score
= 2 PPV × TPR/PPV + TPR= 2 TP/2 TP + FP + FN
Fowlkes–Mallows index (FM)
= PPV × TPR
Matthews correlation coefficient (MCC)
= TPR × TNR × PPV × NPV- FNR × FPR × FOR × FDR
Threat score (TS), critical success index (CSI), Jaccard index
= TP/TP + FN + FP
  1. the number of real positive cases in the data
  2. A test result that correctly indicates the presence of a condition or characteristic
  3. Type II error: A test result which wrongly indicates that a particular condition or attribute is absent
  4. the number of real negative cases in the data
  5. A test result that correctly indicates the absence of a condition or characteristic
  6. Type I error: A test result which wrongly indicates that a particular condition or attribute is present

Note that the rows correspond to the condition actually being positive or negative (or classified as such by the gold standard), as indicated by the color-coding, and the associated statistics are prevalence-independent, while the columns correspond to the test being positive or negative, and the associated statistics are prevalence-dependent. There are analogous likelihood ratios for prediction values, but these are less commonly used, and not depicted above.

Sensitivity and specificity

The fundamental prevalence-independent statistics are sensitivity and specificity.

Sensitivity or True Positive Rate (TPR), also known as recall, is the proportion of people that tested positive and are positive (True Positive, TP) of all the people that actually are positive (Condition Positive, CP = TP + FN). It can be seen as the probability that the test is positive given that the patient is sick. With higher sensitivity, fewer actual cases of disease go undetected (or, in the case of the factory quality control, fewer faulty products go to the market).

Specificity (SPC) or True Negative Rate (TNR) is the proportion of people that tested negative and are negative (True Negative, TN) of all the people that actually are negative (Condition Negative, CN = TN + FP). As with sensitivity, it can be looked at as the probability that the test result is negative given that the patient is not sick. With higher specificity, fewer healthy people are labeled as sick (or, in the factory case, fewer good products are discarded).

The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the Receiver Operating Characteristic (ROC) curve.

In theory, sensitivity and specificity are independent in the sense that it is possible to achieve 100% in both (such as in the red/blue ball example given above). In more practical, less contrived instances, however, there is usually a trade-off, such that they are inversely proportional to one another to some extent. This is because we rarely measure the actual thing we would like to classify; rather, we generally measure an indicator of the thing we would like to classify, referred to as a surrogate marker. The reason why 100% is achievable in the ball example is because redness and blueness is determined by directly detecting redness and blueness. However, indicators are sometimes compromised, such as when non-indicators mimic indicators or when indicators are time-dependent, only becoming evident after a certain lag time. The following example of a pregnancy test will make use of such an indicator.

Modern pregnancy tests do not use the pregnancy itself to determine pregnancy status; rather, human chorionic gonadotropin is used, or hCG, present in the urine of gravid females, as a surrogate marker to indicate that a woman is pregnant. Because hCG can also be produced by a tumor, the specificity of modern pregnancy tests cannot be 100% (because false positives are possible). Also, because hCG is present in the urine in such small concentrations after fertilization and early embryogenesis, the sensitivity of modern pregnancy tests cannot be 100% (because false negatives are possible).

Likelihood ratios

Positive and negative predictive values

In addition to sensitivity and specificity, the performance of a binary classification test can be measured with positive predictive value (PPV), also known as precision, and negative predictive value (NPV). The positive prediction value answers the question "If the test result is positive, how well does that predict an actual presence of disease?". It is calculated as TP/(TP + FP); that is, it is the proportion of true positives out of all positive results. The negative prediction value is the same, but for negatives, naturally.

Impact of prevalence on prediction values

Prevalence has a significant impact on prediction values. As an example, suppose there is a test for a disease with 99% sensitivity and 99% specificity. If 2000 people are tested and the prevalence (in the sample) is 50%, 1000 of them are sick and 1000 of them are healthy. Thus about 990 true positives and 990 true negatives are likely, with 10 false positives and 10 false negatives. The positive and negative prediction values would be 99%, so there can be high confidence in the result.

However, if the prevalence is only 5%, so of the 2000 people only 100 are really sick, then the prediction values change significantly. The likely result is 99 true positives, 1 false negative, 1881 true negatives and 19 false positives. Of the 19+99 people tested positive, only 99 really have the disease – that means, intuitively, that given that a patient's test result is positive, there is only 84% chance that they really have the disease. On the other hand, given that the patient's test result is negative, there is only 1 chance in 1882, or 0.05% probability, that the patient has the disease despite the test result.

Likelihood ratios

Precision and recall

Precision and recall can be interpreted as (estimated) conditional probabilities: Precision is given by while recall is given by , [9] where is the predicted class and is the actual class. Both quantities are therefore connected by Bayes' theorem.

Relationships

There are various relationships between these ratios.

If the prevalence, sensitivity, and specificity are known, the positive predictive value can be obtained from the following identity:

If the prevalence, sensitivity, and specificity are known, the negative predictive value can be obtained from the following identity:

Single metrics

In addition to the paired metrics, there are also single metrics that give a single number to evaluate the test.

Perhaps the simplest statistic is accuracy or fraction correct (FC), which measures the fraction of all instances that are correctly categorized; it is the ratio of the number of correct classifications to the total number of correct or incorrect classifications: (TP + TN)/total population = (TP + TN)/(TP + TN + FP + FN). As such, it compares estimates of pre- and post-test probability. In total ignorance, one can compare a rule to flipping a coin (p0=0.5). This measure is prevalence-dependent. If 90% of people with COVID symptoms don't have COVID, the prior probability P(-) is 0.9, and the simple rule "Classify all such patients as COVID-free." would be 90% accurate. Diagnosis should be better than that. One can construct a "One-proportion z-test" with p0 as max(priors) = max(P(-),P(+)) for a diagnostic method hoping to beat a simple rule using the most likely outcome. Here, the hypotheses are "Ho: p ≤ 0.9 vs. Ha: p > 0.9", rejecting Ho for large values of z. One diagnostic rule could be compared to another if the other's accuracy is known and substituted for p0 in calculating the z statistic. If not known and calculated from data, an accuracy comparison test could be made using "Two-proportion z-test, pooled for Ho: p1 = p2".

Not used very much is the complementary statistic, the fraction incorrect (FiC): FC + FiC = 1, or (FP + FN)/(TP + TN + FP + FN) – this is the sum of the antidiagonal, divided by the total population. Cost-weighted fractions incorrect could compare expected costs of misclassification for different methods.

The diagnostic odds ratio (DOR) can be a more useful overall metric, which can be defined directly as (TP×TN)/(FP×FN) = (TP/FN)/(FP/TN), or indirectly as a ratio of ratio of ratios (ratio of likelihood ratios, which are themselves ratios of true rates or prediction values). This has a useful interpretation – as an odds ratio – and is prevalence-independent. Likelihood ratio is generally considered to be prevalence-independent and is easily interpreted as the multiplier to turn prior probabilities into posterior probabilities. Another useful single measure is "area under the ROC curve", AUC.

Alternative metrics

An F-score is a combination of the precision and the recall, providing a single score. There is a one-parameter family of statistics, with parameter β, which determines the relative weights of precision and recall. The traditional or balanced F-score (F1 score) is the harmonic mean of precision and recall:

.

F-scores do not take the true negative rate into account and, therefore, are more suited to information retrieval and information extraction evaluation where the true negatives are innumerable. Instead, measures such as the phi coefficient, Matthews correlation coefficient, informedness or Cohen's kappa may be preferable to assess the performance of a binary classifier. [10] [11] As a correlation coefficient, the Matthews correlation coefficient is the geometric mean of the regression coefficients of the problem and its dual. The component regression coefficients of the Matthews correlation coefficient are markedness (deltap) and informedness (Youden's J statistic or deltap'). [12]

Choosing the appropriate form of evaluation

Hand has highlighted the importance of choosing an appropriate method of evaluation. However, of the many different methods for evaluating the accuracy of a classifier, there is no general method for determining which method should be used in which circumstances. Different fields have taken different approaches. [13]

Cullerne Bown has distinguished three basic approaches to evaluation:

° Mathematical - such as the Matthews Correlation Coefficient, in which both kinds of error are axiomatically treated as equally problematic;

° Cost-benefit - in which a currency is adopted (e.g. money or Quality Adjusted Life Years) and values assigned to errors and successes on the basis of empirical measurement;

° Judgemental - in which a human judgement is made about the relative importance of the two kinds of error; typically this starts by adopting a pair of indicators such as sensitivity and specificity, precision and recall or positive predictive value and negative predictive value.

In the judgemental case, he has provided a flow chart for determining which pair of indicators should be used when, and consequently how to choose between the Receiver Operating Characteristic and the Precision-Recall Curve. [14]

Accuracy aside

Apart from accuracy, binary classifiers can be assessed in many other ways, for example in terms of their speed or cost.

Evaluation of probabilistic classifiers

Probabilistic classification models go beyond providing binary outputs and instead produce probability scores for each class. These models are designed to assess the likelihood or probability of an instance belonging to different classes. In the context of evaluating probabilistic classifiers, alternative evaluation metrics have been developed to properly assess the performance of these models. These metrics take into account the probabilistic nature of the classifier's output and provide a more comprehensive assessment of its effectiveness in assigning accurate probabilities to different classes. These evaluation metrics aim to capture the degree of calibration, discrimination, and overall accuracy of the probabilistic classifier's predictions.

In information systems

Information retrieval systems, such as databases and web search engines, are evaluated by many different metrics, some of which are derived from the confusion matrix, which divides results into true positives (documents correctly retrieved), true negatives (documents correctly not retrieved), false positives (documents incorrectly retrieved), and false negatives (documents incorrectly not retrieved). Commonly used metrics include the notions of precision and recall. In this context, precision is defined as the fraction of documents correctly retrieved compared to the documents retrieved (true positives divided by true positives plus false positives), using a set of ground truth relevant results selected by humans. Recall is defined as the fraction of documents correctly retrieved compared to the relevant documents (true positives divided by true positives plus false negatives). Less commonly, the metric of accuracy is used, is defined as the fraction of documents correctly classified compared to the documents (true positives plus true negatives divided by true positives plus true negatives plus false positives plus false negatives).

None of these metrics take into account the ranking of results. Ranking is very important for web search engines because readers seldom go past the first page of results, and there are too many documents on the web to manually classify all of them as to whether they should be included or excluded from a given search. Adding a cutoff at a particular number of results takes ranking into account to some degree. The measure precision at k, for example, is a measure of precision looking only at the top ten (k=10) search results. More sophisticated metrics, such as discounted cumulative gain, take into account each individual ranking, and are more commonly used where this is important.

See also

Related Research Articles

<span class="mw-page-title-main">Binary classification</span> Dividing things between two categories

Binary classification is the task of classifying the elements of a set into one of two groups. Typical binary classification problems include:

<span class="mw-page-title-main">Decision tree</span> Decision support tool

A decision tree is a decision support hierarchical model that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one; in unsupervised learning it is usually called a matching matrix.

<span class="mw-page-title-main">Receiver operating characteristic</span> Diagnostic plot of binary classifier ability

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.

In evidence-based medicine, likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result usefully changes the probability that a condition exists. The first description of the use of likelihood ratios for decision rules was made at a symposium on information theory in 1954. In medicine, likelihood ratios were introduced between 1975 and 1980.

<span class="mw-page-title-main">Positive and negative predictive values</span> Statistical measures of whether a finding is likely to be true

The positive and negative predictive values are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. The PPV and NPV are not intrinsic to the test ; they depend also on the prevalence. Both PPV and NPV can be derived using Bayes' theorem.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

<span class="mw-page-title-main">F-score</span> Statistical measure of a tests accuracy

In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

<span class="mw-page-title-main">Sensitivity and specificity</span> Statistical measure of a binary classification

In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:

Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

In statistics, the phi coefficient is a measure of association for two binary variables.

Pre-test probability and post-test probability are the probabilities of the presence of a condition before and after a diagnostic test, respectively. Post-test probability, in turn, can be positive or negative, depending on whether the test falls out as a positive test or a negative test, respectively. In some cases, it is used for the probability of developing the condition of interest in the future.

<span class="mw-page-title-main">Diagnostic odds ratio</span>

In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.

The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings, and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983.

A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition, while a false negative is the opposite error, where the test result incorrectly indicates the absence of a condition when it is actually present. These are the two kinds of errors in a binary test, in contrast to the two kinds of correct result. They are also known in medicine as a false positivediagnosis, and in statistical classification as a false positiveerror.

<span class="mw-page-title-main">Forensic epidemiology</span>

The discipline of forensic epidemiology (FE) is a hybrid of principles and practices common to both forensic medicine and epidemiology. FE is directed at filling the gap between clinical judgment and epidemiologic data for determinations of causality in civil lawsuits and criminal prosecution and defense.

Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine learning models. Decisions made by computers after a machine-learning process may be considered unfair if they were based on variables considered sensitive. For example gender, ethnicity, sexual orientation or disability. As it is the case with many ethical concepts, definitions of fairness and bias are always controversial. In general, fairness and bias are considered relevant when the decision process impacts people's lives. In machine learning, the problem of algorithmic bias is well known and well studied. Outcomes may be skewed by a range of factors and thus might be considered unfair with respect to certain groups or individuals. An example would be the way social media sites deliver personalized news to consumer.

<span class="mw-page-title-main">Partial Area Under the ROC Curve</span> Dev gurjar actor

The Partial Area Under the ROC Curve (pAUC) is a metric for the performance of binary classifier.

P4 metric enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P4 is designed in similar way to F1 metric, however addressing the criticisms leveled against F1. It may be perceived as its extension.

References

  1. Fawcett, Tom (2006). "An Introduction to ROC Analysis" (PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010. S2CID   2027090.
  2. Provost, Foster; Tom Fawcett (2013-08-01). "Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking". O'Reilly Media, Inc.
  3. Powers, David M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63.
  4. Ting, Kai Ming (2011). Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of machine learning. Springer. doi:10.1007/978-0-387-30164-8. ISBN   978-0-387-30164-8.
  5. Brooks, Harold; Brown, Barb; Ebert, Beth; Ferro, Chris; Jolliffe, Ian; Koh, Tieh-Yong; Roebber, Paul; Stephenson, David (2015-01-26). "WWRP/WGNE Joint Working Group on Forecast Verification Research". Collaboration for Australian Weather and Climate Research. World Meteorological Organisation. Retrieved 2019-07-17.
  6. Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (1): 6-1–6-13. doi: 10.1186/s12864-019-6413-7 . PMC   6941312 . PMID   31898477.
  7. Chicco D, Toetsch N, Jurman G (February 2021). "The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation". BioData Mining. 14 (13): 13. doi: 10.1186/s13040-021-00244-z . PMC   7863449 . PMID   33541410.
  8. Tharwat A. (August 2018). "Classification assessment methods". Applied Computing and Informatics. 17: 168–192. doi: 10.1016/j.aci.2018.08.003 .
  9. Roelleke, Thomas (2022-05-31). Information Retrieval Models: Foundations & Relationships. Springer Nature. ISBN   978-3-031-02328-6.
  10. Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Score to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63. hdl:2328/27165.
  11. Powers, David M. W. (2012). "The Problem with Kappa" (PDF). Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop. Archived from the original (PDF) on 2016-05-18. Retrieved 2012-07-20.
  12. Perruchet, P.; Peereman, R. (2004). "The exploitation of distributional information in syllable processing". J. Neurolinguistics. 17 (2–3): 97–119. doi:10.1016/S0911-6044(03)00059-9. S2CID   17104364.
  13. David Hand (2012). "Assessing the Performance of Classification Methods". International Statistical Review . 80 (3): 400–414.
  14. William Cullerne Bown (2024). "Sensitivity and Specificity versus Precision and Recall, and Related Dilemmas". Journal of Classification .