Positive and negative predictive values

Last updated
Positive and negative predictive values Positive and negative predictive values.pdf
Positive and negative predictive values
Positive and negative predictive values - 2 PPV, NPV, Sensitivity and Specificity.svg
Positive and negative predictive values - 2

The positive and negative predictive values (PPV and NPV respectively) are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. [1] The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. The PPV and NPV are not intrinsic to the test (as true positive rate and true negative rate are); they depend also on the prevalence. [2] Both PPV and NPV can be derived using Bayes' theorem.

Contents

Although sometimes used synonymously, a positive predictive value generally refers to what is established by control groups, while a post-test probability refers to a probability for an individual. Still, if the individual's pre-test probability of the target condition is the same as the prevalence in the control group used to establish the positive predictive value, the two are numerically equal.

In information retrieval, the PPV statistic is often called the precision.

Definition

Positive predictive value (PPV)

The positive predictive value (PPV), or precision, is defined as

where a "true positive" is the event that the test makes a positive prediction, and the subject has a positive result under the gold standard, and a "false positive" is the event that the test makes a positive prediction, and the subject has a negative result under the gold standard. The ideal value of the PPV, with a perfect test, is 1 (100%), and the worst possible value would be zero.

The PPV can also be computed from sensitivity, specificity, and the prevalence of the condition:

cf. Bayes' theorem

The complement of the PPV is the false discovery rate (FDR):

Negative predictive value (NPV)

The negative predictive value is defined as:

where a "true negative" is the event that the test makes a negative prediction, and the subject has a negative result under the gold standard, and a "false negative" is the event that the test makes a negative prediction, and the subject has a positive result under the gold standard. With a perfect test, one which returns no false negatives, the value of the NPV is 1 (100%), and with a test which returns no true negatives the NPV value is zero.

The NPV can also be computed from sensitivity, specificity, and prevalence:

The complement of the NPV is the false omission rate (FOR):

Although sometimes used synonymously, a negative predictive value generally refers to what is established by control groups, while a negative post-test probability rather refers to a probability for an individual. Still, if the individual's pre-test probability of the target condition is the same as the prevalence in the control group used to establish the negative predictive value, then the two are numerically equal.

Relationship

The following diagram illustrates how the positive predictive value, negative predictive value, sensitivity, and specificity are related.

Predicted conditionSources: [3] [4] [5] [6] [7] [8] [9] [10] [11]
Total population
= P + N
Predicted Positive (PP)Predicted Negative (PN) Informedness, bookmaker informedness (BM)
= TPR + TNR − 1
Prevalence threshold (PT)
= TPR × FPR - FPR/TPR - FPR
Actual condition
Positive (P) [lower-alpha 1] True positive (TP),
hit
[lower-alpha 2]
False negative (FN),
type II error, miss,
underestimation
[lower-alpha 3]
True positive rate (TPR), recall, sensitivity (SEN), probability of detection, hit rate, power
= TP/P= 1 − FNR
False negative rate (FNR),
miss rate
= FN/P= 1 − TPR
Negative (N) [lower-alpha 4] False positive (FP),
type I error, false alarm,
overestimation
[lower-alpha 5]
True negative (TN),
correct rejection
[lower-alpha 6]
False positive rate (FPR),
probability of false alarm, fall-out
= FP/N= 1 − TNR
True negative rate (TNR),
specificity (SPC), selectivity
= TN/N= 1 − FPR
Prevalence
= P/P + N
Positive predictive value (PPV), precision
= TP/PP= 1 − FDR
False omission rate (FOR)
= FN/PN= 1 − NPV
Positive likelihood ratio (LR+)
= TPR/FPR
Negative likelihood ratio (LR−)
= FNR/TNR
Accuracy (ACC)
= TP + TN/P + N
False discovery rate (FDR)
= FP/PP= 1 − PPV
Negative predictive value (NPV)
= TN/PN= 1 − FOR
Markedness (MK), deltaP (Δp)
= PPV + NPV − 1
Diagnostic odds ratio (DOR)
= LR+/LR−
Balanced accuracy (BA)
= TPR + TNR/2
F1 score
= 2 PPV × TPR/PPV + TPR= 2 TP/2 TP + FP + FN
Fowlkes–Mallows index (FM)
= PPV × TPR
Matthews correlation coefficient (MCC)
= TPR × TNR × PPV × NPV- FNR × FPR × FOR × FDR
Threat score (TS), critical success index (CSI), Jaccard index
= TP/TP + FN + FP
  1. the number of real positive cases in the data
  2. A test result that correctly indicates the presence of a condition or characteristic
  3. Type II error: A test result which wrongly indicates that a particular condition or attribute is absent
  4. the number of real negative cases in the data
  5. Type I error: A test result which wrongly indicates that a particular condition or attribute is present
  6. A test result that correctly indicates the absence of a condition or characteristic

Note that the positive and negative predictive values can only be estimated using data from a cross-sectional study or other population-based study in which valid prevalence estimates may be obtained. In contrast, the sensitivity and specificity can be estimated from case-control studies.

Worked example

Suppose the fecal occult blood (FOB) screen test is used in 2030 people to look for bowel cancer:

Fecal occult blood screen test outcome
Total population
(pop.) = 2030
Test outcome positiveTest outcome negative Accuracy (ACC)
= (TP + TN) / pop.
= (20 + 1820) / 2030
90.64%
F1 score
= 2 ×precision × recall/precision + recall
0.174
Patients with
bowel cancer
(as confirmed
on endoscopy)
Actual condition
positive (AP)
= 30
(2030 × 1.48%)
True positive (TP)
= 20
(2030 × 1.48% × 67%)
False negative (FN)
= 10
(2030 × 1.48% ×(100% 67%))
True positive rate (TPR), recall, sensitivity
= TP / AP
= 20 / 30
66.7%
False negative rate (FNR), miss rate
= FN / AP
= 10 / 30
33.3%
Actual condition
negative (AN)
= 2000
(2030 ×(100% 1.48%))
False positive (FP)
= 180
(2030 ×(100% 1.48%)×(100% 91%))
True negative (TN)
= 1820
(2030 ×(100% 1.48%)× 91%)
False positive rate (FPR), fall-out, probability of false alarm
= FP / AN
= 180 / 2000
= 9.0%
Specificity, selectivity, true negative rate (TNR)
= TN / AN
= 1820 / 2000
= 91%
Prevalence
= AP / pop.
= 30 / 2030
1.48%
Positive predictive value (PPV), precision
= TP / (TP + FP)
= 20 / (20 + 180)
= 10%
False omission rate (FOR)
= FN / (FN + TN)
= 10 / (10 + 1820)
0.55%
Positive likelihood ratio (LR+)
= TPR/FPR
= (20 / 30) / (180 / 2000)
7.41
Negative likelihood ratio (LR)
= FNR/TNR
= (10 / 30) / (1820 / 2000)
0.366
False discovery rate (FDR)
= FP / (TP + FP)
= 180 / (20 + 180)
= 90.0%
Negative predictive value (NPV)
= TN / (FN + TN)
= 1820 / (10 + 1820)
99.45%
Diagnostic odds ratio (DOR)
= LR+/LR
20.2

The small positive predictive value (PPV = 10%) indicates that many of the positive results from this testing procedure are false positives. Thus it will be necessary to follow up any positive result with a more reliable test to obtain a more accurate assessment as to whether cancer is present. Nevertheless, such a test may be useful if it is inexpensive and convenient. The strength of the FOB screen test is instead in its negative predictive value — which, if negative for an individual, gives us a high confidence that its negative result is true.

Problems

Other individual factors

Note that the PPV is not intrinsic to the test—it depends also on the prevalence. [2] Due to the large effect of prevalence upon predictive values, a standardized approach has been proposed, where the PPV is normalized to a prevalence of 50%. [12] PPV is directly proportional[ dubious ] to the prevalence of the disease or condition. In the above example, if the group of people tested had included a higher proportion of people with bowel cancer, then the PPV would probably come out higher and the NPV lower. If everybody in the group had bowel cancer, the PPV would be 100% and the NPV 0%.[ citation needed ]

To overcome this problem, NPV and PPV should only be used if the ratio of the number of patients in the disease group and the number of patients in the healthy control group used to establish the NPV and PPV is equivalent to the prevalence of the diseases in the studied population, or, in case two disease groups are compared, if the ratio of the number of patients in disease group 1 and the number of patients in disease group 2 is equivalent to the ratio of the prevalences of the two diseases studied. Otherwise, positive and negative likelihood ratios are more accurate than NPV and PPV, because likelihood ratios do not depend on prevalence.[ citation needed ]

When an individual being tested has a different pre-test probability of having a condition than the control groups used to establish the PPV and NPV, the PPV and NPV are generally distinguished from the positive and negative post-test probabilities, with the PPV and NPV referring to the ones established by the control groups, and the post-test probabilities referring to the ones for the tested individual (as estimated, for example, by likelihood ratios). Preferably, in such cases, a large group of equivalent individuals should be studied, in order to establish separate positive and negative predictive values for use of the test in such individuals.[ citation needed ]

Bayesian updating

Bayes' theorem confers inherent limitations on the accuracy of screening tests as a function of disease prevalence or pre-test probability. It has been shown that a testing system can tolerate significant drops in prevalence, up to a certain well-defined point known as the prevalence threshold, below which the reliability of a positive screening test drops precipitously. That said, Balayla et al. [13] showed that sequential testing overcomes the aforementioned Bayesian limitations and thus improves the reliability of screening tests. For a desired positive predictive value that approaches some constant , the number of positive test iterations needed is:

where

Of note, the denominator of the above equation is the natural logarithm of the positive likelihood ratio (LR+).

Different target conditions

PPV is used to indicate the probability that in case of a positive test, that the patient really has the specified disease. However, there may be more than one cause for a disease and any single potential cause may not always result in the overt disease seen in a patient. There is potential to mix up related target conditions of PPV and NPV, such as interpreting the PPV or NPV of a test as having a disease, when that PPV or NPV value actually refers only to a predisposition of having that disease.[ citation needed ]

An example is the microbiological throat swab used in patients with a sore throat. Usually publications stating PPV of a throat swab are reporting on the probability that this bacterium is present in the throat, rather than that the patient is ill from the bacteria found. If presence of this bacterium always resulted in a sore throat, then the PPV would be very useful. However the bacteria may colonise individuals in a harmless way and never result in infection or disease. Sore throats occurring in these individuals are caused by other agents such as a virus. In this situation the gold standard used in the evaluation study represents only the presence of bacteria (that might be harmless) but not a causal bacterial sore throat illness. It can be proven that this problem will affect positive predictive value far more than negative predictive value. [14] To evaluate diagnostic tests where the gold standard looks only at potential causes of disease, one may use an extension of the predictive value termed the Etiologic Predictive Value. [15] [16]

See also

Related Research Articles

In probability theory and statistics, Bayes' theorem, named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes' theorem allows the risk to an individual of a known age to be assessed more accurately by conditioning it relative to their age, rather than simply assuming that the individual is typical of the population as a whole.

In epidemiology, prevalence is the proportion of a particular population found to be affected by a medical condition at a specific time. It is derived by comparing the number of people found to have the condition with the total number of people studied and is usually expressed as a fraction, a percentage, or the number of cases per 10,000 or 100,000 people. Prevalence is most often used in questionnaire studies.

Binary classification is the task of classifying the elements of a set into one of two groups on the basis of a classification rule. Typical binary classification problems include:

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

In healthcare, a differential diagnosis (DDx) is a method of analysis of a patient's history and physical examination to arrive at the correct diagnosis. It involves distinguishing a particular disease or condition from others that present with similar clinical features. Differential diagnostic procedures are used by clinicians to diagnose the specific disease in a patient, or, at least, to consider any imminently life-threatening conditions. Often, each individual option of a possible disease is called a differential diagnosis.

<span class="mw-page-title-main">Receiver operating characteristic</span> Diagnostic plot of binary classifier ability

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.

In evidence-based medicine, likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result usefully changes the probability that a condition exists. The first description of the use of likelihood ratios for decision rules was made at a symposium on information theory in 1954. In medicine, likelihood ratios were introduced between 1975 and 1980.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

<span class="mw-page-title-main">Sensitivity and specificity</span> Statistical measures of the performance of a binary classification test

In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted. A type II error, or a false negative, is the failure to reject a null hypothesis that is actually false. For example: a guilty person may be not convicted.

Medical statistics deals with applications of statistics to medicine and the health sciences, including epidemiology, public health, forensic medicine, and clinical research. Medical statistics has been a recognized branch of statistics in the United Kingdom for more than 40 years but the term has not come into general use in North America, where the wider term 'biostatistics' is more commonly used. However, "biostatistics" more commonly connotes all applications of statistics to biology. Medical statistics is a subdiscipline of statistics. "It is the science of summarizing, collecting, presenting and interpreting data in medical practice, and using them to estimate the magnitude of associations and test hypotheses. It has a central role in medical investigations. It not only provides a way of organizing information on a wider and more formal basis than relying on the exchange of anecdotes and personal experience, but also takes into account the intrinsic variation inherent in most biological processes."

Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

Confusion of the inverse, also called the conditional probability fallacy or the inverse fallacy, is a logical fallacy whereupon a conditional probability is equated with its inverse; that is, given two events A and B, the probability of A happening given that B has happened is assumed to be about the same as the probability of B given A, when there is actually no evidence for this assumption. More formally, P(A|B) is assumed to be approximately equal to P(B|A).

Pre-test probability and post-test probability are the probabilities of the presence of a condition before and after a diagnostic test, respectively. Post-test probability, in turn, can be positive or negative, depending on whether the test falls out as a positive test or a negative test, respectively. In some cases, it is used for the probability of developing the condition of interest in the future.

<span class="mw-page-title-main">Diagnostic odds ratio</span>

In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.

The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings, and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983.

<span class="mw-page-title-main">Evaluation of binary classifiers</span>

The evaluation of binary classifiers compares two methods of assigning a binary attribute, one of which is usually a standard method and the other is being investigated. There are many metrics that can be used to measure the performance of a classifier or predictor; different fields have different preferences for specific metrics due to different goals. For example, in medicine sensitivity and specificity are often used, while in computer science precision and recall are preferred. An important distinction is between metrics that are independent on the prevalence, and metrics that depend on the prevalence – both types are useful, but they have very different properties.

<span class="mw-page-title-main">Partial Area Under the ROC Curve</span> Dev gurjar actor

The Partial Area Under the ROC Curve (pAUC) is a metric for the performance of binary classifier.

P4 metric enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P4 is designed in similar way to F1 metric, however addressing the criticisms leveled against F1. It may be perceived as its extension.

References

  1. Fletcher, Robert H. Fletcher; Suzanne W. (2005). Clinical epidemiology : the essentials (4th ed.). Baltimore, Md.: Lippincott Williams & Wilkins. pp.  45. ISBN   0-7817-5215-9.{{cite book}}: CS1 maint: multiple names: authors list (link)
  2. 1 2 Altman, DG; Bland, JM (1994). "Diagnostic tests 2: Predictive values". BMJ. 309 (6947): 102. doi:10.1136/bmj.309.6947.102. PMC   2540558 . PMID   8038641.
  3. Balayla, Jacques (2020). "Prevalence threshold (ϕe) and the geometry of screening curves". PLOS ONE. 15 (10): e0240215. doi: 10.1371/journal.pone.0240215 . PMID   33027310.
  4. Fawcett, Tom (2006). "An Introduction to ROC Analysis" (PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010. S2CID   2027090.
  5. Piryonesi S. Madeh; El-Diraby Tamer E. (2020-03-01). "Data Analytics in Asset Management: Cost-Effective Prediction of the Pavement Condition Index". Journal of Infrastructure Systems. 26 (1): 04019036. doi:10.1061/(ASCE)IS.1943-555X.0000512. S2CID   213782055.
  6. Powers, David M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63.
  7. Ting, Kai Ming (2011). Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of machine learning. Springer. doi:10.1007/978-0-387-30164-8. ISBN   978-0-387-30164-8.
  8. Brooks, Harold; Brown, Barb; Ebert, Beth; Ferro, Chris; Jolliffe, Ian; Koh, Tieh-Yong; Roebber, Paul; Stephenson, David (2015-01-26). "WWRP/WGNE Joint Working Group on Forecast Verification Research". Collaboration for Australian Weather and Climate Research. World Meteorological Organisation. Retrieved 2019-07-17.
  9. Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (1): 6-1–6-13. doi: 10.1186/s12864-019-6413-7 . PMC   6941312 . PMID   31898477.
  10. Chicco D, Toetsch N, Jurman G (February 2021). "The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation". BioData Mining. 14 (13): 13. doi: 10.1186/s13040-021-00244-z . PMC   7863449 . PMID   33541410.
  11. Tharwat A. (August 2018). "Classification assessment methods". Applied Computing and Informatics. 17: 168–192. doi: 10.1016/j.aci.2018.08.003 .
  12. Heston, Thomas F. (2011). "Standardizing predictive values in diagnostic imaging research". Journal of Magnetic Resonance Imaging. 33 (2): 505, author reply 506–7. doi: 10.1002/jmri.22466 . PMID   21274995.
  13. Jacques Balayla. Bayesian Updating and Sequential Testing: Overcoming Inferential Limitations of Screening Tests. ArXiv 2020. https://arxiv.org/abs/2006.11641.
  14. Orda, Ulrich; Gunnarsson, Ronny K; Orda, Sabine; Fitzgerald, Mark; Rofe, Geoffry; Dargan, Anna (2016). "Etiologic predictive value of a rapid immunoassay for the detection of group A Streptococcus antigen from throat swabs in patients presenting with a sore throat" (PDF). International Journal of Infectious Diseases. 45 (April): 32–5. doi: 10.1016/j.ijid.2016.02.002 . PMID   26873279.
  15. Gunnarsson, Ronny K.; Lanke, Jan (2002). "The predictive value of microbiologic diagnostic tests if asymptomatic carriers are present". Statistics in Medicine. 21 (12): 1773–85. doi:10.1002/sim.1119. PMID   12111911. S2CID   26163122.
  16. Gunnarsson, Ronny K. "EPV Calculator". Science Network TV.