Positive and negative predictive values

Last updated
Positive and negative predictive values Positive and negative predictive values.pdf
Positive and negative predictive values
Positive and negative predictive values - 2 PPV, NPV, Sensitivity and Specificity.svg
Positive and negative predictive values - 2

The positive and negative predictive values (PPV and NPV respectively) are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. [1] The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. The PPV and NPV are not intrinsic to the test (as true positive rate and true negative rate are); they depend also on the prevalence. [2] Both PPV and NPV can be derived using Bayes' theorem.

Contents

Although sometimes used synonymously, a positive predictive value generally refers to what is established by control groups, while a post-test probability refers to a probability for an individual. Still, if the individual's pre-test probability of the target condition is the same as the prevalence in the control group used to establish the positive predictive value, the two are numerically equal.

In information retrieval, the PPV statistic is often called the precision.

Definition

Positive predictive value (PPV)

The positive predictive value (PPV), or precision, is defined as

where a "true positive" is the event that the test makes a positive prediction, and the subject has a positive result under the gold standard, and a "false positive" is the event that the test makes a positive prediction, and the subject has a negative result under the gold standard. The ideal value of the PPV, with a perfect test, is 1 (100%), and the worst possible value would be zero.

The PPV can also be computed from sensitivity, specificity, and the prevalence of the condition:

cf. Bayes' theorem

The complement of the PPV is the false discovery rate (FDR):

Negative predictive value (NPV)

The negative predictive value is defined as:

where a "true negative" is the event that the test makes a negative prediction, and the subject has a negative result under the gold standard, and a "false negative" is the event that the test makes a negative prediction, and the subject has a positive result under the gold standard. With a perfect test, one which returns no false negatives, the value of the NPV is 1 (100%), and with a test which returns no true negatives the NPV value is zero.

The NPV can also be computed from sensitivity, specificity, and prevalence:

The complement of the NPV is the false omission rate (FOR):

Although sometimes used synonymously, a negative predictive value generally refers to what is established by control groups, while a negative post-test probability rather refers to a probability for an individual. Still, if the individual's pre-test probability of the target condition is the same as the prevalence in the control group used to establish the negative predictive value, then the two are numerically equal.

Relationship

The following diagram illustrates how the positive predictive value, negative predictive value, sensitivity, and specificity are related.

Predicted conditionSources: [3] [4] [5] [6] [7] [8] [9] [10]
Total population
= P + N
Predicted positive (PP)Predicted negative (PN) Informedness, bookmaker informedness (BM)
= TPR + TNR − 1
Prevalence threshold (PT)
= TPR × FPR - FPR/TPR - FPR
Actual condition
Positive (P) [lower-alpha 1] True positive (TP),
hit [lower-alpha 2]
False negative (FN),
miss, underestimation
True positive rate (TPR), recall, sensitivity (SEN), probability of detection, hit rate, power
= TP/P= 1 − FNR
False negative rate (FNR),
miss rate
type II error [lower-alpha 3]
= FN/P= 1 − TPR
Negative (N) [lower-alpha 4] False positive (FP),
false alarm, overestimation
True negative (TN),
correct rejection [lower-alpha 5]
False positive rate (FPR),
probability of false alarm, fall-out
type I error [lower-alpha 6]
= FP/N= 1 − TNR
True negative rate (TNR),
specificity (SPC), selectivity
= TN/N= 1 − FPR
Prevalence
= P/P + N
Positive predictive value (PPV), precision
= TP/PP= 1 − FDR
False omission rate (FOR)
= FN/PN= 1 − NPV
Positive likelihood ratio (LR+)
= TPR/FPR
Negative likelihood ratio (LR−)
= FNR/TNR
Accuracy (ACC)
= TP + TN/P + N
False discovery rate (FDR)
= FP/PP= 1 − PPV
Negative predictive value (NPV)
= TN/PN= 1 − FOR
Markedness (MK), deltaP (Δp)
= PPV + NPV − 1
Diagnostic odds ratio (DOR)
= LR+/LR−
Balanced accuracy (BA)
= TPR + TNR/2
F1 score
= 2 PPV × TPR/PPV + TPR= 2 TP/2 TP + FP + FN
Fowlkes–Mallows index (FM)
= PPV × TPR
Matthews correlation coefficient (MCC)
= TPR × TNR × PPV × NPV- FNR × FPR × FOR × FDR
Threat score (TS), critical success index (CSI), Jaccard index
= TP/TP + FN + FP
  1. the number of real positive cases in the data
  2. A test result that correctly indicates the presence of a condition or characteristic
  3. Type II error: A test result which wrongly indicates that a particular condition or attribute is absent
  4. the number of real negative cases in the data
  5. A test result that correctly indicates the absence of a condition or characteristic
  6. Type I error: A test result which wrongly indicates that a particular condition or attribute is present

Note that the positive and negative predictive values can only be estimated using data from a cross-sectional study or other population-based study in which valid prevalence estimates may be obtained. In contrast, the sensitivity and specificity can be estimated from case-control studies.

Worked example

Suppose the fecal occult blood (FOB) screen test is used in 2030 people to look for bowel cancer:

Fecal occult blood screen test outcome
Total population
(pop.) = 2030
Test outcome positiveTest outcome negative Accuracy (ACC)
= (TP + TN) / pop.
= (20 + 1820) / 2030
90.64%
F1 score
= 2 ×precision × recall/precision + recall
0.174
Patients with
bowel cancer
(as confirmed
on endoscopy)
Actual condition
positive (AP)
= 30
(2030 × 1.48%)
True positive (TP)
= 20
(2030 × 1.48% × 67%)
False negative (FN)
= 10
(2030 × 1.48% ×(100% 67%))
True positive rate (TPR), recall, sensitivity
= TP / AP
= 20 / 30
66.7%
False negative rate (FNR), miss rate
= FN / AP
= 10 / 30
33.3%
Actual condition
negative (AN)
= 2000
(2030 ×(100% 1.48%))
False positive (FP)
= 180
(2030 ×(100% 1.48%)×(100% 91%))
True negative (TN)
= 1820
(2030 ×(100% 1.48%)× 91%)
False positive rate (FPR), fall-out, probability of false alarm
= FP / AN
= 180 / 2000
= 9.0%
Specificity, selectivity, true negative rate (TNR)
= TN / AN
= 1820 / 2000
= 91%
Prevalence
= AP / pop.
= 30 / 2030
1.48%
Positive predictive value (PPV), precision
= TP / (TP + FP)
= 20 / (20 + 180)
= 10%
False omission rate (FOR)
= FN / (FN + TN)
= 10 / (10 + 1820)
0.55%
Positive likelihood ratio (LR+)
= TPR/FPR
= (20 / 30) / (180 / 2000)
7.41
Negative likelihood ratio (LR)
= FNR/TNR
= (10 / 30) / (1820 / 2000)
0.366
False discovery rate (FDR)
= FP / (TP + FP)
= 180 / (20 + 180)
= 90.0%
Negative predictive value (NPV)
= TN / (FN + TN)
= 1820 / (10 + 1820)
99.45%
Diagnostic odds ratio (DOR)
= LR+/LR
20.2

The small positive predictive value (PPV = 10%) indicates that many of the positive results from this testing procedure are false positives. Thus it will be necessary to follow up any positive result with a more reliable test to obtain a more accurate assessment as to whether cancer is present. Nevertheless, such a test may be useful if it is inexpensive and convenient. The strength of the FOB screen test is instead in its negative predictive value — which, if negative for an individual, gives us a high confidence that its negative result is true.

Problems

Other individual factors

Note that the PPV is not intrinsic to the test—it depends also on the prevalence. [2] Due to the large effect of prevalence upon predictive values, a standardized approach has been proposed, where the PPV is normalized to a prevalence of 50%. [11] PPV is directly proportional[ dubious discuss ] to the prevalence of the disease or condition. In the above example, if the group of people tested had included a higher proportion of people with bowel cancer, then the PPV would probably come out higher and the NPV lower. If everybody in the group had bowel cancer, the PPV would be 100% and the NPV 0%.[ citation needed ]

To overcome this problem, NPV and PPV should only be used if the ratio of the number of patients in the disease group and the number of patients in the healthy control group used to establish the NPV and PPV is equivalent to the prevalence of the diseases in the studied population, or, in case two disease groups are compared, if the ratio of the number of patients in disease group 1 and the number of patients in disease group 2 is equivalent to the ratio of the prevalences of the two diseases studied. Otherwise, positive and negative likelihood ratios are more accurate than NPV and PPV, because likelihood ratios do not depend on prevalence.[ citation needed ]

When an individual being tested has a different pre-test probability of having a condition than the control groups used to establish the PPV and NPV, the PPV and NPV are generally distinguished from the positive and negative post-test probabilities, with the PPV and NPV referring to the ones established by the control groups, and the post-test probabilities referring to the ones for the tested individual (as estimated, for example, by likelihood ratios). Preferably, in such cases, a large group of equivalent individuals should be studied, in order to establish separate positive and negative predictive values for use of the test in such individuals.[ citation needed ]

Bayesian updating

Bayes' theorem confers inherent limitations on the accuracy of screening tests as a function of disease prevalence or pre-test probability. It has been shown that a testing system can tolerate significant drops in prevalence, up to a certain well-defined point known as the prevalence threshold, below which the reliability of a positive screening test drops precipitously. That said, Balayla et al. [12] showed that sequential testing overcomes the aforementioned Bayesian limitations and thus improves the reliability of screening tests. For a desired positive predictive value , where , that approaches some constant , the number of positive test iterations needed is:

where

Of note, the denominator of the above equation is the natural logarithm of the positive likelihood ratio (LR+). Also, note that a critical assumption is that the tests must be independent. As described Balayla et al. [12] , repeating the same test may violate the this independence assumption and in fact "A more natural and reliable method to enhance the positive predictive value would be, when available, to use a different test with different parameters altogether after an initial positive result is obtained." [12] .

Different target conditions

PPV is used to indicate the probability that in case of a positive test, that the patient really has the specified disease. However, there may be more than one cause for a disease and any single potential cause may not always result in the overt disease seen in a patient. There is potential to mix up related target conditions of PPV and NPV, such as interpreting the PPV or NPV of a test as having a disease, when that PPV or NPV value actually refers only to a predisposition of having that disease. [13]

An example is the microbiological throat swab used in patients with a sore throat. Usually publications stating PPV of a throat swab are reporting on the probability that this bacterium is present in the throat, rather than that the patient is ill from the bacteria found. If presence of this bacterium always resulted in a sore throat, then the PPV would be very useful. However the bacteria may colonise individuals in a harmless way and never result in infection or disease. Sore throats occurring in these individuals are caused by other agents such as a virus. In this situation the gold standard used in the evaluation study represents only the presence of bacteria (that might be harmless) but not a causal bacterial sore throat illness. It can be proven that this problem will affect positive predictive value far more than negative predictive value. [14] To evaluate diagnostic tests where the gold standard looks only at potential causes of disease, one may use an extension of the predictive value termed the Etiologic Predictive Value. [13] [15]

See also

Related Research Articles

Bayes' theorem gives a mathematical rule for inverting conditional probabilities, allowing us to find the probability of a cause given its effect. For example, if the risk of developing health problems is known to increase with age, Bayes' theorem allows the risk to an individual of a known age to be assessed more accurately by conditioning it relative to their age, rather than assuming that the individual is typical of the population as a whole. Based on Bayes law both the prevalence of a disease in a given population and the error rate of an infectious disease test have to be taken into account to evaluate the meaning of a positive test result correctly and avoid the base-rate fallacy.

<span class="mw-page-title-main">Binary classification</span> Dividing things between two categories

Binary classification is the task of classifying the elements of a set into one of two groups. Typical binary classification problems include:

In healthcare, a differential diagnosis (DDx) is a method of analysis that distinguishes a particular disease or condition from others that present with similar clinical features. Differential diagnostic procedures are used by clinicians to diagnose the specific disease in a patient, or, at least, to consider any imminently life-threatening conditions. Often, each individual option of a possible disease is called a differential diagnosis.

<span class="mw-page-title-main">Receiver operating characteristic</span> Diagnostic plot of binary classifier ability

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.

In evidence-based medicine, likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result usefully changes the probability that a condition exists. The first description of the use of likelihood ratios for decision rules was made at a symposium on information theory in 1954. In medicine, likelihood ratios were introduced between 1975 and 1980.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

<span class="mw-page-title-main">Sensitivity and specificity</span> Statistical measure of a binary classification

In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted.

Medical statistics deals with applications of statistics to medicine and the health sciences, including epidemiology, public health, forensic medicine, and clinical research. Medical statistics has been a recognized branch of statistics in the United Kingdom for more than 40 years, but the term has not come into general use in North America, where the wider term 'biostatistics' is more commonly used. However, "biostatistics" more commonly connotes all applications of statistics to biology. Medical statistics is a subdiscipline of statistics.

It is the science of summarizing, collecting, presenting and interpreting data in medical practice, and using them to estimate the magnitude of associations and test hypotheses. It has a central role in medical investigations. It not only provides a way of organizing information on a wider and more formal basis than relying on the exchange of anecdotes and personal experience, but also takes into account the intrinsic variation inherent in most biological processes.

<span class="mw-page-title-main">Medical test</span> Medical procedure

A medical test is a medical procedure performed to detect, diagnose, or monitor diseases, disease processes, susceptibility, or to determine a course of treatment. Medical tests such as, physical and visual exams, diagnostic imaging, genetic testing, chemical and cellular analysis, relating to clinical chemistry and molecular diagnostics, are typically performed in a medical setting.

Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

Confusion of the inverse, also called the conditional probability fallacy or the inverse fallacy, is a logical fallacy whereupon a conditional probability is equated with its inverse; that is, given two events A and B, the probability of A happening given that B has happened is assumed to be about the same as the probability of B given A, when there is actually no evidence for this assumption. More formally, P(A|B) is assumed to be approximately equal to P(B|A).

The Cameron Inquiry, formerly the Commission of Inquiry on Hormone Receptor Testing, is a Canadian public judicial inquiry into the conduct of the Newfoundland and Labrador Eastern Health authority. The inquiry is investigating whether Eastern Health was at fault in the reporting of erroneous and delayed test results to breast cancer patients between 1997 and 2005, and in then failing to report the full scope of these errors. The inquiry developed ramifications for regional and national politics as the opposition Liberal Party questioned why the regional Progressive Conservative Party government had not intervened sooner in the crisis, and said that former deputy health minister Robert Thompson, who had been appointed to chair the inquiry, should stand down, prompting Newfoundland and Labrador premier Danny Williams to accuse the Liberals of a smear campaign. The inquiry, chaired by Justice Margaret Cameron, was called in May 2007 and released a report in March 2009.

In statistics, when performing multiple comparisons, a false positive ratio is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive and the total number of actual negative events.

Pre-test probability and post-test probability are the probabilities of the presence of a condition before and after a diagnostic test, respectively. Post-test probability, in turn, can be positive or negative, depending on whether the test falls out as a positive test or a negative test, respectively. In some cases, it is used for the probability of developing the condition of interest in the future.

<span class="mw-page-title-main">Diagnostic odds ratio</span>

In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.

The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings, and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983.

<span class="mw-page-title-main">Evaluation of binary classifiers</span> Quantitative measurement of accuracy

Evaluation of a binary classifier typically assigns a numerical value, or values, to a classifier that represent its accuracy. An example is error rate, which measures how frequently the classifier makes a mistake.

P4 metric (also known as FS or Symmetric F ) enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P4 is designed in similar way to F1 metric, however addressing the criticisms leveled against F1. It may be perceived as its extension.

References

  1. Fletcher, Robert H. Fletcher; Suzanne W. (2005). Clinical epidemiology : the essentials (4th ed.). Baltimore, Md.: Lippincott Williams & Wilkins. pp.  45. ISBN   0-7817-5215-9.{{cite book}}: CS1 maint: multiple names: authors list (link)
  2. 1 2 Altman, DG; Bland, JM (1994). "Diagnostic tests 2: Predictive values". BMJ. 309 (6947): 102. doi:10.1136/bmj.309.6947.102. PMC   2540558 . PMID   8038641.
  3. Fawcett, Tom (2006). "An Introduction to ROC Analysis" (PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010. S2CID   2027090.
  4. Provost, Foster; Tom Fawcett (2013-08-01). "Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking". O'Reilly Media, Inc.
  5. Powers, David M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63.
  6. Ting, Kai Ming (2011). Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of machine learning. Springer. doi:10.1007/978-0-387-30164-8. ISBN   978-0-387-30164-8.
  7. Brooks, Harold; Brown, Barb; Ebert, Beth; Ferro, Chris; Jolliffe, Ian; Koh, Tieh-Yong; Roebber, Paul; Stephenson, David (2015-01-26). "WWRP/WGNE Joint Working Group on Forecast Verification Research". Collaboration for Australian Weather and Climate Research. World Meteorological Organisation. Retrieved 2019-07-17.
  8. Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (1): 6-1–6-13. doi: 10.1186/s12864-019-6413-7 . PMC   6941312 . PMID   31898477.
  9. Chicco D, Toetsch N, Jurman G (February 2021). "The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation". BioData Mining. 14 (13): 13. doi: 10.1186/s13040-021-00244-z . PMC   7863449 . PMID   33541410.
  10. Tharwat A. (August 2018). "Classification assessment methods". Applied Computing and Informatics. 17: 168–192. doi: 10.1016/j.aci.2018.08.003 .
  11. Heston, Thomas F. (2011). "Standardizing predictive values in diagnostic imaging research". Journal of Magnetic Resonance Imaging. 33 (2): 505, author reply 506–7. doi: 10.1002/jmri.22466 . PMID   21274995.
  12. 1 2 3 Jacques Balayla. Bayesian Updating and Sequential Testing: Overcoming Inferential Limitations of Screening Tests. BMC Med Inform Decis Mak 22, 6 (2022). https://doi.org/10.1186/s12911-021-01738-w
  13. 1 2 Gunnarsson, Ronny K.; Lanke, Jan (2002). "The predictive value of microbiologic diagnostic tests if asymptomatic carriers are present". Statistics in Medicine. 21 (12): 1773–85. doi:10.1002/sim.1119. PMID   12111911. S2CID   26163122.
  14. Orda, Ulrich; Gunnarsson, Ronny K; Orda, Sabine; Fitzgerald, Mark; Rofe, Geoffry; Dargan, Anna (2016). "Etiologic predictive value of a rapid immunoassay for the detection of group A Streptococcus antigen from throat swabs in patients presenting with a sore throat" (PDF). International Journal of Infectious Diseases. 45 (April): 32–5. doi: 10.1016/j.ijid.2016.02.002 . PMID   26873279.
  15. Gunnarsson, Ronny K. "EPV Calculator". Science Network TV.