This article needs additional citations for verification .(March 2012) |
The positive and negative predictive values (PPV and NPV respectively) are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. [1] The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. The PPV and NPV are not intrinsic to the test (as true positive rate and true negative rate are); they depend also on the prevalence. [2] Both PPV and NPV can be derived using Bayes' theorem.
Although sometimes used synonymously, a positive predictive value generally refers to what is established by control groups, while a post-test probability refers to a probability for an individual. Still, if the individual's pre-test probability of the target condition is the same as the prevalence in the control group used to establish the positive predictive value, the two are numerically equal.
In information retrieval, the PPV statistic is often called the precision.
The positive predictive value (PPV), or precision, is defined as
where a "true positive" is the event that the test makes a positive prediction, and the subject has a positive result under the gold standard, and a "false positive" is the event that the test makes a positive prediction, and the subject has a negative result under the gold standard. The ideal value of the PPV, with a perfect test, is 1 (100%), and the worst possible value would be zero.
The PPV can also be computed from sensitivity, specificity, and the prevalence of the condition:
cf. Bayes' theorem
The complement of the PPV is the false discovery rate (FDR):
The negative predictive value is defined as:
where a "true negative" is the event that the test makes a negative prediction, and the subject has a negative result under the gold standard, and a "false negative" is the event that the test makes a negative prediction, and the subject has a positive result under the gold standard. With a perfect test, one which returns no false negatives, the value of the NPV is 1 (100%), and with a test which returns no true negatives the NPV value is zero.
The NPV can also be computed from sensitivity, specificity, and prevalence:
The complement of the NPV is the false omission rate (FOR):
Although sometimes used synonymously, a negative predictive value generally refers to what is established by control groups, while a negative post-test probability rather refers to a probability for an individual. Still, if the individual's pre-test probability of the target condition is the same as the prevalence in the control group used to establish the negative predictive value, then the two are numerically equal.
The following diagram illustrates how the positive predictive value, negative predictive value, sensitivity, and specificity are related.
Predicted condition | Sources: [3] [4] [5] [6] [7] [8] [9] [10] [11] | ||||
Total population = P + N | Predicted Positive (PP) | Predicted Negative (PN) | Informedness, bookmaker informedness (BM) = TPR + TNR − 1 | Prevalence threshold (PT) = √TPR × FPR - FPR/TPR - FPR | |
Positive (P) [lower-alpha 1] | True positive (TP), hit [lower-alpha 2] | False negative (FN), type II error, miss, underestimation [lower-alpha 3] | True positive rate (TPR), recall, sensitivity (SEN), probability of detection, hit rate, power = TP/P= 1 − FNR | False negative rate (FNR), miss rate = FN/P= 1 − TPR | |
Negative (N) [lower-alpha 4] | False positive (FP), type I error, false alarm, overestimation [lower-alpha 5] | True negative (TN), correct rejection [lower-alpha 6] | False positive rate (FPR), probability of false alarm, fall-out = FP/N= 1 − TNR | True negative rate (TNR), specificity (SPC), selectivity = TN/N= 1 − FPR | |
Prevalence = P/P + N | Positive predictive value (PPV), precision = TP/PP= 1 − FDR | False omission rate (FOR) = FN/PN= 1 − NPV | Positive likelihood ratio (LR+) = TPR/FPR | Negative likelihood ratio (LR−) = FNR/TNR | |
Accuracy (ACC) = TP + TN/P + N | False discovery rate (FDR) = FP/PP= 1 − PPV | Negative predictive value (NPV) = TN/PN= 1 − FOR | Markedness (MK), deltaP (Δp) = PPV + NPV − 1 | Diagnostic odds ratio (DOR) = LR+/LR− | |
Balanced accuracy (BA) = TPR + TNR/2 | F1 score = 2 PPV × TPR/PPV + TPR= 2 TP/2 TP + FP + FN | Fowlkes–Mallows index (FM) = √PPV × TPR | Matthews correlation coefficient (MCC) = √TPR × TNR × PPV × NPV- √FNR × FPR × FOR × FDR | Threat score (TS), critical success index (CSI), Jaccard index = TP/TP + FN + FP |
Note that the positive and negative predictive values can only be estimated using data from a cross-sectional study or other population-based study in which valid prevalence estimates may be obtained. In contrast, the sensitivity and specificity can be estimated from case-control studies.
Suppose the fecal occult blood (FOB) screen test is used in 2030 people to look for bowel cancer:
Fecal occult blood screen test outcome | |||||
Total population (pop.) = 2030 | Test outcome positive | Test outcome negative | Accuracy (ACC) = (TP + TN) / pop. = (20 + 1820) / 2030 ≈90.64% | F1 score = 2 ×precision × recall/precision + recall ≈ 0.174 | |
Patients with bowel cancer (as confirmed on endoscopy) | Actual condition positive (AP) = 30 (2030 × 1.48%) | True positive (TP) = 20 (2030 × 1.48% × 67%) | False negative (FN) = 10 (2030 × 1.48% ×(100% − 67%)) | True positive rate (TPR), recall, sensitivity = TP / AP = 20 / 30 ≈66.7% | False negative rate (FNR), miss rate = FN / AP = 10 / 30 ≈33.3% |
Actual condition negative (AN) = 2000 (2030 ×(100% − 1.48%)) | False positive (FP) = 180 (2030 ×(100% − 1.48%)×(100% − 91%)) | True negative (TN) = 1820 (2030 ×(100% − 1.48%)× 91%) | False positive rate (FPR), fall-out, probability of false alarm = FP / AN = 180 / 2000 = 9.0% | Specificity, selectivity, true negative rate (TNR) = TN / AN = 1820 / 2000 = 91% | |
Prevalence = AP / pop. = 30 / 2030 ≈1.48% | Positive predictive value (PPV), precision = TP / (TP + FP) = 20 / (20 + 180) = 10% | False omission rate (FOR) = FN / (FN + TN) = 10 / (10 + 1820) ≈0.55% | Positive likelihood ratio (LR+) = TPR/FPR = (20 / 30) / (180 / 2000) ≈7.41 | Negative likelihood ratio (LR−) = FNR/TNR = (10 / 30) / (1820 / 2000) ≈0.366 | |
False discovery rate (FDR) = FP / (TP + FP) = 180 / (20 + 180) = 90.0% | Negative predictive value (NPV) = TN / (FN + TN) = 1820 / (10 + 1820) ≈99.45% | Diagnostic odds ratio (DOR) = LR+/LR− ≈20.2 |
The small positive predictive value (PPV = 10%) indicates that many of the positive results from this testing procedure are false positives. Thus it will be necessary to follow up any positive result with a more reliable test to obtain a more accurate assessment as to whether cancer is present. Nevertheless, such a test may be useful if it is inexpensive and convenient. The strength of the FOB screen test is instead in its negative predictive value — which, if negative for an individual, gives us a high confidence that its negative result is true.
Note that the PPV is not intrinsic to the test—it depends also on the prevalence. [2] Due to the large effect of prevalence upon predictive values, a standardized approach has been proposed, where the PPV is normalized to a prevalence of 50%. [12] PPV is directly proportional[ dubious ] to the prevalence of the disease or condition. In the above example, if the group of people tested had included a higher proportion of people with bowel cancer, then the PPV would probably come out higher and the NPV lower. If everybody in the group had bowel cancer, the PPV would be 100% and the NPV 0%.[ citation needed ]
To overcome this problem, NPV and PPV should only be used if the ratio of the number of patients in the disease group and the number of patients in the healthy control group used to establish the NPV and PPV is equivalent to the prevalence of the diseases in the studied population, or, in case two disease groups are compared, if the ratio of the number of patients in disease group 1 and the number of patients in disease group 2 is equivalent to the ratio of the prevalences of the two diseases studied. Otherwise, positive and negative likelihood ratios are more accurate than NPV and PPV, because likelihood ratios do not depend on prevalence.[ citation needed ]
When an individual being tested has a different pre-test probability of having a condition than the control groups used to establish the PPV and NPV, the PPV and NPV are generally distinguished from the positive and negative post-test probabilities, with the PPV and NPV referring to the ones established by the control groups, and the post-test probabilities referring to the ones for the tested individual (as estimated, for example, by likelihood ratios). Preferably, in such cases, a large group of equivalent individuals should be studied, in order to establish separate positive and negative predictive values for use of the test in such individuals.[ citation needed ]
Bayes' theorem confers inherent limitations on the accuracy of screening tests as a function of disease prevalence or pre-test probability. It has been shown that a testing system can tolerate significant drops in prevalence, up to a certain well-defined point known as the prevalence threshold, below which the reliability of a positive screening test drops precipitously. That said, Balayla et al. [13] showed that sequential testing overcomes the aforementioned Bayesian limitations and thus improves the reliability of screening tests. For a desired positive predictive value that approaches some constant , the number of positive test iterations needed is:
where
Of note, the denominator of the above equation is the natural logarithm of the positive likelihood ratio (LR+).
PPV is used to indicate the probability that in case of a positive test, that the patient really has the specified disease. However, there may be more than one cause for a disease and any single potential cause may not always result in the overt disease seen in a patient. There is potential to mix up related target conditions of PPV and NPV, such as interpreting the PPV or NPV of a test as having a disease, when that PPV or NPV value actually refers only to a predisposition of having that disease.[ citation needed ]
An example is the microbiological throat swab used in patients with a sore throat. Usually publications stating PPV of a throat swab are reporting on the probability that this bacterium is present in the throat, rather than that the patient is ill from the bacteria found. If presence of this bacterium always resulted in a sore throat, then the PPV would be very useful. However the bacteria may colonise individuals in a harmless way and never result in infection or disease. Sore throats occurring in these individuals are caused by other agents such as a virus. In this situation the gold standard used in the evaluation study represents only the presence of bacteria (that might be harmless) but not a causal bacterial sore throat illness. It can be proven that this problem will affect positive predictive value far more than negative predictive value. [14] To evaluate diagnostic tests where the gold standard looks only at potential causes of disease, one may use an extension of the predictive value termed the Etiologic Predictive Value. [15] [16]
In probability theory and statistics, Bayes' theorem, named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes' theorem allows the risk to an individual of a known age to be assessed more accurately by conditioning it relative to their age, rather than simply assuming that the individual is typical of the population as a whole.
In epidemiology, prevalence is the proportion of a particular population found to be affected by a medical condition at a specific time. It is derived by comparing the number of people found to have the condition with the total number of people studied and is usually expressed as a fraction, a percentage, or the number of cases per 10,000 or 100,000 people. Prevalence is most often used in questionnaire studies.
Binary classification is the task of classifying the elements of a set into one of two groups on the basis of a classification rule. Typical binary classification problems include:
In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.
In healthcare, a differential diagnosis (DDx) is a method of analysis of a patient's history and physical examination to arrive at the correct diagnosis. It involves distinguishing a particular disease or condition from others that present with similar clinical features. Differential diagnostic procedures are used by clinicians to diagnose the specific disease in a patient, or, at least, to consider any imminently life-threatening conditions. Often, each individual option of a possible disease is called a differential diagnosis.
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.
In evidence-based medicine, likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result usefully changes the probability that a condition exists. The first description of the use of likelihood ratios for decision rules was made at a symposium on information theory in 1954. In medicine, likelihood ratios were introduced between 1975 and 1980.
Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.
In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:
In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted. A type II error, or a false negative, is the failure to reject a null hypothesis that is actually false. For example: a guilty person may be not convicted.
Medical statistics deals with applications of statistics to medicine and the health sciences, including epidemiology, public health, forensic medicine, and clinical research. Medical statistics has been a recognized branch of statistics in the United Kingdom for more than 40 years but the term has not come into general use in North America, where the wider term 'biostatistics' is more commonly used. However, "biostatistics" more commonly connotes all applications of statistics to biology. Medical statistics is a subdiscipline of statistics. "It is the science of summarizing, collecting, presenting and interpreting data in medical practice, and using them to estimate the magnitude of associations and test hypotheses. It has a central role in medical investigations. It not only provides a way of organizing information on a wider and more formal basis than relying on the exchange of anecdotes and personal experience, but also takes into account the intrinsic variation inherent in most biological processes."
Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.
In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.
Confusion of the inverse, also called the conditional probability fallacy or the inverse fallacy, is a logical fallacy whereupon a conditional probability is equated with its inverse; that is, given two events A and B, the probability of A happening given that B has happened is assumed to be about the same as the probability of B given A, when there is actually no evidence for this assumption. More formally, P(A|B) is assumed to be approximately equal to P(B|A).
Pre-test probability and post-test probability are the probabilities of the presence of a condition before and after a diagnostic test, respectively. Post-test probability, in turn, can be positive or negative, depending on whether the test falls out as a positive test or a negative test, respectively. In some cases, it is used for the probability of developing the condition of interest in the future.
In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.
The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings, and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983.
The evaluation of binary classifiers compares two methods of assigning a binary attribute, one of which is usually a standard method and the other is being investigated. There are many metrics that can be used to measure the performance of a classifier or predictor; different fields have different preferences for specific metrics due to different goals. For example, in medicine sensitivity and specificity are often used, while in computer science precision and recall are preferred. An important distinction is between metrics that are independent on the prevalence, and metrics that depend on the prevalence – both types are useful, but they have very different properties.
The Partial Area Under the ROC Curve (pAUC) is a metric for the performance of binary classifier.
P4 metric enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P4 is designed in similar way to F1 metric, however addressing the criticisms leveled against F1. It may be perceived as its extension.
{{cite book}}
: CS1 maint: multiple names: authors list (link)