Youden's J statistic

Last updated

Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

Definition

Youden's J statistic is

with the two right-hand quantities being sensitivity and specificity. Thus the expanded formula is:

The index was suggested by W. J. Youden in 1950 [1] as a way of summarising the performance of a diagnostic test; however, the formula was earlier published in Science by C. S. Pierce in 1884. [2] Its value ranges from -1 through 1 (inclusive), [1] and has a zero value when a diagnostic test gives the same proportion of positive results for groups with and without the disease, i.e the test is useless. A value of 1 indicates that there are no false positives or false negatives, i.e. the test is perfect. The index gives equal weight to false positive and false negative values, so all tests with the same value of the index give the same proportion of total misclassified results. While it is possible to obtain a value of less than zero from this equation, e.g. Classification yields only False Positives and False Negatives, a value of less than zero just indicates that the positive and negative labels have been switched. After correcting the labels the result will then be in the 0 through 1 range.

Example of a receiver operating characteristic curve. Solid red: ROC curve; Dashed line: Chance level; Vertical line (J) maximum value of Youden's index for the ROC curve ROC Curve Youden J.png
Example of a receiver operating characteristic curve. Solid red: ROC curve; Dashed line: Chance level; Vertical line (J) maximum value of Youden's index for the ROC curve

Youden's index is often used in conjunction with receiver operating characteristic (ROC) analysis. [3] The index is defined for all points of an ROC curve, and the maximum value of the index may be used as a criterion for selecting the optimum cut-off point when a diagnostic test gives a numeric rather than a dichotomous result. The index is represented graphically as the height above the chance line, and it is also equivalent to the area under the curve subtended by a single operating point. [4]

Youden's index is also known as deltaP' [5] and generalizes from the dichotomous to the multiclass case as informedness. [4]

The use of a single index is "not generally to be recommended", [6] but informedness or Youden's index is the probability of an informed decision (as opposed to a random guess) and takes into account all predictions. [4]

An unrelated but commonly used combination of basic statistics from information retrieval is the F-score, being a (possibly weighted) harmonic mean of recall and precision where recall = sensitivity = true positive rate. But specificity and precision are totally different measures. F-score, like recall and precision, only considers the so-called positive predictions, with recall being the probability of predicting just the positive class, precision being the probability of a positive prediction being correct, and F-score equating these probabilities under the effective assumption that the positive labels and the positive predictions should have the same distribution and prevalence, [4] similar to the assumption underlying of Fleiss' kappa. Youden's J, Informedness, Recall, Precision and F-score are intrinsically undirectional, aiming to assess the deductive effectiveness of predictions in the direction proposed by a rule, theory or classifier. DeltaP is Youden's J used to assess the reverse or abductive direction, [4] [7] (and generalizes to the multiclass case as Markedness), matching well human learning of associations; rules and, superstitions as we model possible causation; [5] , while correlation and kappa evaluate bidirectionally.

Matthews correlation coefficient is the geometric mean of the regression coefficient of the dichotomous problem and its dual, where the component regression coefficients of the Matthews correlation coefficient are deltaP and deltaP' (that is Youden's J or Pierce's I). [5] The main article on Matthews correlation coefficient discusses two different generalizations to the multiclass case, one being the analogous geometric mean of Informedness and Markedness. [4] Kappa statistics such as Fleiss' kappa and Cohen's kappa are methods for calculating inter-rater reliability based on different assumptions about the marginal or prior distributions, and are increasingly used as chance corrected alternatives to accuracy in other contexts (including the multiclass case). Fleiss' kappa, like F-score, assumes that both variables are drawn from the same distribution and thus have the same expected prevalence, while Cohen's kappa assumes that the variables are drawn from distinct distributions and referenced to a model of expectation that assumes prevalences are independent. [7]

When the true prevalences for the two positive variables are equal as assumed in Fleiss kappa and F-score, that is the number of positive predictions matches the number of positive classes in the dichotomous (two class) case, the different kappa and correlation measure collapse to identity with Youden's J, and recall, precision and F-score are similarly identical with accuracy. [4] [7]

Related Research Articles

Accuracy and precision are two measures of observational error. Accuracy is how close a given set of measurements are to their true value, while precision is how close the measurements are to each other.

Binary classification is the task of classifying the elements of a set into one of two groups on the basis of a classification rule. Typical binary classification problems include:

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one; in unsupervised learning it is usually called a matching matrix.

<span class="mw-page-title-main">Receiver operating characteristic</span> Diagnostic plot of binary classifier ability

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.

<span class="mw-page-title-main">Positive and negative predictive values</span> In biostatistics, proportion of true positive and true negative results

The positive and negative predictive values are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. The PPV and NPV are not intrinsic to the test ; they depend also on the prevalence. Both PPV and NPV can be derived using Bayes' theorem.

Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

<span class="mw-page-title-main">F-score</span> Statistical measure of a tests accuracy

In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

<span class="mw-page-title-main">Sensitivity and specificity</span> Statistical measures of the performance of a binary classification test

In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:

In statistics, inter-rater reliability is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

In statistics, the phi coefficient is a measure of association for two binary variables.

<span class="mw-page-title-main">Diagnostic odds ratio</span>

In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.

The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings, and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983.

<span class="mw-page-title-main">Evaluation of binary classifiers</span>

The evaluation of binary classifiers compares two methods of assigning a binary attribute, one of which is usually a standard method and the other is being investigated. There are many metrics that can be used to measure the performance of a classifier or predictor; different fields have different preferences for specific metrics due to different goals. For example, in medicine sensitivity and specificity are often used, while in computer science precision and recall are preferred. An important distinction is between metrics that are independent on the prevalence, and metrics that depend on the prevalence – both types are useful, but they have very different properties.

The total operating characteristic (TOC) is a statistical method to compare a Boolean variable versus a rank variable. TOC can measure the ability of an index variable to diagnose either presence or absence of a characteristic. The diagnosis of presence or absence depends on whether the value of the index is above a threshold. TOC considers multiple possible thresholds. Each threshold generates a two-by-two contingency table, which contains four entries: hits, misses, false alarms, and correct rejections.

<span class="mw-page-title-main">Partial Area Under the ROC Curve</span> Dev gurjar actor

The Partial Area Under the ROC Curve (pAUC) is a metric for the performance of binary classifier.

P4 metric enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P4 is designed in similar way to F1 metric, however addressing the criticisms leveled against F1. It may be perceived as its extension.

References

  1. 1 2 Youden, W.J. (1950). "Index for rating diagnostic tests". Cancer. 3: 32–35. doi: 10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3 . PMID   15405679.
  2. Pierce, C.S. (1884). "The numerical measure of the success of predictions". Science. 4 (93): 453–454. doi:10.1126/science.ns-4.93.453.b.
  3. Schisterman, E.F.; Perkins, N.J.; Liu, A.; Bondell, H. (2005). "Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples". Epidemiology. 16 (1): 73–81. doi: 10.1097/01.ede.0000147512.81966.ba . PMID   15613948.
  4. 1 2 3 4 5 6 7 Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Score to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63. hdl:2328/27165.
  5. 1 2 3 Perruchet, P.; Peereman, R. (2004). "The exploitation of distributional information in syllable processing". J. Neurolinguistics. 17 (2–3): 97–119. doi:10.1016/s0911-6044(03)00059-9.
  6. Everitt B.S. (2002) The Cambridge Dictionary of Statistics. CUP ISBN   0-521-81099-X
  7. 1 2 3 Powers, David M W (2012). The Problem with Kappa. Conference of the European Chapter of the Association for Computational Linguistics. pp. 345–355. hdl:2328/27160.