The total operating characteristic (TOC) is a statistical method to compare a Boolean variable versus a rank variable. TOC can measure the ability of an index variable to diagnose either presence or absence of a characteristic. The diagnosis of presence or absence depends on whether the value of the index is above a threshold. TOC considers multiple possible thresholds. Each threshold generates a two-by-two contingency table, which contains four entries: hits, misses, false alarms, and correct rejections. [1]
The receiver operating characteristic (ROC) also characterizes diagnostic ability, although ROC reveals less information than the TOC. For each threshold, ROC reveals two ratios, hits/(hits + misses) and false alarms/(false alarms + correct rejections), while TOC shows the total information in the contingency table for each threshold. [2] The TOC method reveals all of the information that the ROC method provides, plus additional important information that ROC does not reveal, i.e. the size of every entry in the contingency table for each threshold. TOC also provides the popular area under the curve (AUC) of the ROC.
TOC is applicable to measure diagnostic ability in many fields including but not limited to: land change science, medical imaging, weather forecasting, remote sensing, and materials testing.
The procedure to construct the TOC curve compares the Boolean variable to the index variable by diagnosing each observation as either presence or absence, depending on how the index relates to various thresholds. If an observation's index is greater than or equal to a threshold, then the observation is diagnosed as presence, otherwise the observation is diagnosed as absence. The contingency table that results from the comparison between the Boolean variable and the diagnosis for a single threshold has four central entries. The four central entries are hits (H), misses (M), false alarms (F), and correct rejections (C). The total number of observations is P + Q. The terms “true positives”, “false negatives”, “false positives” and “true negatives” are equivalent to hits, misses, false alarms and correct rejections, respectively. The entries can be formulated in a two-by-two contingency table or confusion matrix, as follows:
Diagnosis Boolean | Presence | Absence | Boolean total |
---|---|---|---|
Presence | Hits (H) | Misses (M) | H + M = P |
Absence | False alarms (F) | Correct rejections (C) | F + C = Q |
Diagnosis total | H + F | M + C | P + Q |
Four bits of information determine all the entries in the contingency table, including its marginal totals. For example, if we know H, M, F, and C, then we can compute all the marginal totals for any threshold. Alternatively, if we know H/P, F/Q, P, and Q, then we can compute all the entries in the table. [1] Two bits of information are not sufficient to complete the contingency table. For example, if we know only H/P and F/Q, which is what ROC shows, then it is impossible to know all the entries in the table. [1]
Robert Gilmore Pontius Jr, professor of Geography at Clark University, and Kangping Si in 2014 first developed the TOC for application in land change science.
The TOC curve with four boxes indicates how a point on the TOC curve reveals the hits, misses, false alarms, and correct rejections. The TOC curve is an effective way to show the total information in the contingency table for all thresholds. The data used to create this TOC curve is available for download here. This dataset has 30 observations, each of which consists of values for a Boolean variable and an index variable. The observations are ranked from the greatest to the least value of the index. There are 31 thresholds, consisting of the 30 values of the index and one additional threshold that is greater than all the index values, which creates the point at the origin (0,0). Each point is labeled to indicate the value of each threshold. The horizontal axes ranges from 0 to 30 which is the number of observations in the dataset (P + Q). The vertical axis ranges from 0 to 10, which is the Boolean variable's number of presence observations P (i.e. hits + misses). TOC curves also show the threshold at which the diagnosed amount of presence matches the Boolean amount of presence, which is the threshold point that lies directly under the point where the maximum line meets the hits + misses line, as the TOC curve on the left illustrates. For a more detailed explanation of the construction of the TOC curve, please see Pontius Jr, Robert Gilmore; Si, Kangping (2014). "The total operating characteristic to measure diagnostic ability for multiple thresholds." International Journal of Geographical Information Science28 (3): 570–583.” [1]
The following four pieces of information are the central entries in the contingency table for each threshold:
These figures are the TOC and ROC curves using the same data and thresholds. Consider the point that corresponds to a threshold of 74. The TOC curve shows the number of hits, which is 3, and hence the number of misses, which is 7. Additionally, the TOC curve shows that the number of false alarms is 4 and the number of correct rejections is 16. At any given point in the ROC curve, it is possible to glean values for the ratios of false alarms/(false alarms+correct rejections) and hits/(hits+misses). For example, at threshold 74, it is evident that the x coordinate is 0.2 and the y coordinate is 0.3. However, these two values are insufficient to construct all entries of the underlying two-by-two contingency table.
It is common to report the area under the curve (AUC) to summarize a TOC or ROC curve. However, condensing diagnostic ability into a single number fails to appreciate the shape of the curve. The following three TOC curves are TOC curves that have an AUC of 0.75 but have different shapes.[ citation needed ]
This TOC curve on the left exemplifies an instance in which the index variable has a high diagnostic ability at high thresholds near the origin, but random diagnostic ability at low thresholds near the upper right of the curve. The curve shows accurate diagnosis of presence until the curve reaches a threshold of 86. The curve then levels off and predicts around the random line.[ citation needed ]
This TOC curve exemplifies an instance in which the index variable has a medium diagnostic ability at all thresholds. The curve is consistently above the random line.
This TOC curve exemplifies an instance in which the index variable has random diagnostic ability at high thresholds and high diagnostic ability at low thresholds. The curve follows the random line at the highest thresholds near the origin, then the index variable diagnoses absence correctly as thresholds decrease near the upper right corner.
When measuring diagnostic ability, a commonly reported measure is the area under the curve (AUC). The AUC is calculable from the TOC and the ROC. The value of the AUC is consistent for the same data whether you are calculating the area under the curve for a TOC curve or a ROC curve. The AUC indicates the probability that the diagnosis ranks a randomly chosen observation of Boolean presence higher than a randomly chosen observation of Boolean absence. [3] The AUC is appealing to many researchers because AUC summarizes diagnostic ability in a single number, however, the AUC has come under critique as a potentially misleading measure, especially for spatially explicit analyses. [3] [4] Some features of the AUC that draw criticism include the fact that 1) AUC ignores the thresholds; 2) AUC summarizes the test performance over regions of the TOC or ROC space in which one would rarely operate; 3) AUC weighs omission and commission errors equally; 4) AUC does not give information about the spatial distribution of model errors; and, 5) the selection of spatial extent highly influences the rate of accurately diagnosed absences and the AUC scores. [5] However, most of those criticisms apply to many other metrics.
When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative'). [6] This can be seen as follows: the area under the curve is given by (the integral boundaries are reversed as large T has a lower value on the x-axis)
where is the score for a positive instance and is the score for a negative instance, and and are probability densities as defined in previous section.
It can further be shown that the AUC is closely related to the Mann–Whitney U, [7] [8] which tests whether positives are ranked higher than negatives. It is also equivalent to the Wilcoxon test of ranks. [8] The AUC is related to the Gini coefficient () by the formula , where:
In this way, it is possible to calculate the AUC by using an average of a number of trapezoidal approximations.
It is also common to calculate the area under the TOC convex hull (ROC AUCH = ROCH AUC) as any point on the line segment between two prediction results can be achieved by randomly using one or the other system with probabilities proportional to the relative length of the opposite component of the segment. [10] It is also possible to invert concavities – just as in the figure the worse solution can be reflected to become a better solution; concavities can be reflected in any line segment, but this more extreme form of fusion is much more likely to overfit the data. [11]
Another problem with TOC AUC is that reducing the TOC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system, as well as ignoring the possibility of concavity repair, so that related alternative measures such as Informedness[ citation needed ] or DeltaP are recommended. [12] [13] These measures are essentially equivalent to the Gini for a single prediction point with DeltaP' = informedness = 2AUC-1, whilst DeltaP = markedness represents the dual (viz. predicting the prediction from the real class) and their geometric mean is the Matthews correlation coefficient.[ citation needed ]
Whereas TOC AUC varies between 0 and 1 — with an uninformative classifier yielding 0.5 — the alternative measures known as informedness,[ citation needed ] Certainty [12] and Gini coefficient (in the single parameterization or single system case)[ citation needed ] all have the advantage that 0 represents chance performance whilst 1 represents perfect performance, and −1 represents the "perverse" case of full informedness always giving the wrong response. [14] Bringing chance performance to 0 allows these alternative scales to be interpreted as Kappa statistics. Informedness has been shown to have desirable characteristics for machine learning versus other common definitions of Kappa such as Cohen kappa and Fleiss kappa.[ citation needed ] [15]
Sometimes it can be more useful to look at a specific region of the TOC curve rather than at the whole curve. It is possible to compute partial AUC. [16] For example, one could focus on the region of the curve with low false positive rate, which is often of prime interest for population screening tests. [17] Another common approach for classification problems in which P ≪ N (common in bioinformatics applications) is to use a logarithmic scale for the x-axis. [18]
In economics, the Gini coefficient, also known as the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent the income inequality, the wealth inequality, or the consumption inequality within a nation or a social group. It was developed by Italian statistician and sociologist Corrado Gini.
In economics, the Lorenz curve is a graphical representation of the distribution of income or of wealth. It was developed by Max O. Lorenz in 1905 for representing inequality of the wealth distribution.
In mathematics, a Boolean function is a function whose arguments and result assume values from a two-element set. Alternative names are switching function, used especially in older computer science literature, and truth function, used in logic. Boolean functions are the subject of Boolean algebra and switching theory.
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one; in unsupervised learning it is usually called a matching matrix.
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.
Detection theory or signal detection theory is a means to measure the ability to differentiate between information-bearing patterns and random patterns that distract from the information.
In mathematics, the correlation immunity of a Boolean function is a measure of the degree to which its outputs are uncorrelated with some subset of its inputs. Specifically, a Boolean function is said to be correlation-immune of order m if every subset of m or fewer variables in is statistically independent of the value of .
In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:
The sensitivity index or discriminability index or detectability index is a dimensionless statistic used in signal detection theory. A higher index indicates that the signal can be more readily detected.
Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.
In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.
In statistics, the phi coefficient is a measure of association for two binary variables.
A detection error tradeoff (DET) graph is a graphical plot of error rates for binary classification systems, plotting the false rejection rate vs. false acceptance rate. The x- and y-axes are scaled non-linearly by their standard normal deviates, yielding tradeoff curves that are more linear than ROC curves, and use most of the image area to highlight the differences of importance in the critical operating region.
In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.
Receiver Operating Characteristic Curve Explorer and Tester (ROCCET) is an open-access web server for performing biomarker analysis using ROC curve analyses on metabolomic data sets. ROCCET is designed specifically for performing and assessing a standard binary classification test. ROCCET accepts metabolite data tables, with or without clinical/observational variables, as input and performs extensive biomarker analysis and biomarker identification using these input data. It operates through a menu-based navigation system that allows users to identify or assess those clinical variables and/or metabolites that contain the maximal diagnostic or class-predictive information. ROCCET supports both manual and semi-automated feature selection and is able to automatically generate a variety of mathematical models that maximize the sensitivity and specificity of the biomarker(s) while minimizing the number of biomarkers used in the biomarker model. ROCCET also supports the rigorous assessment of the quality and robustness of newly discovered biomarkers using permutation testing, hold-out testing and cross-validation.
The evaluation of binary classifiers compares two methods of assigning a binary attribute, one of which is usually a standard method and the other is being investigated. There are many metrics that can be used to measure the performance of a classifier or predictor; different fields have different preferences for specific metrics due to different goals. For example, in medicine sensitivity and specificity are often used, while in computer science precision and recall are preferred. An important distinction is between metrics that are independent on the prevalence, and metrics that depend on the prevalence – both types are useful, but they have very different properties.
In statistics, Somers’ D, sometimes incorrectly referred to as Somer’s D, is a measure of ordinal association between two possibly dependent random variables X and Y. Somers’ D takes values between when all pairs of the variables disagree and when all pairs of the variables agree. Somers’ D is named after Robert H. Somers, who proposed it in 1962.
GeoMod is a raster-based land change modeling tool in the GIS software TerrSet that simulates the gain or the loss of a land category over a specified time interval. The model only simulates the spatial allocation of change between two land categories either forwards or backwards in time.
Decision curve analysis evaluates a predictor for an event as a probability threshold is varied, typically by showing a graphical plot of net benefit against threshold probability. By convention, the default strategies of assuming that all or no observations are positive are also plotted.
The Partial Area Under the ROC Curve (pAUC) is a metric for the performance of binary classifier.