P4-metric

Last updated October 11, 2024

P₄ metric ^[1]^[2] (also known as FS or Symmetric F^[3]) enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P₄ is designed in similar way to F₁ metric, however addressing the criticisms leveled against F₁. It may be perceived as its extension.

Justification

The key concept of P₄ is to leverage the four key conditional probabilities:

P(+\mid C{+})

- the probability that the sample is positive, provided the classifier result was positive.

P(C{+}\mid +)

- the probability that the classifier result will be positive, provided the sample is positive.

P(C{-}\mid -)

- the probability that the classifier result will be negative, provided the sample is negative.

P(-\mid C{-})

- the probability the sample is negative, provided the classifier result was negative.

The main assumption behind this metric is, that a properly designed binary classifier should give the results for which all the probabilities mentioned above are close to 1. P₄ is designed the way that $\mathrm {P} _{4}=1$ requires all the probabilities being equal 1. It also goes to zero when any of these probabilities go to zero.

Definition

P₄ is defined as a harmonic mean of four key conditional probabilities:

\mathrm {P} _{4}={\frac {4}{{\frac {1}{P(+\mid C{+})}}+{\frac {1}{P(C{+}\mid +)}}+{\frac {1}{P(C{-}\mid -)}}+{\frac {1}{P(-\mid C{-})}}}}={\frac {4}{{\frac {1}{\mathit {precision}}}+{\frac {1}{\mathit {recall}}}+{\frac {1}{\mathit {specificity}}}+{\frac {1}{\mathit {NPV}}}}}

In terms of TP,TN,FP,FN it can be calculated as follows:

\mathrm {P} _{4}={\frac {4\cdot \mathrm {TP} \cdot \mathrm {TN} }{4\cdot \mathrm {TP} \cdot \mathrm {TN} +(\mathrm {TP} +\mathrm {TN} )\cdot (\mathrm {FP} +\mathrm {FN} )}}

Evaluation of the binary classifier performance

Evaluating the performance of binary classifier is a multidisciplinary concept. It spans from the evaluation of medical tests, psychiatric tests to machine learning classifiers from a variety of fields. Thus, many metrics in use exist under several names. Some of them being defined independently.

		Predicted condition		^Sources:^[4]^[5]^[6]^[7]^[8]^[9]^[10]^[11]^{view talk edit}
	Total population $= P + N$	Predicted positive (PP)	Predicted negative (PN)	Informedness, bookmaker informedness (BM) $= TPR + TNR - 1$	Prevalence threshold (PT) $= .mw-parser-output .sfrac{white-space:nowrap}.mw-parser-output .sfrac.tion,.mw-parser-output .sfrac .tion{display:inline-block;vertical-align:-0.5em;font-size:85%;text-align:center}.mw-parser-output .sfrac .num{display:block;line-height:1em;margin:0.0em 0.1em;border-bottom:1px solid}.mw-parser-output .sfrac .den{display:block;line-height:1em;margin:0.1em 0.1em}.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);clip-path:polygon(0px 0px,0px 0px,0px 0px);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px}⁠√TPR × FPR - FPR/TPR - FPR⁠$
Actual condition	Positive (P)^{[lower-alpha 1]}	True positive (TP), hit^{[lower-alpha 2]}	False negative (FN), miss, underestimation	True positive rate (TPR), recall, sensitivity (SEN), probability of detection, hit rate, power $= ⁠ TP / P ⁠$ $= 1 - FNR$	False negative rate (FNR), miss rate type II error ^{[lower-alpha 3]} $= ⁠ FN / P ⁠$ $= 1 - TPR$
Actual condition	Negative (N)^{[lower-alpha 4]}	False positive (FP), false alarm, overestimation	True negative (TN), correct rejection^{[lower-alpha 5]}	False positive rate (FPR), probability of false alarm, fall-out type I error ^{[lower-alpha 6]} $= ⁠ FP / N ⁠$ $= 1 - TNR$	True negative rate (TNR), specificity (SPC), selectivity $= ⁠ TN / N ⁠$ $= 1 - FPR$
	Prevalence $= ⁠ P / P + N ⁠$	Positive predictive value (PPV), precision $= ⁠ TP / PP ⁠$ $= 1 - FDR$	False omission rate (FOR) $= ⁠ FN / PN ⁠$ $= 1 - NPV$	Positive likelihood ratio (LR+) $= ⁠ TPR / FPR ⁠$	Negative likelihood ratio (LR−) $= ⁠ FNR / TNR ⁠$
	Accuracy (ACC) $= ⁠ TP + TN / P + N ⁠$	False discovery rate (FDR) $= ⁠ FP / PP ⁠$ $= 1 - PPV$	Negative predictive value (NPV) $= ⁠ TN / PN ⁠$ $= 1 - FOR$	Markedness (MK), deltaP (Δp) $= PPV + NPV - 1$	Diagnostic odds ratio (DOR) $= ⁠ LR+ / LR- ⁠$
	Balanced accuracy (BA) $= ⁠ TPR + TNR / 2 ⁠$	F₁ score $= ⁠ 2 PPV \times TPR / PPV + TPR ⁠$ $= ⁠ 2 TP / 2 TP + FP + FN ⁠$	Fowlkes–Mallows index (FM) $= \sqrt PPV \times TPR$	Matthews correlation coefficient (MCC) $= \sqrt TPR \times TNR \times PPV \times NPV$ $- \sqrt FNR \times FPR \times FOR \times FDR$	Threat score (TS), critical success index (CSI), Jaccard index $= ⁠ TP / TP + FN + FP ⁠$

↑ the number of real positive cases in the data
↑ A test result that correctly indicates the presence of a condition or characteristic
↑ Type II error: A test result which wrongly indicates that a particular condition or attribute is absent
↑ the number of real negative cases in the data
↑ A test result that correctly indicates the absence of a condition or characteristic
↑ Type I error: A test result which wrongly indicates that a particular condition or attribute is present

Properties of P₄ metric

Symmetry - contrasting to the F₁ metric, P₄ is symmetrical. It means - it does not change its value when dataset labeling is changed - positives named negatives and negatives named positives.
Range: $\mathrm {P} _{4}\in [0,1]$
Achieving $\mathrm {P} _{4}\approx 1$ requires all the key four conditional probabilities being close to 1.
For $\mathrm {P} _{4}\approx 0$ it is sufficient that one of the key four conditional probabilities is close to 0.

Examples, comparing with the other metrics

Dependency table for selected metrics ("true" means depends, "false" - does not depend):

	$P(+\mid C{+})$	$P(C{+}\mid +)$	$P(C{-}\mid -)$	$P(-\mid C{-})$
P₄	true	true	true	true
F₁	true	true	false	false
Informedness	false	true	true	false
Markedness	true	false	false	true

Metrics that do not depend on a given probability are prone to misrepresentation when it approaches 0.

Example 1: Rare disease detection test

Let us consider the medical test aimed to detect kind of rare disease. Population size is 100 000, while 0.05% population is infected. Test performance: 95% of all positive individuals are classified correctly (TPR=0.95) and 95% of all negative individuals are classified correctly (TNR=0.95). In such a case, due to high population imbalance, in spite of having high test accuracy (0.95), the probability that an individual who has been classified as positive is in fact positive is very low:

P(+\mid C{+})=0.0095

And now we can observe how this low probability is reflected in some of the metrics:

$\mathrm {P} _{4}=0.0370$
$\mathrm {F} _{1}=0.0188$
$\mathrm {J} =\mathbf {0.9100}$ (Informedness / Youden index)
$\mathrm {MK} =0.0095$ (Markedness)

Example 2: Image recognition - cats vs dogs

We are training neural network based image classifier. We are considering only two types of images: containing dogs (labeled as 0) and containing cats (labeled as 1). Thus, our goal is to distinguish between the cats and dogs. The classifier overpredicts in favor of cats ("positive" samples): 99.99% of cats are classified correctly and only 1% of dogs are classified correctly. The image dataset consists of 100000 images, 90% of which are pictures of cats and 10% are pictures of dogs. In such a situation, the probability that the picture containing dog will be classified correctly is pretty low:

P(C-|-)=0.01

Not all the metrics are noticing this low probability:

$\mathrm {P} _{4}=0.0388$
$\mathrm {F} _{1}=\mathbf {0.9478}$
$\mathrm {J} =0.0099$ (Informedness / Youden index)
$\mathrm {MK} =\mathbf {0.8183}$ (Markedness)

Related Research Articles

<span class="mw-page-title-main">Binary classification</span> Dividing things between two categories

Binary classification is the task of classifying the elements of a set into one of two groups. Typical binary classification problems include:

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one; in unsupervised learning it is usually called a matching matrix.

Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered for any given number of random variables. The joint distribution encodes the marginal distributions, i.e. the distributions of each of the individual random variables and the conditional probability distributions, which deal with how the outputs of one random variable are distributed when given information on the outputs of the other random variable(s).

<span class="mw-page-title-main">Receiver operating characteristic</span> Diagnostic plot of binary classifier ability

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

<span class="mw-page-title-main">F-score</span> Statistical measure of a tests accuracy

In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expected proportion of "discoveries" that are false. Equivalently, the FDR is the expected ratio of the number of false positive classifications to the total number of positive classifications. The total number of rejections of the null include both the number of false positives (FP) and true positives (TP). Simply put, FDR = FP /. FDR-controlling procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

The Rand index or Rand measure in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. The Rand index is the accuracy of determining if a link belongs within a cluster or not.

In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:

The Dice-Sørensen coefficient is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Lee Raymond Dice and Thorvald Sørensen, who published in 1945 and 1948 respectively.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

In statistics, the phi coefficient is a measure of association for two binary variables.

<span class="mw-page-title-main">Diagnostic odds ratio</span>

In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.

The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings, and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983.

<span class="mw-page-title-main">Evaluation of binary classifiers</span> Quantitative measurement of accuracy

Evaluation of a binary classifier typically assigns a numerical value, or values, to a classifier that represent its accuracy. An example is error rate, which measures how frequently the classifier makes a mistake.

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms.

Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine-learning models. Decisions made by computers after a machine-learning process may be considered unfair if they were based on variables considered sensitive. For example gender, ethnicity, sexual orientation or disability. As it is the case with many ethical concepts, definitions of fairness and bias are always controversial. In general, fairness and bias are considered relevant when the decision process impacts people's lives. In machine learning the problem of algorithmic bias is well known and well studied. Outcomes may be skewed by a range of factors and thus might be considered unfair with respect to certain groups or individuals. An example would be the way social media sites deliver personalized news to consumer.

The Partial Area Under the ROC Curve (pAUC) is a metric for the performance of binary classifier.

References

↑ Sitarz, Mikolaj (2023). "Extending F1 Metric, Probabilistic Approach". Advances in Artificial Intelligence and Machine Learning. 03 (2): 1025–1038. arXiv: 2210.11997 . doi:10.54364/AAIML.2023.1161.
↑ "P4 metric, a new way to evaluate binary classifiers".
↑ Hand, David J.; Christen, Peter; Ziyad, Sumayya (2024). "Selecting a classification performance measure: Matching the measure to the problem". arXiv: 2409.12391 [cs.LG].
↑ Fawcett, Tom (2006). "An Introduction to ROC Analysis" (PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010. S2CID 2027090.
↑ Provost, Foster; Tom Fawcett (2013-08-01). "Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking". O'Reilly Media, Inc.
↑ Powers, David M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63.
↑ Ting, Kai Ming (2011). Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of machine learning. Springer. doi:10.1007/978-0-387-30164-8. ISBN 978-0-387-30164-8.
↑ Brooks, Harold; Brown, Barb; Ebert, Beth; Ferro, Chris; Jolliffe, Ian; Koh, Tieh-Yong; Roebber, Paul; Stephenson, David (2015-01-26). "WWRP/WGNE Joint Working Group on Forecast Verification Research". Collaboration for Australian Weather and Climate Research. World Meteorological Organisation. Retrieved 2019-07-17.
↑ Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (1): 6-1–6-13. doi: 10.1186/s12864-019-6413-7 . PMC 6941312 . PMID 31898477.
↑ Chicco D, Toetsch N, Jurman G (February 2021). "The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation". BioData Mining. 14 (13): 13. doi: 10.1186/s13040-021-00244-z . PMC 7863449 . PMID 33541410.
↑ Tharwat A. (August 2018). "Classification assessment methods". Applied Computing and Informatics. 17: 168–192. doi: 10.1016/j.aci.2018.08.003 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[12] the number of real positive cases in the data

[13] A test result that correctly indicates the presence of a condition or characteristic

[14] Type II error: A test result which wrongly indicates that a particular condition or attribute is absent

[15] the number of real negative cases in the data

[16] A test result that correctly indicates the absence of a condition or characteristic

[17] Type I error: A test result which wrongly indicates that a particular condition or attribute is present

[1] Sitarz, Mikolaj (2023). "Extending F1 Metric, Probabilistic Approach". Advances in Artificial Intelligence and Machine Learning. 03 (2): 1025–1038. arXiv: 2210.11997 . doi:10.54364/AAIML.2023.1161.

[2] "P4 metric, a new way to evaluate binary classifiers".

[3] Hand, David J.; Christen, Peter; Ziyad, Sumayya (2024). "Selecting a classification performance measure: Matching the measure to the problem". arXiv: 2409.12391 [cs.LG].

[4] Fawcett, Tom (2006). "An Introduction to ROC Analysis" (PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010. S2CID 2027090.

[5] Provost, Foster; Tom Fawcett (2013-08-01). "Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking". O'Reilly Media, Inc.

[6] Powers, David M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63.

[7] Ting, Kai Ming (2011). Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of machine learning. Springer. doi:10.1007/978-0-387-30164-8. ISBN 978-0-387-30164-8.

[8] Brooks, Harold; Brown, Barb; Ebert, Beth; Ferro, Chris; Jolliffe, Ian; Koh, Tieh-Yong; Roebber, Paul; Stephenson, David (2015-01-26). "WWRP/WGNE Joint Working Group on Forecast Verification Research". Collaboration for Australian Weather and Climate Research. World Meteorological Organisation. Retrieved 2019-07-17.

[9] Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (1): 6-1–6-13. doi: 10.1186/s12864-019-6413-7 . PMC 6941312 . PMID 31898477.

[10] Chicco D, Toetsch N, Jurman G (February 2021). "The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation". BioData Mining. 14 (13): 13. doi: 10.1186/s13040-021-00244-z . PMC 7863449 . PMID 33541410.

[11] Tharwat A. (August 2018). "Classification assessment methods". Applied Computing and Informatics. 17: 168–192. doi: 10.1016/j.aci.2018.08.003 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[lower-alpha 1]

[lower-alpha 2]

[lower-alpha 3]

[lower-alpha 4]

[lower-alpha 5]

[lower-alpha 6]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic Loss
Clustering	Silhouette Calinski-Harabasz index Davies-Bouldin Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC SimHash
Ranking	MRR NDCG AP
Computer Vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep Learning Related Metrics	Inception score FID
Recommender system	Coverage Intra-list Similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix

P4-metric

Contents

Justification

Definition

Evaluation of the binary classifier performance

Properties of P₄ metric

Examples, comparing with the other metrics

Example 1: Rare disease detection test

Example 2: Image recognition - cats vs dogs

See also

Related Research Articles

References

P4-metric

Contents

Justification

Definition

Evaluation of the binary classifier performance

Properties of P4 metric

Examples, comparing with the other metrics

Example 1: Rare disease detection test

Example 2: Image recognition - cats vs dogs

See also

Related Research Articles

References

Properties of P₄ metric