Conformal prediction

Last updated

Conformal prediction (CP) is a machine learning framework for uncertainty quantification that produces statistically valid prediction regions (prediction intervals) for any underlying point predictor (whether statistical, machine, or deep learning) only assuming exchangeability of the data. CP works by computing nonconformity scores on previously labeled data, and using these to create prediction sets on a new (unlabeled) test data point. A transductive version of CP was first proposed in 1998 by Gammerman, Vovk, and Vapnik, [1] and since, several variants of conformal prediction have been developed with different computational complexities, formal guarantees, and practical applications. [2]

Contents

Conformal prediction requires a user-specified significance level for which the algorithm should produce its predictions. This significance level restricts the frequency of errors that the algorithm is allowed to make. For example, a significance level of 0.1 means that the algorithm can make at most 10% erroneous predictions. To meet this requirement, the output is a set prediction, instead of a point prediction produced by standard supervised machine learning models. For classification tasks, this means that predictions are not a single class, for example 'cat', but instead a set like {'cat', 'dog'}. Depending on how good the underlying model is (how well it can discern between cats, dogs and other animals) and the specified significance level, these sets can be smaller or larger. For regression tasks, the output is prediction intervals, where a smaller significance level (fewer allowed errors) produces wider intervals which are less specific, and vice versa – more allowed errors produce tighter prediction intervals. [3] [4] [5] [6]

History

The conformal prediction first arose in a collaboration between Gammerman, Vovk, and Vapnik in 1998; [1] this initial version of conformal prediction used what are now called E-values though the version of conformal prediction best known today uses p-values and was proposed a year later by Saunders et al. [7] Vovk, Gammerman, and their students and collaborators, particularly Craig Saunders, Harris Papadopoulos, and Kostas Proedrou, continued to develop the ideas of conformal prediction; major developments include the proposal of inductive conformal prediction (a.k.a. split conformal prediction), in 2002. [8] A book on the topic was written by Vovk and Shafer in 2005, [3] and a tutorial was published in 2008. [9]

Theory

The data has to conform to some standards, such as data being exchangeable (a slightly weaker assumption than the standard IID imposed in standard machine learning). For conformal prediction, a n% prediction region is said to be valid if the truth is in the output n% of the time. [3] The efficiency is the size of the output. For classification, this size is the number of classes; for regression, it is interval width. [9]

In the purest form, conformal prediction is made for an online (transductive) section. That is, after a label is predicted, its true label is known before the next prediction. Thus, the underlying model can be re-trained using this new data point and the next prediction will be made on a calibration set containing n + 1 data points, where the previous model had n data points. [9]

Classification algorithms

The goal of standard classification algorithms is to classify a test object into one of several discrete classes. Conformal classifiers instead compute and output the p-value for each available class by performing a ranking of the nonconformity measure (α-value) of the test object against examples from the training data set. Similar to standard hypothesis testing, the p-value together with a threshold (referred to as significance level in the CP field) is used to determine whether the label should be in the prediction set. For example, for a significance level of 0.1, all classes with a p-value of 0.1 or greater are added to the prediction set. Transductive algorithms compute the nonconformity score using all available training data, while inductive algorithms compute it on a subset of the training set.

Inductive conformal prediction (ICP)

Inductive Conformal Prediction was first known as inductive confidence machines, [10] but was later re-introduced as ICP. It has gained popularity in practical settings because the underlying model does not need to be retrained for every new test example. This makes it interesting for any model that is heavy to train, such as neural networks. [11]

Mondrian inductive conformal prediction (MICP)

In MICP, the alpha values are class-dependent (Mondrian) and the underlying model does not follow the original online setting introduced in 2005. [4]

Training algorithm:

  1. Train a machine learning model (MLM)
  2. Run a calibration set through the MLM, save output from the chosen stage
    • In deep learning, the softmax values are often used
  3. Use a non-conformity function to compute α-values
    • A data point in the calibration set will result in an α-value for its true class

Prediction algorithm:

  1. For a test data point, generate a new α-value
  2. Find a p-value for each class of the data point
  3. If the p-value is greater than the significance level, include the class in the output [4]

Regression algorithms

Conformal prediction was initially formulated for the task of classification, but was later modified for regression. Unlike classification, which outputs p-values without a given significance level, regression requires a fixed significance level at prediction time in order to produce prediction intervals for a new test object. For classic conformal regression, there is no transductive algorithm. This is because it is impossible to postulate all possible labels for a new test object, because the label space is continuous. The available algorithms are all formulated in the inductive setting, which computes a prediction rule once and applies it to all future predictions.

Inductive conformal prediction (ICP)

All inductive algorithms require splitting the available training examples into two disjoint sets: one set used for training the underlying model (the proper training set) and one set for calibrating the prediction (the calibration set). In ICP, this split is done once, thus training a single ML model. If the split is performed randomly and that data is exchangeable, the ICP model is proven to be automatically valid (i.e. the error rate corresponds to the required significance level).

Training algorithm:

  1. Split the training data into proper trainingset and calibration set
  2. Train the underlying ML model using the proper training set
  3. Predict the examples from the calibration set using the derived ML model → ŷ-values
  4. Optional: if using a normalized nonconformity function
    1. Train the normalization ML model
    2. Predict normalization scores → 𝜺 -values
  5. Compute the nonconformity measures (α-values) for all calibration examples, using ŷ- and 𝜺-values
  6. Sort the nonconformity measure and generate nonconformity scores
  7. Save underlying ML model, normalization ML model (if any) and nonconformity scores

Prediction algorithm:

Required input: significance level (s)

  1. Predict the test object using the ML model → ŷt
  2. Optional: if using a normalized nonconformity function
    1. Predict the test object using normalization model → 𝜺t
  3. Pick the nonconformity score from the list of scores produced by the calibration set in training, corresponding to the significance level sαs
  4. Compute the prediction interval half width (d) from rearranging the nonconformity function and input αs (and optionally 𝜺) → d
  5. Output prediction interval (ŷd, ŷ + d) for the given significance level s

Split conformal prediction (SCP)

The SCP, often called aggregated conformal predictor (ACP), can be considered an ensemble of ICPs. SCP usually improves the efficiency of predictions (that is, it creates smaller prediction intervals) compared to a single ICP, but loses the automatic validity in the generated predictions.

A common type of SCPs is the cross-conformal predictor (CCP), which splits the training data into proper training and calibration sets multiple times in a strategy similar to k-fold cross-validation. Regardless of the splitting technique, the algorithm performs n splits and trains an ICP for each split. When predicting a new test object, it uses the median ŷ and d from the n ICPs to create the final prediction interval as (ŷmediandmedian, ŷmedian + dmedian).

Applications

Types of learning models

Several machine learning models can be used in conjunction with conformal prediction. Studies have shown that it can be applied to for example convolutional neural networks, [12] support-vector machines and others.

Use case

Conformal prediction is used in a variety of fields and is an active area of research. For example, in biotechnology it has been used to predict uncertainties in breast cancer, [13] stroke risks, [14] data storage, [15] and disk drive scrubbing. [16] In the domain of hardware security it has been used to detect the evolving hardware trojans. [17] Within language technology, conformal prediction papers are routinely presented at the Symposium on Conformal and Probabilistic Prediction with Applications (COPA). [18]

Conferences

Conformal prediction is one of the main subjects discussed during the COPA conference each year. Both theory and applications of conformal predictions are presented by leaders of the field. The conference has been held since 2012. [18] It has been hosted in several different European countries including Greece, Great Britain, Italy and Sweden.

Books

Published books on Conformal Prediction includes Algorithmic Learning in a Random World, [19] Conformal Prediction for Reliable Machine Learning: Theory, Adaptations and Applications, [20] Practical Guide to Applied Conformal Prediction in Python: Learn and Apply the Best Uncertainty Frameworks to Your Industry Applications, [21] Conformal Prediction: A Gentle Introduction (Foundations and Trends(r) in Machine Learning), [22] and Conformal Prediction: An Inventor's Approach. [23]

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

<span class="mw-page-title-main">Overfitting</span> Flaw in mathematical modelling

In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitted model is a mathematical model that contains more parameters than can be justified by the data. In a mathematical sense, these parameters represent the degree of a polynomial. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

The inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs of given inputs that it has not encountered. Inductive bias is anything which makes the algorithm learn one pattern instead of another pattern. Learning is the process of apprehending useful knowledge by observing and interacting with the world. It involves searching a space of solutions for one expected to provide a better explanation of the data or to achieve higher rewards. But in many cases, there are multiple solutions which are equally good. An inductive bias allows a learning algorithm to prioritize one solution over another, independent of the observed data.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors other interdisciplinary fields, such as psychometrics and econometrics.

There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean

In logic, statistical inference, and supervised learning, transduction or transductive inference is reasoning from observed, specific (training) cases to specific (test) cases. In contrast, induction is reasoning from observed training cases to general rules, which are then applied to the test cases. The distinction is most interesting in cases where the predictions of the transductive model are not achievable by any inductive model. Note that this is caused by transductive inference on different test sets producing mutually inconsistent predictions.

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern recognition, classification and regression. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition. The concept of "feature" is related to that of explanatory variable used in statistical techniques such as linear regression.

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

In data mining, cluster-weighted modeling (CWM) is an algorithm-based approach to non-linear prediction of outputs from inputs based on density estimation using a set of models (clusters) that are each notionally appropriate in a sub-region of the input space. The overall approach works in jointly input-output space and an initial version was proposed by Neil Gershenfeld.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

In machine learning, Platt scaling or Platt calibration is a way of transforming the outputs of a classification model into a probability distribution over classes. The method was invented by John Platt in the context of support vector machines, replacing an earlier method by Vapnik, but can be applied to other classification models. Platt scaling works by fitting a logistic regression model to a classifier's scores.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

The following outline is provided as an overview of and topical guide to machine learning:

Weak supervision is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data, followed by a large amount of unlabeled data. In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

References

  1. 1 2 Gammerman, Alexander; Vovk, Vladimir; Vapnik, Vladimir (1998). "Learning by transduction". Uncertainty in Artificial Intelligence. 14: 148–155.
  2. Angelopoulos, Anastasios; Bates, Stephen (2021). "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification". arXiv: 2107.07511 [cs.LG].
  3. 1 2 3 Vovk, Vladimir (2022). Algorithmic learning in a random world. A. Gammerman, Glenn Shafer. New York: Springer. doi:10.1007/978-3-031-06649-8. ISBN   978-3-031-06648-1. S2CID   118783209.
  4. 1 2 3 Toccaceli, Paolo; Gammerman, Alexander (2019-03-01). "Combination of inductive mondrian conformal predictors". Machine Learning. 108 (3): 489–510. doi: 10.1007/s10994-018-5754-9 . ISSN   1573-0565.
  5. Norinder, Ulf; Carlsson, Lars; Boyer, Scott; Eklund, Martin (2014-06-23). "Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination". Journal of Chemical Information and Modeling. 54 (6): 1596–1603. doi:10.1021/ci5001168. ISSN   1549-9596. PMID   24797111.
  6. Alvarsson, Jonathan; McShane, Staffan Arvidsson; Norinder, Ulf; Spjuth, Ola (2021-01-01). "Predicting With Confidence: Using Conformal Prediction in Drug Discovery". Journal of Pharmaceutical Sciences. 110 (1): 42–49. doi: 10.1016/j.xphs.2020.09.055 . ISSN   0022-3549. PMID   33075380. S2CID   224809705.
  7. Saunders, Craig; Gammerman, Alexander; Vovk, Vladimir (1999). "Transduction with Confidence and Credibility". International Joint Conference on Artificial Intelligence. 16: 722–726.
  8. Papadopoulos, Harris; Proedrou, Kostas; Vovk, Volodya; Gammerman, Alexander (2002). "Inductive Confidence Machines for Regression". Machine Learning: ECML 2002. European Conference on Machine Learning. Lecture Notes in Computer Science. Vol. 13. pp. 345–356. doi: 10.1007/3-540-36755-1_29 . ISBN   978-3-540-44036-9.
  9. 1 2 3 Vovk, Vladimir; Shafer, Glenn (2008-08-03). "A Tutorial on Conformal Prediction" (PDF). Journal of Machine Learning Research. 9: 371–421.
  10. Papadopoulos, Harris; Proedrou, Kostas; Vovk, Volodya; Gammerman, Alex (2002). "Inductive Confidence Machines for Regression". In Elomaa, Tapio; Mannila, Heikki; Toivonen, Hannu (eds.). Machine Learning: ECML 2002. Lecture Notes in Computer Science. Vol. 2430. Berlin, Heidelberg: Springer. pp. 345–356. doi: 10.1007/3-540-36755-1_29 . ISBN   978-3-540-36755-0.
  11. Papadopoulos, Harris; Haralambous, Haris (2010). "Neural Networks Regression Inductive Conformal Predictor and Its Application to Total Electron Content Prediction". In Diamantaras, Konstantinos; Duch, Wlodek; Iliadis, Lazaros S. (eds.). Artificial Neural Networks – ICANN 2010. Lecture Notes in Computer Science. Vol. 6352. Berlin, Heidelberg: Springer. pp. 32–41. doi:10.1007/978-3-642-15819-3_4. ISBN   978-3-642-15819-3.
  12. Papadopoulos, Harris; Vovk, Volodya; Gammerman, Alex (October 2007). "Conformal Prediction with Neural Networks". 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007). Vol. 2. pp. 388–395. doi:10.1109/ICTAI.2007.47. ISBN   978-0-7695-3015-4. S2CID   10164217.
  13. Lambrou, A.; Papadopoulos, H.; Gammerman, A. (November 2009). "Evolutionary Conformal Prediction for Breast Cancer Diagnosis". 2009 9th International Conference on Information Technology and Applications in Biomedicine. pp. 1–4. doi:10.1109/ITAB.2009.5394447. ISBN   978-1-4244-5379-5. S2CID   15703490.
  14. Lambrou, Antonis; Papadopoulos, Harris; Kyriacou, Efthyvoulos; Pattichis, Constantinos S.; Pattichis, Marios S.; Gammerman, Alexander; Nicolaides, Andrew (2010), Papadopoulos, Harris; Andreou, Andreas S.; Bramer, Max (eds.), "Assessment of Stroke Risk Based on Morphological Ultrasound Image Analysis with Conformal Prediction", Artificial Intelligence Applications and Innovations, IFIP Advances in Information and Communication Technology, vol. 339, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 146–153, doi: 10.1007/978-3-642-16239-8_21 , ISBN   978-3-642-16238-1, S2CID   17515625
  15. Vishwakarma, Rahul (2019). New Perspective on Machine Learning Predictions Under Uncertainty (SDC 2019). SNIA SDC.
  16. Vishwakarma, Rahul; Hedayatipour, Ava; Messoudi, Soundouss; Hwang, Jinha (2021). "Enterprise Disk Drive Scrubbing Based on Mondrian Conformal Predictors". Proceedings of Machine Learning Research. 204.
  17. Vishwakarma, Rahul; Rezaei, Amin (October 2023). "Risk-Aware and Explainable Framework for Ensuring Guaranteed Coverage in Evolving Hardware Trojan Detection". 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). pp. 01–09. ISBN   979-8-3503-2225-5.
  18. 1 2 "10th Symposium on Conformal and Probabilistic Prediction with Applications (COPA 2021)". cml.rhul.ac.uk. Retrieved 2021-09-15.
  19. Vovk, Vladimir; Gammerman, Alexander; Shafer, Glenn (2005). Algorithmic learning in a random world. Vol. 29. Springer.
  20. Balasubramanian, Vineeth (2014). Ho, Shen-Shyang; Vovk, Vladimir (eds.). Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes.
  21. Manokhin, Valery (2023). Practical Guide to Applied Conformal Prediction in Python: Learn and Apply the Best Uncertainty Frameworks to Your Industry Applications. United Kingdom: Packt Publishing. ISBN   9781805120919.
  22. Angelopoulos, Anastasios N.; Bates, Stephen (2023). "Conformal prediction: A gentle introduction". Foundations and Trends® in Machine Learning. 16 (4): 494–591.
  23. Vishwakarma, Rahul Deo; Pandey, Rahul; Han, Shangdian (King); Modi, Shrey (March 12, 2024). Conformal Prediction: An Inventor's Approach. Independently published. ASIN   B0CY43HQRX. ISBN   979-8884663619.