Leakage (machine learning)

Last updated October 19, 2024

In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.^[1]

Leakage modes

Leakage can occur in many steps in the machine learning process. The leakage causes can be sub-classified into two possible sources of leakage for a model: features and training examples.^[1]

Feature leakage

Feature or column-wise leakage is caused by the inclusion of columns which are one of the following: a duplicate label, a proxy for the label, or the label itself. These features, known as anachronisms, will not be available when the model is used for predictions, and result in leakage if included when the model is trained.^[2]

For example, including a "MonthlySalary" column when predicting "YearlySalary"; or "MinutesLate" when predicting "IsLate".

Training example leakage

Row-wise leakage is caused by improper sharing of information between rows of data. Types of row-wise leakage include:

Premature featurization; leaking from premature featurization before Cross-validation/Train/Test split (must fit MinMax/ngrams/etc on only the train split, then transform the test set)
Duplicate rows between train/validation/test (e.g. oversampling a dataset to pad its size before splitting; e.g. different rotations/augmentations of a single image; bootstrap sampling before splitting; or duplicating rows to up sample the minority class)
Non-i.i.d. data
- Time leakage (e.g. splitting a time-series dataset randomly instead of newer data in test set using a TrainTest split or rolling-origin cross validation)
- Group leakage—not including a grouping split column (e.g. Andrew Ng's group had 100k x-rays of 30k patients, meaning ~3 images per patient. The paper used random splitting instead of ensuring that all images of a patient were in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays.^[3]^[4])

A 2023 review found data leakage to be "a widespread failure mode in machine-learning (ML)-based science", having affected at least 294 academic publications across 17 disciplines, and causing a potential reproducibility crisis.^[5]

Detection

Data leakage in machine learning can be detected through various methods, focusing on performance analysis, feature examination, data auditing, and model behavior analysis. Performance-wise, unusually high accuracy or significant discrepancies between training and test results often indicate leakage.^[6] Inconsistent cross-validation outcomes may also signal issues.

Feature examination involves scrutinizing feature importance rankings and ensuring temporal integrity in time series data. A thorough audit of the data pipeline is crucial, reviewing pre-processing steps, feature engineering, and data splitting processes.^[7] Detecting duplicate entries across dataset splits is also important.

Analyzing model behavior can reveal leakage. Models relying heavily on counter-intuitive features or showing unexpected prediction patterns warrant investigation. Performance degradation over time when tested on new data may suggest earlier inflated metrics due to leakage.

Advanced techniques include backward feature elimination, where suspicious features are temporarily removed to observe performance changes. Using a separate hold-out dataset for final validation before deployment is advisable.^[7]

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

<span class="mw-page-title-main">Overfitting</span> Flaw in mathematical modelling

In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitted model is a mathematical model that contains more parameters than can be justified by the data. In a mathematical sense, these parameters represent the degree of a polynomial. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Quick progress in the field of deep learning, beginning in 2010s, allowed neural networks to surpass many previous approaches in performance.

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It can also be used to assess the quality of a fitted model and the stability of its parameters.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and test sets.

<span class="mw-page-title-main">Orange (software)</span> Open-source data analysis software

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for exploratory qualitative data analysis and interactive data visualization.

Within statistics, oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set. These terms are used both in statistical sampling, survey design methodology and in machine learning.

Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging). Bagging uses subsampling with replacement to create training samples for the model to learn from. OOB error is the mean prediction error on each training sample $x i$ , using only the trees that did not have $x i$ in their bootstrap sample.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. It is the combination of automation and ML.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par with or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

Federated learning is a sub-field of machine learning focusing on settings in which multiple entities collaboratively train a model while ensuring that their data remains decentralized. This stands in contrast to machine learning settings in which data is centrally stored. One of the primary defining characteristics of federated learning is data heterogeneity. Due to the decentralized nature of the clients' data, there is no guarantee that data samples held by each client are independently and identically distributed.

LightGBM, short for Light Gradient-Boosting Machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

Conformal prediction (CP) is a machine learning framework for uncertainty quantification that produces statistically valid prediction regions for any underlying point predictor only assuming exchangeability of the data. CP works by computing nonconformity scores on previously labeled data, and using these to create prediction sets on a new (unlabeled) test data point. A transductive version of CP was first proposed in 1998 by Gammerman, Vovk, and Vapnik, and since, several variants of conformal prediction have been developed with different computational complexities, formal guarantees, and practical applications.

Applications of machine learning in earth sciences include geological mapping, gas leakage detection and geological features identification. Machine learning (ML) is a type of artificial intelligence (AI) that enables computer systems to classify, cluster, identify and analyze vast and complex sets of data while eliminating the need for explicit instructions and programming. Earth science is the study of the origin, evolution, and future of the planet Earth. The Earth system can be subdivided into four major components including the solid earth, atmosphere, hydrosphere and biosphere.

Y.3181 is an ITU-T Recommendation specifying an Architectural framework for Machine Learning Sandbox in future networks. The standard describes the requirements and architecture for a machine learning sandbox a in future networks including IMT-2020.

References

1 2 3 Shachar Kaufman; Saharon Rosset; Claudia Perlich (January 2011). "Leakage in data mining: Formulation, detection, and avoidance". Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. Vol. 6. pp. 556–563. doi:10.1145/2020408.2020496. ISBN 9781450308137. S2CID 9168804 . Retrieved 13 January 2020.
↑ Soumen Chakrabarti (2008). "9". Data Mining: Know it All. Morgan Kaufmann Publishers. p. 383. ISBN 978-0-12-374629-0. Anachronistic variables are a pernicious mining problem. However, they aren't any problem at all at deployment time—unless someone expects the model to work! Anachronistic variables are out of place in time. Specifically, at data modeling time, they carry information back from the future to the past.
↑
Guts, Yuriy (30 October 2018). Yuriy Guts. TARGET LEAKAGE IN MACHINE LEARNING (Talk). AI Ukraine Conference. Ukraine – via YouTube.
- Yuriy Guts. "Target Leakage in ML" (PDF). AI Ukraine Online Conference.
↑ Nick, Roberts (16 November 2017). "Replying to @AndrewYNg @pranavrajpurkar and 2 others". Brooklyn, NY, USA: Twitter. Archived from the original on 10 June 2018. Retrieved 13 January 2020. Replying to @AndrewYNg @pranavrajpurkar and 2 others ... Were you concerned that the network could memorize patient anatomy since patients cross train and validation? "ChestX-ray14 dataset contains 112,120 frontal-view X-ray images of 30,805 unique patients. We randomly split the entire dataset into 80% training, and 20% validation."
↑ Kapoor, Sayash; Narayanan, Arvind (August 2023). "Leakage and the reproducibility crisis in machine-learning-based science". Patterns. 4 (9): 100804. doi: 10.1016/j.patter.2023.100804 . ISSN 2666-3899. PMC 10499856 . PMID 37720327.
↑ Batutin, Andrew (2024-06-20). "Data Leakage in Machine Learning Models". Shelf. Retrieved 2024-10-18.
1 2 "What is Data Leakage in Machine Learning? | IBM". www.ibm.com. 2024-09-30. Retrieved 2024-10-18.

This artificial intelligence-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[KaufmanKDD11-1] 1 2 3 Shachar Kaufman; Saharon Rosset; Claudia Perlich (January 2011). "Leakage in data mining: Formulation, detection, and avoidance". Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. Vol. 6. pp. 556–563. doi:10.1145/2020408.2020496. ISBN 9781450308137. S2CID 9168804 . Retrieved 13 January 2020.

[2] Soumen Chakrabarti (2008). "9". Data Mining: Know it All. Morgan Kaufmann Publishers. p. 383. ISBN 978-0-12-374629-0. Anachronistic variables are a pernicious mining problem. However, they aren't any problem at all at deployment time—unless someone expects the model to work! Anachronistic variables are out of place in time. Specifically, at data modeling time, they carry information back from the future to the past.

[GutsAIUkraineConfTalk18-3] Guts, Yuriy (30 October 2018). Yuriy Guts. TARGET LEAKAGE IN MACHINE LEARNING (Talk). AI Ukraine Conference. Ukraine – via YouTube.
Yuriy Guts. "Target Leakage in ML" (PDF). AI Ukraine Online Conference.

[mwhQ] Yuriy Guts. "Target Leakage in ML" (PDF). AI Ukraine Online Conference.

[4] Nick, Roberts (16 November 2017). "Replying to @AndrewYNg @pranavrajpurkar and 2 others". Brooklyn, NY, USA: Twitter. Archived from the original on 10 June 2018. Retrieved 13 January 2020. Replying to @AndrewYNg @pranavrajpurkar and 2 others ... Were you concerned that the network could memorize patient anatomy since patients cross train and validation? "ChestX-ray14 dataset contains 112,120 frontal-view X-ray images of 30,805 unique patients. We randomly split the entire dataset into 80% training, and 20% validation."

[5] Kapoor, Sayash; Narayanan, Arvind (August 2023). "Leakage and the reproducibility crisis in machine-learning-based science". Patterns. 4 (9): 100804. doi: 10.1016/j.patter.2023.100804 . ISSN 2666-3899. PMC 10499856 . PMID 37720327.

[6] Batutin, Andrew (2024-06-20). "Data Leakage in Machine Learning Models". Shelf. Retrieved 2024-10-18.

[:0-7] 1 2 "What is Data Leakage in Machine Learning? | IBM". www.ibm.com. 2024-09-30. Retrieved 2024-10-18.

[1]

[2]

[3]

[4]

[5]

[6]

[7]