Concept drift

Last updated September 11, 2024

In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

Predictive model decay

In machine learning and predictive analytics this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as probability distribution of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed.^[1]^[2]^[3]^[4]

Data configuration decay

Another important area is software engineering, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment ("infrastructure drift") may invalidate software infrastructure configuration. "Structural drift" happens when the data schema changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system.^[5]^[6]

For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates.^[6]

In the case of cloud computing, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software.^[5]

There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream the data processing pipeline, but somewhere downstream there data fields are absent.^[6]

Inconsistent data

"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input.^[7]

"Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run checksum regularly. However the remedy may be not so easy.^[8]

Examples

The behavior of the customers in an online shop may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on advertising, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some confounding variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift.

Possible remedies

To prevent deterioration in prediction accuracy because of concept drift, reactive and tracking solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test,^[9]^[10] to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy.^[11]^[12] A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include online machine learning, frequent retraining on the most recently observed samples,^[13] and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble.^[14]

Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.

Concept drift cannot be avoided for complex phenomena that are not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.

External links

Software

Frouros: An open-source Python library for drift detection in machine learning systems.^[15]
NannyML: An open-source Python library for detecting univariate and multivariate distribution drift and estimating machine learning model performance without ground truth labels.
RapidMiner: Formerly Yet Another Learning Environment (YALE): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept. It is used in combination with its data stream mining plugin (formerly concept drift plugin).
EDDM (Early Drift Detection Method): free open-source implementation of drift detection methods in Weka.
MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with Weka.

Datasets

Real

USP Data Stream Repository, 27 real-world stream datasets with concept drift compiled by Souza et al. (2020). Access
Airline, approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E. Ikonomovska. Reference: Data Expo 2009 Competition . Access
Chess.com (online games) and Luxembourg (social survey) datasets compiled by I. Zliobaite. Access
ECUE spam 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual. Access from S.J.Delany webpage
Elec2, electricity demand, 2 classes, 45,312 instances. Reference: M. Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. Access from J.Gama webpage. Comment on applicability.
PAKDD'09 competition data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data. Access
Sensor stream and Power supply stream datasets are available from X. Zhu's Stream Data Mining Repository. Access
SMEAR is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness. Access
Text mining, a collection of text mining datasets with concept drift, maintained by I. Katakis. Access
Gas Sensor Array Drift Dataset, a collection of 13,910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations. Access

Other

KDD'99 competition data contains simulated intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift. Access

Synthetic

Extreme verification latency benchmarkSouza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A. (2015). "Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency". Proceedings of the 2015 SIAM International Conference on Data Mining (SDM). SIAM. pp. 873–881. doi:10.1137/1.9781611974010.98. ISBN 9781611974010. S2CID 19198944. Access from Nonstationary Environments – Archive.
Sine, Line, Plane, Circle and Boolean Data SetsMinku, L.L.; White, A.P.; Yao, X. (2010). "The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift" (PDF). IEEE Transactions on Knowledge and Data Engineering. 22 (5): 730–742. doi:10.1109/TKDE.2009.156. S2CID 16592739. Access from L.Minku webpage.
SEA conceptsStreet, N.W.; Kim, Y. (2001). "A streaming ensemble algorithm (SEA) for large-scale classification" (PDF). KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 377–382. doi:10.1145/502512.502568. ISBN 978-1-58113-391-2. S2CID 11868540. Access from J.Gama webpage.
STAGGERSchlimmer, J.C.; Granger, R.H. (1986). "Incremental Learning from Noisy Data". Mach. Learn. 1 (3): 317–354. doi: 10.1007/BF00116895 . S2CID 33776987.
MixedGama, J.; Medas, P.; Castillo, G.; Rodrigues, P. (2004). "Learning with drift detection". Brazilian symposium on artificial intelligence. Springer. pp. 286–295. doi:10.1007/978-3-540-28645-5_29. ISBN 978-3-540-28645-5. S2CID 2606652.

Data generation frameworks

Minku, White & Yao 2010 Download from L.Minku webpage.
Lindstrom, P.; Delany, S.J.; MacNamee, B. (2008). "Autopilot: Simulating Changing Concepts in Real Data" (PDF). Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science. pp. 272–263.
Narasimhamurthy, A.; Kuncheva, L.I. (2007). "A framework for generating data to simulate changing environments". AIAP'07: Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications. pp. 384–389. Code

Projects

INFER: Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland)
HaCDAIS: Handling Concept Drift in Adaptive Information Systems (2008–2012), Eindhoven University of Technology (the Netherlands)
KDUS: Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal)
ADEPT: Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK)
ALADDIN: autonomous learning agents for decentralised data and information networks (2005–2010)
GAENARI: C++ incremental decision tree algorithm. it minimize concept drifting damage. (2022)

Benchmarks

NAB: The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. (2014–2018)

Meetings

2014
- [] Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014
2013
- RealStream Real-World Challenges for Data Stream Mining Workshop-Discussion at the ECML PKDD 2013, Prague, Czech Republic.
- LEAPS 2013 The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments
2011
- LEE 2011 Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11
- HaCDAIS 2011 The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems
- ICAIS 2011 Track on Incremental Learning
- IJCNN 2011 Special Session on Concept Drift and Learning Dynamic Environments
- CIDUE 2011 Symposium on Computational Intelligence in Dynamic and Uncertain Environments
2010
- HaCDAIS 2010 International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions
- ICMLA10 Special Session on Dynamic learning in non-stationary environments
- SAC 2010 Data Streams Track at ACM Symposium on Applied Computing
- SensorKDD 2010 International Workshop on Knowledge Discovery from Sensor Data
- StreamKDD 2010 Novel Data Stream Pattern Mining Techniques
- Concept Drift and Learning in Nonstationary Environments at IEEE World Congress on Computational Intelligence
- MLMDS’2010 Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA’10

Related Research Articles

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

An intrusion detection system (IDS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically either reported to an administrator or collected centrally using a security information and event management (SIEM) system. A SIEM system combines outputs from multiple sources and uses alarm filtering techniques to distinguish malicious activity from false alarms.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

Learning classifier systems, or LCS, are a paradigm of rule-based machine learning methods that combine a discovery component with a learning component. Learning classifier systems seek to identify a set of context-dependent rules that collectively store and apply knowledge in a piecewise manner in order to make predictions. This approach allows complex solution spaces to be broken up into smaller, simpler parts for the reinforcement learning that is inside artificial intelligence research.

Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

Activity recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. Since the 1980s, this research field has captured the attention of several computer science communities due to its strength in providing personalized support for many different applications and its connection to many different fields of study such as medicine, human-computer interaction, or sociology.

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

An incremental decision tree algorithm is an online machine learning algorithm that outputs a decision tree. Many decision tree methods, such as C4.5, construct a tree using a complete dataset. Incremental decision tree methods allow an existing tree to be updated using only new individual data instances, without having to re-process past instances. This may be useful in situations where the entire dataset is not available when the tree is updated, the original data set is too large to process or the characteristics of the data change over time.

Active learning is a special case of machine learning in which a learning algorithm can interactively query a human user, to label new data points with the desired outputs. The human user must possess knowledge/expertise in the problem domain, including the ability to consult/research authoritative sources when necessary. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.

Massive Online Analysis (MOA) is a free open-source software project specific for data stream mining with concept drift. It is written in Java and developed at the University of Waikato, New Zealand.

Geoffrey I. Webb is Professor in the Department of Data Science and Artificial Intelligence at Monash University, founder and director of Data Mining software development and consultancy company G. I. Webb and Associates, and former editor-in-chief of the journal Data Mining and Knowledge Discovery. Before joining Monash University he was on the faculty at Griffith University from 1986 to 1988 and then at Deakin University from 1988 to 2002.

Bing Liu is a Chinese-American professor of computer science who specializes in data mining, machine learning, and natural language processing. In 2002, he became a scholar at University of Illinois at Chicago. He holds a PhD from the University of Edinburgh (1988). His PhD advisors were Austin Tate and Kenneth Williamson Currie, and his PhD thesis was titled Reinforcement Planning for Resource Allocation and Constraint Satisfaction.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. It is the combination of automation and ML.

References

↑ Koggalahewa, Darshika; Xu, Yue; Foo, Ernest (2021). "A Drift Aware Hierarchical Test Based Approach for Combating Social Spammers in Online Social Networks". Data Mining. Communications in Computer and Information Science. Vol. 1504. pp. 47–61. doi:10.1007/978-981-16-8531-6_4. ISBN 978-981-16-8530-9. S2CID 245009299.
↑ Widmer, Gerhard; Kubat, Miroslav (1996). "Learning in the presence of concept drift and hidden contexts". Machine Learning. 23: 69–101. doi: 10.1007/BF00116900 . S2CID 206767784.
↑ Xia, Yuan; Zhao, Yunlong (2020). "A Drift Detection Method Based on Diversity Measure and McDiarmid's Inequality in Data Streams". Green, Pervasive, and Cloud Computing. Lecture Notes in Computer Science. Vol. 12398. pp. 115–122. doi:10.1007/978-3-030-64243-3_9. ISBN 978-3-030-64242-6. S2CID 227275380.
↑ Lu, Jie; Liu, Anjin; Dong, Fan; Gu, Feng; Gama, Joao; Zhang, Guangquan (2018). "Learning under Concept Drift: A Review". IEEE Transactions on Knowledge and Data Engineering: 1. arXiv: 2004.05785 . doi:10.1109/TKDE.2018.2876857. S2CID 69449458.
1 2 "Driftctl and Terraform, they're two of a kind!"
1 2 3 Girish Pancha, Big Data's Hidden Scourge: Data Drift, CMSWire, April 8, 2016
↑ Matthew Magne, "Data Drift Happens: 7 Pesky Problems with People Data", InformationWeek , July 19, 2017
↑ Daniel Nichter, Efficient MySQL Performance, 2021, ISBN 1098105060, p. 299
↑ Basseville, Michele (1993). Detection of abrupt changes: theory and application. Prentice Hall. ISBN 0-13-126780-9. OCLC 876004326.
↑ Alippi, C.; Roveri, M. (2007). "Adaptive Classifiers in Stationary Conditions". 2007 International Joint Conference on Neural Networks. IEEE. pp. 1008–13. doi:10.1109/ijcnn.2007.4371096. ISBN 978-1-4244-1380-5. S2CID 16255206.
↑ Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. (2004). "Learning with Drift Detection". Advances in Artificial Intelligence – SBIA 2004. Springer. pp. 286–295. doi:10.1007/978-3-540-28645-5_29. ISBN 978-3-540-28645-5. S2CID 2606652.
↑ Alippi, C.; Boracchi, G.; Roveri, M. (2011). "A just-in-time adaptive classification system based on the intersection of confidence intervals rule". Neural Networks. 24 (8): 791–800. doi:10.1016/j.neunet.2011.05.012. PMID 21723706.
↑ Widmer, G.; Kubat, M. (1996). "Learning in the presence of concept drift and hidden contexts". Machine Learning. 23 (1): 69–101. doi: 10.1007/bf00116900 . S2CID 206767784.
↑ Elwell, R.; Polikar, R. (2011). "Incremental Learning of Concept Drift in Nonstationary Environments". IEEE Transactions on Neural Networks. 22 (10): 1517–31. doi:10.1109/tnn.2011.2160459. PMID 21824845. S2CID 9136731.
↑ Céspedes Sisniega, Jaime; López García, Álvaro (2024). "Frouros: An open-source Python library for drift detection in machine learning systems" (PDF). SoftwareX. 26. Elsevier: 101733. doi: 10.1016/j.softx.2024.101733 . hdl: 10261/358367 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Koggalahewa, Darshika; Xu, Yue; Foo, Ernest (2021). "A Drift Aware Hierarchical Test Based Approach for Combating Social Spammers in Online Social Networks". Data Mining. Communications in Computer and Information Science. Vol. 1504. pp. 47–61. doi:10.1007/978-981-16-8531-6_4. ISBN 978-981-16-8530-9. S2CID 245009299.

[2] Widmer, Gerhard; Kubat, Miroslav (1996). "Learning in the presence of concept drift and hidden contexts". Machine Learning. 23: 69–101. doi: 10.1007/BF00116900 . S2CID 206767784.

[3] Xia, Yuan; Zhao, Yunlong (2020). "A Drift Detection Method Based on Diversity Measure and McDiarmid's Inequality in Data Streams". Green, Pervasive, and Cloud Computing. Lecture Notes in Computer Science. Vol. 12398. pp. 115–122. doi:10.1007/978-3-030-64243-3_9. ISBN 978-3-030-64242-6. S2CID 227275380.

[4] Lu, Jie; Liu, Anjin; Dong, Fan; Gu, Feng; Gama, Joao; Zhang, Guangquan (2018). "Learning under Concept Drift: A Review". IEEE Transactions on Knowledge and Data Engineering: 1. arXiv: 2004.05785 . doi:10.1109/TKDE.2018.2876857. S2CID 69449458.

[devto-5] 1 2 "Driftctl and Terraform, they're two of a kind!"

[gipa-6] 1 2 3 Girish Pancha, Big Data's Hidden Scourge: Data Drift, CMSWire, April 8, 2016

[7] Matthew Magne, "Data Drift Happens: 7 Pesky Problems with People Data", InformationWeek , July 19, 2017

[8] Daniel Nichter, Efficient MySQL Performance, 2021, ISBN 1098105060, p. 299

[9] Basseville, Michele (1993). Detection of abrupt changes: theory and application. Prentice Hall. ISBN 0-13-126780-9. OCLC 876004326.

[10] Alippi, C.; Roveri, M. (2007). "Adaptive Classifiers in Stationary Conditions". 2007 International Joint Conference on Neural Networks. IEEE. pp. 1008–13. doi:10.1109/ijcnn.2007.4371096. ISBN 978-1-4244-1380-5. S2CID 16255206.

[11] Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. (2004). "Learning with Drift Detection". Advances in Artificial Intelligence – SBIA 2004. Springer. pp. 286–295. doi:10.1007/978-3-540-28645-5_29. ISBN 978-3-540-28645-5. S2CID 2606652.

[12] Alippi, C.; Boracchi, G.; Roveri, M. (2011). "A just-in-time adaptive classification system based on the intersection of confidence intervals rule". Neural Networks. 24 (8): 791–800. doi:10.1016/j.neunet.2011.05.012. PMID 21723706.

[13] Widmer, G.; Kubat, M. (1996). "Learning in the presence of concept drift and hidden contexts". Machine Learning. 23 (1): 69–101. doi: 10.1007/bf00116900 . S2CID 206767784.

[14] Elwell, R.; Polikar, R. (2011). "Incremental Learning of Concept Drift in Nonstationary Environments". IEEE Transactions on Neural Networks. 22 (10): 1517–31. doi:10.1109/tnn.2011.2160459. PMID 21824845. S2CID 9136731.

[15] Céspedes Sisniega, Jaime; López García, Álvaro (2024). "Frouros: An open-source Python library for drift detection in machine learning systems" (PDF). SoftwareX. 26. Elsevier: 101733. doi: 10.1016/j.softx.2024.101733 . hdl: 10261/358367 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]