Data stream mining

Last updated

Data Stream Mining (also known as stream learning) is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. [1]

Contents

In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. [2] Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. Often, concepts from the field of incremental learning are applied to cope with structural changes, on-line learning and real-time demands. In many applications, especially operating within non-stationary environments, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. [3] This problem is referred to as concept drift. Detecting concept drift is a central issue to data stream mining. [4] [5] Other challenges [6] that arise when applying machine learning to streaming data include: partially and delayed labeled data, [7] [8] recovery from concept drifts, [1] and temporal dependencies. [9]

Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.

Software for data stream mining

Events

See also

Books

Related Research Articles

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recognize cars could be applied when trying to recognize trucks. This topic is related to the psychological literature on transfer of learning, although practical ties between the two fields are limited. Reusing/transferring information from previously learned tasks to new tasks has the potential to significantly improve learning efficiency.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

ECML PKDD, the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, is one of the leading academic conferences on machine learning and knowledge discovery, held in Europe every year.

An incremental decision tree algorithm is an online machine learning algorithm that outputs a decision tree. Many decision tree methods, such as C4.5, construct a tree using a complete dataset. Incremental decision tree methods allow an existing tree to be updated using only new individual data instances, without having to re-process past instances. This may be useful in situations where the entire dataset is not available when the tree is updated, the original data set is too large to process or the characteristics of the data change over time.

<span class="mw-page-title-main">Local outlier factor</span>

In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.

Active learning is a special case of machine learning in which a learning algorithm can interactively query a user to label new data points with the desired outputs. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.

<span class="mw-page-title-main">Data science</span> Interdisciplinary field of study on deriving knowledge and insights from data

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

Massive Online Analysis (MOA) is a free open-source software project specific for data stream mining with concept drift. It is written in Java and developed at the University of Waikato, New Zealand.

Geoffrey I. Webb is Professor of Computer Science at Monash University, founder and director of Data Mining software development and consultancy company G. I. Webb and Associates, and former editor-in-chief of the journal Data Mining and Knowledge Discovery. Before joining Monash University he was on the faculty at Griffith University from 1986 to 1988 and then at Deakin University from 1988 to 2002.

Social media mining is the process of obtaining big data from user-generated content on social media sites and mobile apps in order to extract actionable patterns, form conclusions about users, and act upon the information, often for the purpose of advertising to users or conducting research. The term is an analogy to the resource extraction process of mining for rare minerals. Resource extraction mining requires mining companies to shift through vast quantities of raw ore to find the precious minerals; likewise, social media mining requires human data analysts and automated software programs to shift through massive amounts of raw social media data in order to discern patterns and trends relating to social media usage, online behaviours, sharing of content, connections between individuals, online buying behaviour, and more. These patterns and trends are of interest to companies, governments and not-for-profit organizations, as these organizations can use these patterns and trends to design their strategies or introduce new programs, new products, processes or services.

<span class="mw-page-title-main">Author name disambiguation</span>

Author name disambiguation is a type of disambiguation and record linkage applied to the names of individual people. The process could, for example, distinguish individuals with the name "John Smith".

Longbing Cao is an AI and data science researcher at the University of Technology Sydney, Australia. His broad research interest involves artificial intelligence, data science, behavior informatics, and their enterprise applications.

Arthur Zimek is a professor in data mining, data science and machine learning at the University of Southern Denmark in Odense, Denmark.

scikit-multiflow Machine learning library for data streams in Python

scikit-mutliflow is a free and open source software machine learning library for multi-output/multi-label and stream data written in Python.

In machine learning and data mining, quantification is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies of the classes of interest in a sample of unlabelled data items. For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these 100,000 tweets which belong to class `Positive', and to do the same for classes `Neutral' and `Negative'.

<span class="mw-page-title-main">Nitesh Chawla</span> Computer scientist

Nitesh V. Chawla is a computer scientist and data scientist currently serving as the Frank M. Freimann Professor of Computer Science and Engineering at the University of Notre Dame. He is the Founding Director of the Lucy Family Institute for Data & Society. Chawla's research expertise lies in machine learning, data science, and network science. He is also the co-founder of Aunalytics, a data science software and cloud computing company. Chawla is a fellow of the Association for Computing Machinery (ACM) and a Fellow of the Institute of Electrical and Electronics Engineers (IEEE). He has received multiple awards, including the 1st Source Bank Commercialization Award in 2017 and the IBM Big Data Award in 2013. One of Chawla's most recognized publications, with a citation count of over 24,000, is the research paper titled "SMOTE: Synthetic Minority Over-sampling Technique." Chawla's research has garnered a citation count of over 56,000 and an H-index of 78.

References

  1. 1 2 Gomes, Heitor M.; Bifet, Albert; Read, Jesse; Barddal, Jean Paul; Enembreck, Fabrício; Pfharinger, Bernhard; Holmes, Geoff; Abdessalem, Talel (2017-10-01). "Adaptive random forests for evolving data stream classification". Machine Learning. 106 (9): 1469–1495. doi: 10.1007/s10994-017-5642-8 . hdl: 10289/11231 . ISSN   1573-0565.
  2. Medhat, Mohamed; Zaslavsky; Krishnaswamy (2005-06-01). "Mining data streams". ACM SIGMOD Record. 34 (2): 18–26. doi:10.1145/1083784.1083789. S2CID   705946.
  3. Lemaire, Vincent; Salperwyck, Christophe; Bondu, Alexis (2015), Zimányi, Esteban; Kutsche, Ralf-Detlef (eds.), "A Survey on Supervised Classification on Data Streams", Business Intelligence: 4th European Summer School, eBISS 2014, Berlin, Germany, July 6–11, 2014, Tutorial Lectures, Lecture Notes in Business Information Processing, Springer International Publishing, pp. 88–125, doi:10.1007/978-3-319-17551-5_4, ISBN   978-3-319-17551-5
  4. Webb, Geoffrey I.; Lee, Loong Kuan; Petitjean, François; Goethals, Bart (2017-04-02). "Understanding Concept Drift". arXiv: 1704.00362 [cs.LG].
  5. Gama, João; Žliobaitė; Bifet; Pechenizkiy; Bouchachia (2014-03-01). "A survey on concept drift adaptation" (PDF). ACM Computing Surveys. 46 (4): 1–37. doi:10.1145/2523813. S2CID   207208264.
  6. Gomes, Heitor Murilo; Read; Bifet; Barddal; Gama (2019-11-26). "Machine learning for streaming data". ACM SIGKDD Explorations Newsletter. 21 (2): 6–22. doi:10.1145/3373464.3373470. S2CID   208607941.
  7. Gomes, Heitor Murilo; Grzenda, Maciej; Mello, Rodrigo; Read, Jesse; Le Nguyen, Minh Huong; Bifet, Albert (2022-02-28). "A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams". ACM Computing Surveys. 55 (4): 1–42. arXiv: 2106.09170 . doi: 10.1145/3523055 . ISSN   0360-0300.
  8. Grzenda, Maciej; Gomes, Heitor Murilo; Bifet, Albert (2019-11-16). "Delayed labelling evaluation for data streams". Data Mining and Knowledge Discovery. 34 (5): 1237–1266. doi: 10.1007/s10618-019-00654-y . ISSN   1573-756X.
  9. Žliobaitė, Indrė; Bifet, Albert; Read, Jesse; Pfahringer, Bernhard; Holmes, Geoff (2015-03-01). "Evaluation methods and decision theory for classification of streaming data with temporal dependence". Machine Learning. 98 (3): 455–482. doi: 10.1007/s10994-014-5441-4 . hdl: 10289/8954 . ISSN   1573-0565.
  10. Montiel, Jacob; Read, Jesse; Bifet, Albert; Abdessalem, Talel (2018). "Scikit-Multiflow: A Multi-output Streaming Framework". Journal of Machine Learning Research. 19 (72): 1–5. arXiv: 1807.04662 . Bibcode:2018arXiv180704662M. ISSN   1533-7928.
  11. Features, scikit-multiflow, 2021-10-09, retrieved 2021-10-11
  12. Zaharia, Matei; Das, Tathagata; Li, Haoyuan; Hunter, Timothy; Shenker, Scott; Stoica, Ion (2013). "Discretized streams". Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. New York, New York, USA: ACM Press. pp. 423–438. doi: 10.1145/2517349.2522737 . ISBN   978-1-4503-2388-8.
  13. online-ml/river, OnlineML, 2021-10-11, retrieved 2021-10-11