Scikit-multiflow

Last updated
scikit-mutliflow
Original author(s) Jacob Montiel, Jesse Read, Albert Bifet, Talel Abdessalem
Developer(s) The scikit-mutliflow development team and the open research community
Initial releaseJanuary 2018;5 years ago (2018-01)
Stable release
0.5.3 / 17 June 2020;3 years ago (2020-06-17) [1] [2]
Repository https://github.com/scikit-multiflow/scikit-multiflow
Written in Python, Cython
Operating system Linux, macOS, Windows
Type Library for machine learning
License BSD 3-Clause license
Website scikit-multiflow.github.io

scikit-mutliflow (also known as skmultiflow) is a free and open source software machine learning library for multi-output/multi-label and stream data written in Python. [3]

Contents

Overview

scikit-multiflow allows to easily design and run experiments and to extend existing stream learning algorithms. [3] It features a collection of classification, regression, concept drift detection and anomaly detection algorithms. It also includes a set of data stream generators and evaluators. scikit-multiflow is designed to interoperate with Python's numerical and scientific libraries NumPy and SciPy and is compatible with Jupyter Notebooks.

Implementation

The scikit-multiflow library is implemented under the open research principles and is currently distributed under the BSD 3-Clause license. scikit-multiflow is mainly written in Python, and some core elements are written in Cython for performance. scikit-multiflow integrates with other Python libraries such as Matplotlib for plotting, scikit-learn for incremental learning methods [4] compatible with the stream learning setting, Pandas for data manipulation, Numpy and SciPy.

Components

The scikit-multiflow is composed of the following sub-packages:

History

scikit-multiflow started as a collaboration between researchers at Télécom Paris (Institut Polytechnique de Paris [5] ) and École Polytechnique. Development is currently carried by the University of Waikato, Télécom Paris, École Polytechnique and the open research community.

See also

Related Research Articles

<span class="mw-page-title-main">Boosting (machine learning)</span> Method in machine learning

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant : "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

<span class="mw-page-title-main">Machine learning</span> Study of algorithms that improve automatically through experience

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines "discover" their "own" algorithms, without needing to be explicitly told what to do by any human-developed algorithms. Recently, generative artificial neural networks have been able to surpass results of many previous approaches. Machine-learning approaches have been applied to large language models, computer vision, speech recognition, email filtering, agriculture and medicine, where it is too costly to develop algorithms to perform the needed tasks.

Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities.

<span class="mw-page-title-main">Orange (software)</span>

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

<span class="mw-page-title-main">Weka (software)</span>

Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

<span class="mw-page-title-main">Anomaly detection</span> Approach in data analysis

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

scikit-learn Python library for machine learning

scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project.

mlpy is a Python, open-source, machine learning library built on top of NumPy/SciPy, the GNU Scientific Library and it makes an extensive use of the Cython language. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is distributed under GPL3.

Massive Online Analysis (MOA) is a free open-source software project specific for data stream mining with concept drift. It is written in Java and developed at the University of Waikato, New Zealand.

<span class="mw-page-title-main">Feature engineering</span> Extracting features from raw data for machine learning

Feature engineering or feature extraction or feature discovery is the process of extracting features from raw data. Due to deep learning networks, such as convolutional neural networks, that are able to learn it by itself, domain-specific- based feature engineering has become obsolete for vision and speech processing.

David Cournapeau is a data scientist. He is the original author of the scikit-learn package, an open source machine learning library in the Python programming language.

The Collective Knowledge (CK) project is an open-source framework and repository to enable collaborative, reproducible and sustainable research and development of complex computational systems. CK is a small, portable, customizable and decentralized infrastructure helping researchers and practitioners:

spaCy

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

<span class="mw-page-title-main">Outline of machine learning</span> Overview of and topical guide to machine learning

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

<span class="mw-page-title-main">Dask (software)</span> Python library for parallel computing

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

Isolation Forest is an algorithm for data anomaly detection initially developed by Fei Tony Liu in 2008. Isolation Forest detects anomalies using binary trees. The algorithm has a linear time complexity and a low memory requirement, which works well with high-volume data. In essence, the algorithm relies upon the characteristics of anomalies, i.e., being few and different, in order to detect anomalies. No density estimation is performed in the algorithm. The algorithm is different from decision tree algorithms in that only the path-length measure or approximation is being used to generate the anomaly score, no leaf node statistics on class distribution or target value is needed.

LightGBM, short for light gradient-boosting machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

<span class="mw-page-title-main">CatBoost</span> Yandex open source gradient boosting framework on decision trees

CatBoost is an open-source software library developed by Yandex. It provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It works on Linux, Windows, macOS, and is available in Python, R, and models built using catboost can be used for predictions in C++, Java, C#, Rust, Core ML, ONNX, and PMML. The source code is licensed under Apache License and available on GitHub.

References

  1. "scikit-mutliflow Version 0.5.3".
  2. "scikit-learn 0.5.3". Python Package Index .
  3. 1 2 Montiel, Jacob; Read, Jesse; Bifet, Albert; Abdessalem, Talel (2018). "Scikit-Multiflow: A Multi-output Streaming Framework". Journal of Machine Learning Research. 19 (72): 1–5. ISSN   1533-7928.
  4. "scikit-learn — Incremental learning". scikit-learn.org. Retrieved 2020-04-08.
  5. "Institut Polytechnique de Paris" . Retrieved 2020-04-08.
  6. Bifet, Albert; Holmes, Geoff; Kirkby, Richard; Pfahringer, Bernhard (2010). "MOA: Massive Online Analysis". Journal of Machine Learning Research. 11 (52): 1601–1604. ISSN   1533-7928.
  7. Read, Jesse; Reutemann, Peter; Pfahringer, Bernhard; Holmes, Geoff (2016). "MEKA: A Multi-label/Multi-target Extension to WEKA". Journal of Machine Learning Research. 17 (21): 1–5. ISSN   1533-7928.