Scikit-learn

Last updated
scikit-learn
Original author(s) David Cournapeau
Initial releaseJune 2007;16 years ago (2007-06)
Stable release
1.4.1 [1] / 14 February 2024;60 days ago (14 February 2024)
Repository
Written in Python, Cython, C and C++ [2]
Operating system Linux, macOS, Windows
Type Library for machine learning
License New BSD License
Website scikit-learn.org

scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. [3] It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project. [4]

Contents

Overview

The scikit-learn project started as scikits.learn, a Google Summer of Code project by French data scientist David Cournapeau. The name of the project stems from the notion that it is a "SciKit" (SciPy Toolkit), a separately developed and distributed third-party extension to SciPy. [5] The original codebase was later rewritten by other developers. In 2010, contributors Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort and Vincent Michel, from the French Institute for Research in Computer Science and Automation in Saclay, France, took leadership of the project and released the first public version of the library on February 1, 2010. [6] In November 2012, scikit-learn as well as scikit-image were described as two of the "well-maintained and popular" scikits libraries. [7] In 2019, it was noted that scikit-learn is one of the most popular machine learning libraries on GitHub. [8]

Implementation

scikit-learn is largely written in Python, and uses NumPy extensively for high-performance linear algebra and array operations. Furthermore, some core algorithms are written in Cython to improve performance. Support vector machines are implemented by a Cython wrapper around LIBSVM; logistic regression and linear support vector machines by a similar wrapper around LIBLINEAR. In such cases, extending these methods with Python may not be possible.

scikit-learn integrates well with many other Python libraries, such as Matplotlib and plotly for plotting, NumPy for array vectorization, Pandas dataframes, SciPy, and many more.

Version history

scikit-learn was initially developed by David Cournapeau as a Google Summer of Code project in 2007. Later that year, Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010, INRIA, the French Institute for Research in Computer Science and Automation, got involved and the first public release (v0.1 beta) was published in late January 2010.

scikit-learn alternatives

Related Research Articles

<span class="mw-page-title-main">SciPy</span> Open-source Python library for scientific computing

SciPy is a free and open-source Python library used for scientific computing and technical computing.

<span class="mw-page-title-main">NumPy</span> Python library for numerical programming

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The predecessor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors. NumPy is a NumFOCUS fiscally sponsored project.

Pygame is a cross-platform set of Python modules designed for writing video games. It includes computer graphics and sound libraries designed to be used with the Python programming language.

<span class="mw-page-title-main">Orange (software)</span> Open-source data analysis software

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

<span class="mw-page-title-main">Matplotlib</span> Library for creating static, animated, and interactive visualizations in Python.

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. There is also a procedural "pylab" interface based on a state machine, designed to closely resemble that of MATLAB, though its use is discouraged. SciPy makes use of Matplotlib.

<span class="mw-page-title-main">IPython</span> Advanced interactive shell for Python

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history. IPython provides the following features:

<span class="mw-page-title-main">Cython</span> Programming language

Cython is a superset of the programming language Python, which allows developers to write Python code that yields performance comparable to that of C.

Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) is a matrix-free method for finding the largest eigenvalues and the corresponding eigenvectors of a symmetric generalized eigenvalue problem

Theano is a Python library and optimizing compiler for manipulating and evaluating mathematical expressions, especially matrix-valued ones. In Theano, computations are expressed using a NumPy-esque syntax and compiled to run efficiently on either CPU or GPU architectures.

<span class="mw-page-title-main">Spyder (software)</span> IDE for scientific programming in Python

Spyder is an open-source cross-platform integrated development environment (IDE) for scientific programming in the Python language. Spyder integrates with a number of prominent packages in the scientific Python stack, including NumPy, SciPy, Matplotlib, pandas, IPython, SymPy and Cython, as well as other open-source software. It is released under the MIT license.

scikit-image is an open-source image processing library for the Python programming language. It includes algorithms for segmentation, geometric transformations, color space manipulation, analysis, filtering, morphology, feature detection, and more. It is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

In statistics, the graphical lasso is a sparsepenalized maximum likelihood estimator for the concentration or precision matrix of a multivariate elliptical distribution. The original variant was formulated to solve Dempster's covariance selection problem for the multivariate Gaussian distribution when observations were limited. Subsequently, the optimization algorithms to solve this problem were improved and extended to other types of estimators and distributions.

David Cournapeau is a data scientist. He is the original author of the scikit-learn package, an open source machine learning library in the Python programming language.

spaCy Software library

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

PyTorch is a machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is recognized as one of the two most popular machine learning libraries alongside TensorFlow, offering free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface.

<span class="mw-page-title-main">Dask (software)</span> Python library for parallel computing

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

scikit-multiflow Machine learning library for data streams in Python

scikit-mutliflow is a free and open source software machine learning library for multi-output/multi-label and stream data written in Python.

LightGBM, short for light gradient-boosting machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

CuPy is an open source library for GPU-accelerated computing with Python programming language, providing support for multi-dimensional arrays, sparse matrices, and a variety of numerical algorithms implemented on top of them. CuPy shares the same API set as NumPy and SciPy, allowing it to be a drop-in replacement to run NumPy/SciPy code on GPU. CuPy supports Nvidia CUDA GPU platform, and AMD ROCm GPU platform starting in v9.0.

References

  1. "Release 1.4.1". 14 February 2024. Retrieved 20 February 2024.
  2. "The scikit-learn Open Source Project on Open Hub: Languages Page". Open Hub. Retrieved 14 July 2018.
  3. Fabian Pedregosa; Gaël Varoquaux; Alexandre Gramfort; Vincent Michel; Bertrand Thirion; Olivier Grisel; Mathieu Blondel; Peter Prettenhofer; Ron Weiss; Vincent Dubourg; Jake Vanderplas; Alexandre Passos; David Cournapeau; Matthieu Perrot; Édouard Duchesnay (2011). "scikit-learn: Machine Learning in Python". Journal of Machine Learning Research. 12: 2825–2830.
  4. "NumFOCUS Sponsored Projects". NumFOCUS. Retrieved 2021-10-25.
  5. Dreijer, Janto. "scikit-learn".
  6. "About us — scikit-learn 0.20.1 documentation". scikit-learn.org.
  7. Eli Bressert (2012). SciPy and NumPy: an overview for developers. O'Reilly. p. 43.
  8. "The State of the Octoverse: machine learning". The GitHub Blog. GitHub. 2019-01-24. Retrieved 2019-10-17.
  9. 1 2 3 4 "Release history — scikit-learn 0.19.dev0 documentation". scikit-learn.org. Retrieved 2017-02-27.
  10. "Release History - 0.20.0 documentation". scikit-learn. Retrieved 6 November 2018.
  11. "Release History - 0.21.0 documentation". scikit-learn. Retrieved 5 May 2019.
  12. "Release History - 0.22 documentation". scikit-learn. Retrieved 7 June 2020.
  13. "Release History - 0.23.0 documentation". scikit-learn. Retrieved 7 June 2020.
  14. "Release History - 0.24 documentation", scikit-learn, retrieved 2021-02-08
  15. "Release History - 1.0.0 documentation". scikit-learn.
  16. "Release History - 1.0.1 documentation". scikit-learn.
  17. "Release History - 1.0.2 documentation". scikit-learn.
  18. "Release History - 1.1.0 documentation". scikit-learn.
  19. "Release History - 1.1.1 documentation". scikit-learn.
  20. "Release History - 1.1.2 documentation". scikit-learn.
  21. "Release History - 1.1.3 documentation". scikit-learn.
  22. "Release History - 1.2.0 documentation". scikit-learn.
  23. "Release History - 1.2.1 documentation". scikit-learn.
  24. "Release History - 1.2.2 documentation". scikit-learn.