Mlpack

Last updated
mlpack
Initial releaseFebruary 1, 2008;16 years ago (2008-02-01) [1]
Stable release
4.5.0 [2] / 18 September 2024;2 months ago (18 September 2024)
Repository
Written in C++, Python, Julia, Go
Operating system Cross-platform
Available in English
Type Software library Machine learning
License Open source (BSD)
Website mlpack.org   OOjs UI icon edit-ltr-progressive.svg

mlpack is a free, open-source and header-only software library for machine learning and artificial intelligence written in C++, built on top of the Armadillo library and the ensmallen numerical optimization library. [3] mlpack has an emphasis on scalability, speed, and ease-of-use. Its aim is to make machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users. [4] mlpack has also a light deployment infrastructure with minimum dependencies, making it perfect for embedded systems and low resource devices. Its intended target users are scientists and engineers.

Contents

It is open-source software distributed under the BSD license, making it useful for developing both open source and proprietary software. Releases 1.0.11 and before were released under the LGPL license. The project is supported by the Georgia Institute of Technology and contributions from around the world.

Features

Classical machine learning algorithms

mlpack contains a wide range of algorithms that are used to solved real problems from classification and regression in the Supervised learning paradigm to clustering and dimension reduction algorithms. In the following, a non exhaustive list of algorithms and models that mlpack supports:

Class templates for GRU, LSTM structures are available, thus the library also supports Recurrent Neural Networks.

Bindings

There are bindings to R, Go, Julia, [5] Python, and also to Command Line Interface (CLI) using terminal. Its binding system is extensible to other languages.

Reinforcement learning

mlpack contains several Reinforcement Learning (RL) algorithms implemented in C++ with a set of examples as well, these algorithms can be tuned per examples and combined with external simulators. Currently mlpack supports the following:

Design features

mlpack includes a range of design features that make it particularly well-suited for specialized applications, especially in the Edge AI and IoT domains. Its C++ codebase allows for seamless integration with sensors, facilitating direct data extraction and on-device preprocessing at the Edge. Below, we outline a specific set of design features that highlight mlpack's capabilities in these environments:

Low number of dependencies

mlpack is low dependencies library which makes it perfect for easy deployment of software. mlpack binaries can be linked statically and deployed to any system with minimal effort. The usage of Docker container is not necessary and even discouraged. This makes it suitable for low resource devices, as it requires only the ensmallen and Armadillo or Bandicoot depending on the type of hardware we are planning to deploy to. mlpack uses Cereal library for serialization of the models. Other dependencies are also header-only and part of the library itself.

Low binary footprint

In terms of binary size, mlpack methods have a significantly smaller footprint compared to other popular libraries. Below, we present a comparison of deployable binary sizes between mlpack, PyTorch, and scikit-learn. To ensure consistency, the same application, along with all its dependencies, was packaged within a single Docker container for this comparison.

Binary size comparison
MNIST digit recognizer

(CNN)

Language detection

(Softmax regression)

Forest covertype classifier

(decision tree)

scikit learnN/A327 MB348 MB
Pytorch1.04 GB1.03 GBN/A
mlpack1.23 MB1.03 MB1.62 MB

Other libraries exist such as Tensorflow Lite, However, these libraries are usually specific for one method such as neural network inference or training.

Example

The following shows a simple example how to train a decision tree model using mlpack, and to use it for the classification. Of course you can ingest your own dataset using the Load function, but for now we are showing the API:

// Train a decision tree on random numeric data and predict labels on test data:// All data and labels are uniform random; 10 dimensional data, 5 classes.// Replace with a data::Load() call or similar for a real application.arma::matdataset(10,1000,arma::fill::randu);// 1000 points.arma::Row<size_t>labels=arma::randi<arma::Row<size_t>>(1000,arma::distr_param(0,4));arma::mattestDataset(10,500,arma::fill::randu);// 500 test points.mlpack::DecisionTreetree;// Step 1: create model.tree.Train(dataset,labels,5);// Step 2: train model.arma::Row<size_t>predictions;tree.Classify(testDataset,predictions);// Step 3: classify points.// Print some information about the test predictions.std::cout<<arma::accu(predictions==2)<<" test points classified as class "<<"2."<<std::endl;

The above example demonstrate the simplicity behind the API design, which makes it similar to popular Python based machine learning kit (scikit-learn). Our objective is to simplify for the user the API and the main machine learning functions such as Classify and Predict. More complex examples are located in the examples repository, including documentations for the methods

Backend

Armadillo

Armadillo is the default linear algebra library that is used by mlpack, it provide matrix manipulation and operation necessary for machine learning algorithms. Armadillo is known for its efficiency and simplicity. it can also be used in header-only mode, and the only library we need to link against are either OpenBLAS, IntelMKL or LAPACK.

Bandicoot

Bandicoot [6] is a C++ Linear Algebra library designed for scientific computing, it has the an identical API to Armadillo with objective to execute the computation on Graphics Processing Unit (GPU), the purpose of this library is to facilitate the transition between CPU and GPU by making a minor changes to the source code, (e.g. changing the namespace, and the linking library). mlpack currently supports partially Bandicoot with objective to provide neural network training on the GPU. The following examples shows two code blocks executing an identical operation. The first one is Armadillo code and it is running on the CPU, while the second one can runs on OpenCL supported GPU or NVIDIA GPU (with CUDA backend)

usingnamespacearma;matX,Y;X.randu(10,15);Y.randu(10,10);matZ=2*norm(Y)*(X*X.t()-Y);
usingnamespacecoot;matX,Y;X.randu(10,15);Y.randu(10,10);matZ=2*norm(Y)*(X*X.t()-Y);

ensmallen

ensmallen [7] is a high quality C++ library for non linear numerical optimizer, it uses Armadillo or bandicoot for linear algebra and it is used by mlpack to provide optimizer for training machine learning algorithms. Similar to mlpack, ensmallen is a header-only library and supports custom behavior using callbacks functions allowing the users to extend the functionalities for any optimizer. In addition ensmallen is published under the BSD license.

ensmallen contains a diverse range of optimizer classified based on the function type (differentiable, partially differentiable, categorical, constrained, etc). In the following we list a small set of optimizer that available in ensmallen. For the full list please check this documentation website.

Support

mlpack is fiscally sponsored and supported by NumFOCUS, Consider making a tax-deductible donation to help the developers of the project. In addition mlpack team participates each year Google Summer of Code program and mentors several students.

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Machine learning paradigm

In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects and desired output values, which are often human-made labels. The training process builds a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to accurately determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured via a generalization error.

<span class="mw-page-title-main">SciPy</span> Open-source Python library for scientific computing

SciPy is a free and open-source Python library used for scientific computing and technical computing.

In computer science, array programming refers to solutions that allow the application of operations to an entire set of values at once. Such solutions are commonly used in scientific and engineering settings.

<span class="mw-page-title-main">LAPACK</span> Software library for numerical linear algebra

LAPACK is a standard software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition. LAPACK was originally written in FORTRAN 77, but moved to Fortran 90 in version 3.2 (2008). The routines handle both real and complex matrices in both single and double precision. LAPACK relies on an underlying BLAS implementation to provide efficient and portable computational building blocks for its routines.

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs. CUDA was created by Nvidia in 2006. When it was first introduced, the name was an acronym for Compute Unified Device Architecture, but Nvidia later dropped the common use of the acronym and now rarely expands it.

<span class="mw-page-title-main">Shogun (toolbox)</span> Machine learning software library in C++

Shogun is a free, open-source machine learning software library written in C++. It offers numerous algorithms and data structures for machine learning problems. It offers interfaces for Octave, Python, R, Java, Lua, Ruby and C# using SWIG.

Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the Apache Hadoop platform, however today it is primarily focused on Apache Spark. Mahout also provides Java/Scala libraries for common math operations and primitive Java collections. Mahout is a work in progress; a number of algorithms have been implemented.

Armadillo is a linear algebra software library for the C++ programming language. It aims to provide efficient and streamlined base calculations, while at the same time having a straightforward and easy-to-use interface. Its intended target users are scientists and engineers.

<span class="mw-page-title-main">ALGLIB</span> Open source numerical analysis library

ALGLIB is a cross-platform open source numerical analysis and data processing library. It can be used from several programming languages.

IT++ is a C++ library of classes and functions for linear algebra, numerical optimization, signal processing, communications, and statistics. It is being developed by researchers in these areas and is widely used by researchers, both in the communications industry and universities. The IT++ library originates from the former Department of Information Theory at the Chalmers University of Technology, Gothenburg, Sweden.

<span class="mw-page-title-main">Vowpal Wabbit</span> Machine learning system

Vowpal Wabbit (VW) is an open-source fast online interactive machine learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research. It was started and is led by John Langford. Vowpal Wabbit's interactive learning support is particularly notable including Contextual Bandits, Active Learning, and forms of guided Reinforcement Learning. Vowpal Wabbit provides an efficient scalable out-of-core implementation with support for a number of machine learning reductions, importance weighting, and a selection of different loss functions and optimization algorithms.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

The following outline is provided as an overview of, and topical guide to, machine learning:

<span class="mw-page-title-main">Owl Scientific Computing</span> Numerical programming library for the OCaml programming language

Owl Scientific Computing is a software system for scientific and engineering computing developed in the Department of Computer Science and Technology, University of Cambridge. The System Research Group (SRG) in the department recognises Owl as one of the representative systems developed in SRG in the 2010s. The source code is licensed under the MIT License and can be accessed from the GitHub repository.

<i>Data Science and Predictive Analytics</i> Book on data science applications

The first edition of the textbook Data Science and Predictive Analytics: Biomedical and Health Applications using R, authored by Ivo D. Dinov, was published in August 2018 by Springer. The second edition of the book was printed in 2023.

References

  1. "Initial checkin of the regression package to be released · mlpack/mlpack". February 8, 2008. Retrieved May 24, 2020.
  2. "Release 4.5.0". 18 September 2024. Retrieved 22 September 2024.
  3. Ryan Curtin; et al. (2021). "The ensmallen library for flexible numerical optimization". Journal of Machine Learning Research. 22 (166): 1–6. arXiv: 2108.12981 . Bibcode:2021arXiv210812981C.
  4. Ryan Curtin; et al. (2023). "mlpack 4: a fast, header-only C++ machine learning library". Journal of Open Source Software. 8 (82): 5026. arXiv: 2302.00820 .
  5. "Mlpack/Mlpack.jl". 10 June 2021.
  6. "C++ library for GPU accelerated linear algebra". coot.sourceforge.io. Retrieved 2024-08-12.
  7. "Home". ensmallen.org. Retrieved 2024-08-12.