Apache SystemDS

Last updated
Apache SystemDS
Developer(s) Apache Software Foundation, IBM
Initial releaseNovember 2, 2015;8 years ago (2015-11-02)
Stable release
3.0.0 / July 5, 2022;18 months ago (2022-07-05)
Repository SystemDS Repository
Written in Java, Python, DML, C
Operating system Linux, macOS, Windows
Type Machine Learning, Deep Learning, Data Science
License Apache License 2.0
Website systemds.apache.org

Apache SystemDS (Previously, Apache SystemML) is an open source ML system for the end-to-end data science lifecycle.

Contents

SystemDS's distinguishing characteristics are:

  1. Algorithm customizability via R-like and Python-like languages.
  2. Multiple execution modes, including Standalone, Spark Batch, Spark MLContext, Hadoop Batch, and JMLC.
  3. Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability.

History

SystemML was created in 2010 by researchers at the IBM Almaden Research Center led by IBM Fellow Shivakumar Vaithyanathan. It was observed that data scientists would write machine learning algorithms in languages such as R and Python for small data. When it came time to scale to big data, a systems programmer would be needed to scale the algorithm in a language such as Scala. This process typically involved days or weeks per iteration, and errors would occur translating the algorithms to operate on big data. SystemML seeks to simplify this process. A primary goal of SystemML is to automatically scale an algorithm written in an R-like or Python-like language to operate on big data, generating the same answer without the error-prone, multi-iterative translation approach.

On June 15, 2015, at the Spark Summit in San Francisco, Beth Smith, General Manager of IBM Analytics, announced that IBM was open-sourcing SystemML as part of IBM's major commitment to Apache Spark and Spark-related projects. SystemML became publicly available on GitHub on August 27, 2015 and became an Apache Incubator project on November 2, 2015. On May 17, 2017, the Apache Software Foundation Board approved the graduation of Apache SystemML as an Apache Top Level Project.

Key technologies

The following are some of the technologies built into the SystemDS engine.

Examples

Principal Component Analysis

The following code snippet [1] does the Principal component analysis of input matrix , which returns the and the .

# PCA.dml# Refer: https://github.com/apache/systemds/blob/master/scripts/algorithms/PCA.dml#L61N=nrow(A);D=ncol(A);# perform z-scoring (centering and scaling)A=scale(A,center==1,scale==1);# co-variance matrix mu=colSums(A)/N;C=(t(A)%*%A)/(N-1)-(N/(N-1))*t(mu)%*%mu;# compute eigen vectors and values[evalues,evectors]=eigen(C);

Invocation script

spark-submit SystemDS.jar -f PCA.dml -nvargs INPUT=INPUT_DIR/pca-1000x1000 \  OUTPUT=OUTPUT_DIR/pca-1000x1000-model PROJDATA=1 CENTER=1 SCALE=1

Database functions

DBSCAN clustering algorithm with Euclidean distance.

X=rand(rows=1780,cols=180,min=1,max=20)[indices,model]=dbscan(X=X,eps=2.5,minPts=360)

Improvements

SystemDS 2.0.0 is the first major release under the new name. This release contains a major refactoring, a few major features, a large number of improvements and fixes, and some experimental features to better support the end-to-end data science lifecycle. In addition to that, this release also removes several features that are not up date and outdated.

[2]

Contributions

Apache SystemDS welcomes contributions in code, question and answer, community building, or spreading the word. The contributor guide is available at https://github.com/apache/systemds/blob/main/CONTRIBUTING.md

See also

Related Research Articles

<span class="mw-page-title-main">Apache Flex</span> Software development kit (SDK) for the development and deployment of rich web applications

Apache Flex, formerly Adobe Flex, is a software development kit (SDK) for the development and deployment of cross-platform rich web applications based on the Adobe Flash platform. Initially developed by Macromedia and then acquired by Adobe Systems, Adobe donated Flex to the Apache Software Foundation in 2011 and it was promoted to a top-level project in December 2012.

In computing, a solution stack or software stack is a set of software subsystems or components needed to create a complete platform such that no additional software is needed to support applications. Applications are said to "run on" or "run on top of" the resulting platform.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions . DBSCAN is one of the most common, and most commonly cited, clustering algorithms.

<span class="mw-page-title-main">ELKI</span> Data mining framework

ELKI is a data mining software framework developed for use in research and teaching. It was originally at the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany, and now continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Torch (machine learning)</span> Deep learning software

Torch is an open-source machine learning library, a scientific computing framework, and a scripting language based on Lua. It provides LuaJIT interfaces to deep learning algorithms implemented in C. It was created by the Idiap Research Institute at EPFL. Torch development moved in 2017 to PyTorch, a port of the library to Python.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

<span class="mw-page-title-main">TensorFlow</span> Machine learning software library

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.

<span class="mw-page-title-main">XGBoost</span> Gradient boosting machine learning library

XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Microsoft Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask.

Chainer is an open source deep learning framework written purely in Python on top of NumPy and CuPy Python libraries. The development is led by Japanese venture company Preferred Networks in partnership with IBM, Intel, Microsoft, and Nvidia.

The following outline is provided as an overview of and topical guide to machine learning:

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

<span class="mw-page-title-main">Dask (software)</span> Python library for parallel computing

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

Amazon SageMaker is a cloud based machine-learning platform that allows the creation, training, and deployment by developers of machine-learning (ML) models on the cloud. It can be used to deploy ML models on embedded systems and edge-devices. SageMaker was launched in November 2017.

<span class="mw-page-title-main">CatBoost</span> Yandex open source gradient boosting framework on decision trees

CatBoost is an open-source software library developed by Yandex. It provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It works on Linux, Windows, macOS, and is available in Python, R, and models built using catboost can be used for predictions in C++, Java, C#, Rust, Core ML, ONNX, and PMML. The source code is licensed under Apache License and available on GitHub.

References

  1. Apache SystemDS, The Apache Software Foundation, 2022-02-24, retrieved 2022-03-06
  2. SystemDS, Apache. "SystemML 1.2.0 Release Notes". systemds.apache.org. Retrieved 2021-02-26.