Apache SystemDS

Apache SystemDS
Developer(s)	Apache Software Foundation, IBM
Initial release	November 2, 2015;8 years ago
Stable release	3.0.0 / July 5, 2022;2 years ago
Repository	SystemDS Repository
Written in	Java, Python, DML, C
Operating system	Linux, macOS, Windows
Type	Machine Learning, Deep Learning, Data Science
License	Apache License 2.0
Website	systemds.apache.org

Last updated July 06, 2024

Apache SystemDS (Previously, Apache SystemML) is an open source ML system for the end-to-end data science lifecycle.

History

SystemML was created in 2010 by researchers at the IBM Almaden Research Center led by IBM Fellow Shivakumar Vaithyanathan. It was observed that data scientists would write machine learning algorithms in languages such as R and Python for small data. When it came time to scale to big data, a systems programmer would be needed to scale the algorithm in a language such as Scala. This process typically involved days or weeks per iteration, and errors would occur translating the algorithms to operate on big data. SystemML seeks to simplify this process. A primary goal of SystemML is to automatically scale an algorithm written in an R-like or Python-like language to operate on big data, generating the same answer without the error-prone, multi-iterative translation approach.

On June 15, 2015, at the Spark Summit in San Francisco, Beth Smith, General Manager of IBM Analytics, announced that IBM was open-sourcing SystemML as part of IBM's major commitment to Apache Spark and Spark-related projects. SystemML became publicly available on GitHub on August 27, 2015 and became an Apache Incubator project on November 2, 2015. On May 17, 2017, the Apache Software Foundation Board approved the graduation of Apache SystemML as an Apache Top Level Project.

Key technologies

The following are some of the technologies built into the SystemDS engine.

Examples

Principal Component Analysis

The following code snippet^[1] does the Principal component analysis of input matrix $A$ , which returns the $eigenvectors$ and the ${\textstyle eigenvalues}$ .

# PCA.dml# Refer: https://github.com/apache/systemds/blob/master/scripts/algorithms/PCA.dml#L61N=nrow(A);D=ncol(A);# perform z-scoring (centering and scaling)A=scale(A,center==1,scale==1);# co-variance matrix mu=colSums(A)/N;C=(t(A)%*%A)/(N-1)-(N/(N-1))*t(mu)%*%mu;# compute eigen vectors and values[evalues,evectors]=eigen(C);

Invocation script

spark-submit SystemDS.jar -f PCA.dml -nvargs INPUT=INPUT_DIR/pca-1000x1000 \  OUTPUT=OUTPUT_DIR/pca-1000x1000-model PROJDATA=1 CENTER=1 SCALE=1

Database functions

DBSCAN clustering algorithm with Euclidean distance.

X=rand(rows=1780,cols=180,min=1,max=20)[indices,model]=dbscan(X=X,eps=2.5,minPts=360)

Improvements

SystemDS 2.0.0 is the first major release under the new name. This release contains a major refactoring, a few major features, a large number of improvements and fixes, and some experimental features to better support the end-to-end data science lifecycle. In addition to that, this release also removes several features that are not up date and outdated.

New mechanism for DML-bodied (script-level) builtin functions, and a wealth of new built-in functions for data preprocessing including data cleaning, augmentation and feature engineering techniques, new ML algorithms, and model debugging.
Several methods for data cleaning have been implemented including multiple imputations with multivariate imputation by chained equations (MICE) and other techniques, SMOTE, an oversampling technique for class imbalance, forward and backward NA filling, cleaning using schema and length information, support for outlier detection using standard deviation and inter-quartile range, and functional dependency discovery.
A complete framework for lineage tracing and reuse including support for loop deduplication, full and partial reuse, compiler assisted reuse, several new rewrites to facilitate reuse.
New federated runtime backend including support for federated matrices and frames, federated builtins (transform-encode, decode etc.).
Refactor compression package and add functionalities including quantization for lossy compression, binary cell operations, left matrix multiplication. [experimental]
New python bindings with supports for several builtins, matrix operations, federated tensors and lineage traces.
Cuda implementation of cumulative aggregate operators (cumsum, cumprod etc.)
New model debugging technique with slice finder.
New tensor data model (basic tensors of different value types, data tensors with schema) [experimental]
Cloud deployment scripts for AWS and scripts to set up and start federated operations.
Performance improvements with parallel sort, gpu cum agg, append cbind etc.
Various compiler and runtime improvements including new and improved rewrites, reduced Spark context creation, new eval framework, list operations, updated native kernel libraries to name a few.
New data reader/writer for json frames and support for sql as a data source.
Miscellaneous improvements: improved documentation, better testing, run/release scripts, improved packaging, Docker container for systemds, support for lambda expressions, bug fixes.
Removed MapReduce compiler and runtime backend, pydml parser, Java-UDF framework, script-level debugger.
Deprecated ./scripts/algorithms, as those algorithms gradually will be part of SystemDS builtins.

^[2]

Contributions

Apache SystemDS welcomes contributions in code, question and answer, community building, or spreading the word. The contributor guide is available at https://github.com/apache/systemds/blob/main/CONTRIBUTING.md

Related Research Articles

Apache Flex, formerly Adobe Flex, is a software development kit (SDK) for the development and deployment of cross-platform rich web applications based on the Adobe Flash platform. Initially developed by Macromedia and then acquired by Adobe Systems, Adobe donated Flex to the Apache Software Foundation in 2011 and it was promoted to a top-level project in December 2012.

In computing, a solution stack or software stack is a set of software subsystems or components needed to create a complete platform such that no additional software is needed to support applications. Applications are said to "run on" or "run on top of" the resulting platform.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed, and marks as outliers points that lie alone in low-density regions . DBSCAN is one of the most common, and most commonly cited, clustering algorithms.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Torch is an open-source machine learning library, a scientific computing framework, and a scripting language based on Lua. It provides LuaJIT interfaces to deep learning algorithms implemented in C. It was created by the Idiap Research Institute at EPFL. Torch development moved in 2017 to PyTorch, a port of the library to Python.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.

XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Microsoft Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask.

Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.

Apache MXNet is an open-source deep learning software framework that trains and deploys deep neural networks. It aims to be scalable, allows fast model training, and supports a flexible programming model and multiple programming languages. The MXNet library is portable and can scale to multiple GPUs and machines. It was co-developed by Carlos Guestrin at the University of Washington, along with GraphLab.

Chainer is an open source deep learning framework written purely in Python on top of NumPy and CuPy Python libraries. The development is led by Japanese venture company Preferred Networks in partnership with IBM, Intel, Microsoft, and Nvidia.

The following outline is provided as an overview of and topical guide to machine learning:

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

Amazon SageMaker is a cloud-based machine-learning platform that allows the creation, training, and deployment by developers of machine-learning (ML) models on the cloud. It can be used to deploy ML models on embedded systems and edge-devices. The platform was launched in November 2017.

References

↑ Apache SystemDS, The Apache Software Foundation, 2022-02-24, retrieved 2022-03-06
↑ SystemDS, Apache. "SystemML 1.2.0 Release Notes". systemds.apache.org. Retrieved 2021-02-26.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Apache SystemDS, The Apache Software Foundation, 2022-02-24, retrieved 2022-03-06

[2] SystemDS, Apache. "SystemML 1.2.0 Release Notes". systemds.apache.org. Retrieved 2021-02-26.

[1]

[2]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Struts 2 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive Bluesky iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category