BigDL

Last updated
BigDL
Developer(s) Intel
Repository https://github.com/intel-analytics/BigDL
Type Library for deep learning
License Apache 2.0 [1]
Website BigDL: Distributed Deep Learning Library for Apache Spark

BigDL [2] is a distributed deep learning framework for Apache Spark, created by Jason Dai at Intel. BigDL has its source code hosted on GitHub. [3]

Contents

Features

Applications

See also

Related Research Articles

Apache License Free software license developed by the ASF

The Apache License is a permissive free software license written by the Apache Software Foundation (ASF). It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties. The ASF and its projects release their software products under the Apache License. The license is also used by many non-ASF projects.

GitHub, Inc. is a provider of Internet hosting for software development and version control using Git. It offers the distributed version control and source code management (SCM) functionality of Git, plus its own features. It provides access control and several collaboration features such as bug tracking, feature requests, task management, continuous integration and wikis for every project. Headquartered in California, it has been a subsidiary of Microsoft since 2018.

Apache Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Deeplearning4j

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Apache Mesos Software to manage computer clusters

Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley.

The following table compares notable software frameworks, libraries and computer programs for deep learning.

Apache SystemDS is a flexible machine learning system that automatically scales to Spark and Hadoop clusters. SystemDS's distinguishing characteristics are:

  1. Algorithm customizability via R-like and Python-like languages.
  2. Multiple execution modes, including Standalone, Spark Batch, Spark MLContext, Hadoop Batch, and JMLC.
  3. Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability.
RocksDB

RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit many CPU cores, and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads. It is based on a log-structured merge-tree data structure. It is written in C++ and provides official language bindings for C++, C, and Java; alongside many third-party language bindings. RocksDB is open-source software, and was originally released under a BSD 3-clause license. However, in July 2017 the project was migrated to a dual license of both Apache 2.0 and GPLv2 license, possibly in response to the Apache Software Foundation's blacklist of the previous BSD+Patents license clause.

XGBoost Scalable implementation of the gradient boosted tree machine learning algorithm

XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask.

Keras Neural network library

Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library.

A notebook interface is a virtual notebook environment used for literate programming, a method of writing computer programs. Some notebooks are WYSIWYG environments including executable calculations embedded in formatted documents; others separate calculations and text into separate sections.

Unlicense Public domain-like license with a focus on an anti-copyright message

The Unlicense is a public domain equivalent license with a focus on an anti-copyright message. It was first published on January 1, 2010. The Unlicense offers a public domain waiver text with a fall-back public-domain-like license, inspired by permissive licenses but without an attribution clause. In 2015, GitHub reported that approximately 102,000 of their 5.1 million licensed projects use the Unlicense.

Caffe (software) Deep learning framework

Caffe is a deep learning framework, originally developed at University of California, Berkeley. It is open source, under a BSD license. It is written in C++, with a Python interface.

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

Microsoft, a technology company historically known for its opposition to the open source software paradigm, turned to embrace the approach in the 2010s. From the 1970s through 2000s under CEOs Bill Gates and Steve Ballmer, Microsoft viewed the community creation and sharing of communal code, later to be known as free and open source software, as a threat to its business, and both executives spoke negatively against it. In the 2010s, as the industry turned towards cloud, embedded, and mobile computing—technologies powered by open source advances—CEO Satya Nadella led Microsoft towards open source adoption although Microsoft's traditional Windows business continued to grow throughout this period generating revenues of 26.8 billion in the third quarter of 2018, while Microsoft's Azure cloud revenues nearly doubled.

Horovod (machine learning)

Horovod is a free and open-source software framework for distributed deep learning training using TensorFlow, Keras, PyTorch, and Apache MXNet. Horovod is hosted under the Linux Foundation AI. Horovod has the goal of improving the speed, scale, and resource allocation when training a machine learning model.

LightGBM, short for Light Gradient Boosting Machine, is a free and open source distributed gradient boosting framework for machine learning originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

References

  1. "BigDL LICENSE". GitHub. 31 March 2020.
  2. "BigDL Project". bigdl-project.github.io. Retrieved 2017-12-19.
  3. "BigDL: Distributed Deep Learning Library for Apache Spark". GitHub. 31 March 2020.