Apache SINGA

Last updated
Apache SINGA
Developer(s) Apache Software Foundation
Initial releaseOctober 8, 2015;8 years ago (2015-10-08)
Stable release
4.2.0 / March 15, 2024;46 days ago (2024-03-15)
Repository
Written in C++, Python
Operating system Linux, macOS, Windows
License Apache License 2.0
Website singa.apache.org

Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.

Contents

History

The SINGA project was initiated by the DB System Group at National University of Singapore in 2014, in collaboration with the database group of Zhejiang University, in order to support complex analytics at scale, and make database systems more intelligent and autonomic. [1] It focused on distributed deep learning by partitioning the model and data onto nodes in a cluster and parallelize the training. [2] [3] The prototype was accepted by Apache Incubator in March 2015, and graduated as a top-level project in October 2019. The table below captures information about different types of versions, e.g., the latest version, the latest preview version, older versions which are still maintained, and old versions.

VersionOriginal release dateLatest versionRelease date
Current stable version:4.2.02024-03-154.2.02024-03-15
Older version, yet still maintained: 4.1.02023-11-054.1.02023-11-05
Older version, yet still maintained: 4.0.02023-04-074.0.02023-04-07
Older version, yet still maintained: 3.3.02022-06-073.3.02022-06-07
Older version, yet still maintained: 3.2.02021-08-153.2.02021-08-15
Older version, yet still maintained: 3.1.02020-10-303.1.02020-10-30
Older version, yet still maintained: 3.0.02020-04-203.0.02020-04-20
Old version, no longer maintained: 2.0.02019-04-202.0.02019-04-20
Old version, no longer maintained: 1.2.02018-06-061.2.02018-06-06
Old version, no longer maintained: 1.1.02017-02-121.1.02017-02-12
Old version, no longer maintained: 1.0.02016-09-081.0.02016-09-08
Old version, no longer maintained: 0.3.02016-04-200.1.02016-04-20
Old version, no longer maintained: 0.2.02016-01-140.2.02016-01-14
Old version, no longer maintained: 0.1.02015-10-080.1.02015-10-08
Legend:
Old version
Older version, still maintained
Latest version
Latest preview version
Future release

Software Stack

SINGA's software stack includes three major components, namely, core, IO and model. The following figure illustrates these components together with the hardware. The core component provides memory management and tensor operations; IO has classes for reading (and writing) data from (to) disk and network; The model component provides data structures and algorithms for machine learning models, e.g., layers for neural network models, optimizers/initializer/metric/loss for general machine learning models.

Apache Singa software stack Singav1-sw.png
Apache Singa software stack

SINGA-Auto

SINGA-Auto (aka. Rafiki [4] in VLDB2018) is a subsystem of Apache SINGA to provide the training and inference service of machine learning models. SINGA-Auto frees users from constructing the machine learning models, tuning the hyper-parameters, and optimizing the prediction accuracy and speed. Users can simply upload their datasets, configure the service to conduct training, and then deploy the model for inference. As a cloud service system, SINGA-Auto manages the hardware resources, failure recovery, etc. For ease of use, it provides a model zoo, which is a set of built-in machine-learning models for popular tasks such as structured data (e.g., EMR data) analytics, image recognition, and text processing.

In the training service, a general framework for distributed hyper-parameter tuning is proposed and a collaborative tuning scheme is designed specifically for deep learning models. In the inference service, a scheduling algorithm is proposed based on reinforcement learning to optimize the overall accuracy and reduce latency. It can adapt to the changes of request rates.

SINGA-Easy

SINGA-Easy [5] (ACM Multimedia 2021) is an easy-to-use deep learning framework built as a component of Apache SINGA to facilitate the adoption of deep learning algorithms and inference services by domain-specific domain application users (e.g., multimedia, medical image analysis). It provides distributed hyper-parameter tuning at the training stage, dynamic computational cost control at the inference stage, and intuitive user interactions with multimedia content facilitated by model explanation. To improve accuracy, it supports regularization methods for image and structured data regularizations (ACM SIGMOD 2023). To support the acceptance of domain users on the training results, SINGA-Easy provides an option for users to evaluate model performance from the model explanation perspective based on LIME [6] and Grad-CAM. [7]

MLCask

MLCask [8] (IEEE ICDE 2021) is a pipeline management subsystem that manages machine learning pipelines, from data cleaning to data analytics, to ease the maintenance of evolving and versioning of machine learning pipelines for collaborative analytics. It serves to reduce the cost and facilitate adoption. MLCask supports a Git-like end-to-end ML life-cycle management. By leveraging the version history of pipeline components and workspace, MLCask can skip unchanged preprocessing steps to address the frequent retraining challenges. Its non-linear version control semantics and merge operation facilitate effective collaborative development of the pipeline.

In-Database Model Selection

Starting from version 4.1.0, Apache SINGA provides support for in-database model selection and inference in PostgreSQL. The system implements a resource-efficient two-phase model selection algorithm that incorporates both training-free and training-based model selection techniques. This model selection algorithm is integrated non-intrusively into PostgreSQL via stored procedures with optimizations on execution latency and memory consumption. The inclusion of in-database model selection empowers users to obtain high-performing models within their specified response time requirements.

Applications

Apache SINGA [9] is in use at organizations such as NetEase, [10] Carnegie Technologies, CBRE, Citigroup, JurongHealth Hospital, National University of Singapore, National University Hospital, Noblis, Shentilium Technologies, Singapore General Hospital, Tan Tock Seng Hospital, YZBigData, and others. Apache SINGA is used across applications in banking, education, finance, healthcare, real estate, software development, and other categories.

Apache SINGA and Social Good

The Ng Teng Fong General Hospital [11] collaborated with the Apache SINGA team to develop an application for people diagnosed with pre-diabetes, a condition where blood glucose levels are higher than normal, but not high enough to be classified as diabetes.

The application called JurongHealth Food Log (JHFoodLg) app, uses Apache SINGA to match photos of food to a database of local dishes - including nasi padang, laksa and char siew rice - and utilises nutrition data from the Health Promotion Board, JurongHealth Campus, and the Australian Food and Nutrient Database. After comprehensive data cleaning (e.g., consistent formatting, deduplication, foodness classification, human calibration), the database contains 209, 861 images, covering 13 food groups and 233 food categories.

The app allows users from the hospital's Lifestyle Intervention (Liven) programme to set weight loss and exercise goals. A six-month study shows that almost all 20 patients who used the app lost between 4 and 5 percent of their initial bodyweight.

See also

Related Research Articles

A recommender system, or a recommendation system, is a subclass of information filtering system that provides suggestions for items that are most pertinent to a particular user. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

<span class="mw-page-title-main">Weka (software)</span> Suite of machine learning software written in Java

Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

Cold start is a potential problem in computer-based information systems which involves a degree of automated data modelling. Specifically, it concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.

<span class="mw-page-title-main">James Z. Wang</span> Chinese-American computer scientist

James Ze Wang is a Chinese-American computer scientist. He is a distinguished professor of the College of Information Sciences and Technology at Pennsylvania State University. He is also an affiliated professor of the Molecular, Cellular, and Integrative Biosciences Program; the Computational Science Graduate Minor; and the Social Data Analytics Graduate Program. He is co-director of the Intelligent Information Systems Laboratory. He was a visiting professor of the Robotics Institute at Carnegie Mellon University from 2007 to 2008. In 2011 and 2012, he served as a program manager in the Office of International Science and Engineering at the National Science Foundation. He is the second son of Chinese mathematician Wang Yuan.

In pattern recognition, the iDistance is an indexing and query processing technique for k-nearest neighbor queries on point data in multi-dimensional metric spaces. The kNN query is one of the hardest problems on multi-dimensional data, especially when the dimensionality of the data is high. The iDistance is designed to process kNN queries in high-dimensional spaces efficiently and it is especially good for skewed data distributions, which usually occur in real-life data sets. The iDistance can be augmented with machine learning models to learn the data distributions for searching and storing the multi-dimensional data.

<span class="mw-page-title-main">Reverse image search</span> Content-based image retrieval

Reverse image search is a content-based image retrieval (CBIR) query technique that involves providing the CBIR system with a sample image that it will then base its search upon; in terms of information retrieval, the sample image is very useful. In particular, reverse image search is characterized by a lack of search terms. This effectively removes the need for a user to guess at keywords or terms that may or may not return a correct result. Reverse image search also allows users to discover content that is related to a specific sample image or the popularity of an image, and to discover manipulated versions and derivative works.

KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing, for modeling, data analysis and visualization without, or with only minimal, programming.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

The term is used for two different things:

  1. In computer science, in-memory processing (PIM) is a computer architecture in which data operations are available directly on the data memory, rather than having to be transferred to CPU registers first. This may improve the power usage and performance of moving data between the processor and the main memory.
  2. In software engineering, in-memory processing is a software architecture where a database is kept entirely in random-access memory (RAM) or flash memory so that usual accesses, in particular read or query operations, do not require access to disk storage. This may allow faster data operations such as "joins", and faster reporting and decision-making in business.

An AI accelerator, deep learning processor, or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and machine vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFET transistors.

The following outline is provided as an overview of and topical guide to machine learning:

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems.

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. This family of methods became widely known during the Netflix prize challenge due to its effectiveness as reported by Simon Funk in his 2006 blog post, where he shared his findings with the research community. The prediction results can be improved by assigning different regularization weights to the latent factors based on items' popularity and users' activeness.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

Automated Artificial Intelligence (AutoAI) is a variation of the automated machine learning or AutoML technology, which extends the automation of model building towards automation of the full life cycle of a machine learning model. It applies intelligent automation to the task of building predictive machine learning models by preparing data for training and identifying the best type of model for the given data. then choosing the features or columns of data that best support the problem the model is solving. Finally, automation evaluates a variety of tuning options to reach the best result as it generates, then ranks, model-candidate pipelines. The best performing pipelines can be put into production to process new data, and deliver predictions based on the model training. Automated artificial intelligence can also be applied to making sure the model doesn't have inherent bias and automating the tasks for continuous improvement of the model. Managing an AutoAI model requires frequent monitoring and updating, managed by a process known as model operations or ModelOps.

Wei Wang is a Chinese-born American computer scientist. She is the Leonard Kleinrock Chair Professor in Computer Science and Computational Medicine at University of California, Los Angeles and the director of the Scalable Analytics Institute (ScAi). Her research specializes in big data analytics and modeling, database systems, natural language processing, bioinformatics and computational biology, and computational medicine.

In network theory, collective classification is the simultaneous prediction of the labels for multiple objects, where each label is predicted using information about the object's observed features, the observed features and labels of its neighbors, and the unobserved labels of its neighbors. Collective classification problems are defined in terms of networks of random variables, where the network structure determines the relationship between the random variables. Inference is performed on multiple random variables simultaneously, typically by propagating information between nodes in the network to perform approximate inference. Approaches that use collective classification can make use of relational information when performing inference. Examples of collective classification include predicting attributes of individuals in a social network, classifying webpages in the World Wide Web, and inferring the research area of a paper in a scientific publication dataset.

<span class="mw-page-title-main">Edward Y. Chang</span> American computer scientist

Edward Y. Chang is a computer scientist, academic, and author. He is an adjunct professor of Computer Science at Stanford University, and Visiting Chair Professor of Bioinformatics and Medical Engineering at Asia University, since 2019.

References

  1. Wei, Wang; Meihui, Zhang; Gang, Chen; H.V., Jagadish; Beng Chin, Ooi; Kian-Lee, Tan (June 2016). "Database Meets Deep Learning: Challenges and Opportunities". SIGMOD Record. 45 (2): 17–22. arXiv: 1906.08986 . doi:10.1145/3003665.3003669. S2CID   6526411.
  2. Ooi, Beng Chin; Tan, Kian-Lee; Sheng, Wang; Wang, Wei; Cai, Qingchao; Chen, Gang; Gao, Jinyang; Luo, Zhaojing; Tung, Anthony K. H.; Wang, Yuan; Xie, Zhongle; Zhang, Meihui; Zheng, Kaiping (2015). "SINGA: A Distributed Deep Learning Platform" (PDF). Proceedings of the 23rd ACM international conference on Multimedia. pp. 685–688. doi: 10.1145/2733373.2807410 . S2CID   1840240 . Retrieved 8 September 2016.
  3. Wei, Wang; Chen, Gang; Anh Dinh, Tien Tuan; Gao, Jinyang; Ooi, Beng Chin; Tan, Kian-Lee; Sheng, Wang (2015). "SINGA: Putting Deep Learning in the Hands of Multimedia Users" (PDF). Proceedings of the 23rd ACM international conference on Multimedia. pp. 25–34. doi:10.1145/2733373.2806232. S2CID   7169465 . Retrieved 8 September 2016.
  4. Wang, Wei; Gao, Jinyang; Zhang, Meihui; Sheng, Wang; Chen, Gang; Khim Ng, Teck; Ooi, Beng Chin; Shao, Jie; Reyad, Moaz (2018). "Rafiki: Machine Learning as an Analytics Service System" (PDF). Proceedings of the VLDB Endowment. 12 (2): 128–140. arXiv: 1804.06087 . Bibcode:2018arXiv180406087W. doi:10.14778/3282495.3282499. S2CID   4898729 . Retrieved 9 January 2019.
  5. Xing, Naili; Yeung, Sai Ho; Cai, Chenghao; Ng, Teck Khim; Wang, Wei; Yang, Kaiyuan; Yang, Nan; Zhang, Meihui; Chen, Gang; Ooi, Beng Chin (2021). "SINGA-Easy: An Easy-to-Use Framework for MultiModal Analysis" (PDF). Proceedings of the 29th ACM international conference on Multimedia. pp. 1293–1302. doi:10.1145/3474085.3475176. ISBN   978-1-4503-8651-7 . Retrieved 17 October 2021.
  6. Ribeiro, Marco Tulio; Singh, Sameer; Guestrin, Carlos (2017). ""Why Should I Trust You?": Explaining the Predictions of Any Classifier" (PDF). Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. pp. 97–101. arXiv: 1602.04938 . doi:10.1145/2939672.2939778 . Retrieved 1 August 2016.
  7. Selvaraju, Ramprasaath R.; Cogswell, Michael; Das, Abhishek; Vedantam, Ramakrishna; Parikh, Devi; Batra, Dhruv (2017). "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization" (PDF). 2017 IEEE International Conference on Computer Vision (ICCV). pp. 618–626. arXiv: 1610.02391 . doi:10.1109/ICCV.2017.74. ISBN   978-1-5386-1032-9.
  8. Luo, Zhaojing; Yeung, Sai Ho; Zhang, Meihui; Zheng, Kaiping; Zhu, Lei; Chen, Gang; Fan, Feiyi; Lin, Qian; Ngiam, Kee Yuan; Ooi, Beng Chin (2021). "MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines". 2021 IEEE 37th International Conference on Data Engineering (ICDE). pp. 1655–1666. arXiv: 2010.10246 . doi:10.1109/ICDE51399.2021.00146. ISBN   978-1-7281-9184-3. S2CID   224802796.
  9. "THE APACHE SOFTWARE FOUNDATION ANNOUNCES APACHE SINGA AS A TOP-LEVEL PROJECT". news.apache.org. 4 November 2019. Retrieved 4 November 2019.
  10. 网易 (2 June 2017). "网易携手Apache SINGA角逐人工智能新战场_网易科技". tech.163.com. Retrieved 2017-06-03.
  11. "New app allows pre-diabetics to use photos of their meal to check if it is healthy". The Straits Times. 24 January 2019. Retrieved 6 April 2019.