Apache Ignite

Last updated
Apache Ignite
Original author(s) GridGain Systems
Developer(s) Apache Software Foundation
Initial release24 March 2015;9 years ago (2015-03-24)
Stable release
2.16.0 / 25 December 2023;3 months ago (2023-12-25) [1]
Preview release
3.0.0 (alpha 5) / 15 June 2022;21 months ago (2022-06-15) [1]
Repository Ignite Repository
Written in Java, C#, C++, SQL
Operating system Cross-platform
Platform IA-32, x86-64, PowerPC, SPARC, Java platform, .NET Framework
Type Database, computing platform
License Apache License 2.0
Website ignite.apache.org

Apache Ignite is a distributed database management system for high-performance computing.

Contents

Apache Ignite's database utilizes RAM as the default storage and processing tier, thus, belonging to the class of in-memory computing platforms. [2] The disk tier is optional but, once enabled, will hold the full data set whereas the memory tier [3] will cache full or partial data set depending on its capacity.

Data in Ignite is stored in the form of key-value pairs. The database component distributes key-value pairs across the cluster in such a way that every node owns a portion of the overall data set. Data is rebalanced automatically whenever a node is added to or removed from the cluster.

Apache Ignite cluster can be deployed on-premise on a commodity hardware, in the cloud (e.g. Microsoft Azure, AWS, Google Compute Engine) or in a containerized and provisioning environments such as Kubernetes, Docker, Apache Mesos, VMware. [4] [5]

History

Apache Ignite was developed by GridGain Systems, Inc. and made open source in 2014. GridGain continues to be the main contributor to the source code, and offers both a commercial version and professional services around Apache Ignite.

Once donated as open source, Ignite was accepted in the Apache Incubator program in October 2014. [6] [7] The Ignite project graduated from the incubator program to become a top-level Apache project on September 18, 2015. [7]

In recent years, Apache Ignite has become one of the top 5 most active projects [8] by some metrics, including user base activity and repository size.

GridGain was founded in 2010 by Nikita Ivanov and Dmitriy Setrakyan in Pleasanton, California. A funding round of $2 to $3 million was disclosed in November, 2011. [9] By 2013 it was located in Foster City, California when it disclosed funding of $10 million. [10]

Clustering

Apache Ignite clustering component uses a shared nothing architecture. Server nodes are storage and computational units of the cluster that hold both data and indexes and process incoming requests along with computations. Server nodes are also known as data nodes. [11]

Client nodes are connection points from applications and services to the distributed database on a cluster of server nodes. Client nodes are usually embedded in the application code written in Java, C# or C++ that have special libraries developed. On top of its distributed foundation, Apache Ignite supports interfaces including JCache-compliant key-value APIs, ANSI-99 SQL with joins, ACID transactions, as well as MapReduce like computations. Ignite provides ODBC, [12] JDBC [13] and REST drivers as a way to work with the database from other programming languages or tools. The drivers utilize either client nodes or low-level socket connections internally in order to communicate to the cluster.

Partitioning and replication

Ignite database organizes data in the form of key-value pairs in distributed "caches" (the cache notion is used for historical reasons because initially, the database supported the memory tier). Generally, each cache represents one entity type such as an employee or organization.

Every cache is split into a fixed set of "partitions" that are evenly distributed among cluster nodes using the rendezvous hashing algorithm. There is always one primary and zero or more backup copies of a partition. The number of copies is configured with a replication factor parameter. [14] If the full replication mode is configured, then every cluster node will store a partition's copy. The partitions are rebalanced [15] automatically if a node is added to or removed from the cluster in order to achieve an even data distribution and spread the workload.

The key-value pairs are kept in the partitions. Apache Ignite maps a pair to a partition by taking the key's value and passing it to a special hash function.

Memory architecture

The memory architecture in Apache Ignite consists of two storage tiers and is called "durable memory". Internally, it uses paging for memory space management and data reference, [16] similar to the virtual memory of systems like Unix. However, one significant difference between the durable and virtual memory architectures is that the former always keeps the whole data set with indexes on disk (assuming that the disk tier is enabled), while the virtual memory uses the disk when it runs out of RAM, for swapping purposes only.

The first tier of the memory architecture, memory tier, keeps data and indexes in RAM out of Java heap in so-called "off-heap regions". The regions are preallocated and managed by the database on its own which prevents Java heap utilization for storage needs, as a result, helping to avoid long garbage collection pauses. The regions are split into pages of fixed size that store data, indexes, and system metadata. [17]

Apache Ignite is fully operational from the memory tier but it is always possible to use the second tier, disk tier, for the sake of durability. The database comes with its own native persistence and, plus, can use RDBMS, NoSQL or Hadoop databases as its disk tier.

Native persistence

Apache Ignite native persistence is a distributed and strongly consistent disk store that always holds a superset of data and indexes on disk. The memory tier [3] will only cache as much data as it can depending on its capacity. For example, if there are 1000 entries and the memory tier can fit only 300 of them, then all 1000 will be stored on disk and only 300 will be cached in RAM.

Persistence uses the write-ahead logging (WAL) technique for keeping immediate data modifications on disk. [18] In the background, the store runs the "checkpointing process" which purpose is to copy dirty pages from the memory tier to the partition files. A dirty page is a page that is modified in memory with the modification recorded in WAL but not written to a respective partition file. The checkpointing allows removing outdated WAL segments over the time and reduces cluster restart time replaying only that part of WAL that has not been applied to the partition files. [19]

Third-party persistence

The native persistence became available starting version 2.1. [20] Before that Apache Ignite supported only third-party databases as its disk tier.

Apache Ignite can be configured as the in-memory tier on top of RDBMS, NoSQL or Hadoop databases speeding up the latter. [21] However, there are some limitations in comparison to the native persistence. For instance, SQL queries will be executed only on the data that is in RAM, thus, requiring to preload all the data set from disk to memory beforehand.

Swap space

When using pure memory storage, it is possible for the data size to exceed the physical RAM size, leading to Out-Of-Memory Errors (OOMEs). To avoid this, the ideal approach would be to enable Ignite native persistence or use third-party persistence. However, if you do not want to use native or third-party persistence, you can enable swapping, in which case, Ignite in-memory data will be moved to the swap space located on disk. Note that Ignite does not provide its own implementation of swap space. Instead, it takes advantage of the swapping functionality provided by the operating system (OS). When swap space is enabled, Ignites stores data in memory mapped files (MMF) whose content will be swapped to disk by the OS depending on the current RAM consumption

Consistency

Apache Ignite is a strongly consistent platform that implements two-phase commit protocol. [22] The consistency guarantees are met for both memory and disk tiers. Transactions in Apache Ignite are ACID-compliant and can span multiple cluster nodes and caches. The database supports pessimistic and optimistic concurrency modes, deadlock-free transactions and deadlock detection techniques.

In the scenarios where transactional guarantees are optional, Apache Ignite allows executing queries in the atomic mode that provides better performance.

Distributed SQL

Apache Ignite can be accessed using SQL APIs exposed via JDBC and ODBC drivers, and native libraries developed for Java, C#, C++ programming languages. Both data manipulation and data definition languages' syntax complies with ANSI-99 specification.

Being a distributed database, Apache Ignite supports both distributed collocated and non-collocated joins. [23] When the data is collocated, joins are executed on the local data of cluster nodes avoiding data movement across the network. Non-collocated joins might move the data sets around the network in order to prepare a consistent result set.

Machine Learning

Apache Ignite provides machine learning training and inference functionality as well as data preprocessing and model quality estimation. [24] It natively supports classical training algorithms such as Linear Regression, Decision Trees, Random Forest, Gradient Boosting, SVM, K-Means and others. In addition to that, Apache Ignite has a deep integration with TensorFlow. [25] This integrations allows to train neural networks on a data stored in Apache Ignite in a single-node or distributed manner.

The key idea of Apache Ignite Machine Learning toolkit is an ability to perform distributed training and inference instantly without massive data transmissions. It's based on MapReduce approach, resilient to node failures and data rebalances, allows to avoid data transfers and so that speed up preprocessing and model training. [26]

Related Research Articles

A shared-nothing architecture (SN) is a distributed computing architecture in which each update request is satisfied by a single node in a computer cluster. The intent is to eliminate contention among nodes. Nodes do not share the same memory or storage.

MySQL Cluster is a technology providing shared-nothing clustering and auto-sharding for the MySQL database management system. It is designed to provide high availability and high throughput with low latency, while allowing for near linear scalability. MySQL Cluster is implemented through the NDB or NDBCLUSTER storage engine for MySQL.

The IBM SAN Volume Controller (SVC) is a block storage virtualization appliance that belongs to the IBM System Storage product family. SVC implements an indirection, or "virtualization", layer in a Fibre Channel storage area network (SAN).

In database computing, Oracle Real Application Clusters (RAC) — an option for the Oracle Database software produced by Oracle Corporation and introduced in 2001 with Oracle9i — provides software for clustering and high availability in Oracle database environments. Oracle Corporation includes RAC with the Enterprise Edition, provided the nodes are clustered using Oracle Clusterware.

<span class="mw-page-title-main">Mnesia</span>

Mnesia is a distributed, soft real-time database management system written in the Erlang programming language. It is distributed as part of the Open Telecom Platform.

<span class="mw-page-title-main">Apache Cassandra</span> Free and open-source database management system

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

In computer science, memory virtualization decouples volatile random access memory (RAM) resources from individual systems in the data centre, and then aggregates those resources into a virtualized memory pool available to any computer in the cluster. The memory pool is accessed by the operating system or applications running on top of the operating system. The distributed memory pool can then be utilized as a high-speed cache, a messaging layer, or a large, shared memory resource for a CPU or a GPU application.

NoSQL is an approach to database design that focuses on providing a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Instead of the typical tabular structure of a relational database, NoSQL databases house data within one data structure. Since this non-relational database design does not require a schema, it offers rapid scalability to manage large and typically unstructured data sets. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

<span class="mw-page-title-main">Redis</span> Source available in-memory key–value database

Redis is a formerly open-source, now "source available", in-memory storage, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Redis offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. Redis is the most popular NoSQL database, and one of the most popular databases overall. Redis is used in companies like Twitter, Airbnb, Tinder, Yahoo, Adobe, Hulu, Amazon and OpenAI.

<span class="mw-page-title-main">Couchbase Server</span> Open-source NoSQL database

Couchbase Server, originally known as Membase, is a source-available, distributed multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many concurrent users by creating, storing, retrieving, aggregating, manipulating and presenting data. In support of these kinds of application needs, Couchbase Server is designed to provide easy-to-scale key-value, or JSON document access, with low latency and high sustainability throughput. It is designed to be clustered from a single machine to very large-scale deployments spanning many machines.

Voldemort is a distributed data store that was designed as a key-value store used by LinkedIn for highly-scalable storage. It is named after the fictional Harry Potter villain Lord Voldemort.

eXtremeDB is a high-performance, low-latency, ACID-compliant embedded database management system using an in-memory database system (IMDS) architecture and designed to be linked into C/C++ based programs. It works on Windows, Linux, and other real-time and embedded operating systems.

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Aerospike Database is a real-time, high performance NoSQL database. Designed for applications that cannot experience any downtime and require high read & write throughput. Aerospike is optimized to run on NVMe SSDs capable of efficiently storing large datasets. Aerospike can also be deployed as a fully in-memory cache database. Aerospike offers Key-Value, JSON Document, and Graph data models. Aerospike is open source distributed NoSQL database management system, marketed by the company also named Aerospike.

In computing, Hazelcast is a unified real-time data platform based on Java that combines a fast data store with stream processing. It is also the name of the company developing the product. The Hazelcast company is funded by venture capital and headquartered in Palo Alto, California.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Valkey is an open-source in-memory storage, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Valkey offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. Valkey is the successor to Redis, the most popular NoSQL database, and one of the most popular databases overall. Valkey or its predecessor Redis are used in companies like Twitter, Airbnb, Tinder, Yahoo, Adobe, Hulu, Amazon and OpenAI.

References

  1. 1 2 "Downloads - Apache Ignite". ignite.apache.org. Retrieved 2024-04-09.
  2. "Nikita Ivanov on Apache Ignite In-Memory Computing Platform". InfoQ. Retrieved 2017-10-11.
  3. 1 2 "Apache Ignite Native Persistence, a Brief Overview - DZone Big Data". dzone.com. Retrieved 2017-10-11.
  4. "Deploying Apache Ignite in Kubernetes on Microsoft Azure - DZone Cloud". dzone.com. Retrieved 2017-10-11.
  5. "Real-time in-memory OLTP and Analytics with Apache Ignite on AWS | Amazon Web Services". Amazon Web Services. 2016-05-14. Retrieved 2017-10-11.
  6. "Nikita Ivanov on Apache Ignite In-Memory Computing Platform". InfoQ. Retrieved 2017-11-02.
  7. 1 2 "Ignite Status - Apache Incubator". incubator.apache.org. Retrieved 2017-11-02.
  8. "Apache Ignite Momentum: Highlights from 2020-2021 : Apache Ignite". blogs.apache.org. Retrieved 2022-06-13.
  9. "Form D - Notice of Exempt Offering of Securities". United States Securities and Exchange Commission. November 8, 2011. Retrieved February 16, 2022.
  10. "Form D - Notice of Exempt Offering of Securities". United States Securities and Exchange Commission. May 7, 2013. Retrieved February 16, 2022.
  11. "Clients and Servers". apacheignite.readme.io. Retrieved 2017-10-11.
  12. "ODBC Driver". apacheignite.readme.io. Retrieved 2017-10-11.
  13. "JDBC Driver". apacheignite.readme.io. Retrieved 2017-10-11.
  14. "Primary & Backup Copies". apacheignite.readme.io. Retrieved 2017-10-11.
  15. "Data Rebalancing". apacheignite.readme.io. Retrieved 2017-10-11.
  16. "Apache Ignite 2.0: Redesigned Off-heap Memory, DDL and Machine Learning : Apache Ignite". blogs.apache.org. Retrieved 2017-10-11.
  17. "Memory Architecture". apacheignite.readme.io. Retrieved 2017-10-11.
  18. "Ignite Persistence". apacheignite.readme.io. Retrieved 2017-10-11.
  19. "Ignite Persistence". apacheignite.readme.io. Retrieved 2017-10-11.
  20. "Apache Ignite 2.1 - A Leap from In-Memory to Memory-Centric Architecture : Apache Ignite". blogs.apache.org. Retrieved 2017-10-11.
  21. "Apache Ignite for Database Caching - DZone Database". dzone.com. Retrieved 2017-10-11.
  22. "Distributed Thoughts" . Retrieved 2017-10-11.
  23. "Apache Ignite 1.7: Welcome Non-Collocated Distributed Joins! - DZone Database". dzone.com. Retrieved 2017-10-11.
  24. "Machine Learning". apacheignite.readme.io. Retrieved 2018-12-27.
  25. "TensorFlow: Apache Ignite Integration". github.com. Retrieved 2018-12-27.
  26. "Partition Based Dataset". apacheignite.readme.io. Retrieved 2018-12-27.