Distributed SQL

Last updated

A distributed SQL database is a single relational database which replicates data across multiple servers. Distributed SQL databases are strongly consistent and most support consistency across racks, data centers, and wide area networks including cloud availability zones and cloud geographic zones. Distributed SQL databases typically use the Paxos or Raft algorithms to achieve consensus across multiple nodes.

Contents

Sometimes distributed SQL databases are referred to as NewSQL but NewSQL is a more inclusive term that includes databases that are not distributed databases.

History

Google's Spanner popularized the modern distributed SQL database concept. Google described the database and its architecture in a 2012 whitepaper called "Spanner: Google's Globally-Distributed Database." The paper described Spanner as having evolved from a Big Table-like key value store into a temporal multi-version database where data is stored in "schematized semi-relational tables." [1]

Spanner uses atomic clocks with the Paxos algorithm to accomplish consensus with regards to state distributed between servers. In 2010, and earlier implementation, ClustrixDB (now MariaDB Xpand) moved from a hardware appliance to a Paxos-based software database [2] and was later acquired by MariaDB [3] and added to a SaaS cloud offering called SkySQL. [4] In 2015, two Google engineers left the company to create Cockroach DB which achieves similar results using the Raft algorithm without atomic clocks or custom hardware. [5]

Spanner is primarily used for transactional and time-series use cases. However, Google furthered this research with a follow on paper about Google F1 which it describes as a Hybrid transactional/analytical processing database built on Spanner. [1]

Architecture

Distributed SQL databases have the following general characteristics:

Following the CAP Theorem, distributed SQL databases are "CP" or consistent and partition-tolerant. Algorithmically they sacrifice availability in that a failure of a primary node can make the database unavailable for writes.

All distributed SQL implementations require some kind of temporal synchronization to guarantee consistency. With the exception of Spanner, most do not use custom hardware to provide atomic clocks. Spanner is able to synchronize writes with temporal guarantees. Implementations without custom hardware require servers to compare clock offsets and potentially retry reads. [8]

Distributed SQL implementations

VendorAPILicense model
Amazon Aurora PostgreSQL & MySQL Proprietary
Citus Data (Microsoft) PostgreSQL Open Source (AGPLv3) [9]
CockroachDB PostgreSQL-likeProprietary
Google Spanner Proprietary SQL-likeProprietary
MySQL Cluster MySQL Open Source (GPLv2)
NuoDB Proprietary SQLProprietary
YugabyteDB PostgreSQL & Cassandra CQL-likeOpen Source (Apache 2.0)
TiDB MySQL-likeOpen Source (Apache 2.0)
MariaDB XPand MariaDB Proprietary
Teradata Proprietary SQL-likeProprietary
YDB [10] Proprietary SQL-likeProprietary

Compared to NewSQL

CockroachDB, YugabyteDB and others have at times referred to themselves as NewSQL databases. Some of the NewSQL databases like Citus and Vitess have fundamentally different architectures, but were cited as examples of NewSQL by Matthew Aslett who coined the term. [11] In essence, distributed SQL databases are built from the ground-up and NewSQL databases include replication and sharding technologies added to existing client-server relational databases like PostgreSQL. [12] Some experts define DistributedSQL databases as a more specific subset of NewSQL databases. [13]

Related Research Articles

In database systems, durability is the ACID property that guarantees that the effects of transactions that have been committed will survive permanently, even in case of failures, including incidents and catastrophic events. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.

Multi-master replication is a method of database replication which allows data to be stored by a group of computers, and updated by any member of the group. All members are responsive to client data queries. The multi-master replication system is responsible for propagating the data modifications made by each member to the rest of the group and resolving any conflicts that might arise between concurrent changes made by different members.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

Paxos is a family of protocols for solving consensus in a network of unreliable or fallible processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communications may experience failures.

An embedded database system is a database management system (DBMS) which is tightly integrated with an application software; it is embedded in the application. It is a broad technology category that includes:

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

Clustrix, Inc. is a San Francisco-based private company founded in 2006 that developed a database management system marketed as NewSQL.

SQL:2011 or ISO/IEC 9075:2011 is the seventh revision of the ISO (1987) and ANSI (1986) standard for the SQL database query language. It was formally adopted in December 2011. The standard consists of 9 parts which are described in detail in SQL. The next version is SQL:2016.

<span class="mw-page-title-main">Spanner (database)</span> Cloud-based distributed SQL DBMS service

Spanner is a distributed SQL database management and storage service developed by Google. It provides features such as global transactions, strongly consistent reads, and automatic multi-site replication and failover. Spanner is used in Google F1, the database for its advertising business Google Ads, as well as Gmail and Google Photos.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.

FoundationDB is a free and open-source multi-model distributed NoSQL database developed by Apple Inc. with a shared-nothing architecture. The product was designed around a "core" database, with additional features supplied in "layers." The core database exposes an ordered key–value store with transactions. The transactions are able to read or write multiple keys stored on any machine in the cluster while fully supporting ACID properties. Transactions are used to implement a variety of data models via layers.

The following outline is provided as an overview of and topical guide to MySQL:

<span class="mw-page-title-main">CockroachDB</span> Distributed database management system

CockroachDB is a source-available distributed SQL database management system developed by Cockroach Labs. The relational functionality is built on top of a distributed, transactional, consistent key-value store that can survive a variety of different underlying infrastructure failures, and is wire-compatible with PostgreSQL which means users can take advantage of a wide range of drivers and tools from the extensive PostgreSQL ecosystem. A CockroachDB cluster consists of a number of nodes that can be spread across failure domains such as data centres or public cloud regions. A cluster can be scaled both horizontally and vertically. It can provide high levels of resilience and availability and can be run in a variety of environments such as bare metal, VMs, containers and Kubernetes, both in private data centers and in the cloud. CockroachDB gets its name from cockroaches, as they are known for being disaster-resistant.

<span class="mw-page-title-main">PACELC theorem</span> Theorem in theoretical computer science

In database theory, the PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).

TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. Designed to be MySQL compatible, it is developed and supported primarily by PingCAP and licensed under Apache 2.0. It is also available as a paid product. TiDB drew its initial design inspiration from Google's Spanner and F1 papers.

<span class="mw-page-title-main">YugabyteDB</span> Transactional distributed SQL database

YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte.

YDB is a distributed SQL database management system (DBMS) developed by Yandex, available as open-source technology.

References

  1. 1 2 3 https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41344.pdf [ bare URL PDF ]
  2. Higginbotham, Stacey (May 3, 2010). "Clustrix Builds the Webscale Holy Grail: A Database That Scales". gigaom.com.
  3. "MariaDB acquires Clustrix". 20 September 2018.
  4. Baer (dbInsight), Tony. "For MariaDB, it's time to put the pieces together". ZDNet.
  5. Morgan, Timothy Prickett (February 22, 2017). "Google Spanner Inspires CockroachDB To Outrun It". The Next Platform.
  6. The future of databases: distributed SQL & MariaDB ® , retrieved 2022-12-21
  7. "The Architecture of a Distributed SQL Database". 23 September 2020 via www.youtube.com.
  8. "Living Without Atomic Clocks". Cockroach Labs. April 21, 2020.
  9. "citus/LICENSE at 185ac5e01e8c62757c81cdc1d92cffcf6b55f500 · citusdata/citus". GitHub. Retrieved 2024-09-26.
  10. "YDB is an open-source Distributed SQL Database that combines high availability and scalability with strong consistency and ACID transactions". ydb.tech.
  11. "What we talk about when we talk about NewSQL — Too much information".
  12. "SQL vs. NoSQL Databases: What's the Difference?". www.ibm.com. 12 June 2022.
  13. Prabagaren, Gokul (October 30, 2019). "NewSQL — The Next Evolution in Databases". Medium.