Distributed SQL

Last updated

A distributed SQL database is a single relational database which replicates data across multiple servers. Distributed SQL databases are strongly consistent and most support consistency across racks, data centers, and wide area networks including cloud availability zones and cloud geographic zones. Distributed SQL databases typically use the Paxos or Raft algorithms to achieve consensus across multiple nodes.

Contents

Sometimes distributed SQL databases are referred to as NewSQL but NewSQL is a more inclusive term that includes databases that are not distributed databases.

History

Google's Spanner popularized the modern distributed SQL database concept. Google described the database and its architecture in a 2012 whitepaper called "Spanner: Google's Globally-Distributed Database." The paper described Spanner as having evolved from a Big Table-like key value store into a temporal multi-version database where data is stored in "schematized semi-relational tables." [1]

Spanner uses atomic clocks with the Paxos algorithm to accomplish consensus with regards to state distributed between servers. In 2010, and earlier implementation, ClustrixDB (now MariaDB Xpand) moved from a hardware appliance to a Paxos-based software database [2] and was later acquired by MariaDB [3] and added to a SaaS cloud offering called SkySQL. [4] In 2015, two Google engineers left the company to create Cockroach DB which achieves similar results using the Raft algorithm without atomic clocks or custom hardware. [5]

Spanner is primarily used for transactional and time-series use cases. However, Google furthered this research with a follow on paper about Google F1 which it describes as a Hybrid transactional/analytical processing database built on Spanner. [1]

Architecture

Distributed SQL databases have the following general characteristics:

Following the CAP Theorem, distributed SQL databases are "CP" or consistent and partition-tolerant. Algorithmically they sacrifice availability in that a failure of a primary node can make the database unavailable for writes.

All distributed SQL implementations require some kind of temporal synchronization to guarantee consistency. With the exception of Spanner, most do not use custom hardware to provide atomic clocks. Spanner is able to synchronize writes with temporal guarantees. Implementations without custom hardware require servers to compare clock offsets and potentially retry reads. [8]

Distributed SQL implementations

VendorAPI
Amazon Aurora PostgreSQL & MySQL
Citus Data (Microsoft) PostgreSQL
CockroachDB PostgreSQL-like
Google Spanner Proprietary SQL-like
MySQL Cluster MySQL
NuoDB Proprietary SQL
YugabyteDB PostgreSQL & Cassandra CQL-like
TiDB MySQL-like
MariaDB XPand MariaDB
Teradata Proprietary SQL-like
YDB [9] Proprietary SQL-like

Compared to NewSQL

CockroachDB, YugabyteDB and others have at times referred to themselves as NewSQL databases. Some of the NewSQL databases like Citus and Vitess have fundamentally different architectures, but were cited as examples of NewSQL by Matthew Aslett who coined the term. [10] In essence, distributed SQL databases are built from the ground-up and NewSQL databases include replication and sharding technologies added to existing client-server relational databases like PostgreSQL. [11] Some experts define DistributedSQL databases as a more specific subset of NewSQL databases. [12]

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

A database transaction symbolizes a unit of work, performed within a database management system against a database, that is treated in a coherent and reliable way independent of other transactions. A transaction generally represents any change in a database. Transactions in a database environment have two main purposes:

  1. To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure. For example: when execution prematurely and unexpectedly stops in which case many operations upon a database remain uncompleted, with unclear status.
  2. To provide isolation between programs accessing a database concurrently. If this isolation is not provided, the programs' outcomes are possibly erroneous.

In database systems, durability is the ACID property that guarantees that the effects of transactions that have been committed will survive permanently, even in case of failures, including incidents and catastrophic events. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.

Multi-master replication is a method of database replication which allows data to be stored by a group of computers, and updated by any member of the group. All members are responsive to client data queries. The multi-master replication system is responsible for propagating the data modifications made by each member to the rest of the group and resolving any conflicts that might arise between concurrent changes made by different members.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio.

An embedded database system is a database management system (DBMS) which is tightly integrated with an application software; it is embedded in the application. It is a broad technology category that includes:

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

NoSQL is an approach to database design that focuses on providing a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Instead of the typical tabular structure of a relational database, NoSQL databases house data within one data structure. Since this non-relational database design does not require a schema, it offers rapid scalability to manage large and typically unstructured data sets. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

<span class="mw-page-title-main">Spanner (database)</span> Cloud-based distributed SQL database management service

Spanner is a distributed SQL database management and storage service developed by Google. It provides features such as global transactions, strongly consistent reads, and automatic multi-site replication and failover. Spanner is used in Google F1, the database for its advertising business Google Ads as well as Gmail and Google Photos.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.

The following outline is provided as an overview of and topical guide to MySQL:

<span class="mw-page-title-main">CockroachDB</span> Distributed database management system

CockroachDB is a commercial distributed SQL database management system developed by Cockroach Labs.

In theoretical computer science, the PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).

Database scalability is the ability of a database to handle changing demands by adding/removing resources. Databases use a host of techniques to cope.

TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is designed to be MySQL compatible. It is developed and supported primarily by PingCAP and licensed under Apache 2.0, though it is also available as a paid product. TiDB drew its initial design inspiration from Google's Spanner and F1 papers.

<span class="mw-page-title-main">YugabyteDB</span> Transactional distributed SQL database

YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte.

YDB (Yet another DataBase) is a distributed SQL database management system (DBMS) developed by Yandex, available as open-source technology.

References

  1. 1 2 3 https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41344.pdf [ bare URL PDF ]
  2. Higginbotham, Stacey (May 3, 2010). "Clustrix Builds the Webscale Holy Grail: A Database That Scales". gigaom.com.
  3. "MariaDB acquires Clustrix". 20 September 2018.
  4. Baer (dbInsight), Tony. "For MariaDB, it's time to put the pieces together". ZDNet.
  5. Morgan, Timothy Prickett (February 22, 2017). "Google Spanner Inspires CockroachDB To Outrun It". The Next Platform.
  6. The future of databases: distributed SQL & MariaDB ® , retrieved 2022-12-21
  7. "The Architecture of a Distributed SQL Database" via www.youtube.com.
  8. "Living Without Atomic Clocks". Cockroach Labs. April 21, 2020.
  9. "YDB is an open-source Distributed SQL Database that combines high availability and scalability with strong consistency and ACID transactions". ydb.tech.
  10. "What we talk about when we talk about NewSQL — Too much information".
  11. "SQL vs. NoSQL Databases: What's the Difference?". www.ibm.com.
  12. Prabagaren, Gokul (October 30, 2019). "NewSQL — The Next Evolution in Databases". Medium.