Distributed data store

Last updated

A distributed data store is a computer network where information is stored on more than one node, often in a replicated fashion. [1] It is usually specifically used to refer to either a distributed database where users store information on a number of nodes, or a computer network in which users store information on a number of peer network nodes. [2]

Contents

Distributed databases

Distributed databases are usually non-relational databases that enable a quick access to data over a large number of nodes. Some distributed databases expose rich query abilities while others are limited to a key-value store semantics. Examples of limited distributed databases are Google's Bigtable, which is much more than a distributed file system or a peer-to-peer network, [3] Amazon's Dynamo [4] and Microsoft Azure Storage. [5]

As the ability of arbitrary querying is not as important as the availability, designers of distributed data stores have increased the latter at an expense of consistency. But the high-speed read/write access results in reduced consistency, as it is not possible to guarantee both consistency and availability on a partitioned network, as stated by the CAP theorem.

Peer network node data stores

In peer network data stores, the user can usually reciprocate and allow other users to use their computer as a storage node as well. Information may or may not be accessible to other users depending on the design of the network.

Most peer-to-peer networks do not have distributed data stores in that the user's data is only available when their node is on the network. However, this distinction is somewhat blurred in a system such as BitTorrent, where it is possible for the originating node to go offline but the content to continue to be served. Still, this is only the case for individual files requested by the redistributors, as contrasted with networks such as Freenet, Winny, Share and Perfect Dark where any node may be storing any part of the files on the network.

Distributed data stores typically use an error detection and correction technique. Some distributed data stores (such as Parchive over NNTP) use forward error correction techniques to recover the original file when parts of that file are damaged or unavailable. Others try again to download that file from a different mirror.

Examples

Distributed non-relational databases

ProductLicense High availability Notes
Apache Accumulo AL2
Aerospike AGPL
Apache Cassandra AL2 Yesformerly used by Facebook
Apache Ignite AL2
Bigtable Proprietary used by Google
Couchbase AL2 used by LinkedIn, PayPal, and eBay
CrateDB AL2 Yes
Apache Druid AL2 used by Netflix, and Yahoo
Dynamo Proprietary used by Amazon
etcd AL2 Yes
Hazelcast AL2, Proprietary
HBase AL2 Yesformerly used by Facebook
Hypertable GPL 2 Baidu
MongoDB SSPL
MySQL NDB Cluster GPL 2 YesSQL and NoSQL APIs
Riak AL2 Yes
Redis BSD License Yes
ScyllaDB AGPL
Voldemort AL2 used by LinkedIn

Peer network node data stores

See also

Related Research Articles

<span class="mw-page-title-main">Hyphanet</span> Peer-to-peer Internet platform for censorship-resistant communication

Hyphanet is a peer-to-peer platform for censorship-resistant, anonymous communication. It uses a decentralized distributed data store to keep and deliver information, and has a suite of free software for publishing and communicating on the Web without fear of censorship. Both Freenet and some of its associated tools were originally designed by Ian Clarke, who defined Freenet's goal as providing freedom of speech on the Internet with strong anonymity protection.

<span class="mw-page-title-main">Distributed hash table</span> Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

Winny is a Japanese peer-to-peer (P2P) file-sharing program developed by Isamu Kaneko, a research assistant at the University of Tokyo in 2002. Like Freenet, a user must add an encrypted node list in order to connect to other nodes on the network. Users choose three cluster words which symbolize their interests, and then Winny connects to other nodes which share these cluster words, downloading and storing encrypted data from cache of these neighbors in a distributed data store. If users want a particular file, they set up triggers (keywords), and Winny will download files marked by these triggers. The encryption was meant to provide anonymity, but Winny also included bulletin boards where users would announce uploads, and the IP address of posters could be discovered through these boards. While Freenet was implemented in Java, Winny was implemented as a Windows C++ application.

An anonymous P2P communication system is a peer-to-peer distributed application in which the nodes, which are used to share resources, or participants are anonymous or pseudonymous. Anonymity of participants is usually achieved by special routing overlay networks that hide the physical location of each node from other participants.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio.

<span class="mw-page-title-main">Perfect Dark (P2P)</span> Peer to peer software

Perfect Dark (パーフェクトダーク) is a peer-to-peer file-sharing (P2P) application from Japan designed for use with Microsoft Windows. It was launched in 2006. Its author is known by the pseudonym Kaichō. Perfect Dark was developed with the intention for it to be the successor to both Winny and Share software. While Japan's Association for Copyright of Computer Software reported that in January 2014, the number of nodes connected on Perfect Dark was less than on Share, but more than on Winny, Netagent in 2018 reported Winny being the largest with 50 000 nodes followed by Perfect Dark with 30 000 nodes followed by Share with 10 000. Netagent asserts that the number of nodes on Perfect Dark have fallen since 2015 while the numbers of Winny hold steady. Netagent reports that users of Perfect Dark are most likely to share books/manga.

<span class="mw-page-title-main">Apache Cassandra</span> Free and open-source database management system

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

NoSQL is an approach to database design that focuses on providing a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Instead of the typical tabular structure of a relational database, NoSQL databases house data within one data structure. Since this non-relational database design does not require a schema, it offers rapid scalability to manage large and typically unstructured data sets. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

<span class="mw-page-title-main">Couchbase Server</span> Open-source NoSQL database

Couchbase Server, originally known as Membase, is a source-available, distributed multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many concurrent users by creating, storing, retrieving, aggregating, manipulating and presenting data. In support of these kinds of application needs, Couchbase Server is designed to provide easy-to-scale key-value, or JSON document access, with low latency and high sustainability throughput. It is designed to be clustered from a single machine to very large-scale deployments spanning many machines.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

<span class="mw-page-title-main">Amazon DynamoDB</span> NoSQL database service

Amazon DynamoDB is a fully managed proprietary NoSQL database offered by Amazon.com as part of the Amazon Web Services portfolio. DynamoDB offers a fast persistent key–value datastore with built-in support for replication, autoscaling, encryption at rest, and on-demand backup among other features.

Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google's Bigtable. It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels and server-side programming mechanisms. According to DB-Engines ranking, Accumulo is the third most popular NoSQL wide column store behind Apache Cassandra and HBase and the 67th most popular database engine of any type (complete) as of 2018.

The following is provided as an overview of and topical guide to databases:

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

File sharing in Japan is notable for both its size and sophistication.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

A wide-column store is a column-oriented DBMS and therefore a special type of NoSQL database. It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a two-dimensional key–value store. Google's Bigtable is one of the prototypical examples of a wide-column store.

<span class="mw-page-title-main">InterPlanetary File System</span> Content-addressable, peer-to-peer hypermedia distribution protocol

The InterPlanetary File System (IPFS) is a protocol, hypermedia and file sharing peer-to-peer network for storing and sharing data in a distributed file system. IPFS uses content-addressing to uniquely identify each file in a global namespace connecting IPFS hosts.

References

  1. Yaniv Pessach, Distributed Storage (Distributed Storage: Concepts, Algorithms, and Implementations ed.), OL   25423189M
  2. "Distributed Data Storage - an overview | ScienceDirect Topics".
  3. "Bigtable: Google's Distributed Data Store". Paper Trail. Archived from the original on 2017-07-16. Retrieved 2011-04-05. Although GFS provides Google with reliable, scalable distributed file storage, it does not provide any facility for structuring the data contained in the files beyond a hierarchical directory structure and meaningful file names. It's well known that more expressive solutions are required for large data sets. Google's terabytes upon terabytes of data that they retrieve from web crawlers, amongst many other sources, need organising, so that client applications can quickly perform lookups and updates at a finer granularity than the file level. [...] The very first thing you need to know about Bigtable is that it isn't a relational database. This should come as no surprise: one persistent theme through all of these large scale distributed data store papers is that RDBMSs are hard to do with good performance. There is no hard, fixed schema in a Bigtable, no referential integrity between tables (so no foreign keys) and therefore little support for optimised joins.
  4. Sarah Pidcock (2011-01-31). "Dynamo: Amazon's Highly Available Key-value Store" (PDF). WATERLOO – CHERITON SCHOOL OF COMPUTER SCIENCE. p. 2/22. Retrieved 2011-04-05. Dynamo: a highly available and scalable distributed data store
  5. "Windows Azure Storage". Microsoft . 2011-09-16. Archived from the original on 9 November 2011. Retrieved 6 November 2011.