Distributed data store

Last updated

Contents

A distributed data store is a computer network where information is stored on more than one node, often in a replicated fashion. [1] It is usually specifically used to refer to either a distributed database where users store information on a number of nodes, or a computer network in which users store information on a number of peer network nodes. [2]

Distributed databases

Distributed databases are usually non-relational databases that enable a quick access to data over a large number of nodes. Some distributed databases expose rich query abilities while others are limited to a key-value store semantics. Examples of limited distributed databases are Google's Bigtable, which is much more than a distributed file system or a peer-to-peer network, [3] Amazon's Dynamo [4] and Microsoft Azure Storage. [5]

As the ability of arbitrary querying is not as important as the availability, designers of distributed data stores have increased the latter at an expense of consistency. But the high-speed read/write access results in reduced consistency, as it is not possible to guarantee both consistency and availability on a partitioned network, as stated by the CAP theorem.

Peer network node data stores

In peer network data stores, the user can usually reciprocate and allow other users to use their computer as a storage node as well. Information may or may not be accessible to other users depending on the design of the network.

Most peer-to-peer networks do not have distributed data stores in that the user's data is only available when their node is on the network. However, this distinction is somewhat blurred in a system such as BitTorrent, where it is possible for the originating node to go offline but the content to continue to be served. Still, this is only the case for individual files requested by the redistributors, as contrasted with networks such as Freenet, Winny, Share and Perfect Dark where any node may be storing any part of the files on the network.

Distributed data stores typically use an error detection and correction technique. Some distributed data stores (such as Parchive over NNTP) use forward error correction techniques to recover the original file when parts of that file are damaged or unavailable. Others try again to download that file from a different mirror.

Examples

Distributed non-relational databases

ProductLicense High availability Notes
Apache Accumulo AL2
Aerospike AGPL
Apache Cassandra AL2 Yesformerly used by Facebook
Apache Ignite AL2
Bigtable Proprietary used by Google
Couchbase AL2 used by LinkedIn, PayPal, and eBay
CrateDB AL2 Yes
Druid (open-source data store) AL2 used by Netflix, and Yahoo
Dynamo Proprietary used by Amazon
Hazelcast AL2, Proprietary
HBase AL2 Yesformerly used by Facebook
Hypertable GPL 2 Baidu
MongoDB SSPL
Riak AL2 Yes
Redis BSD License Yes
Scylla AGPL
Voldemort AL2 used by LinkedIn

Peer network node data stores

See also

Related Research Articles

Freenet Peer-to-peer Internet platform for censorship-resistant communication

Freenet is a peer-to-peer platform for censorship-resistant communication. It uses a decentralized distributed data store to keep and deliver information, and has a suite of free software for publishing and communicating on the Web without fear of censorship. Both Freenet and some of its associated tools were originally designed by Ian Clarke, who defined Freenet's goal as providing freedom of speech on the Internet with strong anonymity protection.

Distributed hash table Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table: key-value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

Winny is a Japanese peer-to-peer (P2P) file-sharing program developed by Isamu Kaneko, a research assistant at the University of Tokyo in 2002. Like Freenet, a user must add an encrypted node list in order to connect to other nodes on the network. Users choose three cluster words which symbolize their interests, and then Winny connects to other nodes which share these cluster words, downloading and storing encrypted data from cache of these neighbors in a distributed data store. If users want a particular file, they set up triggers (keywords), and Winny will download files marked by these triggers. The encryption was meant to provide anonymity, but Winny also included bulletin boards where users would announce uploads, and the IP address of posters could be discovered through these boards. While Freenet was implemented in Java, Winny was implemented as a Windows C++ application.

An anonymous P2P communication system is a peer-to-peer distributed application in which the nodes, which are used to share resources, or participants are anonymous or pseudonymous. Anonymity of participants is usually achieved by special routing overlay networks that hide the physical location of each node from other participants.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Bigtable is a compressed, high performance, proprietary data storage system built on Google File System, Chubby Lock Service, SSTable and a few other Google technologies. On May 6, 2015, a public version of Bigtable was made available as a service. Bigtable also underlies Google Cloud Datastore, which is available as a part of the Google Cloud Platform.

Perfect Dark (P2P) Peer to peer software

Perfect Dark (パーフェクトダーク) is a peer-to-peer file-sharing (P2P) application from Japan designed for use with Microsoft Windows. It was launched in 2006. Its author is known by the pseudonym Kaichō. Perfect Dark was developed with the intention for it to be the successor to both Winny and Share software. While Japan's Association for Copyright of Computer Software reported that in January 2014, the number of nodes connected on Perfect Dark was less than on Share, but more than on Winny, Netagent in 2018 reported Winny being the largest with 50 000 nodes followed by Perfect Dark with 30 000 nodes followed by Share with 10 000. Netagent asserts that the number of nodes on Perfect Dark have fallen since 2015 while the numbers of Winny hold steady. Netagent reports that users of Perfect Dark are most likely to share books/manga.

Apache Cassandra

Apache Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st century, triggered by the needs of Web 2.0 companies. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

Amazon DynamoDB

Amazon DynamoDB is a fully managed proprietary NoSQL database service that supports key–value and document data structures and is offered by Amazon.com as part of the Amazon Web Services portfolio. DynamoDB exposes a similar data model to and derives its name from Dynamo, but has a different underlying implementation. Dynamo had a multi-leader design requiring the client to resolve version conflicts and DynamoDB uses synchronous replication across multiple data centers for high durability and availability. DynamoDB was announced by Amazon CTO Werner Vogels on January 18, 2012, and is presented as an evolution of Amazon SimpleDB solution.

Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google's Bigtable. It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels and server-side programming mechanisms. According to DB-Engines ranking, Accumulo is the third most popular NoSQL wide column store behind Apache Cassandra and HBase and the 67th most popular database engine of any type (complete) as of 2018.

The following is provided as an overview of and topical guide to databases:

SingleStore

SingleStore is a distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support and is known for speed in data ingest, transaction processing, and query processing. SingleStore was formerly known as MemSQL.

Apache Drill

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Drill is an Apache top-level project.

Oracle NoSQL Database

Oracle NoSQL Database (ONDB) is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

File sharing in Japan is notable for both its size and sophistication.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

A wide-column store is a type of NoSQL database. It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a two-dimensional key–value store.

InterPlanetary File System Content-addressable, peer-to-peer hypermedia distribution protocol

The InterPlanetary File System (IPFS) is a protocol and peer-to-peer network for storing and sharing data in a distributed file system. IPFS uses content-addressing to uniquely identify each file in a global namespace connecting all computing devices.

Google Cloud Datastore is a highly scalable, fully managed NoSQL database service offered by Google on the Google Cloud Platform. Cloud storage is something that "allows you to save data and files in an off-site location that you access either through the public internet or a dedicated private network connection." This is very cost-effective for businesses since physical files can replaced with cloud storage records. Cloud Datastore is built upon Google's Bigtable and Megastore technology. Google Cloud Datastore allows the user to create databases either in Native or Datastore Mode. Native Mode is designed for mobile and web apps, while Datastore Mode is designed for new server projects.

References

  1. Yaniv Pessach, Distributed Storage (Distributed Storage: Concepts, Algorithms, and Implementations ed.), OL   25423189M
  2. "Distributed Data Storage - an overview | ScienceDirect Topics".CS1 maint: discouraged parameter (link)
  3. "Bigtable: Google's Distributed Data Store". http://the-paper-trail.org/: Paper Trail. Archived from the original on 2017-07-16. Retrieved 2011-04-05. Although GFS provides Google with reliable, scalable distributed file storage, it does not provide any facility for structuring the data contained in the files beyond a hierarchical directory structure and meaningful file names. It’s well known that more expressive solutions are required for large data sets. Google’s terabytes upon terabytes of data that they retrieve from web crawlers, amongst many other sources, need organising, so that client applications can quickly perform lookups and updates at a finer granularity than the file level. [...] The very first thing you need to know about Bigtable is that it isn’t a relational database. This should come as no surprise: one persistent theme through all of these large scale distributed data store papers is that RDBMSs are hard to do with good performance. There is no hard, fixed schema in a Bigtable, no referential integrity between tables (so no foreign keys) and therefore little support for optimised joins.
  4. Sarah Pidcock (2011-01-31). "Dynamo: Amazon's Highly Available Key-value Store" (PDF). http://www.cs.uwaterloo.ca/: WATERLOO – CHERITON SCHOOL OF COMPUTER SCIENCE. p. 2/22. Retrieved 2011-04-05. Dynamo: a highly available and scalable distributed data store
  5. "Windows Azure Storage". 2011-09-16. Archived from the original on 9 November 2011. Retrieved 6 November 2011.