BitVault

Last updated July 02, 2024

BitVault is a content-addressable distributed storage system, developed by Microsoft Research in China. BitVault uses peer-to-peer technology to distribute the tasks of storing and managing data. As such, there is no central authority responsible for management of the system. Rather, it is self-managing, provides high availability, reliability and scales up in a self-organizing manner, with low administrative overhead, which is almost constant irrespective of the size of the distributed overlay network.

Technology

Every participating peer node in BitVault architecture is a Smart Brick, which is a trimmed down PC with large disks. All Smart Bricks in a BitVault system are connected by a high-bandwidth, low latency network. A BitVault system can be easily scaled up – any computer can be configured to act as a Smart Brick by simply installing the BitVault software, and connecting it to the network, without any need for interrupting the already working nodes.

BitVault stores immutable data objects, i.e., objects which cannot be changed. The physical location of the objects is not fixed and can be on any of the bricks. Its location changes depending on its frequency of access; it can even be replicated at more than one brick. To get around this problem of changing locations, BitVault makes it accessible by means of a 160-bit key, which is unique for each object. The system dynamically references the location from which the object can be retrieved most efficiently, by using the key, and makes the object available. The unique key is generated from a hash of the data of the object, thus making the system content-addressable, as opposed to location-addressable. The hashes of the objects (key) are mapped to the physical addresses using hash tables, which are internally managed by the system and do not need any user intervention. Different sets of nodes maintain different sets of hash tables, which concern with only the data in that set of nodes, thereby giving rise to an overlay network in which the location of the data is tracked by a distributed hash table (DHT) architecture.

Architecture

The BitVault architecture is composed of multiple bricks which constitute a logical 160 bit address space, each associated with hash of some data. The association is maintained in a Distributed Hash Table (DHT). The DHT partitions the entire hash table into smaller hash tables. For example, if there are n peers, the hash table would be divided into n hash tables, each starting from the row next to where its immediate predecessor ended. Each DHT has its associated brick, and the extent of the logical address space a brick is responsible for is called its Zone. The bricks communicate using peer-to-peer technology, over the Membership and Routing Layer (MRL). Lookup of any data object can be done by n bricks in parallel, in its own zone, giving an efficiency of O (log N).

Multiple copies of a single object, called replica, are stored in the BitVault system, to give enough redundancy. If any index is damaged, the nearest replica can be notified to start its repair. And if the index notices that the replica is damaged, it can initiate the repair of the replica. This method of error recovery is called the Object Driven Repair model. In order for this to work, there needs to be a membership service running which will give a logical ordering to the peers. This is achieved using the MRL. The membership service guarantees that any addition or removal of a brick is eventually and reliably informed to every other live bricks. The MRL is also responsible to route messages to and from bricks and its associated DHTs.

The MRL uses a one hop DHT to perform routing, i.e., it never takes more than one hop over a peer to route messages, when the BitVault system is stable, i.e., no new bricks are added, nor is any load balancing or repair going on. The MRL is implemented using an XRing architecture, which maintains a distributed routing table which facilitates one-hop routing.

Single brick architecture

Architecture of a Brick Brick.PNG — Architecture of a Brick

A brick registers itself with the MRL with a 160 bit key that forms its identifier, and its zone in the DHT is from its id to just before the id of its next logical successor. The brick architecture is divided into two parts – the Index Module and the Data Module. The index module keeps a list of the list of all the replicas that are cached by the disc, mapped with their hashes. In addition, for each object that is stored, the IM also keeps a list of locations of all other replicas of the object. IM listens to the MRL and updates itself according to membership changes and also according to data being entered into BitVault system or being retrieved from it. The IM is also responsible to initiate repair of replicas once it is informed of a damaged one, and to ask for repair of replicas in its store. The IM is connected to a small Access Module, which serves as the gateway to external clients. Data module stores replicas of objects to a local disc. Along with the object, its metadata such as its hash key and its degree of replication in the BitVault system is also kept.

Working

Check In

Inserting data into the BitVault system is called Check In. A Check In requires the object, its key and an initial replication degree. The MRL routes the object and all its parameters to some brick. The brick then stores the data onto its Data Module and starts the job of replicating the object, by publishing it to random bricks, to achieve the specified replication degree. When the object has achieved the required replication degree, its index is said to be complete, otherwise it is partial. The brick must do further replication of an object which has partial index. Bricks also periodically verify that the index of the object is still complete.

Check Out

Check Out is the process of retrieving data from the BitVault system. The application which uses BitVault as its datastore gives the hash key of the object to be retrieved, which is sent by the MRL to any brick. If the brick doesn't have the object, it passes the request on to other bricks, in parallel. If the brick has the object, it is retrieved from its Data Module and routed to the requestor.

Fault tolerance

BitVault faults can be either transient or permanent. A transient failure will occur when a brick is experiencing temporary failure such as a software crash forcing a reboot. A permanent failure indicates errors such as hardware failure. Whenever any fault is detected, other bricks which have a replica of the affected object update the entry of the object in the index to be partial, and thus triggering further replication. All the other bricks containing replicas collaboratively send different parts of the object data, in parallel, to a new brick which will hold the replica. This parallel replication speeds up the repair of a damaged index to get it back to the complete state.

Membership changes

Whenever a new brick is added to the BitVault system, it takes up a random ID and contacts other bricks. The bricks will then include this new brick in their list of members. The newly added brick also gets a response from those bricks which added this to their membership list. The new brick adds the respondents to its membership list. Background load balancing of the system kicks in to populate the new brick with live replicas.

Load balancing

Bricks periodically query other bricks about the load condition in them. The brick then transfers some replicas onto the low-load bricks to get a more or less balanced load on each brick. It also issues messages to other bricks to update their indices to reflect the change.

Related Research Articles

Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network, forming a peer-to-peer network of nodes.

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

JXTA (Juxtapose) was an open-source peer-to-peer protocol specification begun by Sun Microsystems in 2001. The JXTA protocols were defined as a set of XML messages which allow any device connected to a network to exchange messages and collaborate independently of the underlying network topology.

BitTorrent, also referred to as simply torrent, is a communication protocol for peer-to-peer file sharing (P2P), which enables users to distribute data and electronic files over the Internet in a decentralized manner. The protocol is developed and maintained by Rainberry, Inc., and was first released in 2001.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.

In distributed data storage, a P-Grid is a self-organizing structured peer-to-peer system, which can accommodate arbitrary key distributions, still providing storage load-balancing and efficient search by using randomized routing.

A BitTorrent tracker is a special type of server that assists in the communication between peers using the BitTorrent protocol.

In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only $keys need to be remapped on average where is the number of keys and is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.$

Peer Name Resolution Protocol (PNRP) is a peer-to-peer protocol designed by Microsoft. PNRP enables dynamic name publication and resolution, and requires IPv6.

Peer exchange or PEX is a communications protocol that augments the BitTorrent file sharing protocol. It allows a group of users that are collaborating to share a given file to do so more swiftly and efficiently.

Pastry is an overlay network and routing network for the implementation of a distributed hash table (DHT) similar to Chord. The key–value pairs are stored in a redundant peer-to-peer network of connected Internet hosts. The protocol is bootstrapped by supplying it with the IP address of a peer already in the network and from then on via the routing table which is dynamically built and repaired. It is claimed that because of its redundant and decentralized nature there is no single point of failure and any single node can leave the network at any time without warning and with little or no chance of data loss. The protocol is also capable of using a routing metric supplied by an outside program, such as ping or traceroute, to determine the best routes to store in its routing table.

Tapestry is a peer-to-peer overlay network which provides a distributed hash table, routing, and multicasting infrastructure for distributed applications. The Tapestry peer-to-peer system offers efficient, scalable, self-repairing, location-aware routing to nearby resources.

Magma is a distributed file system based on a distributed hash table, written in C, compatible with Linux and BSD kernels using FUSE.

Dynamo is a set of techniques that together can form a highly available key-value structured storage system or a distributed data store. It has properties of both databases and distributed hash tables (DHTs). It was created to help address some scalability issues that Amazon experienced during the holiday season of 2004. By 2007, it was used in Amazon Web Services, such as its Simple Storage Service (S3).

In the BitTorrent file distribution system, a torrent file or meta-info file is a computer file that contains metadata about files and folders to be distributed, and usually also a list of the network locations of trackers, which are computers that help participants in the system find each other and form efficient distribution groups called swarms. Torrent files are normally named with the extension .torrent.

Mainline DHT is the name given to the Kademlia-based distributed hash table (DHT) used by BitTorrent clients to find peers via the BitTorrent protocol. The idea of using a DHT for distributed tracking in BitTorrent was first implemented in Azureus 2.3.0.0 in May 2005, from which it gained significant popularity. Unrelated but around the same time, BitTorrent, Inc. released a similar DHT into their client called Mainline DHT, and thus popularized the use of distributed tracking in the BitTorrent protocol. Measurement showed that by 2013, the concurrent number of users of Mainline DHT is from 16 million to 28 million, with intra-day changes of at least 10 million.

Skip graphs are a kind of distributed data structure based on skip lists. They were invented in 2003 by James Aspnes and Gauri Shah. A nearly identical data structure called SkipNet was independently invented by Nicholas Harvey, Michael Jones, Stefan Saroiu, Marvin Theimer and Alec Wolman, also in 2003.

TomP2P is a distributed hash table which provides a decentralized key-value infrastructure for distributed applications. Each peer has a table that can be configured either to be disk-based or memory-based to store its values.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Elliptics is a distributed key–value data storage with open source code. By default it is a classic distributed hash table (DHT) with multiple replicas put in different groups. Elliptics was created to meet requirements of multi-datacenter and physically distributed storage locations when storing huge amount of medium and large files.

References

Zheng Zhang; Qiao Lian; Shiding Lin; Wei Chen; Yu Chen; Chao Jin (December 2005). "BitVault: a Highly Reliable Distributed Data Retention Platform" (PDF). Microsoft. doi:10.1145/1243418.1243423. MSR-TR-2005-179.

External links

Microsoft Watch

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.