Chord (peer-to-peer)

Last updated

In computing, Chord is a protocol and algorithm for a peer-to-peer distributed hash table. A distributed hash table stores key-value pairs by assigning keys to different computers (known as "nodes"); a node will store the values for all the keys for which it is responsible. Chord specifies how keys are assigned to nodes, and how a node can discover the value for a given key by first locating the node responsible for that key.

Contents

Chord is one of the four original distributed hash table protocols, along with CAN, Tapestry, and Pastry. It was introduced in 2001 by Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan, and was developed at MIT. [1] The 2001 Chord paper [1] won an ACM SIGCOMM Test of Time award in 2011. [2]

Subsequent research by Pamela Zave has shown that the original Chord algorithm (as specified in the 2001 SIGCOMM paper, [1] the 2001 Technical report, [3] the 2002 PODC paper, [4] and the 2003 TON paper [5] ) can mis-order the ring, produce several rings, and break the ring. [6]

Overview

Chord project.svg

Nodes and keys are assigned an -bit identifier using consistent hashing . The SHA-1 algorithm is the base hashing function for consistent hashing. Consistent hashing is integral to the robustness and performance of Chord because both keys and nodes (in fact, their IP addresses) are uniformly distributed in the same identifier space with a negligible possibility of collision. Thus, it also allows nodes to join and leave the network without disruption. In the protocol, the term node is used to refer to both a node itself and its identifier (ID) without ambiguity. So is the term key.

Using the Chord lookup protocol, nodes and keys are arranged in an identifier circle that has at most nodes, ranging from to . ( should be large enough to avoid collision.) Some of these nodes will map to machines or keys while others (most) will be empty.

Each node has a successor and a predecessor. The successor to a node is the next node in the identifier circle in a clockwise direction. The predecessor is counter-clockwise. If there is a node for each possible ID, the successor of node 0 is node 1, and the predecessor of node 0 is node ; however, normally there are "holes" in the sequence. For example, the successor of node 153 may be node 167 (and nodes from 154 to 166 do not exist); in this case, the predecessor of node 167 will be node 153.

The concept of successor can be used for keys as well. The successor node of a key is the first node whose ID equals to or follows in the identifier circle, denoted by . Every key is assigned to (stored at) its successor node, so looking up a key is to query .

Since the successor (or predecessor) of a node may disappear from the network (because of failure or departure), each node records an arc of nodes in the middle of which it stands, i.e., the list of nodes preceding it and nodes following it. This list results in a high probability that a node is able to correctly locate its successor or predecessor, even if the network in question suffers from a high failure rate.

Protocol details

A 16-node Chord network. The "fingers" for one of the nodes are highlighted. Chord network.png
A 16-node Chord network. The "fingers" for one of the nodes are highlighted.

Basic query

The core usage of the Chord protocol is to query a key from a client (generally a node as well), i.e. to find . The basic approach is to pass the query to a node's successor, if it cannot find the key locally. This will lead to a query time where is the number of machines in the ring.

Finger table

To avoid the linear search above, Chord implements a faster search method by requiring each node to keep a finger table containing up to entries, recall that is the number of bits in the hash key. The entry of node will contain . The first entry of finger table is actually the node's immediate successor (and therefore an extra successor field is not needed). Every time a node wants to look up a key , it will pass the query to the closest successor or predecessor (depending on the finger table) of in its finger table (the "largest" one on the circle whose ID is smaller than ), until a node finds out the key is stored in its immediate successor.

With such a finger table, the number of nodes that must be contacted to find a successor in an N-node network is . (See proof below.)

Node join

Whenever a new node joins, three invariants should be maintained (the first two ensure correctness and the last one keeps querying fast):

  1. Each node's successor points to its immediate successor correctly.
  2. Each key is stored in .
  3. Each node's finger table should be correct.

To satisfy these invariants, a predecessor field is maintained for each node. As the successor is the first entry of the finger table, we do not need to maintain this field separately any more. The following tasks should be done for a newly joined node :

  1. Initialize node (the predecessor and the finger table).
  2. Notify other nodes to update their predecessors and finger tables.
  3. The new node takes over its responsible keys from its successor.

The predecessor of can be easily obtained from the predecessor of (in the previous circle). As for its finger table, there are various initialization methods. The simplest one is to execute find successor queries for all entries, resulting in initialization time. A better method is to check whether entry in the finger table is still correct for the entry. This will lead to . The best method is to initialize the finger table from its immediate neighbours and make some updates, which is .

Stabilization

To ensure correct lookups, all successor pointers must be up to date. Therefore, a stabilization protocol is running periodically in the background which updates finger tables and successor pointers.

The stabilization protocol works as follows:

Potential uses

Proof sketches

The routing path between nodes A and B. Each hop cuts the remaining distance in half (or better). Chord route.png
The routing path between nodes A and B. Each hop cuts the remaining distance in half (or better).

With high probability, Chord contacts nodes to find a successor in an -node network.

Suppose node wishes to find the successor of key . Let be the predecessor of . We wish to find an upper bound for the number of steps it takes for a message to be routed from to . Node will examine its finger table and route the request to the closest predecessor of that it has. Call this node . If is the entry in 's finger table, then both and are at distances between and from along the identifier circle. Hence, the distance between and along this circle is at most . Thus the distance from to is less than the distance from to : the new distance to is at most half the initial distance.

This process of halving the remaining distance repeats itself, so after steps, the distance remaining to is at most ; in particular, after steps, the remaining distance is at most . Because nodes are distributed uniformly at random along the identifier circle, the expected number of nodes falling within an interval of this length is 1, and with high probability, there are fewer than such nodes. Because the message always advances by at least one node, it takes at most steps for a message to traverse this remaining distance. The total expected routing time is thus .

If Chord keeps track of predecessors/successors, then with high probability, if each node has probability of 1/4 of failing, find_successor (see below) and find_predecessor (see below) will return the correct nodes

Simply, the probability that all nodes fail is , which is a low probability; so with high probability at least one of them is alive and the node will have the correct pointer.

Pseudocode

Definitions for pseudocode
finger[k]
first node that succeeds
successor
the next node from the node in question on the identifier ring
predecessor
the previous node from the node in question on the identifier ring

The pseudocode to find the successor node of an id is given below:

// ask node n to find the successor of idn.find_successor(id)// Yes, that should be a closing square bracket to match the opening parenthesis.// It is a half closed interval.if id  (n, successor] thenreturn successor     else         // forward the query around the circle         n0 := closest_preceding_node(id)         return n0.find_successor(id)  // search the local table for the highest predecessor of idn.closest_preceding_node(id)for i = m downto 1 doif (finger[i]  (n, id)) then             return finger[i]     return n

The pseudocode to stabilize the chord ring/circle after node joins and departures is as follows:

// create a new Chord ring.n.create()     predecessor := nil     successor := n  // join a Chord ring containing node n'.n.join(n')     predecessor := nil     successor := n'.find_successor(n)  // called periodically. n asks the successor// about its predecessor, verifies if n's immediate// successor is consistent, and tells the successor about nn.stabilize()     x = successor.predecessor     if x  (n, successor) then         successor := x     successor.notify(n)  // n' thinks it might be our predecessor.n.notify(n')if predecessor is nil or n'(predecessor, n) then         predecessor := n'  // called periodically. refreshes finger table entries.// next stores the index of the finger to fixn.fix_fingers()     next := next + 1     if next > m then         next := 1     finger[next] := find_successor(n+2next-1);  // called periodically. checks whether predecessor has failed.n.check_predecessor()if predecessor has failed then         predecessor := nil

See also

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Distributed hash table</span> Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

<span class="mw-page-title-main">Treap</span>

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only keys need to be remapped on average where is the number of keys and is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.

BitVault is a content-addressable distributed storage system, developed by Microsoft Research in China. BitVault uses peer-to-peer technology to distribute the tasks of storing and managing data. As such, there is no central authority responsible for management of the system. Rather, it is self-managing, provides high availability, reliability and scales up in a self-organizing manner, with low administrative overhead, which is almost constant irrespective of the size of the distributed overlay network.

Pastry is an overlay network and routing network for the implementation of a distributed hash table (DHT) similar to Chord. The key–value pairs are stored in a redundant peer-to-peer network of connected Internet hosts. The protocol is bootstrapped by supplying it with the IP address of a peer already in the network and from then on via the routing table which is dynamically built and repaired. It is claimed that because of its redundant and decentralized nature there is no single point of failure and any single node can leave the network at any time without warning and with little or no chance of data loss. The protocol is also capable of using a routing metric supplied by an outside program, such as ping or traceroute, to determine the best routes to store in its routing table.

Tapestry is a peer-to-peer overlay network which provides a distributed hash table, routing, and multicasting infrastructure for distributed applications. The Tapestry peer-to-peer system offers efficient, scalable, self-repairing, location-aware routing to nearby resources.

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In peer-to-peer networks, Koorde is a distributed hash table (DHT) system based on the Chord DHT and the De Bruijn graph. Inheriting the simplicity of Chord, Koorde meets O(log n) hops per node, and hops per lookup request with O(log n) neighbors per node.

The BAlanced Tree Overlay Network (BATON) is a distributed tree structure designed for peer-to-peer (P2P) systems. Unlike other overlays that utilize a distributed hash table (DHT) like the Chord system, BATON organizes peers in a distributed tree to facilitate range search. Furthermore, BATON aims to maintain a balanced tree height, similar to the AVL tree, resulting in a bounded to for search exact and range queries as well as update operations (join/leave).

Skip graphs are a kind of distributed data structure based on skip lists. They were invented in 2003 by James Aspnes and Gauri Shah. A nearly identical data structure called SkipNet was independently invented by Nicholas Harvey, Michael Jones, Stefan Saroiu, Marvin Theimer and Alec Wolman, also in 2003.

In computer science, an x-fast trie is a data structure for storing integers from a bounded domain. It supports exact and predecessor or successor queries in time O(log log M), using O(n log M) space, where n is the number of stored values and M is the maximum value in the domain. The structure was proposed by Dan Willard in 1982, along with the more complicated y-fast trie, as a way to improve the space usage of van Emde Boas trees, while retaining the O(log log M) query time.

In computer science, a y-fast trie is a data structure for storing integers from a bounded domain. It supports exact and predecessor or successor queries in time O(log log M), using O(n) space, where n is the number of stored values and M is the maximum value in the domain. The structure was proposed by Dan Willard in 1982 to decrease the O(n log M) space used by an x-fast trie.

<span class="mw-page-title-main">Rendezvous hashing</span>

Rendezvous or highest random weight (HRW) hashing is an algorithm that allows clients to achieve distributed agreement on a set of options out of a possible set of options. A typical application is when clients need to agree on which sites objects are assigned to.

Approximate Membership Query Filter (AMQ-Filter) is a group of space-efficient probabilistic data structures that supports approximate membership queries. An approximate membership query answers if an element is in a set or not with a false positive rate of .

References

  1. 1 2 3 Stoica, I.; Morris, R.; Kaashoek, M. F.; Balakrishnan, H. (2001). "Chord: A scalable peer-to-peer lookup service for internet applications" (PDF). ACM SIGCOMM Computer Communication Review. 31 (4): 149. doi:10.1145/964723.383071.
  2. "ACM SIGCOMM Test of Time Paper Award" . Retrieved 16 January 2022.
  3. Stoica, I.; Morris, R.; Liben-Nowell, D.; Karger, D.; Kaashoek, M. F.; Dabek, F.; Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications (PDF) (Technical report). MIT LCS. MIT. 819. Archived from the original (PDF) on 22 July 2012.
  4. Liben-Nowell, David; Balakrishnan, Hari; Karger, David (July 2002). Analysis of the evolution of peer-to-peer systems (PDF). PODC '02: Proceedings of the twenty-first annual symposium on Principles of distributed computing. pp. 233–242. doi:10.1145/571825.571863.
  5. Stoica, I.; Morris, R.; Liben-Nowell, D.; Karger, D.; Kaashoek, M. F.; Dabek, F.; Balakrishnan, H. (25 February 2003). "Chord: a scalable peer-to-peer lookup protocol for Internet applications". IEEE/ACM Transactions on Networking. 11 (1): 17–32. doi:10.1109/TNET.2002.808407. S2CID   221276912.
  6. Zave, Pamela (2012). "Using lightweight modeling to understand chord" (PDF). ACM SIGCOMM Computer Communication Review. 42 (2): 49–57. doi:10.1145/2185376.2185383. S2CID   11727788.
  7. Labbai, Peer Meera (Fall 2016). "T2WSN: TITIVATED TWO-TIRED CHORD OVERLAY AIDING ROBUSTNESS AND DELIVERY RATIO FOR WIRELESS SENSOR NETWORKS" (PDF). Journal of Theoretical and Applied Information Technology. 91: 168–176.