In computing, Chord is a protocol and algorithm for a peer-to-peer distributed hash table. A distributed hash table stores key-value pairs by assigning keys to different computers (known as "nodes"); a node will store the values for all the keys for which it is responsible. Chord specifies how keys are assigned to nodes, and how a node can discover the value for a given key by first locating the node responsible for that key.
Chord is one of the four original distributed hash table protocols, along with CAN, Tapestry, and Pastry. It was introduced in 2001 by Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan, and was developed at MIT. [1] The 2001 Chord paper [1] won an ACM SIGCOMM Test of Time award in 2011. [2]
Subsequent research by Pamela Zave has shown that the original Chord algorithm (as specified in the 2001 SIGCOMM paper, [1] the 2001 Technical report, [3] the 2002 PODC paper, [4] and the 2003 TON paper [5] ) can mis-order the ring, produce several rings, and break the ring. [6]
Nodes and keys are assigned an -bit identifier using consistent hashing . The SHA-1 algorithm is the base hashing function for consistent hashing. Consistent hashing is integral to the robustness and performance of Chord because both keys and nodes (in fact, their IP addresses) are uniformly distributed in the same identifier space with a negligible possibility of collision. Thus, it also allows nodes to join and leave the network without disruption. In the protocol, the term node is used to refer to both a node itself and its identifier (ID) without ambiguity. So is the term key.
Using the Chord lookup protocol, nodes and keys are arranged in an identifier circle that has at most nodes, ranging from to . ( should be large enough to avoid collision.) Some of these nodes will map to machines or keys while others (most) will be empty.
Each node has a successor and a predecessor. The successor to a node is the next node in the identifier circle in a clockwise direction. The predecessor is counter-clockwise. If there is a node for each possible ID, the successor of node 0 is node 1, and the predecessor of node 0 is node ; however, normally there are "holes" in the sequence. For example, the successor of node 153 may be node 167 (and nodes from 154 to 166 do not exist); in this case, the predecessor of node 167 will be node 153.
The concept of successor can be used for keys as well. The successor node of a key is the first node whose ID equals to or follows in the identifier circle, denoted by . Every key is assigned to (stored at) its successor node, so looking up a key is to query .
Since the successor (or predecessor) of a node may disappear from the network (because of failure or departure), each node records an arc of nodes in the middle of which it stands, i.e., the list of nodes preceding it and nodes following it. This list results in a high probability that a node is able to correctly locate its successor or predecessor, even if the network in question suffers from a high failure rate.
The core usage of the Chord protocol is to query a key from a client (generally a node as well), i.e. to find . The basic approach is to pass the query to a node's successor, if it cannot find the key locally. This will lead to a query time where is the number of machines in the ring.
To avoid the linear search above, Chord implements a faster search method by requiring each node to keep a finger table containing up to entries, recall that is the number of bits in the hash key. The entry of node will contain . The first entry of finger table is actually the node's immediate successor (and therefore an extra successor field is not needed). Every time a node wants to look up a key , it will pass the query to the closest successor or predecessor (depending on the finger table) of in its finger table (the "largest" one on the circle whose ID is smaller than ), until a node finds out the key is stored in its immediate successor.
With such a finger table, the number of nodes that must be contacted to find a successor in an N-node network is . (See proof below.)
Whenever a new node joins, three invariants should be maintained (the first two ensure correctness and the last one keeps querying fast):
To satisfy these invariants, a predecessor field is maintained for each node. As the successor is the first entry of the finger table, we do not need to maintain this field separately any more. The following tasks should be done for a newly joined node :
The predecessor of can be easily obtained from the predecessor of (in the previous circle). As for its finger table, there are various initialization methods. The simplest one is to execute find successor queries for all entries, resulting in initialization time. A better method is to check whether entry in the finger table is still correct for the entry. This will lead to . The best method is to initialize the finger table from its immediate neighbours and make some updates, which is .
To ensure correct lookups, all successor pointers must be up to date. Therefore, a stabilization protocol is running periodically in the background which updates finger tables and successor pointers.
The stabilization protocol works as follows:
With high probability, Chord contacts nodes to find a successor in an -node network.
Suppose node wishes to find the successor of key . Let be the predecessor of . We wish to find an upper bound for the number of steps it takes for a message to be routed from to . Node will examine its finger table and route the request to the closest predecessor of that it has. Call this node . If is the entry in 's finger table, then both and are at distances between and from along the identifier circle. Hence, the distance between and along this circle is at most . Thus the distance from to is less than the distance from to : the new distance to is at most half the initial distance.
This process of halving the remaining distance repeats itself, so after steps, the distance remaining to is at most ; in particular, after steps, the remaining distance is at most . Because nodes are distributed uniformly at random along the identifier circle, the expected number of nodes falling within an interval of this length is 1, and with high probability, there are fewer than such nodes. Because the message always advances by at least one node, it takes at most steps for a message to traverse this remaining distance. The total expected routing time is thus .
If Chord keeps track of predecessors/successors, then with high probability, if each node has probability of 1/4 of failing, find_successor (see below) and find_predecessor (see below) will return the correct nodes
Simply, the probability that all nodes fail is , which is a low probability; so with high probability at least one of them is alive and the node will have the correct pointer.
The pseudocode to find the successor node of an id is given below:
// ask node n to find the successor of idn.find_successor(id)// Yes, that should be a closing square bracket to match the opening parenthesis.// It is a half closed interval.if id ∈ (n, successor] thenreturn successor else // forward the query around the circle n0 := closest_preceding_node(id) return n0.find_successor(id) // search the local table for the highest predecessor of idn.closest_preceding_node(id)for i = m downto 1 doif (finger[i] ∈ (n, id)) then return finger[i] return n
The pseudocode to stabilize the chord ring/circle after node joins and departures is as follows:
// create a new Chord ring.n.create() predecessor := nil successor := n // join a Chord ring containing node n'.n.join(n') predecessor := nil successor := n'.find_successor(n) // called periodically. n asks the successor// about its predecessor, verifies if n's immediate// successor is consistent, and tells the successor about nn.stabilize() x = successor.predecessor if x ∈ (n, successor) then successor := x successor.notify(n) // n' thinks it might be our predecessor.n.notify(n')if predecessor is nil or n'∈(predecessor, n) then predecessor := n' // called periodically. refreshes finger table entries.// next stores the index of the finger to fixn.fix_fingers() next := next + 1 if next > m then next := 1 finger[next] := find_successor(n+2next-1); // called periodically. checks whether predecessor has failed.n.check_predecessor()if predecessor has failed then predecessor := nil
In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.
A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.
A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.
In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.
Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.
In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.
In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.
A van Emde Boas tree, also known as a vEB tree or van Emde Boas priority queue, is a tree data structure which implements an associative array with m-bit integer keys. It was invented by a team led by Dutch computer scientist Peter van Emde Boas in 1975. It performs all operations in O(log m) time, or equivalently in time, where is the largest element that can be stored in the tree. The parameter is not to be confused with the actual number of elements stored in the tree, by which the performance of other tree data-structures is often measured.
In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only keys need to be remapped on average where is the number of keys and is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.
Pastry is an overlay network and routing network for the implementation of a distributed hash table (DHT) similar to Chord. The key–value pairs are stored in a redundant peer-to-peer network of connected Internet hosts. The protocol is bootstrapped by supplying it with the IP address of a peer already in the network and from then on via the routing table which is dynamically built and repaired. It is claimed that because of its redundant and decentralized nature there is no single point of failure and any single node can leave the network at any time without warning and with little or no chance of data loss. The protocol is also capable of using a routing metric supplied by an outside program, such as ping or traceroute, to determine the best routes to store in its routing table.
Tapestry is a peer-to-peer overlay network which provides a distributed hash table, routing, and multicasting infrastructure for distributed applications. The Tapestry peer-to-peer system offers efficient, scalable, self-repairing, location-aware routing to nearby resources.
In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
In peer-to-peer networks, Koorde is a distributed hash table (DHT) system based on the Chord DHT and the De Bruijn graph. Inheriting the simplicity of Chord, Koorde meets O(log n) hops per node, and hops per lookup request with O(log n) neighbors per node.
The BAlanced Tree Overlay Network (BATON) is a distributed tree structure designed for peer-to-peer (P2P) systems. Unlike other overlays that employ a distributed hash table, BATON organises peers in a distributed tree to facilitate range search. BATON aims to maintain a balanced tree height, similar to the AVL tree, resulting in a bounded to for search exact and range queries as well as update operations (join/leave).
Skip graphs are a kind of distributed data structure based on skip lists. They were invented in 2003 by James Aspnes and Gauri Shah. A nearly identical data structure called SkipNet was independently invented by Nicholas Harvey, Michael Jones, Stefan Saroiu, Marvin Theimer and Alec Wolman, also in 2003.
In computer science, an x-fast trie is a data structure for storing integers from a bounded domain. It supports exact and predecessor or successor queries in time O(log log M), using O(n log M) space, where n is the number of stored values and M is the maximum value in the domain. The structure was proposed by Dan Willard in 1982, along with the more complicated y-fast trie, as a way to improve the space usage of van Emde Boas trees, while retaining the O(log log M) query time.
Network coding has been shown to optimally use bandwidth in a network, maximizing information flow but the scheme is very inherently vulnerable to pollution attacks by malicious nodes in the network. A node injecting garbage can quickly affect many receivers. The pollution of network packets spreads quickly since the output of honest node is corrupted if at least one of the incoming packets is corrupted.
Rendezvous or highest random weight (HRW) hashing is an algorithm that allows clients to achieve distributed agreement on a set of options out of a possible set of options. A typical application is when clients need to agree on which sites objects are assigned to.
Approximate membership query filters comprise a group of space-efficient probabilistic data structures that support approximate membership queries. An approximate membership query answers whether an element is in a set or not with a false positive rate of .