BK-tree

Last updated

A BK-tree is a metric tree suggested by Walter Austin Burkhard and Robert M. Keller specifically adapted to discrete metric spaces. For simplicity, consider integer discrete metric . Then, BK-tree is defined in the following way. An arbitrary element a is selected as root node. The root node may have zero or more subtrees. The k-th subtree is recursively built of all elements b such that . BK-trees can be used for approximate string matching in a dictionary. [ example needed ]

Contents

Example

An example of BK-tree Bk tree.svg
An example of BK-tree

This picture depicts the BK-tree for the set of words {"book", "books", "cake", "boo", "boon", "cook", "cake", "cape", "cart"} obtained by using the Levenshtein distance

The BK-tree is built so that:

Insertion

The insertion primitive is used to populate a BK-tree according to a discrete metric .

Input:

Output:

Algorithm:

Lookup

Given a searched element , the lookup primitive traverses the BK-tree to find the closest element of . The key idea is to restrict the exploration of to nodes that can only improve the best candidate found so far by taking advantage of the BK-tree organization and of the triangle inequality (cut-off criterion).

Input:

Output:

Algorithm:

Example of the Lookup Algorithm

Consider the example 8-node B-K Tree shown above and set "cool". is initialized to contain the root of the tree, which is subsequently popped as the first value of with ="book". Further since the distance from "book" to "cool" is 2, and as this is the best (i.e. smallest) distance found thus far. Next each outgoing arc from the root is considered in turn: the arc from "book" to "books" has weight 1, and since is less than , the node containing "books" is inserted into for further processing. The next arc, from "book" to "cake," has weight 4, and since is not less than , the node containing "cake" is not inserted into . Therefore, the subtree rooted at "cake" will be pruned from the search, as the word closest to "cool" cannot appear in that subtree. To see why this pruning is correct, notice that a candidate word appearing in "cake"s subtree having distance less than 2 to "cool" would violate the triangle inequality: the triangle inequality requires that for this set of three numbers (as sides of a triangle), no two can sum to less than the third, but here the distance from "cool" to "book" (which is 2) plus the distance from "cool" to (which is less than 2) cannot reach or exceed the distance from "book" to "cake" (which is 4). Therefore, it is safe to disregard the entire subtree rooted at "cake".

Next the node containing "books" is popped from and now , the distance from "cool" to "books." As , remains set at 2 and the single outgoing arc from the node containing "books" is considered. Next, the node containing "boo" is popped from and , the distance from "cool" to "boo." This again does not improve upon . Each outgoing arc from "boo" is now considered; the arc from "boo" to "boon" has weight 1, and since , "boon" is added to . Similarly, since , "cook" is also added to .

Finally each of the two last elements in are considered in arbitrary order: suppose the node containing "cook" is popped first, improving to distance 1, then the node containing "boon" is popped last, which has distance 2 from "cool" and therefore does not improve the best result. Finally, "cook" is returned as the answer with .

See also

Related Research Articles

AVL tree Self-balancing binary search tree

In computer science, an AVL tree is a self-balancing binary search tree (BST). It was the first such data structure to be invented. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

Binary search tree Rooted binary tree data structure

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure whose internal nodes each store a key greater than all the keys in the node's left subtree and less than those in its right subtree. The time complexity of operations on the binary search tree is directly proportional to the height of the tree.

Binary tree Limited form of tree data structure

In computer science, a binary tree is a tree data structure in which each node has at most two children, which are referred to as the left child and the right child. A recursive definition using just set theory notions is that a (non-empty) binary tree is a tuple, where L and R are binary trees or the empty set and S is a singleton set containing the root. Some authors allow the binary tree to be the empty set as well.

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

In computer science, a red–black tree is a kind of self-balancing binary search tree. Each node stores an extra bit representing "color", used to ensure that the tree remains balanced during insertions and deletions.

A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.

Tree decomposition

In graph theory, a tree decomposition is a mapping of a graph into a tree that can be used to define the treewidth of the graph and speed up solving certain computational problems on the graph.

In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually used for trees based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa to form the tree.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, tree traversal is a form of graph traversal and refers to the process of visiting each node in a tree data structure, exactly once. Such traversals are classified by the order in which the nodes are visited. The following algorithms are described for a binary tree, but they may be generalized to other trees as well.

R-tree

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

In the mathematical field of graph theory, the distance between two vertices in a graph is the number of edges in a shortest path connecting them. This is also known as the geodesic distance or shortest-path distance. Notice that there may be more than one shortest path between two vertices. If there is no path connecting the two vertices, i.e., if they belong to different connected components, then conventionally the distance is defined as infinite.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key and creating point clouds. k-d trees are a special case of binary space partitioning trees.

In computer science, a leftist tree or leftist heap is a priority queue implemented with a variant of a binary heap. Every node x has an s-value which is the distance to the nearest leaf in subtree rooted at x. In contrast to a binary heap, a leftist tree attempts to be very unbalanced. In addition to the heap property, leftist trees are maintained so the right descendant of each node has the lower s-value.

A vantage-point tree is a metric tree that segregates data in a metric space by choosing a position in the space and partitioning the data points into two parts: those points that are nearer to the vantage point than a threshold, and those points that are not. By recursively applying this procedure to partition the data into smaller and smaller sets, a tree data structure is created where neighbors in the tree are likely to be neighbors in the space.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

In computer science, a ball tree, balltree or metric tree, is a space partitioning data structure for organizing points in a multi-dimensional space. The ball tree gets its name from the fact that it partitions data points into a nested set of intersecting hyperspheres known as "balls". The resulting data structure has characteristics that make it useful for a number of applications, most notably nearest neighbor search.

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support is over a given threshold. It is a more general form of the maximum agreement subtree problem.

A central problem in algorithmic graph theory is the shortest path problem. One of the generalizations of the shortest path problem is known as the single-source-shortest-paths (SSSP) problem, which consists of finding the shortest path between every pair of vertices in a graph. There are classical sequential algorithms which solve this problem, such as Dijkstra's algorithm. In this article, however, we present two parallel algorithms solving this problem.

References