Optimal binary search tree

Last updated

In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, [1] is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.

Contents

In the static optimality problem, the tree cannot be modified after it has been constructed. In this case, there exists some particular layout of the nodes of the tree which provides the smallest expected search time for the given access probabilities. Various algorithms exist to construct or approximate the statically optimal tree given the information on the access probabilities of the elements.

In the dynamic optimality problem, the tree can be modified at any time, typically by permitting tree rotations. The tree is considered to have a cursor starting at the root which it can move or use to perform modifications. In this case, there exists some minimal-cost sequence of these operations which causes the cursor to visit every node in the target access sequence in order. The splay tree is conjectured to have a constant competitive ratio compared to the dynamically optimal tree in all cases, though this has not yet been proven.

Static optimality

Definition

In the static optimality problem as defined by Knuth, [2] we are given a set of n ordered elements and a set of probabilities. We will denote the elements through and the probabilities through and through . is the probability of a search being done for element (or successful search). [3] For , is the probability of a search being done for an element between and (or unsuccessful search), [3] is the probability of a search being done for an element strictly less than , and is the probability of a search being done for an element strictly greater than . These probabilities cover all possible searches, and therefore add up to one.

The static optimality problem is the optimization problem of finding the binary search tree that minimizes the expected search time, given the probabilities. As the number of possible trees on a set of n elements is , [2] which is exponential in n, brute-force search is not usually a feasible solution.

Knuth's dynamic programming algorithm

In 1971, Knuth published a relatively straightforward dynamic programming algorithm capable of constructing the statically optimal tree in only O(n2) time. [2] In this work, Knuth extended and improved the dynamic programming algorithm by Edgar Gilbert and Edward F. Moore introduced in 1958. [4] Gilbert's and Moore's algorithm required time and space and was designed for a particular case of optimal binary search trees construction (known as optimal alphabetic tree problem [5] ) that considers only the probability of unsuccessful searches, that is, . Knuth's work relied upon the following insight: the static optimality problem exhibits optimal substructure; that is, if a certain tree is statically optimal for a given probability distribution, then its left and right subtrees must also be statically optimal for their appropriate subsets of the distribution (known as monotonicity property of the roots).

To see this, consider what Knuth calls the "weighted path length" of a tree. The weighted path length of a tree of n elements is the sum of the lengths of all possible search paths, weighted by their respective probabilities. The tree with the minimal weighted path length is, by definition, statically optimal.

But weighted path lengths have an interesting property. Let E be the weighted path length of a binary tree, EL be the weighted path length of its left subtree, and ER be the weighted path length of its right subtree. Also let W be the sum of all the probabilities in the tree. Observe that when either subtree is attached to the root, the depth of each of its elements (and thus each of its search paths) is increased by one. Also observe that the root itself has a depth of one. This means that the difference in weighted path length between a tree and its two subtrees is exactly the sum of every single probability in the tree, leading to the following recurrence:

This recurrence leads to a natural dynamic programming solution. Let be the weighted path length of the statically optimal search tree for all values between ai and aj, let be the total weight of that tree, and let be the index of its root. The algorithm can be built using the following formulas:

The naive implementation of this algorithm actually takes O(n3) time, but Knuth's paper includes some additional observations which can be used to produce a modified algorithm taking only O(n2) time.

In addition to its dynamic programming algorithm, Knuth proposed two heuristics (or rules) to produce nearly (approximation of) optimal binary search trees. Studying nearly optimal binary search trees was necessary since Knuth's algorithm time and space complexity can be prohibitive when is substantially large. [6]

Knuth's rules can be seen as the following:

Knuth's heuristics implements nearly optimal binary search trees in time and space. The analysis on how far from the optimum Knuth's heuristics can be was further proposed by Kurt Mehlhorn. [6]

Mehlhorn's approximation algorithm

While the O(n2) time taken by Knuth's algorithm is substantially better than the exponential time required for a brute-force search, it is still too slow to be practical when the number of elements in the tree is very large.

In 1975, Kurt Mehlhorn published a paper proving important properties regarding Knuth's rules. Mehlhorn's major results state that only one of Knuth's heuristics (Rule II) always produces nearly optimal binary search trees. On the other hand, the root-max rule could often lead to very "bad" search trees based on the following simple argument. [6]

Let

and

Considering the weighted path length of the tree constructed based on the previous definition, we have the following:

Thus, the resulting tree by the root-max rule will be a tree that grows only on the right side (except for the deepest level of the tree), and the left side will always have terminal nodes. This tree has a path length bounded by and, when compared with a balanced search tree (with path bounded by ), will perform substantially worse for the same frequency distribution. [6]

In addition, Mehlhorn improved Knuth's work and introduced a much simpler algorithm that uses Rule II and closely approximates the performance of the statically optimal tree in only time. [6] The algorithm follows the same idea of the bisection rule by choosing the tree's root to balance the total weight (by probability) of the left and right subtrees most closely. And the strategy is then applied recursively on each subtree.

That this strategy produces a good approximation can be seen intuitively by noting that the weights of the subtrees along any path form something very close to a geometrically decreasing sequence. In fact, this strategy generates a tree whose weighted path length is at most

where H is the entropy of the probability distribution. Since no optimal binary search tree can ever do better than a weighted path length of

this approximation is very close. [6]

Hu–Tucker and Garsia–Wachs algorithms

In the special case that all of the values are zero, the optimal tree can be found in time . This was first proved by T. C. Hu and Alan Tucker in a paper that they published in 1971. A later simplification by Garsia and Wachs, the Garsia–Wachs algorithm, performs the same comparisons in the same order. The algorithm works by using a greedy algorithm to build a tree that has the optimal height for each leaf, but is out of order, and then constructing another binary search tree with the same heights. [7]

Example Code Snippet

The following code snippet determines an optimal binary search tree when given a set of keys and probability values that the key is the search key:

public static float calculateOptimalSearchTree(int numNodes, float[] probabilities, int[][] roots) {        float[][] costMatrix = new float[numNodes + 2][numNodes + 1];        for (int i = 1; i <= numNodes; i++) {            costMatrix[i][i - 1] = 0;            costMatrix[i][i] = probabilities[i];            roots[i][i] = i;            roots[i][i - 1] = 0;        }        for (int diagonal = 1; diagonal <= numNodes; diagonal++) {            for (int i = 1; i <= numNodes - diagonal; i++) {                int j = i + diagonal;                costMatrix[i][j] = findMinCost(costMatrix, i, j) + sumProbabilities(probabilities, i, j);                // Note: roots[i][j] assignment is missing, this needs to be fixed if you want                // to reconstruct the tree.            }        }        return costMatrix[1][numNodes]; }


Dynamic optimality

Unsolved problem in computer science:

Do splay trees perform as well as any other binary search tree algorithm?

Definition

There are several different definitions of dynamic optimality, all of which are effectively equivalent to within a constant factor in terms of running-time. [8] The problem was first introduced implicitly by Sleator and Tarjan in their paper on splay trees, [9] but Demaine et al. give a very good formal statement of it. [8]

In the dynamic optimality problem, we are given a sequence of accesses x1, ..., xm on the keys 1, ..., n. For each access, we are given a pointer to the root of our BST and may use the pointer to perform any of the following operations:

  1. Move the pointer to the left child of the current node.
  2. Move the pointer to the right child of the current node.
  3. Move the pointer to the parent of the current node.
  4. Perform a single rotation on the current node and its parent.

(It is the presence of the fourth operation, which rearranges the tree during the accesses, which makes this the dynamic optimality problem.)

For each access, our BST algorithm may perform any sequence of the above operations as long as the pointer eventually ends up on the node containing the target value xi. The time it takes a given dynamic BST algorithm to perform a sequence of accesses is equivalent to the total number of such operations performed during that sequence. Given any sequence of accesses on any set of elements, there is some minimum total number of operations required to perform those accesses. We would like to come close to this minimum.

While it is impossible to implement this "God's algorithm" without foreknowledge of exactly what the access sequence will be, we can define OPT(X) as the number of operations it would perform for an access sequence X, and we can say that an algorithm is dynamically optimal if, for any X, it performs X in time O(OPT(X)) (that is, it has a constant competitive ratio). [8]

There are several data structures conjectured to have this property, but none proven. It is an open problem whether there exists a dynamically optimal data structure in this model.

Splay trees

The splay tree is a form of binary search tree invented in 1985 by Daniel Sleator and Robert Tarjan on which the standard search tree operations run in amortized time. [10] It is conjectured to be dynamically optimal in the required sense. That is, a splay tree is believed to perform any sufficiently long access sequence X in time O(OPT(X)). [9]

Tango trees

The tango tree is a data structure proposed in 2004 by Erik Demaine and others which has been proven to perform any sufficiently-long access sequence X in time . While this is not dynamically optimal, the competitive ratio of is still very small for reasonable values of n. [8]

Other results

In 2013, John Iacono published a paper which uses the geometry of binary search trees to provide an algorithm which is dynamically optimal if any binary search tree algorithm is dynamically optimal. [11] Nodes are interpreted as points in two dimensions, and the optimal access sequence is the smallest arborally satisfied superset of those points. Unlike splay trees and tango trees, Iacono's data structure is not known to be implementable in constant time per access sequence step, so even if it is dynamically optimal, it could still be slower than other search tree data structures by a non-constant factor.

The interleave lower bound is an asymptotic lower bound on dynamic optimality.

See also

Notes

  1. Tremblay, Jean-Paul; Cheston, Grant A. (2001). Data Structures and Software Development in an object-oriented domain. Eiffel Edition/Prentice Hall. ISBN   978-0-13-787946-5.
  2. 1 2 3 Knuth, Donald E. (1971), "Optimum binary search trees", Acta Informatica, 1 (1): 14–25, doi:10.1007/BF00264289, S2CID   62777263
  3. 1 2 Nagaraj, S. V. (1997-11-30). "Optimal binary search trees". Theoretical Computer Science. 188 (1): 1–44. doi:10.1016/S0304-3975(96)00320-9. ISSN   0304-3975. S2CID   33484183.
  4. Gilbert, E. N.; Moore, E. F. (July 1959). "Variable-Length Binary Encodings". Bell System Technical Journal. 38 (4): 933–967. doi:10.1002/j.1538-7305.1959.tb01583.x.
  5. Hu, T. C.; Tucker, A. C. (December 1971). "Optimal Computer Search Trees and Variable-Length Alphabetical Codes". SIAM Journal on Applied Mathematics. 21 (4): 514–532. doi:10.1137/0121057. ISSN   0036-1399.
  6. 1 2 3 4 5 6 Mehlhorn, Kurt (1975), "Nearly optimal binary search trees", Acta Informatica, 5 (4): 287–295, doi:10.1007/BF00264563, S2CID   17188103
  7. Knuth, Donald E. (1998), "Algorithm G (Garsia–Wachs algorithm for optimum binary trees)", The Art of Computer Programming, Vol. 3: Sorting and Searching (2nd ed.), Addison–Wesley, pp. 451–453. See also History and bibliography, pp. 453–454.
  8. 1 2 3 4 Demaine, Erik D.; Harmon, Dion; Iacono, John; Patrascu, Mihai (2004), "Dynamic optimality—almost" (PDF), Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, pp. 484–490, CiteSeerX   10.1.1.99.4964 , doi:10.1109/FOCS.2004.23, ISBN   978-0-7695-2228-9
  9. 1 2 Sleator, Daniel; Tarjan, Robert (1985), "Self-adjusting binary search trees", Journal of the ACM, 32 (3): 652–686, doi: 10.1145/3828.3835 , S2CID   1165848
  10. Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald; Stein, Clifford (2009). Introduction to algorithms (PDF) (Third ed.). The MIT Press. p. 503. ISBN   978-0-262-03384-8 . Retrieved 31 October 2017.
  11. Iacono, John (2013), "In pursuit of the dynamic optimality conjecture", arXiv: 1306.0207 [cs.DS]

Related Research Articles

<span class="mw-page-title-main">AVL tree</span> Self-balancing binary search tree

In computer science, an AVL tree is a self-balancing binary search tree. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Binary search tree</span> Rooted binary tree data structure

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree. The time complexity of operations on the binary search tree is linear with respect to the height of the tree.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

In computer science, a red–black tree is a specialised binary search tree data structure noted for fast storage and retrieval of ordered information, and a guarantee that operations will complete within a known time. Compared to other self-balancing binary search trees, the nodes in a red-black tree hold an extra bit called "color" representing "red" and "black" which is used when re-organising the tree to ensure that it is always approximately balanced.

A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.

<span class="mw-page-title-main">Treap</span>

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

<span class="mw-page-title-main">Self-balancing binary search tree</span> Any node-based binary search tree that automatically keeps its height the same

In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions. These operations when designed for a self-balancing binary search tree, contain precautionary measures against boundlessly increasing tree height, so that these abstract data structures receive the attribute "self-balancing".

In computer science, a scapegoat tree is a self-balancing binary search tree, invented by Arne Andersson in 1989 and again by Igal Galperin and Ronald L. Rivest in 1993. It provides worst-case lookup time and amortized insertion and deletion time.

<i>m</i>-ary tree Tree data structure in which each node has at most m children.

In graph theory, an m-ary tree is an arborescence in which each node has no more than m children. A binary tree is the special case where m = 2, and a ternary tree is another case with m = 3 that limits its children to three.

In computer science, the longest increasing subsequence problem aims to find a subsequence of a given sequence in which the subsequence's elements are sorted in an ascending order and in which the subsequence is as long as possible. This subsequence is not necessarily contiguous or unique. The longest increasing subsequences are studied in the context of various disciplines related to mathematics, including algorithmics, random matrix theory, representation theory, and physics. The longest increasing subsequence problem is solvable in time where denotes the length of the input sequence.

In computer science, weight-balanced binary trees (WBTs) are a type of self-balancing binary search trees that can be used to implement dynamic sets, dictionaries (maps) and sequences. These trees were introduced by Nievergelt and Reingold in the 1970s as trees of bounded balance, or BB[α] trees. Their more common name is due to Knuth.

A link/cut tree is a data structure for representing a forest, a set of rooted trees, and offers the following operations:

A tango tree is a type of binary search tree proposed by Erik D. Demaine, Dion Harmon, John Iacono, and Mihai Pătrașcu in 2004. It is named after Buenos Aires, of which the tango is emblematic.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

<span class="mw-page-title-main">Random binary tree</span> Binary tree selected at random

In computer science and probability theory, a random binary tree is a binary tree selected at random from some probability distribution on binary trees. Different distributions have been used, leading to different properties for these trees.

<span class="mw-page-title-main">Fenwick tree</span> Data structure

A Fenwick tree or binary indexed tree(BIT) is a data structure that can efficiently update values and calculate prefix sums in an array of values.

<span class="mw-page-title-main">Top tree</span>

A top tree is a data structure based on a binary tree for unrooted dynamic trees that is used mainly for various path-related operations. It allows simple divide-and-conquer algorithms. It has since been augmented to maintain dynamically various properties of a tree such as diameter, center and median.

In computer science, one approach to the dynamic optimality problem on online algorithms for binary search trees involves reformulating the problem geometrically, in terms of augmenting a set of points in the plane with as few additional points as possible to avoid rectangles with only two points on their boundary.

The Garsia–Wachs algorithm is an efficient method for computers to construct optimal binary search trees and alphabetic Huffman codes, in linearithmic time. It is named after Adriano Garsia and Michelle L. Wachs.