Geometry of binary search trees

Last updated

In computer science, one approach to the dynamic optimality problem on online algorithms for binary search trees involves reformulating the problem geometrically, in terms of augmenting a set of points in the plane with as few additional points as possible to avoid rectangles with only two points on their boundary. [1]

Contents

Access sequences and competitive ratio

As typically formulated, the online binary search tree problem involves search trees defined over a fixed key set . An access sequence is a sequence ... where each access belongs to the key set.

Any particular algorithm for maintaining binary search trees (such as the splay tree algorithm or Iacono's working set structure) has a cost for each access sequence that models the amount of time it would take to use the structure to search for each of the keys in the access sequence in turn. The cost of a search is modeled by assuming that the search tree algorithm has a single pointer into a binary search tree, which at the start of each search points to the root of the tree. The algorithm may then perform any sequence of the following operations:

The search is required, at some point within this sequence of operations to move the pointer to a node containing the key, and the cost of the search is the number of operations that are performed in the sequence. The total cost costA(X) for algorithm A on access sequence X is the sum of the costs of the searches for each successive key in the sequence.

As is standard in competitive analysis, the competitive ratio of an algorithm A is defined to be the maximum, over all access sequences, of the ratio of the cost for A to the best cost that any algorithm could achieve:

The dynamic optimality conjecture states that splay trees have a constant competitive ratio, but this remains unproven. The geometric view of binary search trees provides a different way of understanding the problem that has led to the development of alternative algorithms that could also (conjecturally) have a constant competitive ratio.

Translation to a geometric point set

In the geometric view of the online binary search tree problem, an access sequence (sequence of searches performed on a binary search tree (BST) with a key set ) is mapped to the set of points , where the X-axis represents the key space and the Y-axis represents time; to which a set of touched nodes is added. By touched nodes we mean the following. Consider a BST access algorithm with a single pointer to a node in the tree. At the beginning of an access to a given key , this pointer is initialized to the root of the tree. Whenever the pointer moves to or is initialized to a node, we say that the node is touched. [2] We represent a BST algorithm for a given input sequence by drawing a point for each item that gets touched.

For example, assume the following BST on 4 nodes is given: StaticBinarySearchTree GeometricalView example.jpg The key set is {1, 2, 3, 4}.

Geometrical view of binary search trees - access sequence only.jpg
Mapping of the access sequence 3, 1, 4, 2 only.
Geometrical view of binary search trees - access sequence.jpg
A geometric view of binary search tree algorithm.

Let 3, 1, 4, 2 be the access sequence.

The touches are represented geometrically: If an item x is touched in the operations for the ith access, then a point (x,i) is plotted.

Arborally satisfied point sets

Rectangle spanned by two points. This point set is not arborally satisfied. Rectangle spanned by two points example.jpg
Rectangle spanned by two points. This point set is not arborally satisfied.
This is an example of an arborally satisfied set of points. Example of arborally satisfied set of points.jpg
This is an example of an arborally satisfied set of points.

A point set is said to be arborally satisfied if the following property holds: for any pair of points that do not lie on the same horizontal or vertical line, there exists a third point which lies in the rectangle spanned by the first two points (either inside or on the boundary).

Theorem

A point set containing the points is arborally satisfied if and only if it corresponds to a valid BST for the input sequence .

Proof

First, prove that the point set for any valid BST algorithm is arborally satisfied. Consider points and , where x is touched at time i and y is touched at time j. Assume by symmetry that and . It needs to be shown that there exists a third point in the rectangle with corners as and . Also let denote the lowest common ancestor of nodes a and b right before time t. There are a few cases:

  • If , then use the point , since must have been touched if x was.
  • If , then the point can be used.
  • If neither of the above two cases holds, then x must be an ancestor of y right before time i and y be an ancestor of x right before time j. Then at some time k, y must have been rotated above x, so the point can be used.

Next, show the other direction: given an arborally satisfied point set, a valid BST corresponding to that point set can be constructed. Organize our BST into a treap which is organized in heap-order by next-touch-time. Note that next-touch-time has ties and is thus not uniquely defined, but this isn’t a problem as long as there is a way to break ties. When time i reached, the nodes touched form a connected subtree at the top, by the heap ordering property. Now, assign new next-touch-times for this subtree, and rearrange it into a new local treap. If a pair of nodes, x and y, straddle the boundary between the touched and untouched part of the treap, then if y is to be touched sooner than x then is an unsatisfied rectangle because the leftmost such point would be the right child of x, not y.

Corollary

Finding the best BST execution for the input sequence is equivalent to finding the minimum cardinality superset of points (that contains the input in geometric representation) that is arborally satisfied. The more general problem of finding the minimum cardinality arborally satisfied superset of a general set of input points (not limited to one input point per y coordinate), is known to be NP-complete. [1]

Greedy algorithm

The following greedy algorithm constructs arborally satisfiable sets:

The algorithm has been conjectured to be optimal within an additive term. [3]

Other results

The geometry of binary search trees has been used to provide an algorithm which is dynamically optimal if any binary search tree algorithm is dynamically optimal. [4]

See also

Related Research Articles

<span class="mw-page-title-main">Binary search tree</span> Rooted binary tree data structure

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree. The time complexity of operations on the binary search tree is linear with respect to the height of the tree.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

In computer science, a red–black tree is a specialised binary search tree data structure noted for fast storage and retrieval of ordered information, and a guarantee that operations will complete within a known time. Compared to other self-balancing binary search trees, the nodes in a red-black tree hold an extra bit called "color" representing "red" and "black" which is used when re-organising the tree to ensure that it is always approximately balanced.

A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

<span class="mw-page-title-main">Dijkstra's algorithm</span> Graph search algorithm

Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a weighted graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.

A* is a graph traversal and path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its space complexity, as it stores all generated nodes in memory. Thus, in practical travel-routing systems, it is generally outperformed by algorithms that can pre-process the graph to attain better performance, as well as memory-bounded approaches; however, A* is still the best solution in many cases.

<span class="mw-page-title-main">Self-balancing binary search tree</span> Any node-based binary search tree that automatically keeps its height the same

In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions. These operations when designed for a self-balancing binary search tree, contain precautionary measures against boundlessly increasing tree height, so that these abstract data structures receive the attribute "self-balancing".

In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets. It provides operations for adding new sets, merging sets, and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

<span class="mw-page-title-main">Z-order curve</span> Mapping function that preserves data point locality

In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space-filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It is named in France after Henri Lebesgue, who studied it in 1904, and named in the United States after Guy Macdonald Morton, who first applied the order to file sequencing in 1966. The z-value of a point in multidimensions is simply calculated by interleaving the binary representations of its coordinate values. Once the data are sorted into this ordering, any one-dimensional data structure can be used, such as simple one dimensional arrays, binary search trees, B-trees, skip lists or hash tables. The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree or octree.

<span class="mw-page-title-main">ID3 algorithm</span> Decision tree algorithm

In decision tree learning, ID3 is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domain

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

In graph theory and computer science, the lowest common ancestor (LCA) of two nodes v and w in a tree or directed acyclic graph (DAG) T is the lowest node that has both v and w as descendants, where we define each node to be a descendant of itself.

A tango tree is a type of binary search tree proposed by Erik D. Demaine, Dion Harmon, John Iacono, and Mihai Pătrașcu in 2004. It is named after Buenos Aires, of which the tango is emblematic.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.

<span class="mw-page-title-main">Finger search</span>

In computer science, a finger search on a data structure is an extension of any search operation that structure supports, where a reference (finger) to an element in the data structure is given along with the query. While the search time for an element is most frequently expressed as a function of the number of elements in a data structure, finger search times are a function of the distance between the element and the finger.

Key-independent optimality is a property of some binary search tree data structures in computer science proposed by John Iacono. Suppose that key-value pairs are stored in a data structure, and that the keys have no relation to their paired values. A data structure has key-independent optimality if, when randomly assigning the keys, the expected performance of the data structure is within a constant factor of the optimal data structure. Key-independent optimality is related to dynamic optimality.

In the theory of optimal binary search trees, the interleave lower bound is a lower bound on the number of operations required by a Binary Search Tree (BST) to execute a given sequence of accesses.

References

  1. 1 2 Demaine, Erik D.; Harmon, Dion; Iacono, John; Kane, Daniel; Pătraşcu, Mihai (2009), "The geometry of binary search trees", In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2009), New York: 496–505, doi: 10.1137/1.9781611973068.55 , ISBN   978-0-89871-680-1
  2. Demaine, Erik D.; Harmon, Dion; Iacono, John; Pătraşcu, Mihai (2007), "Dynamic optimality—almost", SIAM Journal on Computing , 37 (1): 240–251, CiteSeerX   10.1.1.99.4964 , doi:10.1137/S0097539705447347, MR   2306291, S2CID   1480961
  3. Fox, Kyle (August 15–17, 2011). Upper bounds for maximally greedy binary search trees (PDF). Algorithms and Data Structures: 12th International Symposium, WADS 2011. Lecture Notes in Computer Science. Vol. 6844. New York: Springer. pp. 411–422. arXiv: 1102.4884 . doi:10.1007/978-3-642-22300-6_35.
  4. Iacono, John (2013). "In Pursuit of the Dynamic Optimality Conjecture". Space-Efficient Data Structures, Streams, and Algorithms. Lecture Notes in Computer Science. Vol. 8066. pp. 236–250. arXiv: 1306.0207 . Bibcode:2013arXiv1306.0207I. doi:10.1007/978-3-642-40273-9_16. ISBN   978-3-642-40272-2. S2CID   14729858.