In computer science and probability theory, a random binary tree is a binary tree selected at random from some probability distribution on binary trees. Different distributions have been used, leading to different properties for these trees.
Random binary trees have been used for analyzing the average-case complexity of data structures based on binary search trees. For this application it is common to use random trees formed by inserting nodes one at a time according to a random permutation. [1] The resulting trees are very likely to have logarithmic depth and logarithmic Strahler number. The treap and related balanced binary search trees use update operations that maintain this random structure even when the update sequence is non-random.
Other distributions on random binary trees include the uniform discrete distribution in which all distinct trees are equally likely, distributions on a given number of nodes obtained by repeated splitting, binary tries and radix trees for random data, and trees of variable size generated by branching processes.
For random trees that are not necessarily binary, see random tree.
A binary tree is a rooted tree in which each node may have up to two children (the nodes directly below it in the tree), and those children are designated as being either left or right. It is sometimes convenient instead to consider extended binary trees in which each node is either an external node with zero children, or an internal node with exactly two children. A binary tree that is not in extended form may be converted into an extended binary tree by treating all its nodes as internal, and adding an external node for each missing child of an internal node. In the other direction, an extended binary tree with at least one internal node may be converted back into a non-extended binary tree by removing all its external nodes. In this way, these two forms are almost entirely equivalent for the purposes of mathematical analysis, except that the extended form allows a tree consisting of a single external node, which does not correspond to anything in the non-extended form. For the purposes of computer data structures, the two forms differ, as the external nodes of the first form may be represented explicitly as objects in a data structure. [2]
In a binary search tree the internal nodes are labeled by numbers or other ordered values, called keys, arranged so that an inorder traversal of the tree lists the keys in sorted order. The external nodes remain unlabeled. [3] Binary trees may also be studied with all nodes unlabeled, or with labels that are not given in sorted order. For instance, the Cartesian tree data structure uses labeled binary trees that are not necessarily binary search trees. [4]
A random binary tree is a random tree drawn from a certain probability distribution on binary trees. In many cases, these probability distributions are defined using a given set of keys, and describe the probabilities of binary search trees having those keys. However, other distributions are possible, not necessarily generating binary search trees, and not necessarily giving a fixed number of nodes. [5]
For any sequence of distinct ordered keys, one may form a binary search tree in which each key is inserted in sequence as a leaf of the tree, without changing the structure of the previously inserted keys. The position for each insertion can be found by a binary search in the previous tree. The random permutation model, for a given set of keys, is defined by choosing the sequence randomly from the permutations of the set, with each permutation having equal probability. [6]
For instance, if the three keys 1,3,2 are inserted into a binary search tree in that sequence, the number 1 will sit at the root of the tree, the number 3 will be placed as its right child, and the number 2 as the left child of the number 3. There are six different permutations of the keys 1,2, and 3, but only five trees may be constructed from them. That is because the permutations 2,1,3 and 2,3,1 form the same tree. Thus, this tree has probability of being generated, whereas the other four trees each have probability . [5]
For any key in a given set of keys, the expected value of the length of the path from the root to in a random binary search tree is at most , where "" denotes the natural logarithm function and the introduces big O notation. By linearity of expectation, the expected number of ancestors of equals the sum, over other keys , of the probability that is an ancestor of . A key is an ancestor of exactly when is the first key to be inserted from the interval . Because each key in the interval is equally likely to be first, this happens with probability inverse to the length of the interval. Thus, the keys that are adjacent to in the sorted sequence of keys have probability of being an ancestor of , the keys one step away have probability , etc. The sum of these probabilities forms two copies of the harmonic series extending away from in both directions in the sorted sequence, giving the bound above. This bound also holds for the expected search path length for a value that is one of the given keys. [7]
The longest root-to-leaf path, in a random binary search tree, is longer than the expected path length, but only by a constant factor. Its length, for a tree with nodes, is with high probability approximately
where is the unique number in the range satisfying the equation
In the random permutation model, each key except the smallest and largest has probability of being a leaf in the tree. This is because it is a leaf when it inserted after its two neighbors, which happens for two out of the six permutations of it and its two neighbors, all of which are equally likely. By similar reasoning, the smallest and largest key have probability of being a leaf. Therefore, the expected number of leaves is the sum of these probabilities, which for is exactly . [9]
The Strahler number of vertices in any tree is a measure of the complexity of the subtrees under those vertices. A leaf (external node) has Strahler number one. For any other node, the Strahler number is defined recursively from the Strahler numbers of its children. In a binary tree, if two children have different Strahler numbers, the Strahler number of their parent is the larger of the two child numbers. But if two children have equal Strahler numbers, their parent has a number that is greater by one. The Strahler number of the whole tree is the number at the root node. For -node random binary search trees, simulations suggest that the expected Strahler number is . A weaker upper bound has been proven. [10]
In applications of binary search tree data structures, it is rare for the keys to be inserted without deletion in a random order, limiting the direct applications of random binary trees. However, algorithm designers have devised data structures that allow arbitrary insertions and deletions to preserve the property that the shape of the tree is random, as if the keys had been inserted randomly. [11]
If a given set of keys is assigned numeric priorities (unrelated to their values), these priorities may be used to construct a Cartesian tree for the numbers, the binary search tree that would result from inserting the keys in priority order. By choosing the priorities to be independent random real numbers in the unit interval, and by maintaining the Cartesian tree structure using tree rotations after any insertion or deletion of a node, it is possible to maintain a data structure that behaves like a random binary search tree. Such a data structure is known as a treap or a randomized binary search tree. [11]
Variants of the treap including the zip tree and zip-zip tree replace the tree rotations by "zipping" operations that split and merge trees, and that limit the number of random bits that need to be generated and stored alongside the keys. The result of these optimizations is still a tree with a random structure, but one that does not exactly match the random permutation model. [12]
The number of binary trees with nodes is a Catalan number. [13] For these numbers of trees are
Thus, if one of these trees is selected uniformly at random, its probability is the reciprocal of a Catalan number. Trees generated from a model in this distribution are sometimes called random binary Catalan trees. [14] They have expected depth proportional to the square root of , rather than to the logarithm. [15] More precisely, the expected depth of a randomly chosen node in an -node tree of this type is
The expected Strahler number of a uniformly random -node binary tree is , lower than the expected Strahler number of random binary search trees. [17]
Due to their large heights, this model of equiprobable random trees is not generally used for binary search trees. However, it has other applications, including:
An algorithm of Jean-Luc Rémy generates a uniformly random binary tree of a specified size in time linear in the size, by the following process. Start with a tree consisting of a single external node. Then, while the current tree has not reached the target size, repeatedly choose one of its nodes (internal or external) uniformly at random. Replace the chosen node by a new internal node, having the chosen node as one of its children (equally likely left or right), and having a new external node as its other child. Stop when the target size is reached. [22]
The Galton–Watson process describes a family of distributions on trees in which the number of children at each node is chosen randomly, independently of other nodes. For binary trees, two versions of the Galton–Watson process are in use, differing only in whether an extended binary tree with only one node, an external root node, is allowed:
Trees generated in this way have been called binary Galton–Watson trees. In the special case where they are called critical binary Galton–Watson trees. [23]
The probability marks a phase transition for the binary Galton–Watson process: for the resulting tree is almost certainly finite, whereas for it is infinite with positive probability. More precisely, for any , the probability that the tree remains finite is
Another way to generate the same trees is to make a sequence of coin flips, with probability of heads and probability of tails, until the first flip at which the number of tails exceeds the number of heads (for the model in which an external root is allowed) or exceeds one plus the number of heads (when the root must be internal), and then use this sequence of coin flips to determine the choices made by the recursive generation process, in depth-first order. [25]
Because the number of internal nodes equals the number of heads in this coin flip sequence, all trees with a given number of nodes are generated from (unique) coin flip sequences of the same length, and are equally likely, regardless of . That is, the choice of affects the variation in the size of trees generated by this process, but for a given size the trees are generated uniformly at random. [26] For values of below the critical probability , smaller values of will produce trees with a smaller expected size, while larger values of will produce trees with a larger expected size. At the critical probability there is no finite bound on the expected size of trees generated by this process. More precisely, for any , the expected number of nodes at depth in the tree is , and the expected size of the tree can be obtained by summing the expected numbers of nodes at each depth. For this gives a geometric series
for the expected tree size, but for this gives 1 + 1 + 1 + 1 + ⋯, a divergent series. [27]
For , any particular tree with internal nodes is generated with probability , and the probability that a random tree has this size is this probability multiplied by a Catalan number,
Galton–Watson processes were originally developed to study the spread and extinction of human surnames, and have been widely applied more generally to the dynamics of human or animal populations. These processes have been generalized to models where the probability of being an internal or external node at a given level of the tree (a generation, in the population dynamics application) is not fixed, but depends on the number of nodes at the previous level. [29] A version of this process, with the critical probability , has been studied as a model for speciation, where it is known as the critical branching process. In this process, each species has an exponentially distributed lifetime, and over the course of its lifetime produces child species at a rate equal to the lifetime. When a child is produced, the parent continues as the left branch of the evolutionary tree, and the child becomes the right branch. [30]
Another application of critical Galton–Watson trees (in the version where the root must be internal) arises in the Karger–Stein algorithm for finding minimum cuts in graphs, using a recursive edge contraction process. This algorithm calls itself twice recursively, with each call having probability at least of preserving the correct solution value. The random tree models the subtree of correct recursive calls. The algorithm succeeds on a graph of vertices whenever this random tree of correct recursive calls has a branch of depth at least , reaching the base case of its recursion. The success probability is , producing one of the logarithmic factors in the algorithm's runtime. [31]
Devroye and Robson consider a related continuous-time random process in which each external node is eventually replaced by an internal node with two external children, at an exponentially distributed time after its first appearance as an external node. The number of external nodes in the tree, at any time, is modeled by a simple birth process or Yule process in which the members of a population give birth at a constant rate: giving birth to one child, in the Yule process, corresponds to being replaced by two children, in Devroye and Robson's model. If this process is stopped at any fixed time, the result is a binary tree of a random size (depending on the stopping time), distributed according to the random permutation model for that size. Devroye and Robson use this model as part of an algorithm to quickly generate trees in the random permutation model, described by their numbers of nodes at each depth rather than by their exact structure. [32] A discrete variant of this process starts with a tree consisting of a single external node, and repeatedly replaces a randomly-chosen external node by an internal node with two external children. Again, if this is stopped at a fixed time (with a fixed size), the resulting tree is distributed according to the random permutation model for that size. [1]
Another form of binary tree, the binary trie or digital search tree, has a collection of binary numbers labeling some of its external nodes. The internal nodes of the tree represent prefixes of their binary representations that are shared by two or more of the numbers. The left and right children of an internal node are obtained by extending the corresponding prefix by one more bit, a zero or a one bit respectively. If this extension does not match any of the given numbers, or it matches only one of them, the result is an external node; otherwise it is another internal node. Random binary tries have been studied, for instance for sets of random real numbers generated independently in the unit interval. Despite the fact that these trees may have some empty external nodes, they tend to be better balanced than random binary search trees. For uniformly random real numbers in the unit interval, or more generally for any square-integrable probability distribution on the unit interval, the average depth of a node is asymptotically , and the average height of the whole tree is asymptotically . The analysis of these trees can be applied to the computational complexity of trie-based sorting algorithms. [33]
A variant of the trie, the radix tree or compressed trie, eliminates empty external nodes and their parent internal nodes. The remaining internal nodes correspond to prefixes for which both possible extensions, by a zero or a one bit, are used by at least one of the randomly chosen numbers. For a radix tree for uniformly distributed binary numbers, the shortest leaf-root path has length and the longest leaf-root path has length both with high probability. [34]
Luc Devroye and Paul Kruszewski describe a recursive process for constructing random binary trees with nodes. It generates a real-valued random variable in the unit interval , assigns the first nodes (rounded down to an integer number of nodes) to the left subtree, the next node to the root, and the remaining nodes to the right subtree. Then, it continues recursively using the same process in the left and right subtrees. If is chosen uniformly at random in the interval, the result is the same as the random binary search tree generated by a random permutation of the nodes, as any node is equally likely to be chosen as root. However, this formulation allows other distributions to be used instead. For instance, in the uniformly random binary tree model, once a root is fixed each of its two subtrees must also be uniformly random, so the uniformly random model may also be generated by a different choice of distribution (depending on ) for . As they show, by choosing a beta distribution on and by using an appropriate choice of shape to draw each of the branches, the mathematical trees generated by this process can be used to create realistic-looking botanical trees. [35]
In computer science, an AVL tree is a self-balancing binary search tree. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.
In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.
In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree. The time complexity of operations on the binary search tree is linear with respect to the height of the tree.
In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child. That is, it is a k-ary tree with k = 2. A recursive definition using set theory is that a binary tree is a tuple (L, S, R), where L and R are binary trees or the empty set and S is a singleton set containing the root.
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
In computer science, a heap is a tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.
In computer science, a red–black tree is a self-balancing binary search tree data structure noted for fast storage and retrieval of ordered information. The nodes in a red-black tree hold an extra "color" bit, often drawn as red and black, which help ensure that the tree is always approximately balanced.
A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.
In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.
A binary heap is a heap data structure that takes the form of a binary tree. Binary heaps are a common way of implementing priority queues. The binary heap was introduced by J. W. J. Williams in 1964, as a data structure for heapsort.
In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.
In probability theory, the Brownian tree, or Aldous tree, or Continuum Random Tree (CRT) is a random real tree that can be defined from a Brownian excursion. The Brownian tree was defined and studied by David Aldous in three articles published in 1991 and 1993. This tree has since then been generalized.
In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions. These operations when designed for a self-balancing binary search tree, contain precautionary measures against boundlessly increasing tree height, so that these abstract data structures receive the attribute "self-balancing".
Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.
A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.
In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. K-dimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions. k-d trees are a useful data structure for several applications, such as:
In mathematics, the Strahler number or Horton–Strahler number of a mathematical tree is a numerical measure of its branching complexity.
In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.
Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.
In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.