Cartesian tree

Last updated

A sequence of numbers and the Cartesian tree derived from them. Cartesian tree.svg
A sequence of numbers and the Cartesian tree derived from them.

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

Contents

Cartesian trees were introduced by Vuillemin (1980) in the context of geometric range searching data structures. They have also been used in the definition of the treap and randomized binary search tree data structures for binary search problems, in comparison sort algorithms that perform efficiently on nearly-sorted inputs, and as the basis for pattern matching algorithms. A Cartesian tree for a sequence can be constructed in linear time.

Definition

Cartesian trees are defined using binary trees, which are a form of rooted tree. To construct the Cartesian tree for a given sequence of distinct numbers, set its root to be the minimum number in the sequence, [1] and recursively construct its left and right subtrees from the subsequences before and after this number, respectively. As a base case, when one of these subsequences is empty, there is no left or right child. [2]

It is also possible to characterize the Cartesian tree directly rather than recursively, using its ordering properties. In any tree, the subtree rooted at any node consists of all other nodes that can reach it by repeatedly following parent pointers. The Cartesian tree for a sequence of distinct numbers is defined by the following properties:

  1. The Cartesian tree for a sequence is a binary tree with one node for each number in the sequence.
  2. A symmetric (in-order) traversal of the tree results in the original sequence. Equivalently, for each node, the numbers in its left subtree are earlier than it in the sequence, and the numbers in the right subtree are later.
  3. The tree has the min-heap property: the parent of any non-root node has a smaller value than the node itself. [1]

These two definitions are equivalent: the tree defined recursively as described above is the unique tree that has the properties listed above. If a sequence of numbers contains repetitions, a Cartesian tree can be determined for it by following a consistent tie-breaking rule before applying the above construction. For instance, the first of two equal elements can be treated as the smaller of the two. [2]

History

Cartesian trees were introduced and named by Vuillemin (1980), who used them as an example of the interaction between geometric combinatorics and the design and analysis of data structures. In particular, Vuillemin used these structures to analyze the average-case complexity of concatenation and splitting operations on binary search trees. The name is derived from the Cartesian coordinate system for the plane: in one version of this structure, as in the two-dimensional range searching application discussed below, a Cartesian tree for a point set has the sorted order of the points by their -coordinates as its symmetric traversal order, and it has the heap property according to the -coordinates of the points. Vuillemin described both this geometric version of the structure, and the definition here in which a Cartesian tree is defined from a sequence. Using sequences instead of point coordinates provides a more general setting that allows the Cartesian tree to be applied to non-geometric problems as well. [2]

Efficient construction

A Cartesian tree can be constructed in linear time from its input sequence. One method is to process the sequence values in left-to-right order, maintaining the Cartesian tree of the nodes processed so far, in a structure that allows both upwards and downwards traversal of the tree. To process each new value , start at the node representing the value prior to in the sequence and follow the path from this node to the root of the tree until finding a value smaller than . The node becomes the right child of , and the previous right child of becomes the new left child of . The total time for this procedure is linear, because the time spent searching for the parent of each new node can be charged against the number of nodes that are removed from the rightmost path in the tree. [3]

An alternative linear-time construction algorithm is based on the all nearest smaller values problem. In the input sequence, define the left neighbor of a value to be the value that occurs prior to , is smaller than , and is closer in position to than any other smaller value. The right neighbor is defined symmetrically. The sequence of left neighbors can be found by an algorithm that maintains a stack containing a subsequence of the input. For each new sequence value , the stack is popped until it is empty or its top element is smaller than , and then is pushed onto the stack. The left neighbor of is the top element at the time is pushed. The right neighbors can be found by applying the same stack algorithm to the reverse of the sequence. The parent of in the Cartesian tree is either the left neighbor of or the right neighbor of , whichever exists and has a larger value. The left and right neighbors can also be constructed efficiently by parallel algorithms, making this formulation useful in efficient parallel algorithms for Cartesian tree construction. [4]

Another linear-time algorithm for Cartesian tree construction is based on divide-and-conquer. The algorithm recursively constructs the tree on each half of the input, and then merges the two trees. The merger process involves only the nodes on the left and right spines of these trees: the left spine is the path obtained by following left child edges from the root until reaching a node with no left child, and the right spine is defined symmetrically. As with any path in a min-heap, both spines have their values in sorted order, from the smallest value at their root to their largest value at the end of the path. To merge the two trees, apply a merge algorithm to the right spine of the left tree and the left spine of the right tree, replacing these two paths in two trees by a single path that contains the same nodes. In the merged path, the successor in the sorted order of each node from the left tree is placed in its right child, and the successor of each node from the right tree is placed in its left child, the same position that was previously used for its successor in the spine. The left children of nodes from the left tree and right children of nodes from the right tree remain unchanged. The algorithm is parallelizable since on each level of recursion, each of the two sub-problems can be computed in parallel, and the merging operation can be efficiently parallelized as well. [5]

It is possible to maintain the Cartesian tree of a dynamic input, subject to insertions of elements and lazy deletion of elements, in logarithmic amortized time per operation. Here, lazy deletion means that a deletion operation is performed by marking an element in the tree as being a deleted element, but not actually removing it from the tree. When the number of marked elements reaches a constant fraction of the size of the whole tree, it is rebuilt, keeping only its unmarked elements. [6]

Applications

Range searching and lowest common ancestors

Two-dimensional range-searching using a Cartesian tree: the bottom point (red in the figure) within a three-sided region with two vertical sides and one horizontal side (if the region is nonempty) can be found as the nearest common ancestor of the leftmost and rightmost points (the blue points in the figure) within the slab defined by the vertical region boundaries. The remaining points in the three-sided region can be found by splitting it by a vertical line through the bottom point and recursing. Cartesian tree range searching.svg
Two-dimensional range-searching using a Cartesian tree: the bottom point (red in the figure) within a three-sided region with two vertical sides and one horizontal side (if the region is nonempty) can be found as the nearest common ancestor of the leftmost and rightmost points (the blue points in the figure) within the slab defined by the vertical region boundaries. The remaining points in the three-sided region can be found by splitting it by a vertical line through the bottom point and recursing.

Cartesian trees form part of an efficient data structure for range minimum queries. An input to this kind of query specifies a contiguous subsequence of the original sequence; the query output should be the minimum value in this subsequence. [7] In a Cartesian tree, this minimum value can be found at the lowest common ancestor of the leftmost and rightmost values in the subsequence. For instance, in the subsequence (12,10,20,15,18) of the example sequence, the minimum value of the subsequence (10) forms the lowest common ancestor of the leftmost and rightmost values (12 and 18). Because lowest common ancestors can be found in constant time per query, using a data structure that takes linear space to store and can be constructed in linear time, the same bounds hold for the range minimization problem. [8]

Bender & Farach-Colton (2000) reversed this relationship between the two data structure problems by showing that data structures for range minimization could also be used for finding lowest common ancestors. Their data structure associates with each node of the tree its distance from the root, and constructs a sequence of these distances in the order of an Euler tour of the (edge-doubled) tree. It then constructs a range minimization data structure for the resulting sequence. The lowest common ancestor of any two vertices in the given tree can be found as the minimum distance appearing in the interval between the initial positions of these two vertices in the sequence. Bender and Farach-Colton also provide a method for range minimization that can be used for the sequences resulting from this transformation, which have the special property that adjacent sequence values differ by one. As they describe, for range minimization in sequences that do not have this form, it is possible to use Cartesian trees to reduce the range minimization problem to lowest common ancestors, and then to use Euler tours to reduce lowest common ancestors to a range minimization problem with this special form. [9]

The same range minimization problem may also be given an alternative interpretation in terms of two dimensional range searching. A collection of finitely many points in the Cartesian plane can be used to form a Cartesian tree, by sorting the points by their -coordinates and using the -coordinates in this order as the sequence of values from which this tree is formed. If is the subset of the input points within some vertical slab defined by the inequalities , is the leftmost point in (the one with minimum -coordinate), and is the rightmost point in (the one with maximum -coordinate) then the lowest common ancestor of and in the Cartesian tree is the bottommost point in the slab. A three-sided range query, in which the task is to list all points within a region bounded by the three inequalities and , can be answered by finding this bottommost point , comparing its -coordinate to , and (if the point lies within the three-sided region) continuing recursively in the two slabs bounded between and and between and . In this way, after the leftmost and rightmost points in the slab are identified, all points within the three-sided region can be listed in constant time per point. [3]

The same construction, of lowest common ancestors in a Cartesian tree, makes it possible to construct a data structure with linear space that allows the distances between pairs of points in any ultrametric space to be queried in constant time per query. The distance within an ultrametric is the same as the minimax path weight in the minimum spanning tree of the metric. [10] From the minimum spanning tree, one can construct a Cartesian tree, the root node of which represents the heaviest edge of the minimum spanning tree. Removing this edge partitions the minimum spanning tree into two subtrees, and Cartesian trees recursively constructed for these two subtrees form the children of the root node of the Cartesian tree. The leaves of the Cartesian tree represent points of the metric space, and the lowest common ancestor of two leaves in the Cartesian tree is the heaviest edge between those two points in the minimum spanning tree, which has weight equal to the distance between the two points. Once the minimum spanning tree has been found and its edge weights sorted, the Cartesian tree can be constructed in linear time. [11]

As a binary search tree

The Cartesian tree of a sorted sequence is just a path graph, rooted at its leftmost endpoint. Binary searching in this tree degenerates to sequential search in the path. However, a different construction uses Cartesian trees to generate binary search trees of logarithmic depth from sorted sequences of values. This can be done by generating priority numbers for each value, and using the sequence of priorities to generate a Cartesian tree. This construction may equivalently be viewed in the geometric framework described above, in which the -coordinates of a set of points are the values in a sorted sequence and the -coordinates are their priorities. [12]

This idea was applied by Seidel & Aragon (1996), who suggested the use of random numbers as priorities. The self-balancing binary search tree resulting from this random choice is called a treap, due to its combination of binary search tree and min-heap features. An insertion into a treap can be performed by inserting the new key as a leaf of an existing tree, choosing a priority for it, and then performing tree rotation operations along a path from the node to the root of the tree to repair any violations of the heap property caused by this insertion; a deletion can similarly be performed by a constant amount of change to the tree followed by a sequence of rotations along a single path in the tree. [12] A variation on this data structure called a zip tree uses the same idea of random priorities, but simplifies the random generation of the priorities, and performs insertions and deletions in a different way, by splitting the sequence and its associated Cartesian tree into two subsequences and two trees and then recombining them. [13]

If the priorities of each key are chosen randomly and independently once whenever the key is inserted into the tree, the resulting Cartesian tree will have the same properties as a random binary search tree, a tree computed by inserting the keys in a randomly chosen permutation starting from an empty tree, with each insertion leaving the previous tree structure unchanged and inserting the new node as a leaf of the tree. Random binary search trees have been studied for much longer than treaps, and are known to behave well as search trees. The expected length of the search path to any given value is at most , and the whole tree has logarithmic depth (its maximum root-to-leaf distance) with high probability. More formally, there exists a constant such that the depth is with probability tending to one as the number of nodes tends to infinity. The same good behavior carries over to treaps. It is also possible, as suggested by Aragon and Seidel, to reprioritize frequently-accessed nodes, causing them to move towards the root of the treap and speeding up future accesses for the same keys. [12]

In sorting

Pairs of consecutive sequence values (shown as the thick red edges) that bracket a sequence value (the darker blue point). The cost of including this value in the sorted order produced by the Levcopoulos-Petersson algorithm is proportional to the logarithm of this number of bracketing pairs. Bracketing pairs.svg
Pairs of consecutive sequence values (shown as the thick red edges) that bracket a sequence value (the darker blue point). The cost of including this value in the sorted order produced by the Levcopoulos–Petersson algorithm is proportional to the logarithm of this number of bracketing pairs.

Levcopoulos & Petersson (1989) describe a sorting algorithm based on Cartesian trees. They describe the algorithm as based on a tree with the maximum at the root, [14] but it can be modified straightforwardly to support a Cartesian tree with the convention that the minimum value is at the root. For consistency, it is this modified version of the algorithm that is described below.

The Levcopoulos–Petersson algorithm can be viewed as a version of selection sort or heap sort that maintains a priority queue of candidate minima, and that at each step finds and removes the minimum value in this queue, moving this value to the end of an output sequence. In their algorithm, the priority queue consists only of elements whose parent in the Cartesian tree has already been found and removed. Thus, the algorithm consists of the following steps: [14]

  1. Construct a Cartesian tree for the input sequence
  2. Initialize a priority queue, initially containing only the tree root
  3. While the priority queue is non-empty:
    • Find and remove the minimum value in the priority queue
    • Add this value to the output sequence
    • Add the Cartesian tree children of the removed value to the priority queue

As Levcopoulos and Petersson show, for input sequences that are already nearly sorted, the size of the priority queue will remain small, allowing this method to take advantage of the nearly-sorted input and run more quickly. Specifically, the worst-case running time of this algorithm is , where is the sequence length and is the average, over all values in the sequence, of the number of consecutive pairs of sequence values that bracket the given value (meaning that the given value is between the two sequence values). They also prove a lower bound stating that, for any and (non-constant) , any comparison-based sorting algorithm must use comparisons for some inputs. [14]

In pattern matching

The problem of Cartesian tree matching has been defined as a generalized form of string matching in which one seeks a substring (or in some cases, a subsequence) of a given string that has a Cartesian tree of the same form as a given pattern. Fast algorithms for variations of the problem with a single pattern or multiple patterns have been developed, as well as data structures analogous to the suffix tree and other text indexing structures. [15]

Notes

  1. 1 2 In some references, the ordering of the numbers is reversed, so the root node instead holds the maximum value and the tree has the max-heap property rather than the min-heap property.
  2. 1 2 3 Vuillemin (1980).
  3. 1 2 Gabow, Bentley & Tarjan (1984).
  4. Berkman, Schieber & Vishkin (1993).
  5. Shun & Blelloch (2014).
  6. Bialynicka-Birula & Grossi (2006).
  7. Gabow, Bentley & Tarjan (1984); Bender & Farach-Colton (2000).
  8. Harel & Tarjan (1984); Schieber & Vishkin (1988).
  9. Bender & Farach-Colton (2000).
  10. Hu (1961); Leclerc (1981)
  11. Demaine, Landau & Weimann (2009).
  12. 1 2 3 Seidel & Aragon (1996).
  13. Tarjan, Levy & Timmel (2021).
  14. 1 2 3 Levcopoulos & Petersson (1989).
  15. Park et al. (2019); Park et al. (2020); Song et al. (2021); Kim & Cho (2021); Nishimoto et al. (2021); Oizumi et al. (2022)

Related Research Articles

<span class="mw-page-title-main">AVL tree</span> Self-balancing binary search tree

In computer science, an AVL tree is a self-balancing binary search tree. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

<span class="mw-page-title-main">Binary search tree</span> Rooted binary tree data structure

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree. The time complexity of operations on the binary search tree is linear with respect to the height of the tree.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

In computer science, a red–black tree is a specialised binary search tree data structure noted for fast storage and retrieval of ordered information, and a guarantee that operations will complete within a known time. Compared to other self-balancing binary search trees, the nodes in a red-black tree hold an extra bit called "color" representing "red" and "black" which is used when re-organising the tree to ensure that it is always approximately balanced.

A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.

<span class="mw-page-title-main">Binary heap</span> Variant of heap data structure

A binary heap is a heap data structure that takes the form of a binary tree. Binary heaps are a common way of implementing priority queues. The binary heap was introduced by J. W. J. Williams in 1964, as a data structure for heapsort.

<span class="mw-page-title-main">Smoothsort</span> Comparison-based sorting algorithm

In computer science, smoothsort is a comparison-based sorting algorithm. A variant of heapsort, it was invented and published by Edsger Dijkstra in 1981. Like heapsort, smoothsort is an in-place algorithm with an upper bound of O(n log n) operations (see big O notation), but it is not a stable sort. The advantage of smoothsort is that it comes closer to O(n) time if the input is already sorted to some degree, whereas heapsort averages O(n log n) regardless of the initial sorted state.

<span class="mw-page-title-main">Treap</span>

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

In computer science, a binomial heap is a data structure that acts as a priority queue but also allows pairs of heaps to be merged. It is important as an implementation of the mergeable heap abstract data type, which is a priority queue supporting merge operation. It is implemented as a heap similar to a binary heap but using a special tree structure that is different from the complete binary trees used by binary heaps. Binomial heaps were invented in 1978 by Jean Vuillemin.

In computer science, a Fibonacci heap is a data structure for priority queue operations, consisting of a collection of heap-ordered trees. It has a better amortized running time than many other priority queue data structures including the binary heap and binomial heap. Michael L. Fredman and Robert E. Tarjan developed Fibonacci heaps in 1984 and published them in a scientific journal in 1987. Fibonacci heaps are named after the Fibonacci numbers, which are used in their running time analysis.

<span class="mw-page-title-main">Self-balancing binary search tree</span> Any node-based binary search tree that automatically keeps its height the same

In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions. These operations when designed for a self-balancing binary search tree, contain precautionary measures against boundlessly increasing tree height, so that these abstract data structures receive the attribute "self-balancing".

In computer science, a leftist tree or leftist heap is a priority queue implemented with a variant of a binary heap. Every node x has an s-value which is the distance to the nearest leaf in subtree rooted at x. In contrast to a binary heap, a leftist tree attempts to be very unbalanced. In addition to the heap property, leftist trees are maintained so the right descendant of each node has the lower s-value.

In computer science, the all nearest smaller values problem is the following task: for each position in a sequence of numbers, search among the previous positions for the last position that contains a smaller value. This problem can be solved efficiently both by parallel and non-parallel algorithms: Berkman, Schieber & Vishkin (1993), who first identified the procedure as a useful subroutine for other parallel programs, developed efficient algorithms to solve it in the Parallel Random Access Machine model; it may also be solved in linear time on a non-parallel computer using a stack-based algorithm. Later researchers have studied algorithms to solve it in other models of parallel computation.

<span class="mw-page-title-main">Random binary tree</span> Binary tree selected at random

In computer science and probability theory, a random binary tree is a binary tree selected at random from some probability distribution on binary trees. Different distributions have been used, leading to different properties for these trees.

In computer science, one approach to the dynamic optimality problem on online algorithms for binary search trees involves reformulating the problem geometrically, in terms of augmenting a set of points in the plane with as few additional points as possible to avoid rectangles with only two points on their boundary.

In mathematics and computer science, a stack-sortable permutation is a permutation whose elements may be sorted by an algorithm whose internal storage is limited to a single stack data structure. The stack-sortable permutations are exactly the permutations that do not contain the permutation pattern 231; they are counted by the Catalan numbers, and may be placed in bijection with many other combinatorial objects with the same counting function including Dyck paths and binary trees.

In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.

In computer science, a weak heap is a data structure for priority queues, combining features of the binary heap and binomial heap. It can be stored in an array as an implicit binary tree like a binary heap, and has the efficiency guarantees of binomial heaps.

In computer science, join-based tree algorithms are a class of algorithms for self-balancing binary search trees. This framework aims at designing highly-parallelized algorithms for various balanced binary search trees. The algorithmic framework is based on a single operation join. Under this framework, the join operation captures all balancing criteria of different balancing schemes, and all other functions join have generic implementation across different balancing schemes. The join-based algorithms can be applied to at least four balancing schemes: AVL trees, red–black trees, weight-balanced trees and treaps.

References