Interval tree

Last updated

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, [1] for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

Contents

The trivial solution is to visit each interval and test whether it intersects the given point or interval, which requires time, where is the number of intervals in the collection. Since a query may return all intervals, for example if the query is a large interval intersecting all intervals in the collection, this is asymptotically optimal; however, we can do better by considering output-sensitive algorithms, where the runtime is expressed in terms of , the number of intervals produced by the query. Interval trees have a query time of and an initial creation time of , while limiting memory consumption to . After creation, interval trees may be dynamic, allowing efficient insertion and deletion of an interval in time. If the endpoints of intervals are within a small integer range (e.g., in the range ), faster and in fact optimal data structures exist [2] [3] with preprocessing time and query time for reporting intervals containing a given query point (see [2] for a very simple one).

Naive approach

In a simple case, the intervals do not overlap and they can be inserted into a simple binary search tree and queried in time. However, with arbitrarily overlapping intervals, there is no way to compare two intervals for insertion into the tree since orderings sorted by the beginning points or the ending points may be different. A naive approach might be to build two parallel trees, one ordered by the beginning point, and one ordered by the ending point of each interval. This allows discarding half of each tree in time, but the results must be merged, requiring time. This gives us queries in , which is no better than brute-force.

Interval trees solve this problem. This article describes two alternative designs for an interval tree, dubbed the centered interval tree and the augmented tree.

Centered interval tree

Queries require time, with being the total number of intervals and being the number of reported results. Construction requires time, and storage requires space.

Construction

Given a set of intervals on the number line, we want to construct a data structure so that we can efficiently retrieve all intervals overlapping another interval or point.

We start by taking the entire range of all the intervals and dividing it in half at (in practice, should be picked to keep the tree relatively balanced). This gives three sets of intervals, those completely to the left of which we'll call , those completely to the right of which we'll call , and those overlapping which we'll call .

The intervals in and are recursively divided in the same manner until there are no intervals left.

The intervals in that overlap the center point are stored in a separate data structure linked to the node in the interval tree. This data structure consists of two lists, one containing all the intervals sorted by their beginning points, and another containing all the intervals sorted by their ending points.

The result is a binary tree with each node storing:

Intersecting

Given the data structure constructed above, we receive queries consisting of ranges or points, and return all the ranges in the original set overlapping this input.

With a point

The task is to find all intervals in the tree that overlap a given point . The tree is walked with a similar recursive algorithm as would be used to traverse a traditional binary tree, but with extra logic to support searching the intervals overlapping the "center" point at each node.

For each tree node, is compared to , the midpoint used in node construction above. If is less than , the leftmost set of intervals, , is considered. If is greater than , the rightmost set of intervals, , is considered.

All intervals in
S
center
{\displaystyle S_{\textrm {center}}}
that begin before
x
{\displaystyle x}
must overlap
x
{\displaystyle x}
if
x
{\displaystyle x}
is less than
x
center
{\displaystyle x_{\textrm {center}}}
.
Similarly, the same technique also applies in checking a given interval. If a given interval ends at y and y is less than
x
center
{\displaystyle x_{\textrm {center}}}
, all intervals in
S
center
{\displaystyle S_{\textrm {center}}}
that begin before y must also overlap the given interval. Check if a point overlaps the intervals in S center of a centered interval tree.svg
All intervals in that begin before must overlap if is less than .
Similarly, the same technique also applies in checking a given interval. If a given interval ends at y and y is less than , all intervals in that begin before y must also overlap the given interval.

As each node is processed as we traverse the tree from the root to a leaf, the ranges in its are processed. If is less than , we know that all intervals in end after , or they could not also overlap . Therefore, we need only find those intervals in that begin before . We can consult the lists of that have already been constructed. Since we only care about the interval beginnings in this scenario, we can consult the list sorted by beginnings. Suppose we find the closest number no greater than in this list. All ranges from the beginning of the list to that found point overlap because they begin before and end after (as we know because they overlap which is larger than ). Thus, we can simply start enumerating intervals in the list until the startpoint value exceeds .

Likewise, if is greater than , we know that all intervals in must begin before , so we find those intervals that end after using the list sorted by interval endings.

If exactly matches , all intervals in can be added to the results without further processing and tree traversal can be stopped.

With an interval

For a result interval to intersect our query interval one of the following must hold:

  • the start and/or end point of is in ; or
  • completely encloses .

We first find all intervals with start and/or end points inside using a separately-constructed tree. In the one-dimensional case, we can use a search tree containing all the start and end points in the interval set, each with a pointer to its corresponding interval. A binary search in time for the start and end of reveals the minimum and maximum points to consider. Each point within this range references an interval that overlaps and is added to the result list. Care must be taken to avoid duplicates, since an interval might both begin and end within . This can be done using a binary flag on each interval to mark whether or not it has been added to the result set.

Finally, we must find intervals that enclose . To find these, we pick any point inside and use the algorithm above to find all intervals intersecting that point (again, being careful to remove duplicates).

Higher dimensions

The interval tree data structure can be generalized to a higher dimension with identical query and construction time and space.

First, a range tree in dimensions is constructed that allows efficient retrieval of all intervals with beginning and end points inside the query region . Once the corresponding ranges are found, the only thing that is left are those ranges that enclose the region in some dimension. To find these overlaps, interval trees are created, and one axis intersecting is queried for each. For example, in two dimensions, the bottom of the square (or any other horizontal line intersecting ) would be queried against the interval tree constructed for the horizontal axis. Likewise, the left (or any other vertical line intersecting ) would be queried against the interval tree constructed on the vertical axis.

Each interval tree also needs an addition for higher dimensions. At each node we traverse in the tree, is compared with to find overlaps. Instead of two sorted lists of points as was used in the one-dimensional case, a range tree is constructed. This allows efficient retrieval of all points in that overlap region .

Deletion

If after deleting an interval from the tree, the node containing that interval contains no more intervals, that node may be deleted from the tree. This is more complex than a normal binary tree deletion operation.

An interval may overlap the center point of several nodes in the tree. Since each node stores the intervals that overlap it, with all intervals completely to the left of its center point in the left subtree, similarly for the right subtree, it follows that each interval is stored in the node closest to the root from the set of nodes whose center point it overlaps.

Normal deletion operations in a binary tree (for the case where the node being deleted has two children) involve promoting a node further from the leaf to the position of the node being deleted (usually the leftmost child of the right subtree, or the rightmost child of the left subtree).

Deleting a node with two children from a binary search tree using the in-order predecessor (rightmost node in the left subtree, labelled 6). Binary search tree delete.svg
Deleting a node with two children from a binary search tree using the in-order predecessor (rightmost node in the left subtree, labelled 6).

As a result of this promotion, some nodes that were above the promoted node will become its descendants; it is necessary to search these nodes for intervals that also overlap the promoted node, and move those intervals into the promoted node. As a consequence, this may result in new empty nodes, which must be deleted, following the same algorithm again.

Balancing

The same issues that affect deletion also affect rotation operations; rotation must preserve the invariant that nodes are stored as close to the root as possible.

Augmented tree

An augmented tree with low value as the key and maximum high as the extra annotation.
For example, when testing if the given interval [40 ,60) overlaps the intervals in the tree shown above, we see that it does not overlap the interval [20, 36) in the root, but since the root's low value (20) is less than the sought high value (60), we must search the right subtree. The left subtree's maximum high of 41 exceeds the sought low value (40), so we must search the left subtree as well. However, both descendants of the [3, 41) node have maximum highs less than 40, so the left subtree search ends there and it is not necessary to search them. Example of augmented tree with low value as the key and maximum high as extra annotation.png
An augmented tree with low value as the key and maximum high as the extra annotation.
For example, when testing if the given interval [40 ,60) overlaps the intervals in the tree shown above, we see that it does not overlap the interval [20, 36) in the root, but since the root's low value (20) is less than the sought high value (60), we must search the right subtree. The left subtree's maximum high of 41 exceeds the sought low value (40), so we must search the left subtree as well. However, both descendants of the [3, 41) node have maximum highs less than 40, so the left subtree search ends there and it is not necessary to search them.

Another way to represent intervals is described in Cormen et al. (2009 , Section 14.3: Interval trees, pp. 348354).

Both insertion and deletion require time, with being the total number of intervals in the tree prior to the insertion or deletion operation.

An augmented tree can be built from a simple ordered tree, for example a binary search tree or self-balancing binary search tree, ordered by the 'low' values of the intervals. An extra annotation is then added to every node, recording the maximum upper value among all the intervals from this node down. Maintaining this attribute involves updating all ancestors of the node from the bottom up whenever a node is added or deleted. This takes only O(h) steps per node addition or removal, where h is the height of the node added or removed in the tree. If there are any tree rotations during insertion and deletion, the affected nodes may need updating as well.

Now, it is known that two intervals and overlap only when both and . When searching the trees for nodes overlapping with a given interval, you can immediately skip:

Membership queries

Some performance may be gained if the tree avoids unnecessary traversals. These can occur when adding intervals that already exist or removing intervals that don't exist.

A total order can be defined on the intervals by ordering them first by their lower bounds and then by their upper bounds. Then, a membership check can be performed in time, versus the time required to find duplicates if intervals overlap the interval to be inserted or removed. This solution has the advantage of not requiring any additional structures. The change is strictly algorithmic. The disadvantage is that membership queries take time.

Alternately, at the rate of memory, membership queries in expected constant time can be implemented with a hash table, updated in lockstep with the interval tree. This may not necessarily double the total memory requirement, if the intervals are stored by reference rather than by value.

Java example: Adding a new interval to the tree

The key of each node is the interval itself, hence nodes are ordered first by low value and finally by high value, and the value of each node is the end point of the interval:

publicvoidadd(Intervali){put(i,i.getEnd());}

Java example: Searching a point or an interval in the tree

To search for an interval, one walks the tree, using the key (n.getKey()) and high value (n.getValue()) to omit any branches that cannot overlap the query. The simplest case is a point query:

// Search for all intervals containing "p", starting with the// node "n" and adding matching intervals to the list "result"publicvoidsearch(IntervalNoden,Pointp,List<Interval>result){// Don't search nodes that don't existif(n==null)return;// If p is to the right of the rightmost point of any interval// in this node and all children, there won't be any matches.if(p.compareTo(n.getValue())>0)return;// Search left childrensearch(n.getLeft(),p,result);// Check this nodeif(n.getKey().contains(p))result.add(n.getKey());// If p is to the left of the start of this interval,// then it can't be in any child to the right.if(p.compareTo(n.getKey().getStart())<0)return;// Otherwise, search right childrensearch(n.getRight(),p,result);}

where

a.compareTo(b) returns a negative value if a < b
a.compareTo(b) returns zero if a = b
a.compareTo(b) returns a positive value if a > b

The code to search for an interval is similar, except for the check in the middle:

// Check this nodeif(n.getKey().overlapsWith(i))result.add(n.getKey());

overlapsWith() is defined as:

publicbooleanoverlapsWith(Intervalother){returnstart.compareTo(other.getEnd())<=0&&end.compareTo(other.getStart())>=0;}

Higher dimensions

Augmented trees can be extended to higher dimensions by cycling through the dimensions at each level of the tree. For example, for two dimensions, the odd levels of the tree might contain ranges for the x-coordinate, while the even levels contain ranges for the y-coordinate. This approach effectively converts the data structure from an augmented binary tree to an augmented kd-tree, thus significantly complicating the balancing algorithms for insertions and deletions.

A simpler solution is to use nested interval trees. First, create a tree using the ranges for the y-coordinate. Now, for each node in the tree, add another interval tree on the x-ranges, for all elements whose y-range is the same as that node's y-range.

The advantage of this solution is that it can be extended to an arbitrary number of dimensions using the same code base.

At first, the additional cost of the nested trees might seem prohibitive, but this is usually not so. As with the non-nested solution earlier, one node is needed per x-coordinate, yielding the same number of nodes for both solutions. The only additional overhead is that of the nested tree structures, one per vertical interval. This structure is usually of negligible size, consisting only of a pointer to the root node, and possibly the number of nodes and the depth of the tree.

Medial- or length-oriented tree

A medial- or length-oriented tree is similar to an augmented tree, but symmetrical, with the binary search tree ordered by the medial points of the intervals. There is a maximum-oriented binary heap in every node, ordered by the length of the interval (or half of the length). Also we store the minimum and maximum possible value of the subtree in each node (thus the symmetry).

Overlap test

Using only start and end values of two intervals , for , the overlap test can be performed as follows:

and

This can be simplified using the sum and difference:

Which reduces the overlap test to:

Adding interval

Adding new intervals to the tree is the same as for a binary search tree using the medial value as the key. We push onto the binary heap associated with the node, and update the minimum and maximum possible values associated with all higher nodes.

Searching for all overlapping intervals

Let's use for the query interval, and for the key of a node (compared to of intervals)

Starting with root node, in each node, first we check if it is possible that our query interval overlaps with the node subtree using minimum and maximum values of node (if it is not, we don't continue for this node).

Then we calculate for intervals inside this node (not its children) to overlap with query interval (knowing ):

and perform a query on its binary heap for the 's bigger than

Then we pass through both left and right children of the node, doing the same thing.

In the worst-case, we have to scan all nodes of the binary search tree, but since binary heap query is optimum, this is acceptable (a 2- dimensional problem can not be optimum in both dimensions)

This algorithm is expected to be faster than a traditional interval tree (augmented tree) for search operations. Adding elements is a little slower in practice, though the order of growth is the same.

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.

<span class="mw-page-title-main">Treap</span>

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjans' 1986 article.

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. K-dimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions. k-d trees are a useful data structure for several applications, such as:

<i>m</i>-ary tree Tree data structure in which each node has at most m children.

In graph theory, an m-ary tree is an arborescence in which each node has no more than m children. A binary tree is the special case where m = 2, and a ternary tree is another case with m = 3 that limits its children to three.

In graph theory and computer science, the lowest common ancestor (LCA) of two nodes v and w in a tree or directed acyclic graph (DAG) T is the lowest node that has both v and w as descendants, where we define each node to be a descendant of itself.

In computer science, fractional cascading is a technique to speed up a sequence of binary searches for the same value in a sequence of related data structures. The first binary search in the sequence takes a logarithmic amount of time, as is standard for binary searches, but successive searches in the sequence are faster. The original version of fractional cascading, introduced in two papers by Chazelle and Guibas in 1986, combined the idea of cascading, originating in range searching data structures of Lueker (1978) and Willard (1978), with the idea of fractional sampling, which originated in Chazelle (1983). Later authors introduced more complex forms of fractional cascading that allow the data structure to be maintained as the data changes by a sequence of discrete insertion and deletion events.

<span class="mw-page-title-main">Segment tree</span> Computer science data structure

In computer science, a segment tree, also known as a statistic tree, is a tree data structure used for storing information about intervals, or segments. It allows querying which of the stored segments contain a given point. It is, in principle, a static structure and cannot be modified once built. A similar data structure is the interval tree.

In computer science, a range tree is an ordered tree data structure to hold a list of points. It allows all points within a given range to be reported efficiently, and is typically used in two or higher dimensions. Range trees were introduced by Jon Louis Bentley in 1979. Similar data structures were discovered independently by Lueker, Lee and Wong, and Willard. The range tree is an alternative to the k-d tree. Compared to k-d trees, range trees offer faster query times of but worse storage of , where n is the number of points stored in the tree, d is the dimension of each point and k is the number of points reported by a given query.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

In computer science, the Bx tree is a query that is used to update efficient B+ tree-based index structures for moving objects.

<span class="mw-page-title-main">Fenwick tree</span> Data structure

A Fenwick tree or binary indexed tree(BIT) is a data structure that can efficiently update values and calculate prefix sums in an array of values.

In data structures, a range query consists of pre-processing some input data into a data structure to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studied where the input is an array of unsorted numbers and a query consists of computing some function, such as the minimum, on a specific range of the array.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

In computer science, a fractal tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree. Like a B-tree, a fractal tree index is a generalization of a binary search tree in that a node can have more than two children. Furthermore, unlike a B-tree, a fractal tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations. The goal of the buffers is to schedule disk writes so that each write performs a large amount of useful work, thereby avoiding the worst-case performance of B-trees, in which each disk write may change a small amount of data on disk. Like a B-tree, fractal tree indexes are optimized for systems that read and write large blocks of data. The fractal tree index has been commercialized in databases by Tokutek. Originally, it was implemented as a cache-oblivious lookahead array, but the current implementation is an extension of the Bε tree. The Bε is related to the Buffered Repository Tree. The Buffered Repository Tree has degree 2, whereas the Bε tree has degree Bε. The fractal tree index has also been used in a prototype filesystem. An open source implementation of the fractal tree index is available, which demonstrates the implementation details outlined below.

In computer science, the augmented map is an abstract data type (ADT) based on ordered maps, which associates each ordered map an augmented value. For an ordered map with key type , comparison function on and value type , the augmented value is defined based on two functions: a base function and a combine function , where is the type of the augmented value. The base function converts a single entry in to an augmented value, and the combine function combines multiple augmented values. The combine function is required to be associative and have an identity . We extend the definition of the associative function as follows:

The PH-tree is a tree data structure used for spatial indexing of multi-dimensional data (keys) such as geographical coordinates, points, feature vectors, rectangles or bounding boxes. The PH-tree is space partitioning index with a structure similar to that of a quadtree or octree. However, unlike quadtrees, it uses a splitting policy based on tries and similar to Crit bit trees that is based on the bit-representation of the keys. The bit-based splitting policy, when combined with the use of different internal representations for nodes, provides scalability with high-dimensional data. The bit-representation splitting policy also imposes a maximum depth, thus avoiding degenerated trees and the need for rebalancing.

References

  1. https://personal.us.es/almar/cg/08windowing.pdf [ bare URL PDF ]
  2. 1 2 Jens M. Schmidt. Interval Stabbing Problems in Small Integer Ranges. DOI. ISAAC'09, 2009
  3. Range Queries#Semigroup operators