Hilbert R-tree

Last updated

Hilbert R-tree, an R-tree variant, is an index for multidimensional objects such as lines, regions, 3-D objects, or high-dimensional feature-based parametric objects. It can be thought of as an extension to B+-tree for multidimensional objects.

Contents

The performance of R-trees depends on the quality of the algorithm that clusters the data rectangles on a node. Hilbert R-trees use space-filling curves, and specifically the Hilbert curve, to impose a linear ordering on the data rectangles.

There are two types of Hilbert R-trees: one for static databases, and one for dynamic databases. In both cases Hilbert space-filling curves are used to achieve better ordering of multidimensional objects in the node. This ordering has to be "good", in the sense that it should group "similar" data rectangles together, to minimize the area and perimeter of the resulting minimum bounding rectangles (MBRs). Packed Hilbert R-trees are suitable for static databases in which updates are very rare or in which there are no updates at all.

The dynamic Hilbert R-tree is suitable for dynamic databases where insertions, deletions, or updates may occur in real time. Moreover, dynamic Hilbert R-trees employ flexible deferred splitting mechanism to increase the space utilization. Every node has a well defined set of sibling nodes. This is done by proposing an ordering on the R-tree nodes. The Hilbert R-tree sorts rectangles according to the Hilbert value of the center of the rectangles (i.e., MBR). (The Hilbert value of a point is the length of the Hilbert curve from the origin to the point.) Given the ordering, every node has a well-defined set of sibling nodes; thus, deferred splitting can be used. By adjusting the split policy, the Hilbert R-tree can achieve a degree of space utilization as high as desired. To the contrary, other R-tree variants have no control over the space utilization.

The basic idea

Although the following example is for a static environment, it explains the intuitive principles for good R-tree design. These principles are valid for both static and dynamic databases.

Roussopoulos and Leifker proposed a method for building a packed R-tree that achieves almost 100% space utilization. The idea is to sort the data on the x or y coordinate of one of the corners of the rectangles. Sorting on any of the four coordinates gives similar results. In this discussion points or rectangles are sorted on the x coordinate of the lower left corner of the rectangle, referred to as a "lowx packed R-tree". The sorted list of rectangles is scanned; successive rectangles are assigned to the same R-tree leaf node until that node is full; a new leaf node is then created, and the scanning of the sorted list continues. Thus, the nodes of the resulting R-tree will be fully packed, with the possible exception of the last node at each level. This leads to space utilization ≈100%. Higher levels of the tree are created in a similar way.

Figure 1 highlights the problem of the lowx packed R-tree. Figure 1 [Right] shows the leaf nodes of the R-tree that the lowx packing method will create for the points of Figure 1 [Left]. The fact that the resulting father nodes cover little area explains why the lowx packed R-tree achieves excellent performance for point queries. However, the fact that the fathers have large perimeters explains the degradation of performance for region queries. This is consistent with the analytical formulas for R-tree performance. [1] Intuitively, the packing algorithm should ideally assign nearby points to the same leaf node. Ignorance of the y coordinate by the lowx packed R-tree tends to violate this empirical rule.

Figure1 left.gif Figure1 right.gif

Figure 1: [Left] 200 points uniformly distributed; [Right] MBR of nodes generated by the "lowx packed R-tree" algorithm

The section below describes two variants of the Hilbert R-trees. The first index is suitable for the static database in which updates are very rare or in which there are no updates at all. The nodes of the resulting R-tree will be fully packed, with the possible exception of the last node at each level. Thus, the space utilization is ≈100%; this structure is called a packed Hilbert R-tree. The second index, called a Dynamic Hilbert R-tree, supports insertions and deletions, and is suitable for a dynamic environment.

Packed Hilbert R-trees

The following provides a brief introduction to the Hilbert curve. The basic Hilbert curve on a 2x2 grid, denoted by H1 is shown in Figure 2. To derive a curve of order i, each vertex of the basic curve is replaced by the curve of order i – 1, which may be appropriately rotated and/or reflected. Figure 2 also shows the Hilbert curves of order two and three. When the order of the curve tends to infinity, like other space filling curves, the resulting curve is a fractal, with a fractal dimension of two. [1] [2] The Hilbert curve can be generalized for higher dimensionalities. Algorithms for drawing the two-dimensional curve of a given order can be found in [3] and. [2] An algorithm for higher dimensionalities is given in. [4]

The path of a space filling curve imposes a linear ordering on the grid points; this path may be calculated by starting at one end of the curve and following the path to the other end. The actual coordinate values of each point can be calculated. However, for the Hilbert curve this is much harder than for example the Z-order curve. Figure 2 shows one such ordering for a 4x4 grid (see curve H2). For example, the point (0,0) on the H2 curve has a Hilbert value of 0, while the point (1,1) has a Hilbert value of 2. The Hilbert value of a rectangle is defined as the Hilbert value of its center.

Figure2 Hilbert.gif

Figure 2: Hilbert curves of order 1, 2, and 3

The Hilbert curve imposes a linear ordering on the data rectangles and then traverses the sorted list, assigning each set of C rectangles to a node in the R-tree. The final result is that the set of data rectangles on the same node will be close to each other in the linear ordering, and most likely in the native space; thus, the resulting R-tree nodes will have smaller areas. Figure 2 illustrates the intuitive reasons why our Hilbert-based methods will result in good performance. The data is composed of points (the same points as given in Figure 1). By grouping the points according to their Hilbert values, the MBRs of the resulting R-tree nodes tend to be small square-like rectangles. This indicates that the nodes will likely have small area and small perimeters. Small area values result in good performance for point queries; small area and small perimeter values lead to good performance for larger queries.

Algorithm Hilbert-Pack

(packs rectangles into an R-tree)
Step 1. Calculate the Hilbert value for each data rectangle
Step 2. Sort data rectangles on ascending Hilbert values
Step 3. /* Create leaf nodes (level l=0) */

Step 4. /* Create nodes at higher level (l + 1) */

The assumption here is that the data are static or the frequency of modification is low. This is a simple heuristic for constructing an R-tree with ~100% space utilization which at the same time will have a good response time.

Dynamic Hilbert R-trees

The performance of R-trees depends on the quality of the algorithm that clusters the data rectangles on a node. Hilbert R-trees use space-filling curves, and specifically the Hilbert curve, to impose a linear ordering on the data rectangles. The Hilbert value of a rectangle is defined as the Hilbert value of its center.

Tree structure

The Hilbert R-tree has the following structure. A leaf node contains at most Cl entries each of the form (R, obj _id) where Cl is the capacity of the leaf, R is the MBR of the real object (xlow, xhigh, ylow, yhigh) and obj-id is a pointer to the object description record. The main difference between the Hilbert R-tree and the R*-tree [5] is that non-leaf nodes also contain information about the LHVs (Largest Hilbert Value). Thus, a non-leaf node in the Hilbert R-tree contains at most Cn entries of the form (R, ptr, LHV) where Cn is the capacity of a non-leaf node, R is the MBR that encloses all the children of that node, ptr is a pointer to the child node, and LHV is the largest Hilbert value among the data rectangles enclosed by R. Notice that since the non-leaf node picks one of the Hilbert values of the children to be the value of its own LHV, there is not extra cost for calculating the Hilbert values of the MBR of non-leaf nodes. Figure 3 illustrates some rectangles organized in a Hilbert R-tree. The Hilbert values of the centers are the numbers near the "x" symbols (shown only for the parent node "II"). The LHV's are in [brackets]. Figure 4 shows how the tree of Figure 3 is stored on the disk; the contents of the parent node "II" are shown in more detail. Every data rectangle in node "I" has a Hilbert value v ≤33; similarly every rectangle in node "II" has a Hilbert value greater than 33 and ≤ 107, etc.

Figure3 data rects.gif

Figure 3: Data rectangles organized in a Hilbert R-tree (Hilbert values and largest Hilbert values (LHVs) are in Brackets)

A plain R-tree splits a node on overflow, creating two nodes from the original one. This policy is called a 1-to-2 splitting policy. It is possible also to defer the split, waiting until two nodes split into three. Note that this is similar to the B*-tree split policy. This method is referred to as the 2-to-3 splitting policy.

In general, this can be extended to s-to-(s+1) splitting policy; where s is the order of the splitting policy. To implement the order-s splitting policy, the overflowing node tries to push some of its entries to one of its s - 1 siblings; if all of them are full, then s-to-(s+1) split need to be done. The s -1 siblings are called the cooperating siblings.

Next, the algorithms for searching, insertion, and overflow handling are described in detail.

Searching

The searching algorithm is similar to the one used in other R-tree variants. Starting from the root, it descends the tree and examines all nodes that intersect the query rectangle. At the leaf level, it reports all entries that intersect the query window w as qualified data items.

Algorithm Search(node Root, rect w):
S1. Search nonleaf nodes:

Invoke Search for every entry whose MBR intersects the query window w.

S2. Search leaf nodes:

Report all entries that intersect the query window w as candidates.

Figure4 file structure.gif

Figure 4: The file structure for the Hilbert R-tree

Insertion

To insert a new rectangle r in the Hilbert R-tree, the Hilbert value h of the center of the new rectangle is used as a key. At each level the node with the minimum LHV value greater than h of all its siblings is chosen. When a leaf node is reached, the rectangle r is inserted in its correct order according to h. After a new rectangle is inserted in a leaf node N, AdjustTree is called to fix the MBR and largest Hilbert values in the upper-level nodes.

Algorithm Insert(node Root, rect r): /* Inserts a new rectangle r in the Hilbert R-tree. h is the Hilbert value of the rectangle*/
I1. Find the appropriate leaf node:

Invoke ChooseLeaf(r, h) to select a leaf node L in which to place r.

I2. Insert r in a leaf node L:

If L has an empty slot, insert r in L in the
appropriate place according to the Hilbert order and return.
If L is full, invoke HandleOverflow(L,r), which
will return new leaf if split was inevitable,

I3. Propagate changes upward:

Form a set S that contains L, its cooperating siblings
and the new leaf (if any)
Invoke AdjustTree(S).

I4. Grow tree taller:

If node split propagation caused the root to split, create
a new root whose children are the two resulting nodes.

Algorithm ChooseLeaf(rect r, int h):
/* Returns the leaf node in which to place a new rectangle r. */
C1. Initialize:

Set N to be the root node.

C2. Leaf check:

If N is a leaf_ return N.

C3. Choose subtree:

If N is a non-leaf node, choose the entry (R, ptr, LHV)
with the minimum LHV value greater than h.

C4. Descend until a leaf is reached:

Set N to the node pointed by ptr and repeat from C2.

Algorithm AdjustTree(set S):
/* S is a set of nodes that contains the node being updated, its cooperating siblings (if overflow has occurred) and the newly
created node NN (if split has occurred). The routine ascends from the leaf level towards the root, adjusting MBR and LHV of nodes that cover the nodes in S. It propagates splits (if any) */
A1. If root level is reached, stop.
A2. Propagate node split upward:

Let Np be the parent node of N.
If N has been split, let NN be the new node.
Insert NN in Np in the correct order according to its Hilbert
value if there is room. Otherwise, invoke HandleOverflow(Np , NN ).
If Np is split, let PP be the new node.

A3. Adjust the MBR's and LHV's in the parent level:

Let P be the set of parent nodes for the nodes in S.
Adjust the corresponding MBR's and LHV's of the nodes in P appropriately.

A4. Move up to next level:

Let S become the set of parent nodes P, with
NN = PP, if Np was split.
repeat from A1.

Deletion

In the Hilbert R-tree, there is no need to re-insert orphaned nodes whenever a father node underflows. Instead, keys can be borrowed from the siblings or the underflowing node is merged with its siblings. This is possible because the nodes have a clear ordering (according to Largest Hilbert Value, LHV); in contrast, in R-trees there is no such concept concerning sibling nodes. Notice that deletion operations require s cooperating siblings, while insertion operations require s - 1 siblings.

Algorithm Delete(r):
D1. Find the host leaf:

Perform an exact match search to find the leaf node L
that contains r.

D2. Delete r :

Remove r from node L.

D3. If L underflows

borrow some entries from s cooperating siblings.
if all the siblings are ready to underflow.
merge s + 1 to s nodes,
adjust the resulting nodes.

D4. Adjust MBR and LHV in parent levels.

form a set S that contains L and its cooperating
siblings (if underflow has occurred).
invoke AdjustTree(S).

Overflow handling

The overflow handling algorithm in the Hilbert R-tree treats the overflowing nodes either by moving some of the entries to one of the s - 1 cooperating siblings or by splitting s nodes into s +1 nodes.

Algorithm HandleOverflow(node N, rect r):
/* return the new node if a split occurred. */
H1. Let ε be a set that contains all the entries from N

and its s - 1 cooperating siblings.

H2. Add r to ε.
H3. If at least one of the s - 1 cooperating siblings is not full,

distribute ε evenly among the s nodes according to Hilbert values.

H4. If all the s cooperating siblings are full,

create a new node NN and
distribute ε evenly among the s + 1 nodes according
to Hilbert values
return NN.

Notes

  1. 1 2 I. Kamel and C. Faloutsos, On Packing R-trees, Second International ACM Conference on Information and Knowledge Management (CIKM), pages 490499, Washington D.C., 1993.
  2. 1 2 H. Jagadish. Linear clustering of objects with multiple attributes. In Proc. of ACM SIGMOD Conf., pages 332342, Atlantic City, NJ, May 1990.
  3. J. Griffiths. An algorithm for displaying a class of space-filling curves, Software-Practice and Experience 16(5), 403411, May 1986.
  4. T. Bially. Space-filling curves. Their generation and their application to bandwidth reduction. IEEE Trans. on Information Theory. IT15(6), 658664, November 1969.
  5. Beckmann, N.; Kriegel, H. P.; Schneider, R.; Seeger, B. (1990). "The R*-tree: an efficient and robust access method for points and rectangles". Proceedings of the 1990 ACM SIGMOD international conference on Management of data - SIGMOD '90 (PDF). p. 322. doi:10.1145/93597.98741. ISBN   0897913655. S2CID   11567855. Archived from the original (PDF) on 2018-04-17. Retrieved 2015-09-02.

Related Research Articles

<span class="mw-page-title-main">AVL tree</span> Self-balancing binary search tree

In computer science, an AVL tree is a self-balancing binary search tree. It was the first such data structure to be invented. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

<span class="mw-page-title-main">Binary tree</span> Limited form of tree data structure

In computer science, a binary tree is a k-ary tree data structure in which each node has at most two children, which are referred to as the left child and the right child. A recursive definition using just set theory notions is that a (non-empty) binary tree is a tuple (L, S, R), where L and R are binary trees or the empty set and S is a singleton set containing the root. Some authors allow the binary tree to be the empty set as well.

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

In computer science, a red–black tree is a kind of self-balancing binary search tree. In addition to the search key and other user data, each node stores an extra bit representing "red" and "black" which helps to keep the tree balanced during insertions and deletions.

<span class="mw-page-title-main">Tree (data structure)</span> Abstract data type simulating a hierarchical tree structure and represented as a set of linked nodes

In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children, but must be connected to exactly one parent, except for the root node, which has no parent. These constraints mean there are no cycles or "loops", and also that each child can be treated like the root node of its own subtree, making recursion a useful technique for tree traversal. In contrast to linear data structures, many trees cannot be represented by relationships between neighboring nodes in a single straight line.

In computer science, tree traversal is a form of graph traversal and refers to the process of visiting each node in a tree data structure, exactly once. Such traversals are classified by the order in which the nodes are visited. The following algorithms are described for a binary tree, but they may be generalized to other trees as well.

<span class="mw-page-title-main">R-tree</span> Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

Adaptive Huffman coding is an adaptive coding technique based on Huffman coding. It permits building the code as the symbols are being transmitted, having no initial knowledge of source distribution, that allows one-pass encoding and adaptation to changing conditions in data.

<span class="mw-page-title-main">2–3–4 tree</span>

In computer science, a 2–3–4 tree is a self-balancing data structure that can be used to implement dictionaries. The numbers mean a tree where every node with children has either two, three, or four child nodes:

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

An R+ tree is a method for looking up data using a location, often coordinates, and often for locations on the surface of the earth. Searching on one number is a solved problem; searching on two or more, and asking for locations that are nearby in both x and y directions, requires craftier algorithms.

In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a better query performance. Like the standard R-tree, it can store both point and spatial data. It was proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger in 1990.

<span class="mw-page-title-main">Treemapping</span> Visualisation method for hierchical data

In information visualization and computing, treemapping is a method for displaying hierarchical data using nested figures, usually rectangles.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key and creating point clouds. k-d trees are a special case of binary space partitioning trees.

<span class="mw-page-title-main">Z-order curve</span> Mapping function that preserves data point locality

In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space-filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It is named in France after Henri Lebesgue, who studied it in 1904, and named in US after Guy Macdonald Morton, who first applied the order to file sequencing in 1966. The z-value of a point in multidimensions is simply calculated by interleaving the binary representations of its coordinate values. Once the data are sorted into this ordering, any one-dimensional data structure can be used such as binary search trees, B-trees, skip lists or hash tables. The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree or octree.

In computer science tree data structures, an X-tree is an index tree structure based on the R-tree used for storing data in many dimensions. It appeared in 1996, and differs from R-trees (1984), R+-trees (1987) and R*-trees (1990) because it emphasizes prevention of overlap in the bounding boxes, which increasingly becomes a problem in high dimensions. In cases where nodes cannot be split without preventing overlap, the node split will be deferred, resulting in super-nodes. In extreme cases, the tree will linearize, which defends against worst-case behaviors observed in some other data structures.

<span class="mw-page-title-main">Hilbert curve</span> Space-filling curve

The Hilbert curve is a continuous fractal space-filling curve first described by the German mathematician David Hilbert in 1891, as a variant of the space-filling Peano curves discovered by Giuseppe Peano in 1890.

In computer science, the Bx tree is a query that is used to update efficient B+ tree-based index structures for moving objects.

In computer science, M-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-nearest neighbor (k-NN) queries. While M-trees can perform well in many conditions, the tree can also have large overlap and there is no clear strategy on how to best avoid overlap. In addition, it can only be used for distance functions that satisfy the triangle inequality, while many advanced dissimilarity functions used in information retrieval do not satisfy this.

In computer science, a K-D-B-tree (k-dimensional B-tree) is a tree data structure for subdividing a k-dimensional search space. The aim of the K-D-B-tree is to provide the search efficiency of a balanced k-d tree, while providing the block-oriented storage of a B-tree for optimizing external memory accesses.

References