K-D-B-tree

Last updated

In computer science, a K-D-B-tree (k-dimensional B-tree) is a tree data structure for subdividing a k-dimensional search space. The aim of the K-D-B-tree is to provide the search efficiency of a balanced k-d tree, while providing the block-oriented storage of a B-tree for optimizing external memory accesses. [1]

Contents

Informal description

Much like the k-d tree, a K-D-B-tree organizes points in k-dimensional space, useful for tasks such as range-searching and multi-dimensional database queries. K-D-B-trees subdivide space into two subspaces by comparing elements in a single domain. Using a 2-D-B-tree (2-dimensional K-D-B-tree) as an example, space is subdivided in the same manner as a k-d tree: using a point in just one of the domains, or axes in this case, all other values are either less than or greater than the current value, and fall to the left and right of the splitting plane respectively.

Unlike a k-d tree, each half-space is not its own node. Instead, as in a B-tree, nodes in the K-D-B-tree are stored as pages and the tree stores a pointer to the root page.

Structure

The basic structure of a K-D-B-tree. KBDtreeStructure.png
The basic structure of a K-D-B-tree.

The K-D-B-tree contains two types of pages:

Page overflows occur when inserting an element into a K-D-B-tree results in the size of a node exceeding its optimal size. Since the purpose of the K-D-B-tree is to optimize external memory accesses like those from a hard-disk, a page is considered to have overflowed or be overfilled if the size of the node exceeds the external memory page size.

Throughout insertion/deletion operations, the K-D-B-tree maintains a certain set of properties:

Operations on a K-D-B-tree

Queries

Queries on a K-D-B-tree are a range search over intervals in all domains or axes in the tree. This collection of intervals is called the query region. In k-space, the query region can be visualized as a bounding volume around some subspace in the entire k-dimensional search space. A query can fall into one of three categories:

Algorithm

  1. If the root of the tree is null, terminate, otherwise let page be root.
  2. If page is a point page, return every point in a (point, location) pair that lies within the query region.
  3. Otherwise, page is a region page, so for all (region, child) pairs where region and query region intersect, set page to be child and recurse from step 2.

Insertions

Since an insertion into a K-D-B-tree may require the splitting of a page in the case of a page overflow, it is important to first define the splitting operation.

Splitting algorithm

First, a region page is split along some plane to create two new region pages, the left and right pages. These pages are filled with the regions from the old region page, and the old region page is deleted. Then, for every (region, child) in the original region page, remembering child is a page and region specifies an actual bounding region:

  1. If region lies entirely to the left of the splitting plane, add (region, child) to the left page.
  2. If region lies entirely to the right of the splitting plane, add (region, child) to the right page.
  3. Otherwise:
    1. Recursively split child by the splitting plane, resulting in the pages new_left_page and new_right_page
    2. Split region by the splitting plane, resulting in left_region and right_region
    3. Add (left_region, new_left_page) to the left page, and (right_region, new_right_page) to the right page.

Insertion algorithm

The importance of choosing the correct splitting domain. KDBTreeSplits.png
The importance of choosing the correct splitting domain.

Using the splitting algorithm, insertions of a new (point, location) pair can be implemented as follows:

  1. If the root page is null, simply make the root page a new point page containing (point, location)
  2. If an exact match query on point to find the page that point should be added to. If it already exists in the page, terminate.
  3. Add (point, location) to the page. If the page overflows, let page denote that page.
  4. Let old_page be equal to page. Choose some element and a domain/axis to define a plane to split page by that results in two pages that will not also result in one of the pages being overfilled with the addition of a new point. Split page by the plane to make two new pages, new_left_page and new_right_page, and two new regions left_region and right_region.
  5. If page was the root page, go to step 6. Otherwise, page becomes the parent of page. Replace (region, old_page) in page with (left_region, new_left_page) and (right_region, new_right_page). If page overflows, repeat step 4, otherwise terminate.
  6. Let left_region be the entire search space to the left of the splitting plane, and right_region be the search space to the right, resulting from the split in Step 4. Set the root page to be a page containing to the regions left_region and right_region.

It is important to take care in the domain and element chosen to split page by, since it is desirable to try to balance the number of points on either side of the splitting plane. In some cases, a poor choice of splitting domain can result in undesirable splits. It is also possible that a page cannot be split by a certain domain.

Deletions

Deletions from a K-D-B-tree are incredibly simple if no minimum requirements are placed on storage utilization. Using an exact match query to find a (point, location) pair, we simply remove the record from the tree if it exists.

Reorganization algorithm

Since deletions can result in pages that contain very little data, it may be necessary to reorganize the K-D-B-tree to meet some minimum storage utilization criteria. The reorganization algorithm to be used when a page contains too little data is as follows:

  1. Let page be the parent of P, containing (region, P).
  2. Find regions in page such that the regions are adjacent and the union of which forms a rectangular region. These regions are considered "joinable". Let R denote the set of these regions.
  3. Merge the set R into one page S, and if the S is overfull, repeatedly split until none of the resulting pages are overfull.
  4. Replace the set R of regions in page with the resulting pages from splitting S.

Like in a k-d tree, updates in a K-D-B-tree may result in the requirement for the splitting of several nodes recursively. This is incredibly inefficient and can result in sub-optimal memory utilization as it may result in many near-empty leaves. Lomet and Salzberg proposed a structure called the hB-tree (holey brick tree) to improve performance of K-D-B-trees by limiting the splits that occur after an insertion to only one root-to-leaf path. This was achieved by storing regions not only as rectangles, but as rectangles with a rectangle removed from the center. [2]

BKD tree

More recently, the Bkd-tree was proposed as a means to provide the fast queries and near 100% space utilization of a static K-D-B-tree. Instead of maintaining a single tree and re-balancing, a set of K-D-B-trees are maintained and rebuilt at regular intervals. [3] In this case, is the size of the memory buffer in number of points.

Related Research Articles

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

<span class="mw-page-title-main">R-tree</span> Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

<span class="mw-page-title-main">Rope (data structure)</span> Data structure for storing strings

In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a better query performance. Like the standard R-tree, it can store both point and spatial data. It was proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger in 1990.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key and creating point clouds. k-d trees are a special case of binary space partitioning trees.

<span class="mw-page-title-main">Z-order curve</span> Mapping function that preserves data point locality

In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space-filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It is named in France after Henri Lebesgue, who studied it in 1904, and named in the United States after Guy Macdonald Morton, who first applied the order to file sequencing in 1966. The z-value of a point in multidimensions is simply calculated by interleaving the binary representations of its coordinate values. Once the data are sorted into this ordering, any one-dimensional data structure can be used, such as simple one dimensional arrays, binary search trees, B-trees, skip lists or hash tables. The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree or octree.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

Hilbert R-tree, an R-tree variant, is an index for multidimensional objects such as lines, regions, 3-D objects, or high-dimensional feature-based parametric objects. It can be thought of as an extension to B+-tree for multidimensional objects.

A min/max kd-tree is a k-d tree with two scalar values - a minimum and a maximum - assigned to its nodes. The minimum/maximum of an inner node is equal to the minimum/maximum of its children's minima/maxima.

<span class="mw-page-title-main">Segment tree</span> Computer science data structure

In computer science, a segment tree, also known as a statistic tree, is a tree data structure used for storing information about intervals, or segments. It allows querying which of the stored segments contain a given point. It is, in principle, a static structure and cannot be modified once built. A similar data structure is the interval tree.

In computer science, a range tree is an ordered tree data structure to hold a list of points. It allows all points within a given range to be reported efficiently, and is typically used in two or higher dimensions. Range trees were introduced by Jon Louis Bentley in 1979. Similar data structures were discovered independently by Lueker, Lee and Wong, and Willard. The range tree is an alternative to the k-d tree. Compared to k-d trees, range trees offer faster query times of but worse storage of , where n is the number of points stored in the tree, d is the dimension of each point and k is the number of points reported by a given query.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of numbers. The smallest number in the sequence is at the root of the tree; its left and right subtrees are constructed recursively from the subsequences to the left and right of this number. When all numbers are distinct, the Cartesian tree is uniquely defined from the properties that it is heap-ordered and that a symmetric (in-order) traversal of the tree returns the original sequence.

In computer science, the Bx tree is a query that is used to update efficient B+ tree-based index structures for moving objects.

In computer science, a ball tree, balltree or metric tree, is a space partitioning data structure for organizing points in a multi-dimensional space. A ball tree partitions data points into a nested set of balls. The resulting data structure has characteristics that make it useful for a number of applications, most notably nearest neighbor search.

The Priority R-tree is a worst-case asymptotically optimal alternative to the spatial tree R-tree. It was first proposed by Arge, De Berg, Haverkort and Yi, K. in an article from 2004. The prioritized R-tree is essentially a hybrid between a k-dimensional tree and a r-tree in that it defines a given object's N-dimensional bounding volume as a point in N-dimensions, represented by the ordered pair of the rectangles. The term prioritized arrives from the introduction of four priority-leaves that represents the most extreme values of each dimensions, included in every branch of the tree. Before answering a window-query by traversing the sub-branches, the prioritized R-tree first checks for overlap in its priority nodes. The sub-branches are traversed by checking whether the least value of the first dimension of the query is above the value of the sub-branches. This gives access to a quick indexation by the value of the first dimension of the bounding box.

A relaxed K-d tree or relaxed K-dimensional tree is a data structure which is a variant of K-d trees. Like K-dimensional trees, a relaxed K-dimensional tree stores a set of n-multidimensional records, each one having a unique K-dimensional key x=(x0,... ,xK−1). Unlike K-d trees, in a relaxed K-d tree, the discriminants in each node are arbitrary. Relaxed K-d trees were introduced in 1998.

In computer science, a priority search tree is a tree data structure for storing points in two dimensions. It was originally introduced by Edward M. McCreight. It is effectively an extension of the priority queue with the purpose of improving the search time from O(n) to O(s + log n) time, where n is the number of points in the tree and s is the number of points returned by the search.

The PH-tree is a tree data structure used for spatial indexing of multi-dimensional data (keys) such as geographical coordinates, points, feature vectors, rectangles or bounding boxes. The PH-tree is space partitioning index with a structure similar to that of a quadtree or octree. However, unlike quadtrees, it uses a splitting policy based on tries and similar to Crit bit trees that is based on the bit-representation of the keys. The bit-based splitting policy, when combined with the use of different internal representations for nodes, provides scalability with high-dimensional data. The bit-representation splitting policy also imposes a maximum depth, thus avoiding degenerated trees and the need for rebalancing.

References

  1. Robinson, John (1981). "The K-D-B-tree". Proceedings of the 1981 ACM SIGMOD international conference on Management of data - SIGMOD '81. Sigmod '81. pp. 10–18. doi:10.1145/582318.582321. ISBN   978-0897910408. S2CID   27482172 . Retrieved Apr 8, 2014.
  2. Lomet, David; Betty Salzberg (Dec 1990). "The hB-tree: a multiattribute indexing method with good guaranteed performance". ACM Transactions on Database Systems. 15 (4): 625–658. CiteSeerX   10.1.1.63.2056 . doi:10.1145/99935.99949. S2CID   15333693.
  3. Procopiuc, Octavian; Agarwal, Pankaj; Arge, Lars; Vitter, Jeffrey Scott (2003). Bkd-Tree: A Dynamic Scalable kd-Tree . Advances in Spatial and Temporal Databases. Lecture Notes in Computer Science. Vol. 2750. pp.  46–65. CiteSeerX   10.1.1.134.3206 . doi:10.1007/978-3-540-45072-6_4. ISBN   978-3-540-40535-1. S2CID   12784232.