Vantage-point tree

Last updated

A vantage-point tree (or VP tree) is a metric tree that segregates data in a metric space by choosing a position in the space (the "vantage point") and partitioning the data points into two parts: those points that are nearer to the vantage point than a threshold, and those points that are not. By recursively applying this procedure to partition the data into smaller and smaller sets, a tree data structure is created where neighbors in the tree are likely to be neighbors in the space. [1]

Contents

One generalization is called a multi-vantage-point tree (or MVP tree): a data structure for indexing objects from large metric spaces for similarity search queries. It uses more than one point to partition each level. [2] [3]

History

Peter Yianilos claimed that the vantage-point tree was discovered independently by him (Peter Yianilos) and by Jeffrey Uhlmann. [1] Yet, Uhlmann published this method before Yianilos in 1991. [4] Uhlmann called the data structure a metric tree, the name VP-tree was proposed by Yianilos. Vantage-point trees have been generalized to non-metric spaces using Bregman divergences by Nielsen et al. [5]

This iterative partitioning process is similar to that of a k-d tree, but uses circular (or spherical, hyperspherical, etc.) rather than rectilinear partitions. In two-dimensional Euclidean space, this can be visualized as a series of circles segregating the data.

The vantage-point tree is particularly useful in dividing data in a non-standard metric space into a metric tree.

Understanding a vantage-point tree

The way a vantage-point tree stores data can be represented by a circle. [6] First, understand that each node of this tree contains an input point and a radius. All the left children of a given node are the points inside the circle and all the right children of a given node are outside of the circle. The tree itself does not need to know any other information about what is being stored. All it needs is the distance function that satisfies the properties of the metric space. [6]

Searching through a vantage-point tree

A vantage-point tree can be used to find the nearest neighbor of a point x. The search algorithm is recursive. At any given step we are working with a node of the tree that has a vantage point v and a threshold distance t. The point of interest x will be some distance from the vantage point v. If that distance d is less than t then use the algorithm recursively to search the subtree of the node that contains the points closer to the vantage point than the threshold t; otherwise recurse to the subtree of the node that contains the points that are farther than the vantage point than the threshold t. If the recursive use of the algorithm finds a neighboring point n with distance to x that is less than |td| then it cannot help to search the other subtree of this node; the discovered node n is returned. Otherwise, the other subtree also needs to be searched recursively.

A similar approach works for finding the k nearest neighbors of a point x. In the recursion, the other subtree is searched for kk′ nearest neighbors of the point x whenever only k′ (< k) of the nearest neighbors found so far have distance that is less than |td|.

Advantages of a vantage-point tree

  1. Instead of inferring multidimensional points for domain before the index being built, we build the index directly based on the distance. [6] Doing this, avoids pre-processing steps.
  2. Updating a vantage-point tree is relatively easy compared to the FastMap approach. For FastMap, after inserting or deleting data, there will come a time when FastMap will have to rescan itself. That takes up too much time and it is unclear to know when the rescanning will start. [6]
  3. Distance based methods are flexible. It is “able to index objects that are represented as feature vectors of a fixed number of dimensions." [6]

Complexity

The time cost to build a vantage-point tree is approximately O(n log n). For each element, the tree is descended by log n levels to find its placement. However there is a constant factor k where k is the number of vantage points per tree node. [3]

The time cost to search a vantage-point tree to find a single nearest neighbor is O(log n). There are log n levels, each involving k distance calculations, where k is the number of vantage points (elements) at that position in the tree.

The time cost to search a vantage-point tree for a range, which may be the most important attribute, can vary greatly depending on the specifics of the algorithm used and parameters. Brin's paper [3] gives the result of experiments with several vantage point algorithms with various parameters to investigate the cost, measured in number of distance calculations.

The space cost for a vantage-point tree is approximately n. Each element is stored, and each tree element in each non-leaf node requires a pointer to its descendant nodes. (See Brin for details on one implementation choice. The parameter for number of elements at each node plays a factor.)

With n points there are O(n2) pairwise distances between points. However, the creation of a vantage-point tree requires that only O(n log n) distances be calculated explicitly, and a search requires only O(log n) distance calculations. For example, if x and y are points and it is known that the distance d(x, y) is small then any point z that is far from x will also necessarily be almost as far from y because the metric space's triangle inequality gives d(y, z) ≥ d(x, z) − d(x, y).

Related Research Articles

<span class="mw-page-title-main">Binary tree</span> Limited form of tree data structure

In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child. That is, it is a k-ary tree with k = 2. A recursive definition using set theory is that a binary tree is a tuple (L, S, R), where L and R are binary trees or the empty set and S is a singleton set containing the root.

In computer science, tree traversal is a form of graph traversal and refers to the process of visiting each node in a tree data structure, exactly once. Such traversals are classified by the order in which the nodes are visited. The following algorithms are described for a binary tree, but they may be generalized to other trees as well.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

<span class="mw-page-title-main">R-tree</span> Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. K-dimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions. k-d trees are a useful data structure for several applications, such as:

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

A metric tree is any tree data structure specialized to index data in metric spaces. Metric trees exploit properties of metric spaces such as the triangle inequality to make accesses to the data more efficient. Examples include the M-tree, vp-trees, cover trees, MVP trees, and BK-trees.

In graph theory and computer science, the lowest common ancestor (LCA) of two nodes v and w in a tree or directed acyclic graph (DAG) T is the lowest node that has both v and w as descendants, where we define each node to be a descendant of itself.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

<span class="mw-page-title-main">Nearest neighbor graph</span> Type of directed graph

The nearest neighbor graph (NNG) is a directed graph defined for a set of points in a metric space, such as the Euclidean distance in the plane. The NNG has a vertex for each point, and a directed edge from p to q whenever q is a nearest neighbor of p, a point whose distance from p is minimum among all the given points other than p itself.

<span class="mw-page-title-main">Segment tree</span> Computer science data structure

In computer science, the segment tree is a data structure used for storing information about intervals or segments. It allows querying which of the stored segments contain a given point. A similar data structure is the interval tree.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions . DBSCAN is one of the most common, and most commonly cited, clustering algorithms.

In computer science, a range tree is an ordered tree data structure to hold a list of points. It allows all points within a given range to be reported efficiently, and is typically used in two or higher dimensions. Range trees were introduced by Jon Louis Bentley in 1979. Similar data structures were discovered independently by Lueker, Lee and Wong, and Willard. The range tree is an alternative to the k-d tree. Compared to k-d trees, range trees offer faster query times of but worse storage of , where n is the number of points stored in the tree, d is the dimension of each point and k is the number of points reported by a given query.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

In computer science, the Bx tree is a query that is used to update efficient B+ tree-based index structures for moving objects.

In computer science, M-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-nearest neighbor (k-NN) queries. While M-trees can perform well in many conditions, the tree can also have large overlap and there is no clear strategy on how to best avoid overlap. In addition, it can only be used for distance functions that satisfy the triangle inequality, while many advanced dissimilarity functions used in information retrieval do not satisfy this.

In computer science, a ball tree, balltree or metric tree, is a space partitioning data structure for organizing points in a multi-dimensional space. A ball tree partitions data points into a nested set of balls. The resulting data structure has characteristics that make it useful for a number of applications, most notably nearest neighbor search.

In the theory of cluster analysis, the nearest-neighbor chain algorithm is an algorithm that can speed up several methods for agglomerative hierarchical clustering. These are methods that take a collection of points as input, and create a hierarchy of clusters of points by repeatedly merging pairs of smaller clusters to form larger clusters. The clustering methods that the nearest-neighbor chain algorithm can be used for include Ward's method, complete-linkage clustering, and single-linkage clustering; these all work by repeatedly merging the closest two clusters but use different definitions of the distance between clusters. The cluster distances for which the nearest-neighbor chain algorithm works are called reducible and are characterized by a simple inequality among certain cluster distances.

A relaxed K-d tree or relaxed K-dimensional tree is a data structure which is a variant of K-d trees. Like K-dimensional trees, a relaxed K-dimensional tree stores a set of n-multidimensional records, each one having a unique K-dimensional key x=(x0,... ,xK−1). Unlike K-d trees, in a relaxed K-d tree, the discriminants in each node are arbitrary. Relaxed K-d trees were introduced in 1998.

References

  1. 1 2 Yianilos (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. Fourth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 311–321.
  2. Bozkaya, Tolga; Ozsoyoglu, Meral (September 1999). "Indexing Large Metric Spaces for Similarity Search Queries". ACM Trans. Database Syst. 24 (3): 361–404. doi: 10.1145/328939.328959 . ISSN   0362-5915. S2CID   6486308.
  3. 1 2 3 Brin, Sergey (Sep 1995). "Near Neighbor Search in Large Metric Spaces". VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases. Zurich, Switzerland: Morgan Kaufmann Publishers Inc.: 574–584. ISBN   9781558603790.
  4. Uhlmann, Jeffrey (1991). "Satisfying General Proximity/Similarity Queries with Metric Trees". Information Processing Letters. 40 (4): 175–179. doi:10.1016/0020-0190(91)90074-r.
  5. Nielsen, Frank (2009). "Bregman vantage point trees for efficient nearest Neighbor Queries". Proceedings of Multimedia and Exp (ICME). IEEE. pp. 878–881.
  6. 1 2 3 4 5 Fu, Ada Wai-chee; Polly Mei-shuen Chan; Yin-Ling Cheung; Yiu Sang Moon (2000). "Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances". The VLDB Journal — The International Journal on Very Large Data Bases. Springer-Verlag New York, Inc. Secaucus, NJ, USA. pp. 154–173. vp. Retrieved 2012-10-02.