Cover tree

Last updated October 17, 2021

The cover tree is a type of data structure in computer science that is specifically designed to facilitate the speed-up of a nearest neighbor search. It is a refinement of the Navigating Net data structure, and related to a variety of other data structures developed for indexing intrinsically low-dimensional data.^[1]

The tree can be thought of as a hierarchy of levels with the top level containing the root point and the bottom level containing every point in the metric space. Each level C is associated with an integer value i that decrements by one as the tree is descended. Each level C in the cover tree has three important properties:

Nesting: $C_{i}\subseteq C_{i-1}$
Covering: For every point $p\in C_{i-1}$ , there exists a point $q\in C_{i}$ such that the distance from $p$ to $q$ is less than or equal to $2^{i}$ and exactly one such $q$ is a parent of $p$ .
Separation: For all points $p,q\in C_{i}$ , the distance from $p$ to $q$ is greater than $2^{i}$ .

Complexity

Find

Like other metric trees the cover tree allows for nearest neighbor searches in $O(\eta *\log {n})$ where $\eta$ is a constant associated with the dimensionality of the dataset and n is the cardinality. To compare, a basic linear search requires $O(n)$ , which is a much worse dependence on $n$ . However, in high-dimensional metric spaces the $\eta$ constant is non-trivial, which means it cannot be ignored in complexity analysis. Unlike other metric trees, the cover tree has a theoretical bound on its constant that is based on the dataset's expansion constant or doubling constant (in the case of approximate NN retrieval). The bound on search time is $O(c^{12}\log {n})$ where $c$ is the expansion constant of the dataset.

Insert

Although cover trees provide faster searches than the naive approach, this advantage must be weighed with the additional cost of maintaining the data structure. In a naive approach adding a new point to the dataset is trivial because order does not need to be preserved, but in a cover tree it can take $O(c^{6}\log {n})$ time. However, this is an upper-bound, and some techniques have been implemented that seem to improve the performance in practice.^[2]

Space

The cover tree uses implicit representation to keep track of repeated points. Thus, it only requires O(n) space.

Related Research Articles

In mathematics, a 3-sphere is a higher-dimensional analogue of a sphere. It may be embedded in 4-dimensional Euclidean space as the set of points equidistant from a fixed central point. Analogous to how the boundary of a ball in three dimensions is an ordinary sphere, the boundary of a ball in four dimensions is a 3-sphere. A 3-sphere is an example of a 3-manifold and an n-sphere.

In mathematical physics, Minkowski space is a combination of three-dimensional Euclidean space and time into a four-dimensional manifold where the spacetime interval between any two events is independent of the inertial frame of reference in which they are recorded. Although initially developed by mathematician Hermann Minkowski for Maxwell's equations of electromagnetism, the mathematical structure of Minkowski spacetime was shown to be implied by the postulates of special relativity.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

Bosonic string theory is the original version of string theory, developed in the late 1960s. It is so called because it only contains bosons in the spectrum.

In mathematics, a hyperbolic metric space is a metric space satisfying certain metric relations between points. The definition, introduced by Mikhael Gromov, generalizes the metric properties of classical hyperbolic geometry and of trees. Hyperbolicity is a large-scale property, and is very useful to the study of certain infinite groups called Gromov-hyperbolic groups.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key and creating point clouds. k-d trees are a special case of binary space partitioning trees.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

A vantage-point tree is a metric tree that segregates data in a metric space by choosing a position in the space and partitioning the data points into two parts: those points that are nearer to the vantage point than a threshold, and those points that are not. By recursively applying this procedure to partition the data into smaller and smaller sets, a tree data structure is created where neighbors in the tree are likely to be neighbors in the space.

A metric tree is any tree data structure specialized to index data in metric spaces. Metric trees exploit properties of metric spaces such as the triangle inequality to make accesses to the data more efficient. Examples include the M-tree, vp-trees, cover trees, MVP Trees, and BK-trees.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

The closest pair of points problem or closest pair problem is a problem of computational geometry: given n points in metric space, find a pair of points with the smallest distance between them. The closest pair problem for points in the Euclidean plane was among the first geometric problems that were treated at the origins of the systematic study of the computational complexity of geometric algorithms.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

Isomap is a nonlinear dimensionality reduction method. It is one of several widely used low-dimensional embedding methods. Isomap is used for computing a quasi-isometric, low-dimensional embedding of a set of high-dimensional data points. The algorithm provides a simple method for estimating the intrinsic geometry of a data manifold based on a rough estimate of each data point’s neighbors on the manifold. Isomap is highly efficient and generally applicable to a broad range of data sources and dimensionalities.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

In computer science, the B^x tree is basically a query that is used to update efficient B+ tree-based index structures for moving objects.

M-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-nearest neighbor (k-NN) queries. While M-trees can perform well in many conditions, the tree can also have large overlap and there is no clear strategy on how to best avoid overlap. In addition, it can only be used for distance functions that satisfy the triangle inequality, while many advanced dissimilarity functions used in information retrieval do not satisfy this.

In computer science, a ball tree, balltree or metric tree, is a space partitioning data structure for organizing points in a multi-dimensional space. The ball tree gets its name from the fact that it partitions data points into a nested set of hyperspheres known as "balls". The resulting data structure has characteristics that make it useful for a number of applications, most notably nearest neighbor search.

In theoretical physics, an M2-brane, is a spatially extended mathematical object (brane) that appears in string theory and in related theories. In particular, it is a solution of eleven-dimensional supergravity which possesses a three-dimensional world volume.

References

Notes

↑ Kenneth Clarkson. Nearest-neighbor searching and metric space dimensions. In G. Shakhnarovich, T. Darrell, and P. Indyk, editors, Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, pages 15--59. MIT Press, 2006.
↑ "Cover Tree".

Bibliography

Alina Beygelzimer, Sham Kakade, and John Langford. Cover Trees for Nearest Neighbor. In Proc. International Conference on Machine Learning (ICML), 2006.
JL's Cover Tree page. John Langford's page links to papers and code.
A C++ Cover Tree implementation on GitHub.
A cover tree implementation in Java.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[clarkson-1] Kenneth Clarkson. Nearest-neighbor searching and metric space dimensions. In G. Shakhnarovich, T. Darrell, and P. Indyk, editors, Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, pages 15--59. MIT Press, 2006.

[2] "Cover Tree".

[1]

[2]

v t e Tree data structures
Search trees (dynamic sets/associative arrays)	2–3 2–3–4 AA (a,b) AVL B B+ B* B^x (Optimal) Binary search Dancing HTree Interval Order statistic (Left-leaning) Red–black Scapegoat Splay T Treap UB Weight-balanced
Heaps	Binary Binomial Brodal Fibonacci Leftist Pairing Skew van Emde Boas Weak
Tries	Ctrie C-trie (compressed ADT) Hash Radix Suffix Ternary search X-fast Y-fast
Spatial data partitioning trees	Ball BK BSP Cartesian Hilbert R k-d (implicit k-d) M Metric MVP Octree Priority R Quad R R+ R* Segment VP X
Other trees	Cover Exponential Fenwick Finger Fractal tree index Fusion Hash calendar iDistance K-ary Left-child right-sibling Link/cut Log-structured merge Merkle PQ Range SPQR Top