Nearest neighbor search

Last updated

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

Contents

Formally, the nearest-neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q  M, find the closest point in S to q. Donald Knuth in vol. 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a k-NN search, where we need to find the k closest points.

Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensional vector space where dissimilarity is measured using the Euclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example is asymmetric Bregman divergence, for which the triangle inequality does not hold. [1]

Applications

The nearest neighbor search problem arises in numerous fields of application, including:

Methods

Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time.

Exact methods

The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces. [5]

The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results.

Space partitioning

Since the 1970s, the branch and bound methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) [6] in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k)) [7] Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R* tree. [8] R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances.

In the case of general metric space, the branch-and-bound approach is known as the metric tree approach. Particular examples include vp-tree and BK-tree methods.

Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm.

(Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result.

Approximation methods

An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. [9]

Greedy search in proximity neighborhood graphs

Proximity graph methods (such as navigable small world graphs [10] and HNSW [11] [12] ) are considered the current state-of-the-art for the approximate nearest neighbors search.

The methods are based on greedy traversing in proximity neighborhood graphs in which every point is uniquely associated with vertex . The search for the nearest neighbors to a query q in the set S takes the form of searching for the vertex in the graph . The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex by computing the distances from the query q to each vertex of its neighborhood , and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself.

The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount, [13] in the VoroNet system for the plane, [14] in the RayNet system for the , [15] and in the Navigable Small World, [10] Metrized Small World [16] and HNSW [11] [12] algorithms for the general case of spaces with a distance function. These works were preceded by a pioneering paper by Toussaint, in which he introduced the concept of a relative neighborhood graph. [17]

Locality sensitive hashing

Locality sensitive hashing (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability. [18]

Nearest neighbor search in spaces with small intrinsic dimension

The cover tree has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c12 log n) where c is the expansion constant of the dataset.

In the special case where the data is a dense 3D map of geometric points, the projection geometry of the sensing technique can be used to dramatically simplify the search problem. This approach requires that the 3D data is organized by a projection to a two-dimensional grid and assumes that the data is spatially smooth across neighboring grid cells with the exception of object boundaries. These assumptions are valid when dealing with 3D sensor data in applications such as surveying, robotics and stereo vision but may not hold for unorganized data in general. In practice this technique has an average search time of O(1) or O(K) for the k-nearest neighbor problem when applied to real world stereo vision data. [4]

Vector approximation files

In high-dimensional spaces, tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation. [19]

The VA-file approach is a special case of a compression based search, where each feature component is compressed uniformly and independently. The optimal compression technique in multidimensional spaces is Vector Quantization (VQ), implemented through clustering. The database is clustered and the most "promising" clusters are retrieved. Huge gains over VA-File, tree-based indexes and sequential scan have been observed. [20] [21] Also note the parallels between clustering and LSH.

Variants

There are numerous variants of the NNS problem and the two most well-known are the k-nearest neighbor search and the ε-approximate nearest neighbor search.

k-nearest neighbors

k-nearest neighbor search identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. k-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors.

Approximate nearest neighbor

In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor. In those cases, we can use an algorithm which doesn't guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried.

Algorithms that support the approximate nearest neighbor search include locality-sensitive hashing, best bin first and balanced box-decomposition tree based search. [22]

Nearest neighbor distance ratio

Nearest neighbor distance ratio does not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor. It is used in CBIR to retrieve pictures through a "query by example" using the similarity between local features. More generally it is involved in several matching problems.

Fixed-radius near neighbors

Fixed-radius near neighbors is the problem where one wants to efficiently find all points given in Euclidean space within a given fixed distance from a specified point. The distance is assumed to be fixed, but the query point is arbitrary.

All nearest neighbors

For some applications (e.g. entropy estimation), we may have N data-points and wish to know which is the nearest neighbor for every one of those N points. This could, of course, be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these N queries to produce a more efficient search. As a simple example: when we find the distance from point X to point Y, that also tells us the distance from point Y to point X, so the same calculation can be reused in two different queries.

Given a fixed dimension, a semi-definite positive norm (thereby including every Lp norm), and n points in this space, the nearest neighbour of every point can be found in O(n log n) time and the m nearest neighbours of every point can be found in O(mn log n) time. [23] [24]

See also

Related Research Articles

<span class="mw-page-title-main">Nonlinear dimensionality reduction</span> Summary of algorithms for nonlinear dimensionality reduction

Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

<span class="mw-page-title-main">Steiner tree problem</span> On short connecting networks with added vertices

In combinatorial mathematics, the Steiner tree problem, or minimum Steiner tree problem, named after Jakob Steiner, is an umbrella term for a class of problems in combinatorial optimization. While Steiner tree problems may be formulated in a number of settings, they all require an optimal interconnect for a given set of objects and a predefined objective function. One well-known variant, which is often used synonymously with the term Steiner tree problem, is the Steiner tree problem in graphs. Given an undirected graph with non-negative edge weights and a subset of vertices, usually referred to as terminals, the Steiner tree problem in graphs requires a tree of minimum weight that contains all terminals and minimizes the total weight of its edges. Further well-known variants are the Euclidean Steiner tree problem and the rectilinear minimum Steiner tree problem.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

<span class="mw-page-title-main">Euclidean minimum spanning tree</span> Shortest network connecting points

A Euclidean minimum spanning tree of a finite set of points in the Euclidean plane or higher-dimensional Euclidean space connects the points by a system of line segments with the points as endpoints, minimizing the total length of the segments. In it, any two points can reach each other along a path through the line segments. It can be found as the minimum spanning tree of a complete graph with the points as vertices and the Euclidean distances between points as edge weights.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. K-dimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions. k-d trees are a useful data structure for several applications, such as:

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

A vantage-point tree is a metric tree that segregates data in a metric space by choosing a position in the space and partitioning the data points into two parts: those points that are nearer to the vantage point than a threshold, and those points that are not. By recursively applying this procedure to partition the data into smaller and smaller sets, a tree data structure is created where neighbors in the tree are likely to be neighbors in the space.

The k-medoids problem is a clustering problem similar to k-means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the k-means and k-medoids algorithms are partitional and attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses actual data points as centers, and thereby allows for greater interpretability of the cluster centers than in k-means, where the center of a cluster is not necessarily one of the input data points. Furthermore, k-medoids can be used with arbitrary dissimilarity measures, whereas k-means generally requires Euclidean distance for efficient solutions. Because k-medoids minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances, it is more robust to noise and outliers than k-means.

<span class="mw-page-title-main">Closest pair of points problem</span>

The closest pair of points problem or closest pair problem is a problem of computational geometry: given points in metric space, find a pair of points with the smallest distance between them. The closest pair problem for points in the Euclidean plane was among the first geometric problems that were treated at the origins of the systematic study of the computational complexity of geometric algorithms.

<span class="mw-page-title-main">Nearest neighbor graph</span> Type of directed graph

The nearest neighbor graph (NNG) is a directed graph defined for a set of points in a metric space, such as the Euclidean distance in the plane. The NNG has a vertex for each point, and a directed edge from p to q whenever q is a nearest neighbor of p, a point whose distance from p is minimum among all the given points other than p itself.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions . DBSCAN is one of the most common, and most commonly cited, clustering algorithms.

<span class="mw-page-title-main">Isomap</span>

Isomap is a nonlinear dimensionality reduction method. It is one of several widely used low-dimensional embedding methods. Isomap is used for computing a quasi-isometric, low-dimensional embedding of a set of high-dimensional data points. The algorithm provides a simple method for estimating the intrinsic geometry of a data manifold based on a rough estimate of each data point’s neighbors on the manifold. Isomap is highly efficient and generally applicable to a broad range of data sources and dimensionalities.

In pattern recognition, the iDistance is an indexing and query processing technique for k-nearest neighbor queries on point data in multi-dimensional metric spaces. The kNN query is one of the hardest problems on multi-dimensional data, especially when the dimensionality of the data is high. The iDistance is designed to process kNN queries in high-dimensional spaces efficiently and it is especially good for skewed data distributions, which usually occur in real-life data sets. The iDistance can be augmented with machine learning models to learn the data distributions for searching and storing the multi-dimensional data.

<span class="mw-page-title-main">David Mount</span> American computer scientist

David Mount is a professor at the University of Maryland, College Park department of computer science whose research is in computational geometry.

Large margin nearest neighbor (LMNN) classification is a statistical machine learning algorithm for metric learning. It learns a pseudometric designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming, a sub-class of convex optimization.

In computer science, a ball tree, balltree or metric tree, is a space partitioning data structure for organizing points in a multi-dimensional space. A ball tree partitions data points into a nested set of balls. The resulting data structure has characteristics that make it useful for a number of applications, most notably nearest neighbor search.

In the theory of cluster analysis, the nearest-neighbor chain algorithm is an algorithm that can speed up several methods for agglomerative hierarchical clustering. These are methods that take a collection of points as input, and create a hierarchy of clusters of points by repeatedly merging pairs of smaller clusters to form larger clusters. The clustering methods that the nearest-neighbor chain algorithm can be used for include Ward's method, complete-linkage clustering, and single-linkage clustering; these all work by repeatedly merging the closest two clusters but use different definitions of the distance between clusters. The cluster distances for which the nearest-neighbor chain algorithm works are called reducible and are characterized by a simple inequality among certain cluster distances.

<span class="mw-page-title-main">Delone set</span> Well-spaced set of points in a metric space

In the mathematical theory of metric spaces, ε-nets, ε-packings, ε-coverings, uniformly discrete sets, relatively dense sets, and Delone sets are several closely related definitions of well-spaced sets of points, and the packing radius and covering radius of these sets measure how well-spaced they are. These sets have applications in coding theory, approximation algorithms, and the theory of quasicrystals.

<span class="mw-page-title-main">Farthest-first traversal</span> Sequence of points far from previous points

In computational geometry, the farthest-first traversal of a compact metric space is a sequence of points in the space, where the first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points. The same concept can also be applied to a finite set of geometric points, by restricting the selected points to belong to the set or equivalently by considering the finite metric space generated by these points. For a finite metric space or finite set of geometric points, the resulting sequence forms a permutation of the points, also known as the greedy permutation.

References

Citations

  1. Cayton, Lawerence (2008). "Fast nearest neighbor retrieval for bregman divergences". Proceedings of the 25th International Conference on Machine Learning. pp. 112–119. doi:10.1145/1390156.1390171. ISBN   9781605582054. S2CID   12169321.
  2. Qiu, Deyuan, Stefan May, and Andreas Nüchter. "GPU-accelerated nearest neighbor search for 3D registration." International conference on computer vision systems. Springer, Berlin, Heidelberg, 2009.
  3. Becker, Ducas, Gama, and Laarhoven. "New directions in nearest neighbor searching with applications to lattice sieving." Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms (pp. 10-24). Society for Industrial and Applied Mathematics.
  4. 1 2 Bewley, A.; Upcroft, B. (2013). Advantages of Exploiting Projection Structure for Segmenting Dense 3D Point Clouds (PDF). Australian Conference on Robotics and Automation.
  5. Weber, Roger; Schek, Hans-J.; Blott, Stephen (1998). "A quantitative analysis and performance study for similarity search methods in high dimensional spaces" (PDF). VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases. pp. 194–205.
  6. Andrew Moore. "An introductory tutorial on KD trees" (PDF). Archived from the original (PDF) on 2016-03-03. Retrieved 2008-10-03.
  7. Lee, D. T.; Wong, C. K. (1977). "Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees". Acta Informatica . 9 (1): 23–29. doi:10.1007/BF00263763. S2CID   36580055.
  8. Roussopoulos, N.; Kelley, S.; Vincent, F. D. R. (1995). "Nearest neighbor queries". Proceedings of the 1995 ACM SIGMOD international conference on Management of data – SIGMOD '95. p. 71. doi: 10.1145/223784.223794 . ISBN   0897917316.
  9. Andoni, A.; Indyk, P. (2006-10-01). "Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions". 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06). pp. 459–468. CiteSeerX   10.1.1.142.3471 . doi:10.1109/FOCS.2006.49. ISBN   978-0-7695-2720-8.
  10. 1 2 Malkov, Yury; Ponomarenko, Alexander; Logvinov, Andrey; Krylov, Vladimir (2012), Navarro, Gonzalo; Pestov, Vladimir (eds.), "Scalable Distributed Algorithm for Approximate Nearest Neighbor Search Problem in High Dimensional General Metric Spaces", Similarity Search and Applications, Berlin, Heidelberg: Springer Berlin Heidelberg, vol. 7404, pp. 132–147, doi:10.1007/978-3-642-32153-5_10, ISBN   978-3-642-32152-8 , retrieved 2024-01-16
  11. 1 2 Malkov, Yury; Yashunin, Dmitry (2016). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". arXiv: 1603.09320 [cs.DS].
  12. 1 2 Malkov, Yu A.; Yashunin, D. A. (2020-04-01). "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 42 (4): 824–836. doi:10.1109/TPAMI.2018.2889473. ISSN   0162-8828.
  13. Arya, Sunil; Mount, David (1993). "Approximate Nearest Neighbor Queries in Fixed Dimensions". Proceedings of the Fourth Annual {ACM/SIGACT-SIAM} Symposium on Discrete Algorithms, 25–27 January 1993, Austin, Texas.: 271–280.
  14. Olivier, Beaumont; Kermarrec, Anne-Marie; Marchal, Loris; Rivière, Etienne (2006). "Voro Net: A scalable object network based on Voronoi tessellations" (PDF). 2007 IEEE International Parallel and Distributed Processing Symposium. Vol. RR-5833. pp. 23–29. doi:10.1109/IPDPS.2007.370210. ISBN   1-4244-0909-8. S2CID   8844431.
  15. Olivier, Beaumont; Kermarrec, Anne-Marie; Rivière, Etienne (2007). "Peer to Peer Multidimensional Overlays: Approximating Complex Structures". Principles of Distributed Systems. Lecture Notes in Computer Science. Vol. 4878. pp. 315–328. CiteSeerX   10.1.1.626.2980 . doi:10.1007/978-3-540-77096-1_23. ISBN   978-3-540-77095-4.
  16. Malkov, Yury; Ponomarenko, Alexander; Krylov, Vladimir; Logvinov, Andrey (2014). "Approximate nearest neighbor algorithm based on navigable small world graphs". Information Systems. 45: 61–68. doi:10.1016/j.is.2013.10.006. S2CID   9896397.
  17. Toussaint, Godfried (1980). "The relative neighbourhood graph of a finite planar set". Pattern Recognition. 12 (4): 261–268. Bibcode:1980PatRe..12..261T. doi:10.1016/0031-3203(80)90066-7.
  18. A. Rajaraman & J. Ullman (2010). "Mining of Massive Datasets, Ch. 3".
  19. Weber, Roger; Blott, Stephen. "An Approximation-Based Data Structure for Similarity Search" (PDF). S2CID   14613657. Archived from the original (PDF) on 2017-03-04.{{cite journal}}: Cite journal requires |journal= (help)
  20. Ramaswamy, Sharadh; Rose, Kenneth (2007). "Adaptive cluster-distance bounding for similarity search in image databases". ICIP.
  21. Ramaswamy, Sharadh; Rose, Kenneth (2010). "Adaptive cluster-distance bounding for high-dimensional indexing". TKDE.
  22. Arya, S.; Mount, D. M.; Netanyahu, N. S.; Silverman, R.; Wu, A. (1998). "An optimal algorithm for approximate nearest neighbor searching" (PDF). Journal of the ACM. 45 (6): 891–923. CiteSeerX   10.1.1.15.3125 . doi:10.1145/293347.293348. S2CID   8193729. Archived from the original (PDF) on 2016-03-03. Retrieved 2009-05-29.
  23. Clarkson, Kenneth L. (1983), "Fast algorithms for the all nearest neighbors problem", 24th IEEE Symp. Foundations of Computer Science, (FOCS '83), pp. 226–232, doi:10.1109/SFCS.1983.16, ISBN   978-0-8186-0508-6, S2CID   16665268 .
  24. Vaidya, P. M. (1989). "An O(n log n) Algorithm for the All-Nearest-Neighbors Problem". Discrete and Computational Geometry . 4 (1): 101–115. doi: 10.1007/BF02187718 .

Sources

Further reading