Hierarchical navigable small world

Last updated

The Hierarchical navigable small world (HNSW) algorithm is a graph-based approximate nearest neighbor search technique used in many vector databases. [1] [2] Nearest neighbor search without an index involves computing the distance from the query to each point in the database, which for large datasets is computationally prohibitive. For high-dimensional data, tree-based exact vector search techniques such as the k-d tree and R-tree do not perform well enough because of the curse of dimensionality. To remedy this, approximate k-nearest neighbor searches have been proposed, such as locality-sensitive hashing (LSH) and product quantization (PQ) that trade performance for accuracy. [2] The HNSW graph offers an approximate k-nearest neighbor search which scales logarithmically even in high-dimensional data.

It is an extension of the earlier work on navigable small world graphs presented at the Similarity Search and Applications (SISAP) conference in 2012 with an additional hierarchical navigation to find entry points to the main graph faster. [3] HNSW-based libraries are among the best performers in the approximate nearest neighbors benchmark. [4] [5] [6]

Use in vector databases

HNSW is a key method for approximate nearest neighbor search in high-dimensional vector databases, for example in the context of embeddings from neural networks in large language models. Databases that use HNSW as search index include:

Several of these use either the hnswlib library [10] provided by the original authors, or the FAISS library.

Related Research Articles

<span class="mw-page-title-main">Nonlinear dimensionality reduction</span> Projection of data onto lower-dimensional manifolds

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear decomposition methods, onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method. It was first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. Most often, it is used for classification, as a k-NN classifier, the output of which is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

<span class="mw-page-title-main">Community structure</span> Concept in graph theory

In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

<span class="mw-page-title-main">Nearest neighbor graph</span> Type of directed graph

The nearest neighbor graph (NNG) is a directed graph defined for a set of points in a metric space, such as the Euclidean distance in the plane. The NNG has a vertex for each point, and a directed edge from p to q whenever q is a nearest neighbor of p, a point whose distance from p is minimum among all the given points other than p itself.

<span class="mw-page-title-main">Spectral clustering</span> Clustering methods

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

The following outline is provided as an overview of, and topical guide to, machine learning:

An Ordered Key-Value Store (OKVS) is a type of data storage paradigm that can support multi-model database. An OKVS is an ordered mapping of bytes to bytes. An OKVS will keep the key-value pairs sorted by the key lexicographic order. OKVS systems provides different set of features and performance trade-offs. Most of them are shipped as a library without network interfaces, in order to be embedded in another process. Most OKVS support ACID guarantees. Some OKVS are distributed databases. Ordered Key-Value Store found their way into many modern database systems including NewSQL database systems.

A vector database, vector store or vector search engine is a database that can store vectors along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor algorithms, so that one can search the database with a query vector to retrieve the closest matching database records.

Milvus is a distributed vector database developed by Zilliz. It is available as both open-source software and a cloud service.

FAISS is an open-source library for similarity search and clustering of vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

References

  1. Malkov, Yu A.; Yashunin, D. A. (2016). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". arXiv: 1603.09320 [cs.DS].
  2. 1 2 Malkov, Yury A; Yashunin, Dmitry A (1 April 2020). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 42 (4): 824–836. arXiv: 1603.09320 . doi:10.1109/TPAMI.2018.2889473. PMID   30602420.
  3. Malkov, Yury; Ponomarenko, Alexander; Logvinov, Andrey; Krylov, Vladimir (2012). "Scalable Distributed Algorithm for Approximate Nearest Neighbor Search Problem in High Dimensional General Metric Spaces". In Navarro, Gonzalo; Pestov, Vladimir (eds.). Similarity Search and Applications. Lecture Notes in Computer Science. Vol. 7404. Berlin, Heidelberg: Springer. pp. 132–147. doi:10.1007/978-3-642-32153-5_10. ISBN   978-3-642-32153-5.
  4. Aumüller, Martin; Bernhardsson, Erik; Faithfull, Alexander (2017). "ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms". In Beecks, Christian; Borutta, Felix; Kröger, Peer; Seidl, Thomas (eds.). Similarity Search and Applications. Lecture Notes in Computer Science. Vol. 10609. Cham: Springer International Publishing. pp. 34–49. arXiv: 1807.05614 . doi:10.1007/978-3-319-68474-1_3. ISBN   978-3-319-68474-1.
  5. Aumüller, Martin; Bernhardsson, Erik; Faithfull, Alexander (2020). "ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms". Information Systems. 87: 101374. arXiv: 1807.05614 . doi:10.1016/j.is.2019.02.006.
  6. "ANN-Benchmarks". ann-benchmarks.com. Retrieved 2024-03-19.
  7. "MariaDB Vector". MariaDB.org. Retrieved 2024-07-30.
  8. "MongoDB Atlas Vector Search". mongodb.com. Retrieved 2024-09-17.
  9. "How to Pick a Vector Index in Your Milvus Instance: A Visual Guide". zilliz.com. Retrieved 2024-10-10.
  10. nmslib/hnswlib, nmslib, 2024-03-18, retrieved 2024-03-19