FAISS

Last updated
Faiss
Original author(s) Hervé Jégou
Matthijs Douze
Jeff Johnson
Developer(s) Meta Platforms (Meta AI)
Initial releaseFebruary 22, 2017;7 years ago (2017-02-22) [1]
Stable release
v1.9.0 / October 4, 2024;3 months ago (2024-10-04).: [2]
Repository github.com/facebookresearch/faiss
Written in C++, Python, CUDA
Operating system Various, including Windows, macOS, and Linux
Platform x86, ARM, PowerPC
Type Similarity search library
License MIT License
Website faiss.ai

Faiss (Facebook AI Similarity Search) [3] is an open-source library for similarity search and clustering of vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

Contents

Faiss is written in C++ with complete wrappers for Python and C. Some of the most useful algorithms are implemented on the GPU using CUDA. [4]

Features

Faiss is organized as a toolbox that contains a variety of indexing methods that commonly involve a chain of components (preprocessing, compression, non-exhaustive search, etc.). The scope of the library is intentionally limited to focus on ANNS algorithmic implementation and to avoid facilities related to database functionality, distributed computing or feature extraction algorithms. [5]

Faiss is designed with the following assumptions: [5]

The following major categories of indexing methods are supported:

The following families of vector quantization methods are supported:

Faiss focuses on euclidean distance and inner product distance for floating-point data. The limited support of other distances (manhattan distance, Lp distance, etc.) is also available.

Faiss code supports multithreading via OpenMP, utilizes BLAS via OpenBLAS or Intel MKL, and also uses custom SIMD kernels for x86 and ARM Neon CPUs.

Besides the similarity search, Faiss provides the following useful facilities:

Faiss has a standalone Vector Codec functionality for the lossy compression of vectors, allowing to trade the representation accuracy for the binary size. [17]

Applications

Typical Faiss applications [5] include recommender systems, data mining, [18] text retrieval and content moderation. [19]

Faiss was reported to index 1.5 trillion 144-dimensional vectors for internal Meta Platforms applications. [5] [20]

Faiss is used in vector databases as a core component of a search engine (OpenSearch, [21] [22] Milvus, [23] Vearch [24] ).

Faiss is often considered as a baseline in similarity search benchmarks. [25] [26] [27] [28]

Faiss has an integration with Haystack, [29] LangChain [30] frameworks.

Various advanced code snippets for Faiss can be found on its snippets wiki page and case studies wiki page.

See also

Related Research Articles

Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. Developed in the early 1980s by Robert M. Gray, it was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms. In simpler terms, vector quantization chooses a set of points to represent a larger set of points.

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming. The curse generally refers to issues that arise when the number of datapoints is small relative to the intrinsic dimension of the data.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

<span class="mw-page-title-main">Bag-of-words model in computer vision</span> Image classification model

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows one to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. Several papers describe the techniques used by fastText.

In natural language processing, a sentence embedding is a representation of a sentence as a vector of numbers which encodes meaningful semantic information.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision, and specifically object detection and localization. The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category of the object. In general, R-CNN architectures perform selective search over feature maps outputted by a CNN.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">Knowledge graph embedding</span> Dimensionality reduction of graph-based semantic data objects [machine learning task]

In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.

Graph neural networks (GNN) are specialized artificial neural networks that are designed for tasks whose inputs are graphs.

<span class="mw-page-title-main">Vision transformer</span> Machine learning model for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

A vector database, vector store or vector search engine is a database that can store vectors along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor algorithms, so that one can search the database with a query vector to retrieve the closest matching database records.

<span class="mw-page-title-main">Hierarchical navigable small world</span> Clustering and community detection algorithm

The Hierarchical navigable small world (HNSW) algorithm is a graph-based approximate nearest neighbor search technique used in many vector databases. Nearest neighbor search without an index involves computing the distance from the query to each point in the database, which for large datasets is computationally prohibitive. For high-dimensional data, tree-based exact vector search techniques such as the k-d tree and R-tree do not perform well enough because of the curse of dimensionality. To remedy this, approximate k-nearest neighbor searches have been proposed, such as locality-sensitive hashing (LSH) and product quantization (PQ) that trade performance for accuracy. The HNSW graph offers an approximate k-nearest neighbor search which scales logarithmically even in high-dimensional data.

Milvus is a distributed vector database developed by Zilliz. It is available as both open-source software and a cloud service.

References

  1. "Faiss: A library for efficient similarity search". 29 March 2017.
  2. "v1.9.0". GitHub .
  3. "Official Faiss wiki". GitHub .
  4. Johnson, Jeff; Douze, Matthijs; Jégou, Hervé (2017). "Billion-scale similarity search with GPUs". arXiv: 1702.08734 [cs.CV].
  5. 1 2 3 4 Douze, Matthijs; Guzhva, Alexandr; Deng, Chengqi; Johnson, Jeff; Szilvasy, Gergely; Mazaré, Pierre-Emmanuel; Lomeli, Maria; Hosseini, Lucas; Jégou, Hervé (2024). "The Faiss library". arXiv: 2401.08281 [cs.LG].
  6. Sivic; Zisserman (2003). "Video Google: A text retrieval approach to object matching in videos". Proceedings Ninth IEEE International Conference on Computer Vision. pp. 1470–1477 vol.2. doi:10.1109/ICCV.2003.1238663. ISBN   0-7695-1950-4.
  7. Fu, Cong; Xiang, Chao; Wang, Changxu; Cai, Deng (2017). "Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph". arXiv: 1707.00143 [cs.LG].
  8. Jégou, H; Douze, M; Schmid, C (January 2011). "Product Quantization for Nearest Neighbor Search" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 33 (1): 117–128. doi:10.1109/TPAMI.2010.57. PMID   21088323.
  9. Douze, Matthijs; Jégou, Hervé; Perronnin, Florent (2016). "Polysemous Codes". Computer Vision – ECCV 2016. Lecture Notes in Computer Science. Vol. 9906. pp. 785–801. doi:10.1007/978-3-319-46475-6_48. ISBN   978-3-319-46474-9.
  10. Ge, Tiezheng; He, Kaiming; Ke, Qifa; Sun, Jian (April 2014). "Optimized Product Quantization". IEEE Transactions on Pattern Analysis and Machine Intelligence. 36 (4): 744–755. doi:10.1109/TPAMI.2013.240. PMID   26353197.
  11. Andre, Fabien; Kermarrec, Anne-Marie; Le Scouarnec, Nicolas (1 May 2021). "Quicker ADC : Unlocking the Hidden Potential of Product Quantization With SIMD". IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (5): 1666–1677. arXiv: 1812.09162 . doi:10.1109/TPAMI.2019.2952606. PMID   31722477.
  12. Babenko, Artem; Lempitsky, Victor (June 2014). "Additive Quantization for Extreme Vector Compression". 2014 IEEE Conference on Computer Vision and Pattern Recognition. pp. 931–938. doi:10.1109/CVPR.2014.124. ISBN   978-1-4799-5118-5.
  13. Liu, Shicong; Lu, Hongtao; Shao, Junru (2015). "Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search". arXiv: 1509.05195 [cs.CV].
  14. Martinez, Julieta; Zakhmi, Shobhit; Hoos, Holger H.; Little, James J. (2018). "LSQ++: Lower Running Time and Higher Recall in Multi-codebook Quantization". Computer Vision – ECCV 2018. Lecture Notes in Computer Science. Vol. 11220. pp. 508–523. doi:10.1007/978-3-030-01270-0_30. ISBN   978-3-030-01269-4.
  15. Huijben, Iris A. M.; Douze, Matthijs; Muckley, Matthew; van Sloun, Ruud J. G.; Verbeek, Jakob (2024). "Residual Quantization with Implicit Neural Codebooks". arXiv: 2401.14732 [cs.LG].
  16. Sandhawalia, Harsimrat; Jegou, Herve (March 2010). "Searching with expectations". 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (PDF). pp. 1242–1245. doi:10.1109/ICASSP.2010.5495403. ISBN   978-1-4244-4295-9.
  17. "Faiss vector codecs". GitHub .
  18. Lin, Hangfei; Miao, Li; Ziai, Amir (2023). "RAFIC: Retrieval-Augmented Few-shot Image Classification". arXiv: 2312.06868 [cs.CV].
  19. "Perceptual hashing tools". GitHub .
  20. "Indexing 1T vectors". GitHub .
  21. "OpenSearch Approximate k-NN search".
  22. "Amazon OpenSearch Service now supports efficient vector query filters for Faiss".
  23. "Milvus Knowhere". GitHub .
  24. "Vearch: AI-native distributed vector database". GitHub .
  25. Harsha Vardhan Simhadri; Williams, George; Aumüller, Martin; Douze, Matthijs; Babenko, Artem; Baranchuk, Dmitry; Chen, Qi; Hosseini, Lucas; Krishnaswamy, Ravishankar; Srinivasa, Gopal; Suhas Jayaram Subramanya; Wang, Jingdong (2022). "Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search". arXiv: 2205.03763 [cs.LG].
  26. Harsha Vardhan Simhadri; Aumüller, Martin; Ingber, Amir; Douze, Matthijs; Williams, George; Magdalen Dobson Manohar; Baranchuk, Dmitry; Liberty, Edo; Liu, Frank; Landrum, Ben; Karjikar, Mazin; Dhulipala, Laxman; Chen, Meng; Chen, Yue; Ma, Rui; Zhang, Kai; Cai, Yuzheng; Shi, Jiayang; Chen, Yizhuo; Zheng, Weiguo; Wan, Zihao; Yin, Jie; Huang, Ben (2024). "Results of the Big ANN: NeurIPS'23 competition". arXiv: 2409.17424 [cs.IR].
  27. "Benchmarking nearest neighbors". GitHub .
  28. "annbench: a lightweight benchmark for approximate nearest neighbor search". GitHub .
  29. "Use a Faiss vector database with Haystack".
  30. "Faiss integration with Langchain".