Vector database

Last updated

A vector database, vector store or vector search engine is a database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor (ANN) algorithms, [1] [2] so that one can search the database with a query vector to retrieve the closest matching database records.

Contents

Vectors are mathematical representations of data in a high-dimensional space. In this space, each dimension corresponds to a feature of the data, with the number of dimensions ranging from a few hundred to tens of thousands, depending on the complexity of the data being represented. A vector's position in this space represents its characteristics. Words, phrases, or entire documents, as well as images, audio, and other types of data, can all be vectorized. [3]

These feature vectors may be computed from the raw data using machine learning methods such as feature extraction algorithms, word embeddings [4] or deep learning networks. The goal is that semantically similar data items receive feature vectors close to each other.

Vector databases can be used for similarity search, multi-modal search, recommendations engines, large language models (LLMs), etc. [5]

Vector databases are also often used to implement Retrieval-Augmented Generation (RAG), a method to improve domain-specific responses of large language models. The retrieval component of a RAG can be any search system, but is most often implemented as a vector database. Text documents describing the domain of interest are collected, and for each document or document section, a feature vector (known as an "embedding") is computed, typically using a deep learning network, and stored in a vector database. Given a user prompt, the feature vector of the prompt is computed, and the database is queried to retrieve the most relevant documents. These are then automatically added into the context window of the large language model, and the large language model proceeds to create a response to the prompt given this context. [6]

Techniques

The most important techniques for similarity search on high-dimensional vectors include:

and combinations of these techniques.[ citation needed ]

In recent benchmarks, HNSW-based implementations have been among the best performers. [7] [8] Conferences such as the International Conference on Similarity Search and Applications, SISAP and the Conference on Neural Information Processing Systems (NeurIPS) host competitions on vector search in large databases.

Implementations

NameLicense
Apache Cassandra [9] [10] Apache License 2.0
Chroma [11] [12] Apache License 2.0 [13]
Azure Cosmos DB [14] Proprietary (Managed Service)
Couchbase [15] [16] BSL 1.1 [17]
Elasticsearch [18] Server Side Public License, Elastic License [19]
HDF5 Query Indexing [20] BSD 3-Clause [21]
Lantern [22] BSL 1.1 [23]
LlamaIndex [24] MIT License [25]
Milvus [26] [27] Apache License 2.0
MongoDB Atlas [28] Server Side Public License (Managed service)
Neo4j [29] [30] GPL v3 (Community Edition) [31]
OpenSearch [32] [33] [34] Apache License 2.0 [35]
Pinecone [36] Proprietary (Managed Service)
Postgres with pgvector [37] PostgreSQL License [38]
Qdrant [39] Apache License 2.0 [40]
Redis Stack [41] [42] Redis Source Available License [43]
Snowflake [44] Proprietary (Managed Service)
SurrealDB [45] BSL 1.1 [46]
Vespa [47] Apache License 2.0 [48]
Weaviate [49] BSD 3-Clause [50]

See also

Related Research Articles

A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve information. A well known example is the Structured Query Language (SQL).

A time series database is a software system that is optimized for storing and serving time series through associated pairs of time(s) and value(s). In some fields, time series may be called profiles, curves, traces or trends. Several early time series databases are associated with industrial applications which could efficiently store measured values from sensory equipment, but now are used in support of a much wider range of applications. In many cases, the repositories of time-series data will utilize compression algorithms to manage the data efficiently. Although it is possible to store time-series data in many different database types, the design of these systems with time as a key index is distinctly different from relational databases which reduce discrete relationships through referential models.

A spatial database is a general-purpose database that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data.

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

Redis is a source-available, in-memory storage, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Redis offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. Redis is the most popular NoSQL database, and one of the most popular databases overall. Redis is used in companies like Twitter, Airbnb, Tinder, Yahoo, Adobe, Hulu, Amazon and OpenAI.

FlockDB was an open-source distributed, fault-tolerant graph database for managing wide but shallow network graphs. It was initially used by Twitter to store relationships between users, e.g. followings and favorites. FlockDB differs from other graph databases, e.g. Neo4j in that it was not designed for multi-hop graph traversal but rather for rapid set operations, not unlike the primary use-case for Redis sets. FlockDB was posted on GitHub shortly after Twitter released its Gizzard framework, which it used to query the FlockDB distributed datastore. The database is licensed under the Apache License.

<span class="mw-page-title-main">Couchbase Server</span> Open-source NoSQL database

Couchbase Server, originally known as Membase, is a source-available, distributed multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many concurrent users by creating, storing, retrieving, aggregating, manipulating and presenting data. In support of these kinds of application needs, Couchbase Server is designed to provide easy-to-scale key-value, or JSON document access, with low latency and high sustainability throughput. It is designed to be clustered from a single machine to very large-scale deployments spanning many machines.

<span class="mw-page-title-main">Open-core model</span> Business model monetizing commercial open-source software

The open-core model is a business model for the monetization of commercially produced open-source software. The open-core model primarily involves offering a "core" or feature-limited version of a software product as free and open-source software, while offering "commercial" versions or add-ons as proprietary software. The term was coined by Andrew Lampitt in 2008.

<span class="mw-page-title-main">Elasticsearch</span> Search engine

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-licensed under the (source-available) Server Side Public License and the Elastic license, while other parts fall under the proprietary (source-available) Elastic License. Official clients are available in Java, .NET (C#), PHP, Python, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine.

Cypher is a declarative graph query language that allows for expressive and efficient data querying in a property graph.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare the relative performance of NoSQL database management systems.

<span class="mw-page-title-main">ClickHouse</span> Open-source database management system

ClickHouse is an open-source column-oriented DBMS for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the San Francisco Bay Area with the subsidiary, ClickHouse B.V., based in Amsterdam, Netherlands.

<span class="mw-page-title-main">GraphBLAS</span> API for graph data and graph operations

GraphBLAS is an API specification that defines standard building blocks for graph algorithms in the language of linear algebra. GraphBLAS is built upon the notion that a sparse matrix can be used to represent graphs as either an adjacency matrix or an incidence matrix. The GraphBLAS specification describes how graph operations can be efficiently implemented via linear algebraic methods over different semirings.

deepset German natural language processing startup

deepset is an enterprise software vendor that provides developers with the tools to build production-ready natural language processing (NLP) systems. It was founded in 2018 in Berlin by Milos Rusic, Malte Pietsch, and Timo Möller. deepset authored and maintains the open source software Haystack and its commercial SaaS offering deepset Cloud.

Llama is a family of autoregressive large language models released by Meta AI starting in February 2023. The latest version is Llama 3 released in April 2024.

LangChain is a framework designed to simplify the creation of applications using large language models (LLMs). As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.

<span class="mw-page-title-main">Hierarchical navigable small world</span> Clustering and community detection algorithm

The Hierarchical navigable small world (HNSW) algorithm is a graph-based approximate nearest neighbor search technique used in many vector databases. Nearest neighbor search without an index involves computing the distance from the query to each point in the database, which for large datasets is computationally prohibitive. For high-dimensional data, tree-based exact vector search techniques such as the k-d tree and R-tree do not perform well enough because of the curse of dimensionality. To remedy this, approximate k-nearest neighbor searches have been proposed, such as locality-sensitive hashing (LSH) and product quantization (PQ) that trade performance for accuracy. The HNSW graph offers an approximate k-nearest neighbor search which scales logarithmically even in high-dimensional data.

<span class="mw-page-title-main">Valkey</span> Freely available in-memory key–value database

Valkey is an open-source in-memory storage, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Valkey offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. Valkey is the successor to Redis, the most popular NoSQL database, and one of the most popular databases overall. Valkey or its predecessor Redis are used in companies like Twitter, Airbnb, Tinder, Yahoo, Adobe, Hulu, Amazon and OpenAI.

References

  1. Roie Schwaber-Cohen. "What is a Vector Database & How Does it Work". Pinecone. Retrieved 18 November 2023.
  2. "What is a vector database". Elastic . Retrieved 18 November 2023.
  3. "Vector database". learn.microsoft.com. 2023-12-26. Retrieved 2024-01-11.
  4. Evan Chaki (2023-07-31). "What is a vector database?". Microsoft. A vector database is a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes.
  5. "Vector database". learn.microsoft.com. 2023-12-26. Retrieved 2024-01-11.
  6. Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman; Küttler, Heinrich (2020). "Retrieval-augmented generation for knowledge-intensive NLP tasks". Advances in Neural Information Processing Systems 33: 9459–9474. arXiv: 2005.11401 .
  7. Aumüller, Martin; Bernhardsson, Erik; Faithfull, Alexander (2017), Beecks, Christian; Borutta, Felix; Kröger, Peer; Seidl, Thomas (eds.), "ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms", Similarity Search and Applications, vol. 10609, Cham: Springer International Publishing, pp. 34–49, arXiv: 1807.05614 , doi:10.1007/978-3-319-68474-1_3, ISBN   978-3-319-68473-4 , retrieved 2024-03-19
  8. Aumüller, Martin; Bernhardsson, Erik; Faithfull, Alexander (2017). "ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms". In Beecks, Christian; Borutta, Felix; Kröger, Peer; Seidl, Thomas (eds.). Similarity Search and Applications. Lecture Notes in Computer Science. Vol. 10609. Cham: Springer International Publishing. pp. 34–49. arXiv: 1807.05614 . doi:10.1007/978-3-319-68474-1_3. ISBN   978-3-319-68474-1.
  9. "5 Hard Problems in Vector Search, and How Cassandra Solves Them". TheNewStack. 2023-09-22. Retrieved 2023-09-22.
  10. "Vector Search quickstart" . Retrieved 2023-11-21.
  11. Palazzolo, Stephanie. "Vector database Chroma scored $18 million in seed funding at a $75 million valuation. Here's why its technology is key to helping generative AI startups". Business Insider. Retrieved 2023-11-16.
  12. MSV, Janakiram (2023-07-28). "Exploring Chroma: The Open Source Vector Database for LLMs". The New Stack. Retrieved 2023-11-16.
  13. "chroma/LICENSE at main · chroma-core/chroma". GitHub.
  14. "Vector database". learn.microsoft.com. 26 December 2023. Retrieved 2024-01-10.
  15. "Couchbase aims to boost developer database productivity with Capella IQ AI tool". VentureBeat. 2023-08-30.
  16. "Investor Presentation Third Quarter Fiscal 2024". Couchbase Investor Relations. 2023-12-06.
  17. Anderson, Scott (2021-03-26). "Couchbase Adopts BSL License". The Couchbase Blog. Retrieved 2024-02-14.
  18. Kerner, Sean (23 May 2023). "Elasticsearch Relevance Engine brings new vectors to generative AI". VentureBeat . Retrieved 18 November 2023.
  19. "elasticsearch/LICENSE.txt at main · elastic/elasticsearch". GitHub.
  20. "HDF5 Query Indexing". GitHub . 27 Sep 2019. Retrieved 3 May 2024.
  21. "HDFGroup/COPYING at master · HDFGroup/hdf5". GitHub. Retrieved 2023-10-29.
  22. "Lantern". 2024-04-05. Retrieved 2024-04-05.
  23. "lantern/LICENSE at main /lanterndata/lantern". GitHub. Retrieved 2024-04-10.
  24. Wiggers, Kyle (2023-06-06). "LlamaIndex adds private data to large language models". TechCrunch. Retrieved 2023-10-29.
  25. "llama_index/LICENSE at main · run-llama/llama_index". GitHub. Retrieved 2023-10-29.
  26. "Open Source Vector Database – Milvus – LFAI & DATA" . Retrieved 29 October 2023.
  27. Liao, Ingrid Lunden and Rita (2022-08-24). "Zilliz raises $60M, relocates to SF". TechCrunch. Retrieved 2023-10-29.
  28. "Introducing Atlas Vector Search: Build Intelligent Applications with Semantic Search and AI Over Any Type of Data". MongoDB. 2023-06-22.
  29. "Neo4j enhances its graph database with vector search". itbrief. 2023-08-22.
  30. "Vector search indexes". neo4j.
  31. "Neo4j licensing".
  32. "Using OpenSearch as a Vector Database". OpenSearch.org. 2023-08-02. Retrieved 2024-02-07.
  33. Pan, James Jie; Wang, Jianguo; Li, Guoliang (2023-10-21), Survey of Vector Database Management Systems, arXiv: 2310.14021
  34. "AWS debuts new AI-powered data management and analysis tools". SiliconANGLE. 2023-07-26. Retrieved 2024-02-07.
  35. "OpenSearch license". github.
  36. "Pinecone leads 'explosion' in vector databases for generative AI". VentureBeat. 2023-07-14. Retrieved 2023-10-29.
  37. "pgvector". GitHub. Retrieved 2023-11-27.
  38. "pgvector/License". GitHub. Retrieved 2023-11-27.
  39. Sawers, Paul (2023-04-19). "Qdrant, an open-source vector database startup, wants to help AI developers leverage unstructured data". TechCrunch. Retrieved 2023-10-29.
  40. "qdrant/LICENSE at master · qdrant/qdrant". GitHub. Retrieved 2023-10-29.
  41. "Using Redis as a Vector Database with OpenAI | OpenAI Cookbook". cookbook.openai.com. Retrieved 2024-02-10.
  42. "Redis as a vector database quick start guide". Redis. Retrieved 2024-01-31.
  43. "Search and query". Redis. Retrieved 2024-02-10.
  44. "Vector data type and vector similarity functions — General Availability". Snowflake. 2024-05-17. Retrieved 2024-05-17.
  45. Wiggers, Kyle (2023-01-04). "SurrealDB raises $6M for its database-as-a-service offering". TechCrunch. Retrieved 2024-01-19.
  46. "SurrealDB | License FAQs | The ultimate multi-model database". SurrealDB. Retrieved 2024-02-14.
  47. Riley, Duncan (4 October 2023). "Yahoo spins off AI scaling engine Vespa as an independent company". siliconANGLE. Retrieved 18 November 2023.
  48. "vespa/LICENSE at master · vespa-engine/vespa". GitHub.
  49. "Weaviate reels in $50M for its AI-optimized vector database". SiliconANGLE. 2023-04-21. Retrieved 2023-10-29.
  50. "weaviate/LICENSE at master · weaviate/weaviate". GitHub. Retrieved 2023-10-29.