Vector space model

Last updated May 11, 2024

Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System ^{[ citation needed ]}.

Definitions

In this section we consider a particular vector space model based on the bag-of-words representation. Documents and queries are represented as vectors.

d_{j}=(w_{1,j},w_{2,j},\dotsc ,w_{n,j})

q=(w_{1,q},w_{2,q},\dotsc ,w_{n,q})

Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting (see the example below).

The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).

Vector operations can be used to compare documents with queries.^[1]

Applications

Candidate documents from the corpus can be retrieved and ranked using a variety of methods. Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as a vector with same dimension as the vectors that represent the other documents.

In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:

\cos {\theta }={\frac {\mathbf {d_{2}} \cdot \mathbf {q} }{\left\|\mathbf {d_{2}} \right\|\left\|\mathbf {q} \right\|}}

Where $\mathbf {d_{2}} \cdot \mathbf {q}$ is the intersection (i.e. the dot product) of the document (d₂ in the figure to the right) and the query (q in the figure) vectors, $\left\|\mathbf {d_{2}} \right\|$ is the norm of vector d₂, and $\left\|\mathbf {q} \right\|$ is the norm of vector q. The norm of a vector is calculated as such:

\left\|\mathbf {q} \right\|={\sqrt {\sum _{i=1}^{n}q_{i}^{2}}}

Using the cosine the similarity between document d_j and query q can be calculated as:

\mathrm {cos} (d_{j},q)={\frac {\mathbf {d_{j}} \cdot \mathbf {q} }{\left\|\mathbf {d_{j}} \right\|\left\|\mathbf {q} \right\|}}={\frac {\sum _{i=1}^{N}d_{i,j}q_{i}}{{\sqrt {\sum _{i=1}^{N}d_{i,j}^{2}}}{\sqrt {\sum _{i=1}^{N}q_{i}^{2}}}}}

As all vectors under consideration by this model are element-wise nonnegative, a cosine value of zero means that the query and document vector are orthogonal and have no match (i.e. the query term does not exist in the document being considered). See cosine similarity for further information.^[1]

Term frequency-inverse document frequency weights

In the classic vector space model proposed by Salton, Wong and Yang ^[2] the term-specific weights in the document vectors are products of local and global parameters. The model is known as term frequency-inverse document frequency model. The weight vector for document d is $\mathbf {v} _{d}=[w_{1,d},w_{2,d},\ldots ,w_{N,d}]^{T}$ , where

w_{t,d}=\mathrm {tf} _{t,d}\cdot \log {\frac {|D|}{|\{d'\in D\,|\,t\in d'\}|}}

and

$\mathrm {tf} _{t,d}$ is term frequency of term t in document d (a local parameter)
$\log {\frac {|D|}{|\{d'\in D\,|\,t\in d'\}|}}$ is inverse document frequency (a global parameter). $|D|$ is the total number of documents in the document set; $|\{d'\in D\,|\,t\in d'\}|$ is the number of documents containing the term t.

Advantages

The vector space model has the following advantages over the Standard Boolean model:

Allows ranking documents according to their possible relevance
Allows retrieving items with a partial term overlap^[1]

Most of these advantages are a consequence of the difference in the density of the document collection representation between Boolean and term frequency-inverse document frequency approaches. When using Boolean weights, any document lies in a vertex in a n-dimensional hypercube. Therefore, the possible document representations are $2^{n}$ and the maximum Euclidean distance between pairs is ${\sqrt {n}}$ . As documents are added to the document collection, the region defined by the hypercube's vertices become more populated and hence denser. Unlike Boolean, when a document is added using term frequency-inverse document frequency weights, the inverse document frequencies of the terms in the new document decrease while that of the remaining terms increase. In average, as documents are added, the region where documents lie expands regulating the density of the entire collection representation. This behavior models the original motivation of Salton and his colleagues that a document collection represented in a low density region could yield better retrieval results.

Limitations

The vector space model has the following limitations:

Query terms are assumed to be independent, so phrases might not be represented well in the ranking
Semantic sensitivity; documents with similar context but different term vocabulary won't be associated^[1]

Many of these difficulties can, however, be overcome by the integration of various tools, including mathematical techniques such as singular value decomposition and lexical databases such as WordNet.

Models based on and extending the vector space model

Models based on and extending the vector space model include:

Software that implements the vector space model

The following software packages may be of interest to those wishing to experiment with vector models and implement search services based upon them.

Free open source software

Apache Lucene. Apache Lucene is a high-performance, open source, full-featured text search engine library written entirely in Java.
OpenSearch (software) and Solr : the 2 most famous search engine software (many smaller exist) based on Lucene.
Gensim is a Python+NumPy framework for Vector Space modelling. It contains incremental (memory-efficient) algorithms for term frequency-inverse document frequency, Latent Semantic Indexing, Random Projections and Latent Dirichlet Allocation.
Weka. Weka is a popular data mining package for Java including WordVectors and Bag Of Words models.
Word2vec. Word2vec uses vector spaces for word embeddings.

Related Research Articles

In mathematics, an operator is generally a mapping or function that acts on elements of a space to produce elements of another space. There is no general definition of an operator, but the term is often used in place of function when the domain is a set of functions or other structured objects. Also, the domain of an operator is often difficult to characterize explicitly, and may be extended so as to act on related objects.

Flux describes any effect that appears to pass or travel through a surface or substance. Flux is a concept in applied mathematics and vector calculus which has many applications to physics. For transport phenomena, flux is a vector quantity, describing the magnitude and direction of the flow of a substance or property. In vector calculus flux is a scalar quantity, defined as the surface integral of the perpendicular component of a vector field over a surface.

In mathematics, the complex conjugate of a complex number is the number with an equal real part and an imaginary part equal in magnitude but opposite in sign. That is, if $and are real numbers then the complex conjugate of is The complex conjugate of is often denoted as or .$

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space $under the operation of composition.$

In physics, especially in multilinear algebra and tensor analysis, covariance and contravariance describe how the quantitative description of certain geometric or physical entities changes with a change of basis. Briefly, a contravariant vector is a list of numbers that transforms oppositely to a change of basis, and a covariant vector is a list of numbers that transforms in the same way. Contravariant vectors are often just called vectors and covariant vectors are called covectors or dual vectors. The terms covariant and contravariant were introduced by James Joseph Sylvester in 1851.

Gerard A. "Gerry" Salton was a professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time, and "the father of Information Retrieval". His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard. It was the very first system to use the now popular vector space model for information retrieval.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

In plasmas and electrolytes, the Debye length, is a measure of a charge carrier's net electrostatic effect in a solution and how far its electrostatic effect persists. With each Debye length the charges are increasingly electrically screened and the electric potential decreases in magnitude by 1/e. A Debye sphere is a volume whose radius is the Debye length. Debye length is an important parameter in plasma physics, electrolytes, and colloids. The corresponding Debye screening wave vector $for particles of density, charge at a temperature is given by in Gaussian units. Expressions in MKS units will be given below. The analogous quantities at very low temperatures are known as the Thomas-Fermi length and the Thomas-Fermi wave vector. They are of interest in describing the behaviour of electrons in metals at room temperature.$

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf.

The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.

<span class="mw-page-title-main">Hyperboloid model</span> Model of n-dimensional hyperbolic geometry

In geometry, the hyperboloid model, also known as the Minkowski model after Hermann Minkowski, is a model of n-dimensional hyperbolic geometry in which points are represented by points on the forward sheet S⁺ of a two-sheeted hyperboloid in (n+1)-dimensional Minkowski space or by the displacement vectors from the origin to those points, and m-planes are represented by the intersections of (m+1)-planes passing through the origin in Minkowski space with S⁺ or by wedge products of m vectors. Hyperbolic space is embedded isometrically in Minkowski space; that is, the hyperbolic distance function is inherited from Minkowski space, analogous to the way spherical distance is inherited from Euclidean distance when the n-sphere is embedded in (n+1)-dimensional Euclidean space.

In geometry, various formalisms exist to express a rotation in three dimensions as a mathematical transformation. In physics, this concept is applied to classical mechanics where rotational kinematics is the science of quantitative description of a purely rotational motion. The orientation of an object at a given instant is described with the same tools, as it is defined as an imaginary rotation from a reference placement in space, rather than an actually observed rotation from a previous placement in space.

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval $For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in .$

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query $q$ and a collection $D$ of documents that match the query, the problem is to rank, that is, sort, the documents in $D$ according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Zero-forcing precoding is a method of spatial signal processing by which a multiple antenna transmitter can null the multiuser interference in a multi-user MIMO wireless communication system. When the channel state information is perfectly known at the transmitter, the zero-forcing precoder is given by the pseudo-inverse of the channel matrix. Zero-forcing has been used in LTE mobile networks.

The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't.

The Generalized vector space model is a generalization of the vector space model used in information retrieval. Wong et al. presented an analysis of the problems that the pairwise orthogonality assumption of the vector space model (VSM) creates. From here they extended the VSM to the generalized vector space model (GVSM).

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

In physics and geometry, there are two closely related vector spaces, usually three-dimensional but in general of any finite dimension. Position space is the set of all position vectorsr in Euclidean space, and has dimensions of length; a position vector defines a point in space. Momentum space is the set of all momentum vectorsp a physical system can have; the momentum vector of a particle corresponds to its motion, with units of [mass][length][time]⁻¹.

References

1 2 3 4 Büttcher, Stefan; Clarke, Charles L. A.; Cormack, Gordon V. (2016). Information retrieval: implementing and evaluating search engines (First MIT Press paperback ed.). Cambridge, Massachusetts London, England: The MIT Press. ISBN 978-0-262-52887-0.
↑ G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, v.18 n.11, p.613–620, Nov. 1975

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 4 Büttcher, Stefan; Clarke, Charles L. A.; Cormack, Gordon V. (2016). Information retrieval: implementing and evaluating search engines (First MIT Press paperback ed.). Cambridge, Massachusetts London, England: The MIT Press. ISBN 978-0-262-52887-0.

[2] G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, v.18 n.11, p.613–620, Nov. 1975

[1]

[2]