Extended Boolean model

Last updated

The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't. [1]

Contents

Thus, the extended Boolean model can be considered as a generalization of both the Boolean and vector space models; those two are special cases if suitable settings and definitions are employed. Further, research has shown effectiveness improves relative to that for Boolean query processing. Other research has shown that relevance feedback and query expansion can be integrated with extended Boolean query processing.

Definitions

In the Extended Boolean model, a document is represented as a vector (similarly to in the vector model). Each i dimension corresponds to a separate term associated with the document.

The weight of term Kx associated with document dj is measured by its normalized Term frequency and can be defined as:

where Idfx is inverse document frequency and fx,j the term frequency for term x in document j.

The weight vector associated with document dj can be represented as:

The 2 Dimensions Example

2D Extended Boolean model OR example.png
Figure 1: The similarities of q = (KxKy) with documents dj and dj+1.
2D Extended Boolean model AND example.png
Figure 2: The similarities of q = (KxKy) with documents dj and dj+1.

Considering the space composed of two terms Kx and Ky only, the corresponding term weights are w1 and w2. [2] Thus, for query qor = (KxKy), we can calculate the similarity with the following formula:

For query qand = (KxKy), we can use:

Generalizing the idea and P-norms

We can generalize the previous 2D extended Boolean model example to higher t-dimensional space using Euclidean distances.

This can be done using P-norms which extends the notion of distance to include p-distances, where 1 p is a new parameter. [3]

:

Examples

Consider the query q = (K1K2) K3. The similarity between query q and document d can be computed using the formula:

Improvements over the Standard Boolean Model

Lee and Fox [4] compared the Standard and Extended Boolean models with three test collections, CISI, CACM and INSPEC. Using P-norms they obtained an average precision improvement of 79%, 106% and 210% over the Standard model, for the CISI, CACM and INSPEC collections, respectively.
The P-norm model is computationally expensive because of the number of exponentiation operations that it requires but it achieves much better results than the Standard model and even Fuzzy retrieval techniques. The Standard Boolean model is still the most efficient.

Further reading

See also

Related Research Articles

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels.

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

A language model is a probability distribution over sequences of words. Given any sequence of words of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences, language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.

The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. It is used by many IR systems to this day. The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms.

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in .

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

In information retrieval, Okapi BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

In probability and statistics, the Hellinger distance is used to quantify the similarity between two probability distributions. It is a type of f-divergence. The Hellinger distance is defined in terms of the Hellinger integral, which was introduced by Ernst Hellinger in 1909.

SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model. SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, SimRank is a measure that says "two objects are considered to be similar if they are referenced by similar objects." Although SimRank is widely adopted, it may output unreasonable similarity scores which are influenced by different factors, and can be solved in several ways, such as introducing an evidence weight factor, inserting additional terms that are neglected by SimRank or using PageRank-based alternatives.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algorithm was developed using the vector space model. Its underlying assumption is that most users have a general conception of which documents should be denoted as relevant or irrelevant. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.

In natural language processing and information retrieval, cluster labeling is the problem of picking descriptive, human-readable labels for the clusters produced by a document clustering algorithm; standard clustering algorithms do not typically produce any such labels. Cluster labeling algorithms examine the contents of the documents per cluster to find a labeling that summarize the topic of each cluster and distinguish the clusters from each other.

Fuzzy retrieval techniques are based on the Extended Boolean model and the Fuzzy set theory. There are two classical fuzzy retrieval models: Mixed Min and Max (MMM) and the Paice model. Both models do not provide a way of evaluating query weights, however this is considered by the P-norms algorithm.

The Generalized vector space model is a generalization of the vector space model used in information retrieval. Wong et al. presented an analysis of the problems that the pairwise orthogonality assumption of the vector space model (VSM) creates. From here they extended the VSM to the generalized vector space model (GVSM).

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

The query likelihood model is a language model used in information retrieval. A language model is constructed for each document in the collection. It is then possible to rank each document by the probability of specific documents given a query. This is interpreted as being the likelihood of a document being relevant given a query.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

References

  1. Salton, Gerard; Fox, Edward A.; Wu, Harry (1983), "Extended Boolean information retrieval", Communications of the ACM, Communications of the ACM, Volume 26, Issue 11, 26 (11): 1022–1036, doi:10.1145/182.358466, hdl: 1813/6351 , S2CID   207180535
  2. "Lusheng Wang". Archived from the original on 2011-09-27. Retrieved 2009-12-01.
  3. Garcia, Dr. E., The Extended Boolean Model - Weighted Queries: Term Weights, p-Norm Queries and Multiconcept Types. Boolean OR Extended? AND that is the Query
  4. Lee, W. C.; Fox, E. A. (1988), Experimental Comparison of Schemes for Interpreting Boolean Queries (PDF)