Boolean model of information retrieval

Last updated

The (standard) Boolean model of information retrieval (BIR) [1] is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. [2] The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms (a bag-of-words model). Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.

Contents

Definitions


An index term is a word or expression, which may be stemmed, describing or characterizing a document, such as a keyword given for a journal article. Letbe the set of all such index terms.

A document is any subset of . Letbe the set of all documents.


is a series of words or small phrases (index terms). Each of those words or small phrases are named , where is the number of the term in the series/list. You can think of as "Terms" and as "index term n".

The words or small phrases (index terms ) can exist in documents. These documents then form a series/list where each individual documents are called . These documents () can contain words or small phrases (index terms ) such as could contain the terms and from . There is an example of this in the following section.

Index terms generally want to represent words which have more meaning to them and corresponds to what the content of an article or document could talk about. Terms like "the" and "like" would appear in nearly all documents whereas "Bayesian" would only be a small fraction of documents. Therefor, rarer terms like "Bayesian" are a better choice to be selected in the sets. This relates to Entropy (information theory). There are multiple types of operations that can be applied to index terms used in queries to make them more generic and more relevant. One such is Stemming.


A query is a Boolean expression in normal form:where is true for when . (Equivalently, could be expressed in disjunctive normal form.)

Any queries are a selection of index terms ( or ) picked from a set of terms which are combined using Boolean operators to form a set of conditions.

These conditions are then applied to a set of documents which contain the same index terms () from the set .

We seek to find the set of documents that satisfy . This operation is called retrieval and consists of the following two steps:

1. For each in , find the set of documents that satisfy :2. Then the set of documents that satisfy Q is given by:Where means OR and means AND as Boolean operators.

Example

Let the set of original (real) documents be, for example

where

= "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."

= "Bayesian decision theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."

= "Bayesian epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."

Let the set of terms be:

Then, the set of documents is as follows:

where

Let the query be ("probability" AND "decision-making"):

Then to retrieve the relevant documents:

  1. Firstly, the following sets and of documents are obtained (retrieved):Where corresponds to the documents which contain the term "probability" and contain the term "decision-making".
  2. Finally, the following documents are retrieved in response to : Where the query looks for documents that are contained in both sets using the intersection operator.

This means that the original document is the answer to .

If there is more than one document with the same representation (the same subset of index terms ), every such document is retrieved. Such documents are indistinguishable in the BIR (in other words, equivalent).

Advantages

Disadvantages

Data structures and algorithms

From a pure formal mathematical point of view, the BIR is straightforward. From a practical point of view, however, several further problems should be solved that relate to algorithms and data structures, such as, for example, the choice of terms (manual or automatic selection or both), stemming, hash tables, inverted file structure, and so on. [4]

Hash sets

Another possibility is to use hash sets. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with bit vectors. On the worst-case performance can degrade from O(n) to O(n2). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.

Signature file

Each document can be summarized by Bloom filter representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such superimposed code bitstring for every document in the collection. Each query can also be summarized by a Bloom filter representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature. [5] [6] [7]

The signature file approached is used in BitFunnel.

Inverted file

An inverted index file contains two parts: a vocabulary containing all the terms used in the collection, and for each distinct term an inverted index that lists every document that mentions that term. [5] [6]

Related Research Articles

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. As formulas are entirely constituted with symbols of various types, many symbols are needed for expressing all mathematics.

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In abstract algebra, a semiring is an algebraic structure. Semirings are a generalization of rings, dropping the requirement that each element must have an additive inverse. At the same time, semirings are a generalization of bounded distributive lattices.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus.

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

In information retrieval, Okapi BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

<span class="mw-page-title-main">Decision tree model</span> Model of computational complexity

In computational complexity the decision tree model is the model of computation in which an algorithm is considered to be basically a decision tree, i.e., a sequence of queries or tests that are done adaptively, so the outcome of previous tests can influence the tests performed next.

The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't.

Fuzzy retrieval techniques are based on the Extended Boolean model and the Fuzzy set theory. There are two classical fuzzy retrieval models: Mixed Min and Max (MMM) and the Paice model. Both models do not provide a way of evaluating query weights, however this is considered by the P-norms algorithm.

The Generalized vector space model is a generalization of the vector space model used in information retrieval. Wong et al. presented an analysis of the problems that the pairwise orthogonality assumption of the vector space model (VSM) creates. From here they extended the VSM to the generalized vector space model (GVSM).

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

The probabilistic relevance model was devised by Stephen E. Robertson and Karen Spärck Jones as a framework for probabilistic models to come. It is a formalism of information retrieval useful to derive ranking functions used by search engines and web search engines in order to rank matching documents according to their relevance to a given search query.

In machine learning, a ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that ranking SVM also can be used to solve other problems such as Rank SIFT.

<span class="mw-page-title-main">Bayesian programming</span> Statistics concept

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms.

References

  1. Lancaster, F.W.; Fayen, E.G. (1973), Information Retrieval On-Line, Melville Publishing Co., Los Angeles, California
  2. "Information Retrieval". MIT Press. Retrieved 2023-12-09.
  3. Shokraneh, Farhad (6 August 2024). "Stop searching and you will find it: Search-Resistant Concepts in systematic review searches". BMJ Evidence-Based Medicine: bmjebm–2023–112798. doi:10.1136/bmjebm-2023-112798.
  4. Wartik, Steven (1992). "Boolean operations". Information Retrieval Data Structures & Algorithms. Prentice-Hall, Inc. ISBN   0-13-463837-9. Archived from the original on 2013-09-28.
  5. 1 2 Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao. "Inverted Files Versus Signature Files for Text Indexing".
  6. 1 2 Bob Goodwin; et al. "BitFunnel: Revisiting Signatures for Search". 2017.
  7. Richard Startin. "Bit-Sliced Signatures and Bloom Filters".