Binary Independence Model

Last updated

The Binary Independence Model (BIM) [1] [2] in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

Contents

Definitions

The Binary Independence Assumption is the that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. Terms are independently distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents. The representation is an ordered set of Boolean variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector d = (x1, ..., xm) where xt=1 if term t is present in the document d and xt=0 if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way. "Independence" signifies that terms in the document are considered independently from each other and no association between terms is modeled. This assumption is very limiting, but it has been shown that it gives good enough results for many situations. This independence is the "naive" assumption of a Naive Bayes classifier, where properties that imply each other are nonetheless treated as independent for the sake of simplicity. This assumption allows the representation to be treated as an instance of a Vector space model by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms.

The probability that a document is relevant derives from the probability of relevance of the terms vector of that document . By using the Bayes rule we get:

where and are the probabilities of retrieving a relevant or nonrelevant document, respectively. If so, then that document's representation is x. The exact probabilities can not be known beforehand, so estimates from statistics about the collection of documents must be used.

and indicate the previous probability of retrieving a relevant or nonrelevant document respectively for a query q. If, for instance, we knew the percentage of relevant documents in the collection, then we could use it to estimate these probabilities. Since a document is either relevant or nonrelevant to a query we have that:

Query Terms Weighting

Given a binary query and the dot product as the similarity function between a document and a query, the problem is to assign weights to the terms in the query such that the retrieval effectiveness will be high. Let and be the probability that a relevant document and an irrelevant document has the ith term respectively. Yu and Salton, [1] who first introduce BIM, propose that the weight of the ith term is an increasing function of . Thus, if is higher than , the weight of term i will be higher than that of term j. Yu and Salton [1] showed that such a weight assignment to query terms yields better retrieval effectiveness than if query terms are equally weighted. Robertson and Spärck Jones [2] later showed that if the ith term is assigned the weight of , then optimal retrieval effectiveness is obtained under the Binary Independence Assumption.

The Binary Independence Model was introduced by Yu and Salton. [1] The name Binary Independence Model was coined by Robertson and Spärck Jones [2] who used the log-odds probability of the probabilistic relevance model to derive where the log-odds probability is shown to be rank equivalent to the probability of relevance (i.e., ) by Luk, [3] obeying the probability ranking principle. [4]

See also

Further reading

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

Gerard A. "Gerry" Salton was a professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time, and "the father of Information Retrieval". His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard. It was the very first system to use the now popular vector space model for information retrieval.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus.

The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.

In information retrieval, Okapi BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

Discounted cumulative gain (DCG) is a measure of ranking quality in information retrieval. It is often normalized so that it is comparable across queries, giving Normalized DCG (nDCG or NDCG). NDCG is often used to measure effectiveness of search engine algorithms and related applications. Using a graded relevance scale of documents in a search-engine result set, DCG sums the usefulness, or gain, of the results discounted by their position in the result list. NDCG is DCG normalized by the maximum possible DCG of the result set when ranked from highest to lowest gain, thus adjusting for the different numbers of relevant results for different queries.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algorithm was developed using the vector space model. Its underlying assumption is that most users have a general conception of which documents should be denoted as relevant or irrelevant. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.

In natural language processing and information retrieval, cluster labeling is the problem of picking descriptive, human-readable labels for the clusters produced by a document clustering algorithm; standard clustering algorithms do not typically produce any such labels. Cluster labeling algorithms examine the contents of the documents per cluster to find a labeling that summarize the topic of each cluster and distinguish the clusters from each other.

The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't.

Fuzzy retrieval techniques are based on the Extended Boolean model and the Fuzzy set theory. There are two classical fuzzy retrieval models: Mixed Min and Max (MMM) and the Paice model. Both models do not provide a way of evaluating query weights, however this is considered by the P-norms algorithm.

The Generalized vector space model is a generalization of the vector space model used in information retrieval. Wong et al. presented an analysis of the problems that the pairwise orthogonality assumption of the vector space model (VSM) creates. From here they extended the VSM to the generalized vector space model (GVSM).

The probabilistic relevance model was devised by Stephen E. Robertson and Karen Spärck Jones as a framework for probabilistic models to come. It is a formalism of information retrieval useful to derive ranking functions used by search engines and web search engines in order to rank matching documents according to their relevance to a given search query.

The query likelihood model is a language model used in information retrieval. A language model is constructed for each document in the collection. It is then possible to rank each document by the probability of specific documents given a query. This is interpreted as being the likelihood of a document being relevant given a query.

In machine learning, a ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that ranking SVM also can be used to solve other problems such as Rank SIFT.

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms.

References

  1. 1 2 3 4 Yu, C. T.; Salton, G. (1976). "Precision Weighting – An Effective Automatic Indexing Method" (PDF). Journal of the ACM. 23: 76–88. doi:10.1145/321921.321930. hdl: 1813/7313 .
  2. 1 2 3 Robertson, S. E.; Spärck Jones, K. (1976). "Relevance weighting of search terms". Journal of the American Society for Information Science. 27 (3): 129. doi:10.1002/asi.4630270302.
  3. Luk, R. W. P. (2022). "Why is information retrieval a scientific discipline?". Foundations of Science. 27 (2): 427–453. doi:10.1007/s10699-020-09685-x. hdl: 10397/94873 .
  4. Robertson, S. E. (1977). "The Probability Ranking Principle in IR". Journal of Documentation. 33 (4): 294–304. doi:10.1108/eb026647.