Fuzzy retrieval

Last updated

Fuzzy retrieval techniques are based on the Extended Boolean model and the Fuzzy set theory. There are two classical fuzzy retrieval models: Mixed Min and Max (MMM) and the Paice model. Both models do not provide a way of evaluating query weights, however this is considered by the P-norms algorithm.

Contents

Mixed Min and Max model (MMM)

In fuzzy-set theory, an element has a varying degree of membership, say dA, to a given set A instead of the traditional membership choice (is an element/is not an element).
In MMM [1] each index term has a fuzzy set associated with it. A document's weight with respect to an index term A is considered to be the degree of membership of the document in the fuzzy set associated with A. The degree of membership for union and intersection are defined as follows in Fuzzy set theory:

According to this, documents that should be retrieved for a query of the form A or B, should be in the fuzzy set associated with the union of the two sets A and B. Similarly, the documents that should be retrieved for a query of the form A and B, should be in the fuzzy set associated with the intersection of the two sets. Hence, it is possible to define the similarity of a document to the or query to be max(dA, dB) and the similarity of the document to the and query to be min(dA, dB). The MMM model tries to soften the Boolean operators by considering the query-document similarity to be a linear combination of the min and max document weights.

Given a document D with index-term weights dA1, dA2, ..., dAn for terms A1, A2, ..., An, and the queries:

Qor = (A1 or A2 or ... or An)
Qand = (A1 and A2 and ... and An)

the query-document similarity in the MMM model is computed as follows:

SlM(Qor, D) = Cor1 * max(dA1, dA2, ..., dAn) + Cor2 * min(dA1, dA2, ..., dAn)
SlM(Qand, D) = Cand1 * min(dA1, dA2, ..., dAn) + Cand2 * max(dA1, dA2 ..., dAn)

where Cor1, Cor2 are "softness" coefficients for the or operator, and Cand1, Cand2 are softness coefficients for the and operator. Since we would like to give the maximum of the document weights more importance while considering an or query and the minimum more importance while considering an and query, generally we have Cor1 > Cor2 and Cand1 > Cand2. For simplicity it is generally assumed that Cor1 = 1 - Cor2 and Cand1 = 1 - Cand2.

Lee and Fox [2] experiments indicate that the best performance usually occurs with Cand1 in the range [0.5, 0.8] and with Cor1 > 0.2. In general, the computational cost of MMM is low, and retrieval effectiveness is much better than with the Standard Boolean model.

Paice model

The Paice model [3] is a general extension to the MMM model. In comparison to the MMM model that considers only the minimum and maximum weights for the index terms, the Paice model incorporates all of the term weights when calculating the similarity:

where r is a constant coefficient and wdi is arranged in ascending order for and queries and descending order for or queries. When n = 2 the Paice model shows the same behavior as the MMM model.

The experiments of Lee and Fox [2] have shown that setting the r to 1.0 for and queries and 0.7 for or queries gives good retrieval effectiveness. The computational cost for this model is higher than that for the MMM model. This is because the MMM model only requires the determination of min or max of a set of term weights each time an and or or clause is considered, which can be done in O(n). The Paice model requires the term weights to be sorted in ascending or descending order, depending on whether an and clause or an or clause is being considered. This requires at least an 0(n log n) sorting algorithm. A good deal of floating point calculation is needed too.

Improvements over the Standard Boolean model

Lee and Fox [2] compared the Standard Boolean model with MMM and Paice models with three test collections, CISI, CACM and INSPEC. These are the reported results for average mean precision improvement:

CISICACMINSPEC
MMM68%109%195%
Paice77%104%206%

These are very good improvements over the Standard model. MMM is very close to Paice and P-norm results which indicates that it can be a very good technique, and is the most efficient of the three.

Recent work

In 2005, Kang et al. [4] have devised a fuzzy retrieval system indexed by concept identification.

If we look at documents on a pure Tf-idf approach, even eliminating stop words, there will be words more relevant to the topic of the document than others and they will have the same weight because they have the same term frequency. If we take into account the user intent on a query we can better weight the terms of a document. Each term can be identified as a concept in a certain lexical chain that translates the importance of that concept for that document.
They report improvements over Paice and P-norm on the average precision and recall for the Top-5 retrieved documents.

Zadrozny [5] revisited the fuzzy information retrieval model. He further extends the fuzzy extended Boolean model by:

The proposed model makes it possible to grasp both imprecision and uncertainty concerning the textual information representation and retrieval.

See also

Further reading

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Fuzzy set operations are a generalization of crisp set operations for fuzzy sets. There is in fact more than one possible generalization. The most widely used operations are called standard fuzzy set operations; they comprise: fuzzy complements, fuzzy intersections, and fuzzy unions.

In the field of information retrieval, divergence from randomness (DFR), is a generalization of one of the very first models, Harter's 2-Poisson indexing-model. It is one type of probabilistic model. It is basically used to test the amount of information carried in the documents. The 2-Poisson model is based on the hypothesis that the level of the documents is related to a set of documents which contains words that occur in relatively greater extent than in the rest of the documents. It is not a 'model', but a framework for weighting terms using probabilistic methods, and it has a special relationship for term weighting based on the notion of elite.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus.

The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

In information retrieval, Okapi BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

The Dice-Sørensen coefficient is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Lee Raymond Dice and Thorvald Sørensen, who published in 1945 and 1948 respectively.

A web query or web search query is a query that a user enters into a web search engine to satisfy their information needs. Web search queries are distinctive in that they are often plain text and boolean search directives are rarely used. They vary greatly from standard query languages, which are governed by strict syntax rules as command languages with keyword or positional parameters.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't.

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

Vocabulary mismatch is a common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently.

Christopher D Paice was one of the pioneers of research into stemming. The Paice-Husk stemmer was published in 1990 and his method of evaluation of stemmer performance by means of Error Rate with Respect to Truncation (ERRT) was the first direct method of comparing under-stemming and over-stemming errors. Apart from his pioneering work on stemming algorithms and evaluation methods he made other research contributions in the area of Information Retrieval, anaphora resolution and automatic abstracting.

References

  1. Fox, E. A.; S. Sharat (1986), A Comparison of Two Methods for Soft Boolean Interpretation in Information Retrieval, Technical Report TR-86-1, Virginia Tech, Department of Computer Science
  2. 1 2 3 Lee, W. C.; E. A. Fox (1988), Experimental Comparison of Schemes for Interpreting Boolean Queries
  3. Paice, C. D. (1984), Soft Evaluation of Boolean Search Queries in Information Retrieval Systems, Information Technology, Res. Dev. Applications, 3(1), 33-42
  4. Kang, Bo-Yeong; Dae-Won Kim; Hae-Jung Kim (2005), "Fuzzy Information Retrieval Indexed by Concept Identification", Text, Speech and Dialogue, Lecture Notes in Computer Science, vol. 3658, Springer Berlin / Heidelberg, pp. 179–186, doi:10.1007/11551874_23, ISBN   978-3-540-28789-6
  5. Zadrozny, Sławomir; Nowacka, Katarzyna (2009), "Fuzzy information retrieval model revisited", Fuzzy Sets and Systems, 160 (15), Elsevier North-Holland, Inc.: 2173–2191, doi:10.1016/j.fss.2009.02.012