Query expansion

Last updated

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. [1] In the context of search engines, query expansion involves evaluating a user's input (what words were typed into the search query area, and sometimes other types of data) and expanding the search query to match additional documents. Query expansion involves techniques such as:

Contents

Query expansion is a methodology studied in the field of computer science, particularly within the realm of natural language processing and information retrieval.

Precision and recall trade-offs

Search engines invoke query expansion to increase the quality of user search results. It is assumed that users do not always formulate search queries using the best terms. Best in this case may be because the database does not contain the user entered terms.

By stemming a user-entered term, more documents are matched, as the alternate word forms for a user entered term are matched as well, increasing the total recall. This comes at the expense of reducing the precision. By expanding a search query to search for the synonyms of a user entered term, the recall is also increased at the expense of precision. This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator. It is also inferred that a larger recall negatively impacts overall search result quality, given that many users do not want more results to comb through, regardless of the precision.

The goal of query expansion in this regard is by increasing recall, precision can potentially increase (rather than decrease as mathematically equated), by including in the result set pages which are more relevant (of higher quality), or at least equally relevant. Pages which would not be included in the result set, which have the potential to be more relevant to the user's desired query, are included, and without query expansion would not have, regardless of relevance. At the same time, many of the current commercial search engines use word frequency (tf-idf) to assist in ranking.[ citation needed ] By ranking the occurrences of both the user entered words and synonyms and alternate morphological forms, documents with a higher density (high frequency and close proximity) tend to migrate higher up in the search results, leading to a higher quality of the search results near the top of the results, despite the larger recall.

Query expansion methods

Automatic methods for query expansion were proposed in 1960 by Maron and Kuhns. [2] Modern query expansion methods either imply document collection analysis (global or local) [3] or are dictionary- or ontology-based. [4] The global analysis of the document collection is applied for searching for relations between terms. The local analysis refers to the relevance feedback introduced by Rocchio. [5] Rocchio proposed to judge manually some of the retrieved documents and use this feedback information to expand the query. Since collecting users' judgment can be challenging, only the first top retrieved documents are considered as relevant. This is so called pseudo-relevance feedback (PRF). [6] Pseudo-relevance feedback is efficient in average but can damage results for some queries, [7] especially difficult ones since the top retrieved documents are probably non-relevant. Pseudo-relevant documents are used to find expansion candidate terms that co-occur with many query terms. [8] This idea was further developed within the relevance language model formalism in positional relevance [9] and proximity relevance models [10] which consider the distance to query terms in the pseudo-relevant documents. Another direction in query expansion is the representation of index and query terms in a vector space which can be used to find related terms at query time, using semantic vectors or word embeddings. [11] [12]

More generally, query expansion, with its counterpart document expansion, are today implemented in the form of vector databases, using various encoding schemes based on deep learning. [13]

See also

Software libraries

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user. Relevance may include concerns such as timeliness, authority or novelty of the result.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

<span class="mw-page-title-main">Text Retrieval Conference</span> Meetings for information retrieval research

The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity, and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology.

Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback.

Human–computer information retrieval (HCIR) is the study and engineering of information retrieval techniques that bring human intelligence into the search process. It combines the fields of human-computer interaction (HCI) and information retrieval (IR) and creates systems that improve search by taking into account the human context, or through a multi-step search process that provides the opportunity for human feedback.

A Web query topic classification/categorization is a problem in information science. The task is to assign a Web search query to one or more predefined categories, based on its topics. The importance of query classification is underscored by many services provided by Web search. A direct application is to provide better search result pages for users with interests of different categories. For example, the users issuing a Web query "apple" might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the computer company. Online advertisement services can rely on the query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm. However, the computation of query classification is non-trivial. Different from the document classification tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, query topic classification is much more difficult than traditional document classification tasks.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Discounted cumulative gain (DCG) is a measure of ranking quality in information retrieval. It is often normalized so that it is comparable across queries, giving Normalized DCG (nDCG or NDCG). NDCG is often used to measure effectiveness of search engine algorithms and related applications. Using a graded relevance scale of documents in a search-engine result set, DCG sums the usefulness, or gain, of the results discounted by their position in the result list. NDCG is DCG normalized by the maximum possible DCG of the result set when ranked from highest to lowest gain, thus adjusting for the different numbers of relevant results for different queries.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Cyril Cleverdon was a British librarian and computer scientist who is best known for his work on the evaluation of information retrieval systems.

XML retrieval, or XML information retrieval, is the content-based retrieval of documents structured with XML. As such it is used for computing relevance of XML documents.

RetrievalWare is an enterprise search engine emphasizing natural language processing and semantic networks which was commercially available from 1992 to 2007 and is especially known for its use by government intelligence agencies.

Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal professionals. Its importance has increased because of the vast and quickly increasing amount of legal documents available through electronic means. Legal information retrieval is a part of the growing field of legal informatics.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algorithm was developed using the vector space model. Its underlying assumption is that most users have a general conception of which documents should be denoted as relevant or irrelevant. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.

Vocabulary mismatch is a common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently.

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms.

ChengXiang Zhai is a computer scientist. He is a Donald Biggar Willett Professor in Engineering in the Department of Computer Science at the University of Illinois at Urbana-Champaign.

Query understanding is the process of inferring the intent of a search engine user by extracting semantic meaning from the searcher’s keywords. Query understanding methods generally take place before the search engine retrieves and ranks results. It is related to natural language processing but specifically focused on the understanding of search queries. Query understanding is at the heart of technologies like Amazon Alexa, Apple's Siri. Google Assistant, IBM's Watson, and Microsoft's Cortana.

References

Citations

  1. Vectomova, Olga; Wang, Ying (2006). "A study of the effect of term proximity on query expansion". Journal of Information Science . 32 (4): 324–333. CiteSeerX   10.1.1.552.5987 . doi:10.1177/0165551506065787. S2CID   7265523.
  2. Maron, M. E. and Kuhns, J. L. 1960. On Relevance, Probabilistic Indexing and Information Retrieval. Journal of the ACM 7, 3, 216–244.
  3. C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Computing Surveys, 44(1):1-50, Jan. 2012.
  4. J. Bhogal, A. Macfarlane, and P. Smith. A review of ontology based query expansion. Inf. Process. Manage., 43(4):866-886, July 2007.
  5. J. Rocchio. Relevance feedback in information retrieval. In The SMART Retrieval System, p. 313-323. 1971.
  6. C. Buckley. Automatic query expansion using SMART: TREC 3. In Proceedings of The third Text REtrieval Conference (TREC-3). NIST Special Publication, p. 69-80. National Institute of Standards and Technology, 1995.
  7. G. Amati, C. Carpineto, and G. Romano. Query difficulty, robustness, and selective application of query expansion. Advances in Information Retrieval, p. 127-137, 2004.
  8. J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4-11. ACM, 1996.
  9. Y. Lv and C. Zhai. Positional relevance model for pseudo-relevance feedback. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, page 579-586. ACM, 2010.
  10. L. Ermakova, J. Mothe, and E. Nikitina. 2016. Proximity relevance model for query expansion. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). ACM, New York, NY, USA, 1054-1059. DOI: https://doi.org/10.1145/2851613.2851696
  11. Sahlgren, Magnus, Jussi Karlgren, Rickard Cöster, and Timo Järvinen. 2003. "Automatic query expansion using random indexing." In Advances in Cross-Language Information Retrieval: Third Workshop of the Cross-Language Evaluation Forum (CLEF). Springer.
  12. S. Kuzi, A. Shtok, and O. Kurland. 2016. Query Expansion Using Word Embeddings. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM '16). ACM, New York, NY, USA, 1929-1932. DOI: https://doi.org/10.1145/2983323.2983876
  13. Lin, Jimmy; Nogueira, Rodrigo; Yates, Andrew (2020-10-13). "Pretrained Transformers for Text Ranking: BERT and Beyond". arXiv: 2010.06467 [cs.IR].
  14. Mahtab Tamannaee, Hossein Fani, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri: ReQue: A Configurable Workflow and Dataset Collection for Query Refinement. CIKM 2020: 3165-3172
  15. Hossein Fani, Mahtab Tamannaee, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri; An Extensible Toolkit of Query Refinement Methods and Gold Standard Dataset Generation. In Advances in Information Retrieval: 43rd European Conference on IR Research (ECIR'21), 2021.

Sources