Query expansion

Last updated

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. [1] In the context of search engines, query expansion involves evaluating a user's input (what words were typed into the search query area, and sometimes other types of data) and expanding the search query to match additional documents. Query expansion involves techniques such as:

Contents

Query expansion is a methodology studied in the field of computer science, particularly within the realm of natural language processing and information retrieval.

Precision and recall trade-offs

Search engines invoke query expansion to increase the quality of user search results. It is assumed that users do not always formulate search queries using the best terms. Best in this case may be because the database does not contain the user entered terms.

By stemming a user-entered term, more documents are matched, as the alternate word forms for a user entered term are matched as well, increasing the total recall. This comes at the expense of reducing the precision. By expanding a search query to search for the synonyms of a user entered term, the recall is also increased at the expense of precision. This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator. It is also inferred that a larger recall negatively impacts overall search result quality, given that many users do not want more results to comb through, regardless of the precision.

The goal of query expansion in this regard is by increasing recall, precision can potentially increase (rather than decrease as mathematically equated), by including in the result set pages which are more relevant (of higher quality), or at least equally relevant. Pages which would not be included in the result set, which have the potential to be more relevant to the user's desired query, are included, and without query expansion would not have, regardless of relevance. At the same time, many of the current commercial search engines use word frequency (tf-idf) to assist in ranking.[ citation needed ] By ranking the occurrences of both the user entered words and synonyms and alternate morphological forms, documents with a higher density (high frequency and close proximity) tend to migrate higher up in the search results, leading to a higher quality of the search results near the top of the results, despite the larger recall.

Query expansion methods

Automatic methods for query expansion were proposed in 1960 by Maron and Kuhns. [2] Modern query expansion methods either imply document collection analysis (global or local) [3] or are dictionary- or ontology-based. [4] The global analysis of the document collection is applied for searching for relations between terms. The local analysis refers to the relevance feedback introduced by Rocchio. [5] Rocchio proposed to judge manually some of the retrieved documents and use this feedback information to expand the query. Since collecting users' judgment can be challenging, only the first top retrieved documents are considered as relevant. This is so called pseudo-relevance feedback (PRF). [6] Pseudo-relevance feedback is efficient in average but can damage results for some queries, [7] especially difficult ones since the top retrieved documents are probably non-relevant. Pseudo-relevant documents are used to find expansion candidate terms that co-occur with many query terms. [8] This idea was further developed within the relevance language model formalism in positional relevance [9] and proximity relevance models [10] which consider the distance to query terms in the pseudo-relevant documents. Another direction in query expansion is the application of word embeddings. [11]

An alternative to query expansion is document expansion, which reformulates the text of the documents being searched rather than the text of the query. [12]

See also

Software libraries

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user. Relevance may include concerns such as timeliness, authority or novelty of the result.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Text Retrieval Conference Meetings for information retrieval research

The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity, and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology.

Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback.

A web query or web search query is a query that a user enters into a web search engine to satisfy their information needs. Web search queries are distinctive in that they are often plain text and boolean search directives are rarely used. They vary greatly from standard query languages, which are governed by strict syntax rules as command languages with keyword or positional parameters.

Precision and recall Pattern recognition performance metrics

In pattern recognition, information retrieval and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

Human–computer information retrieval (HCIR) is the study and engineering of information retrieval techniques that bring human intelligence into the search process. It combines the fields of human-computer interaction (HCI) and information retrieval (IR) and creates systems that improve search by taking into account the human context, or through a multi-step search process that provides the opportunity for human feedback.

A Web query topic classification/categorization is a problem in information science. The task is to assign a Web search query to one or more predefined categories, based on its topics. The importance of query classification is underscored by many services provided by Web search. A direct application is to provide better search result pages for users with interests of different categories. For example, the users issuing a Web query "apple" might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the computer company. Online advertisement services can rely on the query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm. However, the computation of query classification is non-trivial. Different from the document classification tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, query topic classification is much more difficult than traditional document classification tasks.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Cyril Cleverdon was a British librarian and computer scientist who is best known for his work on the evaluation of information retrieval systems.

XML retrieval, or XML information retrieval, is the content-based retrieval of documents structured with XML. As such it is used for computing relevance of XML documents.

Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal professionals. Its importance has increased because of the vast and quickly increasing amount of legal documents available through electronic means. Legal information retrieval is a part of the growing field of legal informatics.

Learning to rank Use of machine learning to rank items

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System which was developed 1960-1964. Like many other retrieval systems, the Rocchio feedback approach was developed using the Vector Space Model. The algorithm is based on the assumption that most users have a general conception of which documents should be denoted as relevant or non-relevant. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and non-relevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and non-relevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.

Vocabulary mismatch is a common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently.

Evaluation measures for an information retrieval system are used to assess how well the search results satisfied the user's query intent. The field of information retrieval has used various types of quantitative metrics for this purpose, based on either observed user behaviour or on scores from prepared benchmark test sets. Besides benchmarking by using this type of measure, an evaluation for an information retrieval system should also include a validation of the measures used, i.e. an assessment of how well the measures what they are intended to measure and how well the system fits its intended use case. Metrics are often split into two types: online metrics look at users' interactions with the search system, while offline metrics measure theoretical relevance, in other words how likely each result, or search engine results page (SERP) page as a whole, is to meet the information needs of the user.

Query understanding is the process of inferring the intent of a search engine user by extracting semantic meaning from the searcher’s keywords. Query understanding methods generally take place before the search engine retrieves and ranks results. It is related to natural language processing but specifically focused on the understanding of search queries. Query understanding is at the heart of technologies like Amazon Alexa, Apple's Siri. Google Assistant, IBM's Watson, and Microsoft's Cortana.

References

Citations

  1. Vectomova, Olga; Wang, Ying (2006). "A study of the effect of term proximity on query expansion". Journal of Information Science . 32 (4): 324–333. CiteSeerX   10.1.1.552.5987 . doi:10.1177/0165551506065787. S2CID   7265523.
  2. Maron, M. E. and Kuhns, J. L. 1960. On Relevance, Probabilistic Indexing and Information Retrieval. Journal of the ACM 7, 3, 216–244.
  3. C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Computing Surveys, 44(1):1-50, Jan. 2012.
  4. J. Bhogal, A. Macfarlane, and P. Smith. A review of ontology based query expansion. Inf. Process. Manage., 43(4):866-886, July 2007.
  5. J. Rocchio. Relevance feedback in information retrieval. In The SMART Retrieval System, p. 313-323. 1971.
  6. C. Buckley. Automatic query expansion using SMART: TREC 3. In Proceedings of The third Text REtrieval Conference (TREC-3). NIST Special Publication, p. 69-80. National Institute of Standards and Technology, 1995.
  7. G. Amati, C. Carpineto, and G. Romano. Query difficulty, robustness, and selective application of query expansion. Advances in Information Retrieval, p. 127-137, 2004.
  8. J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4-11. ACM, 1996.
  9. Y. Lv and C. Zhai. Positional relevance model for pseudo-relevance feedback. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, page 579-586. ACM, 2010.
  10. L. Ermakova, J. Mothe, and E. Nikitina. 2016. Proximity relevance model for query expansion. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). ACM, New York, NY, USA, 1054-1059. DOI: https://doi.org/10.1145/2851613.2851696
  11. S. Kuzi, A. Shtok, and O. Kurland. 2016. Query Expansion Using Word Embeddings. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM '16). ACM, New York, NY, USA, 1929-1932. DOI: https://doi.org/10.1145/2983323.2983876
  12. Lin, Jimmy; Nogueira, Rodrigo; Yates, Andrew (2020-10-13). "Pretrained Transformers for Text Ranking: BERT and Beyond". arXiv: 2010.06467 [cs.IR].
  13. Mahtab Tamannaee, Hossein Fani, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri: ReQue: A Configurable Workflow and Dataset Collection for Query Refinement. CIKM 2020: 3165-3172
  14. Hossein Fani, Mahtab Tamannaee, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri; An Extensible Toolkit of Query Refinement Methods and Gold Standard Dataset Generation. In Advances in Information Retrieval: 43rd European Conference on IR Research (ECIR'21), 2021.

Sources