Query expansion
   HOME

TheInfoList



OR:

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other c ...
operations, particularly in the context of
query understanding Query understanding is the process of inferring the intent of a search engine user by extracting semantic meaning from the searcher’s keywords. Query understanding methods generally take place before the search engine retrieves and ranks results ...
. In the context of
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s, query expansion involves evaluating a user's input (what words were typed into the search query area, and sometimes other types of
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
) and expanding the search query to match additional documents. Query expansion involves techniques such as: * Finding
synonym A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are al ...
s of words, and searching for the synonyms as well * Finding semantically related words (e.g.
antonym In lexical semantics, opposites are words lying in an inherently incompatible binary relationship. For example, something that is ''long'' entails that it is not ''short''. It is referred to as a 'binary' relationship because there are two members ...
s, meronyms, hyponyms, hypernyms) * Finding all the various morphological forms of words by stemming each word in the search query * Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results * Re-weighting the terms in the original query Query expansion is a methodology studied in the field of
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, particularly within the realm of
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
and
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other c ...
.


Precision and recall trade-offs

Search engines invoke query expansion to increase the quality of user search results. It is assumed that users do not always formulate search queries using the best terms. Best in this case may be because the database does not contain the user entered terms. By stemming a user-entered term, more documents are matched, as the alternate word forms for a user entered term are matched as well, increasing the total
recall Recall may refer to: * Recall (bugle call), a signal to stop * Recall (information retrieval), a statistical measure * ''ReCALL'' (journal), an academic journal about computer-assisted language learning * Recall (memory) * ''Recall'' (Overwatc ...
. This comes at the expense of reducing the precision. By expanding a search query to search for the synonyms of a user entered term, the recall is also increased at the expense of precision. This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator. It is also inferred that a larger recall negatively impacts overall search result quality, given that many users do not want more results to comb through, regardless of the precision. The goal of query expansion in this regard is by increasing recall, precision can potentially increase (rather than decrease as mathematically equated), by including in the result set pages which are more relevant (of higher quality), or at least equally relevant. Pages which would not be included in the result set, which have the potential to be more relevant to the user's desired query, are included, and without query expansion would not have, regardless of
relevance Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the second topic when considering the first. The concept of relevance is studied in many different fields, including cognitive sc ...
. At the same time, many of the current commercial search engines use word frequency ( tf-idf) to assist in ranking. By ranking the occurrences of both the user entered words and synonyms and alternate morphological forms, documents with a higher density (high frequency and close proximity) tend to migrate higher up in the search results, leading to a higher quality of the search results near the top of the results, despite the larger recall.


Query expansion methods

Automatic methods for query expansion were proposed in 1960 by Maron and Kuhns. Modern query expansion methods either imply document collection analysis (global or local) or are dictionary- or ontology-based. The global analysis of the document collection is applied for searching for relations between terms. The local analysis refers to the
relevance feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not th ...
introduced by Rocchio. Rocchio proposed to judge manually some of the retrieved documents and use this feedback information to expand the query. Since collecting users' judgment can be challenging, only the first top retrieved documents are considered as relevant. This is so called pseudo-
relevance feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not th ...
(PRF). Pseudo-relevance feedback is efficient in average but can damage results for some queries, especially difficult ones since the top retrieved documents are probably non-relevant. Pseudo-relevant documents are used to find expansion candidate terms that co-occur with many query terms. This idea was further developed within the relevance
language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
formalism in positional relevance and proximity relevance models which consider the distance to query terms in the pseudo-relevant documents. Another direction in query expansion is the application of word embeddings. An alternative to query expansion is document expansion, which reformulates the text of the documents being searched rather than the text of the query.


See also

* Document retrieval *
Information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other c ...
*
Linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Ling ...
*
Morphology (linguistics) In linguistics, morphology () is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words such as stems, root words, prefixes, and suffixes. Morp ...
*
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
*
Search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
* Search engine indexing * Stemming


Software libraries


QueryTermAnalyzer
open-source, C#. Machine learning based query term weight and synonym analyzer for query expansion.
LucQE
- open-source, Java. Provides a framework along with several implementations that allow to perform query expansion with the use of Apache
Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
. *
Xapian Xapian is a free and open-source probabilistic information retrieval library, released under the GNU General Public License (GPL). It is a full-text search engine library for programmers. It is written in C++, with bindings to allow use from P ...
is an open-source search library which includes support for query expansion
ReQue
open-source, Python. A configurable software framework and a collection of gold standard datasets for training and evaluating supervised query expansion methods.Hossein Fani, Mahtab Tamannaee, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri; An Extensible Toolkit of Query Refinement Methods and Gold Standard Dataset Generation. In Advances in Information Retrieval: 43rd European Conference on IR Research (ECIR'21), 2021.


References


Citations


Sources

* D. Abberley, D. Kirby, S. Renals, and T. Robinson, The THISL broadcast news retrieval system. In ''Proc. ESCA ETRW Workshop Accessing Information in Spoken Audio'', (Cambridge), pp. 14–19, 1999. Section o

- Concise, mathematical overview. * R. Navigli, P. Velardi
An Analysis of Ontology-based Query Expansion Strategies
''Proc. of Workshop on Adaptive Text Extraction and Mining (ATEM 2003)'', in the ''14th European Conference on Machine Learning (ECML 2003)'', Cavtat-Dubrovnik, Croatia, September 22-26th, 2003, pp. 42–49 - An analysis of query expansion methods relying on WordNet as the reference ontology. * Y. Qiu and H.P. Frei

In ''Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval'', Pittsburgh, SIGIR Forum, ACM Press, June 1993 - Academic document on a specific method of query expansion * Efthimis N. Efthimiadis

In: Martha E. Williams (ed.), ''Annual Review of Information Systems and Technology (ARIST)'', v31, pp 121–187, 1996 - An introduction for less-technical viewers. {{DEFAULTSORT:Query Expansion Search algorithms