Vocabulary mismatch is a common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently. Furnas et al. (1987) were perhaps the first to quantitatively study the vocabulary mismatch problem. Their results show that on average 80% of the times different people (experts in the same field) will name the same thing differently. There are usually tens of possible names that can be attributed to the same thing. This research motivated the work on

latent semantic indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...

. The vocabulary mismatch between user created queries and relevant documents in a corpus causes the term mismatch problem in

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...

. Zhao and Callan (2010)Zhao, L. and Callan, J., Term Necessity Prediction, Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM 2010). Toronto, Canada, 2010. were perhaps the first to quantitatively study the vocabulary mismatch problem in a retrieval setting. Their results show that an average query term fails to appear in 30-40% of the documents that are relevant to the user query. They also showed that this probability of mismatch is a central probability in one of the fundamental probabilistic retrieval models, the

Binary Independence Model The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible. Defini ...

. They developed novel term weight prediction methods that can lead to potentially 50-80% accuracy gains in retrieval over strong keyword retrieval models. Further research along the line shows that expert users can use Boolean Conjunctive Normal Form expansion to improve retrieval performance by 50-300% over unexpanded keyword queries.Zhao, L. and Callan, J., Automatic term mismatch diagnosis for selective query expansion, SIGIR 2012.

Techniques that may reduce mismatch

Stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...

* Full-text indexing instead of only indexing keywords or abstracts * Indexing text on inbound links from other documents (or other

social tagging Folksonomy is a classification system in which end users apply public tags to online items, typically to make those items easier for themselves or others to find later. Over time, this can give rise to a classification system based on those tags ...

) *

Query expansion Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves ...

. A 2012 study by Zhao and Callan using expert created manual

conjunctive normal form In Boolean logic, a formula is in conjunctive normal form (CNF) or clausal normal form if it is a conjunction of one or more clauses, where a clause is a disjunction of literals; otherwise put, it is a product of sums or an AND of ORs. As a can ...

queries has shown that searchonym expansion in the Boolean conjunctive normal form is much more effective than the traditional bag of word expansion e.g. Rocchio expansion. * Translation-based models

References

{{Reflist Linguistic research Information retrieval techniques Natural language processing