Compound-term processing, in
information-retrieval, is search result matching on the basis of
compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is a compound term.
Compound-term processing is a new approach to an old problem: how can one improve the relevance of search results while maintaining ease of use? Using this technique, a search for ''survival rates following a triple heart bypass in elderly people'' will locate documents about this topic even if this precise phrase is not contained in any document. This can be performed by a
concept search, which itself uses compound-term processing. This will extract the key concepts automatically (in this case "survival rates", "triple heart bypass" and "elderly people") and use these concepts to select the most relevant documents.
Techniques
In August 2003,
Concept Searching Limited introduced the idea of using statistical compound-term processing.
CLAMOUR is a European collaborative project which aims to find a better way to classify when collecting and disseminating industrial information and statistics. CLAMOUR appears to use a linguistic approach, rather than one based on
statistical modelling.
History
Techniques for probabilistic weighting of single word terms date back to at least 1976 in the landmark publication by
Stephen E. Robertson and
Karen Spärck Jones. Robertson stated that the assumption of word independence is not justified and exists as a matter of mathematical convenience. His objection to the term independence is not a new idea, dating back to at least 1964 when H. H. Williams stated that "
e assumption of independence of words in a document is usually made as a matter of mathematical convenience".
In 2004, Anna Lynn Patterson filed patents on "phrase-based searching in an information retrieval system" to which
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
subsequently acquired the rights.
Google Acquires Cuil Patent Applications
/ref>
Adaptability
Statistical compound-term processing is more adaptable than the process described by Patterson. Her process is targeted at searching the World Wide Web
The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
where an extensive statistical knowledge of common searches can be used to identify candidate phrases. Statistical compound term processing is more suited to enterprise search
Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web ...
applications where such a priori
('from the earlier') and ('from the later') are Latin phrases used in philosophy to distinguish types of knowledge, Justification (epistemology), justification, or argument by their reliance on experience. knowledge is independent from any ...
knowledge is not available.
Statistical compound-term processing is also more adaptable than the linguistic approach taken by the CLAMOUR project, which must consider the syntactic properties of the terms (i.e. part of speech, gender, number, etc.) and their combinations. CLAMOUR is highly language-dependent, whereas the statistical approach is language-independent.
Applications
Compound-term processing allows information-retrieval applications, such as search engines
Search engines, including web search engines, selection-based search engines, metasearch engines, desktop search tools, and web portals and vertical market websites have a search facility for online databases.
By content/topic
Gene ...
, to perform their matching on the basis of multi-word concepts, rather than on single words in isolation which can be highly ambiguous.
Early search engines looked for documents containing the words entered by the user into the search box . These are known as keyword search engines. Boolean search engines add a degree of sophistication by allowing the user to specify additional requirements. For example, "Tiger NEAR Woods AND (golf OR golfing) NOT Volkswagen" uses the operators "NEAR", "AND", "OR" and "NOT" to specify that these words must follow certain requirements. A phrase search is simpler to use, but requires that the exact phrase specified appear in the results.
See also
* Concept Searching Limited
* Enterprise search
Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web ...
* Information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
References
{{DEFAULTSORT:Compound Term Processing
Information retrieval techniques