A concept search (or conceptual search) is an automated

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...

method that is used to search electronically stored unstructured text (for example, digital archives, email, scientific literature, etc.) for information that is conceptually similar to the information provided in a search query. In other words, the ''ideas'' expressed in the information retrieved in response to a

concept Concepts are defined as abstract ideas. They are understood to be the fundamental building blocks of the concept behind principles, thoughts and beliefs. They play an important role in all aspects of cognition. As such, concepts are studied by sev ...

search query are relevant to the ideas contained in the text of the query. __TOC__

Development

Concept search techniques were developed because of limitations imposed by classical Boolean

keyword search A search engine is an information retrieval software program that discovers, crawls, transforms and stores information for retrieval and presentation in response to user queries. A search engine normally consists of four components, that are sear ...

technologies when dealing with large, unstructured digital collections of text. Keyword searches often return results that include many non-relevant items (

false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...

s) or that exclude too many relevant items (false negatives) because of the effects of

synonymy A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are a ...

and

polysemy Polysemy ( or ; ) is the capacity for a sign (e.g. a symbol, a morpheme, a word, or a phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from ''monosemy'', where a word has a single ...

. Synonymy means that one of two or more words in the same language have the same meaning, and polysemy means that many individual words have more than one meaning. Polysemy is a major obstacle for all computer systems that attempt to deal with human language. In English, the most frequently used terms have several common meanings. For example, the word fire can mean: a combustion activity; to terminate employment; to launch, or to excite (as in fire up). For the 200 most-polysemous terms in English, the typical verb has more than twelve common meanings, or senses. The typical noun from this set has more than eight common senses. For the 2000 most-polysemous terms in English, the typical verb has more than eight common senses and the typical noun has more than five. In addition to the problems of polysemous and synonymy, keyword searches can exclude inadvertently misspelled words as well as the variations on the stems (or roots) of words (for example, strike vs. striking). Keyword searches are also susceptible to errors introduced by

optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a sc ...

(OCR) scanning processes, which can introduce

random error Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistak ...

s into the text of documents (often referred to as noisy text) during the scanning process. A concept search can overcome these challenges by employing

word sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to con ...

(WSD), and other techniques, to help it derive the actual meanings of the words, and their underlying concepts, rather than by simply matching character strings like keyword search technologies.

Approaches

In general,

research and technology can be divided into two broad categories: semantic and statistical. Information retrieval systems that fall into the semantic category will attempt to implement some degree of syntactic and semantic analysis of the

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...

text that a human user would provide (also see

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics d ...

). Systems that fall into the statistical category will find results based on statistical measures of how closely they match the query. However, systems in the semantic category also often rely on statistical methods to help them find and retrieve information. Efforts to provide information retrieval systems with semantic processing capabilities have basically used three approaches: * Auxiliary structures * Local

co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can ...

statistics * Transform techniques (particularly

matrix decomposition In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of ...

Auxiliary structures

A variety of techniques based on

artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...

(AI) and

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

(NLP) have been applied to semantic processing, and most of them have relied on the use of auxiliary structures such as controlled vocabularies and

ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...

. Controlled vocabularies (dictionaries and thesauri), and ontologies allow broader terms, narrower terms, and related terms to be incorporated into queries. Controlled vocabularies are one way to overcome some of the most severe constraints of Boolean keyword queries. Over the years, additional auxiliary structures of general interest, such as the large synonym sets of

WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions ...

, have been constructed. It was shown that concept search that is based on auxiliary structures, such as WordNet, can be efficiently implemented by reusing retrieval models and data structures of classical information retrieval. Later approaches have implemented grammar to expand the range of semantic constructs. The creation of data models that represent sets of concepts within a specific domain (''domain ontologies''), and which can incorporate the relationships among terms, has also been implemented in recent years. Handcrafted controlled vocabularies contribute to the efficiency and comprehensiveness of information retrieval and related text analysis operations, but they work best when topics are narrowly defined and the terminology is standardized. Controlled vocabularies require extensive human input and oversight to keep up with the rapid evolution of language. They also are not well suited to the growing volumes of unstructured text covering an unlimited number of topics and containing thousands of unique terms because new terms and topics need to be constantly introduced. Controlled vocabularies are also prone to capturing a particular worldview at a specific point in time, which makes them difficult to modify if concepts in a certain topic area change.Bradford, R. B., Why LSI?

Latent Semantic Indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...

and Information Retrieval, White Paper, Content Analyst Company, LLC, 2008.

Local co-occurrence statistics

Information retrieval systems incorporating this approach counts the number of times that groups of terms appear together (co-occur) within a

sliding window A sliding window protocol is a feature of packet-based data transmission protocols. Sliding window protocols are used where reliable in-order delivery of packets is required, such as in the data link layer ( OSI layer 2) as well as in the Tra ...

of terms or sentences (for example, ± 5 sentences or ± 50 words) within a document. It is based on the idea that words that occur together in similar contexts have similar meanings. It is local in the sense that the sliding window of terms and sentences used to determine the co-occurrence of terms is relatively small. This approach is simple, but it captures only a small portion of the semantic information contained in a collection of text. At the most basic level, numerous experiments have shown that approximately only a quarter of the information contained in text is local in nature. In addition, to be most effective, this method requires prior knowledge about the content of the text, which can be difficult with large, unstructured document collections.

Transform techniques

Some of the most powerful approaches to semantic processing are based on the use of mathematical transform techniques.

Matrix decomposition In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of ...

techniques have been the most successful. Some widely used matrix decomposition techniques include the following: *

Independent component analysis In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents a ...

* Semi-discrete decomposition *

Non-negative matrix factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...

Singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...

Matrix decomposition techniques are data-driven, which avoids many of the drawbacks associated with auxiliary structures. They are also global in nature, which means they are capable of much more robust

information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...

and representation of semantic information than techniques based on local co-occurrence statistics. Independent component analysis is a technique that creates sparse representations in an automated fashion, and the semi-discrete and non-negative matrix approaches sacrifice accuracy of representation in order to reduce computational complexity. Singular value decomposition (SVD) was first applied to text at Bell Labs in the late 1980s. It was used as the foundation for a technique called

latent semantic indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...

(LSI) because of its ability to find the semantic meaning that is latent in a collection of text. At first, the SVD was slow to be adopted because of the resource requirements needed to work with large datasets. However, the use of LSI has significantly expanded in recent years as earlier challenges in scalability and performance have been overcome. and even open sourced.

Gensim Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning. Gensim is implemented in Python and ...

open source software LSI is being used in a variety of information retrieval and text processing applications, although its primary application has been for concept searching and automated document categorization.

Uses

eDiscovery Electronic discovery (also ediscovery or e-discovery) refers to discovery in legal proceedings such as litigation, government investigations, or Freedom of Information Act requests, where the information sought is in electronic format (often refe ...

– Concept-based search technologies are increasingly being used for Electronic Document Discovery (EDD or eDiscovery) to help enterprises prepare for litigation. In eDiscovery, the ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis is much more efficient than traditional linear review techniques. Concept-based searching is becoming accepted as a reliable and efficient search method that is more likely to produce relevant results than keyword or Boolean searches. *

Enterprise Search Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience. "Enterprise search" is used to describe the software of search information within an ente ...

and Enterprise Content Management (ECM) – Concept search technologies are being widely used in enterprise search. As the volume of information within the enterprise grows, the ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis has become essential. In 2004 the Gartner Group estimated that professionals spend 30 percent of their time searching, retrieving, and managing information.Laplanche, R., Delgado, J., Turck, M., Concept Search Technology Goes Beyond Keywords, Information Outlook, July 2004. The research company IDC found that a 2,000-employee corporation can save up to $30 million per year by reducing the time employees spend trying to find information and duplicating existing documents. * Content-based image retrieval (CBIR) – Content-based approaches are being used for the semantic retrieval of digitized images and video from large visual corpora. One of the earliest content-based image retrieval systems to address the semantic problem was the ImageScape search engine. In this system, the user could make direct queries for multiple visual objects such as sky, trees, water, etc. using spatially positioned icons in a WWW index containing more than ten million images and videos using keyframes. The system used information theory to determine the best features for minimizing uncertainty in the classification.Lew, M. S., Sebe, N., Djeraba, C., Jain, R.
Content-based Multimedia Information Retrieval: State of the Art and Challenges
ACM Transactions on Multimedia Computing, Communications, and Applications, February 2006. The semantic gap is often mentioned in regard to CBIR. The semantic gap refers to the gap between the information that can be extracted from visual data and the interpretation that the same data have for a user in a given situation. The ACM SIGMM Workshop on

Multimedia Information Retrieval Multimedia information retrieval (MMIR or MIR) is a research discipline of computer science that aims at extracting semantic information from multimedia data sources.H Eidenberger. ''Fundamental Media Understanding'', atpress, 2011, p. 1. Data sour ...

is dedicated to studies of CBIR. * Multimedia and publishing – Concept search is used by the multimedia and publishing industries to provide users with access to news, technical information, and subject matter expertise coming from a variety of unstructured sources. Content-based methods for multimedia information retrieval (MIR) have become especially important when text annotations are missing or incomplete. * Digital libraries and archives – Images, videos, music, and text items in digital libraries and digital archives are being made accessible to large groups of users (especially on the Web) through the use of concept search techniques. For example, the Executive Daily Brief (EDB), a business information monitoring and alerting product developed by

EBSCO Publishing EBSCO Information Services, headquartered in Ipswich, Massachusetts, is a division of EBSCO Industries Inc., a private company headquartered in Birmingham, Alabama. EBSCO provides products and services to libraries of very many types around the ...

, uses concept search technology to provide corporate end users with access to a digital library containing a wide array of business content. In a similar manner, the

Music Genome Project The Music Genome Project is an effort to "capture the essence of music at the most fundamental level" using various attributes to describe songs and mathematics to connect them together into an interactive map. The Music Genome Project covers five ...

spawned Pandora, which employs concept searching to spontaneously create individual music libraries or ''virtual'' radio stations. * Genomic Information Retrieval (GIR) – Genomic Information Retrieval (GIR) uses concept search techniques applied to genomic literature databases to overcome the ambiguities of scientific literature. * Human resources staffing and recruiting – Many human resources staffing and recruiting organizations have adopted concept search technologies to produce highly relevant resume search results that provide more accurate and relevant candidate resumes than loosely related keyword results.

Effective searching

The effectiveness of a concept search can depend on a variety of elements including the dataset being searched and the search engine that is used to process queries and display results. However, most concept search engines work best for certain kinds of queries: * Effective queries are composed of enough text to adequately convey the intended concepts. Effective queries may include full sentences, paragraphs, or even entire documents. Queries composed of just a few words are not as likely to return the most relevant results. * Effective queries do not include concepts in a query that are not the object of the search. Including too many unrelated concepts in a query can negatively affect the relevancy of the result items. For example, searching for information about ''boating on the Mississippi River'' would be more likely to return relevant results than a search for ''boating on the Mississippi River on a rainy day in the middle of the summer in 1967.'' * Effective queries are expressed in a full-text, natural language style similar in style to the documents being searched. For example, using queries composed of excerpts from an introductory science textbook would not be as effective for concept searching if the dataset being searched is made up of advanced, college-level science texts. Substantial queries that better represent the overall concepts, styles, and language of the items for which the query is being conducted are generally more effective. As with all search strategies, experienced searchers generally refine their queries through multiple searches, starting with an initial ''seed'' query to obtain conceptually relevant results that can then be used to compose and/or refine additional queries for increasingly more relevant results. Depending on the search engine, using query concepts found in result documents can be as easy as selecting a document and performing a ''find similar'' function. Changing a query by adding terms and concepts to improve result relevance is called ''

query expansion Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves ...

''. The use of

such as WordNet has been studied to expand queries with conceptually-related words.

Relevance feedback

Relevance feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not th ...

is a feature that helps users determine if the results returned for their queries meet their information needs. In other words, relevance is assessed relative to an information need, not a query. A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.Manning, C. D., Raghavan P., Schütze H., Introduction to Information Retrieval, Cambridge University Press, 2008. It is a way to involve users in the retrieval process in order to improve the final result set. Users can refine their queries based on their initial results to improve the quality of their final results. In general, concept search relevance refers to the degree of similarity between the concepts expressed in the query and the concepts contained in the results returned for the query. The more similar the concepts in the results are to the concepts contained in the query, the more relevant the results are considered to be. Results are usually ranked and sorted by relevance so that the most relevant results are at the top of the list of results and the least relevant results are at the bottom of the list. Relevance feedback has been shown to be very effective at improving the relevance of results. A concept search decreases the risk of missing important result items because all of the items that are related to the concepts in the query will be returned whether or not they contain the same words used in the query.

Ranking A ranking is a relationship between a set of items such that, for any two items, the first is either "ranked higher than", "ranked lower than" or "ranked equal to" the second. In mathematics, this is known as a weak order or total preorder of ...

will continue to be a part of any modern information retrieval system. However, the problems of heterogeneous data, scale, and non-traditional discourse types reflected in the text, along with the fact that search engines will increasingly be integrated components of complex information management processes, not just stand-alone systems, will require new kinds of system responses to a query. For example, one of the problems with ranked lists is that they might not reveal relations that exist among some of the result items.Callan, J., Allan, J., Clarke, C. L. A., Dumais, S., Evans, D., A., Sanderson, M., Zhai, C.
Meeting of the MINDS: An Information Retrieval Research Agenda
ACM, SIGIR Forum, Vol. 41 No. 2, December 2007.

Guidelines for evaluating a concept search engine

# Result items should be relevant to the information need expressed by the concepts contained in the query statements, even if the terminology used by the result items is different from the terminology used in the query. # Result items should be sorted and ranked by relevance. # Relevant result items should be quickly located and displayed. Even complex queries should return relevant results fairly quickly. # Query length should be ''non-fixed'', i.e., a query can be as long as deemed necessary. A sentence, a paragraph, or even an entire document can be submitted as a query. # A concept query should not require any special or complex syntax. The concepts contained in the query can be clearly and prominently expressed without using any special rules. # Combined queries using concepts, keywords, and metadata should be allowed. # Relevant portions of result items should be usable as query text simply by selecting the item and telling the search engine to ''find similar'' items. # Query-ready indexes should be created relatively quickly. # The search engine should be capable of performing

federated search Federated search retrieves information from a variety of sources via a search application built on top of one or more search engines. A user makes a single query request which is distributed to the search engines, databases or other query engines ...

es. Federated searching enables concept queries to be used for simultaneously searching multiple

datasource DataSource is a name given to the connection set up to a database from a server. The name is commonly used when creating a query to the database. The data source name (DSN) need not be the same as the filename for the database. For example, a da ...

s for information, which is then merged, sorted, and displayed in the results. # A concept search should not be affected by misspelled words, typographical errors, or OCR scanning errors in either the query text or in the text of the

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of th ...

being searched.

Conferences and forums

Formalized search engine evaluation has been ongoing for many years. For example, the

Text REtrieval Conference The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or ''tracks.'' It is co-sponsored by the National Institute of Standards and Technology (NIST) ...

(TREC) was started in 1992 to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. Most of today's commercial search engines include technology first developed in TREC.Croft, B., Metzler, D., Strohman, T., Search Engines, Information Retrieval in Practice, Addison Wesley, 2009. In 1997, a Japanese counterpart of TREC was launched, called National Institute of Informatics Test Collection for IR Systems (NTCIR). NTCIR conducts a series of evaluation workshops for research in information retrieval, question answering, automatic summarization, etc. A European series of workshops called the Cross-Language Evaluation Forum (CLEF) was started in 2001 to aid research in multilingual information access. In 2002, the Initiative for the Evaluation of XML Retrieval (INEX) was established for the evaluation of content-oriented XML retrieval systems. Precision and recall have been two of the traditional performance measures for evaluating information retrieval systems. Precision is the fraction of the retrieved result documents that are relevant to the user's information need. The recall is defined as the fraction of relevant documents in the entire collection that are returned as result documents. Although the workshops and publicly available test collections used for search engine testing and evaluation have provided substantial insights into how information is managed and retrieved, the field has only scratched the surface of the challenges people and organizations face in finding, managing, and, using information now that so much information is available. Scientific data about how people use the information tools available to them today is still incomplete because experimental research methodologies haven't been able to keep up with the rapid pace of change. Many challenges, such as contextualized search, personal information management, information integration, and task support, still need to be addressed.

References

{{Reflist

External links

Text Retrieval Conference (TREC)
NIST

National Institute of Informatics, Tokyo
Cross-Language Education and Function (CLEF)

University of Duisburg-Essen
INEX (Initiative for the Evaluation of XML Retrieval)
University of Duisburg (archived 2007) Information retrieval genres