Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as

newspaper article An article or piece is a written work published in a print or electronic medium. It may be for the purpose of propagating news, research results, academic analysis, or debate. News articles A news article discusses current or recent news of eit ...

s, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words. Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...

where the information is stored primarily in the form of text. Text databases became decentralized thanks to the

personal computer A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or tec ...

. Text retrieval is a critical area of study today, since it is the fundamental basis of all

internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...

search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

Description

Document retrieval systems find information to given criteria by matching text records (''documents'') against user queries, as opposed to

expert system In artificial intelligence, an expert system is a computer system emulating the decision-making ability of a human expert. Expert systems are designed to solve complex problems by reasoning through bodies of knowledge, represented mainly as if� ...

s that answer questions by

inferring Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word ''wikt:infer, infer'' means to "carry forward". Inference is theoretically traditionally divided into deductive reasoning, deduction and in ...

over a logical knowledge database. A document retrieval system consists of a database of documents, a

classification algorithm {{Commons category, Classification algorithms This category is about statistical classification algorithms. For more information, see Statistical classification. Categorical data Algorithms In mathematics and computer science, an algori ...

to build a full text index, and a user interface to access the database. A document retrieval system has two main tasks: # Find relevant documents to user queries # Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank. Internet

search engines A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using

statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

techniques.

Variations

There are two main classes of indexing schemata for document retrieval systems: ''form based'' (or ''word based''), and ''content based'' indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system.

Form based

Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A

suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...

algorithm is an example for form based indexing.

Content based

The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an inverted index algorithm. A ''signature file'' is a technique that creates a ''quick and dirty'' filter, for example a

Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...

, that will keep all the documents that match to the query and ''hopefully'' a few ones that do not. The way this is done is by creating for each file a signature, typically a hash coded version. One method is superimposed coding. A post-processing step is done to discard the false alarms. Since in most cases this structure is inferior to

inverted file In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of do ...

s in terms of speed, size and functionality, it is not used widely. However, with proper parameters it can beat the inverted files in certain environments.

Example: PubMed

The

PubMed PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintain the ...

form interface features the "related articles" search which works through a comparison of words from the documents' title, abstract, and MeSH terms using a word-weighted algorithm.

References

External links

{{Commons category
Formal Foundation of Information Retrieval
Buckinghamshire Chilterns University College Information retrieval genres Electronic documents Substring indices Search engine software