text retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly natural language, unstructured text, such as newspaper articles, real estate records or paragraphs ...

, full-text search refers to techniques for searching a single computer-stored

document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...

or a collection in a

full-text database A full-text database or a complete-text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents. They differ from bibliographic databases (which con ...

. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques became common in online

bibliographic databases A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, ...

in the 1990s. Many websites and application programs (such as

word processing A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...

software) provide full-text-search capabilities. Some web search engines, such as

AltaVista AltaVista was a Web search engine established in 1995. It became one of the most-used early search engines, but lost ground to Google and was purchased by Yahoo! in 2003, which retained the brand, but based all AltaVista searches on its own sear ...

, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems.

Indexing

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called " serial scanning". This is what some tools, such as

grep grep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command ''g/re/p'' (''globally search for a regular expression and print matching lines''), which has the sa ...

, do when searching. However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an index, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents. The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore

stop words Stop words are the words in a stop list (or ''stoplist'' or ''negative dictionary'') which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant. There is no single universal list ...

(such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific

stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...

on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive".

The precision vs. recall tradeoff

Recall measures the quantity of relevant results returned by a search, while precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned to all relevant results. Precision is the ratio of the number of relevant results returned to the total number of results returned. The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only 1 relevant result of 3 possible relevant results was returned, so the recall is a very low ratio of 1/3, or 33%. The precision for the example is a very low 1/4, or 25%, since only 1 of the 4 results returned was relevant. Due to the ambiguities of natural language, full-text-search systems typically includes options like

to increase precision and

to increase recall. Controlled-vocabulary searching also helps alleviate low-precision issues by tagging documents in such a way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall, while an increase in recall lowers precision.

False-positive problem

Full-text searching is likely to retrieve many documents that are not

relevant Relevant is something directly related, connected or pertinent to a topic; it may also mean something that is current. Relevant may also refer to: * Relevant operator, a concept in physics, see renormalization group * Relevant, Ain, a commune ...

to the ''intended'' search question. Such documents are called ''false positives'' (see

Type I error In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...

). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background). Clustering techniques based on

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...

algorithms can help reduce false positives. For a search term of "bank", clustering can be used to categorize the document/data universe into "financial institution", "place to sit", "place to store" etc. Depending on the occurrences of words relevant to the categories, search terms or a search result can be placed in one or more of the categories. This technique is being extensively deployed in the

e-discovery Electronic discovery (also ediscovery or e-discovery) refers to discovery in legal proceedings such as litigation, government investigations, or Freedom of Information Act requests, where the information sought is in electronic format (often refe ...

domain.

Performance improvements

The deficiencies of full text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.

Improved querying tools

Keyword Keyword may refer to: Computing * Keyword (Internet search), a word or phrase typically used by bloggers or online content creator to rank a web page on a particular topic * Index term, a term used as a keyword to documents in an information syst ...

s. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text. * Field-restricted search. Some search engines enable users to limit full text searches to a particular

field Field may refer to: Expanses of open ground * Field (agriculture), an area of land used for agricultural purposes * Airfield, an aerodrome that lacks the infrastructure of an airport * Battlefield * Lawn, an area of mowed grass * Meadow, a grass ...

within a stored data record, such as "Title" or "Author." * . Searches that use Boolean operators (for example, ) can dramatically increase the precision of a full text search. The operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the operator can be used to increase recall; consider, for example, . This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall. *

Phrase search In computer science, phrase searching allows users to retrieve content from information systems (such as documents from file storage systems, records from databases, and web pages on the internet) that contains a specific order and combination of wo ...

. A phrase search matches only those documents that contain a specified phrase, such as *

Concept search A concept search (or conceptual search) is an automated information retrieval method that is used to search electronically stored unstructured data, unstructured text (for example, digital archives, email, scientific literature, etc.) for informatio ...

. A search that is based on multi-word concepts, for example

Compound term processing Compound-term processing, in information-retrieval, is search result matching on the basis of compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is ...

. This type of search is becoming popular in many e-discovery solutions. *

Concordance search A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Concordances have been compiled only for works of special importance, such as the Vedas, ...

. A concordance search produces an alphabetical list of all principal words that occur in a

text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...

with their immediate context. * Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for would retrieve only those documents in which the words occur within two words of each other. *

Regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision. *

Fuzzy search In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching ...

will search for document that match the given terms and some variation around them (using for instance

edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to ...

to threshold the multiple variation) * Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. For example, using the asterisk in a search query will find "sin", "son", "sun", etc. in a text.

Improved search algorithms

The

PageRank PageRank (PR) is an algorithm used by Google Search to rank webpages, web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. A ...

algorithm developed by

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

gives more prominence to documents to which other Web pages have linked. See Search engine for additional examples.

Software

The following is a partial list of available software products whose predominant purpose is to perform full-text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full-text search may be accomplished.

Free and open source software

Apache Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...

Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...

* ArangoSearch *

BaseX BaseX is a native and light-weight XML database management system and XQuery processor, developed as a community project on GitHub. It is specialized in storing, querying, and visualizing large XML documents and collections. BaseX is platform-i ...

KinoSearch Apache Lucene is a free and open-source software, free and open-source Search engine (computing), search engine Library (computing), software library, originally written in Java (programming language), Java by Doug Cutting. It is supported by the ...

* Lemur/Indri *

mnoGoSearch mnoGoSearch is an open-source web search engine for Microsoft Windows and Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1 ...

OpenSearch OpenSearch is a collection of technologies that allow the publishing of search results in a format suitable for syndication and aggregation. Introduced in 2005, it is a way for websites and search engines to publish search results in a standard ...

* PostgreSQL *

Searchdaimon Searchdaimon ES (Enterprise Search) is an open source enterprise search engine for full text search of structured and unstructured data available under the GPL v2 license. Its major features include hit highlighting, faceted search, dynamic clus ...

Sphinx A sphinx ( , grc, σφίγξ , Boeotian: , plural sphinxes or sphinges) is a mythical creature with the head of a human, the body of a lion, and the wings of a falcon. In Greek tradition, the sphinx has the head of a woman, the haunches of ...

Swish-e SWISH-E stands for ''Simple Web Indexing System for Humans - Enhanced''. It is used to index collections of documents ranging up to one million documents in size and includes import filters for many document types. SWISH-E is based on SWISH, dev ...

* Terrier IR Platform *

Xapian Xapian is a free and open-source probabilistic information retrieval library, released under the GNU General Public License (GPL). It is a full-text search engine library for programmers. It is written in C++, with bindings to allow use from ...

Proprietary software

Algolia Algolia is a proprietary search engine offering, usable through the software as a service (SaaS) model. Company Algolia was founded in 2012 by Nicolas Dessaigne and Julien Lemoine, both originally from Paris, France. It was originally a comp ...

Autonomy Corporation HP Autonomy, previously Autonomy Corporation PLC, was an enterprise software company which was merged with Micro Focus in 2017. It was founded in Cambridge, United Kingdom in 1996. Autonomy was acquired by Hewlett-Packard (HP) in October 201 ...

* Azure Search * Bar Ilan Responsa Project *

Basis database Basis database or OpenText Collections Server is an Extended Relational Database Management System (RDBMS) produced by OpenText. BASIS was originally developed by the Battelle Institute, and was spun off into Information Dimensions, a private comp ...

Brainware Brainware was an American software company that marketed Automatic identification and data capture and data extraction products. The company was acquired by Hyland Software in 2017. Brainware originally spun out of Dulles-based SER Solutions Inc ...

BRS/Search BRS/Search is a full-text database and information retrieval system. BRS/Search uses a fully inverted indexing system to store, locate, and retrieve unstructured data. It was the search engine that in 1977 powered Bibliographic Retrieval Services ...

* Concept Searching Limited * Dieselpoint *

dtSearch dtSearch Corp. is a software company which specializes in text retrieval software. It was founded in 1991, and is headquartered in Bethesda, Maryland. Its current range of software includes products for enterprise desktop search, Intranet/Intern ...

Elasticsearch Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual ...

Endeca Endeca was a software company headquartered in Cambridge, Massachusetts, that sold eCommerce search, customer experience management, enterprise search and business intelligence applications. Endeca was founded in 1999 as Optigrab and was a privat ...

Exalead EXALEAD is a software company, created in 2000, that provided search platforms and search-based applications (SBA) for consumer and business users. The company is headquartered in Paris, France, and is a subsidiary of Dassault Systèmes (). ...

* Fast Search & Transfer *

Inktomi Inktomi Corporation was a company that provided software for Internet service providers (ISPs). It was incorporated in Delaware and headquartered in Foster City, California, United States. Customers included Microsoft, HotBot, Amazon.com, eBay, ...

Lucid Imagination Lucidworks, a San Francisco, California-based company that specializes in commerce, customer service, and workplace applications. Lucidworks was founded in 2007 under the name Lucid Imagination and launched in 2009. The company was later rena ...

MarkLogic MarkLogic Corporation is an American software business that develops and provides an enterprise NoSQL database, also named ''MarkLogic''. The company was founded in 2001 and is based in San Carlos, California. MarkLogic is a privately held comp ...

SAP HANA SAP HANA (HochleistungsANalyseAnwendung or High-performance ANalytic Application) is an in-memory, column-oriented, relational database management system developed and marketed by SAP SE. Its primary function as the software running a databa ...

Swiftype Swiftype is a search and index company based in San Francisco, California, that provides search software for organizations, websites, and computer programs. Notable customers include AT&T, Dr. Pepper, Hubspot and TechCrunch. History Swiftype was ...

Thunderstone Software LLC. Thunderstone is a US-based software company specializing in enterprise search Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience. "Enter ...

Vivísimo Vivisimo was a privately held technology company in Pittsburgh, Pennsylvania, specialising in the development of computer search engines. The company was acquired by IBM in May 2012 and is now branded aIBM Watson Explorer a product of the IBM W ...