text retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly natural language, unstructured text, such as newspaper articles, real estate records or paragraphs ...

, full-text search refers to techniques for searching a single

computer A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...

-stored

document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...

or a collection in a full-text database. Full-text search is distinguished from searches based on

metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...

or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). In a full-text search, a

search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...

examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques appeared in the 1960s, for example

IBM STAIRS The IBM Storage and Information Retrieval System, better known by the acronym STAIRS, was a program providing storage and online Full-text search, free-text search of text data. STAIRS ran under the OS/360 and successors, OS/360 operating system ...

from 1969, and became common in online

bibliographic databases A bibliographic database is a database of bibliographic records. This is an organised online collection of references to published written works like journal and newspaper articles, conference proceedings, reports, government and legal publicati ...

in the 1990s. Many websites and application programs (such as

word processing A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Word processor (electronic device), Early word processors were stand-alone devices dedicate ...

software) provide full-text-search capabilities. Some web search engines, such as the former

AltaVista AltaVista was a web search engine established in 1995. It became one of the most-used early search engines, but lost ground to Google and was purchased by Yahoo! in 2003, which retained the brand, but based all AltaVista searches on its own sear ...

, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems.

Indexing

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called " serial scanning". This is what some tools, such as

grep grep is a command-line utility for searching plaintext datasets for lines that match a regular expression. Its name comes from the ed command g/re/p (global regular expression search and print), which has the same effect. grep was originally de ...

, do when searching. However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an

index Index (: indexes or indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on the Halo Array in the ...

, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents. The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore

stop words Stop words are the words in a stop list (or ''stoplist'' or ''negative dictionary'') which are filtered out ("stopped") before or after processing of natural language data (i.e. text) because they are deemed to have little semantic value or are ot ...

(such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive".

The precision vs. recall tradeoff

Recall measures the quantity of relevant results returned by a search, while precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned to all relevant results. Precision is the ratio of the number of relevant results returned to the total number of results returned. The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only 1 relevant result of 3 possible relevant results was returned, so the recall is a very low ratio of 1/3, or 33%. The precision for the example is a very low 1/4, or 25%, since only 1 of the 4 results returned was relevant. Due to the ambiguities of

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

, full-text-search systems typically includes options like filtering to increase precision and stemming to increase recall. Controlled-vocabulary searching also helps alleviate low-precision issues by tagging documents in such a way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall, while an increase in recall lowers precision.

False-positive problem

Full-text searching is likely to retrieve many documents that are not relevant to the ''intended'' search question. Such documents are called ''false positives'' (see

Type I error Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...

). The retrieval of irrelevant documents is often caused by the inherent ambiguity of

. In the sample diagram to the right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background). Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "bank", clustering can be used to categorize the document/data universe into "financial institution", "place to sit", "place to store" etc. Depending on the occurrences of words relevant to the categories, search terms or a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.

Performance improvements

The deficiencies of full text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.

Improved querying tools

* Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text. * Field-restricted search. Some search engines enable users to limit full text searches to a particular field within a stored data record, such as "Title" or "Author." * . Searches that use Boolean operators (for example, ) can dramatically increase the precision of a full text search. The operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the operator can be used to increase recall; consider, for example, . This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall. * Phrase search. A phrase search matches only those documents that contain a specified phrase, such as * Concept search. A search that is based on multi-word concepts, for example Compound term processing. This type of search is becoming popular in many e-discovery solutions. * Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a

text Text may refer to: Written word * Text (literary theory) In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothi ...

with their immediate context. * Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for would retrieve only those documents in which the words occur within two words of each other. *

Regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

. A regular expression employs a complex but powerful querying

syntax In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...

that can be used to specify retrieval conditions with precision. * Fuzzy search will search for document that match the given terms and some variation around them (using for instance

edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...

to threshold the multiple variation) * Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an

asterisk The asterisk ( ), from Late Latin , from Ancient Greek , , "little star", is a Typography, typographical symbol. It is so called because it resembles a conventional image of a star (heraldry), heraldic star. Computer scientists and Mathematici ...

. For example, using the asterisk in a search query will find "sin", "son", "sun", etc. in a text.

Improved search algorithms

The

PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...

algorithm developed by

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

gives more prominence to documents to which other

Web page A web page (or webpage) is a World Wide Web, Web document that is accessed in a web browser. A website typically consists of many web pages hyperlink, linked together under a common domain name. The term "web page" is therefore a metaphor of pap ...

s have linked. See

Search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...

for additional examples.

Software

The following is a partial list of available software products whose predominant purpose is to perform full-text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full-text search may be accomplished.

Free and open source software

Apache Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...

* Apache Solr * ArangoSearch *

BaseX BaseX is a native and light-weight XML database management system and XQuery processor, developed as a community project on GitHub. It is specialized in storing, querying, and visualizing large XML documents and collections. BaseX is platform-i ...

* KinoSearch * Lemur/Indri *

MariaDB MariaDB is a community-developed, commercially supported Fork (software development), fork of the MySQL relational database management system (RDBMS), intended to remain free and open-source software under the GNU General Public License. Developm ...

* mnoGoSearch *

MySQL MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...

* OpenSearch *

PostgreSQL PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...

* Searchdaimon *

Sphinx A sphinx ( ; , ; or sphinges ) is a mythical creature with the head of a human, the body of a lion, and the wings of an eagle. In Culture of Greece, Greek tradition, the sphinx is a treacherous and merciless being with the head of a woman, th ...

* Swish-e * Terrier IR Platform * Xapian

Proprietary software

* Algolia * Autonomy Corporation * Azure Search * Bar Ilan Responsa Project * Basis database * Brainware * BRS/Search * Concept Searching Limited * Dieselpoint * dtSearch *

Elasticsearch Elasticsearch is a Search engine (computing), search engine based on Apache Lucene, a free and open-source search engine. It provides a distributed, Multitenancy, multitenant-capable full-text search engine with an HTTP web interface and schema ...

* Endeca * Exalead *

Fast Search & Transfer Microsoft Development Center Norway (known as FAST (Fast Search & Transfer ASA) before 2010) is a Norway, Norwegian company, founded in 1997 and based in Oslo, with offices located in Germany, Italy, Sri Lanka, France, Japan, the United Kingdom, ...

Inktomi Inktomi Corporation was an American Internet service provider (ISP) software developer based in Foster City, California. Customers included Microsoft, HotBot, Amazon.com, eBay, and Walmart. The company developed Traffic Server, a proxy se ...

* Lucid Imagination *

MarkLogic MarkLogic is an American software business that develops and provides an enterprise NoSQL database, which is also named ''MarkLogic''. They have offices in the United States, Europe, Asia, and Australia. In February 2023, MarkLogic was acquired ...

* SAP HANA *

Swiftype Swiftype is a search and index company based in San Francisco, California, that provides search software for organizations, websites, and computer programs. Notable customers include AT&T, Dr. Pepper, Hubspot and TechCrunch. History Swiftype w ...

* Thunderstone Software LLC. * Vivísimo