1.1 Index design factors 1.2 Index data structures 1.3 Challenges in parallelism 1.4 Inverted indices 1.5 Index merging 1.6 The forward index 1.7 Compression
2 Document parsing
3 See also 4 References 5 Further reading
Indexing The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval. Index design factors Major factors in designing a search engine's architecture include:
How data enters the index, or how words or subject features are added
to the index during text corpus traversal, and whether multiple
indexers can work asynchronously. The indexer must first check whether
it is updating old content or adding new content. Traversal typically
correlates to the data collection policy.
Index data structures
Figuratively structured like a tree, supports linear time lookup.
Built by storing the suffixes of words. The suffix tree is a type of
trie. Tries support extendable hashing, which is important for search
engine indexing. Used for searching for patterns in
Inverted index Stores a list of occurrences of each atomic search criterion, typically in the form of a hash table or binary tree.
Citation index Stores citations or hyperlinks between documents to support citation analysis, a subject of Bibliometrics. Ngram index Stores sequences of length of data to support other types of retrieval or text mining. Document-term matrix Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional sparse matrix.
Challenges in parallelism A major challenge in the design of search engines is the management of serial computing processes. There are many opportunities for race conditions and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a web crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully synchronized, distributed, parallel architecture. Inverted indices Many search engines incorporate an inverted index when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. Because the inverted index stores a list of the documents containing each word, the search engine can use direct access to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The following is a simplified illustration of an inverted index:
the Document 1, Document 3, Document 4, Document 5, Document 7
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7
This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a boolean index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to helto the query. Such topics are the central research focus of information retrieval. The inverted index is a sparse matrix, since not all words are present in each document. To reduce computer storage memory requirements, it is stored differently from a two dimensional array. The index is similar to the term document matrices employed by latent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a distributed hash table. Index merging The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing, where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives. After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index. The forward index The forward index stores a list of words for each document. The following is a simplified form of the forward index:
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon
The rationale behind developing a forward index is that as documents are parsed, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck. The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index. Compression Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk. Consider the following scenario for a full text, Internet search engine.
It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per character The average number of characters in any given word on a page may be estimated at 5 (Wikipedia:Size comparisons)
Given this scenario, an uncompressed index (assuming a non-conflated,
simple, index) for 2 billion web pages would need to store 500 billion
word entries. At 1 byte per character, or 5 bytes per word, this would
require 2500 gigabytes of storage space alone. This space requirement
may be even larger for a fault-tolerant distributed storage
architecture. Depending on the compression technique chosen, the index
can be reduced to a fraction of this size. The tradeoff is the time
and processing power required to perform compression and
Notably, large scale search engine designs incorporate the cost of
storage as well as the costs of electricity to power the storage. Thus
compression is a measure of cost.
Document parsing breaks apart the components (words) of a document or
other form of media for insertion into the forward and inverted
indices. The words found are called tokens, and so, in the context of
search engine indexing and natural language processing, parsing is
more commonly referred to as tokenization. It is also sometimes called
word boundary disambiguation, tagging, text segmentation, content
analysis, text analysis, text mining, concordance generation, speech
segmentation, lexing, or lexical analysis. The terms 'indexing',
'parsing', and 'tokenization' are used interchangeably in corporate
Natural language processing
Word Boundary Ambiguity Native English speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a multilingual indexer. In digital form, the texts of other languages such as Chinese, Japanese or Arabic represent a greater challenge, as words are not clearly delineated by whitespace. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).
Diverse File Formats In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document.
Faulty Storage The quality of the natural language data may not always be perfect. An unspecified number of documents, particular on the Internet, do not closely obey proper file protocol. Binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.
Unlike literate humans, computers do not understand the structure of a
natural language document and cannot automatically recognize words and
sentences. To a computer, a document is only a sequence of bytes.
Computers do not 'know' that a space character separates words in a
document. Instead, humans must program the computer to identify what
constitutes an individual or distinct word referred to as a token.
Such a program is commonly called a tokenizer or parser or lexer. Many
search engines, as well as other natural language processing software,
incorporate specialized programs for parsing, such as
YACC or Lex.
During tokenization, the parser identifies sequences of characters
which represent words and other elements, such as punctuation, which
are represented by numeric codes, some of which are non-printing
control characters. The parser can also identify entities such as
email addresses, phone numbers, and URLs. When identifying each token,
several characteristics may be stored, such as the token's case
(upper, lower, mixed, proper), language or encoding, lexical category
(part of speech, like 'noun' or 'verb'), position, sentence number,
sentence position, length, and line number.
Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a custom parser. Some search engines support inspection of files that are stored in a compressed or encrypted file format. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include:
ZIP - Zip archive file
RAR - Roshal ARchive file
Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing:
Including hundreds or thousands of words in a section which is hidden
from view on the computer screen, but visible to the indexer, by use
of formatting (e.g. hidden "div" tag in HTML, which may incorporate
the use of
Some search engines incorporate section recognition, the
identification of major parts of a document, prior to tokenization.
Not all the documents in a corpus read like a well-written book,
divided into organized chapters and pages. Many documents on the web,
such as newsletters and corporate reports, contain erroneous content
and side-sections which do not contain primary material (that which
the document is about). For example, this article displays a side menu
with links to other web pages. Some file formats, like
Content in different sections is treated as related in the index, when in reality it is not Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents.
Section analysis may require the search engine to implement the
rendering logic of each document, essentially an abstract
representation of the actual document, and then index the
representation instead. For example, some content on the
This section possibly contains original research. Please improve it by verifying the claims made and adding inline citations. Statements consisting only of original research should be removed. (November 2013) (Learn how and when to remove this template message)
Indexing often has to recognize the
Controlled vocabulary Database index Full text search Information extraction Key Word in Context Selection-based search Site map Text retrieval Information literacy
^ Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed
Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo,
^ Sikos, L. F. (August 2016). "RDF-powered semantic video annotation
tools with concept mapping to Linked
R. Bayer and E. McCreight. Organization and maintenance of large
ordered indices. Acta Informatica, 173-189, 1972.
Donald E. Knuth. The Art of Computer Programming, volume 1 (3rd ed.):
fundamental algorithms, Addison Wesley Longman Publishing Co. Redwood
City, CA, 1997.
Donald E. Knuth. The art of computer programming, volume 3: (2nd ed.)
sorting and searching, Addison Wesley Longman Publishing Co. Redwood
City, CA, 1998.
Gerald Salton. Automatic text processing, Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, 1988.
Gerard Salton. Michael J. McGill, Introduction to Modern Information
Retrieval, McGraw-Hill, Inc., New York, NY, 1986.
Gerard Salton. Lesk, M.E.: Computer evaluation of indexing and text
processing. Journal of the ACM. January 1968.
Gerard Salton. The SMART Retrieval System - Experiments in Automatic
Document Processing. Prentice Hall Inc., Englewood Cliffs, 1971.
Gerard Salton. The Transformation, Analysis, and Retrieval of
Information by Computer, Addison-Wesley, Reading, Mass., 1989.
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval.
Chapter 8. ACM Press 1999.
G. K. Zipf. Human Behavior and the Principle of Least Effort.
Adelson-Velskii, G.M., Landis, E. M.: An information organization
algorithm. DANSSSR, 146, 263-266 (1962).
Edward H. Sussenguth Jr., Use of tree structures for processing files,
Communications of the ACM, v.6 n.5, p. 272-279, May 1963
Harman, D.K., et al.: Inverted files. In Information Retrieval: Data
Structures and Algorithms, Prentice-Hall, pp 28–43, 1992.
Lim, L., et al.: Characterizing Web Document Change, LNCS 2118,
Lim, L., et al.: Dynamic Maintenance of Web Indexes Using Landmarks.
Proc. of the 12th W3 Conference, 2003.
Moffat, A., Zobel, J.: Self-Indexing Inverted Files for Fast Text
Retrieval. ACM TIS, 349–379, October 1996, Volume 14, Number 4.
v t e
Web search engine
Image search Video search engine Enterprise search Semantic search
Protocols and standards
Z39.50 Search/Retrieve Web Service Search/Retrieve via URL OpenSearch Representational State Transfer Website Parse Template Wide area information server
Search engine Desktop se