Stop words are the words in a stop list (or ''stoplist'' or ''negative dictionary'') which are filtered out ("stopped") before or after
processing of natural language data (i.e. text) because they are deemed to have little semantic value or are otherwise insignificant for the task at hand. There is no single universal list of stop words used by all natural language processing (NLP) tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in
nformation retrievalsystems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever".
History of stop words
A predecessor concept was used in creating some
concordances. For example, the first
Hebrew concordance,
Isaac Nathan ben Kalonymus's , contained a one-page list of unindexed words, with nonsubstantive prepositions and conjunctions which are similar to common modern stop words.
Hans Peter Luhn
Hans Peter Luhn (July 1, 1896 – August 19, 1964) was a German-American researcher in the field of computer science and Library & Information Science for IBM, and creator of the Luhn algorithm, KWIC (Key Words In Context) indexing, and s ...
, one of the pioneers in
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
, is credited with coining the phrase and using the concept when introducing his
Key Word in Context automatic indexing process. The phrase "stop word", which is not in Luhn's 1959 presentation, and the associated terms "stop list" and "stoplist" appear in the literature shortly afterward.
Although it is commonly assumed that stop lists include only the most frequent words in a language, it was C.J. Van Rijsbergen who proposed the first standardized list which was not based on word frequency information. The "Van list" included 250 English words. Martin Porter's word stemming program developed in the 1980s built on the Van list, and the Porter list is now commonly used as a default stop list in a variety of software applications.
In 1990, Christopher Fox proposed the first general stop list based on empirical word frequency information derived from the
Brown Corpus
The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured Text_corpus, corpus of varied genres. This ...
:
This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.
In
SEO terminology, stop words are the most common words that many search engines used to avoid for the purposes of saving space and time in processing of large data during
crawling or
indexing.
For some
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s, these are some of the most common, short
function word
In linguistics, function words (also called functors) are words that have little lexical meaning or have ambiguous meaning and express grammatical relationships among other words within a sentence, or specify the attitude or mood of the speak ...
s, such as ''the'', ''is'', ''at'', ''which'', and ''on''. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "
The Who
The Who are an English Rock music, rock band formed in London in 1964. Their classic lineup (1964–1978) consisted of lead vocalist Roger Daltrey, guitarist Pete Townshend, bassist John Entwistle and drummer Keith Moon. Considered one of th ...
", "
The The", or "
Take That". Other search engines remove some of the most common words—including
lexical word
In grammar, a part of speech or part-of-speech (Abbreviation, abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammar, grammatical propert ...
s, such as "want"—from a query in order to improve performance.
In recent years the SEO best practices around stop words have evolved along with the fields of
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and NLP. In February 2021, John Mueller, Webmaster Trends Analyst at
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
, tweeted "I wouldn't worry about stop words at all; write naturally. Search engines look at much, much more than individual words. '
To be or not to be' just is a collection of stop words, but stop words alone don't do it any justice."
See also
*
Concept mining
*
Filler (linguistics)
In linguistics, a filler, filled pause, hesitation marker or planner (sometimes called crutches) is a sound or word that participants in a conversation use to signal that they are pausing to think but are not finished speaking.Juan, Stephen (201 ...
*
Index (search engine)
Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, an ...
*
Information extraction
*
Query expansion
*
Stemming
*
Text mining
References
External links
Full-Text Stopwords in MySQL English Stop Words (CSV)German Stop Words and phrases another list o
*
Polish Stop Words
Collection of stop words in 29 languagesarchive
{{SearchEngineOptimization
Information retrieval techniques