HOME

TheInfoList



OR:

Stop words are the words in a stop list (or ''stoplist'' or ''negative dictionary'') which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant. There is no single universal list of stop words used by all
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in nformation retrieval systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever".


History of stop words

A predecessor concept was used in creating some
concordance Concordance may refer to: * Agreement (linguistics), a form of cross-reference between different parts of a sentence or phrase * Bible concordance, an alphabetical listing of terms in the Bible * Concordant coastline, in geology, where beds, or la ...
s. For example, the first Hebrew concordance,
Isaac Nathan ben Kalonymus Isaac Nathan ben Kalonymus was a French Jewish philosopher and controversialist. He lived at Arles, perhaps at Avignon also, and in other places, in the fourteenth and fifteenth centuries. He belonged to the well-known Nathan family, which claime ...
's he, Me’ir Nativ, label=none, script=latn, contained a one-page list of unindexed words, with nonsubstantive prepositions and conjunctions which are similar to modern stop words.
Hans Peter Luhn Hans Peter Luhn (July 1, 1896 – August 19, 1964) was a German researcher in the field of computer science and Library & Information Science for IBM, and creator of the Luhn algorithm, KWIC (Key Words In Context) indexing, and Selective ...
, one of the pioneers in
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
, is credited with coining the phrase and using the concept when introducing his Keyword-in-Context automatic indexing process. The phrase "stop word", which is not in Luhn's 1959 presentation, and the associated terms "stop list" and "stoplist" appear in the literature shortly afterward. Although it is commonly assumed that stoplists include only the most frequent words in a language, it was C.J. Van Rijsbergen who proposed the first standardized list which was not based on word frequency information. The "Van list" included 250 English words. Martin Porter's word stemming program developed in the 1980s built on the Van list, and the Porter list is now commonly used as a default stoplist in a variety of software applications. In 1990, Christopher Fox proposed the first general stop list based on empirical word frequency information derived from the Brown Corpus:
This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.
In
SEO Seo or SEO may refer to: * Search engine optimization, the process of improving the visibility of a website or a web page in search engines Organisations * SEO Economic Research, a scientific institute * Spanish Ornithological Society (''Socied ...
terminology, stop words are the most common words that many search engines used to avoid for the purposes of saving space and time in processing of large data during crawling or indexing. For some
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s, these are some of the most common, short function words, such as ''the'', ''is'', ''at'', ''which'', and ''on''. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "
The Who The Who are an English rock band formed in London in 1964. Their classic lineup consisted of lead singer Roger Daltrey, guitarist and singer Pete Townshend, bass guitarist and singer John Entwistle, and drummer Keith Moon. They are considered ...
", "
The The () are an English post-punk band. They have been active in various forms since 1979, with singer-songwriter Matt Johnson being the only constant band member. achieved critical acclaim and commercial success in the UK, with 15 chart singles ...
", or "
Take That Take That are an English pop group formed in Manchester in 1990. The group currently consists of Gary Barlow, Howard Donald and Mark Owen. The original line-up also featured Jason Orange and Robbie Williams. Barlow is the group's lead singe ...
". Other search engines remove some of the most common words—including
lexical word In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ass ...
s, such as "want"—from a query in order to improve performance. In recent years the SEO best practices around stop words have evolved along with the fields of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
. In February 2021, John Mueller, Webmaster Trends Analyst at Google, Tweeted the following, "I wouldn't worry about stop words at all; write naturally. Search engines look at much, much more than individual words. "
To be or not to be To Be or Not to Be may refer to: * ''To be, or not to be'', the soliloquy from ''Hamlet''. Films and TV, theatre and books * ''To Be or Not to Be'' (1942 film), directed by Ernst Lubitsch * ''To Be or Not to Be'' (1983 film), a remake produced ...
" just is a collection of stop words, but stop words alone don't do it any justice."


See also

* Concept mining *
Filler (linguistics) In linguistics, a filler, filled pause, hesitation marker or planner is a sound or word that participants in a conversation use to signal that they are pausing to think but are not finished speaking.Juan, Stephen (2010).Why do we say 'um', 'er', ...
*
Function words In linguistics, function words (also called functors) are words that have little lexical meaning or have ambiguous meaning and express grammatical relationships among other words within a sentence, or specify the attitude or mood of the speaker. ...
*
Index (search engine) Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and ...
*
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
* Query expansion *
Stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...
*
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...


References


External links


List of English Stop Words (PHP array, CSV)



English Stop Words (CSV)



German Stop Words and phrases
another list o

* Polish Stop Words
Collection of stop words in 29 languagesarchive


{{SearchEngineOptimization Information retrieval techniques