A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document (or collection of documents) than in some larger
corpus
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
.
Amazon.com
Amazon.com, Inc., doing business as Amazon, is an American multinational technology company engaged in e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence. Founded in 1994 by Jeff Bezos in Bellevu ...
uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section.
Christian Rudder has also used this concept with data from
online dating profiles and
Twitter
Twitter, officially known as X since 2023, is an American microblogging and social networking service. It is one of the world's largest social media platforms and one of the most-visited websites. Users can share short text messages, image ...
posts to determine the phrases most characteristic of a given race or gender in his book ''
Dataclysm''. SIPs with a linguistic density of two or three words—for example adjective, adjective, noun, or adverb, adverb, verb—will signal the author's attitude, premise or conclusions to the reader or express an important idea.
Another use of SIPs is as a detection tool for plagiarism. (Almost) unique combinations of words can be searched for online, and if they have appeared in a published text, the search will identify where. This method only checks those texts that have been published and that have been digitized online.
For example, a submission by a student that contained the phrase "garden style, praising irregularity in design", might be searched for using Google.com and will yield the original Wikipedia article about Sir
William Temple, English political figure and essayist.
Example
While common words such as "the" appear frequently in most texts, a phrase such as "explicit Boolean algorithm" might occur much more often in a document about computers than it does in general English. Therefore, "explicit Boolean algorithm" would be considered a statistically improbable phrase in that context.
Some statistically improbable phrases of Darwin's ''
On the Origin of Species
''On the Origin of Species'' (or, more completely, ''On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life'')The book's full original title was ''On the Origin of Species by M ...
'' could be: ''genera descended, transitional gradations, unknown progenitor, fossiliferous formations, closely allied forms, profitable variations, transitional grades, very distinct species'' and ''mongrel offspring''.
Sociologically Improbable Phrases
Crooked Timber April 2005
See also
* Collocation
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words t ...
– Any series of words that co-occur more often than would be expected by chance
* Googlewhack – A pair of words occurring on a single webpage, as indexed by Google
* tf-idf – A statistic used in information retrieval and text mining
* Complex specified information – a concept used to argue for the "intelligent design" theory
References
{{comp-ling-stub
Amazon (company)
Bookselling
Information retrieval systems
Computational linguistics