
The Google Books Ngram Viewer is an online
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
that charts the frequencies of any set of search strings using a yearly count of
''n''-grams found in printed sources published between 1500 and 2022
in
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
's
text corpora
In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Annotated, they have been used in cor ...
in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish.
There are also some specialized English corpora, such as American English, British English, and English Fiction.
The program can search for a word or a phrase, including misspellings or gibberish.[ The ''n''-grams are matched with the text within the selected corpus, and if found in 40 or more books, are then displayed as ]
graph
[ The Google Books Ngram Viewer supports searches for ]parts of speech
In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...
and wildcards. It is routinely used in research.
History
In the development processes, Google teamed up with two Harvard
Harvard University is a private Ivy League research university in Cambridge, Massachusetts, United States. Founded in 1636 and named for its first benefactor, the Puritan clergyman John Harvard, it is the oldest institution of higher lear ...
researchers, Jean-Baptiste Michel and Erez Lieberman Aiden, and quietly released the program on December 16, 2010.
Before the release, it was difficult to quantify the rate of linguistic change because of the absence of a database that was designed for this purpose, said Steven Pinker
Steven Arthur Pinker (born September 18, 1954) is a Canadian-American cognitive psychology, cognitive psychologist, psycholinguistics, psycholinguist, popular science author, and public intellectual. He is an advocate of evolutionary psycholo ...
,
a well-known linguist who was one of the co-authors of the ''Science
Science is a systematic discipline that builds and organises knowledge in the form of testable hypotheses and predictions about the universe. Modern science is typically divided into twoor threemajor branches: the natural sciences, which stu ...
'' paper published on the same day. The Google Books Ngram Viewer was developed in the hope of opening a new window to quantitative research in the humanities field, and the database contained 500 billion words from 5.2 million books publicly available from the very beginning.
The intended audience was scholarly, but the Google Books Ngram Viewer made it possible for anyone with a computer to see a graph that represents the diachronic change of the use of words and phrases with ease. Lieberman said in response to the ''New York Times
''The New York Times'' (''NYT'') is an American daily newspaper based in New York City. ''The New York Times'' covers domestic, national, and international news, and publishes opinion pieces, investigative reports, and reviews. As one of ...
'' that the developers aimed to provide even children with the ability to browse cultural trends throughout history. In the ''Science'' paper, Lieberman and his collaborators called the method of high-volume data analysis in digitalized texts " culturomics".
Usage
Commas delimit user-entered search terms, where each comma-separated term is searched in the database as an ''n''-gram (for example, "nursery school" is a 2-gram or bigram).[ The Ngram Viewer then returns a plotted ]line chart
A line chart or line graph, also known as curve chart, is a type of chart that displays information as a series of data points called 'markers' connected by straight wikt:line, line segments. It is a basic type of chart common in many fields. ...
. Note that due to limitations on the size of the Ngram database, only matches found in at least 40 books are indexed.[
]
Limitations
The data sets of the Ngram Viewer have been criticized for their reliance upon inaccurate optical character recognition
Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...
(OCR) and for including large numbers of incorrectly dated and categorized texts.
Because of these errors, and because they are uncontrolled for bias
(such as the increasing amount of scientific literature, which causes other terms to appear to decline in popularity), care must be taken in using the corpora to study language or test theories.
Furthermore, the data sets may not reflect general linguistic or cultural change and can only hint at such an effect because they do not involve any metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
like date published, author, length, or genre, to avoid any potential copyright
A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, ...
infringements.
Systemic errors like the confusion of ''s'' and ''f'' in pre-19th century texts (due to the use of ''ſ'', the long ''s'', which is similar in appearance to ''f'') can cause systemic bias. Although the Google Books team claims that the results are reliable from 1800 onwards, poor OCR and insufficient data mean that frequencies given for languages such as Chinese may only be accurate from 1970 onward, with earlier parts of the corpus showing no results at all for common terms, and data for some years containing more than 50% noise.
Guidelines for doing research with data from Google Ngram have been proposed that try to address some of the issues discussed above.
See also
* Google Trends
* Lexical analysis
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
References
Bibliography
*
External links
*
{{Google LLC
Ngram Viewer
Ngram Viewer
2010 software
Natural language processing
Computational linguistics
Corpus linguistics