The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the

World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web ...

and processed to match the same standards. These corpora are made available through the

Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in corpus linguistics, translators or lan ...

corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (10¹⁰) words per language, which gave rise to the corpus family's name. In the creation of the TenTen corpora, data crawled from the World Wide Web are processed with natural language processing tools developed by the Natural Language Processing Centre at the Faculty of Informatics at

Masaryk University Masaryk University (MU) ( cs, Masarykova univerzita; la, Universitas Masarykiana Brunensis) is the second largest university in the Czech Republic, a member of the Compostela Group and the Utrecht Network. Founded in 1919 in Brno as the se ...

( Brno,

Czech Republic The Czech Republic, or simply Czechia, is a landlocked country in Central Europe. Historically known as Bohemia, it is bordered by Austria to the south, Germany to the west, Poland to the northeast, and Slovakia to the southeast. The ...

) and by the Lexical Computing company (developer of the Sketch Engine).

Corpus linguistics

In corpus linguistics, a

text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...

is a large and structured collection of texts that are electronically stored and processed. It is used to do hypothesis testing about languages, validating linguistic rules or the frequency distribution of words ( n-grams) within languages. Electronically processed corpora provide fast search. Text processing procedures such as

tokenization Tokenization may refer to: * Tokenization (lexical analysis) in language processing * Tokenization (data security) in the field of data security * Word segmentation * Tokenism Tokenism is the practice of making only a perfunctory or symbolic ...

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...

and

word-sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consc ...

enrich corpus texts with detailed linguistic information. This enables to narrow the search to a particular

parts of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ass ...

, word sequences or a specific part of the corpus. First text corpora were created in the 1960s, such as the 1-million-word

Brown Corpus The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...

American English American English, sometimes called United States English or U.S. English, is the set of varieties of the English language native to the United States. English is the most widely spoken language in the United States and in most circumstances i ...

. Over time, many further corpora were produced (such as the

British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...

and the

LOB Corpus The Lancaster-Oslo/Bergen (LOB) Corpus is a million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the ...

) and work had begun also on corpora of larger sizes and covering other languages than English. This development was linked with the emergence of corpus creation tools that help achieve larger size, wider coverage, cleaner data etc.

Production of TenTen corpora

The procedure by which TenTen corpora are produced is based on the creators' earlier research in preparing web corpora and the subsequent processing thereof. At the beginning, a huge amount of text data is

downloaded In computer networks, download means to ''receive'' data from a remote system, typically a server such as a web server, an FTP server, an email server, or other similar system. This contrasts with uploading, where data is ''sent to'' a remote ...

from the World Wide Web by the dedicated SpiderLing web crawler. In a later stage, these texts undergo

cleaning Cleaning is the process of removing unwanted substances, such as dirt, infectious agents, and other impurities, from an object or environment. Cleaning is often performed for aesthetic, hygienic, functional, environmental, or safety purposes ...

, which consists of removing any non-textual material such as navigation links, headers and footers from the

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...

source code of web pages with the jusText tool, so that only full solid sentences are preserved. Eventually, the ONION tool is applied to remove duplicate text portions from the corpus, which naturally occur on the World Wide Web due to practices such as quoting, citing, copying etc.

TenTen corpora data structure

TenTen corpora follow a specific metadata structure that is common to all of them. Metadata is contained in structural attributes that relate to individual documents and paragraphs in the corpus. Some TenTen corpora can feature additional specific attributes.

Document attributes

top-level domain A top-level domain (TLD) is one of the domains at the highest level in the hierarchical Domain Name System of the Internet after the root domain. The top-level domain names are installed in the root zone of the name space. For all domains in ...

– domain at the highest level of the hierarchical

Domain Name System The Domain Name System (DNS) is a hierarchical and distributed naming system for computers, services, and other resources in the Internet or other Internet Protocol (IP) networks. It associates various information with domain names assigned t ...

(e.g. "com") *

website A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google, Facebook, Amazon, and Wi ...

– identification string defining a realm of administrative autonomy within the Internet (e.g. "wikipedia.org") *

web domain A domain name is a string that identifies a realm of administrative autonomy, authority or control within the Internet. Domain names are often used to identify services provided through the Internet, such as websites, email services and more. As ...

– collection of related web pages (e.g. "la.wikipedia.org") *crawl date – date when the document was downloaded from the Web *url – the

Uniform Resource Locator A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifi ...

referring to the document's source *wordcount – number of words in the document *length – classification of the document into a range by its length measured in thousands of words

Paragraph attributes

heading Heading can refer to: * Heading (metalworking), a process which incorporates the extruding and upsetting processes * Headline, text at the top of a newspaper article * Heading (navigation), the direction a person or vehicle is facing, usually s ...

– a numeric attribute distinguishing headers and similar titles from ordinary

body text __NOTOC__ The body text or body copy is the text forming the main content of a book, magazine, web page, or any other printed or digital work. This is as a contrast to both additional components such as headings, images, charts, footnotes etc. on ...

(1 if the paragraph is a heading, 0 otherwise)

Available TenTen corpora

The following corpora can be accessed through the Sketch Engine as of October 2018:

References

External links

TenTen Corpus Family
(at the Sketch Engine website) {{Corpus linguistics Corpora Commercial digital libraries Czech digital libraries