List of text corpora
   HOME

TheInfoList



OR:

Text corpora (singular: ''
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
'') are large and structured sets of texts, which have been systematically collected. Text corpora are used by
corpus linguists Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music ...
and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.


English language

* American National Corpus *
Bank of English The Bank of English is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but content from North America, Australia, New Zealand, South Africa and other Commonwealth ...
* BookCorpus *
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
* Bergen Corpus of London Teenage Language (COLT) *
Brown Corpus The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...
, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB *
Corpus of Contemporary American English The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU). Content The Corpus of Co ...
(COCA) 425 million words, 1990–2011. Freely searchable online *Corpus Resource Database (CoRD), more than 80 English language corpora.
Coruña Corpus
a corpus of late Modern English scientific writing covering the period 1700–1900, developed by th
Muste
research group at the
University of A Coruña The University of A Coruña ( gl, Universidade da Coruña) is a Spanish public university located in the city of A Coruña, Galicia. Established in 1989, university departments are divided between two primary campuses in A Coruña and nearby Fer ...

DBLP Discovery Dataset (D3)
a corpus of computer science publications with sentient metadata.
GUM corpus
the open source Georgetown University Multilayer corpus, with very many annotation layers

* International Corpus of English *
Oxford English Corpus The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly ...

RE3D (Relationship and Entity Extraction Evaluation Dataset)Santa Barbara Corpus of Spoken American English
* Scottish Corpus of Texts & Speech *Strathy Corpus of Canadian English


European languages


CETENFolha
*
The Corpus of Electronic Texts The Corpus of Electronic Texts, or CELT, is an online database of contemporary and historical documents relating to Irish history and culture. As of 8 December 2016, CELT contained 1,601 documents, with a total of over 18 million words. In 199 ...
* Corpus Inscriptionum Insularum Celticarum (CIIC), covering
Primitive Irish Primitive Irish or Archaic Irish ( ga, Gaeilge Ársa), also called Proto-Goidelic, is the oldest known form of the Goidelic languages. It is known only from fragments, mostly personal names, inscribed on stone in the ogham alphabet in Ireland ...
inscriptions in
Ogham Ogham ( Modern Irish: ; mga, ogum, ogom, later mga, ogam, label=none ) is an Early Medieval alphabet used primarily to write the early Irish language (in the "orthodox" inscriptions, 4th to 6th centuries AD), and later the Old Irish langu ...

Google Books Ngram CorpusThe Georgian Language Corpus
*
Thesaurus Linguae Graecae The Thesaurus Linguae Graecae (TLG) is a research center at the University of California, Irvine. The TLG was founded in 1972 by Marianne McDonald (a graduate student at the time and now a professor of theater and classics at the University of Cal ...
(Ancient Greek)
Eastern Armenian National Corpus
(EANC) 110 million words. Freely searchable online. *Spanish text corpus by Molino de Ideas, which contains 660 million words. *CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999–2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania *Reference Corpus of Contemporary Portuguese (CRPC) * Turkish National Corpus
CoRoLa - The Reference Corpus of the Contemporary Romanian Language (Corpus reprezentativ al limbii române contemporane )TS Corpus
- A large set of Turkish corpora. TS Corpus is a Free&Independent Project that aims to build Turkish corpora, NLP tools and linguistic datasets...
MacMorpho
- an annotated corpus of Brazilian Portuguese text


Slavic


East Slavic


Belarusian N-korpus
* Russian National Corpus * General Internet Corpus of Russian
General regionally annotated corpus of Ukrainian

Ukrainian Language Corpus on the Mova.info Linguistic Portal

Ukrainian Language Corpus

Araneum Russicum

Russian Corpus of Biographical Texts

RuTweetCorp

RusAge: Corpus for Age-Based Text Classification


South Slavic

*Bulgarian National Corpus * Macedonian Electronic Corpus * Croatian Language Corpus *
Croatian National Corpus Croatian National Corpus ( hr, Hrvatski nacionalni korpus, ''HNK'') is the biggest and the most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, Unive ...
*
Slovenian National Corpus Slovenian National Corpus FidaPLUS is the 621 million words (tokens) corpus of the Slovenian language, gathered from selected texts written in Slovenian of different genres and styles, mainly from books and newspapers. The FidaPLUS database is an ...


West Slavic

*
Czech National Corpus The Czech National Corpus (CNC) (Czech : Český národní korpus) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Pr ...
* National Corpus of Polish


German

* German Reference Corpus (DeReKo) More than 4 billion words of contemporary written German.
Free corpus of German mistakes from people with dyslexia


Middle Eastern Languages

*
Corpus Inscriptionum Semiticarum The ("Corpus of Semitic Inscriptions", abbreviated CIS) is a collection of ancient inscriptions in Semitic languages produced since the end of 2nd millennium BC until the rise of Islam. It was published in Latin. In a note recovered after his de ...
*
Kanaanäische und Aramäische Inschriften Kanaanäische und Aramäische Inschriften (in English, Canaanite and Aramaic Inscriptions), or KAI, is the standard source for the original text of Canaanite and Aramaic inscriptions not contained in the Hebrew Bible and Old Testament. It was fir ...
*
Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at D ...
( Persian) * Persian in MULTEXT-EAST corpus (Persian) *
Amarna letters The Amarna letters (; sometimes referred to as the Amarna correspondence or Amarna tablets, and cited with the abbreviation EA, for "El Amarna") are an archive, written on clay tablets, primarily consisting of diplomatic correspondence between ...
(for
Akkadian Akkadian or Accadian may refer to: * Akkadians, inhabitants of the Akkadian Empire * Akkadian language, an extinct Eastern Semitic language * Akkadian literature, literature in this language * Akkadian cuneiform Cuneiform is a logo-syllabic ...
, Egyptian,
Sumerogram A Sumerogram is the use of a Sumerian cuneiform character or group of characters as an ideogram or logogram rather than a syllabogram in the graphic representation of a language other than Sumerian, such as Akkadian or Hittite. Sumerograms are n ...
's, etc.) *TEP: Tehran English-Persian Parallel Corpus *TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling *PTC: ''Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus'' (in Persian: ''Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz''),
Hamid Hassani Hamid Hassani or Hamid Hasani ( fa, حمید حسنی, ku, Hemîd Hesenî; ; born 23 November 1968 in Saqqez) is an Iranian scholar and researcher, concentrated on Persian lexicography, dictionary-making, and Persian corpus linguistics, also a ...
, Tehran,
Iran Language Institute The Iran Language Institute, abbreviated as ILI ( fa, کانون زبان ایران) is a state-owned, non-profit organization founded in 1979 in Iran with the national mission of developing foreign language learning. It is a subsidiary of Institu ...
(ILI), 2005, 322 pp. * Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan, Department of English Language and Linguistics * Bijankhan Corpus A Contemporary Persian Corpus for NLP researches,
University of Tehran The University of Tehran (Tehran University or UT, fa, دانشگاه تهران) is the most prominent university located in Tehran, Iran. Based on its historical, socio-cultural, and political pedigree, as well as its research and teaching pro ...
, 2012 *
Neo-Assyrian Text Corpus Project The Neo-Assyrian Text Corpus Project is an international scholarly project aimed at collecting and publishing ancient Assyrian texts and studies based on them. Its headquarters are in Helsinki in Finland. State Archives of Assyria State Archives ...
* Quranic Arabic Corpus (Classical Arabic) *
Electronic Text Corpus of Sumerian Literature The Electronic Text Corpus of Sumerian Literature (ETCSL) was a project that provides an online digital library of texts and translations of Sumerian literature. This project's website contains "Sumerian text, English prose translation and bibl ...
*
Open Richly Annotated Cuneiform Corpus The Open Richly Annotated Cuneiform Corpus, or Oracc, is an ongoing project designed to make the corpus of cuneiform compositions from the ancient Near East available online and accessible to users. The project, created by Steve Tinney of the Univ ...
* Asosoft text corpus
Central Kurdish Central Kurdish (), also called Sorani (), is a Kurdish dialect or a language that is spoken in Iraq, mainly in Iraqi Kurdistan, as well as the provinces of Kurdistan, Kermanshah, and West Azerbaijan in western Iran. Sorani is one of the two ...
(Sorani)


Devanagari


Nepali Text Corpus
(90+ million running words/6.5+ million sentences)


East Asian Languages

*Kotonoha Japanese language corpus * LIVAC Synchronous Corpus (Chinese)


South Asian Languages


SinMin
dataset ( Sinhala)


Parallel corpora of diverse languages


Chinese/English Political Interpreting Corpus (CEPIC)
consists of transcripts of speeches delivered by top political figures from Hong Kong, Beijing, Washington DC and London, as well as their translated/interpreted texts. Developed by Jun Pan and HKBU Library. * Europarl Corpus - proceedings of the European Parliament from 1996 to 2012 *EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database *OPUS: Open source Parallel Corpus in many many languages *
Tatoeba Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. Its name comes from the Japanese phrase "tatoeba" (), meaning "for example". It is written and maintained by a community of volunteer ...
A parallel corpus which contains over 8.9 million sentences in multiple languages; 107 languages have more than 1,000 sentences each; a further 81 languages have from 100 to 1,000 sentences each.
NTU-Multilingual Corpus
in 7 languages (ara, eng, ind, jpn, kor, mcn, vie)
legacy repo

SeedLing
corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.

parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)
The ACTRES Parallel Corpus
(P-ACTRES 2.0) is a bidirectional English-Spanish corpus consisting of original texts in one language and their translation into the other. P-ACTRES 2.0 contains over 6 million words considering both directions together.


Comparable Corpora


Corpus of Political Speeches
contains four collections of political speeches in English and Chinese from The Corpus of U.S. Presidential Speeches (1789–2015), The Corpus of Policy Address by Hong Kong Governors (1984–1996) and Hong Kong Chief Executives (1997–2014), The Corpus of Speeches given on New Year's days and Double Tenth days by Taiwan Presidents (1978–2014), and The Corpus of Report on the Work of the Government by Premiers of the People's Republic of China (1984–2013). Developed by HKBU Library.
WaCky - The Web-As-Corpus Kool Yinitiative Web as Corpus
(eng, fre, deu, ita)
Disambiguating Similar Language Corpora Collection (DSLCC)
(Bosnian, Croatian, Serbian, Indonesian, Malay, Czech, Slovak, Brazilian Portuguese, European Portuguese, Peninsular Spanish, Argentine Spanish)
Wikipedia Comparable Corpora
(41 million aligned Wikipedia articles for 253 language pairs) * The TenTen Corpus Family – comparable web corpora of target size 10 billion words. These corpora are available in the corpus management system
Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in corpus linguistics, translators or lan ...
, currently, there exist TenTen corpora for more than 30 languages (such as English TenTen corpus, Arabic TenTen corpus, Spanish TenTen corpus, Russian Tenten corpus,). The overview of existing TenTen corpora can be found at https://www.sketchengine.co.uk/documentation/tenten-corpora/ * Timestamped JSI web corpora – web corpora of news articles crawled from a list of RSS feeds. Newsfeed corpora are being prepared in the framework of the project implemented by the Jožef Stefan Institute at Slovenian scientific research institute. and published in Sketch Engine. More information about the project is on th
project websites


L2 (English) Corpora

* Cambridge Learner Corpus *Corpus of Academic Written and Spoken English (CAWSE), a collection of Chinese students’ English language samples in academic settings. Freely downloadabl
online
  *English as a Lingua Franca in Academic Settings (ELFA), an academic ELF corpus. *International Corpus of Learner English (ICLE), a corpus of learner written English. *Louvain International Database of Spoken English Interlanguage (LINDSEI), a corpus of learner spoken English. *Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English. *University of Pittsburgh English Language Institute Corpus (PELIC) *Vienna-Oxford International Corpus of English (VOICE), an ELF corpus.


References


See also

* Ancient text corpora {{DEFAULTSORT:Text corpora Corpus linguistics Natural language processing Linguistics lists