List Of Text Corpora
   HOME
*





List Of Text Corpora
Text corpora (singular: ''text corpus'') are large and structured sets of texts, which have been systematically collected. Text corpora are used by corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency. English language *American National Corpus *Bank of English * BookCorpus *British National Corpus * Bergen Corpus of London Teenage Language (COLT) *Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB *Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online *Corpus Resource Database (CoRD), more than 80 English language corpora.Coruña Corpus a corpus of late Modern English scientific writing covering the period 1700–1900, developed by thMusteresearch group at the University of A CoruñaDBLP Discovery Dataset (D3) a corpus o ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Text Corpus
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and statistical hypothesis testing, hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In Search engine (computing), search technology, a corpus is the collection of documents which is being searched. Overview A corpus may contain texts in a single language (''monolingual corpus'') or text data in multiple languages (''multilingual corpus''). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or ''POS-tagging'', in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form o ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Ogham
Ogham (Modern Irish: ; mga, ogum, ogom, later mga, ogam, label=none ) is an Early Medieval alphabet used primarily to write the early Irish language (in the "orthodox" inscriptions, 4th to 6th centuries AD), and later the Old Irish language (scholastic ogham, 6th to 9th centuries). There are roughly 400 surviving orthodox inscriptions on stone monuments throughout Ireland and western Britain, the bulk of which are in southern Munster. The largest number outside Ireland are in Pembrokeshire, Wales. The vast majority of the inscriptions consist of personal names. According to the High Medieval ''Bríatharogam'', the names of various trees can be ascribed to individual letters. For this reason, ogam is sometimes known as the Celtic tree alphabet. The etymology of the word ''ogam'' or ''ogham'' remains unclear. One possible origin is from the Irish ''og-úaim'' 'point-seam', referring to the seam made by the point of a sharp weapon. Origins It is generally thought that th ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Hamshahri Corpus
The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG GroupDBRG News
Database Research Group of . Later, a team headed by Ale AhmadHamshahri
Database Research Group
built on this corpus and created the first Persian text collection suitable for eval ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Kanaanäische Und Aramäische Inschriften
Kanaanäische und Aramäische Inschriften (in English, Canaanite and Aramaic Inscriptions), or KAI, is the standard source for the original text of Canaanite and Aramaic inscriptions not contained in the Hebrew Bible and Old Testament. It was first published from 1960 to 1964 in three volumes by the German Orientalists Herbert Donner and Wolfgang Röllig, and has been updated in numerous subsequent editions. The work attempted to "integrate philology, palaeography and cultural history" in the commented re-editing of a selection of Canaanite and Aramaic Inscriptions, using the "pertinent source material for the Phoenician, Punic, Moabite, pre-exile-Hebrew and Ancient Aramaic cultures." Röllig and Donner had the support of William F. Albright in Baltimore, James Germain Février in Paris and Giorgio Levi Della Vida in Rome during the compilation of the first edition. Editions The 4th edition was published between 1966-69, and a 5th edition was published in 2002. However, the 5t ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Corpus Inscriptionum Semiticarum
The ("Corpus of Semitic Inscriptions", abbreviated CIS) is a collection of ancient inscriptions in Semitic languages produced since the end of 2nd millennium BC until the rise of Islam. It was published in Latin. In a note recovered after his death, Ernest Renan stated that: "Of all I have done, it is the Corpus I like the most." The first part was published in 1881, fourteen years after the beginning of the project. Renan justified the fourteen year delay in the preface to the volume, pointing to the calamity of the Franco-Prussian war and the difficulties that arose in the printing the Phoenician characters, whose first engraving was proven incorrect in light of the inscriptions discovered subsequently. A smaller collection – ("Repertory of Semitic Epigraphy", abbreviated RES) – was subsequently created to present the Semitic inscriptions without delay and in a deliberately concise way as they became known, and was published in French rather than Latin. The was for the ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


German Reference Corpus
The German Reference Corpus (original: Deutsches Referenzkorpus; short: DeReKo) is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language (Leibniz Institute for the German Language, : IDS) in Mannheim, Germany. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens (as of August 2010) and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German. Alternative names The German Reference Corpus is often referred to by other names, such as ''Mannheim corpora'', ''IDS corpora'', ''COSMAS corpora'' and the corresponding German translations. The name ''Deutsches Referenzkorpus (DeReKo)'' was originally used for a specific portion of the current archive which was collected between 1999 and 2002 by a number of institutio ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




National Corpus Of Polish
The National Corpus of Polish (Polish : Narodowy Korpus Języka Polskiego NKJP) is the biggest and the most important corpus of the Polish language. A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function. Description The National Corpus of Polish is a shared initiative of four institutions: Institute of Computer Science and the Institute of the Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź. It has been registered as a research-development project of the Ministry of Science and Higher Education. The intended size of the whole National Corpus of Polish is over 1 billion words, of which a 300-million word subcorpus has been carefully balanced, and a manually-annotated 1-million corpus has been released under an open license. The corpus is accessible online ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Czech National Corpus
The Czech National Corpus (CNC) (Czech : Český národní korpus) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics. The ICNC collaborates with over 200 researchers and students (mainly for spoken and parallel data acquisition), 270 publishers (as text providers), and other similar research projects. Areas of focus The Czech National Corpus focuses systematically on the following areas: * Synchronic written corpora: the SYN-series corpora maps the Czech language of the 20th and 21st century (esp. the last twenty years) and forms the core of the project. Texts are enriched with metadata, lemmatization, and morphological tagging. * Contemporary spontaneous spoken Czech: The ORAL-series corpora contain contemporary, spontaneous spoken language used in informal situations through ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Slovenian National Corpus
Slovenian National Corpus FidaPLUS is the 621 million words (tokens) corpus of the Slovenian language, gathered from selected texts written in Slovenian of different genres and styles, mainly from books and newspapers. The FidaPLUS database is an upgrade of the older (FIDA) corpus, which was developed between 1997 and 2000, with added texts that were published up to 2006 and was the result of the applicative research project of the Faculty of Arts, Faculty of Social Sciences, both University of Ljubljana, and Jožef Stefan Institute's Department of Knowledge Technologies. Corpus is available via a corpus manager Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in corpus linguistics, translators or lan .... This version FidaPLUS corpus contains Word sketches, an automatic corpus-derived overview of word's gram ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Croatian National Corpus
Croatian National Corpus ( hr, Hrvatski nacionalni korpus, ''HNK'') is the biggest and the most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas ofMarko Tadić The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of Croatian started to appear even earlier. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms. The initial composition was divided in two constituents: # ''30-million corpus of contemporary Croatian'' (30m) where samples from texts from 1990 on were included. The criteria for inclusion of text samples were: written by native speakers, different fields, genres and topics. Translated text or poetry were exclude ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Croatian Language Corpus
The Croatian Language Corpus (CLC) ( hr, Hrvatski jezični korpus, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics ( IHJJ). Background The CLC was initially funded as a sub-project of the research program ''Riznica'' (''Croatian Language Repository'') by the Ministry of Science, Education, and Sports of the Republic of Croatia ( MZOŠ) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program ''The Croatian Language Repository'' (CLR) that was granted by the MZOŠ (cf. Ćavar and Brozović Rončević, 2012). Being a research program (PI Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently Dunja Brozović Rončević and Damir Ćavar are in charge of the corpus development. ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Macedonian Electronic Corpus
Macedonian most often refers to someone or something from or related to Macedonia. Macedonian(s) may specifically refer to: People Modern * Macedonians (ethnic group), a nation and a South Slavic ethnic group primarily associated with North Macedonia * Macedonians (Greeks), the Greek people inhabiting or originating from Macedonia, a geographic and administrative region of Greece * Macedonian Bulgarians, the Bulgarian people from the region of Macedonia * Macedo-Romanians (other), an outdated and rarely used anymore term for the Aromanians and Megleno-Romanians, both being small Eastern Romance ethno-linguistic groups present in the region of Macedonia * Macedonians (obsolete terminology), an outdated and rarely used umbrella term to designate all the inhabitants of the region, regardless of their ethnic origin, as well as the local Slavs and Macedo-Romanians, as a regional and ethnographic communities and not as a separate ethnic groups Ancient * Ancient Macedonians, ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]