The Czech National Corpus (CNC) (Czech : Český národní korpus) is a large electronic

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

of written and spoken

Czech language Czech (; Czech ), historically also Bohemian (; ''lingua Bohemica'' in Latin), is a West Slavic language of the Czech–Slovak group, written in Latin script. Spoken by over 10 million people, it serves as the official language of the Czech Re ...

, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at

Charles University ) , image_name = Carolinum_Logo.svg , image_size = 200px , established = , type = Public, Ancient , budget = 8.9 billion CZK , rector = Milena Králíčková , faculty = 4,057 , administrative_staff = 4,026 , students = 51,438 , undergr ...

Prague Prague ( ; cs, Praha ; german: Prag, ; la, Praga) is the capital and largest city in the Czech Republic, and the historical capital of Bohemia. On the Vltava river, Prague is home to about 1.3 million people. The city has a temperate ...

. The collection is used for teaching and research in

corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...

. The ICNC collaborates with over 200 researchers and students (mainly for spoken and parallel data acquisition), 270 publishers (as text providers), and other similar research projects.

Areas of focus

The Czech National Corpus focuses systematically on the following areas: * Synchronic written corpora: the SYN-series corpora maps the

of the 20th and 21st century (esp. the last twenty years) and forms the core of the project. Texts are enriched with

metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...

lemmatization Lemmatisation ( or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemma ...

, and morphological tagging. * Contemporary spontaneous spoken Czech: The ORAL-series corpora contain contemporary, spontaneous spoken language used in informal situations through the entire

Czech Republic The Czech Republic, or simply Czechia, is a landlocked country in Central Europe. Historically known as Bohemia, it is bordered by Austria to the south, Germany to the west, Poland to the northeast, and Slovakia to the southeast. The ...

(as opposed to prepared, broadcast or scripted texts generally found in spoken corpora). * Multilingual parallel corpus: InterCorp is a large corpus of Czech texts aligned at the sentence level with translations to or from more than 30 languages. The core of the corpus consists of manually aligned and proofread fiction texts. * Diachronic corpus of Czech: the DIAKORP corpus of historical Czech includes texts from 14th century onwards. The current focus of DIAKORP is on the 19th century. The long term goal of DIAKORP is to create a corpus covering the period of 1850–present and interconnecting the data with the SYN series. * Specialised linguistic data: the ICNC is also involved in the collection of language data for specific research purposes, including DIALEKT (dialectal speech), CzeSL (texts written by non-native learners of Czech), DEAF (Czech texts written by the deaf), or Jerome (translated and non-translated Czech).

References

External links

Český národní korpus

Institute of the Czech National Corpus
{{Authority control Czech language Corpora Linguistic research Corpus linguistics