HOME

TheInfoList



OR:

The Croatian Language Corpus (CLC; , HJK) is a
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
of Croatian compiled at the Institute of Croatian Language and Linguistics (IHJJ).


Background

The CLC was initially funded as a sub-project of the research program ''Riznica'' (''Croatian Language Repository'') by the Ministry of Science, Education, and Sports of the Republic of Croatia ( MZOŠ) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program ''The Croatian Language Repository'' (CLR) that was granted by the MZOŠ (cf. Ćavar and Brozović Rončević, 2012). Being a research program (PI Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently Dunja Brozović Rončević and Damir Ćavar are in charge of the corpus development.


Goals

One of the main goals of the CLC project is to create a publicly available Croatian
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
that is annotated on multiple levels, i.e. lemmatized, morphologically segmented and morpho-syntactically annotated,
phonemic A phoneme () is any set of similar speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible phonetic unit—that helps distinguish one word from another. All languages con ...
ally transcribed and syllabified, and syntactically parsed. While the current version of the
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
provides resources from the Croatian language standard, several corpora from different development phases of Croatian are created as well, including the digitizations of manuscripts and Croatian dictionaries.


Format and Availability

From the outset, the collected and digitized texts in the CLC were annotated using the
Text Encoding Initiative The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and ma ...
( TEI) P5
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
standard. Currently approx. 90 mil. tokens are available in the TEI P5
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
format. The
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
can be accessed online via the Philologic interface (see The ARTFL Project, Department of Romance Languages and Literatures,
The University of Chicago The University of Chicago (UChicago, Chicago, or UChi) is a private research university in Chicago, Illinois, United States. Its main campus is in the Hyde Park neighborhood on Chicago's South Side, near the shore of Lake Michigan about fr ...
). It is virtualized into various sub-corpora, and individual or specific definitions of sub-corpora can be provided on demand.


Content

The CLC is assembled from selected text of Croatian, covering various functional domains and genres. It includes literature and other written sources from the period of the beginning of the final shaping of the standardization of Croatian, i.e. from the second half of the 19th century on. The CLC consists of: * fundamental Croatian literature (e.g. novels, short stories, drama, poetry) * non-fiction * scientific publications from various domains and University textbooks * school books * translated literature from outstanding Croatian translators * online journals and newspapers * books from the pre-standardization period of Croatian that are adapted to nowadays standard Croatian


Cooperation

The realization of the CLC was made possible in cooperation with: * Školska knjiga d.d. * Croatian Academy of Sciences and Arts (HAZU) * Stoljeća hrvatske književnosti, Matica hrvatska


References


External links


Croatian Language Corpus (CLC) website and Philologic interface
*
''Croatian National Corpus''
another Croatian corpus by th
Institute of Linguistics
of the Faculty of Humanities and Social Sciences,
University of Zagreb The University of Zagreb (, ) is a public university, public research university in Zagreb, Croatia. It is the largest Croatian university and one of the oldest continuously operating universities in Europe. The University of Zagreb and the Unive ...
{{Croatian language Corpora Croatian language Online databases Applied linguistics Linguistic research