Croatian National Corpus ( hr, Hrvatski nacionalni korpus, ''HNK'') is the biggest and the most important

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

of Croatian. Its compilation started in 1998 at the Institute of Linguistics of the

Faculty of Humanities and Social Sciences Faculty may refer to: * Faculty (academic staff), the academic staff of a university (North American usage) * Faculty (division), a division within a university (usage outside of the United States) * Faculty (instrument), an instrument or warra ...

University of Zagreb The University of Zagreb ( hr, Sveučilište u Zagrebu, ; la, Universitas Studiorum Zagrabiensis) is the largest Croatian university and the oldest continuously operating university in the area covering Central Europe south of Vienna and all of ...

following the ideas of
Marko Tadić
The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of Croatian started to appear even earlier. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms. The initial composition was divided in two constituents: # ''30-million corpus of contemporary Croatian'' (30m) where samples from texts from 1990 on were included. The criteria for inclusion of text samples were: written by native speakers, different fields, genres and topics. Translated text or poetry were excluded. # ''Croatian Electronic Text Archive'' (HETA) where the complete text were included, particularly serial publications (volumes, series, editions etc.) which would imbalance the 30m if they were inserted there. Since 2004, with the adoption of the concept of the 3rd generation corpus, the two-constituent structure has been abandoned in favor of several subcorpora and larger size. Since 2005 HNK 105 million tokens and is composed of number of different subcorpora which can be searched individually and all together in a whole corpus. Since 2004 HNK also migrated to a new server platform, namely Manatee/Bonito server-client architecture. For searching the HNK (today still with free test access) a free client program Bonito is needed. The author of this corpus manager is Pavel Rychlý from the Natural Language Processing Laboratory of the Faculty of Informatics,

Masaryk University Masaryk University (MU) ( cs, Masarykova univerzita; la, Universitas Masarykiana Brunensis) is the second largest university in the Czech Republic, a member of the Compostela Group and the Utrecht Network. Founded in 1919 in Brno as the seco ...

in Brno, Czech Republic. Its interface features complex and more elaborated queries over corpus, different types of statistical results, total or partial word lists according to different query criteria (with their frequencies), frequency distribution of types, automatic collocation detection etc. The last version of this corpus (version 3) has 216.8 million tokens. The online search is available via web-interface search Bonito 2 which is a part of NoSketch Engine,NoSketch Engine
/ref> limited version of the software

Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in corpus linguistics, translators or lan ...

References

External links

Free online search

Croatian National Corpus website
*
''Hrvatska jezična riznica''
another online Croatian corpus, by the

Institute of Croatian Language and Linguistics The Institute of Croatian Language and Linguistics ( hr, Institut za hrvatski jezik i jezikoslovlje) is an official institute in Croatia whose purpose is to preserve and foster the Croatian language. It traces its history back to 1948, when it was ...

{{Corpus linguistics Corpora Croatian language Online databases Applied linguistics Linguistic research