Russian National Corpus
   HOME

TheInfoList



OR:

The Russian National Corpus (russian: Национальный корпус русского языка, , National Corpus of the Russian language) is a
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
of the
Russian language Russian (russian: русский язык, russkij jazyk, link=no, ) is an East Slavic languages, East Slavic language mainly spoken in Russia. It is the First language, native language of the Russians, and belongs to the Indo-European langua ...
that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language,
Russian Academy of Sciences The Russian Academy of Sciences (RAS; russian: Росси́йская акаде́мия нау́к (РАН) ''Rossíyskaya akadémiya naúk'') consists of the national academy of Russia; a network of scientific research institutes from across t ...
. It currently contains more than 1 billion word forms that are automatically lemmatized and POS-/grammeme-
tagged Tagged may refer to: * Tagged (website), a social discovery website * Tagged (web series), an American teen psychological thriller web series {{disambiguation ...
, i.e. all the possible morphological analyses for each orthographic form are ascribed to it. Lemmata, POS, grammatical items, and their combinations are searchable. Additionally, 6 million word forms are in the subcorpus with manually resolved homonymy. The subcorpus with resolved morphological homonymy is also automatically accentuated. The whole corpus has a searchable tagging concerning
lexical semantics Lexical semantics (also known as lexicosemantics), as a subfield of linguistic semantics, is the study of word meanings.Pustejovsky, J. (2005) Lexical Semantics: Overview' in Encyclopedia of Language and Linguistics, second edition, Volumes 1-14Ta ...
(LS), including morphosemantic POS subclasses (proper noun, reflexive pronoun etc.), LS characteristics proper (thematic class, causativity, evaluation), derivation (diminutive, adverb formed from adjective etc.). The RNC includes also the following subcorpora: *a
treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...
of syntactical dependencies (largely based on the
Igor Mel'čuk Igor Aleksandrovič Mel'čuk, sometimes ''Melchuk'' (russian: Игорь Александрович Мельчук; uk, Ігор Олександрович Мельчук; born 1932), is a Soviet and Canadian linguist, a retired professor at the ...
's Meaning-Text Theory) *English⇔Russian, German⇒Russian, Ukrainian⇔Russian and Belorussian⇔Russian
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
; *a large (100+ million words) separate corpus of modern newspapers (2001–2011); *a corpus of Russian
poetry Poetry (derived from the Greek ''poiesis'', "making"), also called verse, is a form of literature that uses aesthetic and often rhythmic qualities of language − such as phonaesthetics, sound symbolism, and metre − to evoke meanings i ...
, where the rhyming words and poetic prosody (including meter, stanzas etc.) is additionally tagged; *a corpus of Russian
dialect The term dialect (from Latin , , from the Ancient Greek word , 'discourse', from , 'through' and , 'I speak') can refer to either of two distinctly different types of Linguistics, linguistic phenomena: One usage refers to a variety (linguisti ...
s with specific dialect grammar tagging; *a multimedia corpus with searchable tagged fragments of Russian-language movies; *a corpus showing the history of Russian
stress Stress may refer to: Science and medicine * Stress (biology), an organism's response to a stressor such as an environmental condition * Stress (linguistics), relative emphasis or prominence given to a syllable in a word, or to a word in a phrase ...
*an educational subcorpus reflecting school standards. All the texts have tags bearing metatextual information - the author, his/her birth date, creation date, text size, text genres (general fiction, detective story, newspaper article etc.); all these categories are browsable and searchable separately. It is possible to define a user's subcorpus to search lemmata/POS-grammeme/semantic tags combinations only within this subset.


See also

*
General Internet Corpus of Russian General Internet Corpus of Russian (GICR) is a corpus of Russian internet texts that has been accessible on request through an online query interface since 2013. The corpus includes rich text materials from the blogosphere, social networks, major ...


References


External links

* Corpora Russian language Applied linguistics Linguistic research {{slavic-lang-stub