Russian National Corpus
   HOME

TheInfoList



OR:

The Russian National Corpus (russian: Национальный корпус русского языка, , National Corpus of the Russian language) is a
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
of the
Russian language Russian (russian: русский язык, russkij jazyk, link=no, ) is an East Slavic language mainly spoken in Russia. It is the native language of the Russians, and belongs to the Indo-European language family. It is one of four living E ...
that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language,
Russian Academy of Sciences The Russian Academy of Sciences (RAS; russian: Росси́йская акаде́мия нау́к (РАН) ''Rossíyskaya akadémiya naúk'') consists of the national academy of Russia; a network of scientific research institutes from across t ...
. It currently contains more than 1 billion word forms that are automatically lemmatized and
POS POS, Pos or PoS may refer to: Linguistics * Part of speech, the role that a word or phrase plays in a sentence * Poverty of the stimulus, a linguistic term used in language acquisition and development * Sayula Popoluca (ISO 639-3), an indigenous ...
-/grammeme-
tagged Tagged may refer to: * Tagged (website), a social discovery website * Tagged (web series), an American teen psychological thriller web series {{disambiguation ...
, i.e. all the possible morphological analyses for each orthographic form are ascribed to it. Lemmata, POS, grammatical items, and their combinations are searchable. Additionally, 6 million word forms are in the subcorpus with manually resolved
homonymy In linguistics, homonyms are words which are homographs (words that share the same spelling, regardless of pronunciation), or homophones ( equivocal words, that share the same pronunciation, regardless of spelling), or both. Using this definition ...
. The subcorpus with resolved morphological
homonymy In linguistics, homonyms are words which are homographs (words that share the same spelling, regardless of pronunciation), or homophones ( equivocal words, that share the same pronunciation, regardless of spelling), or both. Using this definition ...
is also automatically accentuated. The whole corpus has a searchable tagging concerning
lexical semantics Lexical semantics (also known as lexicosemantics), as a subfield of linguistic semantics, is the study of word meanings.Pustejovsky, J. (2005) Lexical Semantics: Overview' in Encyclopedia of Language and Linguistics, second edition, Volumes 1-14Ta ...
(LS), including morphosemantic POS subclasses (proper noun, reflexive pronoun etc.), LS characteristics proper (thematic class, causativity, evaluation), derivation (diminutive, adverb formed from adjective etc.). The RNC includes also the following subcorpora: *a
treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...
of
syntactical In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency ...
dependencies (largely based on the Igor Mel'čuk's Meaning-Text Theory) *English⇔Russian, German⇒Russian, Ukrainian⇔Russian and Belorussian⇔Russian
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
; *a large (100+ million words) separate corpus of modern newspapers (2001–2011); *a corpus of Russian
poetry Poetry (derived from the Greek '' poiesis'', "making"), also called verse, is a form of literature that uses aesthetic and often rhythmic qualities of language − such as phonaesthetics, sound symbolism, and metre − to evoke meani ...
, where the rhyming words and poetic prosody (including meter, stanzas etc.) is additionally tagged; *a corpus of Russian
dialect The term dialect (from Latin , , from the Ancient Greek word , 'discourse', from , 'through' and , 'I speak') can refer to either of two distinctly different types of linguistic phenomena: One usage refers to a variety of a language that is ...
s with specific dialect grammar tagging; *a multimedia corpus with searchable tagged fragments of Russian-language movies; *a corpus showing the history of Russian
stress Stress may refer to: Science and medicine * Stress (biology), an organism's response to a stressor such as an environmental condition * Stress (linguistics), relative emphasis or prominence given to a syllable in a word, or to a word in a phrase ...
*an educational subcorpus reflecting school standards. All the texts have tags bearing metatextual information - the author, his/her birth date, creation date, text size, text genres (general fiction, detective story, newspaper article etc.); all these categories are browsable and searchable separately. It is possible to define a user's subcorpus to search lemmata/POS-grammeme/semantic tags combinations only within this subset.


See also

* General Internet Corpus of Russian


References


External links

* Corpora Russian language Applied linguistics Linguistic research {{slavic-lang-stub