GICR
   HOME

TheInfoList



OR:

General Internet Corpus of Russian (GICR) is a corpus of Russian internet texts that has been accessible on request through an online query interface since 2013. The corpus includes rich text materials from the blogosphere, social networks, major news sources and literary magazines.


Goals of the project

The project has the status of an educational and scientific one, and many tasks of computational linguistics are solved by independent researchers and research groups with the materials obtained by GICR. While other corpus projects of Russian are focused on fiction and edited texts, General Internet Corpus provides linguists timely opportunity to learn the language as it is, with all the slang and regional peculiarities. Corpus gives the opportunity to carry out research in * Linguistic research of a wide range: dialectological research, study of word distribution, study of the language of the social networks, study of the influence of gender, age and other factors on the language, frequency of words, fixed expressions and different constructions, stylistic features of texts of different segments of the Internet, etc. * Social media analysis * Corpus-based machine learning for evaluating automatic tagging At various times, student papers and independent researches were carried out on the project material by students, graduates and employees of MSU, MIPT, Russian State Humanitarian University, Novosibirsk State University, Higher School of Economics, Russian Academy of Sciences, SFU, CSU, SGMP, IAAS of MSU. Scientific project leaders: *Belikov V. - RSUH, Moscow, Russia *Selegey V. - RSUH, ABBYY, Moscow, Russia *Sharoff S. - RSUH, Moscow, Russia; University of Leeds, UK The organizations involved in support of GICR: * Russian State University of Humanities * ABBYY Company *
Moscow Institute of Physics and Technology Moscow Institute of Physics and Technology (MIPT; russian: Московский Физико-Технический институт, also known as PhysTech), is a public research university located in Moscow Oblast, Russia. It prepares speciali ...
* Skolkovo Institute of Science and Technology


Size and content of the corpus

Corpus size for the summer 2016 is 19.8 billion tokens, of which 49% are from
VKontakte VK (short for its original name ''VKontakte''; russian: ВКонтакте, meaning ''InContact'') is a Russian online social media and social networking service based in Saint Petersburg. VK is available in multiple languages but it is predomin ...
, 40% are from
LiveJournal LiveJournal (russian: Живой Журнал), stylised as LiVEJOURNAL, is a Russian-owned social networking service where users can keep a blog, journal, or diary. American programmer Brad Fitzpatrick started LiveJournal on April 15, 1999, a ...
, another 4% - from
Mail.ru VK, known as Mail.ru Group until 12 October 2021, is a Russian technology company. It started in 1998 as an e-mail service and went on to become a major corporate figure in the Russian-speaking segment of the Internet. VK operates an e-mail s ...
Blogs and News, and 2% - fro
Russian Magazine Hall
The sources collected in news segment are: RIA Novosti, Regnum,
Lenta.ru ''Lenta.ru'' (russian: Лента.Ру; stylised as LƐNTA.RU) is a Russian-language online newspaper. Based in Moscow, it is owned by Rambler Media Group. In 2013, the Alexander Mamut-owned companies "SUP Media" and "Rambler-Afisha" merged to ...

Rosbalt
Texts are provided with metamarkup (by date of creation of the text, sex, place and year of birth of the author, Internet genre, etc.); all texts are provided with automatic morphological tagging and lemmatization. Most of the texts collected are of 2013–2014 years of creation, although in some segments, such as in Russian Magazine Hall, there are some texts collected since 1994. GICR is one of the few mega-corpora projects nowadays, which means its available size is reaching several billion of words.


Access

Currently the interface of GICR is in beta stage, so access to the search in the corpora is provided and is free, but is available for researchers on request.


See also

*
Text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
* Corpus linguistics *
Russian National Corpus The Russian National Corpus (russian: Национальный корпус русского языка, , National Corpus of the Russian language) is a corpus of the Russian language that has been partially accessible through a query interface onl ...
* Internet linguistics


References

{{Reflist


Further reading


Belikov V., Kopylov N., Piperski A., Selegey V., Sharoff S., (2013), Big and diverse is beautiful: A large corpus of Russian to study linguistic variation. In Web as Corpus Workshop (WAC-8).Lagutin M. B., Katinskaya A. Y., Selegey V. P., Sharoff S., Sorokin A. A. (2015) Automatic Classification of Web Texts Using Functional Text Dimensions. In Dialogue, Russian International Conference on Computational Linguistics, BekasovoKatinskaya A., Sharoff S. (2015) Applying Multi-dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres, in Proc. of the Workshop on Balto-Slavic Natural Language Processing associated with the International Conference RANLP, Hissar, Bulgaria.


External links


Official site of GICR
Applied linguistics Russian language Corpus linguistics Linguistic research Corpora