HOME

TheInfoList



OR:

The Tehran Monolingual Corpus (TMC) is a large-scale
Persian Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the ...
monolingual corpus. TMC is suited for
Language Modeling A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
and relevant research areas in Natural Language Processing. The corpus is extracted from
Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian language, Persian Text corpus, corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers in Iran. It was initially collected and ...
and
ISNA news agency The Iranian Students' News Agency (ISNA) is a news agency run by Iranian university students. Position It covers a variety of national and international topics.Engber, Daniel. What's With the Iranian Students News Agency?, ''Slate'', 2 Februa ...
website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of
tokenization Tokenization may refer to: * Tokenization (lexical analysis) in language processing * Tokenization (data security) in the field of data security * Word segmentation * Tokenism Tokenism is the practice of making only a perfunctory or symbolic ...
and spell-checking steps. TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian. TMC is created by Natural Language Processing Lab of
University of Tehran The University of Tehran (Tehran University or UT, fa, دانشگاه تهران) is the most prominent university located in Tehran, Iran. Based on its historical, socio-cultural, and political pedigree, as well as its research and teaching pro ...
. The corpus is free for research use, after obtaining permission from the corpus aggregator.


See also

*
Bijankhan Corpus The Bijankhan corpus ( fa, پیکرهٔ بی‌جن‌خان) is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collect ...
*
Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian language, Persian Text corpus, corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers in Iran. It was initially collected and ...


External links


TMC description page
{{Corpus linguistics Persian corpora Applied linguistics Linguistic research Natural language processing