The Tehran Monolingual Corpus (TMC) is a large-scale

Persian Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the ...

monolingual corpus. TMC is suited for

Language Modeling A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

and relevant research areas in Natural Language Processing. The corpus is extracted from

Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian language, Persian Text corpus, corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers in Iran. It was initially collected and ...

and

ISNA news agency The Iranian Students' News Agency (ISNA) is a news agency run by Iranian university students. Position It covers a variety of national and international topics.Engber, Daniel. What's With the Iranian Students News Agency?, ''Slate'', 2 Februa ...

website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of

tokenization Tokenization may refer to: * Tokenization (lexical analysis) in language processing * Tokenization (data security) in the field of data security * Word segmentation * Tokenism Tokenism is the practice of making only a perfunctory or symbolic ...

and spell-checking steps. TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian. TMC is created by Natural Language Processing Lab of

University of Tehran The University of Tehran (Tehran University or UT, fa, دانشگاه تهران) is the most prominent university located in Tehran, Iran. Based on its historical, socio-cultural, and political pedigree, as well as its research and teaching pro ...

. The corpus is free for research use, after obtaining permission from the corpus aggregator.

External links

TMC description page
{{Corpus linguistics Persian corpora Applied linguistics Linguistic research Natural language processing

See also

External links