The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable
Persian
Persian may refer to:
* People and things from Iran, historically called ''Persia'' in the English language
** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples
** Persian language, an Iranian language of the ...
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
based on the
Iran
Iran, officially the Islamic Republic of Iran, and also called Persia, is a country located in Western Asia. It is bordered by Iraq and Turkey to the west, by Azerbaijan and Armenia to the northwest, by the Caspian Sea and Turkmeni ...
ian newspaper ''
Hamshahri
''Hamshahri'' ( fa, همشهری, "Fellow citizen"; ) is a major national Iranian Persian-language newspaper.
History and profile
''Hamshahri'' is published by the municipality of Tehran, and founded by Gholamhossein Karbaschi. It is the first ...
'', one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG Group
DBRG News
Database Research Group of University of Tehran
The University of Tehran (Tehran University or UT, fa, دانشگاه تهران) is the most prominent university located in Tehran, Iran. Based on its historical, socio-cultural, and political pedigree, as well as its research and teaching pro ...
. Later, a team headed by Ale Ahmad[Hamshahri]
Database Research Group built on this corpus and created the first Persian text collection suitable for information retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
evaluation tasks.
This corpus was created by crawling the online news articles from the Hamshahri
''Hamshahri'' ( fa, همشهری, "Fellow citizen"; ) is a major national Iranian Persian-language newspaper.
History and profile
''Hamshahri'' is published by the municipality of Tehran, and founded by Gholamhossein Karbaschi. It is the first ...
's website and processing the HTML pages to create a standard text corpus
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...
for modern information retrieval experiments.
Version 1.0
The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average size of 1.8 KB.
The corpus is available in several formats for download:[
* Tagged Text: 560 MB
* In SQL Server 2000 Tables: 712 MB
]
Version 2.0
The second release of the Hamshahri Corpus was launched on 20 October 2008. It offers several new features and improvements:
* More News: 323,616 Text Stories in 3206 XML files (one file for each day)
* Increased Time Span: from 22 June 1996 to 13 May 2007
* Bigger in Size: 1.42 GB uncompressed
* Standard Container: Unicode XML
* Included Images: images have been extracted from the news and preserved (available in an additional package), which makes it suitable for Image Retrieval tasks.
* Categorized News: the news stories have been categorized semi-automatically (appropriate for text categorization and classification tasks).
The corpus is available for download in XML format.
See also
* Bijankhan Corpus
The Bijankhan corpus ( fa, پیکرهٔ بیجنخان) is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collect ...
* Persian Today Corpus
Persian may refer to:
* People and things from Iran, historically called ''Persia'' in the English language
** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples
** Persian language, an Iranian language of the ...
* Tehran Monolingual Corpus
The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing.
The corpus is extracted from Hamshahri Corpus and ISNA news agency we ...
* Text corpus
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...
* Information retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
References
External links
Hamshahri Corpus Homepage
irBlogs Collection Homepage
Persian corpora
Persian-language newspapers
Applied linguistics
Linguistic research
Mass media in Tehran
{{ie-lang-stub