Hamshahri Corpus
   HOME
*





Hamshahri Corpus
The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian language, Persian Text corpus, corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG GroupDBRG News
Database Research Group of University of Tehran. Later, a team headed by Ale AhmadHamshahri
Database Research Group
built on this corpus and created the first Persian text collection suitable for information retrieval evaluation tasks. This corpus was created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern information retrieval experiments.


Version 1.0

< ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Hamshahri Corpus Logo
''Hamshahri'' ( fa, همشهری, "Fellow citizen"; ) is a major national Iranian Persian-language newspaper. History and profile ''Hamshahri'' is published by the municipality of Tehran, and founded by Gholamhossein Karbaschi. It is the first coloured daily newspaper in Iran and has over 60 pages of Classified advertising, classified advertisement. The newspaper is distributed within the limits of Tehran municipality. It has a daily circulation of over 400,000 copies, which is on par with major US-American daily newspapers such as the ''San Francisco Chronicle'', ''Boston Globe'', and ''Chicago Tribune''. Based on the results of a domestic poll of how citizens of Tehran view television and print media which were released by Iran’s Ministry of Culture and Islamic Guidance ''Hamshahri'' was the most read daily in Tehran with 44.1% in March 2014. In 1997's Iranian presidential election, Hamshahri newspaper, then run by former mayor of Tehran, Gholamhossein Karbaschi, was accu ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Bijankhan Corpus
The Bijankhan corpus ( fa, پیکرهٔ بی‌جن‌خان) is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc.; in about 4300 different subject categories. The corpus contains about 2.6 million manually tagged words with a tag set that contains 550 Persian part-of-speech tags. The Bijankhan corpus was created by the Database Research Group at the University of Tehran. The corpus is non- free in that it is not free for commercial use, although these restrictions vary by country. The Bijankhan corpus is named after Mahmood Bijankhan, professor of linguistics at the University of Tehran due to his contributions in this area. See also *Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian language, Persian Text corpus, co ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Applied Linguistics
Applied linguistics is an interdisciplinary field which identifies, investigates, and offers solutions to language-related real-life problems. Some of the academic fields related to applied linguistics are education, psychology, communication research, information science, natural language processing, anthropology, and sociology. Domain Applied linguistics is an interdisciplinary field. Major branches of applied linguistics include bilingualism and multilingualism, conversation analysis, contrastive linguistics, language assessment, literacies, discourse analysis, language pedagogy, second language acquisition, language planning and policy, interlinguistics, stylistics, language teacher education, forensic linguistics, and translation. Journals Major journals of the field include ''Research Methods in Applied Linguistics'', ''Annual Review of Applied Linguistics'', ''Applied Linguistics'', Studies in Second Language Acquisition, ''Applied Psycholinguistics'', ''Internat ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Persian-language Newspapers
Persian (), also known by its endonym Farsi (, ', ), is a Western Iranian language belonging to the Iranian branch of the Indo-Iranian subdivision of the Indo-European languages. Persian is a pluricentric language predominantly spoken and used officially within Iran, Afghanistan, and Tajikistan in three mutually intelligible standard varieties, namely Iranian Persian (officially known as ''Persian''), Dari Persian (officially known as ''Dari'' since 1964) and Tajiki Persian (officially known as ''Tajik'' since 1999).Siddikzoda, S. "Tajik Language: Farsi or not Farsi?" in ''Media Insight Central Asia #27'', August 2002. It is also spoken natively in the Tajik variety by a significant population within Uzbekistan, as well as within other regions with a Persianate history in the cultural sphere of Greater Iran. It is written officially within Iran and Afghanistan in the Persian alphabet, a derivation of the Arabic script, and within Tajikistan in the Tajik alphabet, a derivatio ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Persian Corpora
Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the Indo-European family, native language of ethnic Persians *** Persian alphabet, a writing system based on the Perso-Arabic script * People and things from the historical Persian Empire Other uses * Persian (patience), a card game * Persian (roll), a pastry native to Thunder Bay, Ontario * Persian (wine) * Persian, Indonesia, on the island of Java * Persian cat, a long-haired breed of cat characterized by its round face and shortened muzzle * The Persian, a character from Gaston Leroux's ''The Phantom of the Opera'' * Persian, a generation I Pokémon species * Alpha Indi, star also known as "The Persian" See also * Persian Empire (other) * Persian expedition (other) or Persian campaign * Persian Gulf (disambiguat ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Information Retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds. Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides access to books, journals and other documents; stores and manages those documents. Web search engines are the most visible IR applications. Overview An information retrieval process begins when a user or searcher enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In inf ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Tehran Monolingual Corpus
The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing. The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps. TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian. TMC is created by Natural Language Processing Lab of University of Tehran. The corpus is free for research use, after obtaining permission from the corpus aggregator. See also * Bijankhan Corpus * Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian language, Persian Text corpus, corpus based on the Iranian newspaper ''Hamshahri'', one of the first onli ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Persian Today Corpus
Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the Indo-European family, native language of ethnic Persians *** Persian alphabet, a writing system based on the Perso-Arabic script * People and things from the historical Persian Empire Other uses * Persian (patience), a card game * Persian (roll), a pastry native to Thunder Bay, Ontario * Persian (wine) * Persian, Indonesia, on the island of Java * Persian cat, a long-haired breed of cat characterized by its round face and shortened muzzle * The Persian, a character from Gaston Leroux's ''The Phantom of the Opera'' * Persian, a generation I Pokémon species * Alpha Indi, star also known as "The Persian" See also * Persian Empire (other) * Persian expedition (other) or Persian campaign * Persian Gulf (disambiguat ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Document Classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. T ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Persian Language
Persian (), also known by its endonym Farsi (, ', ), is a Western Iranian language belonging to the Iranian branch of the Indo-Iranian subdivision of the Indo-European languages. Persian is a pluricentric language predominantly spoken and used officially within Iran, Afghanistan, and Tajikistan in three mutually intelligible standard varieties, namely Iranian Persian (officially known as ''Persian''), Dari Persian (officially known as ''Dari'' since 1964) and Tajiki Persian (officially known as ''Tajik'' since 1999).Siddikzoda, S. "Tajik Language: Farsi or not Farsi?" in ''Media Insight Central Asia #27'', August 2002. It is also spoken natively in the Tajik variety by a significant population within Uzbekistan, as well as within other regions with a Persianate history in the cultural sphere of Greater Iran. It is written officially within Iran and Afghanistan in the Persian alphabet, a derivation of the Arabic script, and within Tajikistan in the Tajik alphabet, a der ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Text Corpus
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and statistical hypothesis testing, hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In Search engine (computing), search technology, a corpus is the collection of documents which is being searched. Overview A corpus may contain texts in a single language (''monolingual corpus'') or text data in multiple languages (''multilingual corpus''). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or ''POS-tagging'', in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form o ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Information Retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds. Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides access to books, journals and other documents; stores and manages those documents. Web search engines are the most visible IR applications. Overview An information retrieval process begins when a user or searcher enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In inf ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]