Bijankhan Corpus
   HOME
*





Bijankhan Corpus
The Bijankhan corpus ( fa, پیکرهٔ بی‌جن‌خان) is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc.; in about 4300 different subject categories. The corpus contains about 2.6 million manually tagged words with a tag set that contains 550 Persian part-of-speech tags. The Bijankhan corpus was created by the Database Research Group at the University of Tehran. The corpus is non- free in that it is not free for commercial use, although these restrictions vary by country. The Bijankhan corpus is named after Mahmood Bijankhan, professor of linguistics at the University of Tehran due to his contributions in this area. See also *Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian language, Persian Text corpus, co ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Bijankhan Corpus Logo
Mahmood Bijankhan ( fa, محمود بی‌جن‌خان; born 1958 in Abadan) is an Iranian linguist and professor of linguistics at the University of Tehran. He is the creator of Bijankhan Corpus and a winner of Khwarizmi International Award. Bijankhan received his BSc in applied mathematics from the University of Texas at Arlington (1981) and his MA (1990) and PhD (1996) in linguistics from the University of Tehran. He is known for his research on Persian phonetics and phonology and creating Persian corpora. Books * ''Phonology: Optimality Theory'', Tehran: SAMT, 2006 * ''A Feasibility Study for Analysis of Ezafe in Persian Using Pattern Matching'', Tehran: Research Center for Culture, Art and Communication, 2008 * ''Persian Language and Computers'' (ed.), Tehran: SAMT, 2011 * ''Frequency Dictionary'', Tehran: University of Tehran Press, 2013 * ''Phonetic System of the Persian Language'', Tehran: SAMT, 2014 See also * Bijankhan Corpus The Bijankhan corpus ( fa, پیکرهٔ ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Part-of-speech Tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, by a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms. Principle Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Text Corpus
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and statistical hypothesis testing, hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In Search engine (computing), search technology, a corpus is the collection of documents which is being searched. Overview A corpus may contain texts in a single language (''monolingual corpus'') or text data in multiple languages (''multilingual corpus''). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or ''POS-tagging'', in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form o ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Natural Language Processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation. History Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, t ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Persian Language
Persian (), also known by its endonym Farsi (, ', ), is a Western Iranian language belonging to the Iranian branch of the Indo-Iranian subdivision of the Indo-European languages. Persian is a pluricentric language predominantly spoken and used officially within Iran, Afghanistan, and Tajikistan in three mutually intelligible standard varieties, namely Iranian Persian (officially known as ''Persian''), Dari Persian (officially known as ''Dari'' since 1964) and Tajiki Persian (officially known as ''Tajik'' since 1999).Siddikzoda, S. "Tajik Language: Farsi or not Farsi?" in ''Media Insight Central Asia #27'', August 2002. It is also spoken natively in the Tajik variety by a significant population within Uzbekistan, as well as within other regions with a Persianate history in the cultural sphere of Greater Iran. It is written officially within Iran and Afghanistan in the Persian alphabet, a derivation of the Arabic script, and within Tajikistan in the Tajik alphabet, a der ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Database Research Group
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spans formal techniques and practical considerations, including data modeling, efficient data representation and storage, query languages, security and privacy of sensitive data, and distributed computing issues, including supporting concurrent access and fault tolerance. A database management system (DBMS) is the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS software additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an applicati ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

University Of Tehran
The University of Tehran (Tehran University or UT, fa, دانشگاه تهران) is the most prominent university located in Tehran, Iran. Based on its historical, socio-cultural, and political pedigree, as well as its research and teaching profile, UT has been nicknamed "The Mother University f Iran ( fa, دانشگاه مادر). In international rankings, UT has been ranked as one of the best universities in the Middle East and is among the top universities of the world. It is also the premier knowledge producing institute among all OIC countries. Tehran University of Medical Sciences is in the 7th ranking of the Islamic World University Ranking in 2021. The university offers more than 111 bachelor's degree programs, 177 master's degree programs, and 156 PhD. programs. Many of the departments were absorbed into the University of Tehran from the Dar al-Funun established in 1851 and the Tehran School of Political Sciences established in 1899. The main campus of the univers ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Iran And Copyright Issues
Iran, officially the Islamic Republic of Iran, and also called Persia, is a country located in Western Asia. It is bordered by Iraq and Turkey to the west, by Azerbaijan and Armenia to the northwest, by the Caspian Sea and Turkmenistan to the north, by Afghanistan and Pakistan to the east, and by the Gulf of Oman and the Persian Gulf to the south. It covers an area of , making it the 17th-largest country. Iran has a population of 86 million, making it the 17th-most populous country in the world, and the second-largest in the Middle East. Its largest cities, in descending order, are the capital Tehran, Mashhad, Isfahan, Karaj, Shiraz, and Tabriz. The country is home to one of the world's oldest civilizations, beginning with the formation of the Elamite kingdoms in the fourth millennium BC. It was first unified by the Medes, an ancient Iranian people, in the seventh century BC, and reached its territorial height in the sixth century BC, when Cyrus the Great fo ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Mahmood Bijankhan
Mahmood Bijankhan ( fa, محمود بی‌جن‌خان; born 1958 in Abadan) is an Iranian linguist and professor of linguistics at the University of Tehran. He is the creator of Bijankhan Corpus and a winner of Khwarizmi International Award. Bijankhan received his BSc in applied mathematics from the University of Texas at Arlington (1981) and his MA (1990) and PhD (1996) in linguistics from the University of Tehran. He is known for his research on Persian phonetics and phonology and creating Persian corpora. Books * ''Phonology: Optimality Theory'', Tehran: SAMT, 2006 * ''A Feasibility Study for Analysis of Ezafe in Persian Using Pattern Matching'', Tehran: Research Center for Culture, Art and Communication, 2008 * ''Persian Language and Computers'' (ed.), Tehran: SAMT, 2011 * ''Frequency Dictionary'', Tehran: University of Tehran Press, 2013 * ''Phonetic System of the Persian Language'', Tehran: SAMT, 2014 See also * Bijankhan Corpus The Bijankhan corpus ( fa, پیکرهٔ Ø ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Hamshahri Corpus
The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian language, Persian Text corpus, corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG GroupDBRG News
Database Research Group of University of Tehran. Later, a team headed by Ale AhmadHamshahri
Database Research Group
built on this corpus and created the first Persian text collection suitable for information retrieval evaluation tasks. This corpus was created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern information retrieval experiments.


Version 1.0

< ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Persian Today Corpus
Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the Indo-European family, native language of ethnic Persians *** Persian alphabet, a writing system based on the Perso-Arabic script * People and things from the historical Persian Empire Other uses * Persian (patience), a card game * Persian (roll), a pastry native to Thunder Bay, Ontario * Persian (wine) * Persian, Indonesia, on the island of Java * Persian cat, a long-haired breed of cat characterized by its round face and shortened muzzle * The Persian, a character from Gaston Leroux's ''The Phantom of the Opera'' * Persian, a generation I Pokémon species * Alpha Indi, star also known as "The Persian" See also * Persian Empire (other) * Persian expedition (other) or Persian campaign * Persian Gulf (disambiguat ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]