OCR in Indian Languages
   HOME

TheInfoList



OR:

Indic OCR refers to the process of converting text images written in Indic
scripts Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of handw ...
into e-text using
Optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
(OCR) techniques. Broadly, it can also refer to the OCR systems of
Brahmic scripts The Brahmic scripts, also known as Indic scripts, are a family of abugida writing systems. They are used throughout the Indian subcontinent, Southeast Asia and parts of East Asia. They are descended from the Brahmi script of ancient India ...
for languages of
South Asia South Asia is the southern subregion of Asia, which is defined in both geographical and ethno-cultural terms. The region consists of the countries of Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan, and Sri Lanka.;;;;;;;; ...
and
Southeast Asia Southeast Asia, also spelled South East Asia and South-East Asia, and also known as Southeastern Asia, South-eastern Asia or SEA, is the geographical United Nations geoscheme for Asia#South-eastern Asia, south-eastern region of Asia, consistin ...
, not just the scripts of the
Indian subcontinent The Indian subcontinent is a list of the physiographic regions of the world, physiographical region in United Nations geoscheme for Asia#Southern Asia, Southern Asia. It is situated on the Indian Plate, projecting southwards into the Indian O ...
, which are all written in an
abugida An abugida (, from Ge'ez language, Ge'ez: ), sometimes known as alphasyllabary, neosyllabary or pseudo-alphabet, is a segmental Writing systems#Segmental writing system, writing system in which consonant-vowel sequences are written as units; ...
-based writing system. OCR for Latin characters is still not 100% accurate but a relatively high degree of accuracy in conversion has been able to be achieved. Such accuracy has not yet been able to be achieved for Indic scripts using OCR. This is due in part to the writing systems of
Indic languages Indic languages may refer to: * Indo-Aryan languages, a subgroup of the Indo-European languages spoken mainly in the north of the Indian subcontinent * Languages of the Indian subcontinent, all the indigenous languages of the region regardless of la ...
as well as a lack of standard representation, encoding, and support among operating systems and keyboards. The
Centre for Development of Advanced Computing The Centre for Development of Advanced Computing (C-DAC) is an Indian autonomous scientific society, operating under the Ministry of Electronics and Information Technology. History CDAC was created in November 1987, initially as the Centre f ...
(C-DAC) and
Technology Development for Indian Languages The Ministry of Electronics and Information Technology (MeitY) is an executive agency of the Union Government of the Republic of India. It was carved out of the Ministry of Communications and Information Technology on 19 July 2016 as a standalon ...
, the premier R&D organisation of the
Ministry of Electronics and Information Technology The Ministry of Electronics and Information Technology (MeitY) is an executive agency of the Union Government of the Republic of India. It was carved out of the Ministry of Communications and Information Technology on 19 July 2016 as a standalon ...
(also known as MeitY) of
India India, officially the Republic of India (Hindi: ), is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the so ...
have carried out many projects relating to OCR. Their projects include OCR for
Malayalam Malayalam (; , ) is a Dravidian language spoken in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry (Mahé district) by the Malayali people. It is one of 22 scheduled languages of India. Malayalam was des ...
,
Odia Odia, also spelled Oriya or Odiya, may refer to: * Odia people in Odisha, India * Odia language, an Indian language, belonging to the Indo-Aryan branch of the Indo-European language family * Odia alphabet, a writing system used for the Odia languag ...
,
Punjabi Punjabi, or Panjabi, most often refers to: * Something of, from, or related to Punjab, a region in India and Pakistan * Punjabi language * Punjabi people * Punjabi dialects and languages Punjabi may also refer to: * Punjabi (horse), a British Th ...
,
Telugu Telugu may refer to: * Telugu language, a major Dravidian language of India *Telugu people, an ethno-linguistic group of India * Telugu script, used to write the Telugu language ** Telugu (Unicode block), a block of Telugu characters in Unicode S ...
and
Devanagari Devanagari ( ; , , Sanskrit pronunciation: ), also called Nagari (),Kathleen Kuiper (2010), The Culture of India, New York: The Rosen Publishing Group, , page 83 is a left-to-right abugida (a type of segmental Writing systems#Segmental syste ...
script Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of handw ...
.


Properties of Indian writing systems

There are 22 officially recognised languages in India. Of these,
Hindi Hindi (Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of northern, central, eastern, and western India. Hindi has been de ...
,
Bengali Bengali or Bengalee, or Bengalese may refer to: *something of, from, or related to Bengal, a large region in South Asia * Bengalis, an ethnic and linguistic group of the region * Bengali language, the language they speak ** Bengali alphabet, the w ...
and
Punjabi Punjabi, or Panjabi, most often refers to: * Something of, from, or related to Punjab, a region in India and Pakistan * Punjabi language * Punjabi people * Punjabi dialects and languages Punjabi may also refer to: * Punjabi (horse), a British Th ...
are the most widely spoken Indo-Aryan languages and are also the fourth, seventh and tenth most widely spoken languages in the world respectively. Two or more languages can be written with same script. For example,
Devanagari Devanagari ( ; , , Sanskrit pronunciation: ), also called Nagari (),Kathleen Kuiper (2010), The Culture of India, New York: The Rosen Publishing Group, , page 83 is a left-to-right abugida (a type of segmental Writing systems#Segmental syste ...
is used to write
Hindi Hindi (Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of northern, central, eastern, and western India. Hindi has been de ...
,
Marathi Marathi may refer to: *Marathi people, an Indo-Aryan ethnolinguistic group of Maharashtra, India *Marathi language, the Indo-Aryan language spoken by the Marathi people *Palaiosouda, also known as Marathi, a small island in Greece See also * * ...
,
Rajasthani Rajasthani may refer to: * something of, from, or related to Rajasthan, a state of India * Rajasthani languages, a group of languages spoken there * Rajasthani people, the native inhabitants of the region * Rajasthani architecture * Rajasthani art ...
,
Sanskrit Sanskrit (; attributively , ; nominally , , ) is a classical language belonging to the Indo-Aryan branch of the Indo-European languages. It arose in South Asia after its predecessor languages had diffused there from the northwest in the late ...
, Bhojpuri and others, while Eastern Nagari is used to write
Bengali Bengali or Bengalee, or Bengalese may refer to: *something of, from, or related to Bengal, a large region in South Asia * Bengalis, an ethnic and linguistic group of the region * Bengali language, the language they speak ** Bengali alphabet, the w ...
,
Assamese Assamese may refer to: * Assamese people, a socio-ethnolinguistic identity of north-eastern India * People of Assam, multi-ethnic, multi-linguistic and multi-religious people of Assam * Assamese language, one of the easternmost Indo-Aryan language ...
, Manipuri and others. Apart from basic characters as consonants and vowels, most Indic languages combine 2 or more basic characters to form compound characters. The shape of a compound character is more complex than the constituent basic characters. Some Indo-Aryan languages (including Hindi and Punjabi) have a horizontal line over the characters, while other languages (including
Gujarati Gujarati may refer to: * something of, from, or related to Gujarat, a state of India * Gujarati people, the major ethnic group of Gujarat * Gujarati language, the Indo-Aryan language spoken by them * Gujarati languages, the Western Indo-Aryan sub- ...
) and Dravidian languages (
Malayalam Malayalam (; , ) is a Dravidian language spoken in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry (Mahé district) by the Malayali people. It is one of 22 scheduled languages of India. Malayalam was des ...
,
Kannada Kannada (; ಕನ್ನಡ, ), originally romanised Canarese, is a Dravidian language spoken predominantly by the people of Karnataka in southwestern India, with minorities in all neighbouring states. It has around 47 million native s ...
,
Tamil Tamil may refer to: * Tamils, an ethnic group native to India and some other parts of Asia ** Sri Lankan Tamils, Tamil people native to Sri Lanka also called ilankai tamils **Tamil Malaysians, Tamil people native to Malaysia * Tamil language, nati ...
, and
Telugu Telugu may refer to: * Telugu language, a major Dravidian language of India *Telugu people, an ethno-linguistic group of India * Telugu script, used to write the Telugu language ** Telugu (Unicode block), a block of Telugu characters in Unicode S ...
) do not. These are some of the main challenges for creating a single OCR for all Indic languages. Indic OCR also generally includes support for recently invented scripts in India like
Ol Chiki The Ol Chiki () script, also known as Ol Chemetʼ (Santali: ''ol'' 'writing', ''chemet'' 'learning'), Ol Ciki, Ol, and sometimes as the Santali alphabet invented by Pandit Raghunath Murmu in the year 1925, is the official writing system for San ...
, Warang Citi,
Mundari Bani Mundari Bani (Mundari: 𞓗𞓕𞓨𞓚 ''Bani'' 'alphabet', also known as Mundari Bani Hisir ''Hisir'' 'writing', Nag Mundari, or the Mundari alphabet) is the writing system created for the Mundari language, spoken in eastern India. Mundari is ...
, etc. which are mainly created for writing
Munda languages The Munda languages are a group of closely related languages spoken by about nine million people in India and Bangladesh. Historically, they have been called the Kolarian languages. They constitute a branch of the Austroasiatic language famil ...
of
Austroasiatic family The Austroasiatic languages , , are a large language family in Mainland Southeast Asia and South Asia. These languages are scattered throughout parts of Thailand, Laos, India, Myanmar, Malaysia, Bangladesh, Nepal, and southern China and are th ...
. The concept of upper/lower case is absent in Indic scripts. Apart from
Urdu Urdu (;"Urdu"
''
Sindhi,
Kashmiri Kashmiri may refer to: * People or things related to the Kashmir Valley or the broader region of Kashmir * Kashmiris, an ethnic group native to the Kashmir Valley * Kashmiri language, their language People with the name * Kashmiri Saikia Baruah ...
and
Thaana Thaana, Taana or Tāna (  ) is the present writing system of the Maldivian language spoken in the Maldives. Thaana has characteristics of both an abugida (diacritic, vowel-killer strokes) and a true alphabet (all vowels are written), ...
, all other Indic languages are written from left to right.


Examples

# SanskritOCR - OCR software for Sanskrit, Hindi and other Indo-Aryan languages based on the Devanagari script. Sanskrit OCR is developed by a Sanskrit scholar from
Germany Germany,, officially the Federal Republic of Germany, is a country in Central Europe. It is the second most populous country in Europe after Russia, and the most populous member state of the European Union. Germany is situated betwe ...
- ''Dr. Oliver Hellwig'' of Department for Languages and Cultures of Southern Asia, Freie Universität Berlin. The official website is in German. The interface of earlier versions of the software was also in German, but later versions have an
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
interface too. # E-aksharayan - Optical character recognition engine for Indian languages # Chitrankan - This technology was developed by
ISI ISI or Isi may refer to: Organizations * Intercollegiate Studies Institute, a classical conservative organization focusing on college students * Ice Skating Institute, a trade association for ice rinks * Indian Standards Institute, former name of ...
, Kolkata, and transferred to
C-DAC The Centre for Development of Advanced Computing (C-DAC) is an Government of India, Indian autonomous scientific society, operating under the Ministry of Electronics and Information Technology. History CDAC was created in November 1987, init ...
. It processes printed
Hindi Hindi (Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of northern, central, eastern, and western India. Hindi has been de ...
text from a scanner or from an
image An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...
.
Indic OCR models
for Tesseract (software)


OCR in use

OCR has been used for
Wikisource Wikisource is an online digital library of free-content textual sources on a wiki, operated by the Wikimedia Foundation. Wikisource is the name of the project as a whole and the name for each instance of that project (each instance usually rep ...
and other projects.


References

* * * *


External links

* * Optical character recognition Indic computing {{software-stub