OCRopus
   HOME

TheInfoList



OR:

OCRopus is a
free Free may refer to: Concept * Freedom, having the ability to do something, without having to obey anyone/anything * Freethought, a position that beliefs should be formed only on the basis of logic, reason, and empiricism * Emancipate, to procur ...
document analysis and
optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
(OCR) system released under the Apache License v2.0 with a very modular design using
command-line interface A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and pro ...
s. OCRopus is developed under the lead of Thomas Breuel from the
German Research Centre for Artificial Intelligence The German Research Center for Artificial Intelligence (German: ''Deutsches Forschungszentrum für Künstliche Intelligenz'', DFKI) is one of the world's largest nonprofit contract research institutes for software technology based on artificial in ...
in
Kaiserslautern Kaiserslautern (; Palatinate German: ''Lautre'') is a city in southwest Germany, located in the state of Rhineland-Palatinate at the edge of the Palatinate Forest. The historic centre dates to the 9th century. It is from Paris, from Frankfur ...
, Germany and was sponsored by
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
.


Description

OCRopus was especially designed for use in high-volume
digitization DigitizationTech Target. (2011, April). Definition: digitization. ''WhatIs.com''. Retrieved December 15, 2021, from https://whatis.techtarget.com/definition/digitization is the process of converting information into a Digital data, digital (i ...
projects of books, such as
Google Books Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google Inc. that searches the full text of books and magazines that Google has scanned, converted to text using optical c ...
,
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
or libraries. A large number of languages and fonts are to be supported. However, it can also be used for desktop and office applications or for application for visually impaired people. The main components of OCRopus are formed: * analysis of the document layout *
optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
* use of statistical language models Single or multiple scripts are available for these components. The modular approach allows individual workflows to be used and individual steps to be exchanged. By default, OCRopus comes with a model for English texts and a model for text in
Fraktur Fraktur () is a calligraphic hand of the Latin alphabet and any of several blackletter typefaces derived from this hand. The blackletter lines are broken up; that is, their forms contain many angles when compared to the curves of the Antiqu ...
. These models refer to the script and are largely independent of the actual language. New characters or language variants can be trained either new or in addition. Recent text recognition is based on
recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
s (
LSTM Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
) and does not require a language model. This makes it possible to train language-independent models for which good recognition results for English, German and French have been shown at the same time. In addition to the
Latin script The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy ...
, there are results for other scripts such as
Sanskrit Sanskrit (; attributively , ; nominally , , ) is a classical language belonging to the Indo-Aryan branch of the Indo-European languages. It arose in South Asia after its predecessor languages had diffused there from the northwest in the late ...
,
Urdu Urdu (;"Urdu"
''
Devanagari Devanagari ( ; , , Sanskrit pronunciation: ), also called Nagari (),Kathleen Kuiper (2010), The Culture of India, New York: The Rosen Publishing Group, , page 83 is a left-to-right abugida (a type of segmental Writing systems#Segmental syste ...
and
Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...
. Very good detection rates can be achieved through an appropriate training. This extra effort is particularly worthwhile for difficult documents or scripts that are no longer common today, which are not in the focus of other OCR software.


History

On 9 April 2007, OCRopus was announced as a Google-sponsored project to develop advanced OCR technologies. Funding was granted for a period of three years and covered in particular PhD and postdoctoral positions at DFKI and the
University of Kaiserslautern Technical University of Kaiserslautern (German: ''Technische Universität Kaiserslautern'', also known as TU Kaiserslautern or TUK) is a public university, public research university in Kaiserslautern, Germany. There are numerous institutes arou ...
. In return, OCRopus was also used for automatic text recognition in
Google Book Search Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google Inc. that searches the full text of books and magazines that Google has scanned, converted to text using optical c ...
. Licensing under an open source license was made right from the start to facilitate collaboration between industrial and academic research. OCRopus has received further funding from the
Andrew W. Mellon Foundation The Andrew W. Mellon Foundation of New York City in the United States, simply known as Mellon Foundation, is a private foundation with five core areas of interest, and endowed with wealth accumulated by Andrew Mellon of the Mellon family of Pitts ...
and the
BMBF The Federal Ministry of Education and Research (german: link=no, Bundesministerium für Bildung und Forschung, ), abbreviated BMBF, is a cabinet-level ministry of Germany. It is headquartered in Bonn, with an office in Berlin. The Ministry provi ...
. The first alpha version 0.1 was released on 22 October 2007 and several pre-releases followed between December 2007 and May 2009 reaching a stable version 0.4.4 in March 2010. Originally, the software was developed in
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
and
Lua Lua or LUA may refer to: Science and technology * Lua (programming language) * Latvia University of Agriculture * Last universal ancestor, in evolution Ethnicity and language * Lua people, of Laos * Lawa people, of Thailand sometimes referred t ...
with
Jam Jam is a type of fruit preserve. Jam or Jammed may also refer to: Other common meanings * A firearm malfunction * Block signals ** Radio jamming ** Radar jamming and deception ** Mobile phone jammer ** Echolocation jamming Arts and entertai ...
as a build system. A complete refactoring of the source code in Python modules was done and released in version 0.5 (June 2012). Initially,
Tesseract In geometry, a tesseract is the four-dimensional analogue of the cube; the tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of six square faces, the hypersurface of the tesseract consists of eig ...
was used as the only text recognition module. Since 2009 (version 0.4) Tesseract was only supported as a plugin. Instead, a self-developed text recognizer (also segment-based) was used. This recognizer was then used together with OpenFST for
language modeling A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
after the recognition step. From 2013 onwards, an additional recognition with
recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
s (
LSTM Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
) was offered, which with the release of version 1.0 in November 2014 is the only recognizer. The source code is managed over
GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...
and is maintained and developed by a developer community. The current version of OCRopus is 1.3.3 (December 2017). Thomas Breuel also developed a successor OCRopus 2 and is actively working on OCRopus 4.


Usage

OCRopus can be used from the command line. Once installed, it can be invoked by specifying the input images. It will output the recognized text to
standard output In computer programming, standard streams are interconnected input and output communication channels between a computer program and its environment when it begins execution. The three input/output (I/O) connections are called standard input (stdin ...
directly or write it as
hOCR hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Ma ...
(
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
-based) code into files, from which it then can be transformed to a searchable PDF. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line). Example for the OCRopus calls to recognize the text in an image: # perform binarization ocropus-nlbin tests/ersch.png -o book # perform page layout analysis ocropus-gpageseg book/0001.bin.png # perform text line recognition (with a fraktur model) ocropus-rpred -m models/fraktur.pyrnn.gz book/0001/*.bin.png # generate HTML output ocropus-hocr book/0001.bin.png -o book/0001.html Other tools concentrate on the training part of OCRopus. There are OCRopus models to extract text from Latin, Greek, Cyrillic and Indic scripts.


References


External links

*
Ocropy wiki on GitHub

IUPR Publication Server
(papers behind many of the algorithms used in OCRopus) {{OCR Free software programmed in C++ Free software programmed in Python Optical character recognition software Google software Deep learning software applications