OCRopus is a

free Free may refer to: Concept * Freedom, having the ability to do something, without having to obey anyone/anything * Freethought, a position that beliefs should be formed only on the basis of logic, reason, and empiricism * Emancipate, to procur ...

document analysis Documentary analysis (also ''document analysis'') is a type of qualitative research in which documents are reviewed by the analyst to assess an appraisal theme. Dissecting documents involves coding content into subjects like how focus group or int ...

and

optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...

(OCR) system released under the Apache License v2.0 with a very modular design using

command-line interface A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and pro ...

s. OCRopus is developed under the lead of Thomas Breuel from the

German Research Centre for Artificial Intelligence The German Research Center for Artificial Intelligence (German: ''Deutsches Forschungszentrum für Künstliche Intelligenz'', DFKI) is one of the world's largest nonprofit contract research institutes for software technology based on artificial in ...

Kaiserslautern Kaiserslautern (; Palatinate German: ''Lautre'') is a city in southwest Germany, located in the state of Rhineland-Palatinate at the edge of the Palatinate Forest. The historic centre dates to the 9th century. It is from Paris, from Frankfur ...

, Germany and was sponsored by

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

Description

OCRopus was especially designed for use in high-volume

digitization DigitizationTech Target. (2011, April). Definition: digitization. ''WhatIs.com''. Retrieved December 15, 2021, from https://whatis.techtarget.com/definition/digitization is the process of converting information into a Digital data, digital (i ...

projects of books, such as

Google Books Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google Inc. that searches the full text of books and magazines that Google has scanned, converted to text using optical c ...

Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...

or libraries. A large number of languages and fonts are to be supported. However, it can also be used for desktop and office applications or for application for visually impaired people. The main components of OCRopus are formed: * analysis of the document layout *

* use of statistical language models Single or multiple scripts are available for these components. The modular approach allows individual workflows to be used and individual steps to be exchanged. By default, OCRopus comes with a model for English texts and a model for text in

Fraktur Fraktur () is a calligraphic hand of the Latin alphabet and any of several blackletter typefaces derived from this hand. The blackletter lines are broken up; that is, their forms contain many angles when compared to the curves of the Antiqu ...

. These models refer to the script and are largely independent of the actual language. New characters or language variants can be trained either new or in addition. Recent text recognition is based on recurrent neural networks ( LSTM) and does not require a language model. This makes it possible to train language-independent models for which good recognition results for English, German and French have been shown at the same time. In addition to the

Latin script The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy ...

, there are results for other scripts such as

Sanskrit Sanskrit (; attributively , ; nominally , , ) is a classical language belonging to the Indo-Aryan branch of the Indo-European languages. It arose in South Asia after its predecessor languages had diffused there from the northwest in the late ...

Urdu Urdu (;"Urdu"
''

Devanagari Devanagari ( ; , , Sanskrit pronunciation: ), also called Nagari (),Kathleen Kuiper (2010), The Culture of India, New York: The Rosen Publishing Group, , page 83 is a left-to-right abugida (a type of segmental Writing systems#Segmental syste ...

and Greek. Very good detection rates can be achieved through an appropriate training. This extra effort is particularly worthwhile for difficult documents or scripts that are no longer common today, which are not in the focus of other OCR software.

History

On 9 April 2007, OCRopus was announced as a Google-sponsored project to develop advanced OCR technologies. Funding was granted for a period of three years and covered in particular PhD and postdoctoral positions at DFKI and the University of Kaiserslautern. In return, OCRopus was also used for automatic text recognition in

Google Book Search Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google Inc. that searches the full text of books and magazines that Google has scanned, converted to text using optical c ...

. Licensing under an open source license was made right from the start to facilitate collaboration between industrial and academic research. OCRopus has received further funding from the Andrew W. Mellon Foundation and the BMBF. The first alpha version 0.1 was released on 22 October 2007 and several pre-releases followed between December 2007 and May 2009 reaching a stable version 0.4.4 in March 2010. Originally, the software was developed in C++,

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...

and Lua with

Jam Jam is a type of fruit preserve. Jam or Jammed may also refer to: Other common meanings * A firearm malfunction * Block signals ** Radio jamming ** Radar jamming and deception ** Mobile phone jammer ** Echolocation jamming Arts and entertai ...

as a build system. A complete refactoring of the source code in Python modules was done and released in version 0.5 (June 2012). Initially, Tesseract was used as the only text recognition module. Since 2009 (version 0.4) Tesseract was only supported as a plugin. Instead, a self-developed text recognizer (also segment-based) was used. This recognizer was then used together with OpenFST for

language modeling A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

after the recognition step. From 2013 onwards, an additional recognition with recurrent neural networks ( LSTM) was offered, which with the release of version 1.0 in November 2014 is the only recognizer. The source code is managed over

GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...

and is maintained and developed by a developer community. The current version of OCRopus is 1.3.3 (December 2017). Thomas Breuel also developed a successor OCRopus 2 and is actively working on OCRopus 4.

Usage

OCRopus can be used from the command line. Once installed, it can be invoked by specifying the input images. It will output the recognized text to standard output directly or write it as hOCR (

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...

-based) code into files, from which it then can be transformed to a searchable PDF. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line). Example for the OCRopus calls to recognize the text in an image:

# perform binarization
 ocropus-nlbin tests/ersch.png -o book
 
 # perform page layout analysis
 ocropus-gpageseg book/0001.bin.png
 
 # perform text line recognition (with a fraktur model)
 ocropus-rpred -m models/fraktur.pyrnn.gz book/0001/*.bin.png
 
 # generate HTML output
 ocropus-hocr book/0001.bin.png -o book/0001.html

Other tools concentrate on the training part of OCRopus. There are OCRopus models to extract text from Latin, Greek, Cyrillic and Indic scripts.

References

External links

*
Ocropy wiki on GitHub

IUPR Publication Server
(papers behind many of the algorithms used in OCRopus) {{OCR Free software programmed in C++ Free software programmed in Python Optical character recognition software Google software Deep learning software applications