Tesseract (software)
   HOME

TheInfoList



OR:

Tesseract is an
optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
engine for various operating systems. It is
free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, no ...
, released under the Apache License. Originally developed by
Hewlett-Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company headquartered in Palo Alto, California. HP developed and provided a wide variety of hardware components ...
as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
since 2006.Announcing Tesseract OCR
- The official Google blog
In 2006, Tesseract was considered one of the most accurate open-source OCR engines available.


History

The Tesseract engine was originally developed as proprietary software at
Hewlett Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company headquartered in Palo Alto, California. HP developed and provided a wide variety of hardware components ...
labs in
Bristol, England Bristol () is a city, ceremonial county and unitary authority in England. Situated on the River Avon, it is bordered by the ceremonial counties of Gloucestershire to the north and Somerset to the south. Bristol is the most populous city in S ...
and
Greeley, Colorado Greeley is the home rule municipality city that is the county seat and the most populous municipality of Weld County, Colorado, United States. The city population was 108,795 at the 2020 United States Census, an increase of 17.12% since the 2010 ...
between 1985 and 1994, with more changes made in 1996 to port to Windows, and some migration from C to
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
in 1998. A lot of the code was written in C, and then some more was written in C++. Since then, all the code has been converted to at least compile with a C++ compiler. Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the
University of Nevada, Las Vegas The University of Nevada, Las Vegas (UNLV) is a public land-grant research university in Paradise, Nevada. The campus is about east of the Las Vegas Strip. It was formerly part of the University of Nevada from 1957 to 1969. It includes the S ...
(UNLV). Tesseract development has been sponsored by
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
since 2006. Version 4 adds
LSTM Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages. Additionally 37
scripts Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of handw ...
are supported. So it is for example possible to recognize text with a mix of Western and Central European languages by using the model for the
Latin script The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy ...
it is written in. Version 5 was released in 2021, after more than two years of testing and developing.


Features

Tesseract was in the top three OCR engines in terms of character accuracy in 1995. It is available for
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
,
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
and
Mac OS X macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac (computer), Mac computers. Within the market of ...
. However, due to limited resources it is only rigorously tested by developers under Windows and
Ubuntu Ubuntu ( ) is a Linux distribution based on Debian and composed mostly of free and open-source software. Ubuntu is officially released in three editions: ''Desktop'', ''Server'', and ''Core'' for Internet of things devices and robots. All the ...
. Tesseract up to and including version 2 could only accept TIFF images of simple one-column text as inputs. These early versions did not include layout analysis, and so inputting multi-columned text, images, or equations produced garbled output. Since version 3.00 Tesseract has supported output text formatting,
hOCR hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Ma ...
positional information and page-layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is
monospaced A monospaced font, also called a fixed-pitch, fixed-width, or non-proportional font, is a font whose letters and characters each occupy the same amount of horizontal space. This contrasts with variable-width fonts, where the letters and spac ...
or proportionally spaced. The initial versions of Tesseract could only recognize English-language text. Tesseract v2 added six additional Western languages (French, Italian, German, Spanish, Brazilian Portuguese, Dutch). Version 3 extended language support significantly to include ideographic (Chinese & Japanese) and right-to-left (e.g. Arabic, Hebrew) languages, as well as many more scripts. New languages included Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, German (
Fraktur Fraktur () is a calligraphic hand of the Latin alphabet and any of several blackletter typefaces derived from this hand. The blackletter lines are broken up; that is, their forms contain many angles when compared to the curves of the Antiqu ...
script), Greek, Finnish, Hebrew, Hindi, Hungarian, Indonesian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. V3.04, released in July 2015, added an additional 39 language/script combinations, bringing the total count of support languages to over 100. New language codes included: amh (Amharic), asm (Assamese), aze_cyrl (Azerbaijana in Cyrillic script), bod (Tibetan), bos (Bosnian), ceb (Cebuano), cym (Welsh), dzo (Dzongkha), fas (Persian), gle (Irish), guj (Gujarati), hat (Haitian and Haitian Creole), iku (Inuktitut), jav (Javanese), kat (Georgian), kat_old (Old Georgian), kaz (Kazakh), khm (Central Khmer), kir (Kyrgyz), kur (Kurdish), lao (Lao), lat (Latin), mar (Marathi), mya (Burmese), nep (Nepali), ori (Oriya), pan (Punjabi), pus (Pashto), san (Sanskrit), sin (Sinhala), srp_latn (Serbian in Latin script), syr (Syriac), tgk (Tajik), tir (Tigrinya), uig (Uyghur), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek in Cyrillic script), yid (Yiddish). In addition, Tesseract can be trained to work in other languages. Tesseract can process right-to-left text such as Arabic or Hebrew, many Indic scripts as well as CJK quite well. Accuracy rates are shown in this presentation for Tesseract tutorial at DAS 2016, Santorini by Ray Smith. Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such as
OCRopus OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces. OCRopus is developed under the lead of Thomas Breuel from the Ge ...
. Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially
screenshot screenshot (also known as screen capture or screen grab) is a digital image that shows the contents of a computer display. A screenshot is created by the operating system or software running on the device powering the display. Additionally, s ...
s) must be scaled up such that the text
x-height upright 2.0, alt=A diagram showing the line terms used in typography In typography, the x-height, or corpus size, is the distance between the baseline and the mean line of lowercase letters in a typeface. Typically, this is the height of the let ...
is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be
high-pass filter A high-pass filter (HPF) is an electronic filter that passes signals with a frequency higher than a certain cutoff frequency and attenuates signals with frequencies lower than the cutoff frequency. The amount of attenuation for each frequency d ...
ed, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.


User interfaces

Tesseract is executed from the
command-line interface A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and pro ...
.Google Code – Tesseract Readme
/ref> While Tesseract is not supplied with a GUI, there are many separate projects which provide a GUI for it. One common example is
OCRFeeder OCRFeeder is an optical character recognition suite for GNOME, which also supports virtually any command-line OCR engine, such as CuneiForm, GOCR, Ocrad and Tesseract. It converts paper documents to digital document files and can serve to make t ...
.


Reception

In a July 2007 article on Tesseract, Anthony Kay of ''
Linux Journal ''Linux Journal'' (''LJ'') is an American monthly technology magazine originally published by Specialized System Consultants, Inc. (SSC) in Seattle, Washington since 1994. In December 2006 the publisher changed to Belltown Media, Inc. in Houston ...
'' termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such as
The GIMP GIMP ( ; GNU Image Manipulation Program) is a free and open-source raster graphics editor used for image manipulation (retouching) and image editing, free-form drawing, transcoding between different image file formats, and more specialized task ...
and
Netpbm Netpbm (formerly Pbmplus) is an open-source package of graphics programs and a programming library. It is used mainly in the Unix world, where one can find it included in all major open-source operating system distributions, but also works on Mic ...
." In November 2020,
Brewster Kahle Brewster Lurton Kahle ( ; born October 21, 1960)Alexa Internet profile
, via juggle.com. accessed Novemb ...
from the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
praised Tesseract saying:


See also

*
LibTIFF LibTIFF is a Library (computing), library for reading and writing Tagged Image File Format (abbreviated TIFF) files. The set also contains command line tools for processing TIFFs. It is distributed in source code and can be found as executable, bi ...


References


External links

* {{Google FOSS Free software programmed in C Free software programmed in C++ Optical character recognition software HP software Google software Formerly proprietary software Software using the Apache license