hOCR is an open standard of data representation for formatted text obtained from

optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...

(OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using

Extensible Markup Language Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...

(XML) in the form of

Hypertext Markup Language The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript ...

(HTML) or

XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, prior ...

Software

The following OCR software can output the recognition result as hOCR file: *

OCRopus OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces. OCRopus is developed under the lead of Thomas Breuel from the ...

Tesseract In geometry, a tesseract is the four-dimensional analogue of the cube; the tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of six square faces, the hypersurface of the tesseract consists of eig ...

Cuneiform Cuneiform is a logo-syllabic script that was used to write several languages of the Ancient Middle East. The script was in active use from the early Bronze Age until the beginning of the Common Era. It is named for the characteristic wedge-sha ...

* HebOCR
gcv2hocr

Example

The following example is an extract of an hOCR file: ...

Die Darlehenssumme ist in ihrem ursprünglichen Umfange zu ver- ... The recognized text is stored in normal text nodes of the HTML file. The distribution into separate lines and words is here given by the surrounding ''span'' tags. Moreover, the usual HTML entities are used, for example the ''p'' tag for a paragraph. Additional information is given in the properties such as: * different layout elements such as "ocr_par", "ocr_line", "ocrx_word" * geometric information for each element with a bounding box "bbox" * language information "lang" * some confidence values "x_wconf"

hOCR Docs

Visit https://kba.github.io/hocr-spec/1.2 to view in depth about hOCR.

bbox

General

The Layout of the Bounding Box Object or bbox Object is Grammar. * property-name = "bbox" * property-value = uint uint uint uint

Example

bbox 0 0 100 200 The bbox - short for "bounding box" - of an element is a rectangular box around this element, which is defined by the upper-left corner (x0, y0) and the lower-right corner (x1, y1). the values are with reference to the top-left corner of the document image and measured in pixels the order of the values are x0 y0 x1 y1 = "left top right bottom"

= Usage

= Use x_bboxes below for character bounding boxes Do not use bbox unless the bounding box of the layout component is, in fact, rectangular, some non-rectangular layout components may have rectangular bounding boxes if the non-rectangularity is caused by floating elements around which text flows. ... The bounding box bbox of this line is shown in blue and it is span by the upper-left corner (10, 20) and the lower-right corner (160, 30). All coordinates are measured with reference to the top-left corner of the document image which border is drawn in black.

Searchable PDF Files

The hOCR format is most commonly used in order to make searchable PDF files or as an extracted metadata of the PDF file. In order to create searchable PDF files we can use a scanned document image and a .hocr file of the particular image. We can use the following open source tools in order to achieve that.

hocr-tools

hocr-tools is an open source library written in Python that supports both Python 2.x Versions and

Python 3 The programming language Python was conceived in the late 1980s, and its implementation was started in December 1989 by Guido van Rossum at CWI in the Netherlands as a successor to ABC capable of exception handling and interfacing with the ...

.x Versions. It has a command line utility attached in the scripts called hocr-pdf that enables us to convert standard hocr files to a searchable pdf file. It is also worth noting that the version for dealing with hocr files in RTL or non-

Latin Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...

scripts like

Arabic Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C ...

, we need to use the

GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...

Repo at the moment. hocr-pdf We can use the hocr-pdf utility using the following basic syntax. hocr-pdf—savefile final.pdf folder_images_and_hocr The folder_images_and_hocr must contain the respective .jpg and .hocr format files with their file extensions changed.

Known Issues

Some of the known issues of hocr-pdf script in PyPI installation are the following. * Not up to date with GitHub Repository. * hocr-pdf is broken on line 134 due to decodebytes() depreciated after Python 3.1

Known Fixes

Compile hocr-tools using latest GitHub Repository.

hocr2pdf

hocr2pdf is another library that supports the conversion of hocr files. It is written in

C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...

and is cross-compatible with other libraries. It also has support for utf-8 languages but that may require some additional debugging and browsing through some google conversation records to achieve that. According to Ubuntu Manpages,

ExactImage is a fast C++ image processing library. Unlike many other library frameworks it allows operation in several color spaces and bit depths natively, resulting in low memory and computational requirements. hocr2pdf creates well layouted, searchable PDF files from hOCR (annotated
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
) input obtained from an OCR system.

hOCR to PDF Attempts

In addition to the following discussed and stable libraries there have been many contributions to the hOCR formant over the years with support from many of the early adopters of this format. You can get access to inlaying text on an Image with hOCR and converting that in a PDF file using Python2 with this 12 Year old script as of 2021. This script can also be updated and made functional by converting tha
Python 2 Source code to Python 3
Supported Context.
HOCRConverter
by jbrinley (Documentation)

HOCRConverter

The HOCRConverter is a script written in Python 2.x that can used in order to convert a hOCR file with a specified image file in order to convert it to a searchable pdf file. You can see the documentation using the link above. from HocrConverter import HocrConverter hocr = HocrConverter("myHocrFile.html") # this can be done by changing .hocr to .html and vise-versa hocr.to_text("output.txt") hocr.to_pdf("myImageFile.png", "output.pdf")

Known Issues

* Has not been tested. * Does not natively support Python 3.x

References

External links

specification of current version
(1.2 as of February 2021) * * {{GitHub, UB-Mannheim/ocr-fileformat, ocr-fileformat – Software that validates and transforms various OCR file formats including hOCR, link=no Optical character recognition Microformats

Software

Example

hOCR Docs

bbox

General

Example

= Usage

Searchable PDF Files

hocr-tools

Known Issues

Known Fixes

hocr2pdf

hOCR to PDF Attempts

HOCRConverter

Known Issues

See also

References

External links