HOCR
   HOME
*





HOCR
hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML. Software The following OCR software can output the recognition result as hOCR file: * OCRopus * Tesseract * Cuneiform * HebOCRgcv2hocr Example The following example is an extract of an hOCR file: ... ... The recognized text is stored in normal text nodes of the HTML file. The distribution into separate lines and words is here given by the surrounding ''span'' tags. Moreover, the usual HTML entities are used, for example the ''p'' tag for a paragraph. Additional information is given in the properties such as: * different layout elements such as "ocr_par", "ocr_line", "ocrx_word" * geometric information for each element with a bounding box "bbox" * langua ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Tesseract (software)
Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.Announcing Tesseract OCR
- The official Google blog
In 2006, Tesseract was considered one of the most accurate open-source OCR engines available.


History

The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 1985 and 1994, with more changes made in 1996 to port to Windows, and some migration from C (programming lan ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


ALTO (XML)
ALTO (Analyzed Layout and Text Object) is an open XML Schema developed by the EU-funded project called METAe. The standard was initially developed for the description of text OCR and layout information of pages for digitized material. The goal was to describe the layout and text in a form to be able to reconstruct the original appearance based on the digitized information - similar to the approach of a lossless image saving operation. ALTO is often used in combination with Metadata Encoding and Transmission Standard (METS) for the description of the whole digitized object and creation of references across the ALTO files, e.g. reading sequence description. The standard is hosted by the Library of Congress since 2010 and maintained by the Editorial Board initialized at the same time. In the time from the final version of the ALTO standard in June 2004 (version 1.0) ALTO was maintained by CCCCS Content Conversion Specialists GmbH, Hamburgup to version 1.4. Versions The latest schem ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Optical Character Recognition
Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intellig ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Optical Character Recognition
Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intellig ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

OCRopus
OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces. OCRopus is developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and was sponsored by Google. Description OCRopus was especially designed for use in high-volume digitization projects of books, such as Google Books, Internet Archive or libraries. A large number of languages and fonts are to be supported. However, it can also be used for desktop and office applications or for application for visually impaired people. The main components of OCRopus are formed: * analysis of the document layout * optical character recognition * use of statistical language models Single or multiple scripts are available for these components. The modular approach allows individual workflows to be used and individual steps to be exchanged. By d ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

XHTML
Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, prior to HTML5, was defined as an application of Standard Generalized Markup Language (SGML), a flexible markup language framework, XHTML is an application of XML, a more restrictive subset of SGML. XHTML documents are well-formed and may therefore be parsed using standard XML parsers, unlike HTML, which requires a lenient HTML-specific parser. XHTML 1.0 became a World Wide Web Consortium (W3C) recommendation on 26 January 2000. XHTML 1.1 became a W3C recommendation on 31 May 2001. The standard known as XHTML5 is being developed as an XML adaptation of the HTML5 specification. Overview XHTML 1.0 is "a reformulation of the three HTML 4 document types as applications of XML 1.0". The World Wide Web Consortium (W3C) also continues to maintai ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




CuneiForm (software)
CuneiForm Cognitive OpenOCR is a freely distributed open-source OCR system developed by Russian software company Cognitive Technologies. CuneiForm OCR was developed by Cognitive Technologies as a commercial product in 1993. The system came with the most popular models of scanners, MFPs and software in Russia and the rest of the world: Corel Draw, Hewlet-Packard, Epson, Xerox, Samsung, Brother, Mustek, OKI, Canon, Olivetti, etc. In 2008 Cognitive Technologies opened the program's source codes. Features CuneiForm is a system developed for transforming the electronic copies of paper documents and image files into an editable form without changing the structure and the original document fonts in automatic or semi-automatic mode. The system includes two components for single and batch processing of electronic documents. The list of languages supported by the system: Besides, the system supports a mixture of Russian and English. Recognition of other mixed languages is only suppo ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Python (programming Language)
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library. Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a major revision that is not completely backward-compatible with earlier versions. Python 2 was discontinued with version 2.7.18 in 2020. Python consistently ranks as ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Extensible Markup Language
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML. The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services. Several schema systems exist to aid in the definition of XML-based languages, while programmers have developed many application programming interfaces (APIs) to aid the processing of XML data. Overview The main purpose of XML is serialization, i ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Right-to-left Script
In a script (commonly shortened to right to left or abbreviated RTL, RL-TB or R2L), writing starts from the right of the page and continues to the left, proceeding from top to bottom for new lines. Arabic, Hebrew, Persian, Pashto, Urdu, Kashmiri and Sindhi are the most widespread R2L writing systems in modern times. ''Right-to-left'' can also refer to (TB-RL or vertical) scripts of tradition, such as Chinese, Japanese, and Korean, though in modern times they are also commonly written (with lines going from top to bottom). Books designed for predominantly vertical TBRL text open in the same direction as those for RTL horizontal text: the spine is on the right and pages are numbered from right to left. These scripts can be contrasted with many common modern writing systems, where writing starts from the left of the page and continues to the right. The Arabic script is mostly but not exclusively right-to-left; mathematical expressions, numeric dates and numbers bearing units ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Latin
Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the Roman Republic it became the dominant language in the Italian region and subsequently throughout the Roman Empire. Even after the fall of Western Rome, Latin remained the common language of international communication, science, scholarship and academia in Europe until well into the 18th century, when other regional vernaculars (including its own descendants, the Romance languages) supplanted it in common academic and political usage, and it eventually became a dead language in the modern linguistic definition. Latin is a highly inflected language, with three distinct genders (masculine, feminine, and neuter), six or seven noun cases (nominative, accusative, genitive, dative, ablative, and vocative), five declensions, four verb conjuga ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Arabic
Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter de Gruyter GmbH & Co. KG, Berlin/Boston, 2011. Having emerged in the 1st century, it is named after the Arabs, Arab people; the term "Arab" was initially used to describe those living in the Arabian Peninsula, as perceived by geographers from ancient Greece. Since the 7th century, Arabic has been characterized by diglossia, with an opposition between a standard Prestige (sociolinguistics), prestige language—i.e., Literary Arabic: Modern Standard Arabic (MSA) or Classical Arabic—and diverse vernacular varieties, which serve as First language, mother tongues. Colloquial dialects vary significantly from MSA, impeding mutual intelligibility. MSA is only acquired through formal education and is not spoken natively. It is ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]