Optical character recognition
   HOME

TheInfoList



OR:

Optical character recognition or optical character reader (OCR) is the electronic or
mechanical Mechanical may refer to: Machine * Machine (mechanical), a system of mechanisms that shape the actuator input to achieve a specific application of output forces and movement * Mechanical calculator, a device used to perform the basic operations o ...
conversion of
image An image or picture is a visual representation. An image can be Two-dimensional space, two-dimensional, such as a drawing, painting, or photograph, or Three-dimensional space, three-dimensional, such as a carving or sculpture. Images may be di ...
s of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of
data entry Data entry is the process of digitizing data by entering it into a computer system for organization and management purposes. It is a person-based process and is "one of the important basic" tasks needed when no machine-readable version of the in ...
from printed paper data recordswhether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printed data, or any suitable documentationit is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing,
machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
, (extracted)
text-to-speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or Computer hardware, hardware products. A text-to-speech (TTS) system conv ...
, key data and
text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
. OCR is a field of research in
pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
,
artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
and
computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...
. Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of
image file format An image file format is a file format for a digital image. There are many formats that can be used, such as JPEG, PNG, and GIF. Most formats up until 2022 were for storing 2D images, not 3D ones. The data stored in an image file format may be co ...
inputs. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components.


History

Early optical character recognition may be traced to technologies involving
telegraphy Telegraphy is the long-distance transmission of messages where the sender uses symbolic codes, known to the recipient, rather than a physical exchange of an object bearing the message. Thus flag semaphore is a method of telegraphy, whereas pi ...
and creating reading devices for the blind. In 1914,
Emanuel Goldberg Emanuel Goldberg (; ; ; 31August 188113September 1970) was an Israeli physicist and inventor. He was born in Moscow and moved first to Germany and later to Israel. He described himself as "a chemist by learning, physicist by calling, and a mecha ...
developed a machine that read characters and converted them into standard telegraph code. Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters. In the late 1920s and into the 1930s,
Emanuel Goldberg Emanuel Goldberg (; ; ; 31August 188113September 1970) was an Israeli physicist and inventor. He was born in Moscow and moved first to Germany and later to Israel. He described himself as "a chemist by learning, physicist by calling, and a mecha ...
developed what he called a "Statistical Machine" for searching
microfilm A microform is a scaled-down reproduction of a document, typically either photographic film or paper, made for the purposes of transmission, storage, reading, and printing. Microform images are commonly reduced to about 4% or of the original d ...
archives using an optical code recognition system. In 1931, he was granted US Patent number 1,838,389 for the invention. The patent was acquired by
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
.


Visually impaired users

In 1974,
Ray Kurzweil Raymond Kurzweil ( ; born February 12, 1948) is an American computer scientist, author, entrepreneur, futurist, and inventor. He is involved in fields such as optical character recognition (OCR), speech synthesis, text-to-speech synthesis, spee ...
started the company Kurzweil Computer Products, Inc. and continued development of omni-
font In metal typesetting, a font is a particular size, weight and style of a ''typeface'', defined as the set of fonts that share an overall design. For instance, the typeface Bauer Bodoni (shown in the figure) includes fonts " Roman" (or "regul ...
OCR, which could recognize text printed in virtually any font. (Kurzweil is often credited with inventing omni-font OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s.) Kurzweil used the technology to create a reading machine for blind people to have a computer read text to them out loud. The device included a CCD-type flatbed scanner and a text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind. In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program.
LexisNexis LexisNexis is an American data analytics company headquartered in New York, New York. Its products are various databases that are accessed through online portals, including portals for computer-assisted legal research (CALR), newspaper searc ...
was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to
Xerox Xerox Holdings Corporation (, ) is an American corporation that sells print and electronic document, digital document products and services in more than 160 countries. Xerox was the pioneer of the photocopier market, beginning with the introduc ...
, which eventually spun it off as Scansoft, which merged with Nuance Communications. In the 2000s, OCR was made available online as a service (WebOCR), in a
cloud computing Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...
environment, and in mobile applications like real-time translation of foreign-language signs on a
smartphone A smartphone is a mobile phone with advanced computing capabilities. It typically has a touchscreen interface, allowing users to access a wide range of applications and services, such as web browsing, email, and social media, as well as multi ...
. With the advent of smartphones and
smartglasses Smartglasses or smart glasses are eye or head-worn wearable computers. Many smartglasses include displays that add information alongside or to what the wearer sees. Alternatively, smartglasses are sometimes defined as glasses that are able to c ...
, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have built-in OCR functionality will typically use an OCR
API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
to extract the text from the image file captured by the device. The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display. Various commercial and open source OCR systems are available for most common
writing system A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
s, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters.


Applications

OCR engines have been developed into software applications specializing in various subjects such as receipts, invoices, checks, and legal billing documents. The software can be used for: * Entering data for business documents, e.g. checks, passports, invoices, bank statements and receipts *
Automatic number-plate recognition Automatic number-plate recognition (ANPR; see also #Other names, other names below) is a technology that uses optical character recognition on images to read vehicle registration plates to create vehicle location data. It can use existing clos ...
* Passport recognition and information extraction in airports * Automatically extracting key information from insurance documents * Traffic-sign recognition * Extracting business card information into a contact list * Creating textual versions of printed documents, e.g. book scanning for
Project Gutenberg Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, as well as to "encourage the creation and distribution of eBooks." It was founded in 1971 by American writer Michael S. Hart and is the oldest digital li ...
* Making electronic images of printed documents searchable, e.g.
Google Books Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical charac ...
* Converting handwriting in real-time to control a computer (
pen computing Pen computing refers to any computer user-interface using a digital pen or Stylus (computing), stylus and Graphics tablet, tablet, over input devices such as a keyboard or a mouse. Historically, pen computing (defined as a computer system employin ...
) * Defeating or testing the robustness of
CAPTCHA Completely Automated Public Turing Test to tell Computers and Humans Apart (CAPTCHA) ( ) is a type of challenge–response authentication, challenge–response turing test used in computing to determine whether the user is human in order to de ...
anti-bot systems, though these are specifically designed to prevent OCR. * Assistive technology for blind and visually impaired users * Writing instructions for vehicles by identifying CAD images in a database that are appropriate to the vehicle design as it changes in real time * Making scanned documents searchable by converting them to PDFs


Types

* Optical character recognition (OCR)targets typewritten text, one
glyph A glyph ( ) is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A ...
or character at a time. * Optical word recognitiontargets typewritten text, one word at a time (for languages that use a
space Space is a three-dimensional continuum containing positions and directions. In classical physics, physical space is often conceived in three linear dimensions. Modern physicists usually consider it, with time, to be part of a boundless ...
as a
word divider In punctuation, a word divider is a form of glyph which separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitesp ...
). Usually just called "OCR". * Intelligent character recognition (ICR)also targets handwritten printscript or
cursive Cursive (also known as joined-up writing) is any style of penmanship in which characters are written joined in a flowing manner, generally for the purpose of making writing faster, in contrast to block letters. It varies in functionality and m ...
text one glyph or character at a time, usually involving
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
. * Intelligent word recognition (IWR)also targets handwritten printscript or
cursive Cursive (also known as joined-up writing) is any style of penmanship in which characters are written joined in a flowing manner, generally for the purpose of making writing faster, in contrast to block letters. It varies in functionality and m ...
text, one word at a time. This is especially useful for languages where glyphs are not separated in cursive script. OCR is generally an offline process, which analyses a static document. There are cloud based services which provide an online OCR API service. Handwriting movement analysis can be used as input to
handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwriting, handwritten input from sources such as paper documents, photographs, touch-screens ...
. Instead of merely using the shapes of glyphs and words, this technique is able to capture motion, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the process more accurate. This technology is also known as "online character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition".


Techniques


Pre-processing

OCR software often pre-processes images to improve the chances of successful recognition. Techniques include: * De- skewingif the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical. * Despecklingremoval of positive and negative spots, smoothing edges * Binarizationconversion of an image from color or
greyscale In digital photography, computer-generated imagery, and colorimetry, a greyscale (more common in Commonwealth English) or grayscale (more common in American English) image is one in which the value of each pixel is a single sample repres ...
to black-and-white (called a
binary image A binary image is a digital image that consists of pixels that can have one of exactly two colors, usually black and white. Each pixel is stored as a single bit — i.e. either a 0 or 1. A binary image can be stored in memory as a bitmap: a p ...
because there are two colors). The task is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarization is necessary since most commercial recognition algorithms work only on binary images, as it is simpler to do so. In addition, the effectiveness of binarization influences to a significant extent the quality of character recognition, and careful decisions are made in the choice of the binarization employed for a given input image type; since the quality of the method used to obtain the binary result depends on the type of image (scanned document, scene text image, degraded historical document, etc.). * Line removalCleaning up non-glyph boxes and lines * Layout analysis or zoningIdentification of columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables. * Line and word detectionEstablishment of a baseline for word and character shapes, separating words as necessary. * Script recognitionIn multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script. * Character isolation or segmentationFor per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. * Normalization of
aspect ratio The aspect ratio of a geometry, geometric shape is the ratio of its sizes in different dimensions. For example, the aspect ratio of a rectangle is the ratio of its longer side to its shorter side—the ratio of width to height, when the rectangl ...
and scale Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character.


Text recognition

There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters. * ''Matrix matching'' involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as ''pattern matching'', ''
pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
'', or '' image correlation''. This relies on the input glyph being correctly isolated from the rest of the image, and the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique early physical photocell-based OCR implemented, rather directly. * ''Feature extraction'' decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent"
handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwriting, handwritten input from sources such as paper documents, photographs, touch-screens ...
and most modern OCR software. Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match. Software such as
Cuneiform Cuneiform is a Logogram, logo-Syllabary, syllabic writing system that was used to write several languages of the Ancient Near East. The script was in active use from the early Bronze Age until the beginning of the Common Era. Cuneiform script ...
and
Tesseract In geometry, a tesseract or 4-cube is a four-dimensional hypercube, analogous to a two-dimensional square and a three-dimensional cube. Just as the perimeter of the square consists of four edges and the surface of the cube consists of six ...
use a two-pass approach to character recognition. The second pass is known as adaptive recognition and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded). , modern OCR software includes Google Docs OCR, ABBYY FineReader, and Transym. Others like OCRopus and Tesseract use
neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...
which are trained to recognize whole lines of text instead of focusing on single characters. A technique known as iterative OCR automatically crops a document into sections based on the page layout. OCR is then performed on each section individually using variable character confidence level thresholds to maximize page-level OCR accuracy. A patent from the United States Patent Office has been issued for this method. The OCR result can be stored in the standardized
ALTO The musical term alto, meaning "high" in Italian (Latin: '' altus''), historically refers to the contrapuntal part higher than the tenor and its associated vocal range. In four-part voice leading alto is the second-highest part, sung in ch ...
format, a dedicated
XML schema An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constrai ...
maintained by the United States
Library of Congress The Library of Congress (LOC) is a research library in Washington, D.C., serving as the library and research service for the United States Congress and the ''de facto'' national library of the United States. It also administers Copyright law o ...
. Other common formats include hOCR and PAGE XML. For a list of optical character recognition software, see Comparison of optical character recognition software.


Post-processing

OCR accuracy can be increased if the output is constrained by a
lexicon A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...
a list of words that are allowed to occur in a document. This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like
proper noun A proper noun is a noun that identifies a single entity and is used to refer to that entity ('' Africa''; ''Jupiter''; '' Sarah''; ''Walmart'') as distinguished from a common noun, which is a noun that refers to a class of entities (''continent, ...
s. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy. The output stream may be a
plain text In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated
PDF Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe Inc., Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, computer hardware, ...
that includes both the original image of the page and a searchable textual representation. ''Near-neighbor analysis'' can make use of
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idio ...
frequencies to correct errors, by noting that certain words are often seen together. For example, "Washington, D.C." is generally far more common in English than "Washington DOC". Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The Levenshtein Distance algorithm has also been used in OCR post-processing to further optimize results from an OCR API.


Application-specific optimizations

In recent years, the major OCR technology providers began to tweak OCR systems to deal more efficiently with specific types of input. Beyond an application-specific lexicon, better performance may be had by taking into account business rules, standard expression, or rich information contained in color images. This strategy is called "Application-Oriented OCR" or "Customized OCR", and has been applied to OCR of license plates,
invoice An invoice, bill, tab, or bill of costs is a commercial document that includes an itemized list of goods or services furnished by a seller to a buyer relating to a sale transaction, that usually specifies the price and terms of sale, quanti ...
s,
screenshot A screenshot (also known as screen capture or screen grab) is an analog or digital image that shows the contents of a computer display. A screenshot is created by a (film) camera shooting the screen or the operating system An operating sys ...
s, ID cards,
driver's license A driver's license, driving licence, or driving permit is a legal authorization, or the official document confirming such an authorization, for a specific individual to operate one or more types of motorized vehicles—such as motorcycles, ca ...
s, and automobile manufacturing. ''
The New York Times ''The New York Times'' (''NYT'') is an American daily newspaper based in New York City. ''The New York Times'' covers domestic, national, and international news, and publishes opinion pieces, investigative reports, and reviews. As one of ...
'' has adapted the OCR technology into a proprietary tool they entitle ''Document Helper'', that enables their interactive news team to accelerate the processing of documents that need to be reviewed. They note that it enables them to process what amounts to as many as 5,400 pages per hour in preparation for reporters to review the contents.


Workarounds

There are several techniques for solving the problem of character recognition by means other than improved OCR algorithms.


Forcing better input

Special fonts like
OCR-A OCR-A is a font issued in 1966 and first implemented in 1968. A special font was needed in the early days of computer optical character recognition, when there was a need for a font that could be recognized not only by the computers of that day, b ...
, OCR-B, or MICR fonts, with precisely specified sizing, spacing, and distinctive character shapes, allow a higher accuracy rate during transcription in bank check processing. Several prominent OCR engines were designed to capture text in popular fonts such as Arial or Times New Roman, and are incapable of capturing text in these fonts that are specialized and very different from popularly used fonts. As Google Tesseract can be trained to recognize new fonts, it can recognize OCR-A, OCR-B and MICR fonts. ''Comb fields'' are pre-printed boxes that encourage humans to write more legiblyone glyph per box. These are often printed in a dropout color which can be easily removed by the OCR system.
Palm OS Palm OS (also known as Garnet OS) is a discontinued mobile operating system initially developed by Palm, Inc., for personal digital assistants (PDAs) in 1996. Palm OS was designed for ease of use with a touchscreen-based graphical user interface. ...
used a special set of glyphs, known as
Graffiti Graffiti (singular ''graffiti'', or ''graffito'' only in graffiti archeology) is writing or drawings made on a wall or other surface, usually without permission and within public view. Graffiti ranges from simple written "monikers" to elabor ...
, which are similar to printed English characters but simplified or modified for easier recognition on the platform's computationally limited hardware. Users would need to learn how to write these special glyphs. Zone-based OCR restricts the image to a specific part of a document. This is often referred to as ''Template OCR''.


Crowdsourcing

Crowdsourcing Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digit ...
humans to perform the character recognition can quickly process images like computer-driven OCR, but with higher accuracy for recognizing images than that obtained via computers. Practical systems include the
Amazon Mechanical Turk Amazon Mechanical Turk (MTurk) is a crowdsourcing website with which businesses can hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do as economically. It is operated under Amazon Web ...
and
reCAPTCHA reCAPTCHA Inc. is a CAPTCHA system owned by Google. It enables web hosts to distinguish between human and automated access to websites. The original version asked users to decipher hard-to-read text or match images. Version 2 also asked users ...
. The
National Library of Finland The National Library of Finland (, ) is the foremost research library in Finland. Administratively the library is part of the University of Helsinki. From 1919 to 1 August 2006, it was known as the Helsinki University Library (). The Nationa ...
has developed an online interface for users to correct OCRed texts in the standardized ALTO format. Crowd sourcing has also been used not to perform character recognition directly but to invite software developers to develop image processing algorithms, for example, through the use of rank-order tournaments.


Accuracy

Commissioned by the U.S. Department of Energy (DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents, and it conducted the most authoritative of the ''Annual Test of OCR Accuracy'' from 1992 to 1996. Recognition of typewritten,
Latin script The Latin script, also known as the Roman script, is a writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae in Magna Graecia. The Gree ...
text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 81% to 99%; total accuracy can be achieved by human review or Data Dictionary Authentication. Other areasincluding recognition of hand printing,
cursive Cursive (also known as joined-up writing) is any style of penmanship in which characters are written joined in a flowing manner, generally for the purpose of making writing faster, in contrast to block letters. It varies in functionality and m ...
handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)are still the subject of active research. The
MNIST database The MNIST database (''Modified National Institute of Standards and Technology database'') is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training ...
is commonly used for testing systems' ability to recognize handwritten digits. Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% or worse if the measurement is based on whether each whole word was recognized with no incorrect letters. Using a large enough dataset is important in a neural-network-based handwriting recognition solutions. On the other hand, producing natural datasets is very complicated and time-consuming. An example of the difficulties inherent in digitizing old text is the inability of OCR to differentiate between the "
long s The long s, , also known as the medial ''s'' or initial ''s'', is an Archaism, archaic form of the lowercase letter , found mostly in works from the late 8th to early 19th centuries. It replaced one or both of the letters ''s'' in a double-''s ...
" and "f" characters. Web-based OCR systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history). Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved by
pen computing Pen computing refers to any computer user-interface using a digital pen or Stylus (computing), stylus and Graphics tablet, tablet, over input devices such as a keyboard or a mouse. Historically, pen computing (defined as a computer system employin ...
software, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications. Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the ''Amount'' line of a check (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script. Most programs allow users to set "confidence rates". This means that if the software does not achieve their desired level of accuracy, a user can be notified for manual review. An error introduced by OCR scanning is sometimes termed a ''scanno'' (by analogy with the term ''typo'').


Unicode

Characters to support OCR were added to the
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
Standard in June 1993, with the release of version 1.1. Some of these characters are mapped from fonts specific to MICR,
OCR-A OCR-A is a font issued in 1966 and first implemented in 1968. A special font was needed in the early days of computer optical character recognition, when there was a need for a font that could be recognized not only by the computers of that day, b ...
or OCR-B.


See also


References


External links


Unicode OCRHex Range: 2440-245F
Optical Character Recognition in Unicode

{{DEFAULTSORT:Optical Character Recognition Applications of computer vision Automatic identification and data capture Computational linguistics Unicode Symbols Machine learning task