Document Processing
   HOME

TheInfoList



OR:

Document processing is a field of research and a set of
production process Industrial processes are procedures involving chemical, physical, electrical or mechanical steps to aid in the manufacturing of an item or items, usually carried out on a very large scale. Industrial processes are the key components of heavy ind ...
es aimed at making an analog
document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" or ...
digital. Document processing does not simply aim to photograph or
scan Scan may refer to: Acronyms * Schedules for Clinical Assessment in Neuropsychiatry (SCAN), a psychiatric diagnostic tool developed by WHO * Shared Check Authorization Network (SCAN), a database of bad check writers and collection agency for bad ...
a document to obtain a
digital image A digital image is an image composed of picture elements, also known as ''pixels'', each with ''finite'', '' discrete quantities'' of numeric representation for its intensity or gray level that is an output from its two-dimensional functions ...
, but also to make it digitally intelligible. This includes extracting the structure of the document or the
layout Layout may refer to: * Page layout, the arrangement of visual elements on a page ** Comprehensive layout (comp), a proposed page layout presented by a designer to their client * Layout (computing), the process of calculating the position of obje ...
and then the content, which can take the form of text or images. The process can involve traditional
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation,
object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...
, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly,
transcription Transcription refers to the process of converting sounds (voice, music etc.) into letters or musical notes, or producing a copy of something in another medium, including: Genetics * Transcription (biology), the copying of DNA into RNA, the fir ...
, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or
image classification Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog
archives An archive is an accumulation of historical records or materials – in any medium – or the physical facility in which they are located. Archives contain primary source documents that have accumulated over the course of an individual or ...
and historical documents.


Background

Document processing was initially as is still to some extent a kind of production line work dealing with the treatment of
document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" or ...
s, such as letters and parcels, in an aim of sorting, extracting or massively extracting data. This work could be performed in-house or through
business process outsourcing Outsourcing is an agreement in which one company hires another company to be responsible for a planned or existing activity which otherwise is or could be carried out internally, i.e. in-house, and sometimes involves transferring employees and ...
. Document processing can indeed involve some kind of externalized manual labor, such as
mechanical Turk The Turk, also known as the Mechanical Turk or Automaton Chess Player (german: Schachtürke, ; hu, A Török), was a fraudulent chess-playing machine constructed in the late 18th century. From 1770 until its destruction by fire in 1854 it was ...
. As an example of manual document processing, as relatively recent as 2007, document processing for "millions of visa and citizenship applications" was about use of "approximately 1,000 contract workers" working to "manage mail room and
data entry Data entry is the process of digitizing data by entering it into a computer system for organization and management purposes. It is a person-based process and is "one of the important basic" tasks needed when no machine-readable version of the inf ...
." While document processing involved data entry via keyboard well before use of a
computer mouse A computer mouse (plural mice, sometimes mouses) is a hand-held pointing device that detects two-dimensional motion relative to a surface. This motion is typically translated into the motion of a pointer on a display, which allows a smooth c ...
or a computer scanner, a 1990 article in ''
The New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...
'' regarding what it called the "
paperless office A paperless office (or paper-free office) is a work environment in which the use of paper is eliminated or greatly reduced. This is done by converting documents and other papers into digital form, a process known as digitization. Proponents claim t ...
" stated that "document processing begins with the scanner". In this context, a former
Xerox Xerox Holdings Corporation (; also known simply as Xerox) is an American corporation that sells print and electronic document, digital document products and services in more than 160 countries. Xerox is headquartered in Norwalk, Connecticut (ha ...
vice-president, Paul Strassman, expressed a critical opinion, saying that computers add rather than reduce the volume of paper in an office. It was said that the engineering and maintenance documents for an airplane weigh "more than the airplane itself".


Automatic document processing

As the ''
state of the art The state of the art (sometimes cutting edge or leading edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contexts it can also refer to a level ...
'' advanced, document processing transitioned to handling "document components ... as database entities." A technology called automatic document processing or sometimes intelligent document processing (ID) emerged as a specific form of Intelligent Process Automation (IPA), combining
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
such as
Machine Learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
(ML),
Natural Language Processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
(NAP) or
Intelligent Character Recognition In computer science, intelligent character recognition (ICR) is an advanced optical character recognition (OCR) or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a ...
(ICE) to extract data from several types documents.


Applications

Automatic document processing applies to a whole range of documents, whether structured or not. For instance, in the world of business and finance, technologies may be used to process paper-based invoices, forms, purchase orders, contracts, and currency bills. Financial institutions use intelligent document processing to process high volumes of forms such as regulatory forms or loan documents. ID uses AI to extract and classify data from documents, replacing manual data entry. In medicine, document processing methods have been developed to facilitate patient follow-up and streamline administrative procedures, in particular by digitizing medical or laboratory analysis reports. The goal is also to standardize medical databases. Algorithms are also directly used to assist the physicians in medical diagnosis, e.g. by analyzing magnetic resonance images, or
microscopic The microscopic scale () is the scale of objects and events smaller than those that can easily be seen by the naked eye, requiring a lens (optics), lens or microscope to see them clearly. In physics, the microscopic scale is sometimes regarded a ...
images. Document processing is also widely used in the
humanities Humanities are academic disciplines that study aspects of human society and culture. In the Renaissance, the term contrasted with divinity and referred to what is now called classics, the main area of secular study in universities at the t ...
and
digital humanities Digital humanities (DH) is an area of scholarly activity at the intersection of computing or Information technology, digital technologies and the disciplines of the humanities. It includes the systematic use of digital resources in the humanitie ...
, in order to extract historical
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
from archives or heritage collections. Specific approaches were developed for various sources, including textual documents, such as newspaper archives, but also images, or maps.


Technologies

If, from the 1980s onward, traditional computer vision algorithms were widely used to solve document processing problems, these have been gradually replaced by neural network technologies in the 2010s. However, traditional computer vision technologies are still used, sometimes in conjunction with neural networks, in some sectors. Many technologies support the development of document processing, in particular
optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
(OCR), and
handwritten text recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwriting, handwritten input from sources such as paper documents, photographs, touch-screens a ...
(HTR), which allow the text to be transcribed automatically. Text segments as such are identified using instance or
object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...
algorithms, which can sometimes also be used to detect the structure of the document. The resolution of the latter problem sometimes also uses semantic segmentation algorithms. These technologies often form the core of document processing. However, other algorithms may intervene before or after these processes. Indeed, document
digitization DigitizationTech Target. (2011, April). Definition: digitization. ''WhatIs.com''. Retrieved December 15, 2021, from https://whatis.techtarget.com/definition/digitization is the process of converting information into a Digital data, digital (i ...
technologies are also involved, whether in the form of classical or three-dimensional scanning. The digitization of 3D documents can in particular resort to derivatives of
photogrammetry Photogrammetry is the science and technology of obtaining reliable information about physical objects and the environment through the process of recording, measuring and interpreting photographic images and patterns of electromagnetic radiant ima ...
. Sometimes, specific 2D scanners must also be developed to adapt to the size of the documents or for reasons of scanning ergonomics. The document processing also depends on the digital encoding of the documents in a suitable
file format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file formats ...
. Furthermore, the processing of heterogeneous databases can rely on
image classification Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
technologies. At the other end of the chain are various image completion, extrapolation or data cleanup algorithms. For textual documents, the interpretation can use
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
(NLP) technologies.


See also

*
Document automation Document automation (also known as document assembly or document management) is the design of systems and workflows that assist in the creation of electronic documents. These include logic-based systems that use segments of pre-existing text and/or ...
*
Document modelling Document modelling looks at the inherent structure in documents. Rather than the structure in formatting which is the classic realm of word-processing tools, it is concerned with the structure in content. Because document content is typically vie ...
*
Data Processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an ...
*
Document Imaging Document imaging is an information technology category for systems capable of replicating documents commonly used in business. Document imaging systems can take many forms including microfilm, on demand printers, facsimile machines, copiers, multifu ...
*
Duplex scanning Duplex scanning is a feature of some computer scanners, and multifunction printers (MFPs) that support duplex printing Duplex printing is a feature of some computer printers and multi-function printers (MFPs) that allows the printing of a sheet ...
*
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
*
Workflow A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of ...


References

{{DEFAULTSORT:Document Processing Automatic identification and data capture Applications of artificial intelligence Applied data mining Applications of computer vision