Document processing

Last updated March 06, 2024

Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not.^[1] The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.

Background

Document processing was initially as is still to some extent a kind of production line work dealing with the treatment of documents, such as letters and parcels, in an aim of sorting, extracting or massively extracting data. This work could be performed in-house or through business process outsourcing.^[2]^[3] Document processing can indeed involve some kind of externalized manual labor, such as mechanical Turk.

As an example of manual document processing, as relatively recent as 2007,^[4] document processing for "millions of visa and citizenship applications" was about use of "approximately 1,000 contract workers" working to "manage mail room and data entry."

While document processing involved data entry via keyboard well before use of a computer mouse or a computer scanner, a 1990 article in The New York Times regarding what it called the "paperless office" stated that "document processing begins with the scanner".^[5] In this context, a former Xerox vice-president, Paul Strassman, expressed a critical opinion, saying that computers add rather than reduce the volume of paper in an office.^[5] It was said that the engineering and maintenance documents for an airplane weigh "more than the airplane itself"^{[ citation needed ]}.

Automatic document processing

As the state of the art advanced, document processing transitioned to handling "document components ... as database entities."^[6]

A technology called automatic document processing or sometimes intelligent document processing (ID) emerged as a specific form of Intelligent Process Automation (IPA), combining artificial intelligence such as Machine Learning (ML), Natural Language Processing (NLP) or Intelligent Character Recognition (ICE) to extract data from several types documents.^[7]^[8]

Applications

Automatic document processing applies to a whole range of documents, whether structured or not. For instance, in the world of business and finance, technologies may be used to process paper-based invoices, forms, purchase orders, contracts, and currency bills.^[9] Financial institutions use intelligent document processing to process high volumes of forms such as regulatory forms or loan documents. ID uses AI to extract and classify data from documents, replacing manual data entry.^[10]

In medicine, document processing methods have been developed to facilitate patient follow-up and streamline administrative procedures, in particular by digitizing medical or laboratory analysis reports. The goal is also to standardize medical databases.^[11] Algorithms are also directly used to assist physicians in medical diagnosis, e.g. by analyzing magnetic resonance images,^[12]^[13] or microscopic images.^[14]

Document processing is also widely used in the humanities and digital humanities, in order to extract historical big data from archives or heritage collections. Specific approaches were developed for various sources, including textual documents, such as newspaper archives,^[15] but also images,^[16] or maps.^[17]^[18]

Technologies

If, from the 1980s onward, traditional computer vision algorithms were widely used to solve document processing problems,^[19]^[20] these have been gradually replaced by neural network technologies in the 2010s.^[21] However, traditional computer vision technologies are still used, sometimes in conjunction with neural networks, in some sectors.

Many technologies support the development of document processing, in particular optical character recognition (OCR), and handwritten text recognition (HTR), which allow the text to be transcribed automatically. Text segments as such are identified using instance or object detection algorithms, which can sometimes also be used to detect the structure of the document. The resolution of the latter problem sometimes also uses semantic segmentation algorithms.

These technologies often form the core of document processing. However, other algorithms may intervene before or after these processes. Indeed, document digitization technologies are also involved, whether in the form of classical or three-dimensional scanning.^[22] The digitization of 3D documents can in particular resort to derivatives of photogrammetry. Sometimes, specific 2D scanners must also be developed to adapt to the size of the documents or for reasons of scanning ergonomics.^[16] The document processing also depends on the digital encoding of the documents in a suitable file format. Furthermore, the processing of heterogeneous databases can rely on image classification technologies.

At the other end of the chain are various image completion, extrapolation or data cleanup algorithms. For textual documents, the interpretation can use natural language processing (NLP) technologies.

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop flatbed scanner where the document is placed on a glass window for scanning. Hand-held scanners, where the device is moved by hand, have evolved from text scanning "wands" to 3D scanners used for industrial design, reverse engineering, test and measurement, orthotics, gaming and other applications. Mechanically driven scanners that move the document are typically used for large-format documents, where a flatbed design would be impractical.

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

3D scanning is the process of analyzing a real-world object or environment to collect three dimensional data of its shape and possibly its appearance. The collected data can then be used to construct digital 3D models.

Optical music recognition (OMR) is a field of research that investigates how to computationally read musical notation in documents. The goal of OMR is to teach the computer to read and interpret sheet music and produce a machine-readable version of the written music score. Once captured digitally, the music can be saved in commonly used file formats, e.g. MIDI and MusicXML . In the past it has, misleadingly, also been called "music optical character recognition". Due to significant differences, this term should no longer be used.

Thomas Shi-Tao Huang was a Chinese-born American computer scientist, electrical engineer, and writer. He was a researcher and professor emeritus at the University of Illinois at Urbana-Champaign (UIUC). Huang was one of the leading figures in computer vision, pattern recognition and human computer interaction.

Intelligent character recognition (ICR) is used to extract handwritten text from images. It is a more sophisticated type of OCR technology that recognizes different handwriting styles and fonts to intelligently interpret data on forms and physical documents.

Computer-aided detection (CADe), also called computer-aided diagnosis (CADx), are systems that assist doctors in the interpretation of medical images. Imaging techniques in X-ray, MRI, Endoscopy, and ultrasound diagnostics yield a great deal of information that the radiologist or other medical professional has to analyze and evaluate comprehensively in a short time. CAD systems process digital images or videos for typical appearances and to highlight conspicuous sections, such as possible diseases, in order to offer input to support a decision taken by the professional.

CuneiForm Cognitive OpenOCR is a freely distributed open-source OCR system developed by Russian software company Cognitive Technologies.

Digital pathology is a sub-field of pathology that focuses on data management based on information generated from digitized specimen slides. Through the use of computer-based technology, digital pathology utilizes virtual microscopy. Glass slides are converted into digital slides that can be viewed, managed, shared and analyzed on a computer monitor. With the practice of Whole-Slide Imaging (WSI), which is another name for virtual microscopy, the field of digital pathology is growing and has applications in diagnostic medicine, with the goal of achieving efficient and cheaper diagnoses, prognosis, and prediction of diseases due to the success in machine learning and artificial intelligence in healthcare.

Forms processing is a process by which one can capture information entered into data fields and convert it into an electronic format. This can be done manually or automatically, but the general process is that hard copy data is filled out by humans and then "captured" from their respective fields and entered into a database or other electronic format.

Optical braille recognition is technology to capture and process images of braille characters into natural language characters. It is used to convert braille documents for people who cannot read them into text, and for preservation and reproduction of the documents.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

Scene text is text that appears in an image captured by a camera in an outdoor environment.

Studierfenster or StudierFenster (SF) is a free, non-commercial open science client/server-based medical imaging processing online framework. It offers capabilities, like viewing medical data (computed tomography (CT), magnetic resonance imaging (MRI), etc.) in two- and three-dimensional space directly in the standard web browsers, like Google Chrome, Mozilla Firefox, Safari, and Microsoft Edge. Other functionalities are the calculation of medical metrics (dice score and Hausdorff distance), manual slice-by-slice outlining of structures in medical images (segmentation), manual placing of (anatomical) landmarks in medical image data, viewing medical data in virtual reality, a facial reconstruction and registration of medical data for augmented reality, one click showcases for COVID-19 and veterinary scans, and a Radiomics module.

References

↑ Len Asprey; Michael Middleton (2003). Integrative Document & Content Management: Strategies for Exploiting Enterprise Knowledge. Idea Group Inc (IGI). ISBN 9781591400554.
↑ Vinod V. Sople (2009-05-25). Business Process Outsourcing: A Supply Chain of Expertises. PHI Learning Pvt. Ltd. ISBN 978-8120338159.
↑ Mark Kobayashi-Hillary (2005-12-05). Outsourcing to India: The Offshore Advantage. Springer Science & Business Media. ISBN 9783540247944.
↑ Julia Preston (December 2, 2007). "Immigration Contractor Trims Wages". The New York Times .
1 2 Lawrence M. Fisher (July 7, 1990). "Paper, Once Written Off, Keeps a Place in the Office". The New York Times .
↑ Al Young; Dayle Woolstein; Jay Johnson (February 1996). "Unknown Title". Object Magazine. p. 51.
↑ "Intelligent Document processing" (PDF). Department of Computer Science – University of Bari. 2005-04-07. Retrieved 2018-09-08.
↑ Floriana Esposito, Stefano Ferilli, Teresa M. A. Basile, Nicola Di Mauro (2005-04-01). "Intelligent Document Processing" in Proceedings. Eighth International Conference on Document Analysis and Recognition, Seoul, South Korea, 2005 pp. 1100-1104. doi: 10.1109/ICDAR.2005.144. doi:10.1109/ICDAR.2005.144. S2CID 17302169.{{cite book}}: CS1 maint: multiple names: authors list (link)
↑ USactive US7873576B2,John E. Jones; William J. Jones& Frank M. Csultis,"Financial document processing system",published 2011-01-18,issued 2011-01-18
↑ Bridgwater, Adrian. "Appian Adds Google Cloud Intelligence To Low-Code Automation Mix". Forbes. Retrieved 2021-04-21.
↑ Adamo, Francesco; Attivissimo, Filippo; Di Nisio, Attilio; Spadavecchia, Maurizio (February 2015). "An automatic document processing system for medical data extraction". Measurement. 61: 88–99. Bibcode:2015Meas...61...88A. doi:10.1016/j.measurement.2014.10.032 . Retrieved 31 January 2021.
↑ Changwan, Kim; Seong-Il, Lee; Won Joon, Cho (September 2020). "Volumetric assessment of extrusion in medial meniscus posterior root tears through semi-automatic segmentation on 3-tesla magnetic resonance images". Orthopaedics & Traumatology: Surgery & Research. 101 (5): 963–968. doi:10.1016/j.rcot.2020.06.003. S2CID 225215597 . Retrieved 31 January 2021.
↑ Despotović, Ivana; Bart, Goossens; Wilfried, Philips (1 March 2015). "MRI Segmentation of the Human Brain: Challenges, Methods, and Applications". Computational Intelligence Techniques in Medicine. 2015: 963–968. doi: 10.1155/2015/450341 . PMC 4402572 . PMID 25945121.
↑ Putzua, Lorenzo; Caocci, Giovanni; Di Rubertoa, Cecilia (November 2014). "Leucocyte classification for leukaemia detection using image processing techniques". Artificial Intelligence in Medicine. 63 (3): 179–191. doi:10.1016/j.artmed.2014.09.002. hdl: 11584/94592 . PMID 25241903.
↑ Ehrmann, Maud; Romanello, Matteo; Clematide, Simon; Ströbel, Phillip; Barman, Raphaël (2020). "Language Resources for Historical Newspapers: the Impresso Collection". Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France. pp. 958–968.
1 2 Seguin, Benoit; Costiner, Lisandra; di Lenardo, Isabella; Kaplan, Frédéric (April 1, 2018). "New Techniques for the Digitization of Art Historical Photographic Archives - the Case of the Cini Foundation in Venice". Archiving 2018 Final Program and Proceedings. Society for Imaging Science and Technology. pp. 1–5. doi:10.2352/issn.2168-3204.2018.1.0.2.
↑ Ares Oliveira, Sofia; di Lenardo, Isabella; Tourenc, Bastien; Kaplan, Frédéric (11 July 2019). A deep learning approach to Cadastral Computing. Digital Humanities Conference. Utrecht, Netherlands.
↑ Petitpierre, Rémi (July 2020). Neural networks for semantic segmentation of historical city maps: Cross-cultural performance and the impact of figurative diversity (MSc). arXiv: 2101.12478 . doi:10.13140/RG.2.2.10973.64484.
↑ Fujisawa, H.; Nakano, Y.; Kurino, K. (July 1992). "Segmentation methods for character recognition: from segmentation to document structure analysis". Proceedings of the IEEE. 80 (7): 1079–1092. doi:10.1109/5.156471 . Retrieved 3 February 2021.
↑ Tang, Yuan Y.; Lee, Seong-Whan; Suen, Ching Y. (1996). "Automatic document processing: a survey". Pattern Recognition. 29 (12): 1931–1952. Bibcode:1996PatRe..29.1931T. doi:10.1016/S0031-3203(96)00044-1 . Retrieved 3 February 2021.
↑ Ares Oliveira, Sofia; Seguin, Benoit; Kaplan, Frederic (5–8 August 2018). dhSegment: A Generic Deep-Learning Approach for Document Segmentation. 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). Niagara Falls, NY, USA: IEEE. arXiv: 1804.10371 . doi:10.1109/ICFHR-2018.2018.00011.
↑ "Revolutionary Scanning Technology for Art". Artmyn. Retrieved 3 February 2021.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Len Asprey; Michael Middleton (2003). Integrative Document & Content Management: Strategies for Exploiting Enterprise Knowledge. Idea Group Inc (IGI). ISBN 9781591400554.

[2] Vinod V. Sople (2009-05-25). Business Process Outsourcing: A Supply Chain of Expertises. PHI Learning Pvt. Ltd. ISBN 978-8120338159.

[3] Mark Kobayashi-Hillary (2005-12-05). Outsourcing to India: The Offshore Advantage. Springer Science & Business Media. ISBN 9783540247944.

[VisaDox-4] Julia Preston (December 2, 2007). "Immigration Contractor Trims Wages". The New York Times .

[Paper.NYT-5] 1 2 Lawrence M. Fisher (July 7, 1990). "Paper, Once Written Off, Keeps a Place in the Office". The New York Times .

[6] Al Young; Dayle Woolstein; Jay Johnson (February 1996). "Unknown Title". Object Magazine. p. 51.

[7] "Intelligent Document processing" (PDF). Department of Computer Science – University of Bari. 2005-04-07. Retrieved 2018-09-08.

[8] Floriana Esposito, Stefano Ferilli, Teresa M. A. Basile, Nicola Di Mauro (2005-04-01). "Intelligent Document Processing" in Proceedings. Eighth International Conference on Document Analysis and Recognition, Seoul, South Korea, 2005 pp. 1100-1104. doi: 10.1109/ICDAR.2005.144. doi:10.1109/ICDAR.2005.144. S2CID 17302169.{{cite book}}: CS1 maint: multiple names: authors list (link)

[9] USactive US7873576B2,John E. Jones; William J. Jones& Frank M. Csultis,"Financial document processing system",published 2011-01-18,issued 2011-01-18

[10] Bridgwater, Adrian. "Appian Adds Google Cloud Intelligence To Low-Code Automation Mix". Forbes. Retrieved 2021-04-21.

[11] Adamo, Francesco; Attivissimo, Filippo; Di Nisio, Attilio; Spadavecchia, Maurizio (February 2015). "An automatic document processing system for medical data extraction". Measurement. 61: 88–99. Bibcode:2015Meas...61...88A. doi:10.1016/j.measurement.2014.10.032 . Retrieved 31 January 2021.

[12] Changwan, Kim; Seong-Il, Lee; Won Joon, Cho (September 2020). "Volumetric assessment of extrusion in medial meniscus posterior root tears through semi-automatic segmentation on 3-tesla magnetic resonance images". Orthopaedics & Traumatology: Surgery & Research. 101 (5): 963–968. doi:10.1016/j.rcot.2020.06.003. S2CID 225215597 . Retrieved 31 January 2021.

[13] Despotović, Ivana; Bart, Goossens; Wilfried, Philips (1 March 2015). "MRI Segmentation of the Human Brain: Challenges, Methods, and Applications". Computational Intelligence Techniques in Medicine. 2015: 963–968. doi: 10.1155/2015/450341 . PMC 4402572 . PMID 25945121.

[14] Putzua, Lorenzo; Caocci, Giovanni; Di Rubertoa, Cecilia (November 2014). "Leucocyte classification for leukaemia detection using image processing techniques". Artificial Intelligence in Medicine. 63 (3): 179–191. doi:10.1016/j.artmed.2014.09.002. hdl: 11584/94592 . PMID 25241903.

[15] Ehrmann, Maud; Romanello, Matteo; Clematide, Simon; Ströbel, Phillip; Barman, Raphaël (2020). "Language Resources for Historical Newspapers: the Impresso Collection". Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France. pp. 958–968.

[cini_archive_digitization-16] 1 2 Seguin, Benoit; Costiner, Lisandra; di Lenardo, Isabella; Kaplan, Frédéric (April 1, 2018). "New Techniques for the Digitization of Art Historical Photographic Archives - the Case of the Cini Foundation in Venice". Archiving 2018 Final Program and Proceedings. Society for Imaging Science and Technology. pp. 1–5. doi:10.2352/issn.2168-3204.2018.1.0.2.

[17] Ares Oliveira, Sofia; di Lenardo, Isabella; Tourenc, Bastien; Kaplan, Frédéric (11 July 2019). A deep learning approach to Cadastral Computing. Digital Humanities Conference. Utrecht, Netherlands.

[18] Petitpierre, Rémi (July 2020). Neural networks for semantic segmentation of historical city maps: Cross-cultural performance and the impact of figurative diversity (MSc). arXiv: 2101.12478 . doi:10.13140/RG.2.2.10973.64484.

[19] Fujisawa, H.; Nakano, Y.; Kurino, K. (July 1992). "Segmentation methods for character recognition: from segmentation to document structure analysis". Proceedings of the IEEE. 80 (7): 1079–1092. doi:10.1109/5.156471 . Retrieved 3 February 2021.

[20] Tang, Yuan Y.; Lee, Seong-Whan; Suen, Ching Y. (1996). "Automatic document processing: a survey". Pattern Recognition. 29 (12): 1931–1952. Bibcode:1996PatRe..29.1931T. doi:10.1016/S0031-3203(96)00044-1 . Retrieved 3 February 2021.

[21] Ares Oliveira, Sofia; Seguin, Benoit; Kaplan, Frederic (5–8 August 2018). dhSegment: A Generic Deep-Learning Approach for Document Segmentation. 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). Niagara Falls, NY, USA: IEEE. arXiv: 1804.10371 . doi:10.1109/ICFHR-2018.2018.00011.

[22] "Revolutionary Scanning Technology for Art". Artmyn. Retrieved 3 February 2021.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]