Forms processing

Last updated

Forms processing is a process by which one can capture information entered into data fields and convert it into an electronic format. This can be done manually or automatically, but the general process is that hard copy data is filled out by humans and then "captured" from their respective fields and entered into a database or other electronic format.

Contents

Overview

In the broadest sense, forms processing systems can range from the processing of small application forms to large scale survey forms with multiple pages. There are several common issues involved in forms processing when done manually. These are a lot of tedious human efforts put in, the data keyed in by the user may result in typos, and many hours of labor result from this lengthy process. If the forms are processed using computer software driven applications these common issues can be resolved and minimized to great extent. Most methods for forms processing address the following areas.

Manual data entry

This method of data processing involves human operators keying in data found on the form. The manual process of data entry has many disadvantages in speed, accuracy and cost. Based on average professional typist speeds of 50 to 80 wpm, [1] one could generously estimate about two hundred pages per hour for forms with fifteen one-word fields (not counting the time for reading and sorting pages). In contrast, modern commercial scanners can scan and digitize up to 200 pages per minute. [2] The second major disadvantage to manual data entry is the likelihood of typographical errors. When factoring in the cost of labor and working space, manual data entry is a very inefficient process.

Automated forms processing

This method can automate data processing by using pre-defined templates and configurations. A template in this case, would be a map of the document, detailing where the data fields are located within the form or document. As compared to the manual data entry process, automatic form input systems are preferable, since they help reduce the problems faced during manual data processing.

Automatic form input systems use different types of recognition methods such as optical character recognition (OCR) for machine print, optical mark reading (OMR) for check/mark sense boxes, bar code recognition (BCR) for barcodes, and intelligent character recognition (ICR) for hand print.

With automated form processing system technology users are able to process documents from their scanned images into a computer readable format such as ANSI, XML, CSV, PDF or input directly into a database.

Forms Processing has developed beyond basic capture of the data. Forms processing not only encompasses a recognition process but also helps manage the complete life cycle of documents which starts from scanning of the document to the extraction of the data, and often to delivery into a back-end system. In some cases it may also include processing or generating well formatted results through calculations and analysis. An automated forms processing system can be valuable if there is a need to process hundreds or thousands of images every day.

First Step: Assessment of the form structure

The first step in understanding automated forms processing is to analyze the type of form from which the extraction of data is desired. Forms can be classified as one of two high level categories for the purpose of extracting data. Four categories have been proposed [3] however the document capture industry has settled up these two:

  1. Fixed forms. This type of form is defined as one in which the data to be extracted is always found in the same absolute position on a page. This allows a type of lens grid to be applied to the document and every subsequent occurrence of this document in order to extract the data. An example of a fixed form is a typical credit application form. [4]
  2. Semi-structured (or unstructured) form. This form is one in which the location of the data and fields holding the data vary from document to document. This type of document is perhaps most easily defined by the fact that it is not a fixed form. In the document capture industry, a semi-structured form is also called an unstructured form. Examples of these types of forms include letters, contracts, and invoices. According to a study by AIIM, about 80% of the documents in an organization fall under the semi-structured definition. [5]

Although the components (described below) used for the extraction of data from either type of form is the same the way in which these are applied varies considerably based upon the type of document.

Components

Various components included in data processing using automatic form-input system include

  1. OCR – Optical character recognition
  2. OMR – Optical mark recognition
  3. ICR – Intelligent character recognition
  4. BCR – Barcode recognition
  5. MICR – Magnetic ink character recognition

OCR recognizes machine-printed uppercase/lowercase alphabetic, numeric, accented characters, many currency symbols, digits, arithmetic symbols, expanded punctuation characters and more.

ICR recognizes hand-printed American and European English characters using pre-defined character sets: uppercase, lowercase, mixed case alphabetic, digits, currency (including $ (dollar), ¢ (cent) € (Euro) £ (pound), ¥ (Yen)), arithmetic and punctuation characters (including period, comma, single quote, double quote, ! & ( ) ? @ { } \ # % * + – / : ; < = >)

MICR is recognition technology to facilitate the processing of the MICR fonts of cheques. This minimizes chances of errors in clearing of cheques. It is also useful for easier and faster transfer of funds. MICR provides a secure, high-speed method of scanning and processing information.

Optical Mark Recognition (OMR) identifies bubbles filled in by hand or check boxes on printed forms. Usually OMR supports single and multiple mark recognition. The fields to be recognized can be specified as grids (rows by columns) or single bubbles.

Barcode Recognition can read more than 20 industry 1D and 2D barcodes including Code39, CODABAR, Interleaved 2 of 5, Code93 and more. It automatically detects all barcodes in an image or specified area within the image.

Process

The process of automated forms processing typically includes the following steps:

  1. A batch of completed forms is scanned using a high-speed scanner
  2. Images are cleaned with document image processing algorithms to improve accuracy
  3. Forms are classified based on original template forms and the fields are extracted using the appropriate recognition components
  4. Fields which the system flagged with a low confidence are queued for verification by a human operator
  5. Verified data is saved into a database or exported to searchable text format such as CSV, XML or PDF

Prerequisites

Though automated forms processing has many great advantages over manual data entry, it still comes with some limitations. To achieve the best accuracy, some prerequisites should be followed.

  1. Scan format: It includes the format of scanned file, Resolution and DPI, Color Mode
  2. Configuration: The scanned image layout needs to be configured for this automation
  3. Recognition: The pre defined out put formats
  4. Result /analyze: Any specific format of result of capture value data presentation.

One very important consideration is indexing, determining the metadata that will be used to describe the data contained within the documents. This attribute perhaps drives the forms processing solution more than any other.

Related Research Articles

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Barcode reader</span> Electronic device

A barcode reader or barcode scanner is an optical scanner that can read printed barcodes, decode the data contained in the barcode on a computer. Like a flatbed scanner, it consists of a light source, a lens, and a light sensor for translating optical impulses into electrical signals. Additionally, nearly all barcode readers contain decoder circuitry that can analyse the barcode's image data provided by the sensor and send the barcode's content to the scanner's output port.

<span class="mw-page-title-main">Image scanner</span> Device that optically scans images, printed text

An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop flatbed scanner where the document is placed on a glass window for scanning. Hand-held scanners, where the device is moved by hand, have evolved from text scanning "wands" to 3D scanners used for industrial design, reverse engineering, test and measurement, orthotics, gaming and other applications. Mechanically driven scanners that move the document are typically used for large-format documents, where a flatbed design would be impractical.

Magnetic ink character recognition code, known in short as MICR code, is a character recognition technology used mainly by the banking industry to streamline the processing and clearance of cheques and other documents. MICR encoding, called the MICR line, is at the bottom of cheques and other vouchers and typically includes the document-type indicator, bank code, bank account number, cheque number, cheque amount, and a control indicator. The format for the bank code and bank account number is country-specific.

Optical mark recognition (OMR) collects data from people by identifying markings on a paper. OMR enables the hourly processing of hundreds or even thousands of documents. For instance, students may remember completing quizzes or surveys that required them to use a pencil to fill in bubbles on paper. A teacher or teacher's aide would fill out the form, then feed the cards into a system that grades or collects data from them.

Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.

<span class="mw-page-title-main">Data entry clerk</span>

A data entry clerk, also known as data preparation and control operator, data registration and control operator, and data preparation and registration operator, is a member of staff employed to enter or update data into a computer system. Data is often entered into a computer from paper documents using a keyboard. The keyboards used can often have special keys and multiple colors to help in the task and speed up the work. Proper ergonomics at the workstation is a common topic considered.

Automatic identification and data capture (AIDC) refers to the methods of automatically identifying objects, collecting data about them, and entering them directly into computer systems, without human involvement. Technologies typically considered as part of AIDC include QR codes, bar codes, radio frequency identification (RFID), biometrics, magnetic stripes, optical character recognition (OCR), smart cards, and voice recognition. AIDC is also commonly referred to as "Automatic Identification", "Auto-ID" and "Automatic Data Capture".

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to bring manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

Optical music recognition (OMR) is a field of research that investigates how to computationally read musical notation in documents. The goal of OMR is to teach the computer to read and interpret sheet music and produce a machine-readable version of the written music score. Once captured digitally, the music can be saved in commonly used file formats, e.g. MIDI and MusicXML . In the past it has, misleadingly, also been called "music optical character recognition". Due to significant differences, this term should no longer be used.

Intelligent character recognition (ICR) is used to extract handwritten text from images. It is a more sophisticated type of OCR technology that recognizes different handwriting styles and fonts to intelligently interpret data on forms and physical documents.

TeleForm is a form of processing applications originally developed by Cardiff Software and now is owned by OpenText.

CuneiForm Cognitive OpenOCR is a freely distributed open-source OCR system developed by Russian software company Cognitive Technologies.

Document Capture Software refers to applications that provide the ability and feature set to automate the process of scanning paper documents or importing electronic documents, often for the purposes of feeding advanced document classification and data collection processes. Most scanning hardware, both scanners and copiers, provides the basic ability to scan to any number of image file formats, including: PDF, TIFF, JPG, BMP, etc. This basic functionality is augmented by document capture software, which can add efficiency and standardization to the process.

<span class="mw-page-title-main">Digital mailroom</span> Automation of incoming mail processes

Digital mailroom is the automation of incoming mail processes. Using document scanning and document capture technologies, companies can digitise incoming mail and automate the classification and distribution of mail within the organization. Both paper and electronic mail (email) can be managed through the same process allowing companies to standardize their internal mail distribution procedures and adhere to company compliance policies.

<span class="mw-page-title-main">OCRFeeder</span>

OCRFeeder is an optical character recognition suite for GNOME, which also supports virtually any command-line OCR engine, such as CuneiForm, GOCR, Ocrad and Tesseract. It converts paper documents to digital document files and can serve to make them accessible to visually impaired users.

Scan-Optics LLC, founded in 1968, is an enterprise content management services company and optical character recognition (OCR) and image scanner manufacturer headquartered in Manchester, Connecticut.

Barcode library or Barcode SDK is a software library that can be used to add barcode features to desktop, web, mobile or embedded applications. Barcode library presents sets of subroutines or objects which allow to create barcode images and put them on surfaces or recognize machine-encoded text / data from scanned or captured by camera images with embedded barcodes. The library can support two modes: generation and recognition mode, some libraries support barcode reading and writing in the same way, but some libraries support only one mode.

<span class="mw-page-title-main">IBM optical mark and character readers</span> Optical mark and character readers made and sold by IBM

IBM designed, manufactured and sold optical mark and character readers from 1960 until 1984. The IBM 1287 is notable as being the first commercially sold scanner capable of reading handwritten numbers.

Smart data capture (SDC), also known as 'intelligent data capture' or 'automated data capture', describes the branch of technology concerned with using computer vision techniques like optical character recognition (OCR), barcode scanning, object recognition and other similar technologies to extract and process information from semi-structured and unstructured data sources. IDC characterize smart data capture as an integrated hardware, software, and connectivity strategy to help organizations enable the capture of data in an efficient, repeatable, scalable, and future-proof way. Data is captured visually from barcodes, text, IDs and other objects - often from many sources simultaneously - before being converted and prepared for digital use, typically by artificial intelligence-powered software. An important feature of SDC is that it focuses not just on capturing data more efficiently but serving up easy-to-access, actionable insights at the instant of data collection to both frontline and desk-based workers, aiding decision-making and making it a two-way process.

References

  1. Teresia R. Ostrach (1997), Typing Speed: How Fast is Average (PDF), archived from the original (PDF) on 2012-05-02
  2. "Kodak intros 200 page-per-minute i1860 commercial scanner". Engadget . Retrieved 2011-11-04.
  3. Kuznetsov, Sergei O.; Mandal, Deba P.; Kundu, Malay K.; Pal, Sankar Kumar (2011-06-25). Pattern Recognition and Machine Intelligence: 4th International Conference, PReMI 2011, Moscow, Russia, June 27 - July 1, 2011, Proceedings. Springer. ISBN   9783642217869.
  4. Vassylyev, Artur (10 June 2008). "CAPTURING SEMI-STRUCTURED FORMS AND DOCUMENTS: CHALLENGES AND AVAILABLE TECHNOLOGIES" (PDF). Archived from the original (PDF) on 2017-04-28. Retrieved 4 April 2017.
  5. "Forms Processing- user experiences of text and handwriting recognition (OCR/ICR)" (PDF). Archived from the original (PDF) on 28 April 2017. Retrieved 4 April 2017.