Intelligent word recognition

Last updated

Intelligent Word Recognition, or IWR, [1] is the recognition of unconstrained handwritten words. [2] IWR recognizes entire handwritten words or phrases instead of character-by-character, like its predecessor, optical character recognition (OCR). [3] IWR technology matches handwritten or printed words to a user-defined dictionary, significantly reducing character errors encountered in typical character-based recognition engines.

New technology on the market utilizes IWR, OCR, and ICR together, which opens many doors for the processing of documents, either constrained (hand printed or machine printed) or unconstrained (freeform cursive). IWR also eliminates a large percentage of the manual data entry of handwritten documents that, in the past, could only be keyed by a human, creating an automated workflow.

When cursive handwriting is in play, for each word analyzed, the system breaks down the words into a sequence of graphemes, or subparts of letters. These various curves, shapes and lines make up letters and IWR considers these various shape and groupings in order to calculate a confidence value associated with the word in question. [4]

IWR is not meant to replace ICR and OCR engines which work well with printed data; however, IWR reduces the number of character errors associated with these engines, and it is ideal for processing real-world documents that contain mostly freeform, hard-to-recognize data, inherently unsuitable for them. [5]

See also

Lists

Related Research Articles

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most plausible words.

Optical mark recognition is the process of reading information that people mark on surveys, tests and other paper documents.

<span class="mw-page-title-main">Cursive</span> Style of penmanship in which characters are written joined together in a flowing manner

Cursive is any style of penmanship in which characters are written joined in a flowing manner, generally for the purpose of making writing faster, in contrast to block letters. It varies in functionality and modern-day usage across languages and regions; being used both publicly in artistic and formal documents as well as in private communication. Formal cursive is generally joined, but casual cursive is a combination of joins and pen lifts. The writing style can be further divided as "looped", "italic" or "connected".

<span class="mw-page-title-main">Ol Chiki script</span> Alphabetic script for Santal people

The Ol Chiki script, also known as Ol Chemetʼ, Ol Ciki, Ol, and sometimes as the Santali alphabet invented by Pandit Raghunath Murmu in the year 1925, is the official writing system for Santali, an Austroasiatic language recognized as an official regional language in India. It has 30 letters, the forms of which are intended to evoke natural shapes. The script is written from left to right, and has two forms. Unicode does not maintain a distinction between these two, as is typical for print and cursive forms of scripts. In both forms, this alphabet was invented as a unicameral script.

The shapes of the letters are not arbitrary, but reflect the names for the letters, which are words, usually the names of objects or actions representing conventionalized form in the pictorial shape of the characters.

Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.

Automatic identification and data capture (AIDC) refers to the methods of automatically identifying objects, collecting data about them, and entering them directly into computer systems, without human involvement. Technologies typically considered as part of AIDC include QR codes, bar codes, radio frequency identification (RFID), biometrics, magnetic stripes, optical character recognition (OCR), smart cards, and voice recognition. AIDC is also commonly referred to as "Automatic Identification", "Auto-ID" and "Automatic Data Capture".

In computer science, intelligent character recognition (ICR) is an advanced optical character recognition (OCR) or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition levels.

Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data. While Text analytics is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat, text messages, e-mails, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuations, missing letter case information, pause filling words such as “um” and “uh” and other texting and speech disfluencies. Such text can be seen in large amounts in contact centers, chat rooms, optical character recognition (OCR) of text documents, short message service (SMS) text, etc. Documents with historical language can also be considered noisy with respect to today's knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.

<span class="mw-page-title-main">Russian cursive</span> Handwritten form of Russian Cyrillic

Russian cursive is a variant of the Russian alphabet used for writing by hand. It is typically referred to as (ру́сский) рукопи́сный шрифт (rússky) rukopísny shrift, "(Russian) handwritten font". It is the handwritten form of the modern Russian Cyrillic script, used instead of the block letters seen in printed material. In addition, Russian italics for lowercase letters are often based on Russian cursive. Most handwritten Russian, especially in personal letters and schoolwork, uses the cursive alphabet. In Russian schools most children are taught from first grade how to write with this script.

TeleForm is a forms processing application originally developed by Cardiff Software and now owned by OpenText.

Recognition may refer to:

A text entry interface or text entry device is an interface that is used to enter text information in an electronic device. A commonly used device is a mechanical computer keyboard. Most laptop computers have an integrated mechanical keyboard, and desktop computers are usually operated primarily using a keyboard and mouse. Devices such as smartphones and tablets mean that interfaces such as virtual keyboards and voice recognition are becoming more popular as text entry systems.

<span class="mw-page-title-main">OCR-A</span> Typeface designed for early computer OCR

OCR-A is a font created in 1968, in the early days of computer optical character recognition, when there was a need for a font that could be recognized not only by the computers of that day, but also by humans. OCR-A uses simple, thick strokes to form recognizable characters. The font is monospaced (fixed-width), with the printer required to place glyphs 0.254 cm apart, and the reader required to accept any spacing between 0.2286 cm and 0.4572 cm.

Forms processing is a process by which one can capture information entered into data fields and convert it into an electronic format. This can be done manually or automatically, but the general process is that hard copy data is filled out by humans and then "captured" from their respective fields and entered into a database or other electronic format.

<span class="mw-page-title-main">OCR-B</span> Typeface

OCR-B is a monospace font developed in 1968 by Adrian Frutiger for Monotype by following the European Computer Manufacturer's Association standard. Its function was to facilitate the optical character recognition operations by specific electronic devices, originally for financial and bank-oriented uses. It was accepted as the world standard in 1973. It follows the ISO 1073-2:1976 (E) standard, refined in 1979. It includes all ASCII symbols, and other symbols needed in the bank environment. It is widely used for the human readable digits in UPC/EAN barcodes. It is also used for machine-readable passports. It shares that purpose with OCR-A, but it is easier for the human eye and brain to read and it has a less technical look than OCR-A.

<span class="mw-page-title-main">Handwritten biometric recognition</span> Process of identifying the author of a given text from the handwriting style

Handwritten biometric recognition is the process of identifying the author of a given text from the handwriting style. Handwritten biometric recognition belongs to behavioural biometric systems because it is based on something that the user has learned to do.

Scan-Optics LLC, founded in 1968, is an enterprise content management services company and optical character recognition (OCR) and image scanner manufacturer headquartered in Manchester, Connecticut.

Sayre's paradox is a dilemma encountered in the design of automated handwriting recognition systems. A standard statement of the paradox is that a cursively written word cannot be recognized without being segmented and cannot be segmented without being recognized. The paradox was first articulated in a 1973 publication by Kenneth M. Sayre, after whom it was named.

In Codice Ratio is a research project designed to study and use novel techniques such as Optical Character Recognition and Artificial Intelligence to digitize works in the Vatican Apostolic Archive, most of which is handwritten.

References

  1. "IWR - Intelligent Word Recognition | AcronymFinder". www.acronymfinder.com. Retrieved February 28, 2019.
  2. "What is IWR? (Intelligent Word Recognition)". eFileCabinet. January 4, 2016. Retrieved February 28, 2019.
  3. "intelligent-character-recognition-icr".
  4. Álvarez, D.; Fernández, R.A.; Sánchez, L. (November 2017). "Fuzzy system for intelligent word recognition using a regular grammar". Journal of Applied Logic. 24: 45–53. doi: 10.1016/j.jal.2016.11.023 .
  5. "What is Intelligent Word Recognition | IGI Global". www.igi-global.com. Retrieved February 28, 2019.