Optical character recognition

Last updated

Video of the process of scanning and real-time optical character recognition (OCR) with a portable scanner

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). [1]

Contents

Widely used as a form of data entry from printed paper data records whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printed data, or any suitable documentation it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs. [2] Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components.

History

Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind. [3] In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code. [4] Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters. [5]

In the late 1920s and into the 1930s, Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931, he was granted US Patent number 1,838,389 for the invention. The patent was acquired by IBM.

Visually impaired users

In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-font OCR, which could recognize text printed in virtually any font. (Kurzweil is often credited with inventing omni-font OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s. [3] [6] ) Kurzweil used the technology to create a reading machine for blind people to have a computer read text to them out loud. The device included a CCD-type flatbed scanner and a text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind.[ citation needed ] In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which eventually spun it off as Scansoft, which merged with Nuance Communications.

In the 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone. With the advent of smartphones and smartglasses, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have built-in OCR functionality will typically use an OCR API to extract the text from the image file captured by the device. [7] [8] The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display.

Various commercial and open source OCR systems are available for most common writing systems, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters.

Applications

OCR engines have been developed into software applications specializing in various subjects such as receipts, invoices, checks, and legal billing documents.

The software can be used for:

Types

OCR is generally an offline process, which analyses a static document. There are cloud based services which provide an online OCR API service. Handwriting movement analysis can be used as input to handwriting recognition. [14] Instead of merely using the shapes of glyphs and words, this technique is able to capture motion, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the process more accurate. This technology is also known as "online character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition".

Techniques

Pre-processing

OCR software often pre-processes images to improve the chances of successful recognition. Techniques include: [15]

Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character. [22]

Text recognition

There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters. [23]

Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as adaptive recognition and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded). [22]

As of December 2016, modern OCR software includes Google Docs OCR, ABBYY FineReader, and Transym. [26] [ needs update ] Others like OCRopus and Tesseract use neural networks which are trained to recognize whole lines of text instead of focusing on single characters.

A technique known as iterative OCR automatically crops a document into sections based on page layout. OCR is performed on the sections individually using variable character confidence level thresholds to maximize page-level OCR accuracy. A patent from the United States Patent Office has been issued for this method. [27]

The OCR result can be stored in the standardized ALTO format, a dedicated XML schema maintained by the United States Library of Congress. Other common formats include hOCR and PAGE XML.

For a list of optical character recognition software, see Comparison of optical character recognition software.

Post-processing

OCR accuracy can be increased if the output is constrained by a lexicon  a list of words that are allowed to occur in a document. [15] This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy. [22]

The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation.

Near-neighbor analysis can make use of co-occurrence frequencies to correct errors, by noting that certain words are often seen together. [28] For example, "Washington, D.C." is generally far more common in English than "Washington DOC".

Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy.

The Levenshtein Distance algorithm has also been used in OCR post-processing to further optimize results from an OCR API. [29]

Application-specific optimizations

In recent years,[ when? ] the major OCR technology providers began to tweak OCR systems to deal more efficiently with specific types of input. Beyond an application-specific lexicon, better performance may be had by taking into account business rules, standard expression,[ clarification needed ] or rich information contained in color images. This strategy is called "Application-Oriented OCR" or "Customized OCR", and has been applied to OCR of license plates, invoices, screenshots, ID cards, driver's licenses, and automobile manufacturing.

The New York Times has adapted the OCR technology into a proprietary tool they entitle Document Helper, that enables their interactive news team to accelerate the processing of documents that need to be reviewed. They note that it enables them to process what amounts to as many as 5,400 pages per hour in preparation for reporters to review the contents. [30]

Workarounds

There are several techniques for solving the problem of character recognition by means other than improved OCR algorithms.

Forcing better input

Special fonts like OCR-A, OCR-B, or MICR fonts, with precisely specified sizing, spacing, and distinctive character shapes, allow a higher accuracy rate during transcription in bank check processing. Several prominent OCR engines were designed to capture text in popular fonts such as Arial or Times New Roman, and are incapable of capturing text in these fonts that are specialized and very different from popularly used fonts. As Google Tesseract can be trained to recognize new fonts, it can recognize OCR-A, OCR-B and MICR fonts. [31]

Comb fields are pre-printed boxes that encourage humans to write more legibly one glyph per box. [28] These are often printed in a dropout color which can be easily removed by the OCR system. [28]

Palm OS used a special set of glyphs, known as Graffiti, which are similar to printed English characters but simplified or modified for easier recognition on the platform's computationally limited hardware. Users would need to learn how to write these special glyphs.

Zone-based OCR restricts the image to a specific part of a document. This is often referred to as Template OCR.

Crowdsourcing

Crowdsourcing humans to perform the character recognition can quickly process images like computer-driven OCR, but with higher accuracy for recognizing images than that obtained via computers. Practical systems include the Amazon Mechanical Turk and reCAPTCHA. The National Library of Finland has developed an online interface for users to correct OCRed texts in the standardized ALTO format. [32] Crowd sourcing has also been used not to perform character recognition directly but to invite software developers to develop image processing algorithms, for example, through the use of rank-order tournaments. [33]

Accuracy

Occurrence of laft and last in Google's n-grams database, in English documents from 1700 to 1900, based on OCR scans for the "English 2009" corpus Google Ngrams (English 2009) ocurrence of laft and last.png
Occurrence of laft and last in Google's n-grams database, in English documents from 1700 to 1900, based on OCR scans for the "English 2009" corpus
Occurrence of laft and last in Google's n-grams database, based on OCR scans for the "English 2012" corpus Google Ngrams (English 2012) ocurrence of laft and last.png
Occurrence of laft and last in Google's n-grams database, based on OCR scans for the "English 2012" corpus
Searching for words with a long S in English 2012 or later are normalized to an S. Google Ngrams (English 2019) long s normalization.png
Searching for words with a long S in English 2012 or later are normalized to an S.

Commissioned by the U.S. Department of Energy (DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents, and it conducted the most authoritative of the Annual Test of OCR Accuracy from 1992 to 1996. [35]

Recognition of typewritten, Latin script text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 81% to 99%; [36] total accuracy can be achieved by human review or Data Dictionary Authentication. Other areas including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character) are still the subject of active research. The MNIST database is commonly used for testing systems' ability to recognize handwritten digits.

Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% or worse if the measurement is based on whether each whole word was recognized with no incorrect letters. [37] Using a large enough dataset is important in a neural-network-based handwriting recognition solutions. On the other hand, producing natural datasets is very complicated and time-consuming. [38]

An example of the difficulties inherent in digitizing old text is the inability of OCR to differentiate between the "long s" and "f" characters. [39] [34]

Web-based OCR systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years[ when? ] (see Tablet PC history). Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved by pen computing software, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications.[ citation needed ]

Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a check (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script.[ citation needed ]

Most programs allow users to set "confidence rates". This means that if the software does not achieve their desired level of accuracy, a user can be notified for manual review.

An error introduced by OCR scanning is sometimes termed a scanno (by analogy with the term typo). [40] [41]

Unicode

Characters to support OCR were added to the Unicode Standard in June 1993, with the release of version 1.1.

Some of these characters are mapped from fonts specific to MICR, OCR-A or OCR-B.

Optical Character Recognition [1] [2]
Official Unicode Consortium code chart (PDF)
 0123456789ABCDEF
U+244x
U+245x
Notes
1. ^ As of Unicode version 15.1
2. ^ Grey areas indicate non-assigned code points

See also

Related Research Articles

<span class="mw-page-title-main">Monospaced font</span> Font whose characters occupy the same amount of horizontal space

A monospaced font, also called a fixed-pitch, fixed-width, or non-proportional font, is a font whose letters and characters each occupy the same amount of horizontal space. This contrasts with variable-width fonts, where the letters and spacings have different widths.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

<span class="mw-page-title-main">Image scanner</span> Device that optically scans images, printed text

An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop flatbed scanner where the document is placed on a glass window for scanning. Hand-held scanners, where the device is moved by hand, have evolved from text scanning "wands" to 3D scanners used for industrial design, reverse engineering, test and measurement, orthotics, gaming and other applications. Mechanically driven scanners that move the document are typically used for large-format documents, where a flatbed design would be impractical.

Optical mark recognition (OMR) collects data from people by identifying markings on a paper. OMR enables the hourly processing of hundreds or even thousands of documents. For instance, students may remember completing quizzes or surveys that required them to use a pencil to fill in bubbles on paper. A teacher or teacher's aide would fill out the form, then feed the cards into a system that grades or collects data from them.

Nuance Communications, Inc. is an American multinational computer software technology corporation, headquartered in Burlington, Massachusetts, that markets speech recognition and artificial intelligence software.

<span class="mw-page-title-main">Block letters</span> Style of writing Latin script

Block letters are a sans-serif style of writing Latin script in which the letters are individual glyphs, with no joining.

Intelligent character recognition (ICR) is used to extract handwritten text from image images using ICR, also referred to as intelligent OCR. It is a more sophisticated type of OCR technology that recognizes different handwriting styles and fonts to intelligently interpret data on forms and physical documents.

<span class="mw-page-title-main">Tesseract (software)</span> Free optical character recognition engine

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.

<span class="mw-page-title-main">OCRopus</span>

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.

CuneiForm Cognitive OpenOCR is a freely distributed open-source OCR system developed by Russian software company Cognitive Technologies.

<span class="mw-page-title-main">OCR-A</span> Typeface designed for early computer OCR

OCR-A is a font issued in 1966 and first implemented in 1968. A special font was needed in the early days of computer optical character recognition, when there was a need for a font that could be recognized not only by the computers of that day, but also by humans. OCR-A uses simple, thick strokes to form recognizable characters. The font is monospaced (fixed-width), with the printer required to place glyphs 0.254 cm apart, and the reader required to accept any spacing between 0.2286 cm and 0.4572 cm.

Intelligent Word Recognition, or IWR, is the recognition of unconstrained handwritten words. IWR recognizes entire handwritten words or phrases instead of character-by-character, like its predecessor, optical character recognition (OCR). IWR technology matches handwritten or printed words to a user-defined dictionary, significantly reducing character errors encountered in typical character-based recognition engines.

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.

This comparison of optical character recognition software includes:

Forms processing is a process by which one can capture information entered into data fields and convert it into an electronic format. This can be done manually or automatically, but the general process is that hard copy data is filled out by humans and then "captured" from their respective fields and entered into a database or other electronic format.

<span class="mw-page-title-main">OCRFeeder</span>

OCRFeeder is an optical character recognition suite for GNOME, which also supports virtually any command-line OCR engine, such as CuneiForm, GOCR, Ocrad and Tesseract. It converts paper documents to digital document files and can serve to make them accessible to visually impaired users.

<span class="mw-page-title-main">Optical braille recognition</span> Automated recognition of braille characters

Optical braille recognition is technology to capture and process images of braille characters into natural language characters. It is used to convert braille documents for people who cannot read them into text, and for preservation and reproduction of the documents.

<span class="mw-page-title-main">Project Naptha</span>

Project Naptha is a browser extension software for Google Chrome that allows users to highlight, copy, edit and translate text from within images. It was created by developer Kevin Kwok, and released in April 2014 as a Chrome add-on. This software was first made available only on Google Chrome, downloadable from the Chrome Web Store. It was then made available on Mozilla Firefox, downloadable from the Mozilla Firefox add-ons repository but was soon removed. The reason behind the removal remains unknown.

Indic OCR refers to the process of converting text images written in Indic scripts into e-text using Optical character recognition (OCR) techniques. Broadly, it can also refer to the OCR systems of Brahmic scripts for languages of South Asia and Southeast Asia, not just the scripts of the Indian subcontinent, which are all written in an abugida-based writing system.

<span class="mw-page-title-main">OCR Systems</span> American computing company

OCR Systems, Inc., was an American computer hardware manufacturer and software publisher dedicated to optical character recognition technologies. The company's first product, the System 1000 in 1970, was used by numerous large corporations for bill processing and mail sorting. Following a series of pitfalls in the 1970s and early 1980s, founder Theodor Herzl Levine put the company in the hands of Gregory Boleslavsky and Vadim Brikman, the company's vice presidents and recent immigrants from the Soviet Ukraine, who were able to turn OCR System's fortunes around and expand its employee base. The company released the software-based OCR application ReadRight for DOS, later ported to Windows, in the late 1980s. Adobe Inc. bought the company in 1992.

References

  1. OnDemand, HPE Haven. "OCR Document". Archived from the original on April 15, 2016.
  2. OnDemand, HPE Haven. "undefined". Archived from the original on April 19, 2016.
  3. 1 2 Schantz, Herbert F. (1982). The history of OCR, optical character recognition . [Manchester Center, Vt.]: Recognition Technologies Users Association. ISBN   9780943072012.
  4. Dhavale, Sunita Vikrant (2017). Advanced Image-Based Spam Detection and Filtering Techniques. Hershey, PA: IGI Global. p. 91. ISBN   9781683180142.
  5. d'Albe, E. E. F. (July 1, 1914). "On a Type-Reading Optophone". Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences. 90 (619): 373–375. Bibcode:1914RSPSA..90..373D. doi:10.1098/rspa.1914.0061.
  6. "The History of OCR". Data Processing Magazine. 12: 46. 1970.
  7. "Extracting text from images using OCR on Android". June 27, 2015. Archived from the original on March 15, 2016.
  8. "[Tutorial] OCR on Google Glass". October 23, 2014. Archived from the original on March 5, 2016.
  9. Zeng, Qing-An (2015). Wireless Communications, Networking and Applications: Proceedings of WCNA 2014. Springer. ISBN   978-81-322-2580-5.
  10. "[javascript] Using OCR and Entity Extraction for LinkedIn Company Lookup". July 22, 2014. Archived from the original on April 17, 2016.
  11. "How To Crack Captchas". andrewt.net. June 28, 2006. Retrieved June 16, 2013.
  12. "Breaking a Visual CAPTCHA". Cs.sfu.ca. December 10, 2002. Retrieved June 16, 2013.
  13. Resig, John (January 23, 2009). "John Resig – OCR and Neural Nets in JavaScript". Ejohn.org. Retrieved June 16, 2013.
  14. Tappert, C. C.; Suen, C. Y.; Wakahara, T. (1990). "The state of the art in online handwriting recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 12 (8): 787. doi:10.1109/34.57669. S2CID   42920826.
  15. 1 2 "Optical Character Recognition (OCR) – How it works". Nicomsoft.com. Retrieved June 16, 2013.
  16. Sezgin, Mehmet; Sankur, Bulent (2004). "Survey over image thresholding techniques and quantitative performance evaluation" (PDF). Journal of Electronic Imaging. 13 (1): 146. Bibcode:2004JEI....13..146S. doi:10.1117/1.1631315. Archived from the original (PDF) on October 16, 2015. Retrieved May 2, 2015.
  17. Gupta, Maya R.; Jacobson, Nathaniel P.; Garcia, Eric K. (2007). "OCR binarisation and image pre-processing for searching historical documents" (PDF). Pattern Recognition. 40 (2): 389. Bibcode:2007PatRe..40..389G. doi:10.1016/j.patcog.2006.04.043. Archived from the original (PDF) on October 16, 2015. Retrieved May 2, 2015.
  18. Trier, Oeivind Due; Jain, Anil K. (1995). "Goal-directed evaluation of binarisation methods" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 17 (12): 1191–1201. doi:10.1109/34.476511. Archived (PDF) from the original on October 16, 2015. Retrieved May 2, 2015.
  19. Milyaev, Sergey; Barinova, Olga; Novikova, Tatiana; Kohli, Pushmeet; Lempitsky, Victor (2013). "Image Binarization for End-to-End Text Understanding in Natural Images". 2013 12th International Conference on Document Analysis and Recognition (PDF). pp. 128–132. doi:10.1109/ICDAR.2013.33. ISBN   978-0-7695-4999-6. S2CID   8947361. Archived (PDF) from the original on November 13, 2017. Retrieved May 2, 2015.
  20. Pati, P.B.; Ramakrishnan, A.G. (May 29, 1987). "Word Level Multi-script Identification". Pattern Recognition Letters. 29 (9): 1218–1229. Bibcode:2008PaReL..29.1218P. doi:10.1016/j.patrec.2008.01.027.
  21. "Basic OCR in OpenCV | Damiles". Blog.damiles.com. November 20, 2008. Retrieved June 16, 2013.
  22. 1 2 3 Smith, Ray (2007). "An Overview of the Tesseract OCR Engine" (PDF). Archived from the original (PDF) on September 28, 2010. Retrieved May 23, 2013.
  23. "OCR Introduction". Dataid.com. Retrieved June 16, 2013.
  24. "How OCR Software Works". OCRWizard. Archived from the original on August 16, 2009. Retrieved June 16, 2013.
  25. "The basic pattern recognition and classification with openCV | Damiles". Blog.damiles.com. November 14, 2008. Retrieved June 16, 2013.
  26. Assefi, Mehdi (December 2016). "OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym". ResearchGate.
  27. "How the Best OCR Technology Captures 99.91% of Data". www.bisok.com. Retrieved May 27, 2021.
  28. 1 2 3 Woodford, Chris (January 30, 2012). "How does OCR document scanning work?". Explain that Stuff. Retrieved June 16, 2013.
  29. "How to optimize results from the OCR API when extracting text from an image? - Haven OnDemand Developer Community". Archived from the original on March 22, 2016.
  30. Fehr, Tiff (March 26, 2019). "How We Sped Through 900 Pages of Cohen Documents in Under 10 Minutes". The New York Times. ISSN   0362-4331 . Retrieved June 16, 2023.
  31. "Train Your Tesseract". Train Your Tesseract. September 20, 2018. Retrieved September 20, 2018.
  32. "What is the point of an online interactive OCR text editor? - Fenno-Ugrica". February 21, 2014.
  33. Riedl, C.; Zanibbi, R.; Hearst, M. A.; Zhu, S.; Menietti, M.; Crusan, J.; Metelsky, I.; Lakhani, K. (February 20, 2016). "Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms". International Journal on Document Analysis and Recognition . 19 (2): 155. arXiv: 1410.6751 . doi:10.1007/s10032-016-0260-8. S2CID   11873638.
  34. 1 2 "Google Books Ngram Viewer". books.google.com. Retrieved July 20, 2023. When we generated the original Ngram Viewer corpora in 2009, our OCR wasn't as good […]. This was especially obvious in pre-19th century English, where the elongated medial-s (ſ) was often interpreted as an f, […]. Here's evidence of the improvements we've made since then, using the corpus operator to compare the 2009, 2012 and 2019 versions […]
  35. "Code and Data to evaluate OCR accuracy, originally from UNLV/ISRI". Google Code Archive.
  36. Holley, Rose (April 2009). "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs". D-Lib Magazine. Retrieved January 5, 2014.
  37. Suen, C.Y.; Plamondon, R.; Tappert, A.; Thomassen, A.; Ward, J.R.; Yamamoto, K. (May 29, 1987). Future Challenges in Handwriting and Computer Applications. 3rd International Symposium on Handwriting and Computer Applications, Montreal, May 29, 1987. Retrieved October 3, 2008.
  38. Mohseni, Ayda; Azmi, Reza; Maleki, Arvin and Layeghi, Kamran (2019). Comparison of Synthesized and Natural Datasets in Neural Network Based Handwriting Solutions. ITCT.{{cite book}}: CS1 maint: multiple names: authors list (link)
  39. Kapidakis, Sarantos; Mazurek, Cezary and Werla, Marcin (2015). Research and Advanced Technology for Digital Libraries. Springer. p. 257. ISBN   9783319245928.{{cite book}}: CS1 maint: multiple names: authors list (link)
  40. Atkinson, Kristine H. (2015). "Reinventing nonpatent literature for pharmaceutical patenting". Pharmaceutical Patent Analyst. 4 (5): 371–375. doi:10.4155/ppa.15.21. PMID   26389649.
  41. http://www.hoopoes.com/jargon/entry/scanno.shtml Dead link