OCRopus

Last updated

OCRopus
Developer(s) Thomas Breuel, DFKI
Initial release9 April 2007;16 years ago (2007-04-09) [1]
Stable release
1.3.3 / 16 December 2017;6 years ago (2017-12-16)
Repository
Written in C++ and Python
Operating system FreeBSD, Linux, Mac OS X
Type Optical character recognition
License Apache License v2.0
Website github.com/ocropus/ocropy

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.

Contents

OCRopus is developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and was sponsored by Google.

Description

OCRopus was especially designed for use in high-volume digitization projects of books, such as Google Books, Internet Archive, or libraries. A large number of languages and fonts are to be supported. [2] However, it can also be used for desktop and office applications or for application for visually impaired people.

OCRopus has main components which perform:

Single or multiple scripts are available for these components. The modular programming approach allows individual workflows to be used and individual steps to be exchanged.

By default, OCRopus comes with a model for English texts and a model for text in Fraktur. These models refer to the script and are largely independent of the actual language. [3] New characters or language variants can be trained either from the start, or addeded later.

Recent text recognition is based on recurrent neural networks (LSTM) and does not require a language model. This makes it possible to train language-independent models for which good recognition results in English, German and French have been shown at the same time. [4] In addition to the Latin script, there are results for other scripts such as Sanskrit, Urdu, Devanagari, and Greek.

Very good detection rates can be achieved through an appropriate training. This extra effort is particularly worthwhile for difficult documents or scripts that are no longer common today, which are not in the focus of other OCR software. [5] [6]

History

On 9 April 2007, OCRopus was announced as a Google-sponsored project to develop advanced OCR technologies. [1] Funding was granted for a period of three years and covered in particular PhD and postdoctoral positions at DFKI and the University of Kaiserslautern. In return, OCRopus was also used for automatic text recognition in Google Book Search. [7] Licensing under an open source license was made right from the start to facilitate collaboration between industrial and academic research. [8] OCRopus has received further funding from the Andrew W. Mellon Foundation and the BMBF. [9]

The first alpha version 0.1 was released on 22 October 2007 and several pre-releases followed between December 2007 and May 2009 reaching a stable version 0.4.4 in March 2010. [10] Originally, the software was developed in C++, Python and Lua with Jam as a build system. A complete refactoring of the source code in Python modules was done and released in version 0.5 (June 2012). [11]

Initially, Tesseract was used as the only text recognition module. Since 2009 (version 0.4) Tesseract was only supported as a plugin. Instead, a self-developed text recognizer (also segment-based) was used. [12] This recognizer was then used together with OpenFST [13] for language modeling after the recognition step. From 2013 onwards, an additional recognition with recurrent neural networks (LSTM) was offered, which with the release of version 1.0 in November 2014 is the only recognizer. [14] [15]

The source code is managed over GitHub and is maintained and developed by a developer community. [16] The current version of OCRopus is 1.3.3 (December 2017). [17]

The OCR software kraken which is used by the transcription platform eScriptorium is a fork of OCRopus. It added support for right-to-left scripts. [18] Another fork which is based on kraken is Calamari.

Thomas Breuel also developed a successor OCRopus 2 and is actively working on OCRopus 4. [19]

Usage

Workflow diagram of the separate command line tools from OCRopus. Ocropus Workflow.png
Workflow diagram of the separate command line tools from OCRopus.

OCRopus can be used from the command line. Once installed, it can be invoked by specifying the input images. It will output the recognized text to standard output directly or write it as hOCR (HTML-based) code into files, from which it then can be transformed to a searchable PDF. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line). [20]

Example for the OCRopus calls to recognize the text in an image:

# perform binarization ocropus-nlbin tests/ersch.png -o book  # perform page layout analysis ocropus-gpageseg book/0001.bin.png  # perform text line recognition (with a fraktur model) ocropus-rpred -m models/fraktur.pyrnn.gz book/0001/*.bin.png  # generate HTML output ocropus-hocr book/0001.bin.png -o book/0001.html

Other tools concentrate on the training part of OCRopus. There are OCRopus models to extract text from Latin, Greek, Cyrillic and Indic scripts. [21]

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

Optical music recognition (OMR) is a field of research that investigates how to computationally read musical notation in documents. The goal of OMR is to teach the computer to read and interpret sheet music and produce a machine-readable version of the written music score. Once captured digitally, the music can be saved in commonly used file formats, e.g. MIDI and MusicXML . In the past it has, misleadingly, also been called "music optical character recognition". Due to significant differences, this term should no longer be used.

<span class="mw-page-title-main">Tesseract (software)</span> Free optical character recognition engine

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development was sponsored by Google in 2006.

Ocrad is an optical character recognition program and part of the GNU Project. It is free software licensed under the GNU GPL.

CuneiForm Cognitive OpenOCR is a freely distributed open-source OCR system developed by Russian software company Cognitive Technologies.

<span class="mw-page-title-main">Etherpad</span> Open-source web-based collaborative real-time editor

Etherpad is an open-source, web-based collaborative real-time editor, allowing authors to simultaneously edit a text document, and see all of the participants' edits in real-time, with the ability to display each author's text in their own color. There is also a chat box in the sidebar to allow meta communication.

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.

This comparison of optical character recognition software includes:

<span class="mw-page-title-main">OpenSCAD</span> Free software for creating 3D objects

OpenSCAD is a free software application for creating solid 3D computer-aided design (CAD) objects. It is a script-only based modeller that uses its own description language; the 3D preview can be manipulated interactively, but cannot be interactively modified in 3D. Instead, an OpenSCAD script specifies geometric primitives and defines how they are modified and combined to render a 3D model. As such, the program performs constructive solid geometry (CSG). OpenSCAD is available for Windows, Linux, and macOS.

Xena is open-source software for use in digital preservation. Xena is short for XML Electronic Normalising for Archives.

<span class="mw-page-title-main">OCRFeeder</span>

OCRFeeder is an optical character recognition suite for GNOME, which also supports virtually any command-line OCR engine, such as CuneiForm, GOCR, Ocrad and Tesseract. It converts paper documents to digital document files and can serve to make them accessible to visually impaired users.

spaCy Software library

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

Indic OCR refers to the process of converting text images written in Indic scripts into e-text using Optical character recognition (OCR) techniques. Broadly, it can also refer to the OCR systems of Brahmic scripts for languages of South Asia and Southeast Asia, not just the scripts of the Indian subcontinent, which are all written in an abugida-based writing system.

<span class="mw-page-title-main">Scene text</span> Text captured as part of outdoor surroundings in a photograph

Scene text is text that appears in an image captured by a camera in an outdoor environment.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents. Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.

eScriptorium

eScriptorium is a platform for manual or automated segmentation and text recognition of historical manuscripts and prints.

References

  1. 1 2 Breuel, Thomas (9 April 2007). "Announcing the OCRopus Open Source OCR System". Google Developers Blog. Retrieved 29 December 2017.
  2. Breuel, Thomas (2009). "Recent progress on the OCRopus OCR system". Proceedings of the International Workshop on Multilingual OCR - MOCR '09. New York, NY, USA: ACM. pp. 2:1–2:10. doi:10.1145/1577802.1577805. ISBN   9781605586984. S2CID   16920122.
  3. "Models". ocropy wiki. Retrieved 5 January 2018.
  4. Ul-Hasan, Adnan; Breuel, Thomas M. (2013). "Can we build language-independent OCR using LSTM networks?". Proceedings of the 4th International Workshop on Multilingual OCR - MOCR '13. New York, NY, USA: ACM. pp. 9:1–9:5. doi:10.1145/2505377.2505394. ISBN   9781450321143. S2CID   15054318.
  5. Springmann, Uwe (1 December 2016). "OCR für alte Drucke". Informatik-Spektrum (in German). 39 (6): 459–462. doi:10.1007/s00287-016-1004-3. ISSN   0170-6012. S2CID   26680054.
  6. Simistira, F.; Ul-Hassan, A.; Papavassiliou, V.; Gatos, B.; Katsouros, V.; Liwicki, M. (August 2015). "Recognition of historical Greek polytonic scripts using LSTM networks". 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 766–770. doi:10.1109/icdar.2015.7333865. ISBN   978-1-4799-1805-8. S2CID   39049104.
  7. "Research project OCRopus". dfki.de. Retrieved 5 January 2018.
  8. Breuel, Thomas M. (28 January 2008). "The OCRopus open source OCR system". In Yanikoglu, Berrin A; Berkner, Kathrin (eds.). Document Recognition and Retrieval XV. Document Recognition and Retrieval XV. Vol. 6815. pp. 68150F–68150F–15. Bibcode:2008SPIE.6815E..0FB. CiteSeerX   10.1.1.99.8505 . doi:10.1117/12.783598. S2CID   14728635.
  9. "ocropus project website". Google Project Hosting. January 2019. Archived from the original on 24 December 2012.
  10. "Older versions - ocropy". GitHub. Retrieved 5 January 2018.
  11. "OCRopus 0.5". Google Groups. 2 June 2012.
  12. OCRopus doesn't even link with Tesseract by default.
  13. Official OpenFST website.
  14. "ocropy - release v1.0". GitHub. 2 November 2014. Retrieved 5 January 2018.
  15. Breuel, T. M.; Ul-Hasan, A.; Al-Azawi, M. A.; Shafait, F. (August 2013). "High-Performance OCR for Printed English and Fraktur Using LSTM Networks". 2013 12th International Conference on Document Analysis and Recognition. pp. 683–687. doi:10.1109/icdar.2013.140. ISBN   978-0-7695-4999-6. S2CID   7244356.
  16. "ocropy: Python-based tools for document analysis and OCR", GitHub, retrieved 5 January 2018
  17. "Releases ocropy". GitHub. Retrieved 5 January 2018.
  18. "Kraken - a Universal Text Recognizer for the Humanities" . Retrieved 23 January 2024.
  19. "The OCRopus OCR System and Related Software". GitHub. Retrieved 27 August 2021.
  20. "ocropy wiki". GitHub. Retrieved 30 December 2017.
  21. "ocropy models". GitHub. Retrieved 13 March 2018.