Tesseract (software)

Last updated

Tesseract
Original author(s) Ray Smith, Hewlett-Packard [1]
Developer(s) Google and others
Stable release
5.4.1 [2]   OOjs UI icon edit-ltr-progressive.svg / 11 June 2024
Repository
Written in C and C++
Operating system Linux, Windows, and macOS
Available inInterface: English
Recognition:

Afrikaans, Albanian, Arabic, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Catalan, Czech, Cherokee, Croatian, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hebrew, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Maltese, Malay, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Vietnamese [3]

(more can be added using included training files)

Contents

[4]
Type Optical character recognition
License Apache License 2.0
Website github.com/tesseract-ocr   OOjs UI icon edit-ltr-progressive.svg

Tesseract is an optical character recognition engine for various operating systems. [5] It is free software, released under the Apache License. [1] [6] [7] Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development was sponsored by Google in 2006. [8]

In 2006, Tesseract was considered one of the most accurate open-source OCR engines available. [7] [9]

History

The Tesseract engine was originally developed as proprietary software at Hewlett-Packard labs in Bristol, England and Greeley, Colorado between 1985 and 1994, with more changes made in 1996 to port to Windows, and partial migration from C to C++ in 1998. A majority of the code was written in C, some written in C++. Since then, all the code has been converted to a C++ compiler.[ citation needed ] Very little work was done in the following decade. It was then released as an open source in 2005 by Hewlett-Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development was sponsored by Google in 2006. [8]

Version 4 adds LSTM-based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages. [10] Additionally 37 scripts are supported.

Version 5 was released in 2021, after more than two years of testing and developing. [11]

Features

Tesseract was in the top three OCR engines in terms of character accuracy in 1995. [12] It is available for Linux, Windows and Mac OS X. [6] [7]

Tesseract, up to and including version 2, could only accept TIFF images of simple one-column text as inputs. These early versions did not include layout analysis, and so inputting multi-columned text, images, or equations produced garbled output. Since version 3, Tesseract has supported output text formatting, hOCR [13] positional information and page-layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportionally spaced. [7]

The initial versions of Tesseract could only recognize English-language text.

Tesseract v2 added six additional Western languages (French, Italian, German, Spanish, Brazilian Portuguese, Dutch).

Version 3 extended language support significantly to include ideographic (Chinese & Japanese) and right-to-left (e.g. Arabic, Hebrew) languages, as well as many more scripts. New languages included Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, German (Fraktur script), Greek, Finnish, Hebrew, Hindi, Hungarian, Indonesian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese.

V3.04, released in July 2015, added an additional 39 language/script combinations, bringing the total count of support languages to over 100. New language codes included: amh (Amharic), asm (Assamese), aze_cyrl (Azerbaijana in Cyrillic script), bod (Tibetan), bos (Bosnian), ceb (Cebuano), cym (Welsh), dzo (Dzongkha), fas (Persian), gle (Irish), guj (Gujarati), hat (Haitian and Haitian Creole), iku (Inuktitut), jav (Javanese), kat (Georgian), kat_old (Old Georgian), kaz (Kazakh), khm (Central Khmer), kir (Kyrgyz), kur (Kurdish), lao (Lao), lat (Latin), mar (Marathi), mya (Burmese), nep (Nepali), ori (Oriya), pan (Punjabi), pus (Pashto), san (Sanskrit), sin (Sinhala), srp_latn (Serbian in Latin script), syr (Syriac), tgk (Tajik), tir (Tigrinya), uig (Uyghur), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek in Cyrillic script), yid (Yiddish). [14] It can be trained to work in other languages. [7]

Tesseract can process right-to-left text such as Arabic or Hebrew, many Indic scripts as well as CJK quite well. Accuracy rates are shown in this presentation for Tesseract tutorial at DAS 2016, Santorini by Ray Smith. [15]

Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus. [16]

Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, [17] any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters. [18]

User interfaces

Tesseract configuration window in OCRFeeder Tesseract on ocrfeeder.png
Tesseract configuration window in OCRFeeder

Tesseract is executed from the command-line interface. [19] While Tesseract is not supplied with a GUI, there are many separate projects which provide a GUI for it. [20] One common example is OCRFeeder. [21] . A cross-platform open-source GUI is gImageReader

Reception

In a July 2007 article on Tesseract, Anthony Kay of Linux Journal termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such as The GIMP and Netpbm." [5]

In November 2020, Brewster Kahle from the Internet Archive praised Tesseract saying:

Tesseract has made a major step forward in the last few years. When we last evaluated the accuracy it was not as good as the proprietary OCR, but that has changed– we have done evaluations and it is just as good, and can get better for our application because of its new architecture. [22]

See also

Related Research Articles

In digital printing, a page description language (PDL) is a computer language that describes the appearance of a printed page in a higher level than an actual output bitmap. An overlapping term is printer control language, which includes Hewlett-Packard's Printer Command Language (PCL). PostScript is one of the most noted page description languages. The markup language adaptation of the PDL is the page description markup language.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Graphviz</span> Software package for graph visualization

Graphviz is a package of open-source tools initiated by AT&T Labs Research for drawing graphs specified in DOT language scripts having the file name extension "gv". It also provides libraries for software applications to use the tools. Graphviz is free software licensed under the Eclipse Public License.

cairo (graphics) Vector graphics-based software library

Cairo is an open-source graphics library that provides a vector graphics-based, device-independent API for software developers. It provides primitives for two-dimensional drawing across a number of different backends. Cairo uses hardware acceleration when available.

The Linux Desktop Testing Project (LDTP) is a testing tool that uses computer assistive technology to automate graphical user interface (GUI) testing. The GUI functionality of an application can be tested in Linux, macOS, Windows, Solaris, FreeBSD, and embedded system environments. The macOS version is named PyATOM, and the Windows version is Cobra. The LDTP is released as free and open-source software under the GNU Lesser General Public License (LGPL).

<span class="mw-page-title-main">Git</span> Distributed version control software system

Git is a distributed version control system that tracks versions of files. It is often used to control source code by programmers who are developing software collaboratively.

<span class="mw-page-title-main">FontForge</span> Font editor created by George Williams

FontForge is a FOSS font editor which supports many common font formats. Developed primarily by George Williams until 2012, FontForge is free software and is distributed under a mix of the GNU General Public License Version 3 and the 3-clause BSD license. It is available for operating systems including Linux, Windows, and macOS, and is localized into 12 languages.

Ocrad is an optical character recognition program and part of the GNU Project. It is free software licensed under the GNU GPL.

<span class="mw-page-title-main">OCRopus</span>

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.

Indic Computing means "computing in Indic", i.e., Indian Scripts and Languages. It involves developing software in Indic Scripts/languages, Input methods, Localization of computer applications, web development, Database Management, Spell checkers, Speech to Text and Text to Speech applications and OCR in Indian languages.

CuneiForm Cognitive OpenOCR is a freely distributed open-source OCR system developed by Russian software company Cognitive Technologies.

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.

This comparison of optical character recognition software includes:

<span class="mw-page-title-main">Vaadin</span> Web development platform written in Java

Vaadin is an open-source web application development platform for Java. Vaadin includes a set of Web Components, a Java web framework, and a set of tools that enable developers to implement modern web graphical user interfaces (GUI) using the Java programming language only, TypeScript only, or a combination of both.

<span class="mw-page-title-main">Homebrew (package manager)</span> macOS CLI package manager in Ruby

Homebrew is a free and open-source software package management system that simplifies the installation of software on Apple's operating system, macOS, as well as Linux. The name is intended to suggest the idea of building software on the Mac depending on the user's taste. Originally written by Max Howell, the package manager has gained popularity in the Ruby on Rails community and earned praise for its extensibility. Homebrew has been recommended for its ease of use as well as its integration into the command-line interface. Homebrew is a member of the Open Source Collective, and is run entirely by unpaid volunteers.

<span class="mw-page-title-main">OCRFeeder</span>

OCRFeeder is an optical character recognition suite for GNOME, which also supports virtually any command-line OCR engine, such as CuneiForm, GOCR, Ocrad and Tesseract. It converts paper documents to digital document files and can serve to make them accessible to visually impaired users.

Indic OCR refers to the process of converting text images written in Indic scripts into e-text using Optical character recognition (OCR) techniques. Broadly, it can also refer to the OCR systems of Brahmic scripts for languages of South Asia and Southeast Asia, not just the scripts of the Indian subcontinent, which are all written in an abugida-based writing system.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

raylib Game programming library

Raylib is a cross-platform open-source software development library. The library was made to create graphical applications and games.

<span class="mw-page-title-main">Termux</span> Terminal emulator for Android

Termux is a free and open-source terminal emulator for Android which allows for running a Linux environment on an Android device. Termux installs a minimal base system automatically; additional packages are available using its package manager, based on Debian's.

References

  1. 1 2 Google (2008). "tesseract-ocr". GitHub . Retrieved 8 March 2016.
  2. Stefan Weil. "Release 5.4.1 · tesseract-ocr/tesseract" . Retrieved 12 June 2024.
  3. "Languages supported in different versions of Tesseract". Archived from the original on 8 August 2022. Retrieved 21 November 2022.
  4. "Tesseract documentation – Traineddata files ... – Language data files for Tesseract". Archived from the original on 5 September 2022. Retrieved 21 November 2022.
  5. 1 2 Kay, Anthony (July 2007). "Tesseract: an Open-Source Optical Character Recognition Engine". Linux Journal . Retrieved 28 September 2011.
  6. 1 2 Vincent, Luc (August 2006). "Announcing Tesseract OCR". Archived from the original on 26 October 2006. Retrieved 26 June 2008.
  7. 1 2 3 4 5 Canonical Ltd. (February 2011). "OCR" . Retrieved 11 February 2011.
  8. 1 2 Announcing Tesseract OCR - The official Google blog
  9. Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". Archived from the original on 28 May 2022. Retrieved 18 July 2008.
  10. "TESSERACT(1) Manual Page". GitHub . Retrieved 15 March 2018.
  11. Schmidt, Julia (1 December 2021). "OCR Engine Tesseract 5.0 converts to float for faster training and recognition • DEVCLASS". DEVCLASS. Retrieved 20 December 2021.
  12. Rice Stephen V., Frank R. Jenkins, and Thomas A. Nartker The Fourth Annual Test of OCR Accuracy, expervision.com, retrieved 21 May 2013
  13. Tesseract Project (February 2011). "Issue 263: patch to enable hOCR output". Archived from the original on 13 November 2012. Retrieved 26 February 2011.
  14. "langdata - Source training data for Tesseract for lots of languages". GitHub . Retrieved 6 November 2016.
  15. "Training LSTM networks on 100 languages and test results" (PDF). GitHub . Retrieved 18 March 2018.
  16. Announcing the OCRopus Open Source OCR System Archived 2007-04-14 at the Wayback Machine (Thomas Breuel, OCRopus Project Leader).
  17. "FAQ - tesseract-ocr - Frequently Asked Questions - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". Archived from the original on 23 December 2015. Retrieved 30 May 2014.
  18. "ImproveQuality - tesseract-ocr - Advice on improving the quality of your output. - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". 27 January 2014. Archived from the original on 20 September 2015. Retrieved 30 May 2014.
  19. Google Code – Tesseract Readme
  20. "3rdParty - tesseract-ocr - GUIs and Other Projects using Tesseract OCR". github.com. Retrieved 9 March 2024.
  21. "OCRFeeder". GNOME wiki. Retrieved 12 January 2019.
  22. Brewster Kahle (23 November 2020). "FOSS wins again: Free and Open Source Communities comes through on 19th Century Newspapers (and Books and Periodicals...) - Internet Archive Blogs". blog.archive.org. Retrieved 1 December 2020.