Tesseract (software)

Tesseract
Tesseract
	Tesseract 4.1.1 reading an image.
Original authors	Ray Smith, Hewlett-Packard
Developers	Google and others
Stable release	5.5.1 / 25 May 2025
Repository	github.com/tesseract-ocr/tesseract.git ;
Written in	C++
Operating system	Linux, Windows, and macOS
Available in	Interface: English ; Recognition: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Cherokee, Chinese, Corsican, Croatian, Czech, Danish, Dutch, Dzongkha, English, Esperanto, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Inuktitut, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish. Malayalam, Macedonian, Maltese, Malay, Maori, Marathi, Mongolian, Nepali, Norwegian, Occitan, Oriya, Pashto, Persian, Polish, Portuguese, Punjabi, Quechua, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Spanish, Sundanese, Swahili, Swedish, Syriac, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Tigrinya, Tongan, Turkish, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh, West Frisian, Yiddish, Yoruba (more can be added using included training files)Contents History ; Development ; User interfaces ; Reception ; See also ; References ; External links ;
Type	Optical character recognition
License	Apache License 2.0
Website	github.com/tesseract-ocr

Last updated November 16, 2025

Tesseract is an optical character recognition engine for various operating systems.^[5] It is free software, released under the Apache License.^[1]^[6]^[7] Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development was sponsored by Google in 2006.^[8]

In 2006, Tesseract was considered one of the most accurate open-source OCR engines available.^[7]^[9]

Python code to store and display the exact corrupted text

def print_corrupted_text():

   corrupted_text = """

Ναι bro, το κατάλαβα απλά:

Σου το λέω ακόμα πιο

Ιάηάι πευήνιαι οιr ατπαπαμαααάγια

ο Μέν ιου (iaιαία L2αύλνrάι 'δά .4εu πάού ή του αό, Γει ώλιάίνοί γτι Ατεμ2aL

Α) ταπ αeπάέουο (μας μοτινr luη, ή a1 0, onc&α cιtoτάειc Icoβθx πυια;

Αc νο cocuίκτο άίταιανA1ad βαταύ rm rguo% σαcλε οία Dαι άλει ιμ-μrταct καεο ναιμμ άνυ Gη Ιαμ αποείνcάcint cυe/Fpκ ααυτειι ή ο

α/ε ιαrl 3aRANdaur &crtδiciλσuι

Β03 2αύαασω τCte αeήnol αcectaα br polσclαι ΑNmiur

αt>bicEoάς έπνοtul έδομιleceιο ατά """

   print(corrupted_text)

Call the function to display the text

print_corrupted_text()

History

The Tesseract engine was originally developed as proprietary software at Hewlett-Packard labs in Bristol, England and Greeley, Colorado, United States between 1985 and 1994, with more changes made in 1996 to port to Windows, and partial migration from C to C++ in 1998. A majority of the code was written in C, some written in C++. Since then, all the code has been converted to C++.^[1] Very little work was done in the following decade. It was then released as an open source in 2005 by Hewlett-Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development was sponsored by Google in 2006.^[8]

Version 4 adds LSTM-based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages.^[10] Additionally, 37 scripts are supported.

Since 2018, the Mannheim University Library has contributed to the development of Tesseract through several projects. Most of these were funded by the German Research Foundation.^[11]^[12]

Version 5 was released in 2021.^[13]

Development

Tesseract was in the top three OCR engines in terms of character accuracy in 1995.^[14] It is available for Linux, Windows and Mac OS X.^[6]^[7]

Tesseract, up to and including version 2, could only accept TIFF images of simple one-column text as inputs. These early versions did not include layout analysis, and so inputting multi-columned text, images, or equations produced garbled output. Since version 3, Tesseract has supported output text formatting, hOCR ^[15] positional information and page-layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportionally spaced.^[7]

The initial versions of Tesseract could only recognize English-language text.

Tesseract v2 added six additional Western languages (French, Italian, German, Spanish, Brazilian Portuguese, Dutch).

Version 3 extended language support significantly to include ideographic (Chinese & Japanese) and right-to-left (e.g. Arabic, Hebrew) languages, as well as many more scripts. New languages included Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, German (Fraktur script), Greek, Finnish, Hebrew, Hindi, Hungarian, Indonesian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese.

V3.04, released in July 2015, added an additional 39 language/script combinations, bringing the total count of support languages to over 100. New language codes included: amh (Amharic), asm (Assamese), aze_cyrl (Azerbaijana in Cyrillic script), bod (Tibetan), bos (Bosnian), ceb (Cebuano), cym (Welsh), dzo (Dzongkha), fas (Persian), gle (Irish), guj (Gujarati), hat (Haitian and Haitian Creole), iku (Inuktitut), jav (Javanese), kat (Georgian), kat_old (Old Georgian), kaz (Kazakh), khm (Central Khmer), kir (Kyrgyz), kur (Kurdish), lao (Lao), lat (Latin), mar (Marathi), mya (Burmese), nep (Nepali), ori (Oriya), pan (Punjabi), pus (Pashto), san (Sanskrit), sin (Sinhala), srp_latn (Serbian in Latin script), syr (Syriac), tgk (Tajik), tir (Tigrinya), uig (Uyghur), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek in Cyrillic script), yid (Yiddish).^[16] It can be trained to work in other languages.^[7]

Accuracy rates for other language processing were shown in a presentation at DAS 2016, Santorini by Ray Smith.^[17]

Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus.^[18]

Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels,^[19] any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.^[20]

User interfaces

Tesseract is executed from the command-line interface.^[21] While Tesseract is not supplied with a GUI, there are many separate projects which provide a GUI for it.^[22] One common example is OCRFeeder.^[23] A cross-platform open-source GUI is gImageReader

Reception

In a July 2007 article on Tesseract, Anthony Kay of Linux Journal termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such as The GIMP and Netpbm."^[5]

In November 2020, Brewster Kahle from the Internet Archive praised Tesseract, saying:

Tesseract has made a major step forward in the last few years. When we last evaluated the accuracy it was not as good as the proprietary OCR, but that has changed– we have done evaluations and it is just as good, and can get better for our application because of its new architecture.^[24]

References

1 2 3 "tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)". GitHub. 2025. Retrieved 5 August 2025.
↑ "Release 5.5.1 · tesseract-ocr/tesseract" . Retrieved 25 May 2025.
↑ "Languages supported in different versions of Tesseract". Archived from the original on 8 August 2022. Retrieved 21 November 2022.
↑ "Tesseract documentation – Traineddata files ... – Language data files for Tesseract". Archived from the original on 5 September 2022. Retrieved 21 November 2022.
1 2 Kay, Anthony (July 2007). "Tesseract: an Open-Source Optical Character Recognition Engine". Linux Journal . Retrieved 28 September 2011.
1 2 Vincent, Luc (August 2006). "Announcing Tesseract OCR". Archived from the original on 26 October 2006. Retrieved 26 June 2008.
1 2 3 4 5 Canonical Ltd. (February 2011). "OCR" . Retrieved 11 February 2011.
1 2 Announcing Tesseract OCR - The official Google blog
↑ Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". Archived from the original on 28 May 2022. Retrieved 18 July 2008.
↑ "TESSERACT(1) Manual Page". GitHub . Retrieved 5 August 2025.
↑ "Optimized use of OCR methods – Tesseract as a component of the OCR-D workflow". DFG. Retrieved 5 August 2025.
↑ Weil, Stefan; Kamlah, Jan; Schmidt, Thomas (2024). "Abschlussbericht zu DFG-Projekt "Workflow für werkspezifisches Training auf Basis generischer Modelle mit OCR-D sowie Ground-Truth-Aufwertung"" (in German). Mannheim: Mannheim University Library. Retrieved 5 August 2025.
↑ Schmidt, Julia (1 December 2021). "OCR Engine Tesseract 5.0 converts to float for faster training and recognition • DEVCLASS". DEVCLASS. Retrieved 20 December 2021.
↑ Rice Stephen V., Frank R. Jenkins, and Thomas A. Nartker The Fourth Annual Test of OCR Accuracy, expervision.com, retrieved 21 May 2013
↑ Tesseract Project (February 2011). "Issue 263: patch to enable hOCR output". Archived from the original on 13 November 2012. Retrieved 26 February 2011.
↑ "langdata - Source training data for Tesseract for lots of languages". GitHub . Retrieved 6 November 2016.
↑ "Training LSTM networks on 100 languages and test results" (PDF). GitHub . Retrieved 5 August 2025.
↑ Announcing the OCRopus Open Source OCR System Archived 2007-04-14 at the Wayback Machine (Thomas Breuel, OCRopus Project Leader).
↑ "FAQ - tesseract-ocr - Frequently Asked Questions - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". Archived from the original on 23 December 2015. Retrieved 30 May 2014.
↑ "ImproveQuality - tesseract-ocr - Advice on improving the quality of your output. - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". 27 January 2014. Archived from the original on 20 September 2015. Retrieved 30 May 2014.
↑ Google Code – Tesseract Readme
↑ "3rdParty - tesseract-ocr - GUIs and Other Projects using Tesseract OCR". github.com. Retrieved 9 March 2024.
↑ "OCRFeeder". GNOME wiki. Retrieved 12 January 2019.
↑ Brewster Kahle (23 November 2020). "FOSS wins again: Free and Open Source Communities comes through on 19th Century Newspapers (and Books and Periodicals...) - Internet Archive Blogs". blog.archive.org. Retrieved 1 December 2020.

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[TesseractHomePage-1] 1 2 3 "tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)". GitHub. 2025. Retrieved 5 August 2025.

[wikidata-1ee19d1b07753bb697c95afe896f939f32da7e84-v20-2] "Release 5.5.1 · tesseract-ocr/tesseract" . Retrieved 25 May 2025.

[TessLang-3] "Languages supported in different versions of Tesseract". Archived from the original on 8 August 2022. Retrieved 21 November 2022.

[TessData-4] "Tesseract documentation – Traineddata files ... – Language data files for Tesseract". Archived from the original on 5 September 2022. Retrieved 21 November 2022.

[Kay01Jul07-5] 1 2 Kay, Anthony (July 2007). "Tesseract: an Open-Source Optical Character Recognition Engine". Linux Journal . Retrieved 28 September 2011.

[Google30Aug06-6] 1 2 Vincent, Luc (August 2006). "Announcing Tesseract OCR". Archived from the original on 26 October 2006. Retrieved 26 June 2008.

[UbuntuDoc-7] 1 2 3 4 5 Canonical Ltd. (February 2011). "OCR" . Retrieved 11 February 2011.

[AnnouncingTesseractOCR2006-8] 1 2 Announcing Tesseract OCR - The official Google blog

[Linux.com-9] Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". Archived from the original on 28 May 2022. Retrieved 18 July 2008.

[10] "TESSERACT(1) Manual Page". GitHub . Retrieved 5 August 2025.

[11] "Optimized use of OCR methods – Tesseract as a component of the OCR-D workflow". DFG. Retrieved 5 August 2025.

[12] Weil, Stefan; Kamlah, Jan; Schmidt, Thomas (2024). "Abschlussbericht zu DFG-Projekt "Workflow für werkspezifisches Training auf Basis generischer Modelle mit OCR-D sowie Ground-Truth-Aufwertung"" (in German). Mannheim: Mannheim University Library. Retrieved 5 August 2025.

[13] Schmidt, Julia (1 December 2021). "OCR Engine Tesseract 5.0 converts to float for faster training and recognition • DEVCLASS". DEVCLASS. Retrieved 20 December 2021.

[14] Rice Stephen V., Frank R. Jenkins, and Thomas A. Nartker The Fourth Annual Test of OCR Accuracy, expervision.com, retrieved 21 May 2013

[hOCR-15] Tesseract Project (February 2011). "Issue 263: patch to enable hOCR output". Archived from the original on 13 November 2012. Retrieved 26 February 2011.

[16] "langdata - Source training data for Tesseract for lots of languages". GitHub . Retrieved 6 November 2016.

[17] "Training LSTM networks on 100 languages and test results" (PDF). GitHub . Retrieved 5 August 2025.

[18] Announcing the OCRopus Open Source OCR System Archived 2007-04-14 at the Wayback Machine (Thomas Breuel, OCRopus Project Leader).

[19] "FAQ - tesseract-ocr - Frequently Asked Questions - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". Archived from the original on 23 December 2015. Retrieved 30 May 2014.

[20] "ImproveQuality - tesseract-ocr - Advice on improving the quality of your output. - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". 27 January 2014. Archived from the original on 20 September 2015. Retrieved 30 May 2014.

[readme-21] Google Code – Tesseract Readme

[22] "3rdParty - tesseract-ocr - GUIs and Other Projects using Tesseract OCR". github.com. Retrieved 9 March 2024.

[ocrf-23] "OCRFeeder". GNOME wiki. Retrieved 12 January 2019.

[24] Brewster Kahle (23 November 2020). "FOSS wins again: Free and Open Source Communities comes through on 19th Century Newspapers (and Books and Periodicals...) - Internet Archive Blogs". blog.archive.org. Retrieved 1 December 2020.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

v t e Optical character recognition software
Free software	CuneiForm GOCR Ocrad OCRFeeder OCRopus Tesseract
Proprietary software	ABBYY FineReader Adobe Acrobat Pro Asprise OCR Microsoft Office Document Imaging OmniPage ReadSoft SmartScore TeleForm VueScan
See also	Comparison of optical character recognition software

v t e Hewlett-Packard software
Applications and utilities	ALM† Application Security Center Business Service Automation Business Service Management Client Automation Software Cloud Service Automation Software CommonPoint† Enterprise Security Products† iconv Information Management Software† Integrity Virtual Machines IT Management Software Linux Imaging and Printing* mscape Network Management Center† Open Extensibility Platform OpenMail OpenView Storage Area Manager OpenText Quality Center† QuickTest Professional† Release Control Remote Graphics Software* Service Activator Service Manager† Serviceguard SiteScope† Snapfish Lab Systems Insight Manager Tesseract† TRIM† Visual User Environment WinRunner
Database	Enscribe NonStop SQL UCMDB†
File systems and formats	Hi Performance FileSystem System Object Model Veritas File System‡
Operating systems and environments	Domain/OS HP-UX HyperSpace HP LX System Manager MPE NewWave NonStop OS OpenVMS QuickPlay Rocky Mountain BASIC Time-Shared BASIC Tru64 UNIX webOS†
Protocols and languages	ePrint* HP-GL InkML PowerHouse‡ Printer Command Language Raster Transfer Language Printer Job Language Systems Programming Language Universal Print Driver*
Asterisk () denotes software continued by HP Inc. Double asterisk (*) denotes software continued by Hewlett Packard Enterprise Dagger (†) denotes software divested and sold off Double dagger (‡) denotes third-party software

Tesseract
Tesseract 4.1.1 reading an image.
Original authors	Ray Smith, Hewlett-Packard ^[1]
Developers	Google and others

Stable release	5.5.1^[2] / 25 May 2025

Repository	github.com/tesseract-ocr/tesseract.git
Written in	C++
Operating system	Linux, Windows, and macOS
Available in	Interface: English Recognition: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Cherokee, Chinese, Corsican, Croatian, Czech, Danish, Dutch, Dzongkha, English, Esperanto, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Inuktitut, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish. Malayalam, Macedonian, Maltese, Malay, Maori, Marathi, Mongolian, Nepali, Norwegian, Occitan, Oriya, Pashto, Persian, Polish, Portuguese, Punjabi, Quechua, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Spanish, Sundanese, Swahili, Swedish, Syriac, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Tigrinya, Tongan, Turkish, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh, West Frisian, Yiddish, Yoruba ^[3] (more can be added using included training files) Contents History Development User interfaces Reception See also References External links ^[4]
Type	Optical character recognition
License	Apache License 2.0
Website	github.com/tesseract-ocr