EScriptorium

Last updated
eScriptorium
Initial release2018;6 years ago (2018)
Stable release
v0.14.0 [1] / 24 October 2023
Repository
Operating system platform independent

eScriptorium is a platform for manual or automated segmentation and text recognition of historical manuscripts and prints.

Contents

Details

Screenshot with eScriptorium transcription of Johann Reinhold Forster's diary Journal of a Voyage on Board the Resolution 1772-1774 Vol. 1 EScriptorium Journal of a Voyage on Board the Resolution 1772-1774 Vol. 1.png
Screenshot with eScriptorium transcription of Johann Reinhold Forster's diary Journal of a Voyage on Board the Resolution 1772-1774 Vol. 1

The software is open source and can therefore be freely installed on your own computers. It is developed at the Paris Sciences et Lettres University as part of the projects Scripta [2] and RESILIENCE [3] with contributions from other institutions, partly funded by the EU's Horizon 2020 funding program and a grant from the Andrew W. Mellon Foundation.

Scanned pages from manuscripts and prints can be imported into eScriptorium and exported as text in various formats (text, ALTO or PAGE XML, TEI). The text areas with text lines in the images are first recognized manually or automatically (segmentation). The text lines are then transcribed manually or automatically. [4]

Both automatic segmentation and text recognition can be trained using manually created or corrected examples (ground truth). The new models created in this way can be shared with others and can therefore be easily reused. [5]

At the heart of eScriptorium is the free OCR software Kraken by Benjamin Kiessling, a derivative of the OCR software OCRopus , which is suitable for handwritten and printed texts and also supports scripts such as Hebrew and Arabic, which are written from right to left. [6]

Comparable programs that offer similar functions to eScriptorium are OCR4All [7] and Transkribus.

Individual references

  1. "v0.14.0" . Retrieved 21 January 2024.
  2. "Scripta-PSL. History and practices of writing" . Retrieved 2022-03-13.
  3. "RESILIENCE - The Religious Studies Research Infrastructure" . Retrieved 2022-03-13.
  4. "eScriptorium Documentation" . Retrieved 2024-01-21.
  5. "Export data - eScriptorium Documentation" . Retrieved 2024-01-21.
  6. "lunch/kraken: OCR engine for all the languages" . Retrieved 2022-03-13.
  7. "OCR4all | forTEXT" . Retrieved 2023-06-20.

Related Research Articles

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

<span class="mw-page-title-main">Image scanner</span> Device that optically scans images, printed text

An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop flatbed scanner where the document is placed on a glass window for scanning. Hand-held scanners, where the device is moved by hand, have evolved from text scanning "wands" to 3D scanners used for industrial design, reverse engineering, test and measurement, orthotics, gaming and other applications. Mechanically driven scanners that move the document are typically used for large-format documents, where a flatbed design would be impractical.

<span class="mw-page-title-main">Xfig</span> Vector graphics editor for UNIX-like systems

Xfig is a free and open-source vector graphics editor which runs under the X Window System on most UNIX-compatible platforms.

Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.

NoteWorthy Composer (NWC) is a proprietary scorewriter application made by NoteWorthy Software. It is a graphical score editor for Microsoft Windows computers. Version 1 of NWC was released in October 1994, and Version 2 in September 2008.

Intelligent character recognition (ICR) is used to extract handwritten text from image images using ICR, also referred to as intelligent OCR. It is a more sophisticated type of OCR technology that recognizes different handwriting styles and fonts to intelligently interpret data on forms and physical documents.

<span class="mw-page-title-main">OmegaT</span> Computer assisted translation tool written in Java

OmegaT is a computer-assisted translation tool written in the Java programming language. It is free software originally developed by Keith Godfrey in 2000, and is currently developed by a team led by Aaron Madlon-Kay.

<span class="mw-page-title-main">Tesseract (software)</span> Free optical character recognition engine

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.

<span class="mw-page-title-main">Book scanning</span> Process of converting physical media into digital media

Book scanning or book digitization is the process of converting physical books and magazines into digital media such as images, electronic text, or electronic books (e-books) by using an image scanner. Large scale book scanning projects have made many books available online.

TeleForm is a form of processing applications originally developed by Cardiff Software and now is owned by OpenText.

Ocrad is an optical character recognition program and part of the GNU Project. It is free software licensed under the GNU GPL.

<span class="mw-page-title-main">OCRopus</span>

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.

CuneiForm Cognitive OpenOCR is a freely distributed open-source OCR system developed by Russian software company Cognitive Technologies.

Forms processing is a process by which one can capture information entered into data fields and convert it into an electronic format. This can be done manually or automatically, but the general process is that hard copy data is filled out by humans and then "captured" from their respective fields and entered into a database or other electronic format.

<span class="mw-page-title-main">OCRFeeder</span>

OCRFeeder is an optical character recognition suite for GNOME, which also supports virtually any command-line OCR engine, such as CuneiForm, GOCR, Ocrad and Tesseract. It converts paper documents to digital document files and can serve to make them accessible to visually impaired users.

<span class="mw-page-title-main">Project Naptha</span>

Project Naptha is a browser extension software for Google Chrome that allows users to highlight, copy, edit and translate text from within images. It was created by developer Kevin Kwok, and released in April 2014 as a Chrome add-on. This software was first made available only on Google Chrome, downloadable from the Chrome Web Store. It was then made available on Mozilla Firefox, downloadable from the Mozilla Firefox add-ons repository but was soon removed. The reason behind the removal remains unknown.

Transkribus is a platform for the text recognition, image analysis and structure recognition of historical documents.

Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents. Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.