Page Analysis and Ground Truth Elements

Last updated

Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents. [1] Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.

PAGE XML can be used to describe:[ citation needed ]

The format is developed by the Pattern Recognition & Image Analysis Lab (PRIMA) at the University of Salford in Manchester.[ citation needed ]

It was designed to be used in conjunction with automatic segmentation and transcription techniques (OCR and HTR): indeed, PAGE aims to support each of the different steps in the processing chain for image document analysis (from image enhancement to layout analysis to OCR).[ citation needed ]

The PAGE XML schema is notably used as an export and import format by automatic transcription software such as eScriptorium [2] and Transkribus. [3] It is also an export format used by Kraken, a turnkey OCR system optimised for documents in historical and non-Latin scripts. [4]

Related Research Articles

<span class="mw-page-title-main">PDF</span> Portable Document Format, a digital file format

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991.

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Text Encoding Initiative</span> Academic community concerned with text encoding

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to bring manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

In the macOS, iOS, NeXTSTEP, and GNUstep programming frameworks, property list files are files that store serialized objects. Property list files use the filename extension .plist, and thus are often referred to as p-list files.

<span class="mw-page-title-main">Gramps (software)</span> Genealogy software

Gramps is a free and open source genealogy software. Gramps is programmed in Python using PyGObject. It uses Graphviz to create relationship graphs.

<span class="mw-page-title-main">Tesseract (software)</span> Free optical character recognition engine

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

<span class="mw-page-title-main">OCRopus</span>

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data.

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.

This comparison of optical character recognition software includes:

This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.

iStudio Publisher Desktop publishing application

iStudio Publisher is a page layout and desktop publishing (DTP) application developed by iStudio Software.

<span class="mw-page-title-main">OCRFeeder</span>

OCRFeeder is an optical character recognition suite for GNOME, which also supports virtually any command-line OCR engine, such as CuneiForm, GOCR, Ocrad and Tesseract. It converts paper documents to digital document files and can serve to make them accessible to visually impaired users.

Analyzed Layout and Text Object (ALTO) is an open XML Schema developed by the EU-funded project called METAe.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

References

  1. "PAGE-XML". July 12, 2022 via GitHub.
  2. "eScripta – Digital Tools and Techniques for the Study of Ancient Writing".
  3. "How To Export Documents from Transkribus". READ-COOP.
  4. Kiessling, Benjamin (April 5, 2022). "The Kraken OCR system" via GitHub.