Page Analysis and Ground Truth Elements

Last updated

Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents. [1] Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.

PAGE XML can be used to describe:[ citation needed ]

The format is developed by the Pattern Recognition & Image Analysis Lab (PRIMA) at the University of Salford in Manchester.[ citation needed ]

It was designed to be used in conjunction with automatic segmentation and transcription techniques (OCR and HTR): indeed, PAGE aims to support each of the different steps in the processing chain for image document analysis (from image enhancement to layout analysis to OCR).[ citation needed ]

The PAGE XML schema is notably used as an export and import format by automatic transcription software such as eScriptorium [2] and Transkribus. [3] It is also an export format used by Kraken, a turnkey OCR system optimised for documents in historical and non-Latin scripts [4] and by the OCR software Tesseract. [5]

References

  1. "PAGE-XML". July 12, 2022 via GitHub.
  2. "eScripta – Digital Tools and Techniques for the Study of Ancient Writing".
  3. "How To Export Documents from Transkribus". READ-COOP.
  4. Kiessling, Benjamin (April 5, 2022). "The Kraken OCR system" via GitHub.
  5. "Tesseract Open Source OCR Engine". GitHub. Retrieved 2025-07-07.