HOCR

Last updated

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML. [1]

Contents

Software

The following OCR software can output the recognition result as hOCR file:

Example

The following example is an extract of an hOCR file:

... <pclass="ocr_par"lang="deu"title="bbox930"><spanclass="ocr_line"title="bbox 348 797 1482 838; baseline -0.009 -6"><spanclass="ocrx_word"title="bbox 348 805 402 832; x_wconf 93">Die</span><spanclass="ocrx_word"title="bbox 421 804 697 832; x_wconf 90">Darlehenssumme</span><spanclass="ocrx_word"title="bbox 717 803 755 831; x_wconf 96">ist</span><spanclass="ocrx_word"title="bbox 773 803 802 831; x_wconf 96">in</span><spanclass="ocrx_word"title="bbox 821 803 917 830; x_wconf 96">ihrem</span><spanclass="ocrx_word"title="bbox 935 799 1180 838; x_wconf 95">ursprünglichen</span><spanclass="ocrx_word"title="bbox 1199 797 1343 832; x_wconf 95">Umfange</span><spanclass="ocrx_word"title="bbox 1362 805 1399 823; x_wconf 95">zu</span><spanclass="ocrx_word"title="bbox 1417 x_wconf 96">ver-</span></span>   ... 

The recognized text is stored in normal text nodes of the HTML file. The distribution into separate lines and words is here given by the surrounding span tags. Moreover, the usual HTML entities are used, for example the p tag for a paragraph. Additional information is given in the properties such as:

bbox

General

The Layout of the Bounding Box Object or bbox Object is Grammar.

Example

bbox 0 0 100 200

The bbox - short for "bounding box" - of an element is a rectangular box around this element, which is defined by the upper-left corner (x0, y0) and the lower-right corner (x1, y1).

the values are with reference to the top-left corner of the document image and measured in pixels

the order of the values are x0 y0 x1 y1 = "left top right bottom"

Usage

Use x_bboxes below for character bounding boxes.

Do not use bbox unless the bounding box of the layout component is, in fact, rectangular, some non-rectangular layout components may have rectangular bounding boxes if the non-rectangularity is caused by floating elements around which text flows.

<spanclass="ocr_line"id="line_1"title="bbox 10 20 160 30"></span>

The bounding box bbox of this line is shown in blue and it is span by the upper-left corner (10, 20) and the lower-right corner (160, 30). All coordinates are measured with reference to the top-left corner of the document image which border is drawn in black. [3]

Searchable PDF files

The hOCR format is most commonly used in order to make searchable PDF files or as an extracted metadata of the PDF file. In order to create searchable PDF files we can use a scanned document image and a .hocr file of the particular image. We can use the following open source tools in order to achieve that.

hocr-tools

Source: [4]

hocr-tools is an open source library written in Python. It has a command-line utility attached in the scripts called hocr-pdf that enables us to convert standard hocr files to a searchable PDF file. It is also worth noting that the version for dealing with hocr files in RTL or non-Latin scripts like Arabic, we need to use the GitHub repository at the moment.

hocr-pdf

We can use the hocr-pdf utility using the following basic syntax.

hocr-pdf—savefile final.pdf folder_images_and_hocr

The folder_images_and_hocr must contain the respective .jpg and .hocr format files with their file extensions changed.

Known issues

Some of the known issues of hocr-pdf script in PyPI installation are the following.

  • Not up to date with GitHub repository.
  • hocr-pdf is broken on line 134 due to decodebytes() depreciated after Python 3.1 [5]

Known fixes

Compile hocr-tools using latest GitHub repository.

hocr2pdf

hocr2pdf [6] is another library that supports the conversion of hocr files. It is written in C++ and is cross-compatible with other libraries. It also has support for UTF-8 languages but that may require some additional debugging and browsing through some google conversation records to achieve that.

According to Ubuntu Manpages,

ExactImage is a fast C++ image processing library. Unlike many other library frameworks it allows operation in several color spaces and bit depths natively, resulting in low memory and computational requirements. hocr2pdf creates well layouted, searchable PDF files from hOCR (annotated HTML) input obtained from an OCR system.

hOCR to PDF attempts

In addition to the following discussed and stable libraries there have been many contributions to the hOCR format over the years with support from many of the early adopters of this format. You can get access to inlaying text on an Image with hOCR and converting that in a PDF file using Python 2 with this 12-year-old script as of 2021. This script can also be updated and made functional by converting that Python 2 Source code to Python 3 Supported Context.

- HOCRConverter by jbrinley (Documentation [7] )

HOCRConverter

The HOCRConverter is a script written in Python 2.x that can used in order to convert a hOCR file with a specified image file in order to convert it to a searchable PDF file. You can see the documentation using the link above.

fromHocrConverterimportHocrConverterhocr=HocrConverter("myHocrFile.html")# this can be done by changing .hocr to .html and vise-versahocr.to_text("output.txt")hocr.to_pdf("myImageFile.png","output.pdf")

Known issues

  • Has not been tested.
  • Does not natively support Python 3.x

See also

Related Research Articles

<span class="mw-page-title-main">Quine (computing)</span> Self-replicating program

A quine is a computer program that takes no input and produces a copy of its own source code as its only output. The standard terms for these programs in the computability theory and computer science literature are "self-replicating programs", "self-reproducing programs", and "self-copying programs".

<span class="mw-page-title-main">Netwide Assembler</span> Assembler for the Intel x86 architecture

The Netwide Assembler (NASM) is an assembler and disassembler for the Intel x86 architecture. It can be used to write 16-bit, 32-bit (IA-32) and 64-bit (x86-64) programs. It is considered one of the most popular assemblers for Linux and x86 chips.

<span class="mw-page-title-main">Typesetting</span> Composition of text by means of arranging physical types or digital equivalents

Typesetting is the composition of text for publication, display, or distribution by means of arranging physical type in mechanical systems or glyphs in digital systems representing characters. Stored types are retrieved and ordered according to a language's orthography for visual display. Typesetting requires one or more fonts. One significant effect of typesetting was that authorship of works could be spotted more easily, making it difficult for copiers who have not gained permission.

OpenEXR is a high-dynamic range, multi-channel raster file format, released as an open standard along with a set of software tools created by Industrial Light & Magic (ILM), under a free software license similar to the BSD license.

<span class="mw-page-title-main">Device independent file format</span> Typesetting file format

The device independent file format (DVI) is the output file format of the TeX typesetting program, designed by David R. Fuchs and implemented by Donald E. Knuth in 1982. Unlike the TeX markup files used to generate them, DVI files are not intended to be human-readable; they consist of binary data describing the visual layout of a document in a manner not reliant on any specific image format, display hardware or printer. DVI files are typically used as input to a second program which translates DVI files to graphical data. For example, most TeX software packages include a program for previewing DVI files on a user's computer display; this program is a driver. Drivers are also used to convert from DVI to popular page description languages and for printing.

DOT is a graph description language, developed as a part of the Graphviz project. DOT graphs are typically stored as files with the .gv or .dot filename extension — .gv is preferred, to avoid confusion with the .dot extension used by versions of Microsoft Word before 2007. dot is also the name of the main program to process DOT files in the Graphviz package.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document and this kind of semantic labeling is the scope of the logical layout analysis.

<span class="mw-page-title-main">FontForge</span> Font editor created by George Williams

FontForge is a FOSS font editor which supports many common font formats. Developed primarily by George Williams until 2012, FontForge is free software and is distributed under a mix of the GNU General Public License Version 3 and the 3-clause BSD license. It is available for operating systems including Linux, Windows, and macOS, and is localized into 12 languages.

<span class="mw-page-title-main">Markdown</span> Plain text markup language

Markdown is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber and Aaron Swartz created Markdown in 2004 as a markup language that is intended to be easy to read in its source code form. Markdown is widely used for blogging and instant messaging, and also used elsewhere in online forums, collaborative software, documentation pages, and readme files.

AsciiDoc is a human-readable document format, semantically equivalent to DocBook XML, but using plain-text mark-up conventions. AsciiDoc documents can be created using any text editor and read “as-is”, or rendered to HTML or any other format supported by a DocBook tool-chain, i.e. PDF, TeX, Unix manpages, e-books, slide presentations, etc. Common file extensions for AsciiDoc files are txt and adoc.

<span class="mw-page-title-main">Tesseract (software)</span> Free optical character recognition engine

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development was sponsored by Google in 2006.

JSDoc is a markup language used to annotate JavaScript source code files. Using comments containing JSDoc, programmers can add documentation describing the application programming interface of the code they're creating. This is then processed, by various tools, to produce documentation in accessible formats like HTML and Rich Text Format. The JSDoc specification is released under CC BY-SA 3.0, while its companion documentation generator and parser library is free software under the Apache License 2.0.

<span class="mw-page-title-main">OCRopus</span>

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.

Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs that communicate with each other over a network or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates source code from that description for generating or parsing a stream of bytes that represents the structured data.

<span class="mw-page-title-main">Reflowable document</span> Electronic document with fluid layout

A reflowable document is a type of electronic document that can adapt its presentation to the output device. Typical prepress or fixed page size output formats like PostScript or PDF are not reflowable during the actual printing process because the page is not resized. For end users, the World Wide Web standard, HTML is a reflowable format as is the case with any resizable electronic page format.

<span class="mw-page-title-main">OCRFeeder</span>

OCRFeeder is an optical character recognition suite for GNOME, which also supports virtually any command-line OCR engine, such as CuneiForm, GOCR, Ocrad and Tesseract. It converts paper documents to digital document files and can serve to make them accessible to visually impaired users.

<span class="mw-page-title-main">Kivy (framework)</span> Free and multi-platform graphical library for Python

Kivy is a free and open source Python framework for developing mobile apps and other multitouch application software with a natural user interface (NUI). It is distributed under the terms of the MIT License, and can run on Android, iOS, Linux, macOS, and Windows.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

References

  1. Breuel, T. (2007-09-01). "The hOCR Microformat for OCR Workflow and Results" (PDF). Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2. Vol. 2. pp. 1063–1067. doi:10.1109/ICDAR.2007.4377078. ISBN   978-0-7695-2822-9. S2CID   7565957.
  2. "Ghostscript documentation". ghostscript.com. Retrieved March 1, 2024.
  3. "hOCR - OCR Workflow and Output embedded in HTML". kba.cloud. Retrieved 18 December 2021.PD-icon.svg This article incorporates text from this source, which is in the public domain .
  4. ocropus, ocropus (2021-12-12). "hocr-tools". Github.
  5. Ahmad, Muneeb (2021-12-12). "decodebytes() Depreciated in hocr-pdf use decodestring()". GitHub. Retrieved 2021-12-12. /home/muneeb/.local/bin/hocr-pdf:134: DeprecationWarning: decodestring() is a deprecated alias since Python 3.1, use decodebytes() uncompressed = bytearray(zlib.decompress(base64.decodestring(font)))
  6. http://manpages.ubuntu.com/manpages/trusty/man1/hocr2pdf.1.html
  7. Brinley, Jonathan (2009-04-02). "Convert hOCR to PDF". x+3. Archived from the original on 2021-02-06. Retrieved 2021-12-12.