Document layout analysis

Last updated January 23, 2024

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.^[1] Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis.^[2] But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.

Document layout analysis is the union of geometric and logical labeling. It is typically performed before a document image is sent to an OCR engine, but it can be used also to detect duplicate copies of the same document in large archives, or to index documents by their structure or pictorial content.

Document layout is formally defined in the international standard ISO 8613-1:1989.

Overview of methods

There are two main approaches to document layout analysis. Firstly, there are bottom-up approaches which iteratively parse a document based on the raw pixel data. These approaches typically first parse a document into connected regions of black and white, then these regions are grouped into words, then into text lines, and finally into text blocks.^[3]^[4] Secondly, there are top-down approaches which attempt to iteratively cut up a document into columns and blocks based on white space and geometric information.^[4]

The bottom-up approaches are the traditional ones, and they have the advantage that they require no assumptions on the overall structure of the document. On the other hand, bottom-up approaches require iterative segmentation and clustering, which can be time consuming.^[4] Top-down approaches are newer, and have the advantage that they parse the global structure of a document directly, thus eliminating the need to iteratively cluster together the possibly hundreds or even thousands of characters/symbols which appear on a document. They tend to be faster, but in order for them to operate robustly they typically require a number of assumptions to be made about on the layout of the document.^[4] There are two issues common to any approach at document layout analysis: noise and skew. Noise refers to image noise, such as salt and pepper noise or Gaussian noise. Skew refers to the fact that a document image may be rotated in a way so that the text lines are not perfectly horizontal. It is a common assumption in both document layout analysis algorithms and optical character recognition algorithms that the characters in the document image are oriented so that text lines are horizontal. Therefore, if there is skew present then it is important to rotate the document image so as to remove it.

It follows that the first steps in any document layout analysis code are to remove image noise and to come up with an estimate for the skew angle of the document.

Example of a bottom up approach

In this section we will walk through the steps of a bottom-up document layout analysis algorithm developed in 1993 by O`Gorman.^[3] The steps in this approach are as follows:

Preprocess the image to remove Gaussian and salt-and-pepper noise. Note that some noise removal filters may consider commas and periods as noise, so some care must be taken.
Convert the image into a binary image, i.e. convert each pixel value to completely white or completely black.
Segment the image into connected components of black pixels. These are the symbols of the image. For each symbol, compute a bounding box and centroid.
For each symbol, determine its k nearest neighbors where k is an integer greater than or equal to four. O`Gorman suggests k=5 in his paper as a good compromise between robustness and speed. The reason to use at least k=4 is that for a symbol in a document, the two or three nearest symbols are the ones right next to it on the same text line. The fourth-nearest symbol is typically on a line right above or below, and it is important to include these symbols in the nearest neighbor calculation for the following.
Each nearest neighbor pair of symbols is related by a vector pointing from one symbol’s centroid to the other symbol’s centroid. If these vectors are plotted for every pair of nearest neighbor symbols, then one gets what is called the docstrum for the document (See figure below). One can also use the angle Θ from the horizontal and distance D between two nearest neighbor symbols and create a nearest-neighbor angle and nearest-neighbor distance histogram.
Using the nearest-neighbor angle histogram, the skew of the document can be calculated. If the skew is acceptably low, continue to the next step. If it is not, rotate the image so as to remove the skew and return to step 3.
The nearest-neighbor distance histogram has several peaks, and these peaks typically represent between-character spacing, between-word spacing, and between-line spacing. Calculate these values from the histogram and set them aside.
For each symbol, look at its nearest neighbors and flag any of them that are a distance away which is within some tolerance of the between-character spacing distance or between-word spacing distance. For each nearest neighbor symbol which is flagged, draw a line segment connecting their centroids.
Symbols connected to their neighbors by line segments form text lines. Using all the centroids in a text line, one can compute an actual line segment representing the text line with linear regression. This is important because it is unlikely that all the centroids of symbols in a text line are actually collinear.
For each pair of text lines, one can compute a minimum distance between their corresponding line segments. If this distance is within some tolerance of the between-line spacing calculated in step 7, then the two text lines are grouped into the same text block.
Finally, one can calculate a bounding box for each text block, and the document layout analysis is complete.

Layout analysis software

OCRopus – A free document layout analysis and OCR system, implemented in C++ and Python and for FreeBSD, Linux, and Mac OS X. This software supports a plug-in architecture which allows the user to select from a variety of different document layout analysis and OCR algorithms.
OCRFeeder – An OCR suite for Linux, written in python, which also supports document layout analysis. This software is actively being developed, and is free and open-source.

External links

Related Research Articles

Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms. In simpler terms, vector quantization chooses a set of points to represent a larger set of points.

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Iterative reconstruction refers to iterative algorithms used to reconstruct 2D and 3D images in certain imaging techniques. For example, in computed tomography an image must be reconstructed from projections of an object. Here, iterative reconstruction techniques are usually a better, but computationally more expensive alternative to the common filtered back projection (FBP) method, which directly calculates the image in a single reconstruction step. In recent research works, scientists have shown that extremely fast computations and massive parallelism is possible for iterative reconstruction, which makes iterative reconstruction practical for commercialization.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern recognition, classification and regression. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition. The concept of "feature" is related to that of explanatory variable used in statistical techniques such as linear regression.

Robert M. Haralick is Distinguished Professor in Computer Science at Graduate Center of the City University of New York (CUNY). Haralick is one of the leading figures in computer vision, pattern recognition, and image analysis. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and a Fellow and past president of the International Association for Pattern Recognition. Professor Haralick is the King-Sun Fu Prize winner of 2016, "for contributions in image analysis, including remote sensing, texture analysis, mathematical morphology, consistent labeling, and system performance evaluation".

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

<span class="mw-page-title-main">Lloyd's algorithm</span>

In electrical engineering and computer science, Lloyd's algorithm, also known as Voronoi iteration or relaxation, is an algorithm named after Stuart P. Lloyd for finding evenly spaced sets of points in subsets of Euclidean spaces and partitions of these subsets into well-shaped and uniformly sized convex cells. Like the closely related k-means clustering algorithm, it repeatedly finds the centroid of each set in the partition and then re-partitions the input according to which of these centroids is closest. In this setting, the mean operation is an integral over a region of space, and the nearest centroid operation results in Voronoi diagrams.

In computer graphics and digital imaging, imagescaling refers to the resizing of a digital image. In video technology, the magnification of digital material is known as upscaling or resolution enhancement.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

In various science/engineering applications, such as independent component analysis, image analysis, genetic analysis, speech recognition, manifold learning, and time delay estimation it is useful to estimate the differential entropy of a system or process, given some observations.

The Point Cloud Library (PCL) is an open-source library of algorithms for point cloud processing tasks and 3D geometry processing, such as occur in three-dimensional computer vision. The library contains algorithms for filtering, feature estimation, surface reconstruction, 3D registration, model fitting, object recognition, and segmentation. Each module is implemented as a smaller library that can be compiled separately. PCL has its own data format for storing point clouds - PCD, but also allows datasets to be loaded and saved in many other formats. It is written in C++ and released under the BSD license.

The Medical Intelligence and Language Engineering Laboratory, also known as MILE lab, is a research laboratory at the Indian Institute of Science, Bangalore under the Department of Electrical Engineering. The lab is known for its work on Image processing, online handwriting recognition, Text-To-Speech and Optical character recognition systems, all of which are focused mainly on documents and speech in Indian languages. The lab is headed by A. G. Ramakrishnan.

Relief is an algorithm developed by Kira and Rendell in 1992 that takes a filter-method approach to feature selection that is notably sensitive to feature interactions. It was originally designed for application to binary classification problems with discrete or numerical features. Relief calculates a feature score for each feature which can then be applied to rank and select top scoring features for feature selection. Alternatively, these scores may be applied as feature weights to guide downstream modeling. Relief feature scoring is based on the identification of feature value differences between nearest neighbor instance pairs. If a feature value difference is observed in a neighboring instance pair with the same class, the feature score decreases. Alternatively, if a feature value difference is observed in a neighboring instance pair with different class values, the feature score increases. The original Relief algorithm has since inspired a family of Relief-based feature selection algorithms (RBAs), including the ReliefF algorithm. Beyond the original Relief algorithm, RBAs have been adapted to (1) perform more reliably in noisy problems, (2) generalize to multi-class problems (3) generalize to numerical outcome problems, and (4) to make them robust to incomplete data.

In computer vision, a saliency map is an image that highlights the region on which people's eyes focus first. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system. For example, in this image, a person first looks at the fort and light clouds, so they should be highlighted on the saliency map. Saliency maps engineered in artificial or computer vision are typically not the same as the actual saliency map constructed by biological or natural vision.

Discrete Skeleton Evolution (DSE) describes an iterative approach to reducing a morphological or topological skeleton. It is a form of pruning in that it removes noisy or redundant branches (spurs) generated by the skeletonization process, while preserving information-rich "trunk" segments. The value assigned to individual branches varies from algorithm to algorithm, with the general goal being to convey the features of interest of the original contour with a few carefully chosen lines. Usually, clarity for human vision is valued as well. DSE algorithms are distinguished by complex, recursive decision-making processes with high computational requirements. Pruning methods such as by structuring element (SE) convolution and the Hough transform are general purpose algorithms which quickly pass through an image and eliminate all branches shorter than a given threshold. DSE methods are most applicable when detail retention and contour reconstruction are valued.

Elastix is an image registration toolbox built upon the Insight Segmentation and Registration Toolkit (ITK). It is entirely open-source and provides a wide range of algorithms employed in image registration problems. Its components are designed to be modular to ease a fast and reliable creation of various registration pipelines tailored for case-specific applications. It was first developed by Stefan Klein and Marius Staring under the supervision of Josien P.W. Pluim at Image Sciences Institute (ISI). Its first version was command-line based, allowing the final user to employ scripts to automatically process big data-sets and deploy multiple registration pipelines with few lines of code. Nowadays, to further widen its audience, a version called SimpleElastix is also available, developed by Kasper Marstal, which allows the integration of elastix with high level languages, such as Python, Java, and R.

References

↑ Baird, K.S. (July 1992). "Anatomy of a versatile page reader". Proceedings of the IEEE. 80 (7): 1059–1065. CiteSeerX 10.1.1.40.8060 . doi:10.1109/5.156469.
↑ Cattoni, R.; Coianiz, T.; Messelodi, S.; Modena, C. M. "Geometric Layout Analysis Techniques for Document Image Understanding: a Review. ITC-irst Technical Report TR#9703-09".{{cite journal}}: Cite journal requires |journal= (help)
1 2 O'Gorman, L. (1993). "The document spectrum for page layout analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 15 (11): 1162–1173. doi:10.1109/34.244677.
1 2 3 4 Seong-Whan Lee; Dae-Seok Ryu (2001). "Parameter-free geometric document layout analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 23 (11): 1240–1256. CiteSeerX 10.1.1.574.7875 . doi:10.1109/34.969115.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[baird92-1] Baird, K.S. (July 1992). "Anatomy of a versatile page reader". Proceedings of the IEEE. 80 (7): 1059–1065. CiteSeerX 10.1.1.40.8060 . doi:10.1109/5.156469.

[2] Cattoni, R.; Coianiz, T.; Messelodi, S.; Modena, C. M. "Geometric Layout Analysis Techniques for Document Image Understanding: a Review. ITC-irst Technical Report TR#9703-09".{{cite journal}}: Cite journal requires |journal= (help)

[Gorman93-3] 1 2 O'Gorman, L. (1993). "The document spectrum for page layout analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 15 (11): 1162–1173. doi:10.1109/34.244677.

[Lee01-4] 1 2 3 4 Seong-Whan Lee; Dae-Seok Ryu (2001). "Parameter-free geometric document layout analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 23 (11): 1240–1256. CiteSeerX 10.1.1.574.7875 . doi:10.1109/34.969115.

[1]

[2]

[3]

[4]

v t e Optical character recognition software
Free software	CuneiForm GOCR Ocrad OCRFeeder OCRopus Tesseract
Proprietary software	ABBYY FineReader Adobe Acrobat Pro Asprise OCR Microsoft Office Document Imaging OmniPage ReadSoft SmartScore TeleForm VueScan
See also	Comparison of optical character recognition software