Document mosaicing

Last updated

Document mosaicing is a process that stitches multiple, overlapping snapshot images of a document together to produce one large, high resolution composite. The document is slid under a stationary, over-the-desk camera by hand until all parts of the document are snapshotted by the camera's field of view. As the document slid under the camera, all motion of the document is coarsely tracked by the vision system. The document is periodically snapshotted such that the successive snapshots are overlap by about 50%. The system then finds the overlapped pairs and stitches them together repeatedly until all pairs are stitched together as one piece of document. [1]

Contents

The document mosaicing can be divided into four main processes.

Tracking (simple correlation process)

In this process, the motion of the document slid under the camera is coarsely tracked by the system. Tracking is performed by a process called simple correlation process. In the first frame of snapshots, a small patch is extracted from the center of the image as a correlation template. The correlation process is performed in the four times size of the patch area of the next frame. The motion of the paper is indicated by the peak in the correlation function. The peak in the correlation function indicates the motion of the paper. The template is resampled from this frame and the tracking continues until the template reaches the edge of the document. After the template reaches the edge of the document, another snapshot is taken and the tracking process performs repeatedly until the whole document is imaged. The snapshots are stored in an ordered list to facilitate pairing the overlapped images in later processes.

Feature detecting for efficient matching

Feature detection is the process of finding the transformation that aligns one image with another. There are two main approaches for feature detection. [2] [3]

Each image is segmented into a hierarchy of columns, lines, and words to match the organised sets of features across images. Skew angle estimation and columns, lines and words finding are the examples of feature detection operations.

Skew angle estimation

Firstly, the angle that the rows of text make with the image raster lines (skew angle) is estimated. It is assumed to lie in the range of ±20°. A small patch of text in the image is selected randomly and then rotated in the range of ±20° until the variance of the pixel intensities of the patch summed along the raster lines is maximised. [4]

To ensure that the found skew angle is accurate, the document mosaic system performs calculation at many image patches and derive the final estimation by finding the average of the individual angles weighted by the variance of the pixel intensities of each patch.

Columns, lines and words finding

In this operation, the de-skewed document is intuitively segmented into a hierarchy of columns, lines and words. The sensitivity to illumination and page coloration of the de-skewed document can be removed by applying a Sobel operator to the de-skewed image and thresholding the output to obtain the binary gradient, de-skewed image. [5]

The operation can be roughly separated into 3 steps: column segmentation, line segmentation and word segmentation.

  1. Columns are easily segmented from the binary gradient, de-skewed images by summing pixels vertically.
  2. Baselines of each row are segmented in the same way as the column segmentation process but horizontally.
  3. Finally, individual words are segmented by applying the vertical process at each segmented row.

These segmentations are important because the document mosaic is created by matching the lower right corners of words in overlapping images pair. Moreover, the segmentation operation can organize the list of images in the context of a hierarchy of rows and column reliably.

The segmentation operation involves a considerable amount of summing in the binary gradient, de-skewed images, which done by construct a matrix of partial sums [6] whose elements are given by

The matrix of partial sums is calculated in one pass through the binary gradient, de-skewed image. [6]

Correspondences establishing

The two images are now organized in hierarchy of linked lists in following structure :

At the bottom of the structure, the length of each word is recorded for establishing correspondence between two images to reduce to search only the corresponding structures for the groups of words with the matching lengths.

Seed match finding

A seed match finding is done by comparing each row in image1 with each row in image2. The two rows are then compared to each other by every word. If the length (in pixel) of the two words (one from image1 and one from image2) and their immediate neighbours agree with each other within a predefined tolerance threshold (5 pixels, for example), then they are assumed to match. The row of each image is assumed a match if there are three or more word matches between the two rows. The seed match finding operation is terminated when two pairs of consecutive row match are found.

Match list building

After finishing a seed match finding operation, the next process is to build the match list to generate the correspondences points of the two images. The process is done by searching the matching pairs of rows away from the seed row.

Images mosaicing

Figure 5 : Mosaicing of two document images. Blurring is evident in the affine mosaic (b), but not in the mosaic constructed using a plane-to-plane projectivity (a). Close-ups of typical seams of (a) and (b) are shown in (c) and (d) respectively. Mosaicing.png
Figure 5 : Mosaicing of two document images. Blurring is evident in the affine mosaic (b), but not in the mosaic constructed using a plane-to-plane projectivity (a). Close-ups of typical seams of (a) and (b) are shown in (c) and (d) respectively.

Given the list of corresponding points of the two images, finding the transformation of the overlapping portion of the images is the next process. Assuming a pinhole camera model, the transformation between pixels (u,v) of image 1 and pixels (u0, v0) of image 2 is demonstrated by a plane-to-plane projectivity. [7]

The parameters of the projectivity is found from four pairs of matching points. RANSAC regression [8] technique is used to reject outlying matches and estimate the projectivity from the remaining good matches.

The projectivity is fine-tuned using correlation at the corners of the overlapping portion to obtain four correspondences to sub-pixel accuracy. Therefore, image1 is then transformed into image2's coordinate system using Eq.1. The typical result of the process is shown in Figure 5.

Many images coping

Finally, the whole page composition is built up by mapping all the images into the coordinate system of an "anchor" image, which is normally the one nearest the page center. The transformations to the anchor frame are calculated by concatenating the pair-wise transformations found earlier. The raw document mosaic is shown in Figure 6.

However, there might be a problem of non-consecutive images that are overlap. This problem can be solved by performing Hierarchical sub-mosaics. As shown in Figure 7, image1 and image2 are registered, as are image3 and image4, creating two sub-mosaics. These two sub-mosaics are later stitched together in another mosaicing process.

Applied areas

There are various areas that the technique of document mosaicing can be applied to such as :

Relevant research papers

Related Research Articles

<span class="mw-page-title-main">Pixel</span> Physical point in a raster image

In digital imaging, a pixel, pel, or picture element is the smallest addressable element in a raster image, or the smallest addressable element in a dot matrix display device. In most digital display devices, pixels are the smallest element that can be manipulated through software.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Template matching is a technique in digital image processing for finding small parts of an image which match a template image. It can be used for quality control in manufacturing, navigation of mobile robots, or edge detection in images.

<span class="mw-page-title-main">Motion estimation</span> Process used in video coding/compression

In computer vision and image processing, motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

<span class="mw-page-title-main">Image stitching</span> Combining multiple photographic images with overlapping fields of view

Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Commonly performed through the use of computer software, most approaches to image stitching require nearly exact overlaps between images and identical exposures to produce seamless results, although some stitching algorithms actually benefit from differently exposed images by doing high-dynamic-range imaging in regions of overlap. Some digital cameras can stitch their photos internally.

Demosaicing, also known as color reconstruction, is a digital image processing algorithm used to reconstruct a full color image from the incomplete color samples output from an image sensor overlaid with a color filter array (CFA) such as a Bayer filter. It is also known as CFA interpolation or debayering.

<span class="mw-page-title-main">Image sensor</span> Device that converts images into electronic signals

An image sensor or imager is a sensor that detects and conveys information used to form an image. It does so by converting the variable attenuation of light waves into signals, small bursts of current that convey the information. The waves can be light or other electromagnetic radiation. Image sensors are used in electronic imaging devices of both analog and digital types, which include digital cameras, camera modules, camera phones, optical mouse devices, medical imaging equipment, night vision equipment such as thermal imaging devices, radar, sonar, and others. As technology changes, electronic and digital imaging tends to replace chemical and analog imaging.

<span class="mw-page-title-main">Active-pixel sensor</span> Image sensor, consisting of an integrated circuit

An active-pixel sensor (APS) is an image sensor, which was invented by Peter J.W. Noble in 1968, where each pixel sensor unit cell has a photodetector and one or more active transistors. In a metal–oxide–semiconductor (MOS) active-pixel sensor, MOS field-effect transistors (MOSFETs) are used as amplifiers. There are different types of APS, including the early NMOS APS and the now much more common complementary MOS (CMOS) APS, also known as the CMOS sensor. CMOS sensors are used in digital camera technologies such as cell phone cameras, web cameras, most modern digital pocket cameras, most digital single-lens reflex cameras (DSLRs), mirrorless interchangeable-lens cameras (MILCs), and lensless imaging for cells.

The following outline is provided as an overview of and topical guide to computer vision:

Connected-component labeling (CCL), connected-component analysis (CCA), blob extraction, region labeling, blob discovery, or region extraction is an algorithmic application of graph theory, where subsets of connected components are uniquely labeled based on a given heuristic. Connected-component labeling is not to be confused with segmentation.

Binocular disparity refers to the difference in image location of an object seen by the left and right eyes, resulting from the eyes' horizontal separation (parallax). The mind uses binocular disparity to extract depth information from the two-dimensional retinal images in stereopsis. In computer vision, binocular disparity refers to the difference in coordinates of similar features within two stereo images.

<span class="mw-page-title-main">Image rectification</span>

Image rectification is a transformation process used to project images onto a common image plane. This process has several degrees of freedom and there are many strategies for transforming images to the common plane. Image rectification is used in computer stereo vision to simplify the problem of finding matching points between images, and in geographic information systems (GIS) to merge images taken from multiple perspectives into a common map coordinate system.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

In computer vision, the trifocal tensor is a 3×3×3 array of numbers that incorporates all projective geometric relationships among three views. It relates the coordinates of corresponding points or lines in three views, being independent of the scene structure and depending only on the relative motion among the three views and their intrinsic calibration parameters. Hence, the trifocal tensor can be considered as the generalization of the fundamental matrix in three views. It is noted that despite the tensor being made up of 27 elements, only 18 of them are actually independent.

<span class="mw-page-title-main">Visual odometry</span> Determining the position and orientation of a robot by analyzing associated camera images

In robotics and computer vision, visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images. It has been used in a wide variety of robotic applications, such as on the Mars Exploration Rovers.

Camera auto-calibration is the process of determining internal camera parameters directly from multiple uncalibrated images of unstructured scenes. In contrast to classic camera calibration, auto-calibration does not require any special calibration objects in the scene. In the visual effects industry, camera auto-calibration is often part of the "Match Moving" process where a synthetic camera trajectory and intrinsic projection model are solved to reproject synthetic content into video.

Computer stereo vision is the extraction of 3D information from digital images, such as those obtained by a CCD camera. By comparing information about a scene from two vantage points, 3D information can be extracted by examining the relative positions of objects in the two panels. This is similar to the biological process of stereopsis.

<span class="mw-page-title-main">Image color transfer</span> Function that maps the colors of one image to the colors of another image

Image color transfer is a function that maps (transforms) the colors of one (source) image to the colors of another (target) image. A color mapping may be referred to as the algorithm that results in the mapping function or the algorithm that transforms the image colors. The image modification process is sometimes called color transfer or, when grayscale images are involved, brightness transfer function (BTF); it may also be called photometric camera calibration or radiometric camera calibration.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

<span class="mw-page-title-main">Saliency map</span>

In computer vision, a saliency map is an image that highlights either the region on which people's eyes focus first or the most relevant regions for machine learning models. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system or an otherwise opaque ML model.

References

  1. 1 2 Zappalá, Anthony; Gee, Andrew; Taylor, Michael (1999). "Document mosaicing". Image and Vision Computing. 17 (8): 589–595. doi:10.1016/S0262-8856(98)00178-4.
  2. Mann, S.; Picard, R. W. (1995). "Video orbits of the projective group: A new perspective on image mosaicing". Technical Report (Perceptual Computing Section), MIT Media Laboratory (338). CiteSeerX   10.1.1.56.6000 .
  3. 1 2 Brown, L.G. (1992). "A survey of image registration techniques". ACM Computing Surveys. 24 (4): 325–376. CiteSeerX   10.1.1.35.2732 . doi:10.1145/146370.146374. S2CID   14576088.
  4. 1 2 Bloomberg, Dan S.; Kopec, Gary E.; Dasari, Lakshmi (1995). "Measuring document image skew and orientation" (PDF). In Vincent, Luc M; Baird, Henry S (eds.). Document Recognition II. Proceedings of the SPIE. Vol. 2422. pp. 302–315. Bibcode:1995SPIE.2422..302B. doi:10.1117/12.205832. S2CID   5106427.
  5. 1 2 Taylor, M. J.; Zappala, A.; Newman, W. M.; Dance, C. R. (1999). "Documents through cameras". Image and Vision Computing. 17 (11): 831–844. doi:10.1016/S0262-8856(98)00155-3.
  6. 1 2 Preparata, F.P.; Shamos, M. I. (1985). Computational Geometry: An Introduction . Monographs in Computer Science. Springer–Verlag. ISBN   9780387961316.
  7. Mundy, J.L.; Zisserman, A. (1992). "Appendix-Projective geometry for machine vision" . Geometric Invariance in Computer Vision. Cambridge MA: MIT Press. CiteSeerX   10.1.1.17.1329 . ISBN   9780262132855.
  8. Martin A. Fischler; Robert C. Bolles (1981). "Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography" (PDF). Communications of the ACM . 24 (6): 381–395. doi:10.1145/358669.358692. S2CID   972888.
  9. Wellner, P. (1993). "Interacting with paper on the digital desk". Communications of the ACM. 36 (7): 87–97. CiteSeerX   10.1.1.53.7526 . doi:10.1145/159544.159630. S2CID   207174911.
  10. Szeliski, R. (1996). "Video mosaics for virtual environments". IEEE Computer Graphics and Applications. 16 (2): 22–306. doi:10.1109/38.486677.

Bibliography