Template matching

Last updated January 26, 2023

Template matching^[1] is a technique in digital image processing for finding small parts of an image which match a template image. It can be used for quality control in manufacturing,^[2] navigation of mobile robots,^[3] or edge detection in images.^[4]

Feature-based approach
Template-based approach
Common challenges
Deformable templates in computational anatomy
Template-based matching explained using cross correlation or sum of absolute differences
Implementation
Speeding up the process
Improving the accuracy of the matching
Similar methods
Examples of use
See also
References
External links

The main challenges in a template matching task are detection of occlusion, when a sought-after object is partly hidden in an image; detection of non-rigid transformations, when an object is distorted or imaged from different angles; sensitivity to illumination and background changes; background clutter; and scale changes.^[5]

Feature-based approach

The feature-based approach to template matching relies on the extraction of image features, such as shapes, textures, and colors, that match the target image or frame. This approach is usually achieved using neural networks and deep-learning classifiers such as VGG, AlexNet, and ResNet.^{[ citation needed ]} Convolutional neural networks (CNNs), which many modern classifiers are based on, process an image by passing it through different hidden layers, producing a vector at each layer with classification information about the image. These vectors are extracted from the network and used as the features of the image. Feature extraction using deep neural networks, like CNNs, has proven extremely effective has become the standard in state-of-the-art template matching algorithms.^[6]

This feature-based approach is often more robust than the template-based approach described below. As such, it has become the state-of-the-art method for template matching, as it can match templates with non-rigid and out-of-plane transformations, as well as high background clutter and illumination changes.^[7]^[8]^[9]

Template-based approach

For templates without strong features, or for when the bulk of a template image constitutes the matching image as a whole, a template-based approach may be effective. Since template-based matching may require sampling of a large number of data points, it is often desirable to reduce the number of sampling points by reducing the resolution of search and template images by the same factor before performing the operation on the resultant downsized images. This pre-processing method creates a multi-scale, or pyramid, representation of images, providing a reduced search window of data points within a search image so that the template does not have to be compared with every viable data point. Pyramid representations are a method of dimensionality reduction, a common aim of machine learning on data sets that suffer the curse of dimensionality.

Common challenges

In instances where the template may not provide a direct match, it may be useful to implement eigenspaces to create templates that detail the matching object under a number of different conditions, such as varying perspectives, illuminations, color contrasts, or object poses.^[10] For example, if an algorithm is looking for a face, its template eigenspaces may consist of images (i.e., templates) of faces in different positions to the camera, in different lighting conditions, or with different expressions (i.e., poses).

It is also possible for a matching image to be obscured or occluded by an object. In these cases, it is unreasonable to provide a multitude of templates to cover each possible occlusion. For example, the search object may be a playing card, and in some of the search images, the card is obscured by the fingers of someone holding the card, or by another card on top of it, or by some other object in front of the camera. In cases where the object is malleable or poseable, motion becomes an additional problem, and problems involving both motion and occlusion become ambiguous.^[11] In these cases, one possible solution is to divide the template image into multiple sub-images and perform matching on each subdivision.

Deformable templates in computational anatomy

Template matching is a central tool in computational anatomy (CA). In this field, a deformable template model is used to model the space of human anatomies and their orbits under the group of diffeomorphisms, functions which smoothly deform an object.^[12] Template matching arises as an approach to finding the unknown diffeomorphism that acts on a template image to match the target image.

Template matching algorithms in CA have come to be called large deformation diffeomorphic metric mappings (LDDMMs). Currently, there are LDDMM template matching algorithms for matching anatomical landmark points, curves, surfaces, volumes.

Template-based matching explained using cross correlation or sum of absolute differences

A basic method of template matching sometimes called "Linear Spatial Filtering" uses an image patch (i.e., the "template image" or "filter mask") tailored to a specific feature of search images to detect.^{[ citation needed ]} This technique can be easily performed on grey images or edge images, where the additional variable of color is either not present or not relevant. Cross correlation techniques compare the similarities of the search and template images. Their outputs should be highest at places where the image structure matches the template structure, i.e., where large search image values get multiplied by large template image values.

This method is normally implemented by first picking out a part of a search image to use as a template. Let $S(x,y)$ represent the value of a search image pixel, where $(x,y)$ represents the coordinates of the pixel in the search image. For simplicity, assume pixel values are scalar, as in a greyscale image. Similarly, let ${\textstyle T(x_{t},y_{t})}$ represent the value of a template pixel, where ${\textstyle (x_{t},y_{t})}$ represents the coordinates of the pixel in the template image. To apply the filter, simply move the center (or origin) of the template image over each point in the search image and calculate the sum of products, similar to a dot product, between the pixel values in the search and template images over the whole area spanned by the template. More formally, if $(0,0)$ is the center (or origin) of the template image, then the cross correlation $T\star S$ at each point $(x,y)$ in the search image can be computed as:

(T\star S)(x,y)=\sum _{(x_{t},y_{t})\in T}T(x_{t},y_{t})\cdot S(x_{t}+x,y_{t}+y)

For convenience, $T$ denotes both the pixel values of the template image as well as its domain, the bounds of the template. Note that all possible positions of the template with respect to the search image are considered. Since cross correlation values are greatest when the values of the search and template pixels align, the best matching position $(x_{m},y_{m})$ corresponds to the maximum value of $T\star S$ over $S$ . Another way to handle translation problems on images using template matching is to compare the intensities of the pixels, using the sum of absolute differences (SAD) measure. To formulate this, let $I_{S}(x_{s},y_{s})$ and $I_{T}(x_{t},y_{t})$ denote the light intensity of pixels in the search and template images with coordinates $(x_{s},y_{s})$ and $(x_{t},y_{t})$ , respectively. Then by moving the center (or origin) of the template to a point $(x,y)$ in the search image, as before, the sum of absolute differences between the template and search pixel intensities at that point is:

$SAD(x,y)=\sum _{(x_{t},y_{t})\in T}\left\vert I_{T}(x_{t},y_{t})-I_{S}(x_{t}+x,y_{t}+y)\right\vert$

With this measure, the lowest SAD gives the best position for the template, rather than the greatest as with cross correlation. SAD tends to be relatively simple to implement and understand, but it also tends to be relatively slow to execute. A simple C++ implementation of SAD template matching is given below.

Implementation

In this simple implementation, it is assumed that the above described method is applied on grey images: This is why Grey is used as pixel intensity. The final position in this implementation gives the top left location for where the template image best matches the search image.

minSAD=VALUE_MAX;// loop through the search imagefor(size_tx=0;x<=S_cols-T_cols;x++){for(size_ty=0;y<=S_rows-T_rows;y++){SAD=0.0;// loop through the template imagefor(size_tj=0;j<T_cols;j++)for(size_ti=0;i<T_rows;i++){pixelp_SearchIMG=S[y+i][x+j];pixelp_TemplateIMG=T[i][j];SAD+=abs(p_SearchIMG.Grey-p_TemplateIMG.Grey);}// save the best found position if(minSAD>SAD){minSAD=SAD;// give me min SADposition.bestRow=y;position.bestCol=x;position.bestSAD=SAD;}}}

One way to perform template matching on color images is to decompose the pixels into their color components and measure the quality of match between the color template and search image using the sum of the SAD computed for each color separately.

Speeding up the process

In the past, this type of spatial filtering was normally only used in dedicated hardware solutions because of the computational complexity of the operation,^[13] however we can lessen this complexity by filtering it in the frequency domain of the image, referred to as 'frequency domain filtering,' this is done through the use of the convolution theorem.

Another way of speeding up the matching process is through the use of an image pyramid. This is a series of images, at different scales, which are formed by repeatedly filtering and subsampling the original image in order to generate a sequence of reduced resolution images.^[14] These lower resolution images can then be searched for the template (with a similarly reduced resolution), in order to yield possible start positions for searching at the larger scales. The larger images can then be searched in a small window around the start position to find the best template location.

Other methods can handle problems such as translation, scale, image rotation and even all affine transformations.^[15]^[16]^[17]

Improving the accuracy of the matching

Improvements can be made to the matching method by using more than one template (eigenspaces), these other templates can have different scales and rotations.

It is also possible to improve the accuracy of the matching method by hybridizing the feature-based and template-based approaches.^[18] Naturally, this requires that the search and template images have features that are apparent enough to support feature matching.

Similar methods

Other methods which are similar include 'Stereo matching', 'Image registration' and 'Scale-invariant feature transform'.

Examples of use

Template matching has various applications and is used in such fields as face recognition (see facial recognition system) and medical image processing. Systems have been developed and used in the past to count the number of faces that walk across part of a bridge within a certain amount of time. Other systems include automated calcified nodule detection within digital chest X-rays.^[19] Recently, this method was implemented in geostatistical simulation which could provide a fast algorithm.^[20]

Related Research Articles

Digital image processing is the use of a digital computer to process digital images through an algorithm. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allows a much wider range of algorithms to be applied to the input data and can avoid problems such as the build-up of noise and distortion during processing. Since images are defined over two dimensions digital image processing may be modeled in the form of multidimensional systems. The generation and development of digital image processing are mainly affected by three factors: first, the development of computers; second, the development of mathematics ; third, the demand for a wide range of applications in environment, agriculture, military, industry and medical science has increased.

Edge detection includes a variety of mathematical methods that aim at identifying edges, curves in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The same problem of finding discontinuities in one-dimensional signals is known as step detection and the problem of finding signal discontinuities over time is known as change detection. Edge detection is a fundamental tool in image processing, machine vision and computer vision, particularly in the areas of feature detection and feature extraction.

The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images. It was developed by John F. Canny in 1986. Canny also produced a computational theory of edge detection explaining why the technique works.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

Motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

Scale-space theory is a framework for multi-scale signal representation developed by the computer vision, image processing and signal processing communities with complementary motivations from physics and biological vision. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures. The parameter $in this family is referred to as the scale parameter, with the interpretation that image structures of spatial size smaller than about have largely been smoothed away in the scale-space level at scale .$

In computer vision and image processing, a feature is a piece of information about the content of an image; typically about whether a certain region of the image has certain properties. Features may be specific structures in the image such as points, edges or objects. Features may also be the result of a general neighborhood operation or feature detection applied to the image. Other examples of features are related to motion in image sequences, or to shapes defined in terms of curves or boundaries between different image regions.

In imaging science, difference of Gaussians (DoG) is a feature enhancement algorithm that involves the subtraction of one Gaussian blurred version of an original image from another, less blurred version of the original. In the simple case of grayscale images, the blurred images are obtained by convolving the original grayscale images with Gaussian kernels having differing width. Blurring an image using a Gaussian kernel suppresses only high-frequency spatial information. Subtracting one image from the other preserves spatial information that lies between the range of frequencies that are preserved in the two blurred images. Thus, the DoG is a spatial band-pass filter that attenuates frequencies in the original grayscale image that are far from the band center.

<span class="mw-page-title-main">Corner detection</span>

Corner detection is an approach used within computer vision systems to extract certain kinds of features and infer the contents of an image. Corner detection is frequently used in motion detection, image registration, video tracking, image mosaicing, panorama stitching, 3D reconstruction and object recognition. Corner detection overlaps with the topic of interest point detection.

In computer vision, blob detection methods are aimed at detecting regions in a digital image that differ in properties, such as brightness or color, compared to surrounding regions. Informally, a blob is a region of an image in which some properties are constant or approximately constant; all the points in a blob can be considered in some sense to be similar to each other. The most common method for blob detection is convolution.

The generalized Hough transform (GHT), introduced by Dana H. Ballard in 1981, is the modification of the Hough transform using the principle of template matching. The Hough transform was initially developed to detect analytically defined shapes. In these cases, we have knowledge of the shape and aim to find out its location and orientation in the image. This modification enables the Hough transform to be used to detect an arbitrary object described with its model.

In computer vision, speeded up robust features (SURF) is a patented local feature detector and descriptor. It can be used for tasks such as object recognition, image registration, classification, or 3D reconstruction. It is partly inspired by the scale-invariant feature transform (SIFT) descriptor. The standard version of SURF is several times faster than SIFT and claimed by its authors to be more robust against different image transformations than SIFT.

In digital image processing, the sum of absolute differences (SAD) is a measure of the similarity between image blocks. It is calculated by taking the absolute difference between each pixel in the original block and the corresponding pixel in the block being used for comparison. These differences are summed to create a simple metric of block similarity, the L¹ norm of the difference image or Manhattan distance between two image blocks.

The Kadir–Brady saliency detector extracts features of objects in images that are distinct and representative. It was invented by Timor Kadir and J. Michael Brady in 2001 and an affine invariant version was introduced by Kadir and Brady in 2004 and a robust version was designed by Shao et al. in 2007.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Part-based models refers to a broad class of detection algorithms used on images, in which various parts of the image are used separately in order to determine if and where an object of interest exists. Amongst these methods a very popular one is the constellation model which refers to those schemes which seek to detect a small number of features and their relative positions to then determine whether or not the object of interest is present.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

Computer stereo vision is the extraction of 3D information from digital images, such as those obtained by a CCD camera. By comparing information about a scene from two vantage points, 3D information can be extracted by examining the relative positions of objects in the two panels. This is similar to the biological process of stereopsis.

A heat kernel signature (HKS) is a feature descriptor for use in deformable shape analysis and belongs to the group of spectral shape analysis methods. For each point in the shape, HKS defines its feature vector representing the point's local and global geometric properties. Applications include segmentation, classification, structure discovery, shape matching and shape retrieval.

In image processing, a kernel, convolution matrix, or mask is a small matrix used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between the kernel and an image. Or more simply, when each pixel in the output image is a function of the nearby pixels in the input image, the kernel is that function.

References

↑ R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice, Wiley, ISBN 978-0-470-51706-2, 2009 ( TM book)
↑ Aksoy, M. S.; Torkul, O.; Cedimoglu, I. H. (2004). "An industrial visual inspection system that uses inductive learning". Journal of Intelligent Manufacturing. 15 (4): 569–574. doi:10.1023/B:JIMS.0000034120.86709.8c. S2CID 35493679.
↑ Kyriacou, Theocharis, Guido Bugmann, and Stanislao Lauria. "Vision-based urban navigation procedures for verbally instructed robots." Robotics and Autonomous Systems 51.1 (April 30, 2005): 69-80. Expanded Academic ASAP. Thomson Gale.
↑ WANG, CHING YANG, Ph.D. "EDGE DETECTION USING TEMPLATE MATCHING (IMAGE PROCESSING, THRESHOLD LOGIC, ANALYSIS, FILTERS)". Duke University, 1985, 288 pages; AAT 8523046
↑ Talmi, Itamar; Mechrez, Roey; Zelnik-Manor, Lihi (2016-12-07). "Template Matching with Deformable Diversity Similarity". arXiv: 1612.02190 [cs.CV].
↑ Zhang, Richard; Isola, Phillip; Efros, Alexei A.; Shechtman, Eli; Wang, Oliver (2018-01-11). "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric". arXiv: 1801.03924 [cs.CV].
↑ Talmi, Mechrez, Zelnik-Manor (2016). "Template Matching with Deformable Diversity Similarity". arXiv: 1612.02190 [cs.CV].{{cite arxiv}}: CS1 maint: multiple names: authors list (link)
↑ Li, Yuhai, L. Jian, T. Jinwen, X. Honbo. “A fast rotated template matching based on point feature.” Proceedings of the SPIE 6043 (2005): 453-459. MIPPR 2005: SAR and Multispectral Image Processing.
↑ B. Sirmacek, C. Unsalan. “Urban Area and Building Detection Using SIFT Keypoints and Graph Theory”, IEEE Transactions on Geoscience and Remote Sensing, Vol.47 (4), pp. 1156-1167, April 2009.
↑ Luis A. Mateos, Dan Shao and Walter G. Kropatsch. Expanding Irregular Graph Pyramid for an Approaching Object. CIARP 2009: 885-891.
↑ F. Jurie and M. Dhome. Real time robust template matching. In British Machine Vision Conference, pages 123–131, 2002.
↑ Christensen, G.E.; Rabbitt, R.D.; Miller, M.I. (October 1996). "Deformable template model using large deformation kinematics". IEEE Transactions on Image Processing. 5 (10): 1435–1447. doi:10.1109/83.536892. PMID 18290061.
↑ Gonzalez, R, Woods, R, Eddins, S "Digital Image Processing using Matlab" Prentice Hall, 2004
↑ E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt and J. M. Ogden, Pyramid methods in image processing http://web.mit.edu/persci/people/adelson/pub_pdfs/RCA84.pdf
↑ Yuan, Po, M.S.E.E. "Translation, scale, rotation and threshold invariant pattern recognition system". The University of Texas at Dallas, 1993, 62 pages; AAT EP13780
↑ H. Y. Kim and S. A. Araújo, "Grayscale Template-Matching Invariant to Rotation, Scale, Translation, Brightness and Contrast," IEEE Pacific-Rim Symposium on Image and Video Technology, Lecture Notes in Computer Science, vol. 4872, pp. 100-113, 2007.
↑ Korman S., Reichman D., Tsur G. and Avidan S., "FAsT-Match: Fast Affine Template Matching", CVPR2013.
↑ C. T. Yuen, M. Rizon, W. S. San, and T. C. Seong. “Facial Features for Template Matching Based Face Recognition.” American Journal of Engineering and Applied Sciences 3 (1): 899-903, 2010.
↑ Ashley Aberneithy. "Automatic Detection of Calcified Nodules of Patients with Tuberculous". University College London, 2007
↑ Tahmasebi, P., Hezarkhani, A., Sahimi, M., 2012, Multiple-point geostatistical modeling based on the cross-correlation functions, Computational Geosciences, 16(3):779-79742.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice, Wiley, ISBN 978-0-470-51706-2, 2009 ( TM book)

[2] Aksoy, M. S.; Torkul, O.; Cedimoglu, I. H. (2004). "An industrial visual inspection system that uses inductive learning". Journal of Intelligent Manufacturing. 15 (4): 569–574. doi:10.1023/B:JIMS.0000034120.86709.8c. S2CID 35493679.

[3] Kyriacou, Theocharis, Guido Bugmann, and Stanislao Lauria. "Vision-based urban navigation procedures for verbally instructed robots." Robotics and Autonomous Systems 51.1 (April 30, 2005): 69-80. Expanded Academic ASAP. Thomson Gale.

[4] WANG, CHING YANG, Ph.D. "EDGE DETECTION USING TEMPLATE MATCHING (IMAGE PROCESSING, THRESHOLD LOGIC, ANALYSIS, FILTERS)". Duke University, 1985, 288 pages; AAT 8523046

[5] Talmi, Itamar; Mechrez, Roey; Zelnik-Manor, Lihi (2016-12-07). "Template Matching with Deformable Diversity Similarity". arXiv: 1612.02190 [cs.CV].

[6] Zhang, Richard; Isola, Phillip; Efros, Alexei A.; Shechtman, Eli; Wang, Oliver (2018-01-11). "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric". arXiv: 1801.03924 [cs.CV].

[7] Talmi, Mechrez, Zelnik-Manor (2016). "Template Matching with Deformable Diversity Similarity". arXiv: 1612.02190 [cs.CV].{{cite arxiv}}: CS1 maint: multiple names: authors list (link)

[8] Li, Yuhai, L. Jian, T. Jinwen, X. Honbo. “A fast rotated template matching based on point feature.” Proceedings of the SPIE 6043 (2005): 453-459. MIPPR 2005: SAR and Multispectral Image Processing.

[9] B. Sirmacek, C. Unsalan. “Urban Area and Building Detection Using SIFT Keypoints and Graph Theory”, IEEE Transactions on Geoscience and Remote Sensing, Vol.47 (4), pp. 1156-1167, April 2009.

[10] Luis A. Mateos, Dan Shao and Walter G. Kropatsch. Expanding Irregular Graph Pyramid for an Approaching Object. CIARP 2009: 885-891.

[11] F. Jurie and M. Dhome. Real time robust template matching. In British Machine Vision Conference, pages 123–131, 2002.

[12] Christensen, G.E.; Rabbitt, R.D.; Miller, M.I. (October 1996). "Deformable template model using large deformation kinematics". IEEE Transactions on Image Processing. 5 (10): 1435–1447. doi:10.1109/83.536892. PMID 18290061.

[13] Gonzalez, R, Woods, R, Eddins, S "Digital Image Processing using Matlab" Prentice Hall, 2004

[14] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt and J. M. Ogden, Pyramid methods in image processing http://web.mit.edu/persci/people/adelson/pub_pdfs/RCA84.pdf

[15] Yuan, Po, M.S.E.E. "Translation, scale, rotation and threshold invariant pattern recognition system". The University of Texas at Dallas, 1993, 62 pages; AAT EP13780

[16] H. Y. Kim and S. A. Araújo, "Grayscale Template-Matching Invariant to Rotation, Scale, Translation, Brightness and Contrast," IEEE Pacific-Rim Symposium on Image and Video Technology, Lecture Notes in Computer Science, vol. 4872, pp. 100-113, 2007.

[17] Korman S., Reichman D., Tsur G. and Avidan S., "FAsT-Match: Fast Affine Template Matching", CVPR2013.

[18] C. T. Yuen, M. Rizon, W. S. San, and T. C. Seong. “Facial Features for Template Matching Based Face Recognition.” American Journal of Engineering and Applied Sciences 3 (1): 899-903, 2010.

[19] Ashley Aberneithy. "Automatic Detection of Calcified Nodules of Patients with Tuberculous". University College London, 2007

[20] Tahmasebi, P., Hezarkhani, A., Sahimi, M., 2012, Multiple-point geostatistical modeling based on the cross-correlation functions, Computational Geosciences, 16(3):779-79742.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]