Harris affine region detector

Last updated August 10, 2024

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

Overview

The Harris affine detector can identify similar regions between images that are related through affine transformations and have different illuminations. These affine-invariant detectors should be capable of identifying similar regions in images taken from different viewpoints that are related by a simple geometric transformation: scaling, rotation and shearing. These detected regions have been called both invariant and covariant. On one hand, the regions are detected invariant of the image transformation but the regions covariantly change with image transformation.^[1] Do not dwell too much on these two naming conventions; the important thing to understand is that the design of these interest points will make them compatible across images taken from several viewpoints. Other detectors that are affine-invariant include Hessian affine region detector, maximally stable extremal regions, Kadir–Brady saliency detector, edge-based regions (EBR) and intensity-extrema-based regions (IBR).

Mikolajczyk and Schmid (2002) first described the Harris affine detector as it is used today in An Affine Invariant Interest Point Detector.^[2] Earlier works in this direction include use of affine shape adaptation by Lindeberg and Garding for computing affine invariant image descriptors and in this way reducing the influence of perspective image deformations,^[3] the use affine adapted feature points for wide baseline matching by Baumberg^[4] and the first use of scale invariant feature points by Lindeberg;^[5]^[6]^[7] for an overview of the theoretical background. The Harris affine detector relies on the combination of corner points detected through Harris corner detection, multi-scale analysis through Gaussian scale space and affine normalization using an iterative affine shape adaptation algorithm. The recursive and iterative algorithm follows an iterative approach to detecting these regions:

Identify initial region points using scale-invariant Harris–Laplace detector.
For each initial point, normalize the region to be affine invariant using affine shape adaptation.
Iteratively estimate the affine region: selection of proper integration scale, differentiation scale and spatially localize interest points.
Update the affine region using these scales and spatial localizations.
Repeat step 3 if the stopping criterion is not met.

Algorithm description

Harris–Laplace detector (initial region points)

The Harris affine detector relies heavily on both the Harris measure and a Gaussian scale space representation. Therefore, a brief examination of both follow. For a more exhaustive derivations see corner detection and Gaussian scale space or their associated papers.^[6]^[8]

Harris corner measure

The Harris corner detector algorithm relies on a central principle: at a corner, the image intensity will change largely in multiple directions. This can alternatively be formulated by examining the changes of intensity due to shifts in a local window. Around a corner point, the image intensity will change greatly when the window is shifted in an arbitrary direction. Following this intuition and through a clever decomposition, the Harris detector uses the second moment matrix as the basis of its corner decisions. (See corner detection for a more complete derivation). The matrix $A$ , has also been called the autocorrelation matrix and has values closely related to the derivatives of image intensity.

A(\mathbf {x} )=\sum _{p,q}w(p,q){\begin{bmatrix}I_{x}^{2}(p,q)&I_{x}I_{y}(p,q)\\I_{x}I_{y}(p,q)&I_{y}^{2}(p,q)\\\end{bmatrix}}

where $I_{x}$ and $I_{y}$ are the respective derivatives (of pixel intensity) in the $x$ and $y$ direction at point ( $p$ , $q$ ); $p$ and $q$ are the position parameters of the weighting function w. The off-diagonal entries are the product of $I_{x}$ and $I_{y}$ , while the diagonal entries are squares of the respective derivatives. The weighting function $w(x,y)$ can be uniform, but is more typically an isotropic, circular Gaussian,

w(x,y)=g(x,y,\sigma )={\frac {1}{2\pi \sigma ^{2}}}e^{\left(-{\frac {x^{2}+y^{2}}{2\sigma ^{2}}}\right)}

that acts to average in a local region while weighting those values near the center more heavily.

As it turns out, this $A$ matrix describes the shape of the autocorrelation measure as due to shifts in window location. Thus, if we let $\lambda _{1}$ and $\lambda _{2}$ be the eigenvalues of $A$ , then these values will provide a quantitative description of how the autocorrelation measure changes in space: its principal curvatures. As Harris and Stephens (1988) point out, the $A$ matrix centered on corner points will have two large, positive eigenvalues.^[8] Rather than extracting these eigenvalues using methods like singular value decomposition, the Harris measure based on the trace and determinant is used:

R=\det(A)-\alpha \operatorname {trace} ^{2}(A)=\lambda _{1}\lambda _{2}-\alpha (\lambda _{1}+\lambda _{2})^{2}

where $\alpha$ is a constant. Corner points have large, positive eigenvalues and would thus have a large Harris measure. Thus, corner points are identified as local maxima of the Harris measure that are above a specified threshold.

{\begin{aligned}\{x_{c}\}={\big \{}x_{c}\mid R(x_{c})>R(x_{i}),\forall x_{i}\in W(x_{c}){\big \}},\\R(x_{c})>t_{\text{threshold}}\end{aligned}}

where $\{x_{c}\}$ are the set of all corner points, $R(x)$ is the Harris measure calculated at $x$ , $W(x_{c})$ is an 8-neighbor set centered on $x_{c}$ and $t_{\text{threshold}}$ is a specified threshold.

Gaussian scale-space

A Gaussian scale space representation of an image is the set of images that result from convolving a Gaussian kernel of various sizes with the original image. In general, the representation can be formulated as:

L(\mathbf {x} ,s)=G(s)\otimes I(\mathbf {x} )

where $G(s)$ is an isotropic, circular Gaussian kernel as defined above. The convolution with a Gaussian kernel smooths the image using a window the size of the kernel. A larger scale, $s$ , corresponds to a smoother resultant image. Mikolajczyk and Schmid (2001) point out that derivatives and other measurements must be normalized across scales.^[9] A derivative of order $m$ , $D_{i_{1},...i_{m}}$ , must be normalized by a factor $s^{m}$ in the following manner:

D_{i_{1},\dots ,i_{m}}(\mathbf {x} ,s)=s^{m}L_{i_{1},\dots ,i_{m}}(\mathbf {x} ,s)

These derivatives, or any arbitrary measure, can be adapted to a scale space representation by calculating this measure using a set of scales recursively where the $n$ th scale is $s_{n}=k^{n}s_{0}$ . See scale space for a more complete description.

Combining Harris detector across Gaussian scale-space

The Harris–Laplace detector combines the traditional 2D Harris corner detector with the idea of a Gaussian scale space representation in order to create a scale-invariant detector. Harris-corner points are good starting points because they have been shown to have good rotational and illumination invariance in addition to identifying the interesting points of the image.^[10] However, the points are not scale invariant and thus the second-moment matrix must be modified to reflect a scale-invariant property. Let us denote, $M=\mu (\mathbf {x} ,\sigma _{\mathit {I}},\sigma _{\mathit {D}})$ as the scale adapted second-moment matrix used in the Harris–Laplace detector.

M=\mu (\mathbf {x} ,\sigma _{\mathit {I}},\sigma _{\mathit {D}})=\sigma _{D}^{2}g(\sigma _{I})\otimes {\begin{bmatrix}L_{x}^{2}(\mathbf {x} ,\sigma _{D})&L_{x}L_{y}(\mathbf {x} ,\sigma _{D})\\L_{x}L_{y}(\mathbf {x} ,\sigma _{D})&L_{y}^{2}(\mathbf {x} ,\sigma _{D})\end{bmatrix}}

^[11]

where $g(\sigma _{I})$ is the Gaussian kernel of scale $\sigma _{I}$ and $\mathbf {x} =(x,y)$ . Similar to the Gaussian-scale space, $L(\mathbf {x} )$ is the Gaussian-smoothed image. The $\mathbf {\otimes }$ operator denotes convolution. $L_{x}(\mathbf {x} ,\sigma _{D})$ and $L_{y}(\mathbf {x} ,\sigma _{D})$ are the derivatives in their respective direction applied to the smoothed image and calculated using a Gaussian kernel with scale $\sigma _{D}$ . In terms of our Gaussian scale-space framework, the $\sigma _{I}$ parameter determines the current scale at which the Harris corner points are detected.

Building upon this scale-adapted second-moment matrix, the Harris–Laplace detector is a twofold process: applying the Harris corner detector at multiple scales and automatically choosing the characteristic scale.

Multi-scale Harris corner points

The algorithm searches over a fixed number of predefined scales. This set of scales is defined as:

{\sigma _{1}\dots \sigma _{n}}={k^{1}\sigma _{0}\dots k^{n}\sigma _{0}}

Mikolajczyk and Schmid (2004) use $k=1.4$ . For each integration scale, $\sigma _{I}$ , chosen from this set, the appropriate differentiation scale is chosen to be a constant factor of the integration scale: $\sigma _{D}=s\sigma _{I}$ . Mikolajczyk and Schmid (2004) used $s=0.7$ .^[11] Using these scales, the interest points are detected using a Harris measure on the $\mu (\mathbf {x} ,\sigma _{\mathit {I}},\sigma _{\mathit {D}})$ matrix. The cornerness, like the typical Harris measure, is defined as:

{\mathit {cornerness}}=\det(\mu (\mathbf {x} ,\sigma _{\mathit {I}},\sigma _{\mathit {D}}))-\alpha \operatorname {trace} ^{2}(\mu (\mathbf {x} ,\sigma _{\mathit {I}},\sigma _{\mathit {D}}))

Like the traditional Harris detector, corner points are those local (8 point neighborhood) maxima of the cornerness that are above a specified threshold.

Characteristic scale identification

An iterative algorithm based on Lindeberg (1998) both spatially localizes the corner points and selects the characteristic scale.^[6] The iterative search has three key steps, that are carried for each point $\mathbf {x}$ that were initially detected at scale $\sigma _{I}$ by the multi-scale Harris detector ( $k$ indicates the $kth$ iteration):

Choose the scale $\sigma _{I}^{(k+1)}$ that maximizes the Laplacian-of-Gaussians (LoG) over a predefined range of neighboring scales. The neighboring scales are typically chosen from a range that is within a two scale-space neighborhood. That is, if the original points were detected using a scaling factor of $1.4$ between successive scales, a two scale-space neighborhood is the range $t\in [0.7,\dots ,1.4]$ . Thus the Gaussian scales examined are: $\sigma _{I}^{(k+1)}=t\sigma _{I}^{k}$ . The LoG measurement is defined as:

|\operatorname {LoG} (\mathbf {x} ,\sigma _{I})|=\sigma _{I}^{2}\left|L_{xx}(\mathbf {x} ,\sigma _{I})+L_{yy}(\mathbf {x} ,\sigma _{I})\right|

where

L_{xx}

and

L_{yy}

are the second derivatives in their respective directions.^[12] The

\sigma _{I}^{2}

factor (as discussed above in Gaussian scale-space) is used to normalize the LoG across scales and make these measures comparable, thus making a maximum relevant. Mikolajczyk and Schmid (2001) demonstrate that the LoG measure attains the highest percentage of correctly detected corner points in comparison to other scale-selection measures.^[9] The scale which maximizes this LoG measure in the two scale-space neighborhood is deemed the characteristic scale,

\sigma _{I}^{(k+1)}

, and used in subsequent iterations. If no extrema, or maxima of the LoG is found, this point is discarded from future searches.

Using the characteristic scale, the points are spatially localized. That is to say, the point $\mathbf {x} ^{(k+1)}$ is chosen such that it maximizes the Harris corner measure (cornerness as defined above) within an 8×8 local neighborhood.
Stopping criterion: $\sigma _{I}^{(k+1)}==\sigma _{I}^{(k)}$ and $\mathbf {x} ^{(k+1)}==\mathbf {x} ^{(k)}$ .

If the stopping criterion is not met, then the algorithm repeats from step 1 using the new $k+1$ points and scale. When the stopping criterion is met, the found points represent those that maximize the LoG across scales (scale selection) and maximize the Harris corner measure in a local neighborhood (spatial selection).

Affine-invariant points

Mathematical theory

The Harris–Laplace detected points are scale invariant and work well for isotropic regions that are viewed from the same viewing angle. In order to be invariant to arbitrary affine transformations (and viewpoints), the mathematical framework must be revisited. The second-moment matrix $\mathbf {\mu }$ is defined more generally for anisotropic regions:

\mu (\mathbf {x} ,\Sigma _{I},\Sigma _{D})=\det(\Sigma _{D})g(\Sigma _{I})*(\nabla L(\mathbf {x} ,\Sigma _{D})\nabla L(\mathbf {x} ,\Sigma _{D})^{T})

where $\Sigma _{I}$ and $\Sigma _{D}$ are covariance matrices defining the differentiation and the integration Gaussian kernel scales. Although this may look significantly different from the second-moment matrix in the Harris–Laplace detector; it is in fact, identical. The earlier $\mu$ matrix was the 2D-isotropic version in which the covariance matrices $\Sigma _{I}$ and $\Sigma _{D}$ were 2x2 identity matrices multiplied by factors $\sigma _{I}$ and $\sigma _{D}$ , respectively. In the new formulation, one can think of Gaussian kernels as a multivariate Gaussian distributions as opposed to a uniform Gaussian kernel. A uniform Gaussian kernel can be thought of as an isotropic, circular region. Similarly, a more general Gaussian kernel defines an ellipsoid. In fact, the eigenvectors and eigenvalues of the covariance matrix define the rotation and size of the ellipsoid. Thus we can easily see that this representation allows us to completely define an arbitrary elliptical affine region over which we want to integrate or differentiate.

The goal of the affine invariant detector is to identify regions in images that are related through affine transformations. We thus consider a point $\mathbf {x} _{L}$ and the transformed point $\mathbf {x} _{R}=A\mathbf {x} _{L}$ , where A is an affine transformation. In the case of images, both $\mathbf {x} _{R}$ and $\mathbf {x} _{L}$ live in $R^{2}$ space. The second-moment matrices are related in the following manner:^[3]

{\begin{aligned}\mu (\mathbf {x} _{L},\Sigma _{I,L},\Sigma _{D,L})&{}=A^{T}\mu (\mathbf {x} _{R},\Sigma _{I,R},\Sigma _{D,R})A\\M_{L}&{}=\mu (\mathbf {x} _{L},\Sigma _{I,L},\Sigma _{D,L})\\M_{R}&{}=\mu (\mathbf {x} _{R},\Sigma _{I,R},\Sigma _{D,R})\\M_{L}&{}=A^{T}M_{R}A\\\Sigma _{I,R}&{}=A\Sigma _{I,L}A^{T}{\text{ and }}\Sigma _{D,R}=A\Sigma _{D,L}A^{T}\end{aligned}}

where $\Sigma _{I,b}$ and $\Sigma _{D,b}$ are the covariance matrices for the $b$ reference frame. If we continue with this formulation and enforce that

{\begin{aligned}\Sigma _{I,L}=\sigma _{I}M_{L}^{-1}\\\Sigma _{D,L}=\sigma _{D}M_{L}^{-1}\end{aligned}}

where $\sigma _{I}$ and $\sigma _{D}$ are scalar factors, one can show that the covariance matrices for the related point are similarly related:

{\begin{aligned}\Sigma _{I,R}=\sigma _{I}M_{R}^{-1}\\\Sigma _{D,R}=\sigma _{D}M_{R}^{-1}\end{aligned}}

By requiring the covariance matrices to satisfy these conditions, several nice properties arise. One of these properties is that the square root of the second-moment matrix, $M^{\tfrac {1}{2}}$ will transform the original anisotropic region into isotropic regions that are related simply through a pure rotation matrix $R$ . These new isotropic regions can be thought of as a normalized reference frame. The following equations formulate the relation between the normalized points $x_{R}^{'}$ and $x_{L}^{'}$ :

{\begin{aligned}A=M_{R}^{-{\tfrac {1}{2}}}RM_{L}^{\tfrac {1}{2}}\\x_{R}^{'}=M_{R}^{\tfrac {1}{2}}x_{R}\\x_{L}^{'}=M_{L}^{\tfrac {1}{2}}x_{L}\\x_{L}^{'}=Rx_{R}^{'}\\\end{aligned}}

The rotation matrix can be recovered using gradient methods likes those in the SIFT descriptor. As discussed with the Harris detector, the eigenvalues and eigenvectors of the second-moment matrix, $M=\mu (\mathbf {x} ,\Sigma _{I},\Sigma _{D})$ characterize the curvature and shape of the pixel intensities. That is, the eigenvector associated with the largest eigenvalue indicates the direction of largest change and the eigenvector associated with the smallest eigenvalue defines the direction of least change. In the 2D case, the eigenvectors and eigenvalues define an ellipse. For an isotropic region, the region should be circular in shape and not elliptical. This is the case when the eigenvalues have the same magnitude. Thus a measure of the isotropy around a local region is defined as the following:

{\mathcal {Q}}={\frac {\lambda _{\min }(M)}{\lambda _{\max }(M)}}

where $\lambda$ denote eigenvalues. This measure has the range $[0\dots 1]$ . A value of $1$ corresponds to perfect isotropy.

Iterative algorithm

Using this mathematical framework, the Harris affine detector algorithm iteratively discovers the second-moment matrix that transforms the anisotropic region into a normalized region in which the isotropic measure is sufficiently close to one. The algorithm uses this shape adaptation matrix, $U$ , to transform the image into a normalized reference frame. In this normalized space, the interest points' parameters (spatial location, integration scale and differentiation scale) are refined using methods similar to the Harris–Laplace detector. The second-moment matrix is computed in this normalized reference frame and should have an isotropic measure close to one at the final iteration. At every $k$ th iteration, each interest region is defined by several parameters that the algorithm must discover: the $U^{(k)}$ matrix, position $\mathbf {x} ^{(k)}$ , integration scale $\sigma _{I}^{(k)}$ and differentiation scale $\sigma _{D}^{(k)}$ . Because the detector computes the second-moment matrix in the transformed domain, it's convenient to denote this transformed position as $\mathbf {x} _{w}^{(k)}$ where $U^{(k)}\mathbf {x} _{w}^{(k)}=\mathbf {x^{(k)}}$ .

The detector initializes the search space with points detected by the Harris–Laplace detector.
$U^{(0)}={\mathit {identity}}$ and $\mathbf {x} ^{(0)}$ , $\sigma _{D}^{(0)}$ , and $\sigma _{I}^{(0)}$ are those from the Harris–Laplace detector.
Apply the previous iteration shape adaptation matrix, $U^{(k-1)}$ to generate the normalized reference frame, $U^{(k-1)}\mathbf {x} _{w}^{(k-1)}=\mathbf {x} ^{(k-1)}$ . For the first iteration, you apply $U^{(0)}$ .
Select the integration scale, $\sigma _{I}^{(k)}$ , using a method similar to the Harris–Laplace detector. The scale is chosen as the scale that maximizes the Laplacian of Gaussian (LoG). The search space of the scales are those within two scale-spaces of the previous iterations scale.
$\sigma _{I}^{(k)}={\underset {\sigma _{I}=t\sigma _{I}^{(k-1)} \atop t\in [0.7,\dots ,1.4]}{\operatorname {argmax} }}\,\sigma _{I}^{2}\det(L_{xx}(\mathbf {x} ,\sigma _{I})+L_{yy}(\mathbf {x} ,\sigma _{I}))$
It's important to note that the integration scale in the $U-normalized$ space differs significantly than the non-normalized space. Therefore, it is necessary to search for the integration scale as opposed to using the scale in the non-normalized space.
Select the differentiation scale, $\sigma _{D}^{(k)}$ . In order to reduce the search space and degrees of freedom, the differentiation scale is taken to be related to the integration scale through a constant factor: $\sigma _{D}^{k}=s\sigma _{I}^{k}$ . For obvious reasons, the constant factor is less than one. Mikolajczyk and Schmid (2001) note that a too small factor will make smoothing (integration) too significant in comparison to differentiation and a factor that's too large will not allow for the integration to average the covariance matrix.^[9] It is common to choose $s\in [0.5,0.75]$ . From this set, the chosen scale will maximize the isotropic measure ${\mathcal {Q}}={\frac {\lambda _{min}(\mu )}{\lambda _{max}(\mu )}}$ .
$\sigma _{D}^{(k)}={\underset {\sigma _{D}=s\sigma _{I}^{(k)},\;s\in [0.5,\dots ,0.75]}{\operatorname {argmax} }}\,{\frac {\lambda _{\min }(\mu (\mathbf {x} _{w}^{(k)},\sigma _{I}^{k},\sigma _{D}))}{\lambda _{\max }(\mu (\mathbf {x} _{w}^{(k)},\sigma _{I}^{k},\sigma _{D}))}}$
where $\mu (\mathbf {x} _{w}^{(k)},\sigma _{I}^{k},\sigma _{D})$ is the second-moment matrix evaluated in the normalized reference frame. This maximization processes causes the eigenvalues to converge to the same value.
Spatial Localization: Select the point $\mathbf {x} _{w}^{(k)}$ that maximizes the Harris corner measure ( ${\mathit {cornerness}}$ ) within an 8-point neighborhood around the previous $\mathbf {x} _{w}^{(k-1)}$ point.
$\mathbf {x} _{w}^{(k)}={\underset {\mathbf {x} _{w}\in W(\mathbf {x} _{w}^{(k-1)})}{\operatorname {argmax} }}\,\det(\mu (\mathbf {x} _{w},\sigma _{I}^{k},\sigma _{D}^{(k)}))-\alpha \operatorname {trace} ^{2}(\mu (\mathbf {x} _{w},\sigma _{I}^{k},\sigma _{D}^{(k)}))$
where $\mu$ is the second-moment matrix as defined above. The window $W(\mathbf {x} _{w}^{(k-1)})$ is the set of 8-nearest neighbors of the previous iteration's point in the normalized reference frame.
Because our spatial localization was done in the $U$ -normalized reference frame, the newly chosen point must be transformed back to the original reference frame. This is achieved by transforming a displacement vector and adding this to the previous point:
$\mathbf {x} ^{(k)}=\mathbf {x} ^{(k-1)}+U^{(k-1)}\cdot (\mathbf {x} _{w}^{(k)}-\mathbf {x} _{w}^{(k-1)})$
As mentioned above, the square-root of the second-moment matrix defines the transformation matrix that generates the normalized reference frame. We thus need to save this matrix: $\mu _{i}^{(k)}=\mu ^{-{\tfrac {1}{2}}}(\mathbf {x} _{w}^{(k)},\sigma _{I}^{(k)},\sigma _{D}^{(k)})$ . The transformation matrix $U$ is updated: $U^{(k)}=\mu _{i}^{(k)}\cdot U^{(k-1)}$ . In order to ensure that the image gets sampled correctly and we are expanding the image in the direction of the least change (smallest eigenvalue), we fix the maximum eigenvalue: $\lambda _{max}(U^{(k)})=1$ . Using this updating method, one can easily see that the final $U$ matrix takes the following form:
$U=\prod _{k}\mu _{i}^{(k)}\cdot U^{(0)}=\prod _{k}(\mu ^{-{\tfrac {1}{2}}})^{(k)}\cdot U^{(0)}$
If the stopping criterion is not met, continue to the next iteration at step 2. Because the algorithm iteratively solves for the $U-normalization$ matrix that transforms an anisotropic region into an isotropic region, it makes sense to stop when the isotropic measure, ${\mathcal {Q}}={\frac {\lambda _{\min }(\mu )}{\lambda _{\max }(\mu )}}$ , is sufficiently close to its maximum value 1. Sufficiently close implies the following stopping condition:
$1-{\frac {\lambda _{\min }(\mu _{i}^{(k)})}{\lambda _{\max }(\mu _{i}^{(k)})}}<\varepsilon _{C}$
Mikolajczyk and Schmid (2004) had good success with $\epsilon _{C}=0.05$ .

Computation and implementation

The computational complexity of the Harris-affine detector is broken into two parts: initial point detection and affine region normalization. The initial point detection algorithm, Harris–Laplace, has complexity ${\mathcal {O}}(n)$ where $n$ is the number of pixels in the image. The affine region normalization algorithm automatically detects the scale and estimates the shape adaptation matrix, $U$ . This process has complexity ${\mathcal {O}}((m+k)p)$ , where $p$ is the number of initial points, $m$ is the size of the search space for the automatic scale selection and $k$ is the number of iterations required to compute the $U$ matrix.^[11]

Some methods exist to reduce the complexity of the algorithm at the expense of accuracy. One method is to eliminate the search in the differentiation scale step. Rather than choose a factor $s$ from a set of factors, the sped-up algorithm chooses the scale to be constant across iterations and points: $\sigma _{D}=s\sigma _{I},\;s=constant$ . Although this reduction in search space might decrease the complexity, this change can severely effect the convergence of the $U$ matrix.

Analysis

Convergence

One can imagine that this algorithm might identify duplicate interest points at multiple scales. Because the Harris affine algorithm looks at each initial point given by the Harris–Laplace detector independently, there is no discrimination between identical points. In practice, it has been shown that these points will ultimately all converge to the same interest point. After finishing identifying all interest points, the algorithm accounts for duplicates by comparing the spatial coordinates ( $\mathbf {x}$ ), the integration scale $\sigma _{I}$ , the isotropic measure ${\tfrac {\lambda _{\min }(U)}{\lambda _{\max }(U)}}$ and skew.^[11] If these interest point parameters are similar within a specified threshold, then they are labeled duplicates. The algorithm discards all these duplicate points except for the interest point that's closest to the average of the duplicates. Typically 30% of the Harris affine points are distinct and dissimilar enough to not be discarded.^[11]

Mikolajczyk and Schmid (2004) showed that often the initial points (40%) do not converge. The algorithm detects this divergence by stopping the iterative algorithm if the inverse of the isotropic measure is larger than a specified threshold: ${\tfrac {\lambda _{\max }(U)}{\lambda _{\min }(U)}}>t_{\text{diverge}}$ . Mikolajczyk and Schmid (2004) use $t_{diverge}=6$ . Of those that did converge, the typical number of required iterations was 10.^[2]

Quantitative measure

Quantitative analysis of affine region detectors take into account both the accuracy of point locations and the overlap of regions across two images. Mioklajcyzk and Schmid (2004) extend the repeatability measure of Schmid et al. (1998) as the ratio of point correspondences to minimum detected points of the two images.^[11]^[13]

R_{\text{score}}={\frac {C(A,B)}{\min(n_{A},n_{B})}}

where $C(A,B)$ are the number of corresponding points in images $A$ and $B$ . $n_{B}$ and $n_{A}$ are the number of detected points in the respective images. Because each image represents 3D space, it might be the case that the one image contains objects that are not in the second image and thus whose interest points have no chance of corresponding. In order to make the repeatability measure valid, one remove these points and must only consider points that lie in both images; $n_{A}$ and $n_{B}$ only count those points such that $x_{A}=H\cdot x_{B}$ . For a pair of two images related through a homography matrix $H$ , two points, $\mathbf {x_{a}}$ and $\mathbf {x_{b}}$ are said to correspond if:

Error in pixel location is less than 1.5 pixels: $\|\mathbf {x_{a}} -H\cdot \mathbf {x_{b}} \|<1.5$
The overlap error of the two affine points ( $\epsilon _{S}$ ) must be less than a specified threshold (typically 40%).^[1] For affine regions, this overlap error is the following:
$\epsilon _{S}=1-{\frac {\mu _{a}\cap (H^{T}\mu _{b}H)}{\mu _{a}\cup (H^{T}\mu _{b}H)}}$
where $\mu _{a}$ and $\mu _{b}$ are the recovered elliptical regions whose points satisfy: $\mu ^{T}\mathbf {x} \mu =1$ . Basically, this measure takes a ratio of areas: the area of overlap (intersection) and the total area (union). Perfect overlap would have a ratio of one and have an $\epsilon _{S}=0$ . Different scales effect the region of overlap and thus must be taken into account by normalizing the area of each region of interest. Regions with an overlap error as high as 50% are viable detectors to be matched with a good descriptor.^[1]
A second measure, a matching score, more practically assesses the detector's ability to identify matching points between images. Mikolajczyk and Schmid (2005) use a SIFT descriptor to identify matching points. In addition to being the closest points in SIFT-space, two matched points must also have a sufficiently small overlap error (as defined in the repeatability measure). The matching score is the ratio of the number of matched points and the minimum of the total detected points in each image:
$M_{score}={\frac {M(A,B)}{\min(n_{A},n_{B})}}$ ,^[1]
where $M(A,B)$ are the number of matching points and $n_{B}$ and $n_{A}$ are the number of detected regions in the respective images.

Robustness to affine and other transformations

Mikolajczyk et al. (2005) have done a thorough analysis of several state-of-the-art affine region detectors: Harris affine, Hessian affine, MSER,^[14] IBR & EBR^[15] and salient ^[16] detectors.^[1] Mikolajczyk et al. analyzed both structured images and textured images in their evaluation. Linux binaries of the detectors and their test images are freely available at their webpage. A brief summary of the results of Mikolajczyk et al. (2005) follow; see A comparison of affine region detectors for a more quantitative analysis.

Viewpoint Angle Change: The Harris affine detector has reasonable (average) robustness to these types of changes. The detector maintains a repeatability score of above 50% up until a viewpoint angle of above 40 degrees. The detector tends to detect a high number of repeatable and matchable regions even under a large viewpoint change.
Scale Change: The Harris affine detector remains very consistent under scale changes. Although the number of points declines considerably at large scale changes (above 2.8), the repeatability (50–60%) and matching scores (25–30%) remain very constant especially with textured images. This is consistent with the high-performance of the automatic scale selection iterative algorithm.
Blurred Images: The Harris affine detector remains very stable under image blurring. Because the detector does not rely on image segmentation or region boundaries, the repeatability and matching scores remain constant.
JPEG Artifacts: The Harris affine detector degrades similar to other affine detectors: repeatability and matching scores drop significantly above 80% compression.
Illumination Changes: The Harris affine detector, like other affine detectors, is very robust to illumination changes: repeatability and matching scores remain constant under decreasing light. This should be expected because the detectors rely heavily on relative intensities (derivatives) and not absolute intensities.

General trends

Harris affine region points tend to be small and numerous. Both the Harris-affine detector and Hessian-affine consistently identify double the number repeatable points as other affine detectors: ~1000 regions for an 800x640 image.^[1] Small regions are less likely to be occluded but have a smaller chance of overlapping neighboring regions.
The Harris affine detector responds well to textured scenes in which there are a lot of corner-like parts. However, for some structured scenes, like buildings, the Harris-affine detector performs very well. This is complementary to MSER that tends to do better with well structured (segmentable) scenes.
Overall the Harris affine detector performs very well, but still behind MSER and Hessian-affine in all cases but blurred images.
Harris-affine and Hessian-affine detectors are less accurate than others: their repeatability score increases as the overlap threshold is increased.
The detected affine-invariant regions may still differ in their rotation and illumination. Any descriptor that uses these regions must account for the invariance when using the regions for matching or other comparisons.

Applications

Content-based image retrieval ^[17]^[18]
Model-based recognition
Object retrieval in video^[19]
Visual data mining: identifying important objects, characters and scenes in videos^[20]
Object recognition and categorization^[21]
Remotely sensed image analysis: Object detection from remotely sensed images^[22]

Software packages

Affine Covariant Features: K. Mikolajczyk maintains a web page that contains Linux binaries of the Harris-affine detector in addition to other detectors and descriptors. Matlab code is also available that can be used to illustrate and compute the repeatability of various detectors. Code and images are also available to duplicate the results found in the Mikolajczyk et al. (2005) paper.
lip-vireo – binary code for Linux, Windows and SunOS from VIREO research group. See more from the homepage Archived 2017-05-11 at the Wayback Machine

External links

Related Research Articles

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is $The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate .$

In particle physics, the Dirac equation is a relativistic wave equation derived by British physicist Paul Dirac in 1928. In its free form, or including electromagnetic interactions, it describes all spin-1/2 massive particles, called "Dirac particles", such as electrons and quarks for which parity is a symmetry. It is consistent with both the principles of quantum mechanics and the theory of special relativity, and was the first theory to account fully for special relativity in the context of quantum mechanics. It was validated by accounting for the fine structure of the hydrogen spectrum in a completely rigorous way. It has become vital in the building of the Standard Model.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in R^p×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

<span class="mw-page-title-main">Corner detection</span> Approach used in computer vision systems

Corner detection is an approach used within computer vision systems to extract certain kinds of features and infer the contents of an image. Corner detection is frequently used in motion detection, image registration, video tracking, image mosaicing, panorama stitching, 3D reconstruction and object recognition. Corner detection overlaps with the topic of interest point detection.

Affine shape adaptation is a methodology for iteratively adapting the shape of the smoothing kernels in an affine group of smoothing kernels to the local image structure in neighbourhood region of a specific image point. Equivalently, affine shape adaptation can be accomplished by iteratively warping a local image patch with affine transformations while applying a rotationally symmetric filter to the warped image patches. Provided that this iterative process converges, the resulting fixed point will be affine invariant. In the area of computer vision, this idea has been used for defining affine invariant interest point operators as well as affine invariant texture analysis methods.

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

The ensemble Kalman filter (EnKF) is a recursive filter suitable for problems with a large number of variables, such as discretizations of partial differential equations in geophysical models. The EnKF originated as a version of the Kalman filter for large problems, and it is now an important data assimilation component of ensemble forecasting. EnKF is related to the particle filter but the EnKF makes the assumption that all probability distributions involved are Gaussian; when it is applicable, it is much more efficient than the particle filter.

In computer vision, speeded up robust features (SURF) is a patented local feature detector and descriptor. It can be used for tasks such as object recognition, image registration, classification, or 3D reconstruction. It is partly inspired by the scale-invariant feature transform (SIFT) descriptor. The standard version of SURF is several times faster than SIFT and claimed by its authors to be more robust against different image transformations than SIFT.

The Kadir–Brady saliency detector extracts features of objects in images that are distinct and representative. It was invented by Timor Kadir and J. Michael Brady in 2001 and an affine invariant version was introduced by Kadir and Brady in 2004 and a robust version was designed by Shao et al. in 2007.

The Hessian affine region detector is a feature detector used in the fields of computer vision and image analysis. Like other feature detectors, the Hessian affine detector is typically used as a preprocessing step to algorithms that rely on identifiable, characteristic interest points.

In probability theory and statistics, the generalized chi-squared distribution is the distribution of a quadratic form of a multinormal variable, or a linear combination of different normal variables and squares of normal variables. Equivalently, it is also a linear sum of independent noncentral chi-square variables and a normal variable. There are several other such generalizations for which the same term is sometimes used; some of them are special cases of the family discussed here, for example the gamma distribution.

In physics, particularly in quantum field theory, the Weyl equation is a relativistic wave equation for describing massless spin-1/2 particles called Weyl fermions. The equation is named after Hermann Weyl. The Weyl fermions are one of the three possible types of elementary fermions, the other two being the Dirac and the Majorana fermions.

In theoretical physics, relativistic Lagrangian mechanics is Lagrangian mechanics applied in the context of special relativity and general relativity.

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.

In statistics and machine learning, Gaussian process approximation is a computational method that accelerates inference tasks in the context of a Gaussian process model, most commonly likelihood evaluation and prediction. Like approximations of other models, they can often be expressed as additional assumptions imposed on the model, which do not correspond to any actual feature, but which retain its key properties while simplifying calculations. Many of these approximation methods can be expressed in purely linear algebraic or functional analytic terms as matrix or function approximations. Others are purely algorithmic and cannot easily be rephrased as a modification of a statistical model.

References

1 2 3 4 5 6 K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir and L. Van Gool, A comparison of affine region detectors. In IJCV 65(1/2):43-72, 2005
1 2 "Mikolajcyk, K. and Schmid, C. 2002. An affine invariant interest point detector. In Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada" (PDF). Archived from the original (PDF) on 2004-07-23. Retrieved 2007-12-11.
1 2 T. Lindeberg and J. Garding (1997). "Shape-adapted smoothing in estimation of 3-{D} depth cues from affine distortions of local 2-{D} structure". Image and Vision Computing 15: pp. 415–434.
↑ A. Baumberg (2000). "Reliable feature matching across widely separated views". Proceedings of IEEE Conference on Computer Vision and Pattern Recognition: pages I:1774–1781.
↑ Lindeberg, Tony, Scale-Space Theory in Computer Vision, Kluwer Academic Publishers, 1994, ISBN 0-7923-9418-6
1 2 3 T. Lindeberg (1998). "Feature detection with automatic scale selection". International Journal of Computer Vision 30 (2): pp. 77–116.
↑ Lindeberg, T. (2008). "Scale-space". In Wah, Benjamin (ed.). Encyclopedia of Computer Science and Engineering. Vol. IV. John Wiley and Sons. pp. 2495–2504. doi:10.1002/9780470050118.ecse609. ISBN 978-0470050118.
1 2 C. Harris and M. Stephens (1988). "A combined corner and edge detector". Proceedings of the 4th Alvey Vision Conference: pages 147–151. Archived 2007-09-16 at the Wayback Machine
1 2 3 K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada, pages 525-531, 2001.
↑ Schmid, C., Mohr, R., and Bauckhage, C. 2000. Evaluation of interest point detectors. International Journal of Computer Vision, 37(2):151–172.
1 2 3 4 5 6 Mikolajczyk, K. and Schmid, C. 2004. Scale & affine invariant interest point detectors. International Journal on Computer Vision 60(1):63-86.
↑ "Spatial Filters: Laplacian/Laplacian of Gaussian". Archived from the original on 2007-11-20. Retrieved 2007-12-11.
↑ C. Schmid, R. Mohr, and C. Bauckhage. Comparing and evaluating interest points. In International Conference on Computer Vision, pp. 230–135, 1998.
↑ J.Matas, O. Chum, M. Urban, and T. Pajdla, Robust wide baseline stereo from maximally stable extremal regions. In BMVC p. 384-393, 2002.
↑ T. Tuytelaars and L. Van Gool, Matching widely separated views based on affine invariant regions. In IJCV 59(1):61-85, 2004.
↑ T. Kadir, A. Zisserman, and M. Brady, An affine invariant salient region detector. In ECCV p. 404-416, 2004.
↑ http://staff.science.uva.nl/~gevers/pub/overview.pdf ^{[ bare URL PDF ]}
↑ R. Datta, J. Li, and J. Z. Wang, “Content-based image retrieval – Approaches and trends of the new age,” In Proc. Int. Workshop on Multimedia Information Retrieval, pp. 253–262, 2005.IEEE Transactions on Multimedia, vol. 7, no. 1, pp. 127–142, 2005. Archived 2007-09-28 at the Wayback Machine
↑ J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proceedings of the International Conference on Computer Vision, Nice, France, 2003.
↑ J. Sivic and A. Zisserman. Video data mining using configurations of viewpoint invariant regions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington DC, USA, pp. 488–495, 2004. ^{[ permanent dead link ]}
↑ G. Dorko and C. Schmid. Selection of scale invariant neighborhoods for object class recognition. In Proceedings of International Conference on Computer Vision, Nice, France, pp. 634–640, 2003.
↑ Beril Sirmacek and Cem Unsalan (January 2011). "A probabilistic framework to detect buildings in aerial and satellite images" (PDF). IEEE Transactions on Geoscience and Remote Sensing. 49 (1): 211–221. Bibcode:2011ITGRS..49..211S. doi:10.1109/TGRS.2010.2053713. S2CID 10637950.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[miko05-1] 1 2 3 4 5 6 K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir and L. Van Gool, A comparison of affine region detectors. In IJCV 65(1/2):43-72, 2005

[miko02-2] 1 2 "Mikolajcyk, K. and Schmid, C. 2002. An affine invariant interest point detector. In Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada" (PDF). Archived from the original (PDF) on 2004-07-23. Retrieved 2007-12-11.

[lindgard97-3] 1 2 T. Lindeberg and J. Garding (1997). "Shape-adapted smoothing in estimation of 3-{D} depth cues from affine distortions of local 2-{D} structure". Image and Vision Computing 15: pp. 415–434.

[4] A. Baumberg (2000). "Reliable feature matching across widely separated views". Proceedings of IEEE Conference on Computer Vision and Pattern Recognition: pages I:1774–1781.

[lin94-5] Lindeberg, Tony, Scale-Space Theory in Computer Vision, Kluwer Academic Publishers, 1994, ISBN 0-7923-9418-6

[lin98-6] 1 2 3 T. Lindeberg (1998). "Feature detection with automatic scale selection". International Journal of Computer Vision 30 (2): pp. 77–116.

[7] Lindeberg, T. (2008). "Scale-space". In Wah, Benjamin (ed.). Encyclopedia of Computer Science and Engineering. Vol. IV. John Wiley and Sons. pp. 2495–2504. doi:10.1002/9780470050118.ecse609. ISBN 978-0470050118.

[harris88-8] 1 2 C. Harris and M. Stephens (1988). "A combined corner and edge detector". Proceedings of the 4th Alvey Vision Conference: pages 147–151. Archived 2007-09-16 at the Wayback Machine

[miko01-9] 1 2 3 K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada, pages 525-531, 2001.

[10] Schmid, C., Mohr, R., and Bauckhage, C. 2000. Evaluation of interest point detectors. International Journal of Computer Vision, 37(2):151–172.

[miko04-11] 1 2 3 4 5 6 Mikolajczyk, K. and Schmid, C. 2004. Scale & affine invariant interest point detectors. International Journal on Computer Vision 60(1):63-86.

[12] "Spatial Filters: Laplacian/Laplacian of Gaussian". Archived from the original on 2007-11-20. Retrieved 2007-12-11.

[schmid98-13] C. Schmid, R. Mohr, and C. Bauckhage. Comparing and evaluating interest points. In International Conference on Computer Vision, pp. 230–135, 1998.

[14] J.Matas, O. Chum, M. Urban, and T. Pajdla, Robust wide baseline stereo from maximally stable extremal regions. In BMVC p. 384-393, 2002.

[15] T. Tuytelaars and L. Van Gool, Matching widely separated views based on affine invariant regions. In IJCV 59(1):61-85, 2004.

[16] T. Kadir, A. Zisserman, and M. Brady, An affine invariant salient region detector. In ECCV p. 404-416, 2004.

[17] ttp://staff.science.uva.nl/~gevers/pub/overview.pdf ^{[ bare URL PDF ]}

[18] R. Datta, J. Li, and J. Z. Wang, “Content-based image retrieval – Approaches and trends of the new age,” In Proc. Int. Workshop on Multimedia Information Retrieval, pp. 253–262, 2005.IEEE Transactions on Multimedia, vol. 7, no. 1, pp. 127–142, 2005. Archived 2007-09-28 at the Wayback Machine

[19] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proceedings of the International Conference on Computer Vision, Nice, France, 2003.

[20] J. Sivic and A. Zisserman. Video data mining using configurations of viewpoint invariant regions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington DC, USA, pp. 488–495, 2004. ^{[ permanent dead link ]}

[21] G. Dorko and C. Schmid. Selection of scale invariant neighborhoods for object class recognition. In Proceedings of International Conference on Computer Vision, Nice, France, pp. 634–640, 2003.

[Sirmacek2011a-22] Beril Sirmacek and Cem Unsalan (January 2011). "A probabilistic framework to detect buildings in aerial and satellite images" (PDF). IEEE Transactions on Geoscience and Remote Sensing. 49 (1): 211–221. Bibcode:2011ITGRS..49..211S. doi:10.1109/TGRS.2010.2053713. S2CID 10637950.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]