# Scale space

Last updated

Scale space
Scale-space axioms
Scale space implementation
Feature detection
Edge detection
Blob detection
Corner detection
Ridge detection
Interest point detection
Scale selection
Scale-space segmentation

Scale-space theory is a framework for multi-scale signal representation developed by the computer vision, image processing and signal processing communities with complementary motivations from physics and biological vision. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures.         The parameter $t$ in this family is referred to as the scale parameter, with the interpretation that image structures of spatial size smaller than about ${\sqrt {t}}$ have largely been smoothed away in the scale-space level at scale $t$ .

## Contents

The main type of scale space is the linear (Gaussian) scale space, which has wide applicability as well as the attractive property of being possible to derive from a small set of scale-space axioms . The corresponding scale-space framework encompasses a theory for Gaussian derivative operators, which can be used as a basis for expressing a large class of visual operations for computerized systems that process visual information. This framework also allows visual operations to be made scale invariant , which is necessary for dealing with the size variations that may occur in image data, because real-world objects may be of different sizes and in addition the distance between the object and the camera may be unknown and may vary depending on the circumstances.  

## Definition

The notion of scale space applies to signals of arbitrary numbers of variables. The most common case in the literature applies to two-dimensional images, which is what is presented here. For a given image $f(x,y)$ , its linear (Gaussian) scale-space representation is a family of derived signals $L(x,y;t)$ defined by the convolution of $f(x,y)$ with the two-dimensional Gaussian kernel

$g(x,y;t)={\frac {1}{2\pi t}}e^{-(x^{2}+y^{2})/2t}\,$ such that

$L(\cdot ,\cdot ;t)\ =g(\cdot ,\cdot ;t)*f(\cdot ,\cdot ),$ where the semicolon in the argument of $L$ implies that the convolution is performed only over the variables $x,y$ , while the scale parameter $t$ after the semicolon just indicates which scale level is being defined. This definition of $L$ works for a continuum of scales $t\geq 0$ , but typically only a finite discrete set of levels in the scale-space representation would be actually considered.

The scale parameter $t=\sigma ^{2}$ is the variance of the Gaussian filter and as a limit for $t=0$ the filter $g$ becomes an impulse function such that $L(x,y;0)=f(x,y),$ that is, the scale-space representation at scale level $t=0$ is the image $f$ itself. As $t$ increases, $L$ is the result of smoothing $f$ with a larger and larger filter, thereby removing more and more of the details that the image contains. Since the standard deviation of the filter is $\sigma ={\sqrt {t}}$ , details that are significantly smaller than this value are to a large extent removed from the image at scale parameter $t$ , see the following figure and  for graphical illustrations.

### Why a Gaussian filter?

When faced with the task of generating a multi-scale representation one may ask: could any filter g of low-pass type and with a parameter t which determines its width be used to generate a scale space? The answer is no, as it is of crucial importance that the smoothing filter does not introduce new spurious structures at coarse scales that do not correspond to simplifications of corresponding structures at finer scales. In the scale-space literature, a number of different ways have been expressed to formulate this criterion in precise mathematical terms.

The conclusion from several different axiomatic derivations that have been presented is that the Gaussian scale space constitutes the canonical way to generate a linear scale space, based on the essential requirement that new structures must not be created when going from a fine scale to any coarser scale.              Conditions, referred to as scale-space axioms , that have been used for deriving the uniqueness of the Gaussian kernel include linearity, shift invariance, semi-group structure, non-enhancement of local extrema, scale invariance and rotational invariance. In the works,    the uniqueness claimed in the arguments based on scale invariance has been criticized, and alternative self-similar scale-space kernels have been proposed. The Gaussian kernel is, however, a unique choice according to the scale-space axiomatics based on causality  or non-enhancement of local extrema.  

### Alternative definition

Equivalently, the scale-space family can be defined as the solution of the diffusion equation (for example in terms of the heat equation),

$\partial _{t}L={\frac {1}{2}}\nabla ^{2}L,$ with initial condition $L(x,y;0)=f(x,y)$ . This formulation of the scale-space representation L means that it is possible to interpret the intensity values of the image f as a "temperature distribution" in the image plane and that the process that generates the scale-space representation as a function of t corresponds to heat diffusion in the image plane over time t (assuming the thermal conductivity of the material equal to the arbitrarily chosen constant ½). Although this connection may appear superficial for a reader not familiar with differential equations, it is indeed the case that the main scale-space formulation in terms of non-enhancement of local extrema is expressed in terms of a sign condition on partial derivatives in the 2+1-D volume generated by the scale space, thus within the framework of partial differential equations. Furthermore, a detailed analysis of the discrete case shows that the diffusion equation provides a unifying link between continuous and discrete scale spaces, which also generalizes to nonlinear scale spaces, for example, using anisotropic diffusion. Hence, one may say that the primary way to generate a scale space is by the diffusion equation, and that the Gaussian kernel arises as the Green's function of this specific partial differential equation.

## Motivations

The motivation for generating a scale-space representation of a given data set originates from the basic observation that real-world objects are composed of different structures at different scales. This implies that real-world objects, in contrast to idealized mathematical entities such as points or lines, may appear in different ways depending on the scale of observation. For example, the concept of a "tree" is appropriate at the scale of meters, while concepts such as leaves and molecules are more appropriate at finer scales. For a computer vision system analysing an unknown scene, there is no way to know a priori what scales are appropriate for describing the interesting structures in the image data. Hence, the only reasonable approach is to consider descriptions at multiple scales in order to be able to capture the unknown scale variations that may occur. Taken to the limit, a scale-space representation considers representations at all scales. 

Another motivation to the scale-space concept originates from the process of performing a physical measurement on real-world data. In order to extract any information from a measurement process, one has to apply operators of non-infinitesimal size to the data. In many branches of computer science and applied mathematics, the size of the measurement operator is disregarded in the theoretical modelling of a problem. The scale-space theory on the other hand explicitly incorporates the need for a non-infinitesimal size of the image operators as an integral part of any measurement as well as any other operation that depends on a real-world measurement. 

There is a close link between scale-space theory and biological vision. Many scale-space operations show a high degree of similarity with receptive field profiles recorded from the mammalian retina and the first stages in the visual cortex. In these respects, the scale-space framework can be seen as a theoretically well-founded paradigm for early vision, which in addition has been thoroughly tested by algorithms and experiments.  

## Gaussian derivatives

At any scale in scale space, we can apply local derivative operators to the scale-space representation:

$L_{x^{m}y^{n}}(x,y;t)=\left(\partial _{x^{m}y^{n}}L\right)(x,y;t).$ Due to the commutative property between the derivative operator and the Gaussian smoothing operator, such scale-space derivatives can equivalently be computed by convolving the original image with Gaussian derivative operators. For this reason they are often also referred to as Gaussian derivatives:

$L_{x^{m}y^{n}}(\cdot ,\cdot ;t)=\partial _{x^{m}y^{n}}g(\cdot ,\cdo$ ;\,t)*f(\cdot ,\cdot ).} The uniqueness of the Gaussian derivative operators as local operations derived from a scale-space representation can be obtained by similar axiomatic derivations as are used for deriving the uniqueness of the Gaussian kernel for scale-space smoothing.  

### Visual front end

These Gaussian derivative operators can in turn be combined by linear or non-linear operators into a larger variety of different types of feature detectors, which in many cases can be well modelled by differential geometry. Specifically, invariance (or more appropriately covariance) to local geometric transformations, such as rotations or local affine transformations, can be obtained by considering differential invariants under the appropriate class of transformations or alternatively by normalizing the Gaussian derivative operators to a locally determined coordinate frame determined from e.g. a preferred orientation in the image domain, or by applying a preferred local affine transformation to a local image patch (see the article on affine shape adaptation for further details).

When Gaussian derivative operators and differential invariants are used in this way as basic feature detectors at multiple scales, the uncommitted first stages of visual processing are often referred to as a visual front-end. This overall framework has been applied to a large variety of problems in computer vision, including feature detection, feature classification, image segmentation, image matching, motion estimation, computation of shape cues and object recognition. The set of Gaussian derivative operators up to a certain order is often referred to as the N-jet and constitutes a basic type of feature within the scale-space framework.

## Detector examples

Following the idea of expressing visual operations in terms of differential invariants computed at multiple scales using Gaussian derivative operators, we can express an edge detector from the set of points that satisfy the requirement that the gradient magnitude

$L_{v}={\sqrt {L_{x}^{2}+L_{y}^{2}}}$ should assume a local maximum in the gradient direction

$\nabla L=(L_{x},L_{y})^{T}.$ By working out the differential geometry, it can be shown  that this differential edge detector can equivalently be expressed from the zero-crossings of the second-order differential invariant

${\tilde {L}}_{v}^{2}=L_{x}^{2}\,L_{xx}+2\,L_{x}\,L_{y}\,L_{xy}+L_{y}^{2}\,L_{yy}=0$ that satisfy the following sign condition on a third-order differential invariant:

${\tilde {L}}_{v}^{3}=L_{x}^{3}\,L_{xxx}+3\,L_{x}^{2}\,L_{y}\,L_{xxy}+3\,L_{x}\,L_{y}^{2}\,L_{xyy}+L_{y}^{3}\,L_{yyy}<0.$ Similarly, multi-scale blob detectors at any given fixed scale   can be obtained from local maxima and local minima of either the Laplacian operator (also referred to as the Laplacian of Gaussian)

$\nabla ^{2}L=L_{xx}+L_{yy}\,$ $\operatorname {det} HL(x,y;t)=(L_{xx}L_{yy}-L_{xy}^{2}).$ In an analogous fashion, corner detectors and ridge and valley detectors can be expressed as local maxima, minima or zero-crossings of multi-scale differential invariants defined from Gaussian derivatives. The algebraic expressions for the corner and ridge detection operators are, however, somewhat more complex and the reader is referred to the articles on corner detection and ridge detection for further details.

Scale-space operations have also been frequently used for expressing coarse-to-fine methods, in particular for tasks such as image matching and for multi-scale image segmentation.

## Scale selection

The theory presented so far describes a well-founded framework for representing image structures at multiple scales. In many cases it is, however, also necessary to select locally appropriate scales for further analysis. This need for scale selection originates from two major reasons; (i) real-world objects may have different size, and this size may be unknown to the vision system, and (ii) the distance between the object and the camera can vary, and this distance information may also be unknown a priori. A highly useful property of scale-space representation is that image representations can be made invariant to scales, by performing automatic local scale selection         based on local maxima (or minima) over scales of scale-normalized derivatives

$L_{\xi ^{m}\eta ^{n}}(x,y;t)=t^{(m+n)\gamma /2}L_{x^{m}y^{n}}(x,y;t)$ where $\gamma \in [0,1]$ is a parameter that is related to the dimensionality of the image feature. This algebraic expression for scale normalized Gaussian derivative operators originates from the introduction of $\gamma$ -normalized derivatives according to

$\partial _{\xi }=t^{\gamma /2}\partial _{x}\quad$ and $\quad \partial _{\eta }=t^{\gamma /2}\partial _{y}.$ It can be theoretically shown that a scale selection module working according to this principle will satisfy the following scale covariance property: if for a certain type of image feature a local maximum is assumed in a certain image at a certain scale $t_{0}$ , then under a rescaling of the image by a scale factor $s$ the local maximum over scales in the rescaled image will be transformed to the scale level $s^{2}t_{0}$ . 

### Scale invariant feature detection

Following this approach of gamma-normalized derivatives, it can be shown that different types of scale adaptive and scale invariant feature detectors         can be expressed for tasks such as blob detection, corner detection, ridge detection, edge detection and spatio-temporal interest point detection (see the specific articles on these topics for in-depth descriptions of how these scale-invariant feature detectors are formulated). Furthermore, the scale levels obtained from automatic scale selection can be used for determining regions of interest for subsequent affine shape adaptation  to obtain affine invariant interest points   or for determining scale levels for computing associated image descriptors, such as locally scale adapted N-jets.

Recent work has shown that also more complex operations, such as scale-invariant object recognition can be performed in this way, by computing local image descriptors (N-jets or local histograms of gradient directions) at scale-adapted interest points obtained from scale-space extrema of the normalized Laplacian operator (see also scale-invariant feature transform  ) or the determinant of the Hessian (see also SURF);  see also the Scholarpedia article on the scale-invariant feature transform  for a more general outlook of object recognition approaches based on receptive field responses     in terms Gaussian derivative operators or approximations thereof.

An image pyramid is a discrete representation in which a scale space is sampled in both space and scale. For scale invariance, the scale factors should be sampled exponentially, for example as integer powers of 2 or 2. When properly constructed, the ratio of the sample rates in space and scale are held constant so that the impulse response is identical in all levels of the pyramid.     Fast, O(N), algorithms exist for computing a scale invariant image pyramid, in which the image or signal is repeatedly smoothed then subsampled. Values for scale space between pyramid samples can easily be estimated using interpolation within and between scales and allowing for scale and position estimates with sub resolution accuracy. 

In a scale-space representation, the existence of a continuous scale parameter makes it possible to track zero crossings over scales leading to so-called deep structure. For features defined as zero-crossings of differential invariants, the implicit function theorem directly defines trajectories across scales,   and at those scales where bifurcations occur, the local behaviour can be modelled by singularity theory.     

Extensions of linear scale-space theory concern the formulation of non-linear scale-space concepts more committed to specific purposes.   These non-linear scale-spaces often start from the equivalent diffusion formulation of the scale-space concept, which is subsequently extended in a non-linear fashion. A large number of evolution equations have been formulated in this way, motivated by different specific requirements (see the abovementioned book references for further information). It should be noted, however, that not all of these non-linear scale-spaces satisfy similar "nice" theoretical requirements as the linear Gaussian scale-space concept. Hence, unexpected artifacts may sometimes occur and one should be very careful of not using the term "scale-space" for just any type of one-parameter family of images.

A first-order extension of the isotropic Gaussian scale space is provided by the affine (Gaussian) scale space.  One motivation for this extension originates from the common need for computing image descriptors subject for real-world objects that are viewed under a perspective camera model. To handle such non-linear deformations locally, partial invariance (or more correctly covariance) to local affine deformations can be achieved by considering affine Gaussian kernels with their shapes determined by the local image structure,  see the article on affine shape adaptation for theory and algorithms. Indeed, this affine scale space can also be expressed from a non-isotropic extension of the linear (isotropic) diffusion equation, while still being within the class of linear partial differential equations.

There exists a more general extension of the Gaussian scale-space model to affine and spatio-temporal scale-spaces.      In addition to variabilities over scale, which original scale-space theory was designed to handle, this generalized scale-space theory  also comprises other types of variabilities caused by geometric transformations in the image formation process, including variations in viewing direction approximated by local affine transformations, and relative motions between objects in the world and the observer, approximated by local Galilean transformations. This generalized scale-space theory leads to predictions about receptive field profiles in good qualitative agreement with receptive field profiles measured by cell recordings in biological vision.    

There are strong relations between scale-space theory and wavelet theory, although these two notions of multi-scale representation have been developed from somewhat different premises. There has also been work on other multi-scale approaches, such as pyramids and a variety of other kernels, that do not exploit or require the same requirements as true scale-space descriptions do.

## Relations to biological vision and hearing

There are interesting relations between scale-space representation and biological vision and hearing. Neurophysiological studies of biological vision have shown that there are receptive field profiles in the mammalian retina and visual cortex, that can be well modelled by linear Gaussian derivative operators, in some cases also complemented by a non-isotropic affine scale-space model, a spatio-temporal scale-space model and/or non-linear combinations of such linear operators.         

Regarding biological hearing there are receptive field profiles in the inferior colliculus and the primary auditory cortex that can be well modelled by spectra-temporal receptive fields that can be well modelled by Gaussian derivates over logarithmic frequencies and windowed Fourier transforms over time with the window functions being temporal scale-space kernels.  

## Implementation issues

When implementing scale-space smoothing in practice there are a number of different approaches that can be taken in terms of continuous or discrete Gaussian smoothing, implementation in the Fourier domain, in terms of pyramids based on binomial filters that approximate the Gaussian or using recursive filters. More details about this are given in a separate article on scale space implementation.

## Related Research Articles

Edge detection includes a variety of mathematical methods that aim at identifying points in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The points at which image brightness changes sharply are typically organized into a set of curved line segments termed edges. The same problem of finding discontinuities in one-dimensional signals is known as step detection and the problem of finding signal discontinuities over time is known as change detection. Edge detection is a fundamental tool in image processing, machine vision and computer vision, particularly in the areas of feature detection and feature extraction.

In physics, mathematics and statistics, scale invariance is a feature of objects or laws that do not change if scales of length, energy, or other variables, are multiplied by a common factor, and thus represent a universality. In mathematics and numerical analysis, the Ricker wavelet

The scale-invariant feature transform (SIFT) is a feature detection algorithm in computer vision to detect and describe local features in images. It was published by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

In imaging science, difference of Gaussians (DoG) is a feature enhancement algorithm that involves the subtraction of one Gaussian blurred version of an original image from another, less blurred version of the original. In the simple case of grayscale images, the blurred images are obtained by convolving the original grayscale images with Gaussian kernels having differing width. Blurring an image using a Gaussian kernel suppresses only high-frequency spatial information. Subtracting one image from the other preserves spatial information that lies between the range of frequencies that are preserved in the two blurred images. Thus, the DoG is a spatial band-pass filter that attenuates frequencies in the original grayscale image that are far from the band center. Corner detection is an approach used within computer vision systems to extract certain kinds of features and infer the contents of an image. Corner detection is frequently used in motion detection, image registration, video tracking, image mosaicing, panorama stitching, 3D reconstruction and object recognition. Corner detection overlaps with the topic of interest point detection.

In the areas of computer vision, image analysis and signal processing, the notion of scale-space representation is used for processing measurement data at multiple scales, and specifically enhance or suppress image features over different ranges of scale. A special type of scale-space representation is provided by the Gaussian scale space, where the image data in N dimensions is subjected to smoothing by Gaussian convolution. Most of the theory for Gaussian scale space deals with continuous images, whereas one when implementing this theory will have to face the fact that most measurement data are discrete. Hence, the theoretical problem arises concerning how to discretize the continuous theory while either preserving or well approximating the desirable theoretical properties that lead to the choice of the Gaussian kernel. This article describes basic appproaches for this that have been developed in the literature.

In image processing and computer vision, a scale space framework can be used to represent an image as a family of gradually smoothed images. This framework is very general and a variety of scale space representations exist. A typical approach for choosing a particular type of scale space representation is to establish a set of scale-space axioms, describing basic properties of the desired scale-space representation and often chosen so as to make the representation useful in practical applications. Once established, the axioms narrow the possible scale-space representations to a smaller class, typically with only a few free parameters.

Ridge detection is the attempt, via software, to locate ridges in an image. A simple cell in the primary visual cortex is a cell that responds primarily to oriented edges and gratings. These cells were discovered by Torsten Wiesel and David Hubel in the late 1950s.

In computer vision, blob detection methods are aimed at detecting regions in a digital image that differ in properties, such as brightness or color, compared to surrounding regions. Informally, a blob is a region of an image in which some properties are constant or approximately constant; all the points in a blob can be considered in some sense to be similar to each other. The most common method for blob detection is convolution.

Affine shape adaptation is a methodology for iteratively adapting the shape of the smoothing kernels in an affine group of smoothing kernels to the local image structure in neighbourhood region of a specific image point. Equivalently, affine shape adaptation can be accomplished by iteratively warping a local image patch with affine transformations while applying a rotationally symmetric filter to the warped image patches. Provided that this iterative process converges, the resulting fixed point will be affine invariant. In the area of computer vision, this idea has been used for defining affine invariant interest point operators as well as affine invariant texture analysis methods.

In mathematics and theoretical physics, an invariant differential operator is a kind of mathematical map from some objects to an object of similar type. These objects are typically functions on , functions on a manifold, vector valued functions, vector fields, or, more generally, sections of a vector bundle.

In mathematics, the structure tensor, also referred to as the second-moment matrix, is a matrix derived from the gradient of a function. It summarizes the predominant directions of the gradient in a specified neighborhood of a point, and the degree to which those directions are coherent. The structure tensor is often used in image processing and computer vision.

The Kadir–Brady saliency detector extracts features of objects in images that are distinct and representative. It was invented by Timor Kadir and J. Michael Brady in 2001 and an affine invariant version was introduced by Kadir and Brady in 2004 and a robust version was designed by Shao et al. in 2007.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

The Hessian affine region detector is a feature detector used in the fields of computer vision and image analysis. Like other feature detectors, the Hessian affine detector is typically used as a preprocessing step to algorithms that rely on identifiable, characteristic interest points.

In image processing and computer vision, anisotropic diffusion, also called Perona–Malik diffusion, is a technique aiming at reducing image noise without removing significant parts of the image content, typically edges, lines or other details that are important for the interpretation of the image. Anisotropic diffusion resembles the process that creates a scale space, where an image generates a parameterized family of successively more and more blurred images based on a diffusion process. Each of the resulting images in this family are given as a convolution between the image and a 2D isotropic Gaussian filter, where the width of the filter increases with the parameter. This diffusion process is a linear and space-invariant transformation of the original image. Anisotropic diffusion is a generalization of this diffusion process: it produces a family of parameterized images, but each resulting image is a combination between the original image and a filter that depends on the local content of the original image. As a consequence, anisotropic diffusion is a non-linear and space-variant transformation of the original image.

The principal curvature-based region detector, also called PCBR is a feature detector used in the fields of computer vision and image analysis. Specifically the PCBR detector is designed for object recognition applications.

In image analysis, the generalized structure tensor (GST) is an extension of the Cartesian structure tensor to curvilinear coordinates. It is mainly used to detect and to represent the "direction" parameters of curves, just as the Cartesian structure tensor detects and represents the direction in Cartesian coordinates. Curve families generated by pairs of locally orthogonal functions have been the best studied.

1. Ijima, T. "Basic theory on normalization of pattern (in case of typical one-dimensional pattern)". Bull. Electrotech. Lab. 26, 368– 388, 1962. (in Japanese)
2. Koenderink, Jan "The structure of images", Biological Cybernetics, 50:363–370, 1984
3. Lindeberg, T., Scale-Space Theory in Computer Vision, Kluwer Academic Publishers, 1994, ISBN   0-7923-9418-6
4. T. Lindeberg (1994). "Scale-space theory: A basic tool for analysing structures at different scales". Journal of Applied Statistics (Supplement on Advances in Applied Statistics: Statistics and Images: 2). 21 (2). pp. 224–270. doi:10.1080/757582976.
5. Florack, Luc, Image Structure, Kluwer Academic Publishers, 1997.
6. ter Haar Romeny, Bart M. (2008). Front-End Vision and Multi-Scale Image Analysis: Multi-scale Computer Vision Theory and Applications, written in Mathematica. Springer Science & Business Media. ISBN   978-1-4020-8840-7.
7. Lindeberg, Tony (2008). "Scale-space". In Benjamin Wah (ed.). Encyclopedia of Computer Science and Engineering. IV. John Wiley and Sons. pp. 2495–2504. doi:10.1002/9780470050118.ecse609. ISBN   978-0470050118.
8. M. Felsberg and G.Sommer "The Monogenic Scale-Space: A Unifying Approach to Phase-Based Image Processing in Scale Space", Journal of Mathematical Imaging and Vision, 21(1): 5–28, 2004.
9. R. Duits, L. Florack, J. de Graaf and B. ter Haar Romeny "On the Axioms of Scale Space Theory", Journal of Mathematical Imaging and Vision, 20(3): 267–298, 2004.
10. Burt, Peter and Adelson, Ted, "The Laplacian Pyramid as a Compact Image Code", IEEE Trans. Communications, 9:4, 532–540, 1983.
11. Jan Koenderink and Andrea van Doorn, A. J. (1986), ‘Dynamic shape’, Biological Cybernetics 53, 383–396.
12. Damon, J. (1995), ‘Local Morse theory for solutions to the heat equation and Gaussian blurring’, Journal of Differential Equations 115(2), 386–401.
13. ter Haar Romeny, Bart M. (Editor), Geometry-Driven Diffusion in Computer Vision, Kluwer Academic Publishers, 1994.
14. Young, R. A. "The Gaussian derivative model for spatial vision: Retinal mechanisms", Spatial Vision, 2:273–293, 1987.