Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing (object recognition etc.).
Many applications do not need to know everything about the evolution of movement in a video sequence, but only require the information of changes in the scene, because an image's regions of interest are objects (humans, cars, text etc.) in its foreground. After the stage of image preprocessing (which may include image denoising, post processing like morphology etc.) object localisation is required which may make use of this technique.
Foreground detection separates foreground from background based on these changes taking place in the foreground. It is a set of techniques that typically analyze video sequences recorded in real time with a stationary camera.
All detection techniques are based on modelling the background of the image, i.e. set the background and detect which changes occur. Defining the background can be very difficult when it contains shapes, shadows, and moving objects. In defining the background, it is assumed that the stationary objects could vary in color and intensity over time.
Scenarios where these techniques apply tend to be very diverse. There can be highly variable sequences, such as images with very different lighting, interiors, exteriors, quality, and noise. In addition to processing in real time, systems need to be able to adapt to these changes.
A very good foreground detection system should be able to:
Background subtraction is a widely used approach for detecting moving objects in videos from static cameras. The rationale in the approach is that of detecting the moving objects from the difference between the current frame and a reference frame, often called "background image", or "background model". Background subtraction is mostly done if the image in question is a part of a video stream. Background subtraction provides important cues for numerous applications in computer vision, for example surveillance tracking or human pose estimation.[ citation needed ]
Background subtraction is generally based on a static background hypothesis which is often not applicable in real environments. With indoor scenes, reflections or animated images on screens lead to background changes. Similarly, due to wind, rain or illumination changes brought by weather, static backgrounds methods have difficulties with outdoor scenes. [1]
The temporal average filter is a method that was proposed at the Velastin. This system estimates the background model from the median of all pixels of a number of previous images. The system uses a buffer with the pixel values of the last frames to update the median for each image.
To model the background, the system examines all images in a given time period called training time. At this time, we only display images and will find the median, pixel by pixel, of all the plots in the background this time.
After the training period for each new frame, each pixel value is compared with the input value of funds previously calculated. If the input pixel is within a threshold, the pixel is considered to match the background model and its value is included in the pixbuf. Otherwise, if the value is outside this threshold pixel is classified as foreground, and not included in the buffer.
This method cannot be considered very efficient because they do not present a rigorous statistical basis and requires a buffer that has a high computational cost.
A robust background subtraction algorithm should be able to handle lighting changes, repetitive motions from clutter and long-term scene changes. [2] The following analyses make use of the function of V(x,y,t) as a video sequence where t is the time dimension, x and y are the pixel location variables. e.g. V(1,2,3) is the pixel intensity at (1,2) pixel location of the image at t = 3 in the video sequence.
A motion detection algorithm begins with the segmentation part where foreground or moving objects are segmented from the background. The simplest way to implement this is to take an image as background and take the frames obtained at the time t, denoted by I(t) to compare with the background image denoted by B. Here using simple arithmetic calculations, we can segment out the objects simply by using image subtraction technique of computer vision meaning for each pixels in I(t), take the pixel value denoted by P[I(t)] and subtract it with the corresponding pixels at the same position on the background image denoted as P[B].
In mathematical equation, it is written as:
The background is assumed to be the frame at time t. This difference image would only show some intensity for the pixel locations which have changed in the two frames. Though we have seemingly removed the background, this approach will only work for cases where all foreground pixels are moving, and all background pixels are static. [2] A threshold "Threshold" is put on this difference image to improve the subtraction (see Image thresholding):
This means that the difference image's pixels' intensities are 'thresholded' or filtered on the basis of value of Threshold. [3] The accuracy of this approach is dependent on speed of movement in the scene. Faster movements may require higher thresholds.
For calculating the image containing only the background, a series of preceding images are averaged. For calculating the background image at the instant t:
where N is the number of preceding images taken for averaging. This averaging refers to averaging corresponding pixels in the given images. N would depend on the video speed (number of images per second in the video) and the amount of movement in the video. [4] After calculating the background B(x,y,t) we can then subtract it from the image V(x,y,t) at time t = t and threshold it. Thus the foreground is:
where Th is a threshold value. Similarly, we can also use median instead of mean in the above calculation of B(x,y,t).
Usage of global and time-independent thresholds (same Th value for all pixels in the image) may limit the accuracy of the above two approaches. [2]
For this method, Wren et al. [5] propose fitting a Gaussian probabilistic density function (pdf) on the most recent frames. In order to avoid fitting the pdf from scratch at each new frame time , a running (or on-line cumulative) average is computed.
The pdf of every pixel is characterized by mean and variance . The following is a possible initial condition (assuming that initially every pixel is background):
where is the value of the pixel's intensity at time . In order to initialize variance, we can, for example, use the variance in x and y from a small window around each pixel.
Note that background may change over time (e.g. due to illumination changes or non-static background objects). To accommodate for that change, at every frame , every pixel's mean and variance must be updated, as follows:
Where determines the size of the temporal window that is used to fit the pdf (usually ) and is the Euclidean distance between the mean and the value of the pixel.
We can now classify a pixel as background if its current intensity lies within some confidence interval of its distribution's mean:
where the parameter is a free threshold (usually ). A larger value for allows for more dynamic background, while a smaller increases the probability of a transition from background to foreground due to more subtle changes.
In a variant of the method, a pixel's distribution is only updated if it is classified as background. This is to prevent newly introduced foreground objects from fading into the background. The update formula for the mean is changed accordingly:
where when is considered foreground and otherwise. So when , that is, when the pixel is detected as foreground, the mean will stay the same. As a result, a pixel, once it has become foreground, can only become background again when the intensity value gets close to what it was before turning foreground. This method, however, has several issues: It only works if all pixels are initially background pixels (or foreground pixels are annotated as such). Also, it cannot cope with gradual background changes: If a pixel is categorized as foreground for a too long period of time, the background intensity in that location might have changed (because illumination has changed etc.). As a result, once the foreground object is gone, the new background intensity might not be recognized as such anymore.
Mixture of Gaussians method approaches by modelling each pixel as a mixture of Gaussians and uses an on-line approximation to update the model. In this technique, it is assumed that every pixel's intensity values in the video can be modeled using a Gaussian mixture model. [6] A simple heuristic determines which intensities are most probably of the background. Then the pixels which do not match to these are called the foreground pixels. Foreground pixels are grouped using 2D connected component analysis. [6]
At any time t, a particular pixel ()'s history is:
This history is modeled by a mixture of K Gaussian distributions:
where:
First, each pixel is characterized by its intensity in RGB color space. Then probability of observing the current pixel is given by the following formula in the multidimensional case:
Where K is the number of distributions, ω is a weight associated to the ith Gaussian at time t and μ, Σ are the mean and standard deviation of said Gaussian respectively.
Once the parameters initialization is made, a first foreground detection can be made then the parameters are updated. The first B Gaussian distribution which exceeds the threshold T is retained for a background distribution:
The other distributions are considered to represent a foreground distribution. Then, when the new frame incomes at times , a match test is made of each pixel. A pixel matches a Gaussian distribution if the Mahalanobis distance:
where k is a constant threshold equal to . Then, two cases can occur:
Case 1: A match is found with one of the k Gaussians. For the matched component, the update is done as follows: [7]
Power and Schoonees [3] used the same algorithm to segment the foreground of the image:
The essential approximation to is given by : [8]
Case 2: No match is found with any of the Gaussians. In this case, the least probable distribution is replaced with a new one with parameters:
Once the parameter maintenance is made, foreground detection can be made and so on. An on-line K-means approximation is used to update the Gaussians. Numerous improvements of this original method developed by Stauffer and Grimson [6] have been proposed and a complete survey can be found in Bouwmans et al. [7] A standard method of adaptive backgrounding is averaging the images over time, creating a background approximation which is similar to the current static scene except where motion occur.
Several surveys which concern categories or sub-categories of models can be found as follows:
For more details, please see [19]
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
In the mathematical field of differential geometry, the Riemann curvature tensor or Riemann–Christoffel tensor is the most common way used to express the curvature of Riemannian manifolds. It assigns a tensor to each point of a Riemannian manifold. It is a local invariant of Riemannian metrics which measures the failure of the second covariant derivatives to commute. A Riemannian manifold has zero curvature if and only if it is flat, i.e. locally isometric to the Euclidean space. The curvature tensor can also be defined for any pseudo-Riemannian manifold, or indeed any manifold equipped with an affine connection.
An adaptive filter is a system with a linear filter that has a transfer function controlled by variable parameters and a means to adjust those parameters according to an optimization algorithm. Because of the complexity of the optimization algorithms, almost all adaptive filters are digital filters. Adaptive filters are required for some applications because some parameters of the desired processing operation are not known in advance or are changing. The closed loop adaptive filter uses feedback in the form of an error signal to refine its transfer function.
In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.
Linear elasticity is a mathematical model of how solid objects deform and become internally stressed by prescribed loading conditions. It is a simplification of the more general nonlinear theory of elasticity and a branch of continuum mechanics.
The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images. It was developed by John F. Canny in 1986. Canny also produced a computational theory of edge detection explaining why the technique works.
In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.
When studying and formulating Albert Einstein's theory of general relativity, various mathematical structures and techniques are utilized. The main tools used in this geometrical theory of gravitation are tensor fields defined on a Lorentzian manifold representing spacetime. This article is a general description of the mathematics of general relativity.
In computer vision and image processing, Otsu's method, named after Nobuyuki Otsu, is used to perform automatic image thresholding. In the simplest form, the algorithm returns a single intensity threshold that separate pixels into two classes, foreground and background. This threshold is determined by minimizing intra-class intensity variance, or equivalently, by maximizing inter-class variance. Otsu's method is a one-dimensional discrete analogue of Fisher's Discriminant Analysis, is related to Jenks optimization method, and is equivalent to a globally optimal k-means performed on the intensity histogram. The extension to multi-level thresholding was described in the original paper, and computationally efficient implementations have since been proposed.
Machine olfaction is the automated simulation of the sense of smell. An emerging application in modern engineering, it involves the use of robots or other automated systems to analyze air-borne chemicals. Such an apparatus is often called an electronic nose or e-nose. The development of machine olfaction is complicated by the fact that e-nose devices to date have responded to a limited number of chemicals, whereas odors are produced by unique sets of odorant compounds. The technology, though still in the early stages of development, promises many applications, such as: quality control in food processing, detection and diagnosis in medicine, detection of drugs, explosives and other dangerous or illegal substances, disaster response, and environmental monitoring.
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.
Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.
A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.
In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.
The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features.
Maurice Charles Kenneth Tweedie was a British medical physicist and statistician from the University of Liverpool. He was known for research into the exponential family probability distributions.
The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant.
In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.
Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.
The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images. The FID metric does not completely replace the IS metric. Classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.
Several comparison/evaluation papers can be found in the literature:
The Background Subtraction Website (T. Bouwmans, Univ. La Rochelle, France) contains a comprehensive list of the references in the field, and links to available datasets and software.
The BackgroundSubtractorCNT library implements a very fast and high quality algorithm written in C++ based on OpenCV. It is targeted at low spec hardware but works just as fast on modern Linux and Windows. (For more information: https://github.com/sagi-z/BackgroundSubtractorCNT).
The BGS Library (A. Sobral, Univ. La Rochelle, France) provides a C++ framework to perform background subtraction algorithms. The code works either on Windows or on Linux. Currently the library offers more than 30 BGS algorithms. (For more information: https://github.com/andrewssobral/bgslibrary)