Computer stereo vision

Last updated

Computer stereo vision is the extraction of 3D information from digital images, such as those obtained by a CCD camera. By comparing information about a scene from two vantage points, 3D information can be extracted by examining the relative positions of objects in the two panels. This is similar to the biological process of stereopsis.

Contents

Outline

In traditional stereo vision, two cameras, displaced horizontally from one another, are used to obtain two differing views on a scene, in a manner similar to human binocular vision. By comparing these two images, the relative depth information can be obtained in the form of a disparity map, which encodes the difference in horizontal coordinates of corresponding image points. The values in this disparity map are inversely proportional to the scene depth at the corresponding pixel location.

For a human to compare the two images, they must be superimposed in a stereoscopic device, with the image from the right camera being shown to the observer's right eye and from the left one to the left eye.

In a computer vision system, several pre-processing steps are required. [1]

  1. The image must first be undistorted, such that barrel distortion and tangential distortion are removed. This ensures that the observed image matches the projection of an ideal pinhole camera.
  2. The image must be projected back to a common plane to allow comparison of the image pairs, known as image rectification.
  3. An information measure which compares the two images is minimized. This gives the best estimate of the position of features in the two images, and creates a disparity map.
  4. Optionally, the received disparity map is projected into a 3d point cloud. By utilising the cameras' projective parameters, the point cloud can be computed such that it provides measurements at a known scale.

Active stereo vision

The active stereo vision is a form of stereo vision which actively employs a light such as a laser or a structured light to simplify the stereo matching problem. The opposed term is passive stereo vision.

Conventional structured-light vision (SLV)

The conventional structured-light vision (SLV) employs a structured light or laser, and finds projector-camera correspondences. [2] [3]

Conventional active stereo vision (ASV)

The conventional active stereo vision (ASV) employs a structured light or laser, however, the stereo matching is performed only for camera-camera correspondences, in the same way as the passive stereo vision.

Structured-light stereo (SLS)

There is a hybrid technique, which utilizes both camera-camera and projector-camera correspondences. [4]

Applications

3D stereo displays find many applications in entertainment, information transfer and automated systems. Stereo vision is highly important in fields such as robotics to extract information about the relative position of 3D objects in the vicinity of autonomous systems. Other applications for robotics include object recognition, [5] where depth information allows for the system to separate occluding image components, such as one chair in front of another, which the robot may otherwise not be able to distinguish as a separate object by any other criteria.

Scientific applications for digital stereo vision include the extraction of information from aerial surveys, for calculation of contour maps or even geometry extraction for 3D building mapping, photogrammetric satellite mapping, or calculation of 3D heliographical information such as obtained by the NASA STEREO project.

Detailed definition

Diagram describing relationship of image displacement to depth with stereoscopic images, assuming flat co-planar images Stereoscopic images, depth to displacement relationship assuming flat co-planar images..png
Diagram describing relationship of image displacement to depth with stereoscopic images, assuming flat co-planar images

A pixel records color at a position. The position is identified by position in the grid of pixels (x, y) and depth to the pixel z.

Stereoscopic vision gives two images of the same scene, from different positions. In the adjacent diagram light from the point A is transmitted through the entry points of pinhole cameras at B and D, onto image screens at E and H.

In the attached diagram the distance between the centers of the two camera lens is BD = BC + CD. The triangles are similar,

So assuming the cameras are level, and image planes are flat on the same plane, the displacement in the y axis between the same pixel in the two images is,

Where k is the distance between the two cameras times the distance from the lens to the image.

The depth component in the two images are and , given by,

These formulas allow for the occlusion of voxels, seen in one image on the surface of the object, by closer voxels seen in the other image, on the surface of the object.

Image rectification

Where the image planes are not co-planar, image rectification is required to adjust the images as if they were co-planar. This may be achieved by a linear transformation.

The images may also need rectification to make each image equivalent to the image taken from a pinhole camera projecting to a flat plane.

Smoothness

Smoothness is a measure of the similarity of colors. Given the assumption that a distinct object has a small number of colors, similarly-colored pixels are more likely to belong to a single object than to multiple objects.

The method described above for evaluating smoothness is based on information theory, and an assumption that the influence of the color of a voxel influences the color of nearby voxels according to the normal distribution on the distance between points. The model is based on approximate assumptions about the world.

Another method based on prior assumptions of smoothness is auto-correlation.

Smoothness is a property of the world rather than an intrinsic property of an image. An image comprising random dots would have no smoothness, and inferences about neighboring points would be useless.

In principle, smoothness, as with other properties of the world, should be learned. This appears to be what the human vision system does.[ citation needed ]

Information measure

Least squares information measure

The normal distribution is

Probability is related to information content described by message length L,

so,

For the purposes of comparing stereoscopic images, only the relative message length matters. Based on this, the information measure I, called the Sum of Squares of Differences (SSD) is,

where,

Because of the cost in processing time of squaring numbers in SSD, many implementations use Sum of Absolute Difference (SAD) as the basis for computing the information measure. Other methods use normalized cross correlation (NCC).

Information measure for stereoscopic images

The least squares measure may be used to measure the information content of the stereoscopic images, [6] given depths at each point . Firstly the information needed to express one image in terms of the other is derived. This is called .

A color difference function should be used to fairly measure the difference between colors. The color difference function is written cd in the following. The measure of the information needed to record the color matching between the two images is,

An assumption is made about the smoothness of the image. Assume that two pixels are more likely to be the same color, the closer the voxels they represent are. This measure is intended to favor colors that are similar being grouped at the same depth. For example, if an object in front occludes an area of sky behind, the measure of smoothness favors the blue pixels all being grouped together at the same depth.

The total measure of smoothness uses the distance between voxels as an estimate of the expected standard deviation of the color difference,

The total information content is then the sum,

The z component of each pixel must be chosen to give the minimum value for the information content. This will give the most likely depths at each pixel. The minimum total information measure is,

The depth functions for the left and right images are the pair,

Methods of implementation

The minimization problem is NP-complete. This means a general solution to this problem will take a long time to reach. However methods exist for computers based on heuristics that approximate the result in a reasonable amount of time. Also methods exist based on neural networks. [7] Efficient implementation of stereoscopic vision is an area of active research.

See also

Related Research Articles

In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes a particular aspect of a probability distribution. There are different ways to quantify kurtosis for a theoretical distribution, and there are corresponding ways of estimating it using a sample from a population. Different measures of kurtosis may have different interpretations.

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable that is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in the natural sciences, engineering, as well as medicine, economics and other fields. It can be applied to diverse quantities such as energies, concentrations, lengths, prices of financial instruments, and other metrics, while acknowledging the inherent uncertainty in all measurements.

In probability theory, Chebyshev's inequality provides an upper bound on the probability of deviation of a random variable from its mean. More specifically, the probability that a random variable deviates from its mean by more than is at most , where is any positive constant and is the standard deviation.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form

<span class="mw-page-title-main">Rayleigh distribution</span> Probability distribution

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for nonnegative-valued random variables. Up to rescaling, it coincides with the chi distribution with two degrees of freedom. The distribution is named after Lord Rayleigh.

Similarity may refer to:

<span class="mw-page-title-main">Cross-correlation</span> Covariance and correlation

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

The structural similarityindex measure (SSIM) is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. It is also used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference.

In mathematics, a π-system on a set is a collection of certain subsets of such that

<span class="mw-page-title-main">Optical transfer function</span> Function that specifies how different spatial frequencies are captured by an optical system

The optical transfer function (OTF) of an optical system such as a camera, microscope, human eye, or projector specifies how different spatial frequencies are captured or transmitted. It is used by optical engineers to describe how the optics project light from the object or scene onto a photographic film, detector array, retina, screen, or simply the next item in the optical transmission chain. A variant, the modulation transfer function (MTF), neglects phase effects, but is equivalent to the OTF in many situations.

Expected shortfall (ES) is a risk measure—a concept used in the field of financial risk measurement to evaluate the market risk or credit risk of a portfolio. The "expected shortfall at q% level" is the expected return on the portfolio in the worst of cases. ES is an alternative to value at risk that is more sensitive to the shape of the tail of the loss distribution.

<span class="mw-page-title-main">Active contour model</span>

Active contour model, also called snakes, is a framework in computer vision introduced by Michael Kass, Andrew Witkin, and Demetri Terzopoulos for delineating an object outline from a possibly noisy 2D image. The snakes model is popular in computer vision, and snakes are widely used in applications like object tracking, shape recognition, segmentation, edge detection and stereo matching.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

<span class="mw-page-title-main">3D reconstruction</span> Process of capturing the shape and appearance of real objects

In computer vision and computer graphics, 3D reconstruction is the process of capturing the shape and appearance of real objects. This process can be accomplished either by active or passive methods. If the model is allowed to change its shape in time, this is referred to as non-rigid or spatio-temporal reconstruction.

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images. The FID metric does not completely replace the IS metric. Classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.

References

  1. Bradski, Gary; Kaehler, Adrian. Learning OpenCV: Computer Vision with the OpenCV Library. O'Reilly.
  2. Je, Changsoo; Lee, Sang Wook; Park, Rae-Hong (2004). "High-Contrast Color-Stripe Pattern for Rapid Structured-Light Range Imaging". Computer Vision - ECCV 2004. Lecture Notes in Computer Science. Vol. 3021. pp. 95–107. arXiv: 1508.04981 . doi:10.1007/978-3-540-24670-1_8. ISBN   978-3-540-21984-2. S2CID   13277591.
  3. Je, Changsoo; Lee, Sang Wook; Park, Rae-Hong (2012). "Colour-stripe permutation pattern for rapid structured-light range imaging". Optics Communications. 285 (9): 2320–2331. Bibcode:2012OptCo.285.2320J. doi:10.1016/j.optcom.2012.01.025.
  4. Jang, Wonkwi; Je, Changsoo; Seo, Yongduek; Lee, Sang Wook (2013). "Structured-light stereo: Comparative analysis and integration of structured-light and active stereo for measuring dynamic shape". Optics and Lasers in Engineering. 51 (11): 1255–1264. Bibcode:2013OptLE..51.1255J. doi:10.1016/j.optlaseng.2013.05.001.
  5. Sumi, Yasushi; Kawai, Yoshihiro; Yoshimi, Takashi; Tomita, Fumiaki (2002). "3D Object Recognition in Cluttered Environments by Segment-Based Stereo Vision". International Journal of Computer Vision. 46 (1): 5–23. doi:10.1023/A:1013240031067. S2CID   22926546.
  6. Lazaros, Nalpantidis; Sirakoulis, Georgios Christou; Gasteratos1, Antonios (2008). "Review of Stereo Vision Algorithms: From Software to Hardware". International Journal of Optomechatronics. 2 (4): 435–462. doi: 10.1080/15599610802438680 . S2CID   18115413.{{cite journal}}: CS1 maint: numeric names: authors list (link)
  7. WANG, JUNG-HUA; HSIAO, CHIH-PING (1999). "On disparity matching in stereo vision via a neural network framework". Proc. Natl. Sci. Counc. ROC(A). 23 (5): 665–678. CiteSeerX   10.1.1.105.9067 .