Triangulation (computer vision)

Last updated August 20, 2024

In computer vision, triangulation refers to the process of determining a point in 3D space given its projections onto two, or more, images. In order to solve this problem it is necessary to know the parameters of the camera projection function from 3D to 2D for the cameras involved, in the simplest case represented by the camera matrices. Triangulation is sometimes also referred to as reconstruction or intersection.

The triangulation problem is in principle trivial. Since each point in an image corresponds to a line in 3D space, all points on the line in 3D are projected to the point in the image. If a pair of corresponding points in two, or more images, can be found it must be the case that they are the projection of a common 3D point x. The set of lines generated by the image points must intersect at x (3D point) and the algebraic formulation of the coordinates of x (3D point) can be computed in a variety of ways, as is presented below.

In practice, however, the coordinates of image points cannot be measured with arbitrary accuracy. Instead, various types of noise, such as geometric noise from lens distortion or interest point detection error, lead to inaccuracies in the measured image coordinates. As a consequence, the lines generated by the corresponding image points do not always intersect in 3D space. The problem, then, is to find a 3D point which optimally fits the measured image points. In the literature there are multiple proposals for how to define optimality and how to find the optimal 3D point. Since they are based on different optimality criteria, the various methods produce different estimates of the 3D point x when noise is involved.

Introduction

In the following, it is assumed that triangulation is made on corresponding image points from two views generated by pinhole cameras.

The ideal case of epipolar geometry. A 3D point x is projected onto two camera images through lines (green) which intersect with each camera's focal point, O1 and O2. The resulting image points are y1 and y2. The green lines intersect at x. TriangulationIdeal.svg — The ideal case of epipolar geometry. A 3D point x is projected onto two camera images through lines (green) which intersect with each camera's focal point, O₁ and O₂. The resulting image points are y₁ and y₂. The green lines intersect at x.

In practice, the image points y1 and y2 cannot be measured with arbitrary accuracy. Instead points y'1 and y'2 are detected and used for the triangulation. The corresponding projection lines (blue) do not, in general, intersect in 3D space and may also not intersect with point x. TriangulationReal.svg — In practice, the image points y₁ and y₂ cannot be measured with arbitrary accuracy. Instead points **y'₁** and **y'₂** are detected and used for the triangulation. The corresponding projection lines (blue) do not, in general, intersect in 3D space and may also not intersect with point x.

The image to the left illustrates the epipolar geometry of a pair of stereo cameras of pinhole model. A point x (3D point) in 3D space is projected onto the respective image plane along a line (green) which goes through the camera's focal point, $\mathbf {O} _{1}$ and $\mathbf {O} _{2}$ , resulting in the two corresponding image points $\mathbf {y} _{1}$ and $\mathbf {y} _{2}$ . If $\mathbf {y} _{1}$ and $\mathbf {y} _{2}$ are given and the geometry of the two cameras are known, the two projection lines (green lines) can be determined and it must be the case that they intersect at point x (3D point). Using basic linear algebra that intersection point can be determined in a straightforward way.

The image to the right shows the real case. The position of the image points $\mathbf {y} _{1}$ and $\mathbf {y} _{2}$ cannot be measured exactly. The reason is a combination of factors such as

Geometric distortion, for example lens distortion, which means that the 3D to 2D mapping of the camera deviates from the pinhole camera model. To some extent these errors can be compensated for, leaving a residual geometric error.
A single ray of light from x (3D point) is dispersed in the lens system of the cameras according to a point spread function. The recovery of the corresponding image point from measurements of the dispersed intensity function in the images gives errors.
In a digital camera, the image intensity function is only measured in discrete sensor elements. Inexact interpolation of the discrete intensity function have to be used to recover the true one.
The image points y₁^' and y₂' used for triangulation are often found using various types of feature extractors, for example of corners or interest points in general. There is an inherent localization error for any type of feature extraction based on neighborhood operations.

As a consequence, the measured image points are $\mathbf {y} '_{1}$ and $\mathbf {y} '_{2}$ instead of $\mathbf {y} _{1}$ and $\mathbf {y} _{2}$ . However, their projection lines (blue) do not have to intersect in 3D space or come close to x. In fact, these lines intersect if and only if $\mathbf {y} '_{1}$ and $\mathbf {y} '_{2}$ satisfy the epipolar constraint defined by the fundamental matrix. Given the measurement noise in $\mathbf {y} '_{1}$ and $\mathbf {y} '_{2}$ it is rather likely that the epipolar constraint is not satisfied and the projection lines do not intersect.

This observation leads to the problem which is solved in triangulation. Which 3D point x_est is the best estimate of x given $\mathbf {y} '_{1}$ and $\mathbf {y} '_{2}$ and the geometry of the cameras? The answer is often found by defining an error measure which depends on x_est and then minimizing this error. In the following sections, some of the various methods for computing x_est presented in the literature are briefly described.

All triangulation methods produce x_est = x in the case that $\mathbf {y} _{1}=\mathbf {y} '_{1}$ and $\mathbf {y} _{2}=\mathbf {y} '_{2}$ , that is, when the epipolar constraint is satisfied (except for singular points, see below). It is what happens when the constraint is not satisfied which differs between the methods.

Properties

A triangulation method can be described in terms of a function $\tau \,$ such that

\mathbf {x} \sim \tau (\mathbf {y} '_{1},\mathbf {y} '_{2},\mathbf {C} _{1},\mathbf {C} _{2})

where $\mathbf {y} '_{1},\mathbf {y} '_{2}$ are the homogeneous coordinates of the detected image points and $\mathbf {C} _{1},\mathbf {C} _{2}$ are the camera matrices. x (3D point) is the homogeneous representation of the resulting 3D point. The $\sim \,$ sign implies that $\tau \,$ is only required to produce a vector which is equal to x up to a multiplication by a non-zero scalar since homogeneous vectors are involved.

Before looking at the specific methods, that is, specific functions $\tau \,$ , there are some general concepts related to the methods that need to be explained. Which triangulation method is chosen for a particular problem depends to some extent on these characteristics.

Singularities

Some of the methods fail to correctly compute an estimate of x (3D point) if it lies in a certain subset of the 3D space, corresponding to some combination of $\mathbf {y} '_{1},\mathbf {y} '_{2},\mathbf {C} _{1},\mathbf {C} _{2}$ . A point in this subset is then a singularity of the triangulation method. The reason for the failure can be that some equation system to be solved is under-determined or that the projective representation of x_est becomes the zero vector for the singular points.

Invariance

In some applications, it is desirable that the triangulation is independent of the coordinate system used to represent 3D points; if the triangulation problem is formulated in one coordinate system and then transformed into another the resulting estimate x_est should transform in the same way. This property is commonly referred to as invariance. Not every triangulation method assures invariance, at least not for general types of coordinate transformations.

For a homogeneous representation of 3D coordinates, the most general transformation is a projective transformation, represented by a $4\times 4$ matrix $\mathbf {T}$ . If the homogeneous coordinates are transformed according to

\mathbf {\bar {x}} \sim \mathbf {T} \,\mathbf {x}

then the camera matrices must transform as (C_k)

\mathbf {\bar {C}} _{k}\sim \mathbf {C} _{k}\,\mathbf {T} ^{-1}

to produce the same homogeneous image coordinates (y_k)

\mathbf {y} _{k}\sim \mathbf {\bar {C}} _{k}\,\mathbf {\bar {x}} =\mathbf {C} _{k}\,\mathbf {x}

If the triangulation function $\tau$ is invariant to $\mathbf {T}$ then the following relation must be valid

\mathbf {\bar {x}} _{\rm {est}}\sim \mathbf {T} \,\mathbf {x} _{\rm {est}}

from which follows that

\tau (\mathbf {y} '_{1},\mathbf {y} '_{2},\mathbf {C} _{1},\mathbf {C} _{2})\sim \mathbf {T} ^{-1}\,\tau (\mathbf {y} '_{1},\mathbf {y} '_{2},\mathbf {C} _{1}\,\mathbf {T} ^{-1},\mathbf {C} _{2}\,\mathbf {T} ^{-1}),

for all

\mathbf {y} '_{1},\mathbf {y} '_{2}

For each triangulation method, it can be determined if this last relation is valid. If it is, it may be satisfied only for a subset of the projective transformations, for example, rigid or affine transformations.

Computational complexity

The function $\tau$ is only an abstract representation of a computation which, in practice, may be relatively complex. Some methods result in a $\tau$ which is a closed-form continuous function while others need to be decomposed into a series of computational steps involving, for example, SVD or finding the roots of a polynomial. Yet another class of methods results in $\tau$ which must rely on iterative estimation of some parameters. This means that both the computation time and the complexity of the operations involved may vary between the different methods.

Methods

Mid-point method

Each of the two image points $\mathbf {y} '_{1}$ and $\mathbf {y} '_{2}$ has a corresponding projection line (blue in the right image above), here denoted as $\mathbf {L} '_{1}$ and $\mathbf {L} '_{2}$ , which can be determined given the camera matrices $\mathbf {C} _{1},\mathbf {C} _{2}$ . Let $d\,$ be a distance function between a (3D line) L and a x (3D point) such that $d(\mathbf {L} ,\mathbf {x} )$ is the Euclidean distance between $\mathbf {L}$ and $\mathbf {x}$ . The midpoint method finds the point x_est which minimizes

d(\mathbf {L} '_{1},\mathbf {x} )^{2}+d(\mathbf {L} '_{2},\mathbf {x} )^{2}

It turns out that x_est lies exactly at the middle of the shortest line segment which joins the two projection lines.

Direct linear transformation

Via the essential matrix

The problem to be solved there is how to compute $(x_{1},x_{2},x_{3})$ given corresponding normalized image coordinates $(y_{1},y_{2})$ and $(y'_{1},y'_{2})$ . If the essential matrix is known and the corresponding rotation and translation transformations have been determined, this algorithm (described in Longuet-Higgins' paper) provides a solution.

Let $\mathbf {r} _{k}$ denote row k of the rotation matrix $\mathbf {R}$ :

\mathbf {R} ={\begin{pmatrix}-\mathbf {r} _{1}-\\-\mathbf {r} _{2}-\\-\mathbf {r} _{3}-\end{pmatrix}}

Combining the above relations between 3D coordinates in the two coordinate systems and the mapping between 3D and 2D points described earlier gives

y'_{1}={\frac {x'_{1}}{x'_{3}}}={\frac {\mathbf {r} _{1}\cdot ({\tilde {\mathbf {x} }}-\mathbf {t} )}{\mathbf {r} _{3}\cdot ({\tilde {\mathbf {x} }}-\mathbf {t} )}}={\frac {\mathbf {r} _{1}\cdot (\mathbf {y} -\mathbf {t} /x_{3})}{\mathbf {r} _{3}\cdot (\mathbf {y} -\mathbf {t} /x_{3})}}

or

x_{3}={\frac {(\mathbf {r} _{1}-y'_{1}\,\mathbf {r} _{3})\cdot \mathbf {t} }{(\mathbf {r} _{1}-y'_{1}\,\mathbf {r} _{3})\cdot \mathbf {y} }}

Once $x_{3}$ is determined, the other two coordinates can be computed as

{\begin{pmatrix}x_{1}\\x_{2}\end{pmatrix}}=x_{3}{\begin{pmatrix}y_{1}\\y_{2}\end{pmatrix}}

The above derivation is not unique. It is also possible to start with an expression for $y'_{2}$ and derive an expression for $x_{3}$ according to

x_{3}={\frac {(\mathbf {r} _{2}-y'_{2}\,\mathbf {r} _{3})\cdot \mathbf {t} }{(\mathbf {r} _{2}-y'_{2}\,\mathbf {r} _{3})\cdot \mathbf {y} }}

In the ideal case, when the camera maps the 3D points according to a perfect pinhole camera and the resulting 2D points can be detected without any noise, the two expressions for $x_{3}$ are equal. In practice, however, they are not and it may be advantageous to combine the two estimates of $x_{3}$ , for example, in terms of some sort of average.

There are also other types of extensions of the above computations which are possible. They started with an expression of the primed image coordinates and derived 3D coordinates in the unprimed system. It is also possible to start with unprimed image coordinates and obtain primed 3D coordinates, which finally can be transformed into unprimed 3D coordinates. Again, in the ideal case the result should be equal to the above expressions, but in practice they may deviate.

A final remark relates to the fact that if the essential matrix is determined from corresponding image coordinate, which often is the case when 3D points are determined in this way, the translation vector $\mathbf {t}$ is known only up to an unknown positive scaling. As a consequence, the reconstructed 3D points, too, are undetermined with respect to a positive scaling.

Related Research Articles

In special relativity, a four-vector is an object with four components, which transform in a specific way under Lorentz transformations. Specifically, a four-vector is an element of a four-dimensional vector space considered as a representation space of the standard representation of the Lorentz group, the representation. It differs from a Euclidean vector in how its magnitude is determined. The transformations that preserve this magnitude are the Lorentz transformations, which include spatial rotations and boosts.

A 3D projection is a design technique used to display a three-dimensional (3D) object on a two-dimensional (2D) surface. These projections rely on visual perspective and aspect analysis to project a complex object for viewing capability on a simpler plane.

<span class="mw-page-title-main">Active and passive transformation</span> Distinction between meanings of Euclidean space transformations

Geometric transformations can be distinguished into two types: active or alibi transformations which change the physical position of a set of points relative to a fixed frame of reference or coordinate system ; and passive or alias transformations which leave points fixed but change the frame of reference or coordinate system relative to which they are described. By transformation, mathematicians usually refer to active transformations, while physicists and engineers could mean either.

In mathematics, the covariant derivative is a way of specifying a derivative along tangent vectors of a manifold. Alternatively, the covariant derivative is a way of introducing and working with a connection on a manifold by means of a differential operator, to be contrasted with the approach given by a principal connection on the frame bundle – see affine connection. In the special case of a manifold isometrically embedded into a higher-dimensional Euclidean space, the covariant derivative can be viewed as the orthogonal projection of the Euclidean directional derivative onto the manifold's tangent space. In this case the Euclidean derivative is broken into two parts, the extrinsic normal component and the intrinsic covariant derivative component.

Shear stress is the component of stress coplanar with a material cross section. It arises from the shear force, the component of force vector parallel to the material cross section. Normal stress, on the other hand, arises from the force vector component perpendicular to the material cross section on which it acts.

<span class="mw-page-title-main">Barycentric coordinate system</span> Coordinate system that is defined by points instead of vectors

In geometry, a barycentric coordinate system is a coordinate system in which the location of a point is specified by reference to a simplex. The barycentric coordinates of a point can be interpreted as masses placed at the vertices of the simplex, such that the point is the center of mass of these masses. These masses can be zero or negative; they are all positive if and only if the point is inside the simplex.

In nuclear physics, the chiral model, introduced by Feza Gürsey in 1960, is a phenomenological model describing effective interactions of mesons in the chiral limit (where the masses of the quarks go to zero), but without necessarily mentioning quarks at all. It is a nonlinear sigma model with the principal homogeneous space of a Lie group $as its target manifold. When the model was originally introduced, this Lie group was the SU(N), where N is the number of quark flavors. The Riemannian metric of the target manifold is given by a positive constant multiplied by the Killing form acting upon the Maurer-Cartan form of SU(N).$

An osculating circle is a circle that best approximates the curvature of a curve at a specific point. It is tangent to the curve at that point and has the same curvature as the curve at that point. The osculating circle provides a way to understand the local behavior of a curve and is commonly used in differential geometry and calculus.

In geometry, a three-dimensional space is a mathematical space in which three values (coordinates) are required to determine the position of a point. Most commonly, it is the three-dimensional Euclidean space, that is, the Euclidean space of dimension three, which models physical space. More general three-dimensional spaces are called 3-manifolds. The term may also refer colloquially to a subset of space, a three-dimensional region, a solid figure.

In computer vision, the motion field is an ideal representation of motion in three-dimensional space (3D) as it is projected onto a camera image. Given a simplified camera model, each point $in the image is the projection of some point in the 3D scene but the position of the projection of a fixed point in space can vary with time. The motion field can formally be defined as the time derivative of the image position of all image points given that they correspond to fixed 3D points. This means that the motion field can be represented as a function which maps image coordinates to a 2-dimensional vector. The motion field is an ideal description of the projected 3D motion in the sense that it can be formally defined but in practice it is normally only possible to determine an approximation of the motion field from the image data.$

Epipolar geometry is the geometry of stereo vision. When two cameras view a 3D scene from two distinct positions, there are a number of geometric relations between the 3D points and their projections onto the 2D images that lead to constraints between the image points. These relations are derived based on the assumption that the cameras can be approximated by the pinhole camera model.

In computer vision, the essential matrix is a $matrix, that relates corresponding points in stereo images assuming that the cameras satisfy the pinhole camera model.$

The derivation of the Navier–Stokes equations as well as their application and formulation for different families of fluids, is an important exercise in fluid dynamics with applications in mechanical engineering, physics, chemistry, heat transfer, and electrical engineering. A proof explaining the properties and bounds of the equations, such as Navier–Stokes existence and smoothness, is one of the important unsolved problems in mathematics.

In computer vision a camera matrix or (camera) projection matrix is a $matrix which describes the mapping of a pinhole camera from 3D points in the world to 2D points in an image.$

<span class="mw-page-title-main">Pinhole camera model</span> Model of 3D points projected onto planar image via a lens-less aperture

The pinhole camera model describes the mathematical relationship between the coordinates of a point in three-dimensional space and its projection onto the image plane of an ideal pinhole camera, where the camera aperture is described as a point and no lenses are used to focus light. The model does not include, for example, geometric distortions or blurring of unfocused objects caused by lenses and finite sized apertures. It also does not take into account that most practical cameras have only discrete image coordinates. This means that the pinhole camera model can only be used as a first order approximation of the mapping from a 3D scene to a 2D image. Its validity depends on the quality of the camera and, in general, decreases from the center of the image to the edges as lens distortion effects increase.

The eight-point algorithm is an algorithm used in computer vision to estimate the essential matrix or the fundamental matrix related to a stereo camera pair from a set of corresponding image points. It was introduced by Christopher Longuet-Higgins in 1981 for the case of the essential matrix. In theory, this algorithm can be used also for the fundamental matrix, but in practice the normalized eight-point algorithm, described by Richard Hartley in 1997, is better suited for this case.

In photogrammetry and computer stereo vision, bundle adjustment is simultaneous refining of the 3D coordinates describing the scene geometry, the parameters of the relative motion, and the optical characteristics of the camera(s) employed to acquire the images, given a set of images depicting a number of 3D points from different viewpoints. Its name refers to the geometrical bundles of light rays originating from each 3D feature and converging on each camera's optical center, which are adjusted optimally according to an optimality criterion involving the corresponding image projections of all points.

The Cauchy momentum equation is a vector partial differential equation put forth by Cauchy that describes the non-relativistic momentum transport in any continuum.

The instant center of rotation of a body undergoing planar movement is a point that has zero velocity at a particular instant of time. At this instant, the velocity vectors of the other points in the body generate a circular field around this center of rotation which is identical to what is generated by a pure rotation.

In mathematics, the moduli stack of elliptic curves, denoted as $or, is an algebraic stack over classifying elliptic curves. Note that it is a special case of the moduli stack of algebraic curves . In particular its points with values in some field correspond to elliptic curves over the field, and more generally morphisms from a scheme to it correspond to elliptic curves over . The construction of this space spans over a century because of the various generalizations of elliptic curves as the field has developed. All of these generalizations are contained in .$

References

Richard Hartley and Andrew Zisserman (2003). Multiple View Geometry in computer vision. Cambridge University Press. ISBN 978-0-521-54051-3.

External links

Two view and multi-view triangulation in Matlab

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.