Camera auto-calibration

Last updated

Camera auto-calibration is the process of determining internal camera parameters directly from multiple uncalibrated images of unstructured scenes. In contrast to classic camera calibration, auto-calibration does not require any special calibration objects in the scene. In the visual effects industry, camera auto-calibration is often part of the "Match Moving" process where a synthetic camera trajectory and intrinsic projection model are solved to reproject synthetic content into video.

Contents

Camera auto-calibration is a form of sensor ego-structure discovery; the subjective effects of the sensor are separated from the objective effects of the environment leading to a reconstruction of the perceived world without the bias applied by the measurement device. This is achieved via the fundamental assumption that images are projected from a Euclidean space through a linear, 5 degree of freedom (in the simplest case), pinhole camera model with non-linear optical distortion. The linear pinhole parameters are the focal length, the aspect ratio, the skew, and the 2D principal point. With only a set of uncalibrated (or calibrated) images, a scene may be reconstructed up to a six degree of freedom euclidean transform and an isotropic scaling.

A mathematical theory for general multi-view camera self-calibration was originally demonstrated in 1992 by Olivier Faugeras, QT Luong, and Stephen J. Maybank. In 3D scenes and general motions, each pair of views provides two constraints on the 5 degree-of-freedom calibration. Therefore, three views are the minimum needed for full calibration with fixed intrinsic parameters between views. Quality modern imaging sensors and optics may also provide further prior constraints on the calibration such as zero skew (orthogonal pixel grid) and unity aspect ratio (square pixels). Integrating these priors will reduce the minimal number of images needed to two. It is possible to auto-calibrate a sensor from a single image given supporting information in a structured scene. For example, calibration may be obtained if multiple sets of parallel lines or objects with a known shape (e.g. circular) are identified.

Problem statement

Given set of cameras and 3D points reconstructed up to projective ambiguity (using, for example, bundle adjustment method) we wish to define rectifying homography such that is a metric reconstruction. After that internal camera parameters can be easily calculated using camera matrix factorization .

Solution domains

Algorithms

Related Research Articles

<span class="mw-page-title-main">Inverse kinematics</span> Computing joint values of a kinematic chain from a known end position

In computer animation and robotics, inverse kinematics is the mathematical process of calculating the variable joint parameters needed to place the end of a kinematic chain, such as a robot manipulator or animation character's skeleton, in a given position and orientation relative to the start of the chain. Given joint parameters, the position and orientation of the chain's end, e.g. the hand of the character or robot, can typically be calculated directly using multiple applications of trigonometric formulas, a process known as forward kinematics. However, the reverse operation is, in general, much more challenging.

<span class="mw-page-title-main">Vanishing point</span> Aspect of perspective drawing

A vanishing point is a point on the image plane of a perspective rendering where the two-dimensional perspective projections of mutually parallel lines in three-dimensional space appear to converge. When the set of parallel lines is perpendicular to a picture plane, the construction is known as one-point perspective, and their vanishing point corresponds to the oculus, or "eye point", from which the image should be viewed for correct perspective geometry. Traditional linear drawings use objects with one to three sets of parallels, defining one to three vanishing points.

In the fields of computing and computer vision, pose represents the position and orientation of an object, usually in three dimensions. Poses are often stored internally as transformation matrices. The term “pose” is largely synonymous with the term “transform”, but a transform may often include scale, whereas pose does not.

In computer vision, the fundamental matrix is a 3×3 matrix which relates corresponding points in stereo images. In epipolar geometry, with homogeneous image coordinates, x and x′, of corresponding points in a stereo image pair, Fx describes a line on which the corresponding point x′ on the other image must lie. That means, for all pairs of corresponding points holds

Camera resectioning is the process of estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video; it determines which incoming light ray is associated with each pixel on the resulting image. Basically, the process determines the pose of the pinhole camera.

<span class="mw-page-title-main">Epipolar geometry</span> Geometry of stereo vision

Epipolar geometry is the geometry of stereo vision. When two cameras view a 3D scene from two distinct positions, there are a number of geometric relations between the 3D points and their projections onto the 2D images that lead to constraints between the image points. These relations are derived based on the assumption that the cameras can be approximated by the pinhole camera model.

<span class="mw-page-title-main">Image rectification</span>

Image rectification is a transformation process used to project images onto a common image plane. This process has several degrees of freedom and there are many strategies for transforming images to the common plane. Image rectification is used in computer stereo vision to simplify the problem of finding matching points between images, and in geographic information systems to merge images taken from multiple perspectives into a common map coordinate system.

<span class="mw-page-title-main">Image sensor format</span> Shape and size of a digital cameras image sensor

In digital photography, the image sensor format is the shape and size of the image sensor.

<span class="mw-page-title-main">Pinhole camera model</span>

The pinhole camera model describes the mathematical relationship between the coordinates of a point in three-dimensional space and its projection onto the image plane of an ideal pinhole camera, where the camera aperture is described as a point and no lenses are used to focus light. The model does not include, for example, geometric distortions or blurring of unfocused objects caused by lenses and finite sized apertures. It also does not take into account that most practical cameras have only discrete image coordinates. This means that the pinhole camera model can only be used as a first order approximation of the mapping from a 3D scene to a 2D image. Its validity depends on the quality of the camera and, in general, decreases from the center of the image to the edges as lens distortion effects increase.

<span class="mw-page-title-main">Bundle adjustment</span>

In photogrammetry and computer stereo vision, bundle adjustment is simultaneous refining of the 3D coordinates describing the scene geometry, the parameters of the relative motion, and the optical characteristics of the camera(s) employed to acquire the images, given a set of images depicting a number of 3D points from different viewpoints. Its name refers to the geometrical bundles of light rays originating from each 3D feature and converging on each camera's optical center, which are adjusted optimally according to an optimality criterion involving the corresponding image projections of all points.

Visual servoing, also known as vision-based robot control and abbreviated VS, is a technique which uses feedback information extracted from a vision sensor to control the motion of a robot. One of the earliest papers that talks about visual servoing was from the SRI International Labs in 1979.

In computer vision, the trifocal tensor is a 3×3×3 array of numbers that incorporates all projective geometric relationships among three views. It relates the coordinates of corresponding points or lines in three views, being independent of the scene structure and depending only on the relative motion among the three views and their intrinsic calibration parameters. Hence, the trifocal tensor can be considered as the generalization of the fundamental matrix in three views. It is noted that despite the tensor being made up of 27 elements, only 18 of them are actually independent.

Computer stereo vision is the extraction of 3D information from digital images, such as those obtained by a CCD camera. By comparing information about a scene from two vantage points, 3D information can be extracted by examining the relative positions of objects in the two panels. This is similar to the biological process of stereopsis.

In Euclidean and projective geometry, just as two (distinct) points determine a line, five points determine a conic. There are additional subtleties for conics that do not exist for lines, and thus the statement and its proof for conics are both more technical than for lines.

<span class="mw-page-title-main">3D reconstruction from multiple images</span> Creation of a 3D model from a set of images

3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.

<span class="mw-page-title-main">Homography (computer vision)</span> Relation of two images with software

In the field of computer vision, any two images of the same planar surface in space are related by a homography. This has many practical applications, such as image rectification, image registration, or camera motion—rotation and translation—between two images. Once camera resectioning has been done from an estimated homography matrix, this information may be used for navigation, or to insert models of 3D objects into an image or video, so that they are rendered with the correct perspective and appear to have been part of the original scene.

The Laguerre formula provides the acute angle between two proper real lines, as follows:

Chessboards arise frequently in computer vision theory and practice because their highly structured geometry is well-suited for algorithmic detection and processing. The appearance of chessboards in computer vision can be divided into two main areas: camera calibration and feature extraction. This article provides a unified discussion of the role that chessboards play in the canonical methods from these two areas, including references to the seminal literature, examples, and pointers to software implementations.

Perspective-n-Point is the problem of estimating the pose of a calibrated camera given a set of n 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6 degrees-of-freedom (DOF) which are made up of the rotation and 3D translation of the camera with respect to the world. This problem originates from camera calibration and has many applications in computer vision and other areas, including 3D pose estimation, robotics and augmented reality. A commonly used solution to the problem exists for n = 3 called P3P, and many solutions are available for the general case of n ≥ 3. A solution for n = 2 exists if feature orientations are available at the two points. Implementations of these solutions are also available in open source software.

QT Luong is a French-Vietnamese born American photographer known for his work in the U.S. National Parks, as well as for work in the theory of computer vision. In 2022, Luong received the Ansel Adams Award for Conservation Photography from the Sierra Club.

References