Bundle adjustment

Last updated May 24, 2024

In photogrammetry and computer stereo vision, bundle adjustment is simultaneous refining of the 3D coordinates describing the scene geometry, the parameters of the relative motion, and the optical characteristics of the camera(s) employed to acquire the images, given a set of images depicting a number of 3D points from different viewpoints. Its name refers to the geometrical bundles of light rays originating from each 3D feature and converging on each camera's optical center, which are adjusted optimally according to an optimality criterion involving the corresponding image projections of all points.

Uses

Bundle adjustment is almost always ^{[ citation needed ]} used as the last step of feature-based 3D reconstruction algorithms. It amounts to an optimization problem on the 3D structure and viewing parameters (i.e., camera pose and possibly intrinsic calibration and radial distortion), to obtain a reconstruction which is optimal under certain assumptions regarding the noise pertaining to the observed^[1] image features: If the image error is zero-mean Gaussian, then bundle adjustment is the Maximum Likelihood Estimator.^[2]^: 2 Bundle adjustment was originally conceived in the field of photogrammetry during the 1950s and has increasingly been used by computer vision researchers during recent years.^[2]^: 2

General approach

Bundle adjustment boils down to minimizing the reprojection error between the image locations of observed and predicted image points, which is expressed as the sum of squares of a large number of nonlinear, real-valued functions. Thus, the minimization is achieved using nonlinear least-squares algorithms. Of these, Levenberg–Marquardt has proven to be one of the most successful due to its ease of implementation and its use of an effective damping strategy that lends it the ability to converge quickly from a wide range of initial guesses. By iteratively linearizing the function to be minimized in the neighborhood of the current estimate, the Levenberg–Marquardt algorithm involves the solution of linear systems termed the normal equations. When solving the minimization problems arising in the framework of bundle adjustment, the normal equations have a sparse block structure owing to the lack of interaction among parameters for different 3D points and cameras. This can be exploited to gain tremendous computational benefits by employing a sparse variant of the Levenberg–Marquardt algorithm which explicitly takes advantage of the normal equations zeros pattern, avoiding storing and operating on zero-elements.^[2]^: 3

Mathematical definition

Bundle adjustment amounts to jointly refining a set of initial camera and structure parameter estimates for finding the set of parameters that most accurately predict the locations of the observed points in the set of available images. More formally,^[3] assume that $n$ 3D points are seen in $m$ views and let $\mathbf {x} _{ij}$ be the projection of the $i$ th point on image $j$ . Let $\displaystyle v_{ij}$ denote the binary variables that equal 1 if point $i$ is visible in image $j$ and 0 otherwise. Assume also that each camera $j$ is parameterized by a vector $\mathbf {a} _{j}$ and each 3D point $i$ by a vector $\mathbf {b} _{i}$ . Bundle adjustment minimizes the total reprojection error with respect to all 3D point and camera parameters, specifically

\min _{\mathbf {a} _{j},\,\mathbf {b} _{i}}\displaystyle \sum _{i=1}^{n}\;\displaystyle \sum _{j=1}^{m}\;v_{ij}\,d(\mathbf {Q} (\mathbf {a} _{j},\,\mathbf {b} _{i}),\;\mathbf {x} _{ij})^{2},

where $\mathbf {Q} (\mathbf {a} _{j},\,\mathbf {b} _{i})$ is the predicted projection of point $i$ on image $j$ and $d(\mathbf {x} ,\,\mathbf {y} )$ denotes the Euclidean distance between the image points represented by vectors $\mathbf {x}$ and $\mathbf {y}$ . Because the minimum is computed over many points and many images, bundle adjustment is by definition tolerant to missing image projections, and if the distance metric is chosen reasonably (e.g., Euclidean distance), bundle adjustment will also minimize a physically meaningful criterion.

Related Research Articles

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

Fuzzy clustering is a form of clustering in which each data point can belong to more than one cluster.

In computer vision, the fundamental matrix $is a 3\times3 matrix which relates corresponding points in stereo images. In epipolar geometry, with homogeneous image coordinates, x and x', of corresponding points in a stereo image pair, Fx describes a line on which the corresponding point x' on the other image must lie. That means, for all pairs of corresponding points holds$

Camera resectioning is the process of estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video; it determines which incoming light ray is associated with each pixel on the resulting image. Basically, the process determines the pose of the pinhole camera.

The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a p-norm:

In computer vision, triangulation refers to the process of determining a point in 3D space given its projections onto two, or more, images. In order to solve this problem it is necessary to know the parameters of the camera projection function from 3D to 2D for the cameras involved, in the simplest case represented by the camera matrices. Triangulation is sometimes also referred to as reconstruction or intersection.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

Camera auto-calibration is the process of determining internal camera parameters directly from multiple uncalibrated images of unstructured scenes. In contrast to classic camera calibration, auto-calibration does not require any special calibration objects in the scene. In the visual effects industry, camera auto-calibration is often part of the "Match Moving" process where a synthetic camera trajectory and intrinsic projection model are solved to reproject synthetic content into video.

3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

<span class="mw-page-title-main">Homography (computer vision)</span> Relation of two images with software

In the field of computer vision, any two images of the same planar surface in space are related by a homography. This has many practical applications, such as image rectification, image registration, or camera motion—rotation and translation—between two images. Once camera resectioning has been done from an estimated homography matrix, this information may be used for navigation, or to insert models of 3D objects into an image or video, so that they are rendered with the correct perspective and appear to have been part of the original scene.

Chessboards arise frequently in computer vision theory and practice because their highly structured geometry is well-suited for algorithmic detection and processing. The appearance of chessboards in computer vision can be divided into two main areas: camera calibration and feature extraction. This article provides a unified discussion of the role that chessboards play in the canonical methods from these two areas, including references to the seminal literature, examples, and pointers to software implementations.

Perspective-n-Point is the problem of estimating the pose of a calibrated camera given a set of $n$ 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6 degrees-of-freedom (DOF) which are made up of the rotation and 3D translation of the camera with respect to the world. This problem originates from camera calibration and has many applications in computer vision and other areas, including 3D pose estimation, robotics and augmented reality. A commonly used solution to the problem exists for $n = 3$ called P3P, and many solutions are available for the general case of $n \geq 3$ . A solution for $n = 2$ exists if feature orientations are available at the two points. Implementations of these solutions are also available in open source software.

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

References

↑ B. Triggs; P. McLauchlan; R. Hartley; A. Fitzgibbon (1999). "Bundle Adjustment — A Modern Synthesis" (PDF). ICCV '99: Proceedings of the International Workshop on Vision Algorithms. Springer-Verlag. pp. 298–372. doi:10.1007/3-540-44480-7_21. ISBN 3-540-67973-1.
1 2 3 M.I.A. Lourakis and A.A. Argyros (2009). "SBA: A Software Package for Generic Sparse Bundle Adjustment" (PDF). ACM Transactions on Mathematical Software. 36 (1): 1–30. doi:10.1145/1486525.1486527. S2CID 474253.
↑ R.I. Hartley and A. Zisserman (2004). Multiple View Geometry in computer vision (2nd ed.). Cambridge University Press. ISBN 978-0-521-54051-3.

External links

Software

: Apero/MicMac, a free open source photogrammetric software. Cecill-B licence.
sba: A Generic Sparse Bundle Adjustment C/C++ Package Based on the Levenberg–Marquardt Algorithm (C, MATLAB). GPL.
cvsba Archived 2013-10-24 at the Wayback Machine : An OpenCV wrapper for sba library (C++). GPL.
ssba: Simple Sparse Bundle Adjustment package based on the Levenberg–Marquardt Algorithm (C++). LGPL.
OpenCV: Computer Vision library in the Images stitching module. BSD license.
mcba: Multi-Core Bundle Adjustment (CPU/GPU). GPL3.
libdogleg: General-purpose sparse non-linear least squares solver, based on Powell's dogleg method. LGPL.
ceres-solver: A Nonlinear Least Squares Minimizer. BSD license.
g2o: General Graph Optimization (C++) - framework with solvers for sparse graph-based non-linear error functions. LGPL.
DGAP: The program DGAP implement the photogrammetric method of bundle adjustment invented by Helmut Schmid and Duane Brown. GPL.
Bundler: A structure-from-motion (SfM) system for unordered image collections (for instance, images from the Internet) by Noah Snavely. GPL.
COLMAP: A general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface. BSD license.
Theia: A computer vision library aimed at providing efficient and reliable algorithms for Structure from Motion (SfM). New BSD license.
Ames Stereo Pipeline has a tool for bundle adjustment (Apache II licence).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[triggs1999-1] B. Triggs; P. McLauchlan; R. Hartley; A. Fitzgibbon (1999). "Bundle Adjustment — A Modern Synthesis" (PDF). ICCV '99: Proceedings of the International Workshop on Vision Algorithms. Springer-Verlag. pp. 298–372. doi:10.1007/3-540-44480-7_21. ISBN 3-540-67973-1.

[sba2009-2] 1 2 3 M.I.A. Lourakis and A.A. Argyros (2009). "SBA: A Software Package for Generic Sparse Bundle Adjustment" (PDF). ACM Transactions on Mathematical Software. 36 (1): 1–30. doi:10.1145/1486525.1486527. S2CID 474253.

[3] R.I. Hartley and A. Zisserman (2004). Multiple View Geometry in computer vision (2nd ed.). Cambridge University Press. ISBN 978-0-521-54051-3.

[1]

[2]

[3]