Structure from motion (SfM) [1] is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences that may be coupled with local motion signals. It is studied in the fields of computer vision and visual perception.
Humans perceive a great deal of information about the three-dimensional structure in their environment by moving around it. When the observer moves, objects around them move different amounts depending on their distance from the observer. This is known as motion parallax, and from this depth information can be used to generate an accurate 3D representation of the world around them. [2]
Finding structure from motion presents a similar problem to finding structure from stereo vision. In both instances, the correspondence between images and the reconstruction of 3D object needs to be found.
To find correspondence between images, features such as corner points (edges with gradients in multiple directions) are tracked from one image to the next. One of the most widely used feature detectors is the scale-invariant feature transform (SIFT). It uses the maxima from a difference-of-Gaussians (DOG) pyramid as features. The first step in SIFT is finding a dominant gradient direction. To make it rotation-invariant, the descriptor is rotated to fit this orientation. [3] Another common feature detector is the SURF (speeded-up robust features). [4] In SURF, the DOG is replaced with a Hessian matrix-based blob detector. Also, instead of evaluating the gradient histograms, SURF computes for the sums of gradient components and the sums of their absolute values. [5] Its usage of integral images allows the features to be detected extremely quickly with high detection rate. [6] Therefore, comparing to SIFT, SURF is a faster feature detector with drawback of less accuracy in feature positions. [5] Another type of feature recently made practical for structure from motion are general curves (e.g., locally an edge with gradients in one direction), part of a technology known as pointless SfM, [7] [8] useful when point features are insufficient, common in man-made environments. [9]
The features detected from all the images will then be matched. One of the matching algorithms that track features from one image to another is the Lucas–Kanade tracker. [10]
Sometimes some of the matched features are incorrectly matched. This is why the matches should also be filtered. RANSAC (random sample consensus) is the algorithm that is usually used to remove the outlier correspondences. In the paper of Fischler and Bolles, RANSAC is used to solve the location determination problem (LDP), where the objective is to determine the points in space that project onto an image into a set of landmarks with known locations. [11]
The feature trajectories over time are then used to reconstruct their 3D positions and the camera's motion. [12] An alternative is given by so-called direct approaches, where geometric information (3D structure and camera motion) is directly estimated from the images, without intermediate abstraction to features or corners. [13]
There are several approaches to structure from motion. In incremental SfM, [14] camera poses are solved for and added one by one to the collection. In global SfM, [15] [16] the poses of all cameras are solved for at the same time. A somewhat intermediate approach is out-of-core SfM, where several partial reconstructions are computed that are then integrated into a global solution.
Structure-from-motion photogrammetry with multi-view stereo provides hyperscale landform models using images acquired from a range of digital cameras and optionally a network of ground control points. The technique is not limited in temporal frequency and can provide point cloud data comparable in density and accuracy to those generated by terrestrial and airborne laser scanning at a fraction of the cost. [17] [18] [19] Structure from motion is also useful in remote or rugged environments where terrestrial laser scanning is limited by equipment portability and airborne laser scanning is limited by terrain roughness causing loss of data and image foreshortening. The technique has been applied in many settings such as rivers, [20] badlands, [21] sandy coastlines, [22] [23] fault zones, [24] landslides, [25] [26] and coral reef settings. [27] SfM has been also successfully applied for the assessment of changes [28] and large wood accumulation volume [29] and porosity [30] in fluvial systems, the characterization of rock masses through the determination of some properties as the orientation, persistence, etc. of discontinuities. [31] [32] as well as for the evaluation of the stability of rock cut slopes [33] . A full range of digital cameras can be utilized, including digital SLR's, compact digital cameras and even smart phones. Generally though, higher accuracy data will be achieved with more expensive cameras, which include lenses of higher optical quality. The technique therefore offers exciting opportunities to characterize surface topography in unprecedented detail and, with multi-temporal data, to detect elevation, position and volumetric changes that are symptomatic of earth surface processes. Structure from motion can be placed in the context of other digital surveying methods.
Cultural heritage is present everywhere. Its structural control, documentation and conservation is one of humanity's main duties (UNESCO). Under this point of view, SfM is used in order to properly estimate situations as well as planning and maintenance efforts and costs, control and restoration. Because serious constraints often exist connected to the accessibility of the site and impossibility to install invasive surveying pillars that did not permit the use of traditional surveying routines (like total stations), SfM provides a non-invasive approach for the structure, without the direct interaction between the structure and any operator. The use is accurate as only qualitative considerations are needed. It is fast enough to respond to the monument’s immediate management needs. [34] The first operational phase is an accurate preparation of the photogrammetric surveying where is established the relation between best distance from the object, focal length, the ground sampling distance (GSD) and the sensor’s resolution. With this information the programmed photographic acquisitions must be made using vertical overlapping of at least 60% (figure 02). [35]
Furthermore, structure-from-motion photogrammetry represents a non-invasive, highly flexible and low-cost methodology to digitalize historical documents. [36]
Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.
Photogrammetry is the science and technology of obtaining reliable information about physical objects and the environment through the process of recording, measuring and interpreting photographic images and patterns of electromagnetic radiant imagery and other phenomena.
Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.
3D scanning is the process of analyzing a real-world object or environment to collect three dimensional data of its shape and possibly its appearance. The collected data can then be used to construct digital 3D models.
Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.
In photogrammetry and computer stereo vision, bundle adjustment is simultaneous refining of the 3D coordinates describing the scene geometry, the parameters of the relative motion, and the optical characteristics of the camera(s) employed to acquire the images, given a set of images depicting a number of 3D points from different viewpoints. Its name refers to the geometrical bundles of light rays originating from each 3D feature and converging on each camera's optical center, which are adjusted optimally according to an optimality criterion involving the corresponding image projections of all points.
An area of computer vision is active vision, sometimes also called active computer vision. An active vision system is one that can manipulate the viewpoint of the camera(s) in order to investigate the environment and get better information from it.
In computer vision and computer graphics, 3D reconstruction is the process of capturing the shape and appearance of real objects. This process can be accomplished either by active or passive methods. If the model is allowed to change its shape in time, this is referred to as non-rigid or spatio-temporal reconstruction.
A structured-light 3D scanner is a 3D scanning device for measuring the three-dimensional shape of an object using projected light patterns and a camera system.
In robotics and computer vision, visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images. It has been used in a wide variety of robotic applications, such as on the Mars Exploration Rovers.
The Institute of Photogrammetry and GeoInformation (IPI) is a research institute that is part of the consortium of institutes operating under the aegis of Leibniz University situated in Hannover, Germany. The current research at IPI focuses both on terrestrial and extraterrestrial image interpretation. The basic themes of research revolve around computer vision, 3D geometry, image processing and machine learning. IPI contributes regularly with state-of-the-art methods to interpret high resolution images received from the HRSC probe of the Mars Express mission.
PhotoModeler is a software application that performs image-based modeling and close range photogrammetry – producing 3D models and measurements from photography. The software is used for close-range, aerial and uav photogrammetry.
3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.
A digital outcrop model (DOM), also called a virtual outcrop model, is a digital 3D representation of the outcrop surface, mostly in a form of textured polygon mesh.
Dlib is a general purpose cross-platform software library written in the programming language C++. Its design is heavily influenced by ideas from design by contract and component-based software engineering. Thus it is, first and foremost, a set of independent software components. It is open-source software released under a Boost Software License.
Pix4D is a Swiss software company that specializes in photogrammetry. It was founded in 2011 as a spinoff from the École Polytechnique Fédérale de Lausanne (EPFL) Computer Vision Lab in Switzerland. It develops a suite of software products that use photogrammetry and computer vision algorithms to transform DSLR, fisheye, RGB, thermal and multispectral images into 3D maps and 3D modeling. The company has 7 international offices, with its headquarters in Lausanne, Switzerland.
An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.
OpenDroneMap is an open source photogrammetry toolkit to process aerial imagery into maps and 3D models. The software is hosted and distributed freely on GitHub.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.