3D pose estimation

Last updated December 16, 2024

Pose estimation in a motion capture system

3D pose estimation is a process of predicting the transformation of an object from a user-defined reference pose, given an image or a 3D scan. It arises in computer vision or robotics where the pose or transformation of an object can be used for alignment of a computer-aided design models, identification, grasping, or manipulation of the object.

The image data from which the pose of an object is determined can be either a single image, a stereo image pair, or an image sequence where, typically, the camera is moving with a known velocity. The objects which are considered can be rather general, including a living being or body parts, e.g., a head or hands. The methods which are used for determining the pose of an object, however, are usually specific for a class of objects and cannot generally be expected to work well for other types of objects.

From an uncalibrated 2D camera

It is possible to estimate the 3D rotation and translation of a 3D object from a single 2D photo, if an approximate 3D model of the object is known and the corresponding points in the 2D image are known. A common technique developed in 1995 for solving this is POSIT,^[1] where the 3D pose is estimated directly from the 3D model points and the 2D image points, and corrects the errors iteratively until a good estimate is found from a single image.^[2] Most implementations of POSIT only work on non-coplanar points (in other words, it won't work with flat objects or planes).^[3]

Another approach is to register a 3D CAD model over the photograph of a known object by optimizing a suitable distance measure with respect to the pose parameters.^[4]^[5] The distance measure is computed between the object in the photograph and the 3D CAD model projection at a given pose. Perspective projection or orthogonal projection is possible depending on the pose representation used. This approach is appropriate for applications where a 3D CAD model of a known object (or object category) is available.

From a calibrated 2D camera

Given a 2D image of an object, and the camera that is calibrated with respect to a world coordinate system, it is also possible to find the pose which gives the 3D object in its object coordinate system.^[6] This works as follows.

Extracting 3D from 2D

Starting with a 2D image, image points are extracted which correspond to corners in an image. The projection rays from the image points are reconstructed from the 2D points so that the 3D points, which must be incident with the reconstructed rays, can be determined.

Pseudocode

The algorithm for determining pose estimation is based on the iterative closest point algorithm. The main idea is to determine the correspondences between 2D image features and points on the 3D model curve.

(a) Reconstruct projection rays from the image points (b) Estimate the nearest point of each projection ray to a point on the 3D contour (c) Estimate the pose of the contour with the use of this correspondence set (d) goto (b)

The above algorithm does not account for images containing an object that is partially occluded. The following algorithm assumes that all contours are rigidly coupled, meaning the pose of one contour defines the pose of another contour.

(a) Reconstruct projection rays from the image points (b) For each projection ray R:      (c) For each 3D contour:           (c1) Estimate the nearest point P1 of ray R to a point on the contour           (c2) if (n == 1) choose P1 as actual P for the point-line correspondence           (c3) else compare P1 with P:                    if dist(P1, R) is smaller than dist(P, R) then                        choose P1 as new P (d) Use (P, R) as correspondence set. (e) Estimate pose with this correspondence set (f) Transform contours, goto (b)

Estimating pose through comparison

Systems exist which use a database of an object at different rotations and translations to compare an input image against to estimate pose. These systems accuracy is limited to situations which are represented in their database of images, however the goal is to recognize a pose, rather than determine it.^[7]

Software

posest, a GPL C/C++ library for 6DoF pose estimation from 3D-2D correspondences.
diffgeom2pose, fast Matlab solver for 6DoF pose estimation from only two 3D-2D correspondences of points with directions (vectors), or points at curves (point-tangents). The points can be SIFT attributed with feature directions.
MINUS: C++ package for (relative) pose estimation of three views. Includes cases of three corresponding points with lines at these points (as in feature positions and orientations, or curve points with tangents), and also for three corresponding points and one line correspondence.
Nvidia FoundationPose, a unified deep learning foundation model for 6DoF object pose estimation and tracking, supporting both 3D model-based and 3D model-free setups.

Related Research Articles

<span class="mw-page-title-main">Point cloud</span> Set of data points in three-dimensional space

A point cloud is a discrete set of data points in space. The points may represent a 3D shape or object. Each point position has its set of Cartesian coordinates. Points may contain data other than position such as RGB colors, normals, timestamps and others. Point clouds are generally produced by 3D scanners or by photogrammetry software, which measure many points on the external surfaces of objects around them. As the output of 3D scanning processes, point clouds are used for many purposes, including to create 3D computer-aided design (CAD) or geographic information systems (GIS) models for manufactured parts, for metrology and quality inspection, and for a multitude of visualizing, animating, rendering, and mass customization applications.

Ray casting is the methodological basis for 3D CAD/CAM solid modeling and image rendering. It is essentially the same as ray tracing for computer graphics where virtual light rays are "cast" or "traced" on their path from the focal point of a camera through each pixel in the camera sensor to determine what is visible along the ray in the 3D scene. The term "Ray Casting" was introduced by Scott Roth while at the General Motors Research Labs from 1978–1980. His paper, "Ray Casting for Modeling Solids", describes modeled solid objects by combining primitive solids, such as blocks and cylinders, using the set operators union (+), intersection (&), and difference (-). The general idea of using these binary operators for solid modeling is largely due to Voelcker and Requicha's geometric modelling group at the University of Rochester. See solid modeling for a broad overview of solid modeling methods. This figure on the right shows a U-Joint modeled from cylinders and blocks in a binary tree using Roth's ray casting system in 1979.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

In visual effects, match moving is a technique that allows the insertion of 2D elements, other live action elements or CG computer graphics into live-action footage with correct position, scale, orientation, and motion relative to the photographed objects in the shot. It also allows for the removal of live action elements from the live action shot. The term is used loosely to describe several different methods of extracting camera motion information from a motion picture. Also referred to as motion tracking or camera solving, match moving is related to rotoscoping and photogrammetry. Match moving is sometimes confused with motion capture, which records the motion of objects, often human actors, rather than the camera. Typically, motion capture requires special cameras and sensors and a controlled environment. Match moving is also distinct from motion control photography, which uses mechanical hardware to execute multiple identical camera moves. Match moving, by contrast, is typically a software-based technology, applied after the fact to normal footage recorded in uncontrolled environments with an ordinary camera.

In computer vision and image processing, motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

3D scanning is the process of analyzing a real-world object or environment to collect three dimensional data of its shape and possibly its appearance. The collected data can then be used to construct digital 3D models.

<span class="mw-page-title-main">Image stitching</span> Combining multiple photographic images with overlapping fields of view

Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Commonly performed through the use of computer software, most approaches to image stitching require nearly exact overlaps between images and identical exposures to produce seamless results, although some stitching algorithms actually benefit from differently exposed images by doing high-dynamic-range imaging in regions of overlap. Some digital cameras can stitch their photos internally.

In the fields of computing and computer vision, pose represents the position and orientation of an object, usually in three dimensions. Poses are often stored internally as transformation matrices. The term “pose” is largely synonymous with the term “transform”, but a transform may often include scale, whereas pose does not.

<span class="mw-page-title-main">3D rendering</span> Process of converting 3D scenes into 2D images

3D rendering is the 3D computer graphics process of converting 3D models into 2D images on a computer. 3D renders may include photorealistic effects or non-photorealistic styles.

Camera resectioning is the process of estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video; it determines which incoming light ray is associated with each pixel on the resulting image. Basically, the process determines the pose of the pinhole camera.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

In computer vision, 3D object recognition involves recognizing and determining 3D information, such as the pose, volume, or shape, of user-chosen 3D objects in a photograph or range scan. Typically, an example of the object to be recognized is presented to a vision system in a controlled environment, and then for an arbitrary input such as a video stream, the system locates the previously presented object. This can be done either off-line, or in real-time. The algorithms for solving this problem are specialized for locating a single pre-identified object, and can be contrasted with algorithms which operate on general classes of objects, such as face recognition systems or 3D generic object recognition. Due to the low cost and ease of acquiring photographs, a significant amount of research has been devoted to 3D object recognition in photographs.

In computer vision and computer graphics, 3D reconstruction is the process of capturing the shape and appearance of real objects. This process can be accomplished either by active or passive methods. If the model is allowed to change its shape in time, this is referred to as non-rigid or spatio-temporal reconstruction.

Visual servoing, also known as vision-based robot control and abbreviated VS, is a technique which uses feedback information extracted from a vision sensor to control the motion of a robot. One of the earliest papers that talks about visual servoing was from the SRI International Labs in 1979.

<span class="mw-page-title-main">3D modeling</span> Form of computer-aided engineering

In 3D computer graphics, 3D modeling is the process of developing a mathematical coordinate-based representation of a surface of an object in three dimensions via specialized software by manipulating edges, vertices, and polygons in a simulated 3D space.

3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

Perspective-n-Point is the problem of estimating the pose of a calibrated camera given a set of $n$ 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6 degrees-of-freedom (DOF) which are made up of the rotation and 3D translation of the camera with respect to the world. This problem originates from camera calibration and has many applications in computer vision and other areas, including 3D pose estimation, robotics and augmented reality. A commonly used solution to the problem exists for $n = 3$ called P3P, and many solutions are available for the general case of $n \geq 3$ . A solution for $n = 2$ exists if feature orientations are available at the two points. Implementations of these solutions are also available in open source software.

References

↑ Javier Barandiaran (28 December 2017). "POSIT tutorial". OpenCV.
↑ Daniel F. Dementhon; Larry S. Davis (1995). "Model-based object pose in 25 lines of code". International Journal of Computer Vision. 15 (1–2): 123–141. doi:10.1007/BF01450852. S2CID 14501637 . Retrieved 2010-05-29.
↑ Javier Barandiaran. "POSIT tutorial with OpenCV and OpenGL". Archived from the original on 20 June 2010. Retrieved 29 May 2010.
↑ Srimal Jayawardena and Marcus Hutter and Nathan Brewer (2011). "A Novel Illumination-Invariant Loss for Monocular 3D Pose Estimation". 2011 International Conference on Digital Image Computing: Techniques and Applications. pp. 37–44. CiteSeerX 10.1.1.766.3931 . doi:10.1109/DICTA.2011.15. ISBN 978-1-4577-2006-2. S2CID 17296505.
↑ Srimal Jayawardena and Di Yang and Marcus Hutter (2011). "3D Model Assisted Image Segmentation". 2011 International Conference on Digital Image Computing: Techniques and Applications. pp. 51–58. CiteSeerX 10.1.1.751.8774 . doi:10.1109/DICTA.2011.17. ISBN 978-1-4577-2006-2. S2CID 1665253.
↑ Bodo Rosenhahn. "Foundations about 2D-3D Pose Estimation". CV Online. Retrieved 2008-06-09.
↑ Vassilis Athitsos; Stan Sclarof (April 1, 2003). Estimating 3D Hand Pose from a Cluttered Image (PDF) (Technical report). Boston University Computer Science Tech. Archived from the original (PDF) on 2019-07-31.

Bibliography

Rosenhahn, B. "Foundations about 2D-3D Pose Estimation."
Rosenhahn, B. "Pose Estimation of 3D Free-form Contours in Conformal Geometry."
Athitsos, V. "Estimating 3D Hand Pose from a Cluttered Image."

External links

Estimación de una Postura 3D

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Javier Barandiaran (28 December 2017). "POSIT tutorial". OpenCV.

[2] Daniel F. Dementhon; Larry S. Davis (1995). "Model-based object pose in 25 lines of code". International Journal of Computer Vision. 15 (1–2): 123–141. doi:10.1007/BF01450852. S2CID 14501637 . Retrieved 2010-05-29.

[3] Javier Barandiaran. "POSIT tutorial with OpenCV and OpenGL". Archived from the original on 20 June 2010. Retrieved 29 May 2010.

[4] Srimal Jayawardena and Marcus Hutter and Nathan Brewer (2011). "A Novel Illumination-Invariant Loss for Monocular 3D Pose Estimation". 2011 International Conference on Digital Image Computing: Techniques and Applications. pp. 37–44. CiteSeerX 10.1.1.766.3931 . doi:10.1109/DICTA.2011.15. ISBN 978-1-4577-2006-2. S2CID 17296505.

[5] Srimal Jayawardena and Di Yang and Marcus Hutter (2011). "3D Model Assisted Image Segmentation". 2011 International Conference on Digital Image Computing: Techniques and Applications. pp. 51–58. CiteSeerX 10.1.1.751.8774 . doi:10.1109/DICTA.2011.17. ISBN 978-1-4577-2006-2. S2CID 1665253.

[Bodo_Rosenhahn-6] Bodo Rosenhahn. "Foundations about 2D-3D Pose Estimation". CV Online. Retrieved 2008-06-09.

[7] Vassilis Athitsos; Stan Sclarof (April 1, 2003). Estimating 3D Hand Pose from a Cluttered Image (PDF) (Technical report). Boston University Computer Science Tech. Archived from the original (PDF) on 2019-07-31.

[1]

[2]

[3]

[4]

[5]

[6]

[7]