Match moving

Last updated

In visual effects, match moving is a technique that allows the insertion of 2D elements, other live action elements or CG computer graphics into live-action footage with correct position, scale, orientation, and motion relative to the photographed objects in the shot. It also allows for the removal of live action elements from the live action shot. The term is used loosely to describe several different methods of extracting camera motion information from a motion picture. Sometimes referred to as motion tracking or camera solving, match moving is related to rotoscoping and photogrammetry. Match moving is sometimes confused with motion capture, which records the motion of objects, often human actors, rather than the camera. Typically, motion capture requires special cameras and sensors and a controlled environment (although recent developments such as the Kinect camera and Apple's Face ID have begun to change this). Match moving is also distinct from motion control photography, which uses mechanical hardware to execute multiple identical camera moves. Match moving, by contrast, is typically a software-based technology, applied after the fact to normal footage recorded in uncontrolled environments with an ordinary camera.

Contents

Match moving is primarily used to track the movement of a camera through a shot so that an identical virtual camera move can be reproduced in a 3D animation program. When new animated elements are composited back into the original live-action shot, they will appear in perfectly matched perspective and therefore appear seamless.

As it is mostly software-based, match moving has become increasingly affordable as the cost of computer power has declined; it is now an established visual-effects tool and is even used in live television broadcasts as part of providing effects such as the yellow virtual down-line in American football.

Principle

The process of match moving can be broken down into two steps.

Tracking

The first step is identifying and tracking features. A feature is a specific point in the image that a tracking algorithm can lock onto and follow through multiple frames (SynthEyes calls them blips). Often features are selected because they are bright/dark spots, edges or corners depending on the particular tracking algorithm. Popular programs use template matching based on NCC score and RMS error. What is important is that each feature represents a specific point on the surface of a real object. As a feature is tracked it becomes a series of two-dimensional coordinates that represent the position of the feature across a series of frames. This series is referred to as a "track". Once tracks have been created they can be used immediately for 2-D motion tracking, or then be used to calculate 3-D information.

Calibration

The second step involves solving for 3D motion. This process attempts to derive the motion of the camera by solving the inverse-projection of the 2-D paths for the position of the camera. This process is referred to as calibration.

When a point on the surface of a three-dimensional object is photographed, its position in the 2-D frame can be calculated by a 3-D projection function. We can consider a camera to be an abstraction that holds all the parameters necessary to model a camera in a real or virtual world. Therefore, a camera is a vector that includes as its elements the position of the camera, its orientation, focal length, and other possible parameters that define how the camera focuses light onto the film plane. Exactly how this vector is constructed is not important as long as there is a compatible projection function P.

The projection function P takes as its input a camera vector (denoted camera) and another vector the position of a 3-D point in space (denoted xyz) and returns a 2D point that has been projected onto a plane in front of the camera (denoted XY). We can express this:

XY = P(camera, xyz)
An illustration of feature projection. Around the rendering of a 3-D structure, red dots represent points that are chosen by the tracking process. Cameras at frame i and j project the view onto a plane depending on the parameters of the camera. In this way features tracked in 2-D correspond to real points in a 3D space. Although this particular illustration is computer-generated, match moving is normally done on real objects. Match moving - projection de points 3D.jpg
An illustration of feature projection. Around the rendering of a 3-D structure, red dots represent points that are chosen by the tracking process. Cameras at frame i and j project the view onto a plane depending on the parameters of the camera. In this way features tracked in 2-D correspond to real points in a 3D space. Although this particular illustration is computer-generated, match moving is normally done on real objects.

The projection function transforms the 3-D point and strips away the component of depth. Without knowing the depth of the component an inverse projection function can only return a set of possible 3D points, that form a line emanating from the nodal point of the camera lens and passing through the projected 2-D point. We can express the inverse projection as:

xyz ∈ P'(camera, XY)

or

{xyz:P(camera, xyz) = XY}

Let's say we are in a situation where the features we are tracking are on the surface of a rigid object such as a building. Since we know that the real point xyz will remain in the same place in real space from one frame of the image to the next we can make the point a constant even though we do not know where it is. So:

xyzi = xyzj

where the subscripts i and j refer to arbitrary frames in the shot we are analyzing. Since this is always true then we know that:

P'(camerai, XYi) P'(cameraj, XYj) ≠ {}

Because the value of XYi has been determined for all frames that the feature is tracked through by the tracking program, we can solve the reverse projection function between any two frames as long as P'(camerai, XYi) ∩ P'(cameraj, XYj) is a small set. Set of possible camera vectors that solve the equation at i and j (denoted Cij).

Cij = {(camerai,cameraj):P'(camerai, XYi) P'(cameraj, XYj) ≠ {})

So there is a set of camera vector pairs Cij for which the intersection of the inverse projections of two points XYi and XYj is a non-empty, hopefully small, set centering on a theoretical stationary point xyz.

In other words, imagine a black point floating in a white void and a camera. For any position in space that we place the camera, there is a set of corresponding parameters (orientation, focal length, etc.) that will photograph that black point exactly the same way. Since C has an infinite number of members, one point is never enough to determine the actual camera position.

As we start adding tracking points, we can narrow the possible camera positions. For example, if we have a set of points {xyzi,0,...,xyzi,n} and {xyzj,0,...,xyzj,n} where i and j still refer to frames and n is an index to one of many tracking points we are following. We can derive a set of camera vector pair sets {Ci,j,0,...,Ci,j,n}.

In this way multiple tracks allow us to narrow the possible camera parameters. The set of possible camera parameters that fit, F, is the intersection of all sets:

F = Ci,j,0 ... Ci,j,n

The fewer elements are in this set the closer we can come to extracting the actual parameters of the camera. In reality errors introduced to the tracking process require a more statistical approach to determining a good camera vector for each frame, optimization algorithms and bundle block adjustment are often utilized. Unfortunately there are so many elements to a camera vector that when every parameter is free we still might not be able to narrow F down to a single possibility no matter how many features we track. The more we can restrict the various parameters, especially focal length, the easier it becomes to pinpoint the solution.

In all, the 3D solving process is the process of narrowing down the possible solutions to the motion of the camera until we reach one that suits the needs of the composite we are trying to create.

Point-cloud projection

Once the camera position has been determined for every frame it is then possible to estimate the position of each feature in real space by inverse projection. The resulting set of points is often referred to as a point cloud because of its raw appearance like a nebula. Since point clouds often reveal some of the shape of the 3-D scene they can be used as a reference for placing synthetic objects or by a reconstruction program to create a 3-D version of the actual scene.

Ground-plane determination

The camera and point cloud need to be oriented in some kind of space. Therefore, once calibration is complete, it is necessary to define a ground plane. Normally, this is a unit plane that determines the scale, orientation and origin of the projected space. Some programs attempt to do this automatically, though more often the user defines this plane. Since shifting ground planes does a simple transformation of all of the points, the actual position of the plane is really a matter of convenience.

Reconstruction

3-D reconstruction is the interactive process of recreating a photographed object using tracking data. This technique is related to photogrammetry. In this particular case we are referring to using match moving software to reconstruct a scene from incidental footage.

A reconstruction program can create three-dimensional objects that mimic the real objects from the photographed scene. Using data from the point cloud and the user's estimation, the program can create a virtual object and then extract a texture from the footage that can be projected onto the virtual object as a surface texture.

2-D vs. 3-D

Match moving has two forms. Some compositing programs, such as Shake, Adobe Substance, Adobe After Effects, and Discreet Combustion, include two-dimensionalmotion tracking capabilities. Two dimensional match moving only tracks features in two-dimensional space, without any concern to camera movement or distortion. It can be used to add motion blur or image stabilization effects to footage. This technique is sufficient to create realistic effects when the original footage does not include major changes in camera perspective. For example, a billboard deep in the background of a shot can often be replaced using two-dimensional tracking.

Three-dimensional match moving tools make it possible to extrapolate three-dimensional information from two-dimensional photography. These tools allow users to derive camera movement and other relative motion from arbitrary footage. The tracking information can be transferred to computer graphics software and used to animate virtual cameras and simulated objects. Programs capable of 3-D match moving include:

Automatic vs. interactive tracking

There are two methods by which motion information can be extracted from an image. Interactive tracking, sometimes referred to as "supervised tracking", relies on the user to follow features through a scene. Automatic tracking relies on computer algorithms to identify and track features through a shot. The tracked points movements are then used to calculate a "solution". This solution is composed of all the camera's information such as the motion, focal length, and lens distortion.

The advantage of automatic tracking is that the computer can create many points faster than a human can. A large number of points can be analyzed with statistics to determine the most reliable data. The disadvantage of automatic tracking is that, depending on the algorithm, the computer can be easily confused as it tracks objects through the scene. Automatic tracking methods are particularly ineffective in shots involving fast camera motion such as that seen with hand-held camera work and in shots with repetitive subject matter like small tiles or any sort of regular pattern where one area is not very distinct. This tracking method also suffers when a shot contains a large amount of motion blur, making the small details it needs harder to distinguish.

The advantage of interactive tracking is that a human user can follow features through an entire scene and will not be confused by features that are not rigid. A human user can also determine where features are in a shot that suffers from motion blur; it is extremely difficult for an automatic tracker to correctly find features with high amounts of motion blur. The disadvantage of interactive tracking is that the user will inevitably introduce small errors as they follow objects through the scene, which can lead to what is called "drift".

Professional-level motion tracking is usually achieved using a combination of interactive and automatic techniques. An artist can remove points that are clearly anomalous and use "tracking mattes" to block confusing information out of the automatic tracking process. Tracking mattes are also employed to cover areas of the shot which contain moving elements such as an actor or a spinning ceiling fan.

Tracking mattes

A tracking matte is similar in concept to a garbage matte used in traveling matte compositing. However, the purpose of a tracking matte is to prevent tracking algorithms from using unreliable, irrelevant, or non-rigid tracking points. For example, in a scene where an actor walks in front of a background, the tracking artist will want to use only the background to track the camera through the scene, knowing that motion of the actor will throw off the calculations. In this case, the artist will construct a tracking matte to follow the actor through the scene, blocking that information from the tracking process.

Refining

Since there are often multiple possible solutions to the calibration process and a significant amount of error can accumulate, the final step to match moving often involves refining the solution by hand. This could mean altering the camera motion itself or giving hints to the calibration mechanism. This interactive calibration is referred to as "refining".

Most match moving applications are based on similar algorithms for tracking and calibration. Often, the initial results obtained are similar. However, each program has different refining capabilities.

Real time

On-set, real-time camera tracking is becoming more widely used in feature film production to allow elements that will be inserted in post-production be visualised live on-set. This has the benefit of helping the director and actors improve performances by actually seeing set extensions or CGI characters whilst (or shortly after) they do a take. No longer do they need to perform to green/blue screens and have no feedback of the end result. Eye-line references, actor positioning, and CGI interaction can now be done live on-set giving everyone confidence that the shot is correct and going to work in the final composite.

To achieve this, a number of components from hardware to software need to be combined. Software collects all of the 360 degrees of freedom movement of the camera as well as metadata such as zoom, focus, iris and shutter elements from many different types of hardware devices, ranging from motion capture systems such as active LED marker based system from PhaseSpace, passive systems such as Motion Analysis or Vicon, to rotary encoders fitted to camera cranes and dollies such as Technocranes and Fisher Dollies, or inertia & gyroscopic sensors mounted directly to the camera. There are also laser based tracking systems that can be attached to anything, including Steadicams, to track cameras outside in the rain at distances of up to 30 meters.

Motion control cameras can also be used as a source or destination for 3D camera data. Camera moves can be pre-visualised in advance and then converted into motion control data that drives a camera crane along precisely the same path as the 3-D camera. Encoders on the crane can also be used in real time on-set to reverse this process to generate live 3D cameras. The data can be sent to any number of different 3D applications, allowing 3D artists to modify their CGI elements live on set as well. The main advantage being that set design issues that would be time-consuming and costly issues later down the line can be sorted out during the shooting process, ensuring that the actors "fit" within each environment for each shot whilst they do their performances.

Real time motion capture systems can also be mixed within camera data stream allowing virtual characters to be inserted into live shots on-set. This dramatically improves the interaction between real and non-real MoCap driven characters as both plate and CGI performances can be choreographed together.

See also

Related Research Articles

Visual effects is the process by which imagery is created or manipulated outside the context of a live-action shot in filmmaking and video production. The integration of live-action footage and other live-action footage or CGI elements to create realistic imagery is called VFX.

<span class="mw-page-title-main">Ray casting</span> Methodological basis for 3D CAD/CAM solid modeling and image rendering

Ray casting is the methodological basis for 3D CAD/CAM solid modeling and image rendering. It is essentially the same as ray tracing for computer graphics where virtual light rays are "cast" or "traced" on their path from the focal point of a camera through each pixel in the camera sensor to determine what is visible along the ray in the 3D scene. The term "Ray Casting" was introduced by Scott Roth while at the General Motors Research Labs from 1978–1980. His paper, "Ray Casting for Modeling Solids", describes modeled solid objects by combining primitive solids, such as blocks and cylinders, using the set operators union (+), intersection (&), and difference (-). The general idea of using these binary operators for solid modeling is largely due to Voelcker and Requicha's geometric modelling group at the University of Rochester. See solid modeling for a broad overview of solid modeling methods. This figure on the right shows a U-Joint modeled from cylinders and blocks in a binary tree using Roth's ray casting system in 1979.

<span class="mw-page-title-main">Vanishing point</span> Artistic concept relating to perspective

A vanishing point is a point on the image plane of a perspective rendering where the two-dimensional perspective projections of mutually parallel lines in three-dimensional space appear to converge. When the set of parallel lines is perpendicular to a picture plane, the construction is known as one-point perspective, and their vanishing point corresponds to the oculus, or "eye point", from which the image should be viewed for correct perspective geometry. Traditional linear drawings use objects with one to three sets of parallels, defining one to three vanishing points.

2.5D perspective refers to gameplay or movement in a video game or virtual reality environment that is restricted to a two-dimensional (2D) plane with little or no access to a third dimension in a space that otherwise appears to be three-dimensional and is often simulated and rendered in a 3D digital environment.

The computer graphics pipeline, also known as the rendering pipeline or graphics pipeline, is a framework within computer graphics that outlines the necessary procedures for transforming a three-dimensional (3D) scene into a two-dimensional (2D) representation on a screen. Once a 3D model is generated, the graphics pipeline converts the model into a visually perceivable format on the computer display. Due to the dependence on specific software, hardware configurations, and desired display attributes, a universally applicable graphics pipeline does not exist. Nevertheless, graphics application programming interfaces (APIs), such as Direct3D, OpenGL and Vulkan were developed to standardize common procedures and oversee the graphics pipeline of a given hardware accelerator. These APIs provide an abstraction layer over the underlying hardware, relieving programmers from the need to write code explicitly targeting various graphics hardware accelerators like AMD, Intel, Nvidia, and others.

<span class="mw-page-title-main">Motion estimation</span> Process used in video coding/compression

In computer vision and image processing, motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

Video tracking is the process of locating a moving object over time using a camera. It has a variety of uses, some of which are: human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, medical imaging and video editing. Video tracking can be a time-consuming process due to the amount of data that is contained in video. Adding further to the complexity is the possible need to use object recognition techniques for tracking, a challenging problem in its own right.

<span class="mw-page-title-main">Image stitching</span> Combining multiple photographic images with overlapping fields of view

Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Commonly performed through the use of computer software, most approaches to image stitching require nearly exact overlaps between images and identical exposures to produce seamless results, although some stitching algorithms actually benefit from differently exposed images by doing high-dynamic-range imaging in regions of overlap. Some digital cameras can stitch their photos internally.

In the fields of computing and computer vision, pose represents the position and orientation of an object, usually in three dimensions. Poses are often stored internally as transformation matrices. The term “pose” is largely synonymous with the term “transform”, but a transform may often include scale, whereas pose does not.

Camera resectioning is the process of estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video; it determines which incoming light ray is associated with each pixel on the resulting image. Basically, the process determines the pose of the pinhole camera.

Motion analysis is used in computer vision, image processing, high-speed photography and machine vision that studies methods and applications in which two or more consecutive images from an image sequences, e.g., produced by a video camera or high-speed camera, are processed to produce information based on the apparent motion in the images. In some applications, the camera is fixed relative to the scene and objects are moving around in the scene, in some applications the scene is more or less fixed and the camera is moving, and in some cases both the camera and the scene are moving.

<span class="mw-page-title-main">3D computer graphics</span> Graphics that use a three-dimensional representation of geometric data

3D computer graphics, sometimes called CGI, 3-D-CGI or three-dimensional computer graphics, are graphics that use a three-dimensional representation of geometric data that is stored in the computer for the purposes of performing calculations and rendering digital images, usually 2D images but sometimes 3D images. The resulting images may be stored for viewing later or displayed in real time.

<span class="mw-page-title-main">Bundle adjustment</span>

In photogrammetry and computer stereo vision, bundle adjustment is simultaneous refining of the 3D coordinates describing the scene geometry, the parameters of the relative motion, and the optical characteristics of the camera(s) employed to acquire the images, given a set of images depicting a number of 3D points from different viewpoints. Its name refers to the geometrical bundles of light rays originating from each 3D feature and converging on each camera's optical center, which are adjusted optimally according to an optimality criterion involving the corresponding image projections of all points.

<span class="mw-page-title-main">3D reconstruction</span> Process of capturing the shape and appearance of real objects

In computer vision and computer graphics, 3D reconstruction is the process of capturing the shape and appearance of real objects. This process can be accomplished either by active or passive methods. If the model is allowed to change its shape in time, this is referred to as non-rigid or spatio-temporal reconstruction.

A structured-light 3D scanner is a 3D scanning device for measuring the three-dimensional shape of an object using projected light patterns and a camera system.

Camera auto-calibration is the process of determining internal camera parameters directly from multiple uncalibrated images of unstructured scenes. In contrast to classic camera calibration, auto-calibration does not require any special calibration objects in the scene. In the visual effects industry, camera auto-calibration is often part of the "Match Moving" process where a synthetic camera trajectory and intrinsic projection model are solved to reproject synthetic content into video.

<span class="mw-page-title-main">3D pose estimation</span> Process of determining spatial characteristics of objects

3D pose estimation is a process of predicting the transformation of an object from a user-defined reference pose, given an image or a 3D scan. It arises in computer vision or robotics where the pose or transformation of an object can be used for alignment of a computer-aided design models, identification, grasping, or manipulation of the object.

2D to 3D video conversion is the process of transforming 2D ("flat") film to 3D form, which in almost all cases is stereo, so it is the process of creating imagery for each eye from one 2D image.

<span class="mw-page-title-main">3D reconstruction from multiple images</span> Creation of a 3D model from a set of images

3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

References