Volumetric capture

Last updated

Volumetric capture or volumetric video is a technique that captures a three-dimensional space, such as a location or performance. [1] This type of volumography acquires data that can be viewed on flat screens as well as using 3D displays and VR goggles. Consumer-facing formats are numerous and the required motion capture techniques lean on computer graphics, photogrammetry, and other computation-based methods. The viewer generally experiences the result in a real-time engine and has direct input in exploring the generated volume.

Contents

History

A multi-camera setup recording a "bullet time" effect Camera setup for bullet time.jpg
A multi-camera setup recording a "bullet time" effect

Recording talent without the limitation of a flat screen has been depicted in science-fiction for a long time. Holograms and 3D real-world visuals have featured prominently in Star Wars , Blade Runner , and many other science-fiction productions over the years. Through the growing advancements in the fields of computer graphics, optics, and data processing, this fiction has slowly evolved into a reality. Volumetric video is the logical next step after stereoscopic movies and 360° videos in that it combines the visual quality of photography with the immersion and interactivity of spatialized content and could prove to be the most important development in the recording of human performance since the creation of contemporary cinema.

Computer graphics and VFX

Creating 3D models from video, photography, and other ways of measuring the world has always been an important topic in computer graphics. The ultimate goal is to imitate reality in minute detail while giving creatives the power to build worlds atop this foundation to match their vision. Traditionally, artists create these worlds using modeling and rendering techniques developed over decades since the birth of computer graphics. Visual effects in movies and video games paved the way for advances in photogrammetry, scanning devices, and the computational backend to handle the data received from these new intensive methods. Generally, these advances have come as a result of creating more advanced visuals for entertainment and media, but have not been the goal of the field itself.

LIDAR

Leica HDS-3000 LIDAR Lidar P1270901.jpg
Leica HDS-3000 LIDAR

LIDAR scanning describes a survey method that uses laser-sampled points densely packed to scan static objects into a point cloud. This requires physical scanners and produces enormous amounts of data. In 2007 the band Radiohead used it extensively to create a music video for "House of Cards", capturing point cloud performances of the singer's face and of select environments in one of the first uses of this technology for volumetric capture. Director James Frost collaborated with media artist Aaron Koblin to capture 3D point-clouds used for this music clip, and while the final output of this work was still a rendered flat representation of the data, the capture and mindset of the authors was already ahead of its time. Point clouds, being distinct samples of three-dimensional space with position and color, create a high fidelity representation of the real world with a huge amount of data. However, viewing this data in real-time was not yet possible.

Structured light

Xbox One's Kinect Xbox-One-Kinect.jpg
Xbox One's Kinect

In 2010 Microsoft brought the Kinect to the market, a consumer product that used structured light in the infrared spectrum to generate a 3D mesh from its camera. While the intent was to facilitate and innovate in user input and gameplay, it was very quickly adapted as a generic capture device for 3D data in the volumetric capture community. By projecting a known pattern onto the space and capturing the distortion by objects in the scene, the result capture can then be computed into different outputs. Artists and hobbyists started to make tools and projects around the affordable device, sparking a growing interest in volumetric capture as a creative medium.

Researchers at Microsoft then constructed an entire capture stage using multiple cameras, Kinect devices, and algorithms that generated a full volumetric capture from the combined optical and depth information. This is now the Microsoft Mixed Reality Capture Studio, used today as part of both their research division and in certain select commercial experiences such as the Blade Runner 2049 VR experience. There are currently three studios in operation: Redmond, WA; San Francisco, CA; and London, England. While this remains a very interesting setup for the high-end market, the affordable price of a single Kinect device led more experimental artists and independent directors to become active in the volumetric capture field. [2] Two results from this activity are Depthkit and EF EVE™. EF EVE™ supports unlimited number of Azure Kinect sensors on one PC giving full volumetric capture with easy setup. It also has automatic sensors calibration and VFX functionality. Depthkit is a software suite that allows the capture of geometry data with one structured light sensor including the Azure Kinect, [3] as well as high quality color detail from an attached witness camera.

Photogrammetry

3D animation Thumbnail GIF of 3D Model of Trim Castle-320x160.gif
3D animation

Photogrammetry describes the process of measuring data based on photographic reference. While being as old as photography itself, only through advances over the years in volumetric capture research has it now become possible to capture more and more geometry and texture detail from a large number of input images. The result is usually split into two composited sources, static geometry and full performance capture. For static geometry, sets that are captured with a large number of overlapping digital images are then aligned to each other using similar features in the images and used as a base for triangulation and depth estimation. This information is interpreted as 3D geometry, resulting in a near-perfect replica of the set. Full performance capture, however, uses an array of video cameras to capture real-time information. Those synchronized cameras are then used frame-by-frame to generate a set of points or geometry that can be played back at speed, resulting in the full volumetric performance capture that can be composited into any environment. In 2008, 4DViews [4] installed a first volumetric video capture system at DigiCast studio in Tokyo (JP). Later in 2015, 8i contributed in the field, and recently Intel, Microsoft, [5] and Samsung [6] have joined in by creating their own capture stages for performance capture and photogrammetry.

Virtual reality

Virtual reality headset Oculus-Rift-CV1-Headset-Front.jpg
Virtual reality headset

As volumetric video developed into a commercially applicable approach to environment and performance capture, the ability to move about the results with six degrees of freedom and true stereoscopy necessitated a new type of display device. With the rise of consumer-facing VR in 2016 through devices such as the Oculus Rift and HTC Vive, this was suddenly possible. Stereoscopic viewing and the ability to rotate and move the head as well as move in a small space allows immersion into environments well beyond what was possible in the past. The photographic nature of the captures combined with this immersion and the resulting interactivity is one giant step closer to being the holy grail of true virtual reality. With the rise of 360° video content, the demand for 6-DOF capture is rising, and VR in particular drives the applications for this technology, slowly fusing cinema, games and art with the field of volumetric capture research.

Light fields

Lytro Illum Camera, a second generation Light Field camera. Lytro Illum Camera.jpg
Lytro Illum Camera, a second generation Light Field camera.

Light fields describe at a given sample point the incoming light from all directions. This is then used in post processing to generate effects such as depth of field as well as allowing the user to move their head slightly. Since 2006 Lytro is creating consumer-facing cameras to allow the capture of light fields. Fields can be captured inside-out in camera or outside-in from renderings of 3D geometry, representing a huge amount of information ready to be manipulated. Currently data rates are still a large issue and the technique has a large potential for the future as it samples light and displays the result in a variety of ways.

Another by-product of this technique is a reasonably accurate depth map of the scene. Meaning each pixel has information about its distance from the camera. Facebook is using this idea in its Surround360 camera family to capture 360° video footage that is getting stitched with the help of distance maps. Extracting this raw data is possible and allows a high-resolution capture of any stage. Again the data rates combined with the fidelity of the depth maps are huge bottlenecks but soon to be overcome with more advanced depth estimation techniques, compression, as well as parametric light fields.

Workflows

Different workflows to generate volumetric video are currently available. These are not mutually exclusive and are used effectively in combinations. Here are some examples that show a couple of them:

Mesh-based

This approach generates a more traditional 3D triangle mesh similar to the geometry used for computer games and visual effects. The data volume is usually less but the quantization of real-world data into lower resolution data limits the resolution and visual fidelity. Trade-offs are generally made between mesh density and final experience performance.

Photogrammetry is usually used as a base for static meshes, and is then augmented with performance capture of talent via the same underlying technology of videogrammetry. Intense clean up is required to create the final set of triangles. To extend beyond the physical world, CG techniques can be deployed to further enhance the captured data, employing artists to build onto and into the static mesh as necessary. The playback is usually handled by a real-time engine and resembles a traditional game pipeline in implementation, allowing interactive lighting changes and creative and archivable ways of compositing static and animated meshes together.

Point-based

Recently the spotlight has shifted towards point-based volumetric capture. The resulting data is represented as points or particles in 3D space carrying attributes such as color and point size with them. This allows for more information density and higher resolution content. The data rates required are big and current graphics hardware is not optimized to render this, being optimised to a mesh-based render pipeline.

The main advantage of points is the potential for higher spatial resolution. Points can either be scattered on triangle meshes with pre-computed lighting, or used directly from a LIDAR scanner. [7] Performance of talent is captured the same way as per the mesh-based approach, but more time and computational power can be used at production time to further improve the data. At playback, 'level of detail' can be utilized to manage the computational load on the playback device, increasing or decreasing the number of polygons. [8] Interactive light changes are harder to realize as the bulk of the data is pre-baked. This means that while the lighting information stored with the points is very accurate and high-fidelity, it lacks the ability to easily change in any given situation. Another benefit of point capture is that computer graphics can be rendered with very high quality and also stored as points, opening the door for a perfect blend of real and imagined elements.

After capturing and generating the data, editing and compositing is done within a realtime engine, connecting recorded actions to tell the intended story. The final product can then be viewed either as a flat rendering of the captured data, or interactively in a VR headset.

While one goal, with the point-based approach to volumetric capture, is to stream point data from cloud to the user at home, allowing the creating and dissemination of realistic virtual worlds on demand - a second goal more recently considered would be a real-time data stream of live events. This requires very high bandwidth as pixel information includes depth data (i.e. become voxels)

Promises

With the general understanding of the technology in mind, this chapter will describe the advances on the horizon for entertainment and other industries, as well as the potential this technology has to change media landscape.

True immersion

As volumetric video evolves into global capture and the display hardware evolves to match, we will enter into an era of true immersion where the nuances of captured environment combined with those of captured performances will convey emotionality in a whole new medium, blurring the boundaries between real and virtual worlds. This groundbreaking in the world of sensory trickery will spark an evolution in the way we consume media, and while technologies for other senses like scent, smell, and proprioception are still in research and development stage, one day in the not-so-distant future we will travel convincingly to new locales, both real and imagined. Industries in tourism and journalism will find new life in the ability to transport to a viewer or visitor safely to a location, while others such as architectural visualization and civil engineering will find ways to build entire structures and cities and explore them without the need for a single swing of a hammer.

Full capture and re-use

Once a capture is created and saved, it can be re-used and even possibly re-purposed ad nauseam for circumstances beyond the initial envisioned scope. Creating a virtual set enables volumetric videographers and cinematographers to create stories and plan for shots without needing a crew or to even be present at the physical set itself, and a proper visualization can help an actor or performer block out a scene or action with the comfort that their practice isn't at the expense of the rest of production. Old sets can be captured digitally before being torn down, allowing them to persist eternally as a place to revisit and explore for entertainment and inspiration, and multiple sets can be kit-bashed in such a way to tighten the iteration loops of set design, sound design, coloring, and many other aspects of production.

Traditional skillsets

One area of concern with the growing field of volumetric capture is the shrinking of demand for traditional skillsets like modeling, lighting, animation, etc. However, while in future the stack of production-oriented volumetric capture technologies will grow and grow, so too will the demand for traditional skillsets.[ citation needed ]

Volumetric capture excels at capturing static data or pre-rendered animated footage. It can not, however, create an imaginary environment or natively allow for any level of interactivity. This is where skilled artists and developers will be in highest demand, creating seamless interactive events and assets to complement the existing geometry data, or using the existing data as a base on which to build, similar to how a digital painter might paint over a basic 3D render. The onus will be on the artisan to ensure they keep up with the tools and workflows that best suit their skillsets, but the prudent will find that the production pipeline of the future will involve many opportunities to streamline the creation of the labor-intensive and allowing for investment in bigger creative challenges.

Most importantly, skills currently rendered semi-obsolete by advances in computer graphics and off-line rendering will once again be made relevant, as the fidelity of things like real, hand-built sets quality tailored costumes rendered as high volume captures will almost always be far more immersive than anything completely CG. By combining these real-life set capture with the volumetric captures of additional CG elements, we will be able to blend real-life and our imagination in a way that we have only previously been able to do on a flat-screen, creating new fields in the areas like compositing and VFX.

Challenges

The capture and creation process of volumetric data is full of challenges and unsolved problems. It is the next step in cinematography and comes with issues that will removed over time.

Visual language

As every medium creates its own visual language, rules and creative approaches, volumetric video is still at the infancy of this. This compares to the addition of sound to moving pictures. New design philosophies had to be created and tested. Currently the language of film, the art of directing is battle hardened over 100 years. In a completely six degrees of freedom, interactive and non-linear world many of the traditional approaches can't function. The more experiences are being created and analyzed, the quicker can the community come to a conclusion about this language of experiences.

Pipeline disruption

Current video and film making pipelines and productions are not ready to just go volumetric. Every step in the film making process needs to be rethought and reinvented. On set capture, directing of talent on set, editing, photography, story telling, and much more are all fields that need to spend time to adapt to the volumetric workflows. Currently each production is using a variety of technologies as well as trying the rules of engagement.

Data rates

In order to store and playback the captured data, enormous sets need to be streamed to the consumer. Currently the most effective way is to build bespoke apps that are delivered. There is no standard yet that generated volumetric video and makes it experienceable at home. Compression of this data is starting to be available with the Moving Picture Experts Group in search for a reasonable way to stream the data. This would make truly interactive immersive projects available to be distributed and worked on more efficiently and needs to be solved before the medium becomes mainstream.

Future applications

Besides the application in entertainment, several other industries have vested interest in the capture of scenes to the detail described above. Sports events would benefit greatly from a detailed replay of the state of a game. This is already happening in American football and baseball, as well as British soccer. [9] Those 360° degree replays will enable viewers in the future to analyze a match from multiple perspectives.

Documenting spaces for historical event, captured live or recreated will benefit the educational sector greatly. Virtual lectures depicting big events in history with an immersive component will help future generations imagine spaces and learn collaboratively about events. This can be abstracted and used to visualize micro scale scenarios on a cellular level as much as epic events that changed the course of the human experiment. The main advantage being that virtual field trips is the democratisation of high end educational scenarios. Being able to take part in visiting a museum without having to physically be there allows a broader audience and also enables institutions to show their entire inventory rather the subsection currently on display.

Real estate and tourism could preview destinations accurately and make the retail industry much more custom for the individual. Capturing products has already been done for shoes and magic mirrors can be used in stores to visualize this. Shopping Malls have started to embrace this to repopulate them by attracting customers with VR Arcades as well as presenting merchandise virtually.

Related Research Articles

<span class="mw-page-title-main">Virtual reality</span> Computer-simulated experience

Virtual reality (VR) is a simulated experience that employs pose tracking and 3D near-eye displays to give the user an immersive feel of a virtual world. Applications of virtual reality include entertainment, education and business. Other distinct types of VR-style technology include augmented reality and mixed reality, sometimes referred to as extended reality or XR, although definitions are currently changing due to the nascence of the industry.

<span class="mw-page-title-main">Point cloud</span> Set of data points in three-dimensional space

A point cloud is a discrete set of data points in space. The points may represent a 3D shape or object. Each point position has its set of Cartesian coordinates. Point clouds are generally produced by 3D scanners or by photogrammetry software, which measure many points on the external surfaces of objects around them. As the output of 3D scanning processes, point clouds are used for many purposes, including to create 3D computer-aided design (CAD) models for manufactured parts, for metrology and quality inspection, and for a multitude of visualizing, animating, rendering, and mass customization applications.

Visual effects is the process by which imagery is created or manipulated outside the context of a live-action shot in filmmaking and video production. The integration of live-action footage and other live-action footage or CGI elements to create realistic imagery is called VFX.

<span class="mw-page-title-main">Motion capture</span> Process of recording the movement of objects or people

Motion capture is the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision and robots. In filmmaking and video game development, it refers to recording actions of human actors and using that information to animate digital character models in 2D or 3D computer animation. When it includes face and fingers or captures subtle expressions, it is often referred to as performance capture. In many fields, motion capture is sometimes called motion tracking, but in filmmaking and games, motion tracking usually refers more to match moving.

<span class="mw-page-title-main">LightWave 3D</span> 3D computer graphics program

LightWave 3D is a 3D computer graphics program developed by LightWave Digital. It has been used in films, television, motion graphics, digital matte painting, visual effects, video game development, product design, architectural visualizations, virtual production, music videos, pre-visualizations and advertising.

<span class="mw-page-title-main">Photogrammetry</span> Taking measurements using photography

Photogrammetry is the science and technology of obtaining reliable information about physical objects and the environment through the process of recording, measuring and interpreting photographic images and patterns of electromagnetic radiant imagery and other phenomena.

<span class="mw-page-title-main">Volume rendering</span> Representing a 3D-modeled object or dataset as a 2D projection

In scientific visualization and computer graphics, volume rendering is a set of techniques used to display a 2D projection of a 3D discretely sampled data set, typically a 3D scalar field.

<span class="mw-page-title-main">Shader</span> Type of program in a graphical processing unit (GPU)

In computer graphics, a shader is a computer program that calculates the appropriate levels of light, darkness, and color during the rendering of a 3D scene—a process known as shading. Shaders have evolved to perform a variety of specialized functions in computer graphics special effects and video post-processing, as well as general-purpose computing on graphics processing units.

A volumetric display device is a display device that forms a visual representation of an object in three physical dimensions, as opposed to the planar image of traditional screens that simulate depth through a number of different visual effects. One definition offered by pioneers in the field is that volumetric displays create 3D imagery via the emission, scattering, or relaying of illumination from well-defined regions in (x,y,z) space.

<span class="mw-page-title-main">3D scanning</span> Scanning of an object or environment to collect data on its shape

3D scanning is the process of analyzing a real-world object or environment to collect three dimensional data of its shape and possibly its appearance. The collected data can then be used to construct digital 3D models.

A virtual tour is a simulation of an existing location, usually composed of a sequence of videos, still images or 360-degree images. It may also use other multimedia elements such as sound effects, music, narration, text and floor map. It is distinguished from the use of live television to affect tele-tourism.

<span class="mw-page-title-main">Virtual cinematography</span> CGI essentially

Virtual cinematography is the set of cinematographic techniques performed in a computer graphics environment. It includes a wide variety of subjects like photographing real objects, often with stereo or multi-camera setup, for the purpose of recreating them as three-dimensional objects and algorithms for the automated creation of real and simulated camera angles. Virtual cinematography can be used to shoot scenes from otherwise impossible camera angles, create the photography of animated films, and manipulate the appearance of computer-generated effects.

Free viewpoint television (FTV) is a system for viewing natural video, allowing the user to interactively control the viewpoint and generate new views of a dynamic scene from any 3D position. The equivalent system for computer-simulated video is known as virtual reality. With FTV, the focus of attention can be controlled by the viewers rather than a director, meaning that each viewer may be observing a unique viewpoint. It remains to be seen how FTV will affect television watching as a group activity.

<span class="mw-page-title-main">3D computer graphics</span> Graphics that use a three-dimensional representation of geometric data

3D computer graphics, sometimes called CGI, 3-D-CGI or three-dimensional computer graphics, are graphics that use a three-dimensional representation of geometric data that is stored in the computer for the purposes of performing calculations and rendering digital images, usually 2D images but sometimes 3D images. The resulting images may be stored for viewing later or displayed in real time.

<span class="mw-page-title-main">3D modeling</span> Form of computer-aided engineering

In 3D computer graphics, 3D modeling is the process of developing a mathematical coordinate-based representation of a surface of an object in three dimensions via specialized software by manipulating edges, vertices, and polygons in a simulated 3D space.

<span class="mw-page-title-main">Computer-generated imagery</span> Application of computer graphics to create or contribute to images

Computer-generated imagery (CGI) is a specific-technology or application of computer graphics for creating or improving images in art, printed media, simulators, videos and video games. These images are either static or dynamic. CGI both refers to 2D computer graphics and 3D computer graphics with the purpose of designing characters, virtual worlds, or scenes and special effects. The application of CGI for creating/improving animations is called computer animation, or CGI animation.

<span class="mw-page-title-main">IllumiRoom</span> Microsoft research Project

IllumiRoom is a Microsoft Research project that augments a television screen with images projected onto the wall and surrounding objects. The current proof-of-concept uses a Kinect sensor and video projector. The Kinect sensor captures the geometry and colors of the area of the room that surrounds the television, and the projector displays video around the television that corresponds to a video source on the television, such as a video game or movie.

This is a glossary of terms relating to computer graphics.

The Azure Kinect DK is a developer kit and PC peripheral which employs the use of artificial intelligence sensors for computer vision and speech models, and is connected to the Microsoft Azure cloud. It is the successor to the Microsoft Kinect line of sensors.

A neural radiance field (NeRF) is a method based on deep learning for reconstructing a three-dimensional representation of a scene from sparse two-dimensional images. The NeRF model enables learning of novel view synthesis, scene geometry, and the reflectance properties of the scene. Additional scene properties such as camera poses may also be jointly learned. NeRF enables rendering of photorealistic views from novel viewpoints. First introduced in 2020, it has since gained significant attention for its potential applications in computer graphics and content creation.

References

  1. Vittorio Ferrari; Martial Hebert; Cristian Sminchisescu; Yair Weiss (2018). Computer Vision -- ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings. Springer. pp. 351–. ISBN   978-3-030-01270-0.
  2. "RGBDToolkit Workshop". Eyebeam. Retrieved 2019-08-06.
  3. "Announcing Azure Kinect support in Depthkit!". www.depthkit.tv. Retrieved 2019-08-06.
  4. "Home". 4dviews.com.
  5. "Bring life to mixed reality at Mixed Reality Capture Studios". Microsoft . 7 August 2023.
  6. "Samsung HOLOLAB". 7 November 2018.
  7. "Aspect 3D volumetric video". Level Five Supplies. Retrieved 2020-06-23.
  8. "Volograms technology". Volograms. Retrieved 2020-06-23.
  9. "Arsenal FC, Liverpool FC and Manchester City Bring Immersive Experiences to Fans with Intel True View".

List of experiences contributing