Gaussian splatting

Last updated
Video rendered from a 3D gaussian splatting model

Gaussian splatting is a volume rendering technique that deals with the direct rendering of volume data without converting the data into surface or line primitives. [1] The technique was originally introduced as splatting by Lee Westover in the early 1990s. [2]

Contents

With advancements in computer graphics, newer methods such as 3D Gaussian splatting and 3D Temporal Gaussian splatting have been developed to offer real-time radiance field rendering and dynamic scene rendering respectively. [3] [4]

3D Gaussian splatting

Gaussian splatting model of a collapsed building taken from drone footage 3dgs viewer QIBWA66MsC.jpg
Gaussian splatting model of a collapsed building taken from drone footage

3D Gaussian splatting is a technique used in the field of real-time radiance field rendering. [3] It enables the creation of high-quality real-time novel-view scenes by combining multiple photos or videos, addressing a significant challenge in the field.

The method represents scenes with 3D Gaussians that retain properties of continuous volumetric radiance fields, integrating sparse points produced during camera calibration. It introduces an Anisotropic representation using 3D Gaussians to model radiance fields, along with an interleaved optimization and density control of the Gaussians. A fast visibility-aware rendering algorithm supporting anisotropic splatting is also proposed, catered to GPU usage. [3]

Method

This diagram illustrates the working of the proposed algorithm. Methodology.png
This diagram illustrates the working of the proposed algorithm.

The method involves several key steps:

The method uses differentiable 3D Gaussian splatting, which is unstructured and explicit, allowing rapid rendering and projection to 2D splats. The covariance of the Gaussians can be thought of as configurations of an ellipsoid, which can be mathematically decomposed into a scaling matrix and a rotation matrix. The gradients for all parameters are derived explicitly to overcome any overhead due to autodiff.

The optimization creates a dense set of 3D Gaussians that represent the scene as accurately as possible. Each step of rendering is followed by a comparison to the training views available in the dataset.

Results and evaluation

The authors tested their algorithm on 13 real scenes from previously published datasets and the synthetic Blender dataset. [6] They compared their method against state-of-the-art techniques like Mip-NeRF360, [7] InstantNGP, [8] and Plenoxels. [5] Quantitative evaluation metrics used were PSNR, L-PIPS, and SSIM.

Their fully converged model (30,000 iterations) achieves quality on par with or slightly better than Mip-NeRF360, [7] but with significantly reduced training time (35–45 minutes vs. 48 hours) and faster rendering (real-time vs. 10 seconds per frame). At 7,000 iterations (5–10 minutes of training), their method achieves comparable quality to InstantNGP [8] and Plenoxels. [5]

For synthetic bounded scenes (Blender dataset [6] ), they achieved state-of-the-art results even with random initialization, starting from 100,000 uniformly random Gaussians.

Limitations

Some limitations of the method include:

The authors note that some of these limitations could potentially be addressed through future improvements like better culling approaches, antialiasing, regularization, and compression techniques.

3D Temporal Gaussian splatting

Extending 3D Gaussian splatting to dynamic scenes, 3D Temporal Gaussian splatting incorporates a time component, allowing for real-time rendering of dynamic scenes with high resolutions. [4] It represents and renders dynamic scenes by modeling complex motions while maintaining efficiency. The method uses a HexPlane to connect adjacent Gaussians, providing an accurate representation of position and shape deformations. By utilizing only a single set of canonical 3D Gaussians and predictive analytics, it models how they move over different timestamps. [9]

It is sometimes referred to as "4D Gaussian splatting"; however, this naming convention implies the use of 4D Gaussian primitives (parameterized by a 4×4 mean and a 4×4 covariance matrix). Most work in this area still employs 3D Gaussian primitives, applying temporal constraints as an extra parameter of optimization.

Achievements of this technique include real-time rendering on dynamic scenes with high resolutions, while maintaining quality. It showcases potential applications for future developments in film and other media, although there are current limitations regarding the length of motion captured. [9]

Applications

3D Gaussian splatting has been adapted and extended across various computer vision and graphics applications, from dynamic scene rendering to autonomous driving simulations and 4D content creation:

See also

Related Research Articles

<span class="mw-page-title-main">Rendering (computer graphics)</span> Process of generating an image from a model

Rendering or image synthesis is the process of generating a photorealistic or non-photorealistic image from a 2D or 3D model by means of a computer program. The resulting image is referred to as a rendering. Multiple models can be defined in a scene file containing objects in a strictly defined language or data structure. The scene file contains geometry, viewpoint, textures, lighting, and shading information describing the virtual scene. The data contained in the scene file is then passed to a rendering program to be processed and output to a digital image or raster graphics image file. The term "rendering" is analogous to the concept of an artist's impression of a scene. The term "rendering" is also used to describe the process of calculating effects in a video editing program to produce the final video output.

A light field, or lightfield, is a vector function that describes the amount of light flowing in every direction through every point in a space. The space of all possible light rays is given by the five-dimensional plenoptic function, and the magnitude of each ray is given by its radiance. Michael Faraday was the first to propose that light should be interpreted as a field, much like the magnetic fields on which he had been working. The term light field was coined by Andrey Gershun in a classic 1936 paper on the radiometric properties of light in three-dimensional space.

In computer graphics, view synthesis, or novel view synthesis, is a task which consists of generating images of a specific subject or scene from a specific point of view, when the only available information is pictures taken from different points of view.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

<span class="mw-page-title-main">Superellipsoid</span> Family of geometric shapes

In mathematics, a superellipsoid is a solid whose horizontal sections are superellipses with the same squareness parameter , and whose vertical sections through the center are superellipses with the squareness parameter . It is a generalization of an ellipsoid, which is a special case when .

In computer vision and computer graphics, 4D reconstruction is the process of capturing the shape and appearance of real objects along a temporal dimension. This process can be accomplished by methods such as depth camera imaging, photometric stereo, or structure from motion, and is also referred to as spatio-temporal reconstruction.

<span class="mw-page-title-main">DeepDream</span> Software program

DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.

<span class="mw-page-title-main">Event camera</span> Type of imaging sensor

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

A neural radiance field (NeRF) is a method based on deep learning for reconstructing a three-dimensional representation of a scene from two-dimensional images. The NeRF model enables downstream applications of novel view synthesis, scene geometry reconstruction, and obtaining the reflectance properties of the scene. Additional scene properties such as camera poses may also be jointly learned. First introduced in 2020, it has since gained significant attention for its potential applications in computer graphics and content creation.

<span class="mw-page-title-main">3D Face Morphable Model</span> Generative model for 3D textured faces

In computer vision and computer graphics, the 3D Face Morphable Model (3DFMM) is a generative technique for modeling textured 3D faces. The generation of new faces is based on a pre-existing database of example faces acquired through a 3D scanning procedure. All these faces are in dense point-to-point correspondence, which enables the generation of a new realistic face (morph) by combining the acquired faces. A new 3D face can be inferred from one or multiple existing images of a face or by arbitrarily combining the example faces. 3DFMM provides a way to represent face shape and texture disentangled from external factors, such as camera parameters and illumination.

In computer vision and computer graphics, the 3D Morphable Model (3DMM) is a generative technique that uses methods of statistical shape analysis to model 3D objects. The model follows an analysis-by-synthesis approach over a dataset of 3D example shapes of a single class of objects. The main prerequisite is that all the 3D shapes are in a dense point-to-point correspondence, namely each point has the same semantical meaning over all the shapes. In this way, we can extract meaningful statistics from the dataset and use it to represent new plausible shapes of the object's class. Given a 2D image, we can represent its 3D shape via a fitting process or generate novel shapes by directly sampling from the statistical shape distribution of that class.

<span class="mw-page-title-main">VGGNet</span> Series of convolutional neural networks for image classification

The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford.

MobileNet is a family of convolutional neural network (CNN) architectures designed for image classification, object detection, and other computer vision tasks. They are designed for small size, low latency, and low power consumption, making them suitable for on-device inference and edge computing on resource-constrained devices like mobile phones and embedded systems. They were originally designed to be run efficiently on mobile devices with TensorFlow Lite.

References

  1. Westover, Lee Alan (July 1991). "SPLATTING: A Parallel, Feed-Forward Volume Rendering Algorithm" (PDF). Retrieved October 18, 2023.
  2. Huang, Jian (Spring 2002). "Splatting" (PPT). Retrieved 5 August 2011.
  3. 1 2 3 Bernhard Kerbl; Georgios Kopanas; Thomas Leimkühler; George Drettakis (8 Aug 2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering". arXiv: 2308.04079 [cs.GR].
  4. 1 2 Guanjun Wu; Taoran Yi; Jiemin Fang; Lingxi Xie; Xiaopeng Zhang; Wei Wei; Wenyu Liu; Qi Tian; Xinggang Wang (12 Oct 2023). "4D Gaussian Splatting for Real-Time Dynamic Scene Rendering". arXiv: 2310.08528 [cs.CV].
  5. 1 2 3 Fridovich-Keil, Sara; Yu, Alex; Tancik, Matthew; Chen, Qinhong; Recht, Benjamin; Kanazawa, Angjoo (June 2022). "Plenoxels: Radiance Fields without Neural Networks". 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 5491–5500. arXiv: 2112.05131 . doi:10.1109/cvpr52688.2022.00542. ISBN   978-1-6654-6946-3.
  6. 1 2 Mildenhall, Ben; Srinivasan, Pratul P.; Tancik, Matthew; Barron, Jonathan T.; Ramamoorthi, Ravi; Ng, Ren (2020), "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis", Lecture Notes in Computer Science, Cham: Springer International Publishing, pp. 405–421, doi:10.1007/978-3-030-58452-8_24, ISBN   978-3-030-58451-1 , retrieved 2024-09-25
  7. 1 2 Barron, Jonathan T.; Mildenhall, Ben; Verbin, Dor; Srinivasan, Pratul P.; Hedman, Peter (June 2022). "Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields". 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 5460–5469. arXiv: 2111.12077 . doi:10.1109/cvpr52688.2022.00539. ISBN   978-1-6654-6946-3.
  8. 1 2 Müller, Thomas; Evans, Alex; Schied, Christoph; Keller, Alexander (July 2022). "Instant neural graphics primitives with a multiresolution hash encoding". ACM Transactions on Graphics. 41 (4): 1–15. arXiv: 2201.05989 . doi:10.1145/3528223.3530127. ISSN   0730-0301.
  9. 1 2 Franzen, Carl. "Actors' worst fears come true? New 3D Temporal Gaussian Splatting method captures human motion". venturebeat.com. VentureBeat . Retrieved October 18, 2023.
  10. Chen, Zilong; Wang, Feng; Wang, Yikai; Liu, Huaping (2024-06-16). "Text-to-3D using Gaussian Splatting". 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vol. abs/2211.0 1324. IEEE. pp. 21401–21412. arXiv: 2309.16585 . doi:10.1109/cvpr52733.2024.02022. ISBN   979-8-3503-5300-6.
  11. Chen, Li; Wu, Penghao; Chitta, Kashyap; Jaeger, Bernhard; Geiger, Andreas; Li, Hongyang (2024). "End-to-end Autonomous Driving: Challenges and Frontiers". IEEE Transactions on Pattern Analysis and Machine Intelligence. PP: 1–20. arXiv: 2306.16927 . doi:10.1109/tpami.2024.3435937. ISSN   0162-8828. PMID   39078757.
  12. Guédon, Antoine; Lepetit, Vincent (2024-06-16). "SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering". 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 5354–5363. arXiv: 2311.12775 . doi:10.1109/cvpr52733.2024.00512. ISBN   979-8-3503-5300-6.
  13. Keetha, Nikhil; Karhade, Jay; Jatavallabhula, Krishna Murthy; Yang, Gengshan; Scherer, Sebastian; Ramanan, Deva; Luiten, Jonathon (2024-06-16). "SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM". 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 21357–21366. doi:10.1109/cvpr52733.2024.02018. ISBN   979-8-3503-5300-6.
  14. Ling, Huan; Kim, Seung Wook; Torralba, Antonio; Fidler, Sanja; Kreis, Karsten (2024-06-16). "Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models". 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 8576–8588. arXiv: 2312.13763 . doi:10.1109/cvpr52733.2024.00819. ISBN   979-8-3503-5300-6.