Neural radiance field

Last updated

A neural radiance field (NeRF) is a method based on deep learning for reconstructing a three-dimensional representation of a scene from two-dimensional images. The NeRF model enables downstream applications of novel view synthesis, scene geometry reconstruction, and obtaining the reflectance properties of the scene. Additional scene properties such as camera poses may also be jointly learned. First introduced in 2020, [1] it has since gained significant attention for its potential applications in computer graphics and content creation. [2]

Contents

Algorithm

The NeRF algorithm represents a scene as a radiance field parametrized by a deep neural network (DNN). The network predicts a volume density and view-dependent emitted radiance given the spatial location (x, y, z) and viewing direction in Euler angles (θ, Φ) of the camera. By sampling many points along camera rays, traditional volume rendering techniques can produce an image. [1]

Data collection

A NeRF needs to be retrained for each unique scene. The first step is to collect images of the scene from different angles and their respective camera pose. These images are standard 2D images and do not require a specialized camera or software. Any camera is able to generate datasets, provided the settings and capture method meet the requirements for SfM (Structure from Motion).

This requires tracking of the camera position and orientation, often through some combination of SLAM, GPS, or inertial estimation. Researchers often use synthetic data to evaluate NeRF and related techniques. For such data, images (rendered through traditional non-learned methods) and respective camera poses are reproducible and error-free. [3]

Training

For each sparse viewpoint (image and camera pose) provided, camera rays are marched through the scene, generating a set of 3D points with a given radiance direction (into the camera). For these points, volume density and emitted radiance are predicted using the multi-layer perceptron (MLP). An image is then generated through classical volume rendering. Because this process is fully differentiable, the error between the predicted image and the original image can be minimized with gradient descent over multiple viewpoints, encouraging the MLP to develop a coherent model of the scene. [1]

Variations and improvements

Early versions of NeRF were slow to optimize and required that all input views were taken with the same camera in the same lighting conditions. These performed best when limited to orbiting around individual objects, such as a drum set, plants or small toys. [2] Since the original paper in 2020, many improvements have been made to the NeRF algorithm, with variations for special use cases.

Fourier feature mapping

In 2020, shortly after the release of NeRF, the addition of Fourier Feature Mapping improved training speed and image accuracy. Deep neural networks struggle to learn high frequency functions in low dimensional domains; a phenomenon known as spectral bias. To overcome this shortcoming, points are mapped to a higher dimensional feature space before being fed into the MLP.

Where is the input point, are the frequency vectors, and are coefficients.

This allows for rapid convergence to high frequency functions, such as pixels in a detailed image. [4]

Bundle-adjusting neural radiance fields

One limitation of NeRFs is the requirement of knowing accurate camera poses to train the model. Often times, pose estimation methods are not completely accurate, nor is the camera pose even possible to know. These imperfections result in artifacts and suboptimal convergence. So, a method was developed to optimize the camera pose along with the volumetric function itself. Called Bundle-Adjusting Neural Radiance Field (BARF), the technique uses a dynamic low-pass filter to go from coarse to fine adjustment, minimizing error by finding the geometric transformation to the desired image. This corrects imperfect camera poses and greatly improves the quality of NeRF renders. [5]

Multiscale representation

Conventional NeRFs struggle to represent detail at all viewing distances, producing blurry images up close and overly aliased images from distant views. In 2021, researchers introduced a technique to improve the sharpness of details at different viewing scales known as mip-NeRF (comes from mipmap). Rather than sampling a single ray per pixel, the technique fits a gaussian to the conical frustum cast by the camera. This improvement effectively anti-aliases across all viewing scales. mip-NeRF also reduces overall image error and is faster to converge at ~half the size of ray-based NeRF. [6]

Learned initializations

In 2021, researchers applied meta-learning to assign initial weights to the MLP. This rapidly speeds up convergence by effectively giving the network a head start in gradient descent. Meta-learning also allowed the MLP to learn an underlying representation of certain scene types. For example, given a dataset of famous tourist landmarks, an initialized NeRF could partially reconstruct a scene given one image. [7]

NeRF in the wild

Conventional NeRFs are vulnerable to slight variations in input images (objects, lighting) often resulting in ghosting and artifacts. As a result, NeRFs struggle to represent dynamic scenes, such as bustling city streets with changes in lighting and dynamic objects. In 2021, researchers at Google [2] developed a new method for accounting for these variations, named NeRF in the Wild (NeRF-W). This method splits the neural network (MLP) into three separate models. The main MLP is retained to encode the static volumetric radiance. However, it operates in sequence with a separate MLP for appearance embedding (changes in lighting, camera properties) and an MLP for transient embedding (changes in scene objects). This allows the NeRF to be trained on diverse photo collections, such as those taken by mobile phones at different times of day. [8]

Relighting

In 2021, researchers added more outputs to the MLP at the heart of NeRFs. The output now included: volume density, surface normal, material parameters, distance to the first surface intersection (in any direction), and visibility of the external environment in any direction. The inclusion of these new parameters lets the MLP learn material properties, rather than pure radiance values. This facilitates a more complex rendering pipeline, calculating direct and global illumination, specular highlights, and shadows. As a result, the NeRF can render the scene under any lighting conditions with no re-training. [9]

Plenoctrees

Although NeRFs had reached high levels of fidelity, their costly compute time made them useless for many applications requiring real-time rendering, such as VR/AR and interactive content. Introduced in 2021, Plenoctrees (plenoptic octrees) enabled real-time rendering of pre-trained NeRFs through division of the volumetric radiance function into an octree. Rather than assigning a radiance direction into the camera, viewing direction is taken out of the network input and spherical radiance is predicted for each region. This makes rendering over 3000x faster than conventional NeRFs. [10]

Sparse Neural Radiance Grid

Similar to Plenoctrees, this method enabled real-time rendering of pretrained NeRFs. To avoid querying the large MLP for each point, this method bakes NeRFs into Sparse Neural Radiance Grids (SNeRG). A SNeRG is a sparse voxel grid containing opacity and color, with learned feature vectors to encode view-dependent information. A lightweight, more efficient MLP is then used to produce view-dependent residuals to modify the color and opacity. To enable this compressive baking, small changes to the NeRF architecture were made, such as running the MLP once per pixel rather than for each point along the ray. These improvements make SNeRG extremely efficient, outperforming Plenoctrees. [11]

Instant NeRFs

In 2022, researchers at Nvidia enabled real-time training of NeRFs through a technique known as Instant Neural Graphics Primitives. An innovative input encoding reduces computation, enabling real-time training of a NeRF, an improvement orders of magnitude above previous methods. The speedup stems from the use of spatial hash functions, which have access times, and parallelized architectures which run fast on modern GPUs. [12]

Plenoxels

Plenoxel (plenoptic volume element) uses a sparse voxel representation instead of a volumetric approach as seen in NeRFs. Plenoxel also completely removes the MLP, instead directly performing gradient descent on the voxel coefficients. Plenoxel can match the fidelity of a conventional NeRF in orders of magnitude less training time. Published in 2022, this method disproved the importance of the MLP, showing that the differentiable rendering pipeline is the critical component. [13]

Gaussian splatting

Gaussian splatting is a newer method that can outperform NeRF in render time and fidelity. Rather than representing the scene as a volumetric function, it uses a sparse cloud of 3D gaussians. First, a point cloud is generated (through structure from motion) and converted to gaussians of initial covariance, color, and opacity. The gaussians are directly optimized through stochastic gradient descent to match the input image. This saves computation by removing empty space and foregoing the need to query a neural network for each point. Instead, simply "splat" all the gaussians onto the screen and they overlap to produce the desired image. [14]

Photogrammetry

Traditional photogrammetry is not neural, instead using robust geometric equations to obtain 3D measurements. NeRFs, unlike photogrammetric methods, do not inherently produce dimensionally accurate 3D geometry. While their results are often sufficient for extracting accurate geometry (ex: via cube marching [1] ), the process is fuzzy, as with most neural methods. This limits NeRF to cases where the output image is valued, rather than raw scene geometry. However, NeRFs excel in situations with unfavorable lighting. For example, photogrammetric methods completely break down when trying to reconstruct reflective or transparent objects in a scene, while a NeRF is able to infer the geometry. [15]

Applications

NeRFs have a wide range of applications, and are starting to grow in popularity as they become integrated into user-friendly applications. [3]

Content creation

NeRFs have huge potential in content creation, where on-demand photorealistic views are extremely valuable. [16] The technology democratizes a space previously only accessible by teams of VFX artists with expensive assets. Neural radiance fields now allow anyone with a camera to create compelling 3D environments. [3] NeRF has been combined with generative AI, allowing users with no modelling experience to instruct changes in photorealistic 3D scenes. [17] NeRFs have potential uses in video production, computer graphics, and product design.

Interactive content

The photorealism of NeRFs make them appealing for applications where immersion is important, such as virtual reality or videogames. NeRFs can be combined with classical rendering techniques to insert synthetic objects and create believable virtual experiences. [18]

Medical imaging

NeRFs have been used to reconstruct 3D CT scans from sparse or even single X-ray views. The model demonstrated high fidelity renderings of chest and knee data. If adopted, this method can save patients from excess doses of ionizing radiation, allowing for safer diagnosis. [19]

Robotics and autonomy

The unique ability of NeRFs to understand transparent and reflective objects makes them useful for robots interacting in such environments. The use of NeRF allowed a robot arm to precisely manipulate a transparent wine glass; a task where traditional computer vision would struggle. [20]

NeRFs can also generate photorealistic human faces, making them valuable tools for human-computer interaction. Traditionally rendered faces can be uncanny, while other neural methods are too slow to run in real-time. [21]

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

<span class="mw-page-title-main">Rendering (computer graphics)</span> Process of generating an image from a model

Rendering or image synthesis is the process of generating a photorealistic or non-photorealistic image from a 2D or 3D model by means of a computer program. The resulting image is referred to as a rendering. Multiple models can be defined in a scene file containing objects in a strictly defined language or data structure. The scene file contains geometry, viewpoint, textures, lighting, and shading information describing the virtual scene. The data contained in the scene file is then passed to a rendering program to be processed and output to a digital image or raster graphics image file. The term "rendering" is analogous to the concept of an artist's impression of a scene. The term "rendering" is also used to describe the process of calculating effects in a video editing program to produce the final video output.

A light field, or lightfield, is a vector function that describes the amount of light flowing in every direction through every point in a space. The space of all possible light rays is given by the five-dimensional plenoptic function, and the magnitude of each ray is given by its radiance. Michael Faraday was the first to propose that light should be interpreted as a field, much like the magnetic fields on which he had been working. The term light field was coined by Andrey Gershun in a classic 1936 paper on the radiometric properties of light in three-dimensional space.

In computer graphics, view synthesis, or novel view synthesis, is a task which consists of generating images of a specific subject or scene from a specific point of view, when the only available information is pictures taken from different points of view.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

<span class="mw-page-title-main">3D reconstruction</span> Process of capturing the shape and appearance of real objects

In computer vision and computer graphics, 3D reconstruction is the process of capturing the shape and appearance of real objects. This process can be accomplished either by active or passive methods. If the model is allowed to change its shape in time, this is referred to as non-rigid or spatio-temporal reconstruction.

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

In computer vision and computer graphics, 4D reconstruction is the process of capturing the shape and appearance of real objects along a temporal dimension. This process can be accomplished by methods such as depth camera imaging, photometric stereo, or structure from motion, and is also referred to as spatio-temporal reconstruction.

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

The CIFAR-10 dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern (2015) GPU using the U-Net architecture.

<span class="mw-page-title-main">Neural style transfer</span> Type of software algorithm for image manipulation

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

<span class="mw-page-title-main">Vision transformer</span> Machine learning algorithm for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

The frequency principle/spectral bias is a phenomenon observed in the study of Artificial Neural Networks(ANNs), specifically deep neural networks(DNNs). It describes the tendency of deep neural networks to fit target functions from low to high frequencies during the training process.

Gaussian Splatting is a volume rendering technique that deals with the direct rendering of volume data without converting the data into surface or line primitives. The technique was originally introduced as splatting by Lee Westover in the early 1990s. With advancements in computer graphics, newer methods such as 3D and 4D Gaussian splatting have been developed to offer real-time radiance field rendering and dynamic scene rendering respectively.

Xiaoming Liu is a Chinese-American computer scientist and an academic. He is a Professor in the Department of Computer Science and Engineering, MSU Foundation Professor as well as Anil K. and Nandita Jain Endowed Professor of Engineering at Michigan State University.

References

  1. 1 2 3 4 Mildenhall, Ben; Srinivasan, Pratul P.; Tancik, Matthew; Barron, Jonathan T.; Ramamoorthi, Ravi; Ng, Ren (2020). "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis". In Vedaldi, Andrea; Bischof, Horst; Brox, Thomas; Frahm, Jan-Michael (eds.). Computer Vision – ECCV 2020. Lecture Notes in Computer Science. Vol. 12346. Cham: Springer International Publishing. pp. 405–421. arXiv: 2003.08934 . doi:10.1007/978-3-030-58452-8_24. ISBN   978-3-030-58452-8. S2CID   213175590.
  2. 1 2 3 "What is a Neural Radiance Field (NeRF)? | Definition from TechTarget". Enterprise AI. Retrieved 2023-10-24.
  3. 1 2 3 Tancik, Matthew; Weber, Ethan; Ng, Evonne; Li, Ruilong; Yi, Brent; Kerr, Justin; Wang, Terrance; Kristoffersen, Alexander; Austin, Jake; Salahi, Kamyar; Ahuja, Abhik; McAllister, David; Kanazawa, Angjoo (2023-07-23). "Nerfstudio: A Modular Framework for Neural Radiance Field Development". Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. pp. 1–12. arXiv: 2302.04264 . doi:10.1145/3588432.3591516. ISBN   9798400701597. S2CID   256662551.
  4. Tancik, Matthew; Srinivasan, Pratul P.; Mildenhall, Ben; Fridovich-Keil, Sara; Raghavan, Nithin; Singhal, Utkarsh; Ramamoorthi, Ravi; Barron, Jonathan T.; Ng, Ren (2020-06-18). "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains". arXiv: 2006.10739 [cs.CV].
  5. Lin, Chen-Hsuan; Ma, Wei-Chiu; Torralba, Antonio; Lucey, Simon (2021). "BARF: Bundle-Adjusting Neural Radiance Fields". arXiv: 2104.06405 [cs.CV].
  6. Barron, Jonathan T.; Mildenhall, Ben; Tancik, Matthew; Hedman, Peter; Martin-Brualla, Ricardo; Srinivasan, Pratul P. (2021-04-07). "Mip-NeRF: {A} Multiscale Representation for Anti-Aliasing Neural Radiance Fields". arXiv: 2103.13415 [cs.CV].
  7. Tancik, Matthew; Mildenhall, Ben; Wang, Terrance; Schmidt, Divi; Srinivasan, Pratul (2021). "Learned Initializations for Optimizing Coordinate-Based Neural Representations". arXiv: 2012.02189 [cs.CV].
  8. Martin-Brualla, Ricardo; Radwan, Noha; Sajjadi, Mehdi S. M.; Barron, Jonathan T.; Dosovitskiy, Alexey; Duckworth, Daniel (2020). "NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections". arXiv: 2008.02268 [cs.CV].
  9. Srinivasan, Pratul P.; Deng, Boyang; Zhang, Xiuming; Tancik, Matthew; Mildenhall, Ben; Barron, Jonathan T. (2020). "NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis". arXiv: 2012.03927 [cs.CV].
  10. Yu, Alex; Li, Ruilong; Tancik, Matthew; Li, Hao; Ng, Ren; Kanazawa, Angjoo (2021). "PlenOctrees for Real-time Rendering of Neural Radiance Fields". arXiv: 2103.14024 [cs.CV].
  11. Hedman, Peter; Srinivasan, Pratul P.; Mildenhall, Ben; Barron, Jonathan T.; Debevec, Paul (2021). "Baking Neural Radiance Fields for Real-Time View Synthesis". arXiv: 2103.14645 [cs.CV].
  12. Müller, Thomas; Evans, Alex; Schied, Christoph; Keller, Alexander (2022-07-04). "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding". ACM Transactions on Graphics. 41 (4): 1–15. arXiv: 2201.05989 . doi:10.1145/3528223.3530127. ISSN   0730-0301. S2CID   246016186.
  13. Fridovich-Keil, Sara; Yu, Alex; Tancik, Matthew; Chen, Qinhong; Recht, Benjamin; Kanazawa, Angjoo (2021). "Plenoxels: Radiance Fields without Neural Networks". arXiv: 2112.05131 [cs.CV].
  14. Kerbl, Bernhard; Kopanas, Georgios; Leimkuehler, Thomas; Drettakis, George (2023-07-26). "3D Gaussian Splatting for Real-Time Radiance Field Rendering". ACM Transactions on Graphics. 42 (4): 1–14. arXiv: 2308.04079 . doi: 10.1145/3592433 . ISSN   0730-0301. S2CID   259267917.
  15. "Why THIS is the Future of Imagery (and Nobody Knows it Yet)" via www.youtube.com.
  16. "Shutterstock Speaks About NeRFs At Ad Week | Neural Radiance Fields". neuralradiancefields.io. 2023-10-20. Retrieved 2023-10-24.
  17. Haque, Ayaan; Tancik, Matthew; Efros, Alexei; Holynski, Aleksander; Kanazawa, Angjoo (2023-06-01). "InstructPix2Pix: Learning to Follow Image Editing Instructions". 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 18392–18402. arXiv: 2211.09800 . doi:10.1109/cvpr52729.2023.01764. ISBN   979-8-3503-0129-8. S2CID   253581213.
  18. "Venturing Beyond Reality: VR-NeRF | Neural Radiance Fields". neuralradiancefields.io. 2023-11-08. Retrieved 2023-11-09.
  19. Corona-Figueroa, Abril; Frawley, Jonathan; Taylor, Sam Bond-; Bethapudi, Sarath; Shum, Hubert P. H.; Willcocks, Chris G. (2022-07-11). "MedNeRF: Medical Neural Radiance Fields for Reconstructing 3D-aware CT-Projections from a Single X-ray". 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) (PDF). Vol. 2022. IEEE. pp. 3843–3848. doi:10.1109/embc48229.2022.9871757. ISBN   978-1-7281-2782-8. PMID   36085823. S2CID   246473192.
  20. Kerr, Justin; Fu, Letian; Huang, Huang; Avigal, Yahav; Tancik, Matthew; Ichnowski, Jeffrey; Kanazawa, Angjoo; Goldberg, Ken (2022-08-15). Evo-NeRF: Evolving NeRF for Sequential Robot Grasping of Transparent Objects. CoRL 2022 Conference.
  21. Aurora (2023-06-04). "Generating highly detailed human faces using Neural Radiance Fields". ILLUMINATION. Retrieved 2023-11-09.