Deep image prior

Last updated August 24, 2024

Deep image prior is a type of convolutional neural network used to enhance a given image with no prior training data other than the image itself. A neural network is randomly initialized and used as prior to solve inverse problems such as noise reduction, super-resolution, and inpainting. Image statistics are captured by the structure of a convolutional image generator rather than by any previously learned capabilities.

Method

Background

Inverse problems such as noise reduction, super-resolution, and inpainting can be formulated as the optimization task $x^{*}=min_{x}E(x;x_{0})+R(x)$ , where $x$ is an image, $x_{0}$ a corrupted representation of that image, $E(x;x_{0})$ is a task-dependent data term, and R(x) is the regularizer. This forms an energy minimization problem.

Deep neural networks learn a generator/decoder $x=f_{\theta }(z)$ which maps a random code vector $z$ to an image $x$ .

The image corruption method used to generate $x_{0}$ is selected for the specific application.

Specifics

In this approach, the $R(x)$ prior is replaced with the implicit prior captured by the neural network (where $R(x)=0$ for images that can be produced by a deep neural networks and $R(x)=+\infty$ otherwise). This yields the equation for the minimizer $\theta ^{*}=argmin_{\theta }E(f_{\theta }(z);x_{0})$ and the result of the optimization process $x^{*}=f_{\theta ^{*}}(z)$ .

The minimizer $\theta ^{*}$ (typically a gradient descent) starts from a randomly initialized parameters and descends into a local best result to yield the $x^{*}$ restoration function.

Overfitting

A parameter θ may be used to recover any image, including its noise. However, the network is reluctant to pick up noise because it contains high impedance while useful signal offers low impedance. This results in the θ parameter approaching a good-looking local optimum so long as the number of iterations in the optimization process remains low enough not to overfit data.

Deep Neural Network Model

Typically, the deep neural network model for deep image prior uses a U-Net like model without the skip connections that connect the encoder blocks with the decoder blocks. The authors in their paper mention that "Our findings here (and in other similar comparisons) seem to suggest that having deeper architecture is beneficial, and that having skip-connections that work so well for recognition tasks (such as semantic segmentation) is highly detrimental."^[1]

Applications

Denoising

The principle of denoising is to recover an image $x$ from a noisy observation $x_{0}$ , where $x_{0}=x+\epsilon$ . The distribution $\epsilon$ is sometimes known (e.g.: profiling sensor and photon noise^[2]) and may optionally be incorporated into the model, though this process works well in blind denoising.

The quadratic energy function $E(x,x_{0})=||x-x_{0}||^{2}$ is used as the data term, plugging it into the equation for $\theta ^{*}$ yields the optimization problem $min_{\theta }||f_{\theta }(z)-x_{0}||^{2}$ .

Super-resolution

Super-resolution is used to generate a higher resolution version of image x. The data term is set to $E(x;x_{0})=||d(x)-x_{0}||^{2}$ where d(·) is a downsampling operator such as Lanczos that decimates the image by a factor t.

Inpainting

Inpainting is used to reconstruct a missing area in an image $x_{0}$ . These missing pixels are defined as the binary mask $m\in \{0,1\}^{H\times V}$ . The data term is defined as $E(x;x_{0})=||(x-x_{0})\odot m||^{2}$ (where $\odot$ is the Hadamard product).

The intuition behind this is that the loss is computed only on the known pixels in the image, and the network is going to learn enough about the image to fill in unknown parts of the image even though the computed loss doesn't include those pixels. This strategy is used to remove image watermarks by treating the watermark as missing pixels in the image.

Flash–no-flash reconstruction

This approach may be extended to multiple images. A straightforward example mentioned by the author is the reconstruction of an image to obtain natural light and clarity from a flash–no-flash pair. Video reconstruction is possible but it requires optimizations to take into account the spatial differences.

Implementations

A reference implementation rewritten in Python 3.6 with the PyTorch 0.4.0 library was released by the author under the Apache 2.0 license: deep-image-prior ^[3]
A TensorFlow-based implementation written in Python 2 and released under the CC-SA 3.0 license: deep-image-prior-tensorflow
A Keras-based implementation written in Python 2 and released under the GPLv3: machine_learning_denoising

Example

See Astronomy Picture of the Day (APOD) of 2024-02-18 ^[4]

Related Research Articles

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.

Tomographic reconstruction is a type of multidimensional inverse problem where the challenge is to yield an estimate of a specific system from a finite number of projections. The mathematical basis for tomographic imaging was laid down by Johann Radon. A notable example of applications is the reconstruction of computed tomography (CT) where cross-sectional images of patients are obtained in non-invasive manner. Recent developments have seen the Radon transform and its inverse used for tasks related to realistic object insertion required for testing and evaluating computed tomography use in airport security.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range $[-1,1]$ , and backpropagation computes gradients by the chain rule. This has the effect of multiplying $n$ of these small numbers to compute gradients of the early layers in an $n$ -layer network, meaning that the gradient decreases exponentially with $n$ while the early layers train very slowly.

In signal processing it is useful to simultaneously analyze the space and frequency characteristics of a signal. While the Fourier transform gives the frequency information of the signal, it is not localized. This means that we cannot determine which part of a signal produced a particular frequency. It is possible to use a short time Fourier transform for this purpose, however the short time Fourier transform limits the basis functions to be sinusoidal. To provide a more flexible space-frequency signal decomposition several filters have been proposed. The Log-Gabor filter is one such filter that is an improvement upon the original Gabor filter. The advantage of this filter over the many alternatives is that it better fits the statistics of natural images compared with Gabor filters and other wavelet filters.

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

Shrinkage fields is a random field-based machine learning technique that aims to perform high quality image restoration using low computational overhead.

An MRI artifact is a visual artifact in magnetic resonance imaging (MRI). It is a feature appearing in an image that is not present in the original object. Many different artifacts can occur during MRI, some affecting the diagnostic quality, while others may be confused with pathology. Artifacts can be classified as patient-related, signal processing-dependent and hardware (machine)-related.

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow, which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.

In mathematics, the Chambolle-Pock algorithm is an algorithm used to solve convex optimization problems. It was introduced by Antonin Chambolle and Thomas Pock in 2011 and has since become a widely used method in various fields, including image processing, computer vision, and signal processing.

Deep backward stochastic differential equation method is a numerical method that combines deep learning with Backward stochastic differential equation (BSDE). This method is particularly useful for solving high-dimensional problems in financial derivatives pricing and risk management. By leveraging the powerful function approximation capabilities of deep neural networks, deep BSDE addresses the computational challenges faced by traditional numerical methods in high-dimensional settings.

References

↑ https://sites.skoltech.ru/app/data/uploads/sites/25/2018/04/deep_image_prior.pdf ^{[ bare URL PDF ]}
↑ jo (2012-12-11). "profiling sensor and photon noise .. and how to get rid of it". darktable.
↑ "DmitryUlyanov/Deep-image-prior". GitHub . 3 June 2021.
↑ "Astronomy Picture of the Day".

Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (30 November 2017). "Deep Image Prior". arXiv: 1711.10925v2 [cs.CV].

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] ttps://sites.skoltech.ru/app/data/uploads/sites/25/2018/04/deep_image_prior.pdf ^{[ bare URL PDF ]}

[2] (2012-12-11). "profiling sensor and photon noise .. and how to get rid of it". darktable.

[3] "DmitryUlyanov/Deep-image-prior". GitHub . 3 June 2021.

[4] "Astronomy Picture of the Day".

[1]

[2]

[3]

[4]