# Deep Image Prior

Last updated

Deep Image Prior is a type of convolutional neural network used to enhance a given image with no prior training data other than the image itself. A neural network is randomly initialized and used as prior to solve inverse problems such as noise reduction, super-resolution, and inpainting. Image statistics is captured by the structure of a convolutional image generator rather than by any previously learned capabilities.

## Method

### Background

Inverse problems such as noise reduction, super-resolution, and inpainting can be formulated as the optimization task ${\displaystyle x^{*}=min_{x}E(x;x_{0})+R(x)}$, where ${\displaystyle x}$ is an image, ${\displaystyle x_{0}}$ a corrupted representation of that image, ${\displaystyle E(x;x_{0})}$ is a task-dependent data term, and R(x) is the regularizer. This forms an energy minimization problem.

Deep neural networks learn a generator/decoder ${\displaystyle x=f_{\theta }(z)}$ which maps a random code vector ${\displaystyle z}$ to an image ${\displaystyle x}$.

The image corruption method used to generate ${\displaystyle x_{0}}$ is selected for the specific application.

### Specifics

In this approach, the ${\displaystyle R(x)}$ prior is replaced with the implicit prior captured by the neural network (where ${\displaystyle R(x)=0}$ for images that can be produced by a deep neural networks and ${\displaystyle R(x)=+\infty }$ otherwise). This yields the equation for the minimizer ${\displaystyle \theta ^{*}=argmin_{\theta }E(f_{\theta }(z);x_{0})}$ and the result of the optimization process ${\displaystyle x^{*}=f_{\theta ^{*}}(z)}$.

The minimizer ${\displaystyle \theta ^{*}}$ (typically a gradient descent) starts from a randomly initialized parameters and descends into a local best result to yield the ${\displaystyle x^{*}}$ restoration function.

#### Overfitting

A parameter θ may be used to recover any image, including its noise. However, the network is reluctant to pick up noise because it contains high impedance while useful signal offers low impedance. This results in the θ parameter approaching a good-looking local optimum so long as the number of iterations in the optimization process remains low enough not to overfit data.

## Applications

### Denoising

The principle of denoising is to recover an image ${\displaystyle x}$ from a noisy observation ${\displaystyle x_{0}}$, where ${\displaystyle x_{0}=x+\epsilon }$. The distribution ${\displaystyle \epsilon }$ is sometimes known (e.g.: profiling sensor and photon noise [1] ) and may optionally be incorporated into the model, though this process works well in blind denoising.

The quadratic energy function ${\displaystyle E(x,x_{0})=||x-x_{0}||^{2}}$ is used as the data term, plugging it into the equation for ${\displaystyle \theta ^{*}}$ yields the optimization problem ${\displaystyle min_{\theta }||f_{\theta }(z)-x_{0}||^{2}}$.

### Super-resolution

Super-resolution is used to generate a higher resolution version of image x. The data term is set to ${\displaystyle E(x;x_{0})=||d(x)-x_{0}||^{2}}$ where d(·) is a downsampling operator such as Lanczos that decimates the image by a factor t.

### Inpainting

Inpainting is used to reconstruct a missing area in an image ${\displaystyle x_{0}}$. These missing pixels are defined as the binary mask ${\displaystyle m\in \{0,1\}^{H\times V}}$. The data term is defined as ${\displaystyle E(x;x_{0})=||(x-x_{0})\odot m||^{2}}$ (where ${\displaystyle \odot }$ is the Hadamard product).

### Flash-no-flash reconstruction

This approach may be extended to multiple images. A straightforward example mentioned by the author is the reconstruction of an image to obtain natural light and clarity from a flash-no-flash pair. Video reconstruction is possible but it requires optimizations to take into account the spatial differences.

## Related Research Articles

Pattern recognition is the automated recognition of patterns and regularities in data. Pattern recognition is closely related to artificial intelligence and machine learning, together with applications such as data mining and knowledge discovery in databases (KDD), and is often used interchangeably with these terms. However, these are distinguished: machine learning is one approach to pattern recognition, while other approaches include hand-crafted rules or heuristics; and pattern recognition is one approach to artificial intelligence, while other approaches include symbolic artificial intelligence. A modern definition of pattern recognition is:

The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.

A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

A sensor array is a group of sensors, usually deployed in a certain geometry pattern, used for collecting and processing electromagnetic or acoustic signals. The advantage of using a sensor array over using a single sensor lies in the fact that an array adds new dimensions to the observation, helping to estimate more parameters and improve the estimation performance. For example an array of radio antenna elements used for beamforming can increase antenna gain in the direction of the signal while decreasing the gain in other directions, i.e., increasing signal-to-noise ratio (SNR) by amplifying the signal coherently. Another example of sensor array application is to estimate the direction of arrival of impinging electromagnetic waves. The related processing method is called array signal processing. Application examples of array signal processing include radar/sonar, wireless communications, seismology, machine condition monitoring, astronomical observations fault diagnosis, etc.

Tomographic reconstruction is a type of multidimensional inverse problem where the challenge is to yield an estimate of a specific system from a finite number of projections. The mathematical basis for tomographic imaging was laid down by Johann Radon. A notable example of applications is the reconstruction of computed tomography (CT) where cross-sectional images of patients are obtained in non-invasive manner. Recent developments have seen the Radon transform and its inverse used for tasks related to realistic object insertion required for testing and evaluating computed tomography use in airport security.

A Boltzmann machine is a type of stochastic recurrent neural network. It is a Markov random field. It was translated from statistical physics for use in cognitive science. The Boltzmann machine is based on stochastic spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model that is a stochastic Ising Model and applied to machine learning.

In image processing, a Gabor filter, named after Dennis Gabor, is a linear filter used for texture analysis, which means that it basically analyzes whether there are any specific frequency content in the image in specific directions in a localized region around the point or region of analysis. Frequency and orientation representations of Gabor filters are claimed by many contemporary vision scientists to be similar to those of the human visual system. They have been found to be particularly appropriate for texture representation and discrimination. In the spatial domain, a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave.

Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets. Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard.

In Bayesian statistics, a maximum a posterior probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties. Examples are the regularized autoencoders, proven effective in learning representations for subsequent classification tasks, and Variational autoencoders, with their recent applications as generative models. Autoencoders are effectively used for solving many applied problems, from face recognition to acquiring the semantic meaning of words.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

The topological derivative is, conceptually, a derivative of a shape functional with respect to infinitesimal changes in its topology, such as adding an infinitesimal hole or crack. When used in higher dimensions than one, the term topological gradient is also used to name the first-order term of the topological asymptotic expansion, dealing only with infinitesimal singular domain perturbations. It has applications in shape optimization, topology optimization, image processing and mechanical modeling.

Location estimation in wireless sensor networks is the problem of estimating the location of an object from a set of noisy measurements. These measurements are acquired in a distributed manner by a set of sensors.

In signal processing it is useful to simultaneously analyze the space and frequency characteristics of a signal. While the Fourier transform gives the frequency information of the signal, it is not localized. This means that we cannot determine which part of a signal produced a particular frequency. It is possible to use a short time Fourier transform for this purpose, however the short time Fourier transform limits the basis functions to be sinusoidal. To provide a more flexible space-frequency signal decomposition several filters have been proposed. The Log-Gabor filter is one such filter that is an improvement upon the original Gabor filter. The advantage of this filter over the many alternatives is that it better fits the statistics of natural images compared with Gabor filters and other wavelet filters.

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

Shrinkage fields is a random field-based machine learning technique that aims to perform high quality image restoration using low computational overhead.

Batch normalization is a technique for improving the speed, performance, and stability of artificial neural networks. Batch normalization was introduced in a 2015 paper. It is used to normalize the input layer by adjusting and scaling the activations.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel which describes the evolution of deep artificial neural network during their training by gradient descent.

## References

1. jo (2012-12-11). "profiling sensor and photon noise .. and how to get rid of it". darktable.
• Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (30 November 2017). "Deep Image Prior". arXiv:.