Discrete Universal Denoiser

Last updated

In information theory and signal processing, the Discrete Universal Denoiser (DUDE) is a denoising scheme for recovering sequences over a finite alphabet, which have been corrupted by a discrete memoryless channel. The DUDE was proposed in 2005 by Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú and Marcelo J. Weinberger. [1]

Contents

Overview

The Discrete Universal Denoiser [1] (DUDE) is a denoising scheme that estimates an unknown signal over a finite alphabet from a noisy version . While most denoising schemes in the signal processing and statistics literature deal with signals over an infinite alphabet (notably, real-valued signals), the DUDE addresses the finite alphabet case. The noisy version is assumed to be generated by transmitting through a known discrete memoryless channel.

For a fixed context length parameter , the DUDE counts of the occurrences of all the strings of length appearing in . The estimated value is determined based the two-sided length-context of , taking into account all the other tokens in with the same context, as well as the known channel matrix and the loss function being used.

The idea underlying the DUDE is best illustrated when is a realization of a random vector . If the conditional distribution , namely the distribution of the noiseless symbol conditional on its noisy context was available, the optimal estimator would be the Bayes Response to . Fortunately, when the channel matrix is known and non-degenerate, this conditional distribution can be expressed in terms of the conditional distribution , namely the distribution of the noisy symbol conditional on its noisy context. This conditional distribution, in turn, can be estimated from an individual observed noisy signal by virtue of the Law of Large Numbers, provided is “large enough”.

Applying the DUDE scheme with a context length to a sequence of length over a finite alphabet requires operations and space .

Under certain assumptions, the DUDE is a universal scheme in the sense of asymptotically performing as well as an optimal denoiser, which has oracle access to the unknown sequence. More specifically, assume that the denoising performance is measured using a given single-character fidelity criterion, and consider the regime where the sequence length tends to infinity and the context length tends to infinity “not too fast”. In the stochastic setting, where a doubly infinite sequence noiseless sequence is a realization of a stationary process , the DUDE asymptotically performs, in expectation, as well as the best denoiser, which has oracle access to the source distribution . In the single-sequence, or “semi-stochastic” setting with a fixed doubly infinite sequence , the DUDE asymptotically performs as well as the best “sliding window” denoiser, namely any denoiser that determines from the window , which has oracle access to .

The discrete denoising problem

Block diagram description of the discrete denoising problem Denoise-scheme.pdf
Block diagram description of the discrete denoising problem

Let be the finite alphabet of a fixed but unknown original “noiseless” sequence . The sequence is fed into a discrete memoryless channel (DMC). The DMC operates on each symbol independently, producing a corresponding random symbol in a finite alphabet . The DMC is known and given as a -by- Markov matrix , whose entries are . It is convenient to write for the -column of . The DMC produces a random noisy sequence . A specific realization of this random vector will be denoted by . A denoiser is a function that attempts to recover the noiseless sequence from a distorted version . A specific denoised sequence is denoted by . The problem of choosing the denoiser is known as signal estimation, filtering or smoothing. To compare candidate denoisers, we choose a single-symbol fidelity criterion (for example, the Hamming loss) and define the per-symbol loss of the denoiser at by

Ordering the elements of the alphabet by , the fidelity criterion can be given by a -by- matrix, with columns of the form

The DUDE scheme

Step 1: Calculating the empirical distribution in each context

The DUDE corrects symbols according to their context. The context length used is a tuning parameter of the scheme. For , define the left context of the -th symbol in by and the corresponding right context as . A two-sided context is a combination of a left and a right context.

The first step of the DUDE scheme is to calculate the empirical distribution of symbols in each possible two-sided context along the noisy sequence . Formally, a given two-sided context that appears once or more along determines an empirical probability distribution over , whose value at the symbol is

Thus, the first step of the DUDE scheme with context length is to scan the input noisy sequence once, and store the length- empirical distribution vector (or its non-normalized version, the count vector) for each two-sided context found along . Since there are at most possible two-sided contexts along , this step requires operations and storage .

Step 2: Calculating the Bayes response to each context

Denote the column of single-symbol fidelity criterion , corresponding to the symbol , by . We define the Bayes Response to any vector of length with non-negative entries as

This definition is motivated in the background below.

The second step of the DUDE scheme is to calculate, for each two-sided context observed in the previous step along , and for each symbol observed in each context (namely, any such that is a substring of ) the Bayes response to the vector , namely

Note that the sequence and the context length are implicit. Here, is the -column of and for vectors and , denotes their Schur (entrywise) product, defined by . Matrix multiplication is evaluated before the Schur product, so that stands for .

This formula assumed that the channel matrix is square () and invertible. When and is not invertible, under the reasonable assumption that it has full row rank, we replace above with its Moore-Penrose pseudo-inverse and calculate instead

By caching the inverse or pseudo-inverse , and the values for the relevant pairs , this step requires operations and storage.

Step 3: Estimating each symbol by the Bayes response to its context

The third and final step of the DUDE scheme is to scan again and compute the actual denoised sequence . The denoised symbol chosen to replace is the Bayes response to the two-sided context of the symbol, namely

This step requires operations and used the data structure constructed in the previous step.

In summary, the entire DUDE requires operations and storage.

Asymptotic optimality properties

The DUDE is designed to be universally optimal, namely optimal (is some sense, under some assumptions) regardless of the original sequence .

Let denote a sequence of DUDE schemes, as described above, where uses a context length that is implicit in the notation. We only require that and that .

For a stationary source

Denote by the set of all -block denoisers, namely all maps .

Let be an unknown stationary source and be the distribution of the corresponding noisy sequence. Then

and both limits exist. If, in addition the source is ergodic, then

For an individual sequence

Denote by the set of all -block -th order sliding window denoisers, namely all maps of the form with arbitrary.

Let be an unknown noiseless sequence stationary source and be the distribution of the corresponding noisy sequence. Then

Non-asymptotic performance

Let denote the DUDE on with context length defined on -blocks. Then there exist explicit constants and that depend on alone, such that for any and any we have

where is the noisy sequence corresponding to (whose randomness is due to the channel alone) [2] .

In fact holds with the same constants as above for any-block denoiser . [1] The lower bound proof requires that the channel matrix be square and the pair satisfies a certain technical condition.

Background

To motivate the particular definition of the DUDE using the Bayes response to a particular vector, we now find the optimal denoiser in the non-universal case, where the unknown sequence is a realization of a random vector , whose distribution is known.

Consider first the case . Since the joint distribution of is known, given the observed noisy symbol , the unknown symbol is distributed according to the known distribution . By ordering the elements of , we can describe this conditional distribution on using a probability vector , indexed by , whose -entry is . Clearly the expected loss for the choice of estimated symbol is .

Define the Bayes Envelope of a probability vector , describing a probability distribution on , as the minimal expected loss , and the Bayes Response to as the prediction that achieves this minimum, . Observe that the Bayes response is scale invariant in the sense that for .

For the case , then, the optimal denoiser is . This optimal denoiser can be expressed using the marginal distribution of alone, as follows. When the channel matrix is invertible, we have where is the -th column of . This implies that the optimal denoiser is given equivalently by . When and is not invertible, under the reasonable assumption that it has full row rank, we can replace with its Moore-Penrose pseudo-inverse and obtain

Turning now to arbitrary , the optimal denoiser (with minimal expected loss) is therefore given by the Bayes response to

where is a vector indexed by , whose -entry is . The conditional probability vector is hard to compute. A derivation analogous to the case above shows that the optimal denoiser admits an alternative representation, namely , where is a given vector and is the probability vector indexed by whose -entry is Again, is replaced by a pseudo-inverse if is not square or not invertible.

When the distribution of (and therefore, of ) is not available, the DUDE replaces the unknown vector with an empirical estimate obtained along the noisy sequence itself, namely with . This leads to the above definition of the DUDE.

While the convergence arguments behind the optimality properties above are more subtle, we note that the above, combined with the Birkhoff Ergodic Theorem, is enough to prove that for a stationary ergodic source, the DUDE with context-length is asymptotically optimal all -th order sliding window denoisers.

Extensions

The basic DUDE as described here assumes a signal with a one-dimensional index set over a finite alphabet, a known memoryless channel and a context length that is fixed in advance. Relaxations of each of these assumptions have been considered in turn. [3] Specifically:

Applications

Application to image denoising

A DUDE-based framework for grayscale image denoising [6] achieves state-of-the-art denoising for impulse-type noise channels (e.g., "salt and pepper" or "M-ary symmetric" noise), and good performance on the Gaussian channel (comparable to the Non-local means image denoising scheme on this channel). A different DUDE variant applicable to grayscale images is presented in. [7]

Application to channel decoding of uncompressed sources

The DUDE has led to universal algorithms for channel decoding of uncompressed sources. [17]

Related Research Articles

<span class="mw-page-title-main">Discrete Fourier transform</span> Type of Fourier transform in discrete mathematics

In mathematics, the discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency. The interval at which the DTFT is sampled is the reciprocal of the duration of the input sequence. An inverse DFT (IDFT) is a Fourier series, using the DTFT samples as coefficients of complex sinusoids at the corresponding DTFT frequencies. It has the same sample-values as the original input sequence. The DFT is therefore said to be a frequency domain representation of the original input sequence. If the original sequence spans all the non-zero values of a function, its DTFT is continuous, and the DFT provides discrete samples of one cycle. If the original sequence is one cycle of a periodic function, the DFT provides all the non-zero values of one DTFT cycle.

<i>n</i>-sphere Generalized sphere of dimension n (mathematics)

In mathematics, an n-sphere or hypersphere is an n-dimensional generalization of the 1-dimensional circle and 2-dimensional sphere to any non-negative integer n. The n-sphere is the setting for n-dimensional spherical geometry.

<span class="mw-page-title-main">Probability density function</span> Function whose integral over a region describes the probability of an event occurring in that region

In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

<span class="mw-page-title-main">Multivariate random variable</span> Random variable with multiple component dimensions

In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant. Partial derivatives are used in vector calculus and differential geometry.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Faà di Bruno's formula is an identity in mathematics generalizing the chain rule to higher derivatives. It is named after Francesco Faà di Bruno, although he was not the first to state or prove the formula. In 1800, more than 50 years before Faà di Bruno, the French mathematician Louis François Antoine Arbogast had stated the formula in a calculus textbook, which is considered to be the first published reference on the subject.

<span class="mw-page-title-main">Radon transform</span> Integral transform

In mathematics, the Radon transform is the integral transform which takes a function f defined on the plane to a function Rf defined on the (two-dimensional) space of lines in the plane, whose value at a particular line is equal to the line integral of the function over that line. The transform was introduced in 1917 by Johann Radon, who also provided a formula for the inverse transform. Radon further included formulas for the transform in three dimensions, in which the integral is taken over planes. It was later generalized to higher-dimensional Euclidean spaces and more broadly in the context of integral geometry. The complex analogue of the Radon transform is known as the Penrose transform. The Radon transform is widely applicable to tomography, the creation of an image from the projection data associated with cross-sectional scans of an object.

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

In algebraic topology, a Steenrod algebra was defined by Henri Cartan to be the algebra of stable cohomology operations for mod cohomology.

In the mathematical theory of probability, a Doob martingale is a stochastic process that approximates a given random variable and has the martingale property with respect to the given filtration. It may be thought of as the evolving sequence of best approximations to the random variable based on information accumulated up to a certain time.

In mathematics, a holomorphic vector bundle is a complex vector bundle over a complex manifold X such that the total space E is a complex manifold and the projection map π : EX is holomorphic. Fundamental examples are the holomorphic tangent bundle of a complex manifold, and its dual, the holomorphic cotangent bundle. A holomorphic line bundle is a rank one holomorphic vector bundle.

In quantum computing and quantum communication, a stabilizer code is a class of quantum codes for performing quantum error correction. The toric code, and surface codes more generally, are types of stabilizer codes considered very important for the practical realization of quantum information processing.

In many-body theory, the term Green's function is sometimes used interchangeably with correlation function, but refers specifically to correlators of field operators or creation and annihilation operators.

In probability theory, the family of complex normal distributions, denoted or , characterizes complex random variables whose real and imaginary parts are jointly normal. The complex normal family has three parameters: location parameter μ, covariance matrix , and the relation matrix . The standard complex normal is the univariate distribution with , , and .

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In mathematics, topological recursion is a recursive definition of invariants of spectral curves. It has applications in enumerative geometry, random matrix theory, mathematical physics, string theory, knot theory.

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

  1. 1 2 3 T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu ́, and M.J. Weinberger. Universal discrete denoising: Known channel. IEEE Transactions on Information Theory,, 51(1):5–28, 2005.
  2. K. Viswanathan and E. Ordentlich. Lower limits of discrete universal denoising. IEEE Transactions on Information Theory, 55(3):1374–1386, 2009.
  3. Ordentlich, E.; Seroussi, G.; Verd´u; Weinberger, M. J.; Weissman, T. "Reflections on the DUDE" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  4. A. Dembo and T. Weissman. Universal denoising for the finite-input-general-output channel. IEEE Trans. Inf. Theory, 51(4):1507–1517, April 2005.
  5. K. Sivaramakrishnan and T. Weissman. Universal denoising of discrete-time continuous amplitude signals. In Proc. of the 2006 IEEE Intl. Symp. on Inform. Theory, (ISIT’06), Seattle, WA, USA, July 2006.
  6. 1 2 G. Motta, E. Ordentlich, I. Ramírez, G. Seroussi, and M. Weinberger, “The DUDE framework for continuous tone image denoising,” IEEE Transactions on Image Processing, 20, No. 1, January 2011.
  7. 1 2 K. Sivaramakrishnan and T. Weissman. Universal denoising of continuous amplitude signals with applications to images. In Proc. of IEEE International Conference on Image Processing, Atlanta, GA, USA, October 2006, pp. 2609–2612
  8. C. D. Giurcaneanu and B. Yu. Efficient algorithms for discrete universal denoising for channels with memory. In Proc. of the 2005 IEEE Intl. Symp. on Inform. Theory, (ISIT’05), Adelaide, Australia, Sept. 2005.
  9. R. Zhang and T. Weissman. Discrete denoising for channels with memory. Communications in Information and Systems (CIS), 5(2):257–288, 2005.
  10. G. M. Gemelos, S. Sigurjonsson, T. Weissman. Universal minimax discrete denoising under channel uncertainty. IEEE Trans. Inf. Theory, 52:3476–3497, 2006.
  11. G. M. Gemelos, S. Sigurjonsson and T. Weissman. Algorithms for discrete denoising under channel uncertainty. IEEE Trans. Signal Process., 54(6):2263–2276, June 2006.
  12. E. Ordentlich, M.J. Weinberger, and T. Weissman. Multi-directional context sets with applications to universal denoising and compression. In Proc. of the 2005 IEEE Intl. Symp. on Inform. Theory, (ISIT’05), Adelaide, Australia, Sept. 2005.
  13. J. Yu and S. Verd´u. Schemes for bidirectional modeling of discrete stationary sources. IEEE Trans. Inform. Theory, 52(11):4789–4807, 2006.
  14. S. Chen, S. N. Diggavi, S. Dusad and S. Muthukrishnan. Efficient string matching algorithms for combinatorial universal denoising. In Proc. of IEEE Data Compression Conference (DCC), Snowbird, Utah, March 2005.
  15. G. Gimel’farb. Adaptive context for a discrete universal denoiser. In Proc. Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, August 18–20, pp. 477–485
  16. E. Ordentlich, G. Seroussi, S. Verd´u, M.J. Weinberger, and T. Weissman. A universal discrete image denoiser and its application to binary images. In Proc. IEEE International Conference on Image Processing, Barcelona, Catalonia, Spain, September 2003.
  17. E. Ordentlich, G. Seroussi, S. Verdú, and K. Viswanathan, "Universal Algorithms for Channel Decoding of Uncompressed Sources," IEEE Trans. Information Theory, vol. 54, no. 5, pp. 2243–2262, May 2008