Entropy estimation

Last updated April 30, 2024

In various science/engineering applications, such as independent component analysis,^[1] image analysis,^[2] genetic analysis,^[3] speech recognition,^[4] manifold learning,^[5] and time delay estimation^[6] it is useful to estimate the differential entropy of a system or process, given some observations.

The simplest and most common approach uses histogram-based estimation, but other approaches have been developed and used, each with its own benefits and drawbacks.^[7] The main factor in choosing a method is often a trade-off between the bias and the variance of the estimate,^[8] although the nature of the (suspected) distribution of the data may also be a factor ^[7], as well as the sample size and the size of the alphabet of the probability distribution ^[9].

Histogram estimator

The histogram approach uses the idea that the differential entropy of a probability distribution $f(x)$ for a continuous random variable $x$ ,

h(X)=-\int _{\mathbb {X} }f(x)\log f(x)\,dx

can be approximated by first approximating $f(x)$ with a histogram of the observations, and then finding the discrete entropy of a quantization of $x$

H(X)=-\sum _{i=1}^{n}f(x_{i})\log \left({\frac {f(x_{i})}{w(x_{i})}}\right)

with bin probabilities given by that histogram. The histogram is itself a maximum-likelihood (ML) estimate of the discretized frequency distribution ^{[ citation needed ]}), where $w$ is the width of the $i$ th bin. Histograms can be quick to calculate, and simple, so this approach has some attraction. However, the estimate produced is biased, and although corrections can be made to the estimate, they may not always be satisfactory.^[10]

A method better suited for multidimensional probability density functions (pdf) is to first make a pdf estimate with some method, and then, from the pdf estimate, compute the entropy. A useful pdf estimate method is e.g. Gaussian mixture modeling (GMM), where the expectation maximization (EM) algorithm is used to find an ML estimate of a weighted sum of Gaussian pdf's approximating the data pdf.

Estimates based on sample-spacings

If the data is one-dimensional, we can imagine taking all the observations and putting them in order of their value. The spacing between one value and the next then gives us a rough idea of (the reciprocal of) the probability density in that region: the closer together the values are, the higher the probability density. This is a very rough estimate with high variance, but can be improved, for example by thinking about the space between a given value and the one m away from it, where m is some fixed number.^[7]

The probability density estimated in this way can then be used to calculate the entropy estimate, in a similar way to that given above for the histogram, but with some slight tweaks.

One of the main drawbacks with this approach is going beyond one dimension: the idea of lining the data points up in order falls apart in more than one dimension. However, using analogous methods, some multidimensional entropy estimators have been developed.^[11]^[12]

Estimates based on nearest-neighbours

For each point in our dataset, we can find the distance to its nearest neighbour. We can in fact estimate the entropy from the distribution of the nearest-neighbour-distance of our datapoints.^[7] (In a uniform distribution these distances all tend to be fairly similar, whereas in a strongly nonuniform distribution they may vary a lot more.)

Bayesian estimator

When in under-sampled regime, having a prior on the distribution can help the estimation. One such Bayesian estimator was proposed in the neuroscience context known as the NSB (Nemenman–Shafee–Bialek) estimator.^[13]^[14] The NSB estimator uses a mixture of Dirichlet prior, chosen such that the induced prior over the entropy is approximately uniform.

Estimates based on expected entropy

A new approach to the problem of entropy evaluation is to compare the expected entropy of a sample of random sequence with the calculated entropy of the sample. The method gives very accurate results, but it is limited to calculations of random sequences modeled as Markov chains of the first order with small values of bias and correlations. This is the first known method that takes into account the size of the sample sequence and its impact on the accuracy of the calculation of entropy.^[15]^[16]

Deep Neural Network estimator

A deep neural network (DNN) can be used to estimate the joint entropy and called Neural Joint Entropy Estimator (NJEE).^[17] Practically, the DNN is trained as a classifier that maps an input vector or matrix X to an output probability distribution over the possible classes of random variable Y, given input X. For example, in an image classification task, the NJEE maps a vector of pixel values to probabilities over possible image classes. In practice, the probability distribution of Y is obtained by a Softmax layer with number of nodes that is equal to the alphabet size of Y. NJEE uses continuously differentiable activation functions, such that the conditions for the universal approximation theorem holds. It is shown that this method provides a strongly consistent estimator and outperforms other methods in case of large alphabet sizes.^[17]^[9]

Related Research Articles

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

A histogram is a visual representation of the distribution of quantitative data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values— divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) are adjacent and are typically of equal size.

The following outline is provided as an overview of and topical guide to statistics:

Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. Developed in the early 1980s by Robert M. Gray, it was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms. In simpler terms, vector quantization chooses a set of points to represent a larger set of points.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Density estimation</span> Estimate of an unobservable underlying probability density function

In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population.

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by Teun Kloek and Herman K. van Dijk in 1978, but its precursors can be found in statistical physics as early as 1949. Importance sampling is also related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in R^p×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

Kernel density estimation is a nonparametric technique for density estimation i.e., estimation of probability density functions, which is one of the fundamental questions in statistics. It can be viewed as a generalisation of histogram density estimation with improved statistical properties. Apart from histograms, other types of density estimators include parametric, spline, wavelet and Fourier series. Kernel density estimators were first introduced in the scientific literature for univariate data in the 1950s and 1960s and subsequently have been widely adopted. It was soon recognised that analogous estimators for multivariate data would be an important addition to multivariate statistics. Based on research carried out in the 1990s and 2000s, multivariate kernel density estimation has reached a level of maturity comparable to its univariate counterparts.

References

↑ Dinh-Tuan Pham (2004) Fast algorithms for mutual information based independent component analysis. In Signal Processing. Volume 52, Issue 10, 2690–2700, doi : 10.1109/TSP.2004.834398
↑ Chang, C.-I.; Du, Y.; Wang, J.; Guo, S.-M.; Thouin, P.D. (2006) Survey and comparative analysis of entropy and relative entropy thresholding techniques. In Vision, Image and Signal Processing, Volume 153, Issue 6, 837–850, doi : 10.1049/ip-vis:20050032
↑ Martins, D. C. et al. (2008) Intrinsically Multivariate Predictive Genes. In Selected Topics in Signal Processing. Volume 2, Issue 3, 424–439, doi : 10.1109/JSTSP.2008.923841
↑ Gue Jun Jung; Yung-Hwan Oh (2008) Information Distance-Based Subvector Clustering for ASR Parameter Quantization. In Signal Processing Letters, Volume 15, 209–212, doi : 10.1109/LSP.2007.913132
↑ Costa, J.A.; Hero, A.O. (2004), Geodesic entropic graphs for dimension and entropy estimation in manifold learning. In Signal Processing, Volume 52, Issue 8, 2210–2221, doi : 10.1109/TSP.2004.831130
↑ Benesty, J.; Yiteng Huang; Jingdong Chen (2007) Time Delay Estimation via Minimum Entropy. In Signal Processing Letters, Volume 14, Issue 3, March 2007 157–160 doi : 10.1109/LSP.2006.884038
1 2 3 4 J. Beirlant, E. J. Dudewicz, L. Gyorfi, and E. C. van der Meulen (1997) Nonparametric entropy estimation: An overview. In International Journal of Mathematical and Statistical Sciences, Volume 6, pp. 17– 39.
↑ T. Schürmann, Bias analysis in entropy estimation. In J. Phys. A: Math. Gen, 37 (2004), pp. L295–L301. doi : 10.1088/0305-4470/37/27/L02
1 2 Pinchas A., Ben-Gal I., Painsky A. (2024). "A Comparative Analysis of Discrete Entropy Estimators for Large-Alphabet Problems" (PDF). Entropy. 2024; 26(5):369. doi:10.3390/e26050369.{{cite web}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
↑ G. Miller (1955) Note on the bias of information estimates. In Information Theory in Psychology: Problems and Methods, pp. 95–100.
↑ E. G. Learned-Miller (2003) A new class of entropy estimators for multi-dimensional densities, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 3, April 2003, pp. 297–300.
↑ I. Lee (2010) Sample-spacings based density and entropy estimators for spherically invariant multidimensional data, In Neural Computation, vol. 22, issue 8, April 2010, pp. 2208–2227.
↑ Ilya Nemenman, Fariel Shafee, William Bialek (2003) Entropy and Inference, Revisited. Advances in Neural Information Processing
↑ Ilya Nemenman, William Bialek, de Ruyter (2004) Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E
↑ Marek Lesniewicz (2014) Expected Entropy as a Measure and Criterion of Randomness of Binary Sequences In Przeglad Elektrotechniczny, Volume 90, 1/2014, pp. 42– 46.
↑ Marek Lesniewicz (2016) Analyses and Measurements of Hardware Generated Random Binary Sequences Modeled as Markov Chains In Przeglad Elektrotechniczny, Volume 92, 11/2016, pp. 268-274.
1 2 Shalev y., Painsky A., and Ben-Gal I. (2022). "Neural Joint Entropy Estimation" (PDF). IEEE Transactions on Neural Network and Learning Systems (TNNLS).{{cite web}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Dinh-Tuan Pham (2004) Fast algorithms for mutual information based independent component analysis. In Signal Processing. Volume 52, Issue 10, 2690–2700, doi : 10.1109/TSP.2004.834398

[2] Chang, C.-I.; Du, Y.; Wang, J.; Guo, S.-M.; Thouin, P.D. (2006) Survey and comparative analysis of entropy and relative entropy thresholding techniques. In Vision, Image and Signal Processing, Volume 153, Issue 6, 837–850, doi : 10.1049/ip-vis:20050032

[3] Martins, D. C. et al. (2008) Intrinsically Multivariate Predictive Genes. In Selected Topics in Signal Processing. Volume 2, Issue 3, 424–439, doi : 10.1109/JSTSP.2008.923841

[4] Gue Jun Jung; Yung-Hwan Oh (2008) Information Distance-Based Subvector Clustering for ASR Parameter Quantization. In Signal Processing Letters, Volume 15, 209–212, doi : 10.1109/LSP.2007.913132

[5] Costa, J.A.; Hero, A.O. (2004), Geodesic entropic graphs for dimension and entropy estimation in manifold learning. In Signal Processing, Volume 52, Issue 8, 2210–2221, doi : 10.1109/TSP.2004.831130

[6] Benesty, J.; Yiteng Huang; Jingdong Chen (2007) Time Delay Estimation via Minimum Entropy. In Signal Processing Letters, Volume 14, Issue 3, March 2007 157–160 doi : 10.1109/LSP.2006.884038

[beirlant-7] 1 2 3 4 J. Beirlant, E. J. Dudewicz, L. Gyorfi, and E. C. van der Meulen (1997) Nonparametric entropy estimation: An overview. In International Journal of Mathematical and Statistical Sciences, Volume 6, pp. 17– 39.

[schurmann-8] T. Schürmann, Bias analysis in entropy estimation. In J. Phys. A: Math. Gen, 37 (2004), pp. L295–L301. doi : 10.1088/0305-4470/37/27/L02

[PBP24-9] 1 2 Pinchas A., Ben-Gal I., Painsky A. (2024). "A Comparative Analysis of Discrete Entropy Estimators for Large-Alphabet Problems" (PDF). Entropy. 2024; 26(5):369. doi:10.3390/e26050369.{{cite web}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)

[miller55-10] G. Miller (1955) Note on the bias of information estimates. In Information Theory in Psychology: Problems and Methods, pp. 95–100.

[lm2003-11] E. G. Learned-Miller (2003) A new class of entropy estimators for multi-dimensional densities, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 3, April 2003, pp. 297–300.

[il2010-12] I. Lee (2010) Sample-spacings based density and entropy estimators for spherically invariant multidimensional data, In Neural Computation, vol. 22, issue 8, April 2010, pp. 2208–2227.

[Nemenman2003-13] Ilya Nemenman, Fariel Shafee, William Bialek (2003) Entropy and Inference, Revisited. Advances in Neural Information Processing

[Nemenman2004-14] Ilya Nemenman, William Bialek, de Ruyter (2004) Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E

[Lesniewicz2014-15] Marek Lesniewicz (2014) Expected Entropy as a Measure and Criterion of Randomness of Binary Sequences In Przeglad Elektrotechniczny, Volume 90, 1/2014, pp. 42– 46.

[Lesniewicz2016-16] Marek Lesniewicz (2016) Analyses and Measurements of Hardware Generated Random Binary Sequences Modeled as Markov Chains In Przeglad Elektrotechniczny, Volume 92, 11/2016, pp. 268-274.

[SPB22-17] 1 2 Shalev y., Painsky A., and Ben-Gal I. (2022). "Neural Joint Entropy Estimation" (PDF). IEEE Transactions on Neural Network and Learning Systems (TNNLS).{{cite web}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]