Density estimation

Last updated
Demonstration of density estimation using Kernel density estimation: The true density is mixture of two Gaussians centered around 0 and 3, shown with solid blue curve. In each frame, 100 samples are generated from the distribution, shown in red. Centered on each sample, a Gaussian kernel is drawn in gray. Averaging the Gaussians yields the density estimate shown in the dashed black curve. KernelDensityGaussianAnimated.gif
Demonstration of density estimation using Kernel density estimation: The true density is mixture of two Gaussians centered around 0 and 3, shown with solid blue curve. In each frame, 100 samples are generated from the distribution, shown in red. Centered on each sample, a Gaussian kernel is drawn in gray. Averaging the Gaussians yields the density estimate shown in the dashed black curve.

In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population. [1]

Contents

A variety of approaches to density estimation are used, including Parzen windows and a range of data clustering techniques, including vector quantization. The most basic form of density estimation is a rescaled histogram.

Example

Estimated density of p (glu | diabetes=1) (red), p (glu | diabetes=0) (blue), and p (glu) (black) P glu given diabetes.png
Estimated density of p (glu | diabetes=1) (red), p (glu | diabetes=0) (blue), and p (glu) (black)
Estimated probability of p(diabetes=1 | glu) P diabetes given glu.png
Estimated probability of p(diabetes=1 | glu)
Estimated probability of p (diabetes=1 | glu) Glu opt.png
Estimated probability of p (diabetes=1 | glu)

We will consider records of the incidence of diabetes. The following is quoted verbatim from the data set description:

A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes mellitus according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records. [2] [3]

In this example, we construct three density estimates for "glu" (plasma glucose concentration), one conditional on the presence of diabetes, the second conditional on the absence of diabetes, and the third not conditional on diabetes. The conditional density estimates are then used to construct the probability of diabetes conditional on "glu".

The "glu" data were obtained from the MASS package [4] of the R programming language. Within R, ?Pima.tr and ?Pima.te give a fuller account of the data.

The mean of "glu" in the diabetes cases is 143.1 and the standard deviation is 31.26. The mean of "glu" in the non-diabetes cases is 110.0 and the standard deviation is 24.29. From this we see that, in this data set, diabetes cases are associated with greater levels of "glu". This will be made clearer by plots of the estimated density functions.

The first figure shows density estimates of p(glu | diabetes=1), p(glu | diabetes=0), and p(glu). The density estimates are kernel density estimates using a Gaussian kernel. That is, a Gaussian density function is placed at each data point, and the sum of the density functions is computed over the range of the data.

From the density of "glu" conditional on diabetes, we can obtain the probability of diabetes conditional on "glu" via Bayes' rule. For brevity, "diabetes" is abbreviated "db." in this formula.

The second figure shows the estimated posterior probability p(diabetes=1 | glu). From these data, it appears that an increased level of "glu" is associated with diabetes.

Application and purpose

A very natural use of density estimates is in the informal investigation of the properties of a given set of data. Density estimates can give a valuable indication of such features as skewness and multimodality in the data. In some cases they will yield conclusions that may then be regarded as self-evidently true, while in others all they will do is to point the way to further analysis and/or data collection. [5]

Histogram and density function for a Gumbel distribution Gumbel distribtion.png
Histogram and density function for a Gumbel distribution

An important aspect of statistics is often the presentation of data back to the client in order to provide explanation and illustration of conclusions that may possibly have been obtained by other means. Density estimates are ideal for this purpose, for the simple reason that they are fairly easily comprehensible to non-mathematicians.

More examples illustrating the use of density estimates for exploratory and presentational purposes, including the important case of bivariate data. [7]

Density estimation is also frequently used in anomaly detection or novelty detection: [8] if an observation lies in a very low-density region, it is likely to be an anomaly or a novelty.

Kernel density estimation

Kernel density estimation of 100 normally distributed random numbers using different smoothing bandwidths. Kernel density.svg
Kernel density estimation of 100 normally distributed random numbers using different smoothing bandwidths.
In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. [10] [11] One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, [12] [13] which can improve its prediction accuracy. [12]

See also

Related Research Articles

<span class="mw-page-title-main">Histogram</span> Graphical representation of the distribution of numerical data

A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

The following outline is provided as an overview of and topical guide to statistics:

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels.

Nonparametric statistics is the type of statistics that is not restricted by assumptions concerning the nature of the population from which a sample is drawn. This is opposed to parametric statistics, for which a problem is restricted a priori by assumptions concerning the specific distribution of the population and parameters. Nonparametric statistics is based on either not assuming a particular distribution or having a distribution specified but with the distribution's parameters not specified in advance. Nonparametric statistics can be used for descriptive statistics or statistical inference. Nonparametric tests are often used when the assumptions of parametric tests are evidently violated.

<span class="mw-page-title-main">Kernel density estimation</span> Estimator

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

Emanuel Parzen was an American statistician. He worked and published on signal detection theory and time series analysis, where he pioneered the use of kernel density estimation. Parzen was the recipient of the 1994 Samuel S. Wilks Memorial Medal of the American Statistical Association.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distribution that is of interest, but a distribution may have a heavy left tail, or both tails may be heavy.

The term kernel is used in statistical analysis to refer to a window function. The term "kernel" has several distinct meanings in different branches of statistics.

In statistics, kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relation between a pair of random variables X and Y.

A home range is the area in which an animal lives and moves on a periodic basis. It is related to the concept of an animal's territory which is the area that is actively defended. The concept of a home range was introduced by W. H. Burt in 1943. He drew maps showing where the animal had been observed at different times. An associated concept is the utilization distribution which examines where the animal is likely to be at any given time. Data for mapping a home range used to be gathered by careful observation, but nowadays, the animal is fitted with a transmission collar or similar GPS device.

A utilization distribution is a probability distribution giving the probability density that an animal is found at a given point in space. It is estimated from data sampling the location of an individual or individuals in space over a period of time using, for example, telemetry or GPS based methods.

<span class="mw-page-title-main">Mean shift</span> Mathematical technique

Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

<span class="mw-page-title-main">Tukey lambda distribution</span>

Formalized by John Tukey, the Tukey lambda distribution is a continuous, symmetric probability distribution defined in terms of its quantile function. It is typically used to identify an appropriate distribution and not used in statistical models directly.

In various science/engineering applications, such as independent component analysis, image analysis, genetic analysis, speech recognition, manifold learning, and time delay estimation it is useful to estimate the differential entropy of a system or process, given some observations.

Kernel density estimation is a nonparametric technique for density estimation i.e., estimation of probability density functions, which is one of the fundamental questions in statistics. It can be viewed as a generalisation of histogram density estimation with improved statistical properties. Apart from histograms, other types of density estimators include parametric, spline, wavelet and Fourier series. Kernel density estimators were first introduced in the scientific literature for univariate data in the 1950s and 1960s and subsequently have been widely adopted. It was soon recognised that analogous estimators for multivariate data would be an important addition to multivariate statistics. Based on research carried out in the 1990s and 2000s, multivariate kernel density estimation has reached a level of maturity comparable to its univariate counterparts.

<span class="mw-page-title-main">Probabilistic classification</span> Machine learning problem

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

Èlizbar Nadaraya is a Georgian mathematician who is currently a Full Professor and the Chair of the Theory of Probability and Mathematical Statistics at the Tbilisi State University. He developed the Nadaraya-Watson estimator along with Geoffrey Watson, which proposes estimating the conditional expectation of a random variable as a locally weighted average using a kernel as a weighting function.

References

  1. Alberto Bernacchia, Simone Pigolotti, Self-Consistent Method for Density Estimation, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 73, Issue 3, June 2011, Pages 407–422, https://doi.org/10.1111/j.1467-9868.2011.00772.x
  2. "Diabetes in Pima Indian Women - R documentation".
  3. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C. and Johannes, R. S. (1988). R. A. Greenes (ed.). "Using the ADAP learning algorithm to forecast the onset of diabetes mellitus". Proceedings of the Symposium on Computer Applications in Medical Care (Washington, 1988). Los Alamitos, CA: 261–265. PMC   2245318 .{{cite journal}}: CS1 maint: multiple names: authors list (link)
  4. "Support Functions and Datasets for Venables and Ripley's MASS".
  5. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall. ISBN   978-0412246203.
  6. A calculator for probability distributions and density functions
  7. Geof H., Givens (2013). Computational Statistics. Wiley. p. 330. ISBN   978-0-470-53331-4.
  8. Pimentel, Marco A.F.; Clifton, David A.; Clifton, Lei; Tarassenko, Lionel (2 January 2014). "A review of novelty detection". Signal Processing. 99 (June 2014): 215–249. doi:10.1016/j.sigpro.2013.12.026.
  9. An illustration of histograms and probability density functions
  10. Rosenblatt, M. (1956). "Remarks on Some Nonparametric Estimates of a Density Function". The Annals of Mathematical Statistics. 27 (3): 832–837. doi: 10.1214/aoms/1177728190 .
  11. Parzen, E. (1962). "On Estimation of a Probability Density Function and Mode". The Annals of Mathematical Statistics . 33 (3): 1065–1076. doi: 10.1214/aoms/1177704472 . JSTOR   2237880.
  12. 1 2 Piryonesi S. Madeh; El-Diraby Tamer E. (2020-06-01). "Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems". Journal of Transportation Engineering, Part B: Pavements. 146 (2): 04020022. doi:10.1061/JPEODX.0000175. S2CID   216485629.
  13. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2001). The Elements of Statistical Learning : Data Mining, Inference, and Prediction : with 200 full-color illustrations. New York: Springer. ISBN   0-387-95284-5. OCLC   46809224.

Sources