# Density estimation

Last updated Demonstration of density estimation using Kernel density estimation: The true density is mixture of two Gaussians centered around 0 and 3, shown with solid blue curve. In each frame, 100 samples are generated from the distribution, shown in red. Centered on each sample, a Gaussian kernel is drawn in gray. Averaging the Gaussians yields the density estimate shown in the dashed black curve.

In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population.

## Contents

A variety of approaches to density estimation are used, including Parzen windows and a range of data clustering techniques, including vector quantization. The most basic form of density estimation is a rescaled histogram.

## Example of density estimation

We will consider records of the incidence of diabetes. The following is quoted verbatim from the data set description:

A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes mellitus according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records.  

In this example, we construct three density estimates for "glu" (plasma glucose concentration), one conditional on the presence of diabetes, the second conditional on the absence of diabetes, and the third not conditional on diabetes. The conditional density estimates are then used to construct the probability of diabetes conditional on "glu".

The "glu" data were obtained from the MASS package  of the R programming language. Within R, ?Pima.tr and ?Pima.te give a fuller account of the data.

The mean of "glu" in the diabetes cases is 143.1 and the standard deviation is 31.26. The mean of "glu" in the non-diabetes cases is 110.0 and the standard deviation is 24.29. From this we see that, in this data set, diabetes cases are associated with greater levels of "glu". This will be made clearer by plots of the estimated density functions.

The first figure shows density estimates of p(glu | diabetes=1), p(glu | diabetes=0), and p(glu). The density estimates are kernel density estimates using a Gaussian kernel. That is, a Gaussian density function is placed at each data point, and the sum of the density functions is computed over the range of the data.

From the density of "glu" conditional on diabetes, we can obtain the probability of diabetes conditional on "glu" via Bayes' rule. For brevity, "diabetes" is abbreviated "db." in this formula.

$p({\mbox{diabetes}}=1|{\mbox{glu}})={\frac {p({\mbox{glu}}|{\mbox{db.}}=1)\,p({\mbox{db.}}=1)}{p({\mbox{glu}}|{\mbox{db.}}=1)\,p({\mbox{db.}}=1)+p({\mbox{glu}}|{\mbox{db.}}=0)\,p({\mbox{db.}}=0)}}$ The second figure shows the estimated posterior probability p(diabetes=1 | glu). From these data, it appears that an increased level of "glu" is associated with diabetes.

### Script for example

The following R commands will create the figures shown above. These commands can be entered at the command prompt by using cut and paste.

library(MASS)data(Pima.tr)data(Pima.te)Pima<-rbind (Pima.tr,Pima.te)glu<-Pima[,'glu']d0<-Pima[,'type']=='No'd1<-Pima[,'type']=='Yes'base.rate.d1<-sum(d1)/(sum(d1)+sum(d0))glu.density<-density (glu)glu.d0.density<-density (glu[d0])glu.d1.density<-density (glu[d1])glu.d0.f<-approxfun(glu.d0.density$x,glu.d0.density$y)glu.d1.f<-approxfun(glu.d1.density$x,glu.d1.density$y)p.d.given.glu<-function(glu,base.rate.d1){p1<-glu.d1.f(glu)*base.rate.d1p0<-glu.d0.f(glu)*(1-base.rate.d1)p1/(p0+p1)}x<-1:250y<-p.d.given.glu (x,base.rate.d1)plot(x,y,type='l',col='red',xlab='glu',ylab='estimated p(diabetes|glu)')plot(density(glu[d0]),col='blue',xlab='glu',ylab='estimate p(glu),      p(glu|diabetes), p(glu|not diabetes)',main=NA)lines(density(glu[d1]),col='red')

Note that the above conditional density estimator uses bandwidths that are optimal for unconditional densities. Alternatively, one could use the method of Hall, Racine and Li (2004)  and the R np package  for automatic (data-driven) bandwidth selection that is optimal for conditional density estimates; see the np vignette  for an introduction to the np package. The following R commands use the npcdens() function to deliver optimal smoothing. Note that the response "Yes"/"No" is a factor.

library(np)fy.x<-npcdens(type~glu,nmulti=1,data=Pima)Pima.eval<-data.frame(type=factor("Yes"),glu=seq(min(Pima$glu),max(Pima$glu),length=250))plot(x,y,type='l',lty=2,col='red',xlab='glu',ylab='estimated p(diabetes|glu)')lines(Pima.eval\$glu,predict(fy.x,newdata=Pima.eval),col="blue")legend(0,1,c("Unconditional bandwidth","Conditional bandwidth"),col=c("red","blue"),lty=c(2,1))

The third figure uses optimal smoothing via the method of Hall, Racine, and Li  indicating that the unconditional density bandwidth used in the second figure above yields a conditional density estimate that may be somewhat undersmoothed.

## Application and Purpose

A very natural use of density estimates is in the informal investigation of the properties of a given set of data. Density estimates can give valuable indication of such features as skewness and multimodality in the data. In some cases they will yield conclusions that may then be regarded as self-evidently true, while in others all they will do is to point the way to further analysis and/or data collection. 

An important aspect of statistics is often the presentation of data back to the client in order to provide explanation and illustration of conclusions that may possibly have been obtained by other means. Density estimates are ideal for this purpose, for the simple reason that they are fairly easily comprehensible to non-mathematicians.

More examples illustrating the use of density estimates for exploratory and presentational purposes, including the important case of bivariate data. 

Density estimation is also frequently used in anomaly detection or novelty detection:  if an observation lies in a very low-density region, it is likely to be an anomaly or a novelty.

• In hydrology the histogram and estimated density function of rainfall and river discharge data, analysed with a probability distribution, are used to gain insight in their behaviour and frequency of occurrence.  An example is shown in the blue figure.

## Related Research Articles A histogram is an approximate representation of the distribution of numerical data. It was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

Statistics is a field of inquiry that studies the collection, analysis, interpretation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities; it is also used and misused for making informed decisions in all areas of business and government.

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve higher accuracy levels.

Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions. Nonparametric statistics is based on either being distribution-free or having a specified distribution but with the distribution's parameters unspecified. Nonparametric statistics includes both descriptive statistics and statistical inference. Nonparametric tests are often used when the assumptions of parametric tests are violated. In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

The following is a glossary of terms used in the mathematical sciences statistics and probability.

In statistics, semiparametric regression includes regression models that combine parametric and nonparametric models. They are often used in situations where the fully nonparametric model may not perform well or when the researcher wants to use a parametric model but the functional form with respect to a subset of the regressors or the density of the errors is not known. Semiparametric regression models are a particular type of semiparametric modelling and, since semiparametric models contain a parametric component, they rely on parametric assumptions and may be misspecified and inconsistent, just like a fully parametric model.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distribution that is of interest, but a distribution may have a heavy left tail, or both tails may be heavy.

A dot chart or dot plot is a statistical chart consisting of data points plotted on a fairly simple scale, typically using filled in circles. There are two common, yet very different, versions of the dot chart. The first has been used in hand-drawn graphs to depict distributions going back to 1884. The other version is described by William S. Cleveland as an alternative to the bar chart, in which dots are used to depict the quantitative values associated with categorical variables.

The term kernel is used in statistical analysis to refer to a window function. The term "kernel" has several distinct meanings in different branches of statistics.

In statistics, Kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relation between a pair of random variables X and Y.

A utilization distribution is a probability distribution giving the probability density that an animal is found at a given point in space. It is estimated from data sampling the location of an individual or individuals in space over a period of time using, for example, telemetry or GPS based methods. Mean shift is a non-parametric feature-space analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

In various science/engineering applications, such as independent component analysis, image analysis, genetic analysis, speech recognition, manifold learning, and time delay estimation it is useful to estimate the differential entropy of a system or process, given some observations.

Maximum likelihood sequence estimation (MLSE) is a mathematical algorithm to extract useful data out of a noisy data stream. A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

In statistics, adaptive or "variable-bandwidth" kernel density estimation is a form of kernel density estimation in which the size of the kernels used in the estimate are varied depending upon either the location of the samples or the location of the test point. It is a particularly effective technique when the sample space is multi-dimensional.

Kernel density estimation is a nonparametric technique for density estimation i.e., estimation of probability density functions, which is one of the fundamental questions in statistics. It can be viewed as a generalisation of histogram density estimation with improved statistical properties. Apart from histograms, other types of density estimators include parametric, spline, wavelet and Fourier series. Kernel density estimators were first introduced in the scientific literature for univariate data in the 1950s and 1960s and subsequently have been widely adopted. It was soon recognised that analogous estimators for multivariate data would be an important addition to multivariate statistics. Based on research carried out in the 1990s and 2000s, multivariate kernel density estimation has reached a level of maturity comparable to its univariate counterparts.

1. "Diabetes in Pima Indian Women - R documentation".
2. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C. and Johannes, R. S. (1988). R. A. Greenes (ed.). "Using the ADAP learning algorithm to forecast the onset of diabetes mellitus". Proceedings of the Symposium on Computer Applications in Medical Care (Washington, 1988). Los Alamitos, CA: 261–265. PMC  .CS1 maint: multiple names: authors list (link)
3. Peter Hall; Jeffrey S. Racine; Qi Li (2004). "Cross-Validation and the Estimation of Conditional Probability Densities". Journal of the American Statistical Association. 99 (468): 1015–1026. CiteSeerX  . doi:10.1198/016214504000000548.
4. Tristen Hayfield; Jeffrey S. Racine. "The np Package" (PDF).
5. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall. ISBN   978-0412246203.
6. Geof H., Givens (2013). Computational Statistics. Wiley. p. 330. ISBN   978-0-470-53331-4.
7. Pimentel, Marco A.F.; Clifton, David A.; Clifton, Lei; Tarassenko, Lionel (2 January 2014). "A review of novelty detection". Signal Processing. 99 (June 2014): 215–249. doi:10.1016/j.sigpro.2013.12.026.

Sources