Kernel (statistics)

Last updated December 28, 2024

The term kernel is used in statistical analysis to refer to a window function. The term "kernel" has several distinct meanings in different branches of statistics.

Bayesian statistics

In statistics, especially in Bayesian statistics, the kernel of a probability density function (pdf) or probability mass function (pmf) is the form of the pdf or pmf in which any factors that are not functions of any of the variables in the domain are omitted.^[1] Note that such factors may well be functions of the parameters of the pdf or pmf. These factors form part of the normalization factor of the probability distribution, and are unnecessary in many situations. For example, in pseudo-random number sampling, most sampling algorithms ignore the normalization factor. In addition, in Bayesian analysis of conjugate prior distributions, the normalization factors are generally ignored during the calculations, and only the kernel considered. At the end, the form of the kernel is examined, and if it matches a known distribution, the normalization factor can be reinstated. Otherwise, it may be unnecessary (for example, if the distribution only needs to be sampled from).

For many distributions, the kernel can be written in closed form, but not the normalization constant.

An example is the normal distribution. Its probability density function is

p(x|\mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}

and the associated kernel is

p(x|\mu ,\sigma ^{2})\propto e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}

Note that the factor in front of the exponential has been omitted, even though it contains the parameter $\sigma ^{2}$ , because it is not a function of the domain variable $x$ .

Pattern analysis

The kernel of a reproducing kernel Hilbert space is used in the suite of techniques known as kernel methods to perform tasks such as statistical classification, regression analysis, and cluster analysis on data in an implicit space. This usage is particularly common in machine learning.

Nonparametric statistics

In nonparametric statistics, a kernel is a weighting function used in non-parametric estimation techniques. Kernels are used in kernel density estimation to estimate random variables' density functions, or in kernel regression to estimate the conditional expectation of a random variable. Kernels are also used in time series, in the use of the periodogram to estimate the spectral density where they are known as window functions. An additional use is in the estimation of a time-varying intensity for a point process where window functions (kernels) are convolved with time-series data.

Commonly, kernel widths must also be specified when running a non-parametric estimation.

Definition

A kernel is a non-negative real-valued integrable function K. For most applications, it is desirable to define the function to satisfy two additional requirements:

Normalization:

\int _{-\infty }^{+\infty }K(u)\,du=1\,;

Even-function Symmetry:

K(-u)=K(u){\mbox{ for all values of }}u\,.

The first requirement ensures that the method of kernel density estimation results in a probability density function. The second requirement ensures that the average of the corresponding distribution is equal to that of the sample used.

If K is a kernel, then so is the function K* defined by K*(u) = λK(λu), where λ > 0. This can be used to select a scale that is appropriate for the data.

Kernel functions in common use

All of the kernels below in a common coordinate system. Kernels.svg — All of the kernels below in a common coordinate system.

Several types of kernel functions are commonly used: uniform, triangle, Epanechnikov,^[2] quartic (biweight), tricube,^[3] triweight, Gaussian, quadratic^[4] and cosine.

In the table below, if $K$ is given with a bounded support, then $K(u)=0$ for values of u lying outside the support.

Kernel Functions, K(u)			$\textstyle \int u^{2}K(u)du$	$\textstyle \int K(u)^{2}du$	Efficiency^[a] relative to the Epanechnikov kernel
Uniform ("rectangular window")	$K(u)={\frac {1}{2}}$ Support: $\|u\|\leq 1$	"Boxcar function"	${\frac {1}{3}}$	${\frac {1}{2}}$	92.9%
Triangular	$K(u)=(1-\|u\|)$ Support: $\|u\|\leq 1$		${\frac {1}{6}}$	${\frac {2}{3}}$	98.6%
Epanechnikov (parabolic)	$K(u)={\frac {3}{4}}(1-u^{2})$ Support: $\|u\|\leq 1$		${\frac {1}{5}}$	${\frac {3}{5}}$	100%
Quartic (biweight)	$K(u)={\frac {15}{16}}(1-u^{2})^{2}$ Support: $\|u\|\leq 1$		${\frac {1}{7}}$	${\frac {5}{7}}$	99.4%
Triweight	$K(u)={\frac {35}{32}}(1-u^{2})^{3}$ Support: $\|u\|\leq 1$		${\frac {1}{9}}$	${\frac {350}{429}}$	98.7%
Tricube	$K(u)={\frac {70}{81}}(1-{\left\|u\right\|}^{3})^{3}$ Support: $\|u\|\leq 1$		${\frac {35}{243}}$	${\frac {175}{247}}$	99.8%
Gaussian	$K(u)={\frac {1}{\sqrt {2\pi }}}e^{-{\frac {1}{2}}u^{2}}$		$1\,$	${\frac {1}{2{\sqrt {\pi }}}}$	95.1%
Cosine	$K(u)={\frac {\pi }{4}}\cos \left({\frac {\pi }{2}}u\right)$ Support: $\|u\|\leq 1$		$1-{\frac {8}{\pi ^{2}}}$	${\frac {\pi ^{2}}{16}}$	99.9%
Logistic	$K(u)={\frac {1}{e^{u}+2+e^{-u}}}$		${\frac {\pi ^{2}}{3}}$	${\frac {1}{6}}$	88.7%
Sigmoid function	$K(u)={\frac {2}{\pi }}{\frac {1}{e^{u}+e^{-u}}}$		${\frac {\pi ^{2}}{4}}$	${\frac {2}{\pi ^{2}}}$	84.3%
Silverman kernel^[5]	$K(u)={\frac {1}{2}}e^{-{\frac {\|u\|}{\sqrt {2}}}}\cdot \sin \left({\frac {\|u\|}{\sqrt {2}}}+{\frac {\pi }{4}}\right)$		$0$	${\frac {3{\sqrt {2}}}{16}}$	not applicable

Notes

↑ Efficiency is defined as ${\sqrt {\int u^{2}K(u)\,du}}\int K(u)^{2}\,du$ .

Related Research Articles

In statistics, a location parameter of a probability distribution is a scalar- or vector-valued parameter $, which determines the "location" or shift of the distribution. In the literature of location parameter estimation, the probability distributions with such parameter are found to be formally defined in one of the following equivalent ways:$

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis.

<span class="mw-page-title-main">Logistic distribution</span> Continuous probability distribution

In probability theory and statistics, the logistic distribution is a continuous probability distribution. Its cumulative distribution function is the logistic function, which appears in logistic regression and feedforward neural networks. It resembles the normal distribution in shape but has heavier tails. The logistic distribution is a special case of the Tukey lambda distribution.

In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors, in which case the mixture distribution is a multivariate distribution.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

Exact statistics, such as that described in exact test, is a branch of statistics that was developed to provide more accurate results pertaining to statistical testing and interval estimation by eliminating procedures based on asymptotic and approximate statistical methods. The main characteristic of exact methods is that statistical tests and confidence intervals are based on exact probability statements that are valid for any sample size. Exact statistical methods help avoid some of the unreasonable assumptions of traditional statistical methods, such as the assumption of equal variances in classical ANOVA. They also allow exact inference on variance components of mixed models.

In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an infinite number of observations from it. Mathematically, this is equivalent to saying that different values of the parameters must generate different probability distributions of the observable variables. Usually the model is identifiable only under certain technical restrictions, in which case the set of these requirements is called the identification conditions.

In probability theory, the Mills ratio of a continuous random variable $is the function$

<span class="mw-page-title-main">Log-Cauchy distribution</span>

In probability theory, a log-Cauchy distribution is a probability distribution of a random variable whose logarithm is distributed in accordance with a Cauchy distribution. If X is a random variable with a Cauchy distribution, then Y = exp(X) has a log-Cauchy distribution; likewise, if Y has a log-Cauchy distribution, then X = log(Y) has a Cauchy distribution.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function $with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.$

In probability theory, a log-t distribution or log-Student t distribution is a probability distribution of a random variable whose logarithm is distributed in accordance with a Student's t-distribution. If X is a random variable with a Student's t-distribution, then Y = exp(X) has a log-t distribution; likewise, if Y has a log-t distribution, then X = log(Y) has a Student's t-distribution.

References

↑ Schuster, Eugene (August 1969). "Estimation of a probability density function and its derivatives". The Annals of Mathematical Statistics. 40 (4): 1187-1195. doi: 10.1214/aoms/1177697495 .
↑ Named for Epanechnikov, V. A. (1969). "Non-Parametric Estimation of a Multivariate Probability Density". Theory Probab. Appl. 14 (1): 153–158. doi:10.1137/1114019.
↑ Altman, N. S. (1992). "An introduction to kernel and nearest neighbor nonparametric regression". The American Statistician. 46 (3): 175–185. doi:10.1080/00031305.1992.10475879. hdl: 1813/31637 .
↑ Cleveland, W. S.; Devlin, S. J. (1988). "Locally weighted regression: An approach to regression analysis by local fitting". Journal of the American Statistical Association. 83 (403): 596–610. doi:10.1080/01621459.1988.10478639.
↑ Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Bibcode:1986desd.book.....S.

Li, Qi; Racine, Jeffrey S. (2007). Nonparametric Econometrics: Theory and Practice. Princeton University Press. ISBN 978-0-691-12161-1.

Zucchini, Walter. "APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Density Estimation" (PDF). Retrieved 6 September 2018.

Comaniciu, D; Meer, P (2002). "Mean shift: A robust approach toward feature space analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 24 (5): 603–619. CiteSeerX 10.1.1.76.8968 . doi:10.1109/34.1000236.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[5] Efficiency is defined as ${\sqrt {\int u^{2}K(u)\,du}}\int K(u)^{2}\,du$ .

[1] Schuster, Eugene (August 1969). "Estimation of a probability density function and its derivatives". The Annals of Mathematical Statistics. 40 (4): 1187-1195. doi: 10.1214/aoms/1177697495 .

[2] Named for Epanechnikov, V. A. (1969). "Non-Parametric Estimation of a Multivariate Probability Density". Theory Probab. Appl. 14 (1): 153–158. doi:10.1137/1114019.

[3] Altman, N. S. (1992). "An introduction to kernel and nearest neighbor nonparametric regression". The American Statistician. 46 (3): 175–185. doi:10.1080/00031305.1992.10475879. hdl: 1813/31637 .

[4] Cleveland, W. S.; Devlin, S. J. (1988). "Locally weighted regression: An approach to regression analysis by local fitting". Journal of the American Statistical Association. 83 (403): 596–610. doi:10.1080/01621459.1988.10478639.

[6] Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Bibcode:1986desd.book.....S.

[1]

[2]

[3]

[4]

[a]

[5]