
In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, kernel density estimation (KDE) is the application of
kernel smoothing A kernel smoother is a statistical technique to estimate a real valued function f: \mathbb^p \to \mathbb as the weighted average of neighboring observed data. The weight is defined by the ''kernel'', such that closer points are given higher weights. ...
for
probability density estimation
In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of ...
, i.e., a
non-parametric method to
estimate the
probability density function of a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
based on ''
kernels'' as
weights. KDE answers a fundamental data smoothing problem where inferences about the
population are made, based on a finite data
sample. In some fields such as
signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing '' signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...
and
econometrics it is also termed the Parzen–Rosenblatt window method, after
Emanuel Parzen and
Murray Rosenblatt
Murray Rosenblatt (September 7, 1926 – October 9, 2019) was a statistician specializing in time series analysis who was a professor of
mathematics at the University of California, San Diego. He received his Ph.D. at Cornell University.
He was als ...
, who are usually credited with independently creating it in its current form.
One of the famous applications of kernel density estimation is in estimating the class-conditional
marginal densities of data when using a
naive Bayes classifier
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
,
which can improve its prediction accuracy.
Definition
Let (''x''
1, ''x''
2, ..., ''x
n'') be
independent and identically distributed samples drawn from some univariate distribution with an unknown
density ''ƒ'' at any given point ''x''. We are interested in estimating the shape of this function ''ƒ''. Its ''kernel density estimator'' is
:
where ''K'' is the
kernel — a non-negative function — and is a
smoothing parameter called the ''bandwidth''. A kernel with subscript ''h'' is called the ''scaled kernel'' and defined as . Intuitively one wants to choose ''h'' as small as the data will allow; however, there is always a trade-off between the bias of the estimator and its variance. The choice of bandwidth is discussed in more detail below.
A range of
kernel functions are commonly used: uniform, triangular, biweight, triweight, Epanechnikov, normal, and others. The Epanechnikov kernel is optimal in a mean square error sense, though the loss of efficiency is small for the kernels listed previously.
Due to its convenient mathematical properties, the normal kernel is often used, which means , where ''ϕ'' is the
standard normal density function.
The construction of a kernel density estimate finds interpretations in fields outside of density estimation.
For example, in
thermodynamics, this is equivalent to the amount of heat generated when
heat kernel
In the mathematical study of heat conduction and diffusion, a heat kernel is the fundamental solution to the heat equation on a specified domain with appropriate boundary conditions. It is also one of the main tools in the study of the spectru ...
s (the fundamental solution to the
heat equation
In mathematics and physics, the heat equation is a certain partial differential equation. Solutions of the heat equation are sometimes known as caloric functions. The theory of the heat equation was first developed by Joseph Fourier in 1822 for t ...
) are placed at each data point locations ''x
i''. Similar methods are used to construct
discrete Laplace operators on point clouds for
manifold learning (e.g.
diffusion map
Diffusion maps is a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon which computes a family of embeddings of a data set into Euclidean space (often low-dimensional) whose coordinates can be computed fr ...
).
Example
Kernel density estimates are closely related to
histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. The diagram below based on these 6 data points illustrates this relationship:
For the histogram, first, the horizontal axis is divided into sub-intervals or bins which cover the range of the data: In this case, six bins each of width 2. Whenever a data point falls inside this interval, a box of height 1/12 is placed there. If more than one data point falls inside the same bin, the boxes are stacked on top of each other.
For the kernel density estimate, normal kernels with variance 2.25 (indicated by the red dashed lines) are placed on each of the data points ''x
i''. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate (compared to the discreteness of the histogram) illustrates how kernel density estimates converge faster to the true underlying density for continuous random variables.
Bandwidth selection
The
bandwidth of the kernel is a
free parameter
A free parameter is a variable in a mathematical model which cannot be predicted precisely or constrained by the model and must be estimated experimentally or theoretically. A mathematical model, theory, or conjecture is more likely to be right a ...
which exhibits a strong influence on the resulting estimate. To illustrate its effect, we take a simulated
random sample from the standard
normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu i ...
(plotted at the blue spikes in the
rug plot on the horizontal axis). The grey curve is the true density (a normal density with mean 0 and variance 1). In comparison, the red curve is ''undersmoothed'' since it contains too many spurious data artifacts arising from using a bandwidth ''h'' = 0.05, which is too small. The green curve is ''oversmoothed'' since using the bandwidth ''h'' = 2 obscures much of the underlying structure. The black curve with a bandwidth of ''h'' = 0.337 is considered to be optimally smoothed since its density estimate is close to the true density. An extreme situation is encountered in the limit
(no smoothing), where the estimate is a sum of ''n''
delta functions centered at the coordinates of analyzed samples. In the other extreme limit
the estimate retains the shape of the used kernel, centered on the mean of the samples (completely smooth).
The most common optimality criterion used to select this parameter is the expected ''L''
2 risk function, also termed the
mean integrated squared error:
: