In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, kernel density estimation (KDE) is the application of
kernel smoothing for
probability density estimation, i.e., a
non-parametric
Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...
method to
estimate
Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...
the
probability density function
In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
of a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
based on ''
kernels
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learnin ...
'' as
weights. KDE answers a fundamental data smoothing problem where inferences about the
population
Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...
are made, based on a finite data
sample
Sample or samples may refer to:
Base meaning
* Sample (statistics), a subset of a population – complete data set
* Sample (signal), a digital discrete sample of a continuous analog signal
* Sample (material), a specimen or small quantity of s ...
. In some fields such as
signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, and scientific measurements. Signal processing techniq ...
and
econometrics
Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics," '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8 ...
it is also termed the Parzen–Rosenblatt window method, after
Emanuel Parzen
Emanuel Parzen (April 21, 1929 – February 6, 2016) was an American statistician. He worked and published on signal detection theory and time series analysis, where he pioneered the use of kernel density estimation (also known as the Parzen w ...
and
Murray Rosenblatt, who are usually credited with independently creating it in its current form.
One of the famous applications of kernel density estimation is in estimating the class-conditional
marginal densities of data when using a
naive Bayes classifier,
which can improve its prediction accuracy.
Definition
Let (''x''
1, ''x''
2, ..., ''x
n'') be
independent and identically distributed
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usual ...
samples drawn from some univariate distribution with an unknown
density
Density (volumetric mass density or specific mass) is the substance's mass per unit of volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' can also be used. Mathematical ...
''ƒ'' at any given point ''x''. We are interested in estimating the shape of this function ''ƒ''. Its ''kernel density estimator'' is
:
where ''K'' is the
kernel
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learn ...
— a non-negative function — and is a
smoothing
In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the dat ...
parameter called the ''bandwidth''. A kernel with subscript ''h'' is called the ''scaled kernel'' and defined as . Intuitively one wants to choose ''h'' as small as the data will allow; however, there is always a trade-off between the bias of the estimator and its variance. The choice of bandwidth is discussed in more detail below.
A range of
kernel functions are commonly used: uniform, triangular, biweight, triweight, Epanechnikov, normal, and others. The Epanechnikov kernel is optimal in a mean square error sense, though the loss of efficiency is small for the kernels listed previously.
Due to its convenient mathematical properties, the normal kernel is often used, which means , where ''ϕ'' is the
standard normal density function.
The construction of a kernel density estimate finds interpretations in fields outside of density estimation.
For example, in
thermodynamics
Thermodynamics is a branch of physics that deals with heat, work, and temperature, and their relation to energy, entropy, and the physical properties of matter and radiation. The behavior of these quantities is governed by the four laws of the ...
, this is equivalent to the amount of heat generated when
heat kernel
In the mathematical study of heat conduction and diffusion, a heat kernel is the fundamental solution to the heat equation on a specified domain with appropriate boundary conditions. It is also one of the main tools in the study of the spectru ...
s (the fundamental solution to the
heat equation
In mathematics and physics, the heat equation is a certain partial differential equation. Solutions of the heat equation are sometimes known as caloric functions. The theory of the heat equation was first developed by Joseph Fourier in 1822 for t ...
) are placed at each data point locations ''x
i''. Similar methods are used to construct
discrete Laplace operator
In mathematics, the discrete Laplace operator is an analog of the continuous Laplace operator, defined so that it has meaning on a graph or a discrete grid. For the case of a finite-dimensional graph (having a finite number of edges and vertice ...
s on point clouds for
manifold learning
Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low- ...
(e.g.
diffusion map
Diffusion maps is a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon which computes a family of embeddings of a data set into Euclidean space (often low-dimensional) whose coordinates can be computed from ...
).
Example
Kernel density estimates are closely related to
histograms
A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the en ...
, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. The diagram below based on these 6 data points illustrates this relationship:
For the histogram, first, the horizontal axis is divided into sub-intervals or bins which cover the range of the data: In this case, six bins each of width 2. Whenever a data point falls inside this interval, a box of height 1/12 is placed there. If more than one data point falls inside the same bin, the boxes are stacked on top of each other.
For the kernel density estimate, normal kernels with variance 2.25 (indicated by the red dashed lines) are placed on each of the data points ''x
i''. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate (compared to the discreteness of the histogram) illustrates how kernel density estimates converge faster to the true underlying density for continuous random variables.
Bandwidth selection
The
bandwidth
Bandwidth commonly refers to:
* Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range
* Bandwidth (computing), the rate of data transfer, bit rate or thr ...
of the kernel is a
free parameter
A free parameter is a variable in a mathematical model which cannot be predicted precisely or constrained by the model and must be estimated experimentally or theoretically. A mathematical model, theory, or conjecture is more likely to be right ...
which exhibits a strong influence on the resulting estimate. To illustrate its effect, we take a simulated
random sample
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians atte ...
from the standard
normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
(plotted at the blue spikes in the
rug plot
Rug or RUG may refer to:
* Rug, or carpet, a textile floor covering
* Rug, slang for a toupée
* Ghent University (''Rijksunversiteit Gent'', or RUG)
* Really Useful Group, or RUG, a company set up by Andrew Lloyd Webber
* Rugby railway station, N ...
on the horizontal axis). The grey curve is the true density (a normal density with mean 0 and variance 1). In comparison, the red curve is ''undersmoothed'' since it contains too many spurious data artifacts arising from using a bandwidth ''h'' = 0.05, which is too small. The green curve is ''oversmoothed'' since using the bandwidth ''h'' = 2 obscures much of the underlying structure. The black curve with a bandwidth of ''h'' = 0.337 is considered to be optimally smoothed since its density estimate is close to the true density. An extreme situation is encountered in the limit
(no smoothing), where the estimate is a sum of ''n''
delta functions centered at the coordinates of analyzed samples. In the other extreme limit
the estimate retains the shape of the used kernel, centered on the mean of the samples (completely smooth).
The most common optimality criterion used to select this parameter is the expected ''L''
2 risk function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
, also termed the
mean integrated squared error
In statistics, the mean integrated squared error (MISE) is used in density estimation. The MISE of an estimate of an unknown probability density is given by
:\operatorname\, f_n-f\, _2^2=\operatorname\int (f_n(x)-f(x))^2 \, dx
where ''ƒ'' is ...
:
: