HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the Freedman–Diaconis rule can be used to select the width of the bins to be used in a
histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...
. It is named after David A. Freedman and
Persi Diaconis Persi Warren Diaconis (; born January 31, 1945) is an American mathematician of Greek descent and former professional magician. He is the Mary V. Sunseri Professor of Statistics and Mathematics at Stanford University. He is particularly known f ...
. For a set of empirical measurements sampled from some
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
, the Freedman–Diaconis rule is designed to approximately minimize the integral of the squared difference between the
histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...
(i.e., relative frequency density) and the density of the theoretical probability distribution. In detail, the Integrated Mean Squared Error (IMSE) is : \text = E\left \int_I (H(x) - f(x))^2 \right where H is the histogram approximation of f on the interval I computed with n data points sampled from the distribution f. E cdot/math> denotes the expectation across many independent draws of n data points. Under mild conditions, namely that f and its first two derivatives are L^2, Freedman and Diaconis show that the integral is minimised by choosing the bin width : h^* = \left( 6 / \int_^ f'(x)^2 dx \right)^n^ A formula which was derived earlier by Scott. Swapping the order of the integration and expectation is justified by
Fubini's Theorem In mathematical analysis, Fubini's theorem characterizes the conditions under which it is possible to compute a double integral by using an iterated integral. It was introduced by Guido Fubini in 1907. The theorem states that if a function is L ...
. The Freedman–Diaconis rule is derived by assuming that f is a
Normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...
, making it an example of a ''normal reference rule''. In this case \int f'(x)^2 = (4 \sqrt \sigma^3)^. Freedman and Diaconis use the
interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the differen ...
to estimate the
standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
: \sigma \sim \Phi^(0.75) - \Phi^(0.25) where \Phi is the
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ever ...
for a normal density. This gives the rule :\text=2\, where \operatorname(x) is the
interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the differen ...
of the data and n is the number of observations in the sample x . In fact if the normal density is used the factor 2 in front comes out to be \sim 2.59, but 2 is the factor recommended by Freedman and Diaconis.


Other approaches

With the factor 2 replaced by approximately 2.59, the Freedman–Diaconis rule asymptotically matches
Scott's Rule Scott's rule is a method to select the number of bins in a histogram. Scott's rule is widely employed in data analysis software including R, Python and Microsoft Excel where it is the default bin selection method. For a set of n observations x_i ...
for data sampled from a normal distribution. Another approach is to use
Sturges's rule Sturges's rule is a method to choose the number of bins for a histogram. Given n observations, Sturges's rule suggests using : \hat = 1 + \log_2(n) bins in the histogram. This rule is widely employed in data analysis software including Python ...
: use a bin width so that there are about 1+\log_2n non-empty bins, however this approach is not recommended when the number of data points is large. For a discussion of the many alternative approaches to bin selection, see Birgé and Rozenholc.


References

Rules of thumb Statistical charts and diagrams Infographics {{statistics-stub