Sturges's Rule
   HOME

TheInfoList



OR:

Sturges's rule is a method to choose the number of bins for a
histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...
. Given n observations, Sturges's rule suggests using : \hat = 1 + \log_2(n) bins in the histogram. This rule is widely employed in
data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...
software including
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
and R, where it is the default bin selection method. Sturges's rule comes from the
binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...
which is used as a discrete approximation to the
normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...
. If the function to be approximated f is binomially distributed then : f(y) = \binom p^y (1-p)^ where m is the number of trials and p is the probability of success and y = 0,1,\ldots,m. Choosing p=1/2 gives : f(y) = \binom 2^ In this form we can consider 2^ as the normalisation factor and Sturges's rule is saying that the sample should result in a histogram with bin counts given by the
binomial coefficients In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficient is indexed by a pair of integers and is written \tbinom. It is the coefficient of the te ...
. Since the total sample size is fixed to n we must have : n = \sum_y \binom = 2^m using the well-known formula for sums of the binomial coefficients. Solving this by taking logs of both sides gives m = \log_2(n) and finally using k = m+1 (due to counting the 0 outcomes) gives Sturges's rule. In general Sturges's rule does not give an integer answer so the result is rounded up.


Doane's formula

DoaneDoane DP (1976) Aesthetic frequency classification. American Statistician, 30: 181–183 proposed modifying Sturges's formula to add extra bins when the data is
skewed In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...
. Using the method of moments estimator : g_1 = \frac = \frac, along with its variance : \sigma_^2= \frac Doane proposed adding \log_2 \left( 1 + \frac \right) extra bins giving ''Doane's formula'' : \hat = 1 + \log_2(n) + \log_2 \left( 1 + \frac \right) For symmetric distributions , g_1, \simeq 0 this is equivalent to Sturges's rule. For asymmetric distributions a number of additional bins will be used.


Criticisms

Sturges's rule is not based on any sort of optimisation procedure, like the
Freedman–Diaconis rule In statistics, the Freedman–Diaconis rule can be used to select the width of the bins to be used in a histogram. It is named after David A. Freedman and Persi Diaconis. For a set of empirical measurements sampled from some probability distri ...
or
Scott's rule Scott's rule is a method to select the number of bins in a histogram. Scott's rule is widely employed in data analysis software including R, Python and Microsoft Excel where it is the default bin selection method. For a set of n observations x_i ...
. It is simply posited based on the approximation of a normal curve by a binomial distribution. Hyndman has pointed outHyndman RJ. The problem with Sturges' rule for constructing histograms. Monash University. 1995 Jul 5:1-2. that any multiple of the binomial coefficients would also converge to a normal distribution, so any number of bins could be obtained following the derivation above. Scott shows that Sturges's rule in general produces oversmoothed histograms i.e. too few bins, and advises against its use in favour of other rules such as Freedman-Diaconis or Scott's rule.


References

{{reflist Rules of thumb Statistical charts and diagrams Infographics