Scott's Rule
   HOME

TheInfoList



OR:

Scott's rule is a method to select the number of bins in a
histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...
. Scott's rule is widely employed in
data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...
software including R,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
and
Microsoft Excel Microsoft Excel is a spreadsheet editor developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android, iOS and iPadOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a ...
where it is the default bin selection method. For a set of n observations x_i let \hat(x) be the histogram approximation of some function f(x). The integrated mean squared error (IMSE) is : \text = E\left \int_^ dx (\hat(x) - f(x))^2\right Where E cdot/math> denotes the expectation across many independent draws of n data points. By Taylor expanding to first order in h, the bin width, Scott showed that the optimal width is : h^* = \left( 6 / \int_^ f'(x)^2 dx \right)^n^ This formula is also the basis for the Freedman–Diaconis rule. By taking a ''normal reference'' i.e. assuming that f(x) is a
normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...
, the equation for h^* becomes : h^* = \left( 24 \sqrt \right)^ \sigma n^ \sim 3.5 \sigma n^ where \sigma is the
standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
of the normal distribution and is estimated from the data. With this value of bin width Scott demonstrates that :\text \propto n^ showing how quickly the histogram approximation approaches the true distribution as the number of samples increases.


Terrell–Scott rule

Another approach developed by Terrell and Scott is based on the observation that, among all densities g(x) defined on a compact interval, say , x, < 1/2, with derivatives which are
absolutely continuous In calculus and real analysis, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship betwe ...
, the density which minimises \int_^ dx (g^(x))^2 is : f_k(x) = \begin \frac(1-4x^2)^k, \quad &, x, \leq1/2\\ 0 &, x, >1/2 \end Using this with k=1 in the expression for h^* gives an
upper bound In mathematics, particularly in order theory, an upper bound or majorant of a subset of some preordered set is an element of that is every element of . Dually, a lower bound or minorant of is defined to be an element of that is less ...
on the value of bin width which is : h^*_ = \left( \frac \right)^. So, for functions satisfying the continuity conditions, at least : k_ = \frac = \left( 2n \right)^ bins should be used. This rule is also called the ''oversmoothed rule'' or the ''Rice rule'',Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane, Rice University (chapter 2 "Graphing Distributions", section "Histograms") so called because both authors worked at
Rice University William Marsh Rice University, commonly referred to as Rice University, is a Private university, private research university in Houston, Houston, Texas, United States. Established in 1912, the university spans 300 acres. Rice University comp ...
. The Rice rule is often reported with the factor of 2 outside the cube root, 2\left(n \right)^, and may be considered a different rule. The key difference from Scott's rule is that this rule does not assume the data is normally distributed and the bin width only depends on the number of samples, not on any properties of the data. In general \left( 2n \right)^ is not an integer so \lceil \left( 2n \right)^ \rceil is used where \lceil \cdot \rceil denotes the
ceiling function In mathematics, the floor function is the function that takes as input a real number , and gives as output the greatest integer less than or equal to , denoted or . Similarly, the ceiling function maps to the least integer greater than or eq ...
.


References

{{reflist Rules of thumb Statistical charts and diagrams Infographics