HOME

TheInfoList



OR:

In
probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...
and
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, Fisher's noncentral hypergeometric distribution is a generalization of the
hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of k successes (random draws for which the object drawn has a specified feature) in n draws, ''without'' ...
where sampling probabilities are modified by weight factors. It can also be defined as the
conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the co ...
of two or more
binomially distributed In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
variables dependent upon their fixed sum. The distribution may be illustrated by the following
urn model In probability and statistics, an urn problem is an idealized mental exercise in which some objects of real interest (such as atoms, people, cars, etc.) are represented as colored balls in an urn or other container. One pretends to remove one or m ...
. Assume, for example, that an urn contains ''m''1 red balls and ''m''2 white balls, totalling ''N'' = ''m''1 + ''m''2 balls. Each red ball has the weight ω1 and each white ball has the weight ω2. We will say that the odds ratio is ω = ω1 / ω2. Now we are taking balls randomly in such a way that the probability of taking a particular ball is proportional to its weight, but independent of what happens to the other balls. The number of balls taken of a particular color follows the
binomial distribution In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
. If the total number ''n'' of balls taken is known then the conditional distribution of the number of taken red balls for given ''n'' is Fisher's noncentral hypergeometric distribution. To generate this distribution experimentally, we have to repeat the experiment until it happens to give ''n'' balls. If we want to fix the value of ''n'' prior to the experiment then we have to take the balls one by one until we have ''n'' balls. The balls are therefore no longer independent. This gives a slightly different distribution known as Wallenius' noncentral hypergeometric distribution. It is far from obvious why these two distributions are different. See the entry for
noncentral hypergeometric distributions In statistics, the hypergeometric distribution is the discrete probability distribution generated by picking colored balls at random from an urn without replacement. Various generalizations to this distribution exist for cases where the picking ...
for an explanation of the difference between these two distributions and a discussion of which distribution to use in various situations. The two distributions are both equal to the (central)
hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of k successes (random draws for which the object drawn has a specified feature) in n draws, ''without'' ...
when the odds ratio is 1. Unfortunately, both distributions are known in the literature as "the" noncentral hypergeometric distribution. It is important to be specific about which distribution is meant when using this name. Fisher's noncentral hypergeometric distribution was first given the name extended hypergeometric distribution (Harkness, 1965), and some authors still use this name today.


Univariate distribution

The probability function, mean and variance are given in the adjacent table. An alternative expression of the distribution has both the number of balls taken of each color and the number of balls not taken as random variables, whereby the expression for the probability becomes symmetric. The calculation time for the probability function can be high when the sum in ''P''0 has many terms. The calculation time can be reduced by calculating the terms in the sum recursively relative to the term for ''y'' = ''x'' and ignoring negligible terms in the tails (Liao and Rosen, 2001). The mean can be approximated by: :\mu \approx \frac \, , where a=\omega-1, b=m_1 + n - N -(m_1+n)\omega, c=m_1 n \omega. The variance can be approximated by: :\sigma^2 \approx \frac \bigg/ \left( \frac+ \frac+ \frac+ \frac \right) . Better approximations to the mean and variance are given by Levin (1984, 1990), McCullagh and Nelder (1989), Liao (1992), and Eisinga and Pelzer (2011). The saddlepoint methods to approximate the mean and the variance suggested Eisinga and Pelzer (2011) offer extremely accurate results.


Properties

The following symmetry relations apply: :\operatorname(x;n,m_1,N,\omega) = \operatorname(n-x;n,m_2,N,1/\omega)\,. :\operatorname(x;n,m_1,N,\omega) = \operatorname(x;m_1,n,N,\omega)\,. :\operatorname(x;n,m_1,N,\omega) = \operatorname(m_1-x;N-n,m_1,N,1/\omega)\,. Recurrence relation: :\operatorname(x;n,m_1,N,\omega) = \operatorname(x-1;n,m_1,N,\omega) \frac\omega\,. The distribution is affectionately called "finchy-pig," based on the abbreviation convention above.


Derivation

The univariate noncentral hypergeometric distribution may be derived alternatively as a conditional distribution in the context of two binomially distributed random variables, for example when considering the response to a particular treatment in two different groups of patients participating in a clinical trial. An important application of the noncentral hypergeometric distribution in this context is the computation of exact confidence intervals for the odds ratio comparing treatment response between the two groups. Suppose ''X'' and ''Y'' are binomially distributed random variables counting the number of responders in two corresponding groups of size ''m''X and ''m''Y respectively, : X \sim \operatorname(m_X, \pi_X),\quad Y \sim \operatorname(m_Y, \pi_Y) \, . Their odds ratio is given as : \omega = \frac = \frac . The responder prevalence \pi_i is fully defined in terms of the odds \omega_i, i \in \, which correspond to the sampling bias in the urn scheme above, i.e. :\pi_i = \frac. The trial can be summarized and analyzed in terms of the following contingency table. In the table, n=x+y corresponds to the total number of responders across groups, and ''N'' to the total number of patients recruited into the trial. The dots denote corresponding frequency counts of no further relevance. The sampling distribution of responders in group X conditional upon the trial outcome and prevalences, Pr(X = x \; , \; X+Y = n,m_X,m_Y,\omega_X,\omega_Y), is noncentral hypergeometric: \begin F(X,\omega) :&= Pr(X = x \; , \; X+Y = n,m_X,m_Y,\omega_X,\omega_Y)\\ &= \frac\\ &= \frac\\ &= \frac\\ &= \frac\\ &= \frac\\ &= \frac \end Note that the denominator is essentially just the numerator, summed over all events of the joint sample space (X,Y) for which it holds that X+Y = n. Terms independent of ''X'' can be factored out of the sum and cancel out with the numerator.


Multivariate distribution

The distribution can be expanded to any number of colors ''c'' of balls in the urn. The multivariate distribution is used when there are more than two colors. The probability function and a simple approximation to the mean are given to the right. Better approximations to the mean and variance are given by McCullagh and Nelder (1989).


Properties

The order of the colors is arbitrary so that any colors can be swapped. The weights can be arbitrarily scaled: :\operatorname(\mathbf;n,\mathbf, \boldsymbol) = \operatorname(\mathbf;n,\mathbf, r\boldsymbol)\,\, for all r \in \mathbb_+. Colors with zero number (''m''''i'' = 0) or zero weight (ω''i'' = 0) can be omitted from the equations. Colors with the same weight can be joined: : \begin & \operatorname\left(\mathbf;n,\mathbf, (\omega_1,\ldots,\omega_,\omega_)\right) \\ & = \operatorname\left((x_1,\ldots,x_+x_c); n,(m_1,\ldots,m_+m_c), (\omega_1,\ldots,\omega_)\right)\, \cdot \\ & \qquad \operatorname(x_c; x_+x_c, m_c, m_+m_c) \end where \operatorname(x;n,m,N) is the (univariate, central) hypergeometric distribution probability.


Applications

Fisher's noncentral hypergeometric distribution is useful for models of biased sampling or biased selection where the individual items are sampled independently of each other with no competition. The bias or odds can be estimated from an experimental value of the mean. Use Wallenius' noncentral hypergeometric distribution instead if items are sampled one by one with competition. Fisher's noncentral hypergeometric distribution is used mostly for tests in
contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business i ...
s where a conditional distribution for fixed margins is desired. This can be useful, for example, for testing or measuring the effect of a medicine. See McCullagh and Nelder (1989).


Software available


FisherHypergeometricDistribution
in
Mathematica Wolfram Mathematica is a software system with built-in libraries for several areas of technical computing that allow machine learning, statistics, symbolic computation, data manipulation, network analysis, time series analysis, NLP, optimizat ...
. * An implementation for the
R programming language R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinform ...
is available as the package name
BiasedUrn
Includes univariate and multivariate probability mass functions, distribution functions,
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile th ...
s,
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
generating functions, mean and variance. * The R packag
MCMCpack
includes the univariate probability mass function and random variable generating function. *
SAS System SAS (previously "Statistical Analysis System") is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics. ...
includes univariate probability mass function and distribution function. * Implementation in
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
is available fro
www.agner.org
* Calculation methods are described by Liao and Rosen (2001) and Fog (2008).


See also

*
Noncentral hypergeometric distributions In statistics, the hypergeometric distribution is the discrete probability distribution generated by picking colored balls at random from an urn without replacement. Various generalizations to this distribution exist for cases where the picking ...
* Wallenius' noncentral hypergeometric distribution *
Hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of k successes (random draws for which the object drawn has a specified feature) in n draws, ''without'' ...
* Urn models *
Biased sample In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample of a population (or non-human f ...
*
Bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
*
Contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business i ...
*
Fisher's exact test Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, ...


References

. . . . . . . . . . {{DEFAULTSORT:Fisher's Noncentral Hypergeometric Distribution Discrete distributions