Wallenius' Noncentral Hypergeometric Distribution
   HOME

TheInfoList



OR:

In
probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...
and
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, Wallenius' noncentral hypergeometric distribution (named after Kenneth Ted Wallenius) is a generalization of the
hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a Probability distribution#Discrete probability distribution, discrete probability distribution that describes the probability of k successes (random draws for which the ...
where items are sampled with
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
. This distribution can be illustrated as an
urn model In probability and statistics, an urn problem is an idealized mental exercise in which some objects of real interest (such as atoms, people, cars, etc.) are represented as colored balls in an urn or other container. One pretends to remove one or ...
with bias. Assume, for example, that an urn contains ''m''1 red balls and ''m''2 white balls, totalling ''N'' = ''m''1 + ''m''2 balls. Each red ball has the weight ω1 and each white ball has the weight ω2. We will say that the odds ratio is ω = ω1 / ω2. Now we are taking ''n'' balls, one by one, in such a way that the probability of taking a particular ball at a particular draw is equal to its proportion of the total weight of all balls that lie in the urn at that moment. The number of red balls ''x''1 that we get in this experiment is a
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
with Wallenius' noncentral hypergeometric distribution. The matter is complicated by the fact that there is more than one noncentral hypergeometric distribution. Wallenius' noncentral hypergeometric distribution is obtained if balls are sampled one by one in such a way that there is
competition Competition is a rivalry where two or more parties strive for a common goal which cannot be shared: where one's gain is the other's loss (an example of which is a zero-sum game). Competition can arise between entities such as organisms, indi ...
between the balls.
Fisher's noncentral hypergeometric distribution In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. It can also be defined as the conditional di ...
is obtained if the balls are sampled simultaneously or independently of each other. Unfortunately, both distributions are known in the literature as "the" noncentral hypergeometric distribution. It is important to be specific about which distribution is meant when using this name. The two distributions are both equal to the (central)
hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a Probability distribution#Discrete probability distribution, discrete probability distribution that describes the probability of k successes (random draws for which the ...
when the
odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of event A taking place in the presence of B, and the odds of A in the absence of B ...
is 1. The difference between these two probability distributions is subtle. See the Wikipedia entry on
noncentral hypergeometric distributions In statistics, the hypergeometric distribution is the discrete probability distribution generated by picking colored balls at random from an urn problem, urn without replacement. Various generalizations to this distribution exist for cases where ...
for a more detailed explanation.


Univariate distribution

Wallenius' distribution is particularly complicated because each ball has a probability of being taken that depends not only on its weight, but also on the total weight of its competitors. And the weight of the competing balls depends on the outcomes of all preceding draws. This recursive dependency gives rise to a
difference equation In mathematics, a recurrence relation is an equation according to which the nth term of a sequence of numbers is equal to some combination of the previous terms. Often, only k previous terms of the sequence appear in the equation, for a parameter ...
with a solution that is given in open form by the integral in the expression of the probability mass function in the table above. Closed form expressions for the probability mass function exist (Lyons, 1980), but they are not very useful for practical calculations because of extreme
numerical instability In the mathematical subfield of numerical analysis, numerical stability is a generally desirable property of numerical algorithms. The precise definition of stability depends on the context: one important context is numerical linear algebra, and a ...
, except in degenerate cases. Several other calculation methods are used, including
recursion Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in m ...
,
Taylor expansion In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...
and
numerical integration In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral. The term numerical quadrature (often abbreviated to quadrature) is more or less a synonym for "numerical integr ...
(Fog, 2007, 2008). The most reliable calculation method is recursive calculation of f(''x'',''n'') from f(''x'',''n''-1) and f(''x''-1,''n''-1) using the recursion formula given below under properties. The probabilities of all (''x'',''n'') combinations on all possible
trajectories A trajectory or flight path is the path that an object with mass in motion follows through space as a function of time. In classical mechanics, a trajectory is defined by Hamiltonian mechanics via canonical coordinates; hence, a complete traje ...
leading to the desired point are calculated, starting with f(0,0) = 1 as shown on the figure to the right. The total number of probabilities to calculate is ''n''(''x''+1)-''x''2. Other calculation methods must be used when ''n'' and ''x'' are so big that this method is too inefficient. The probability that all balls have the same color is easier to calculate. See the formula below under multivariate distribution. No exact formula for the mean is known (short of complete enumeration of all probabilities). The equation given above is reasonably accurate. This equation can be solved for μ by Newton-Raphson iteration. The same equation can be used for estimating the odds from an experimentally obtained value of the mean.


Properties of the univariate distribution

Wallenius' distribution has fewer symmetry relations than
Fisher's noncentral hypergeometric distribution In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. It can also be defined as the conditional di ...
has. The only symmetry relates to the swapping of colors: :\operatorname(x;n,m_1,m_2,\omega) = \operatorname(n-x;n,m_2,m_1,1/\omega)\,. Unlike Fisher's distribution, Wallenius' distribution has no symmetry relating to the number of balls ''not'' taken. The following recursion formula is useful for calculating probabilities: :\operatorname(x;n,m_1,m_2,\omega) = ::\operatorname(x-1;n-1,m_1,m_2,\omega) \frac + ::\operatorname(x;n-1,m_1,m_2,\omega) \frac Another recursion formula is also known: :\operatorname(x;n,m_1,m_2,\omega) = ::\operatorname(x-1;n-1,m_1-1,m_2,\omega) \frac + ::\operatorname(x;n-1,m_1,m_2-1,\omega) \frac\,. The probability is limited by :\operatorname_1(x) \le \operatorname(x;n,m_1,m_2,\omega) \le \operatorname_2(x)\,,\,\,\text\,\, \omega < 1\,, :\operatorname_1(x) \ge \operatorname(x;n,m_1,m_2,\omega) \ge \operatorname_2(x)\,,\,\,\text\,\, \omega > 1\,,\text :\operatorname_1(x)=\binom\binom \frac :\operatorname_2(x)=\binom\binom \frac\, , where the underlined superscript indicates the
falling factorial In mathematics, the falling factorial (sometimes called the descending factorial, falling sequential product, or lower factorial) is defined as the polynomial \begin (x)_n = x^\underline &= \overbrace^ \\ &= \prod_^n(x-k+1) = \prod_^(x-k) . \end ...
a^ = a(a-1)\ldots(a-b+1).


Multivariate distribution

The distribution can be expanded to any number of colors ''c'' of balls in the urn. The multivariate distribution is used when there are more than two colors. The probability mass function can be calculated by various
Taylor expansion In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...
methods or by
numerical integration In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral. The term numerical quadrature (often abbreviated to quadrature) is more or less a synonym for "numerical integr ...
(Fog, 2008). The probability that all balls have the same color, ''j'', can be calculated as: :\operatorname((0,\ldots,0,x_j,0,\ldots);n,\mathbf, \boldsymbol) = \frac for ''x''j = ''n'' ≤ ''m''j, where the underlined superscript denotes the
falling factorial In mathematics, the falling factorial (sometimes called the descending factorial, falling sequential product, or lower factorial) is defined as the polynomial \begin (x)_n = x^\underline &= \overbrace^ \\ &= \prod_^n(x-k+1) = \prod_^(x-k) . \end ...
. A reasonably good approximation to the mean can be calculated using the equation given above. The equation can be solved by defining θ so that :\mu_i = m_i(1-e^) and solving :\sum_^c \mu_i = n for θ by Newton-Raphson iteration. The equation for the mean is also useful for estimating the odds from experimentally obtained values for the mean. No good way of calculating the variance is known. The best known method is to approximate the multivariate Wallenius distribution by a multivariate
Fisher's noncentral hypergeometric distribution In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. It can also be defined as the conditional di ...
with the same mean, and insert the mean as calculated above in the approximate formula for the variance of the latter distribution.


Properties of the multivariate distribution

The order of the colors is arbitrary so that any colors can be swapped. The weights can be arbitrarily scaled: :\operatorname(\mathbf;n,\mathbf, \boldsymbol) = \operatorname(\mathbf;n,\mathbf, r\boldsymbol)\,\, for all r \in \mathbb_+. Colors with zero number (''m''i = 0) or zero weight (ωi = 0) can be omitted from the equations. Colors with the same weight can be joined: :\operatorname\left(\mathbf;n,\mathbf, (\omega_1,\ldots,\omega_,\omega_)\right)\, = ::\operatorname\left((x_1,\ldots,x_+x_c); n,(m_1,\ldots,m_+m_c), (\omega_1,\ldots,\omega_)\right)\, \cdot ::\operatorname(x_c; x_+x_c, m_c, m_+m_c)\,, where \operatorname(x;n,m,N) is the (univariate, central) hypergeometric distribution probability.


Complementary Wallenius' noncentral hypergeometric distribution

The balls that are ''not'' taken in the urn experiment have a distribution that is different from Wallenius' noncentral hypergeometric distribution, due to a lack of symmetry. The distribution of the balls not taken can be called the complementary Wallenius' noncentral hypergeometric distribution. Probabilities in the complementary distribution are calculated from Wallenius' distribution by replacing ''n'' with ''N''-''n'', ''x''i with ''m''i - ''x''i, and ωi with 1/ωi.


Software available


WalleniusHypergeometricDistribution
in
Mathematica Wolfram (previously known as Mathematica and Wolfram Mathematica) is a software system with built-in libraries for several areas of technical computing that allows machine learning, statistics, symbolic computation, data manipulation, network ...
. * An implementation for the
R programming language R is a programming language for statistical computing and data visualization. It has been widely adopted in the fields of data mining, bioinformatics, data analysis, and data science. The core R language is extended by a large number of so ...
is available as the package name
BiasedUrn
Includes univariate and multivariate probability mass functions, distribution functions,
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities or dividing the observations in a sample in the same way. There is one fewer quantile t ...
s,
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
generating functions, mean and variance. * Implementation in C++ is available fro
www.agner.org


See also

*
Noncentral hypergeometric distributions In statistics, the hypergeometric distribution is the discrete probability distribution generated by picking colored balls at random from an urn problem, urn without replacement. Various generalizations to this distribution exist for cases where ...
*
Fisher's noncentral hypergeometric distribution In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. It can also be defined as the conditional di ...
*
Biased sample In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample of a population (or non-human ...
*
Bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
*
Population genetics Population genetics is a subfield of genetics that deals with genetic differences within and among populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as Adaptation (biology), adaptation, s ...
*
Fisher's exact test Fisher's exact test (also Fisher-Irwin test) is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. The test assumes that a ...


References

* * * * * * * {{DEFAULTSORT:Wallenius Noncentral Hypergeometric Distribution Discrete distributions Multivariate discrete distributions