HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, ''G''-tests are likelihood-ratio or
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
statistical significance In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (simply by chance alone). More precisely, a study's defined significance level, denoted by \alpha, is the p ...
tests that are increasingly being used in situations where
chi-squared test A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variable ...
s were previously recommended. The general formula for ''G'' is : G = 2\sum_ , where O_i \geq 0 is the observed count in a cell, E_i > 0 is the expected count under the
null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
, \ln denotes the
natural logarithm The natural logarithm of a number is its logarithm to the base of the mathematical constant , which is an irrational and transcendental number approximately equal to . The natural logarithm of is generally written as , , or sometimes, if ...
, and the sum is taken over all non-empty cells. Furthermore, the total observed count should be equal to the total expected count:\sum_i O_i = \sum_i E_i = Nwhere N is the total number of observations. ''G''-tests have been recommended at least since the 1981 edition of ''Biometry'', a statistics textbook by
Robert R. Sokal Robert Reuven Sokal (January 13, 1926 in Vienna, Austria – April 9, 2012 in Stony Brook, New York) was an Austrian-American biostatistician and entomologist. Distinguished Professor Emeritus at the Stony Brook University, Sokal was a member ...
and F. James Rohlf.


Derivation

We can derive the value of the ''G''-test from the log-likelihood ratio test where the underlying model is a multinomial model. Suppose we had a sample x = (x_1, \ldots, x_m) where each x_i is the number of times that an object of type i was observed. Furthermore, let n = \sum_^m x_i be the total number of objects observed. If we assume that the underlying model is multinomial, then the test statistic is defined by\ln \left( \frac \right) = \ln \left( \frac \right)where \tilde is the null hypothesis and \hat is the
maximum likelihood estimate In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statisti ...
(MLE) of the parameters given the data. Recall that for the multinomial model, the MLE of \hat_i given some data is defined by\hat_i = \fracFurthermore, we may represent each null hypothesis parameter \tilde_i as\tilde_i = \fracThus, by substituting the representations of \tilde and \hat in the log-likelihood ratio, the equation simplifies to\begin \ln \left( \frac \right) &= \ln \prod_^m \left(\frac\right)^ \\ &= \sum_^m x_i \ln\left(\frac\right) \\ \endRelabel the variables e_i with E_i and x_i with O_i. Finally, multiply by a factor of -2 (used to make the G test formula asymptotically equivalent to the Pearson's chi-squared test formula) to achieve the form \begin G & = & \; -2 \sum_^m O_i \ln\left(\frac\right) \\ & = & 2 \sum_^m O_i \ln\left(\frac\right) \end Heuristically, one can imagine ~ O_i ~ as continuous and approaching zero, in which case ~ O_i \ln O_i \to 0 ~, and terms with zero observations can simply be dropped. However the ''expected'' count in each cell must be strictly greater than zero for each cell (~ E_i > 0 ~ \forall \, i ~) to apply the method.


Distribution and use

Given the null hypothesis that the observed frequencies result from random sampling from a distribution with the given expected frequencies, the
distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations * Probability distribution, the probability of a particular value or value range of a vari ...
of ''G'' is approximately a
chi-squared distribution In probability theory and statistics, the chi-squared distribution (also chi-square or \chi^2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-squa ...
, with the same number of
degrees of freedom Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...
as in the corresponding chi-squared test. For very small samples the
multinomial test In statistics, the multinomial test is the test of the null hypothesis that the parameters of a multinomial distribution equal specified values; it is used for categorical data. Beginning with a sample of ~ N ~ items each of which has been observed ...
for goodness of fit, and
Fisher's exact test Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, ...
for contingency tables, or even Bayesian hypothesis selection are preferable to the ''G''-test. McDonald recommends to always use an exact test (exact test of goodness-of-fit,
Fisher's exact test Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, ...
) if the total sample size is less than 1 000 . :There is nothing magical about a sample size of 1 000, it's just a nice round number that is well within the range where an exact test, chi-square test, and ''G''–test will give almost identical  values. Spreadsheets, web-page calculators, and
SAS SAS or Sas may refer to: Arts, entertainment, and media * ''SAS'' (novel series), a French book series by Gérard de Villiers * ''Shimmer and Shine'', an American animated children's television series * Southern All Stars, a Japanese rock ba ...
shouldn't have any problem doing an exact test on a sample size of 1 000 . :::: — John H. McDonald


Relation to the chi-squared test

The commonly used
chi-squared test A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variable ...
s for goodness of fit to a distribution and for independence in
contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business i ...
s are in fact approximations of the
log-likelihood ratio In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...
on which the ''G''-tests are based. The general formula for Pearson's chi-squared test statistic is : \chi^2 = \sum_ ~. The approximation of ''G'' by chi squared is obtained by a second order
Taylor expansion In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor serie ...
of the natural logarithm around 1. To see this consider : G = 2\sum_ ~, and let O_i = E_i + \delta_i with \sum_i \delta_i = 0 ~, so that the total number of counts remains the same. Upon substitution we find, : G = 2\sum_ ~. A Taylor expansion around 1+\frac can be performed using \ln(1 + x) = x - \fracx^2 + \mathcal(x^3) . The result is : G = 2\sum_ (E_i + \delta_i) \left(\frac - \frac\frac + \mathcal\left(\delta_i^3\right) \right) ~, and distributing terms we find, : G = 2\sum_ \delta_i + \frac\frac + \mathcal\left(\delta_i^3\right)~. Now, using the fact that ~ \sum_ \delta_i = 0 ~ and ~ \delta_i = O_i - E_i ~, we can write the result, :~ G \approx \sum_ \frac ~. This shows that G \approx \chi^2 when the observed counts ~ O_i ~ are close to the expected counts ~ E_i ~. When this difference is large, however, the ~ \chi^2 ~ approximation begins to break down. Here, the effects of outliers in data will be more pronounced, and this explains the why ~ \chi^2 ~ tests fail in situations with little data. For samples of a reasonable size, the ''G''-test and the chi-squared test will lead to the same conclusions. However, the approximation to the theoretical chi-squared distribution for the ''G''-test is better than for the
Pearson's chi-squared test Pearson's chi-squared test (\chi^2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests (e.g., ...
. In cases where ~ O_i > 2 \cdot E_i ~ for some cell case the ''G''-test is always better than the chi-squared test. For testing goodness-of-fit the ''G''-test is infinitely more efficient than the chi squared test in the sense of Bahadur, but the two tests are equally efficient in the sense of Pitman or in the sense of Hodges and Lehmann.


Relation to Kullback–Leibler divergence

The ''G''-test statistic is proportional to the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
of the theoretical distribution from the empirical distribution: : \begin G &= 2\sum_ = 2 N \sum_ \\ &= 2 N \, D_(o\, e), \end where ''N'' is the total number of observations and o_i and e_i are the empirical and theoretical frequencies, respectively.


Relation to mutual information

For analysis of
contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business i ...
s the value of ''G'' can also be expressed in terms of
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
. Let :N = \sum_ \; , \; \pi_ = \frac \; , \; \pi_ = \frac \; , and \; \pi_ = \frac \;. Then ''G'' can be expressed in several alternative forms: : G = 2 \cdot N \cdot \sum_ , : G = 2 \cdot N \cdot \left H(r) + H(c) - H(r,c) \right, : G = 2 \cdot N \cdot \operatorname(r,c) \, , where the
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of a discrete random variable X \, is defined as : H(X) = - \, , and where : \operatorname(r,c)= H(r) + H(c) - H(r,c) \, is the
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
between the row vector ''r'' and the column vector ''c'' of the contingency table. It can also be shown that the inverse document frequency weighting commonly used for text retrieval is an approximation of ''G'' applicable when the row sum for the query is much smaller than the row sum for the remainder of the corpus. Similarly, the result of Bayesian inference applied to a choice of single multinomial distribution for all rows of the contingency table taken together versus the more general alternative of a separate multinomial per row produces results very similar to the ''G'' statistic.


Application

* The
McDonald–Kreitman test The McDonald–Kreitman test is a statistical test often used by evolutionary and population biologists to detect and measure the amount of adaptive evolution within a species by determining whether adaptive evolution has occurred, and the proportio ...
in
statistical genetics Statistical genetics is a scientific field concerned with the development and application of statistical methods for drawing inferences from genetic data. The term is most commonly used in the context of human genetics. Research in statistical gen ...
is an application of the ''G''-test. * Dunning introduced the test to the
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
community where it is now widely used.


Statistical software

* In R fast implementations can be found in th
AMR
an
Rfast
packages. For the AMR package, the command is g.test which works exactly like chisq.test from base R. R also has th

function in th

package. Note: Fisher's ''G''-test in th
GeneCycle Package
of the
R programming language R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinform ...
(fisher.g.test) does not implement the ''G''-test as described in this article, but rather Fisher's exact test of Gaussian white-noise in a time series. * Another R implementation to compute the G statistic and corresponding p-values is provided by the R packag
entropy
The commands are Gstat for the standard G statistic and the associated p-value and Gstatindep for the G statistic applied to comparing joint and product distributions to test independence. * In
SAS SAS or Sas may refer to: Arts, entertainment, and media * ''SAS'' (novel series), a French book series by Gérard de Villiers * ''Shimmer and Shine'', an American animated children's television series * Southern All Stars, a Japanese rock ba ...
, one can conduct ''G''-test by applying the /chisq option after the proc freq. * In
Stata Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fie ...
, one can conduct a ''G''-test by applying the lr option after the tabulate command. * In
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
, use org.apache.commons.math3.stat.inference.GTest. * In
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, use scipy.stats.power_divergence with lambda_=0.


References


External links


G2/Log-likelihood calculator
{{DEFAULTSORT:G-test Statistical tests for contingency tables