
A chi-squared test (also chi-square or test) is a
statistical hypothesis test
A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
used in the analysis of
contingency table
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business int ...
s when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables (''two dimensions of the contingency table'') are independent in influencing the test statistic (''values within the table'').
The test is
valid when the test statistic is
chi-squared distributed under the
null hypothesis
The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...
, specifically
Pearson's chi-squared test
Pearson's chi-squared test or Pearson's \chi^2 test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squa ...
and variants thereof. Pearson's chi-squared test is used to determine whether there is a
statistically significant difference between the expected
frequencies
Frequency is the number of occurrences of a repeating event per unit of time. Frequency is an important parameter used in science and engineering to specify the rate of oscillatory and vibratory phenomena, such as mechanical vibrations, audio ...
and the observed frequencies in one or more categories of a
contingency table
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business int ...
. For contingency tables with smaller sample sizes, a
Fisher's exact test is used instead.
In the standard applications of this test, the observations are classified into mutually exclusive classes. If the
null hypothesis
The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...
that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a
frequency distribution
In statistics, the frequency or absolute frequency of an Event (probability theory), event i is the number n_i of times the observation has occurred/been recorded in an experiment or study. These frequencies are often depicted graphically or tabu ...
. The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.
Test statistics that follow a distribution occur when the observations are independent. There are also tests for testing the null hypothesis of independence of a pair of
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
s based on observations of the pairs.
''Chi-squared tests'' often refers to tests for which the distribution of the test statistic approaches the distribution
asymptotically
In analytic geometry, an asymptote () of a curve is a line such that the distance between the curve and the line approaches zero as one or both of the ''x'' or ''y'' coordinates tends to infinity. In projective geometry and related contexts, ...
, meaning that the
sampling distribution
In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. For an arbitrarily large number of samples where each sample, involving multiple observations (data poi ...
(if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as
sample sizes increase.
History
In the 19th century, statistical analytical methods were mainly applied in biological data analysis and it was customary for researchers to assume that observations followed a
normal distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
f(x) = \frac ...
, such as
Sir George Airy and
Mansfield Merriman Mansfield Merriman (March 27, 1848 June 7, 1925) was an American civil engineer, born in Southington, Connecticut.
He graduated from Yale's Sheffield Scientific School in 1871, was an assistant in the United States Corps of Engineers in 187273 ...
, whose works were criticized by
Karl Pearson
Karl Pearson (; born Carl Pearson; 27 March 1857 – 27 April 1936) was an English biostatistician and mathematician. He has been credited with establishing the discipline of mathematical statistics. He founded the world's first university ...
in his 1900 paper.
[
]
At the end of the 19th century, Pearson noticed the existence of significant
skewness
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.
For a unimodal ...
within some biological observations. In order to model the observations regardless of being normal or skewed, Pearson, in a series of articles published from 1893 to 1916,
[
][
][
][
] devised the
Pearson distribution
The Pearson distribution is a family of continuous probability distributions. It was first published by Karl Pearson in 1895 and subsequently extended by him in 1901 and 1916 in a series of articles on biostatistics.
History
The Pearson syste ...
, a family of continuous
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s, which includes the normal distribution and many skewed distributions, and proposed a method of statistical analysis consisting of using the Pearson distribution to model the observation and performing a test of goodness of fit to determine how well the model really fits to the observations.
Pearson's chi-squared test
In 1900, Pearson published a paper
on the test which is considered to be one of the foundations of modern statistics.
[
] In this paper, Pearson investigated a test of goodness of fit.
Suppose that observations in a random sample from a population are classified into mutually exclusive classes with respective observed numbers of observations (for ), and a null hypothesis gives the probability that an observation falls into the th class. So we have the expected numbers for all , where
:
Pearson proposed that, under the circumstance of the null hypothesis being correct, as the limiting distribution of the quantity given below is the distribution.
:
Pearson dealt first with the case in which the expected numbers are large enough known numbers in all cells assuming every observation may be taken as
normally distributed
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
, and reached the result that, in the limit as becomes large, follows the distribution with degrees of freedom.
However, Pearson next considered the case in which the expected numbers depended on the parameters that had to be estimated from the sample, and suggested that, with the notation of being the true expected numbers and being the estimated expected numbers, the difference
:
will usually be positive and small enough to be omitted. In a conclusion, Pearson argued that if we regarded as also distributed as distribution with degrees of freedom, the error in this approximation would not affect practical decisions. This conclusion caused some controversy in practical applications and was not settled for 20 years until Fisher's 1922 and 1924 papers.
[
][
]
Other examples of chi-squared tests
One
test statistic
Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...
that follows a
chi-squared distribution
In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...
exactly is the test that the variance of a normally distributed population has a given value based on a
sample variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, ...
. Such tests are uncommon in practice because the true variance of the population is usually unknown. However, there are several statistical tests where the
chi-squared distribution
In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...
is approximately valid:
Fisher's exact test
For an
exact test
An exact (significance) test is a statistical test such that if the null hypothesis is true, then all assumptions made during the derivation of the distribution of the test statistic are met. Using an exact test provides a significance test that ...
used in place of the 2 × 2 chi-squared test for independence when all the row and column totals were fixed by design, see
Fisher's exact test. When the row or column margins (or both) are random variables (as in most common research designs) this tends to be overly conservative and
underpowered.
Binomial test
For an exact test used in place of the 2 × 1 chi-squared test for goodness of fit, see
binomial test
Binomial test is an exact test of the statistical significance of deviations from a theoretically expected distribution of observations into two categories using sample data.
Usage
A binomial test is a statistical hypothesis test used to deter ...
.
Other chi-squared tests
*
Cochran–Mantel–Haenszel chi-squared test.
*
McNemar's test
McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are eq ...
, used in certain tables with pairing
*
Tukey's test of additivity In statistics, Tukey's test of additivity, named for John Tukey, is an approach used in two-way ANOVA ( regression analysis involving two qualitative factors) to assess whether the factor variables ( categorical variables) are additively related to ...
* The
portmanteau test
A portmanteau test is a type of statistical hypothesis test in which the null hypothesis is well specified, but the alternative hypothesis is more loosely specified. Tests constructed in this context can have the property of being at least modera ...
in
time-series analysis, testing for the presence of
autocorrelation
Autocorrelation, sometimes known as serial correlation in the discrete time case, measures the correlation of a signal with a delayed copy of itself. Essentially, it quantifies the similarity between observations of a random variable at differe ...
*
Likelihood-ratio test
In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing ...
s in general
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
ling, for testing whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one).
Yates's correction for continuity
Using the
chi-squared distribution
In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...
to interpret
Pearson's chi-squared statistic requires one to assume that the
discrete
Discrete may refer to:
*Discrete particle or quantum in physics, for example in quantum theory
* Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit
* Discrete group, ...
probability of observed
binomial frequencies in the table can be approximated by the continuous
chi-squared distribution
In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...
. This assumption is not quite correct and introduces some error.
To reduce the error in approximation,
Frank Yates suggested a correction for continuity that adjusts the formula for
Pearson's chi-squared test
Pearson's chi-squared test or Pearson's \chi^2 test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squa ...
by subtracting 0.5 from the absolute difference between each observed value and its expected value in a contingency table.
This reduces the chi-squared value obtained and thus increases its
''p''-value.
Chi-squared test for variance in a normal population
If a sample of size is taken from a population having a
normal distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
f(x) = \frac ...
, then there is a result (see
distribution of the sample variance) which allows a test to be made of whether the variance of the population has a pre-determined value. For example, a manufacturing process might have been in stable condition for a long period, allowing a value for the variance to be determined essentially without error. Suppose that a variant of the process is being tested, giving rise to a small sample of product items whose variation is to be tested. The test statistic in this instance could be set to be the sum of squares about the sample mean, divided by the nominal value for the variance (i.e. the value to be tested as holding). Then has a chi-squared distribution with
degrees of freedom
In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...
. For example, if the sample size is 21, the acceptance region for with a significance level of 5% is between 9.59 and 34.17.
Example chi-squared test for categorical data
Suppose there is a city of 1,000,000 residents with four neighborhoods: , , , and . A random sample of 650 residents of the city is taken and their occupation is recorded as
"white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The data are tabulated as:
:
Let us take the sample living in neighborhood , 150, to estimate what proportion of the whole 1,000,000 live in neighborhood . Similarly we take to estimate what proportion of the 1,000,000 are white-collar workers. By the assumption of independence under the hypothesis we should "expect" the number of white-collar workers in neighborhood to be
:
Then in that "cell" of the table, we have
:
The sum of these quantities over all of the cells is the test statistic; in this case,
. Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom is
:
If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.
A related issue is a test of homogeneity. Suppose that instead of giving every resident of each of the four neighborhoods an equal chance of inclusion in the sample, we decide in advance how many residents of each neighborhood to include. Then each resident has the same chance of being chosen as do all residents of the same neighborhood, but residents of different neighborhoods would have different probabilities of being chosen if the four sample sizes are not proportional to the populations of the four neighborhoods. In such a case, we would be testing "homogeneity" rather than "independence". The question is whether the proportions of blue-collar, white-collar, and no-collar workers in the four neighborhoods are the same. However, the test is done in the same way.
Applications
In
cryptanalysis
Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic se ...
, the chi-squared test is used to compare the distribution of
plaintext
In cryptography, plaintext usually means unencrypted information pending input into cryptographic algorithms, usually encryption algorithms. This usually refers to data that is transmitted or stored unencrypted.
Overview
With the advent of comp ...
and (possibly) decrypted
ciphertext
In cryptography, ciphertext or cyphertext is the result of encryption performed on plaintext using an algorithm, called a cipher. Ciphertext is also known as encrypted or encoded information because it contains a form of the original plaintext ...
. The lowest value of the test means that the decryption was successful with high probability.
This method can be generalized for solving modern cryptographic problems.
In
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, the chi-squared test is used to compare the distribution of certain properties of genes (e.g., genomic content, mutation rate, interaction network clustering, etc.) belonging to different categories (e.g., disease genes, essential genes, genes on a certain chromosome etc.).
See also
*
Chi-squared test nomogram
*
GEH statistic
*
''G''-test
*
Minimum chi-square estimation
*
Nonparametric statistics
Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in parametric s ...
*
Wald test
*
Wilson score interval
References
Further reading
*
*
*
*
*
{{DEFAULTSORT:Chi-Squared Test
Statistical tests for contingency tables
Nonparametric statistics