HOME

TheInfoList



OR:

In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of
table Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
in a
matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...
format that displays the (multivariate)
frequency distribution In statistics, the frequency (or absolute frequency) of an event i is the number n_i of times the observation has occurred/recorded in an experiment or study. These frequencies are often depicted graphically or in tabular form. Types The cumula ...
of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term ''contingency table'' was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the ''
Drapers' Company The Worshipful Company of Drapers is one of the 110 livery companies of the City of London. It has the formal name The Master and Wardens and Brethren and Sisters of the Guild or Fraternity of the Blessed Mary the Virgin of the Mystery of Dr ...
Research Memoirs Biometric Series I'' published in 1904. A crucial problem of
multivariate statistics Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. Multivariate statistics concerns understanding the different aims and background of each of the dif ...
is finding the (direct-)dependence structure underlying the variables contained in high-dimensional contingency tables. If some of the
conditional independence In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probabil ...
s are revealed, then even the storage of the data can be done in a smarter way (see Lauritzen (2002)). In order to do this one can use information theory concepts, which gain the information only from the distribution of probability, which can be expressed easily from the contingency table by the relative frequencies. A
pivot table A pivot table is a table of grouped values that aggregates the individual items of a more extensive table (such as from a database, spreadsheet, or business intelligence program) within one or more discrete categories. This summary might include ...
is a way to create contingency tables using spreadsheet software.


Example

Suppose there are two variables, sex (male or female) and handedness (right- or left-handed). Further suppose that 100 individuals are randomly sampled from a very large population as part of a study of sex differences in handedness. A contingency table can be created to display the numbers of individuals who are male right-handed and left-handed, female right-handed and left-handed. Such a contingency table is shown below. The numbers of the males, females, and right- and left-handed individuals are called
marginal total In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the varia ...
s. The grand total (the total number of individuals represented in the contingency table) is the number in the bottom right corner. The table allows users to see at a glance that the proportion of men who are right-handed is about the same as the proportion of women who are right-handed although the proportions are not identical. The strength of the association can be measured by the
odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
, and the population odds ratio estimated by the
sample odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
. The significance of the difference between the two proportions can be assessed with a variety of statistical tests including Pearson's chi-squared test, the ''G''-test,
Fisher's exact test Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, a ...
,
Boschloo's test Boschloo's test is a statistical hypothesis test for analysing 2x2 contingency tables. It examines the association of two Bernoulli distributed random variables and is a uniformly more powerful alternative to Fisher's exact test. It was propo ...
, and
Barnard's test In statistics, Barnard’s test is an exact test used in the analysis of contingency tables with one margin fixed. Barnard’s tests are really a class of hypothesis tests, also known as unconditional exact tests for two independent binomials. ...
, provided the entries in the table represent individuals randomly sampled from the population about which conclusions are to be drawn. If the proportions of individuals in the different columns vary significantly between rows (or vice versa), it is said that there is a ''contingency'' between the two variables. In other words, the two variables are ''not'' independent. If there is no contingency, it is said that the two variables are ''independent''. The example above is the simplest kind of contingency table, a table in which each variable has only two levels; this is called a 2 × 2 contingency table. In principle, any number of rows and columns may be used. There may also be more than two variables, but higher order contingency tables are difficult to represent visually. The relation between
ordinal variable Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described b ...
s, or between ordinal and categorical variables, may also be represented in contingency tables, although such a practice is rare. For more on the use of a contingency table for the relation between two ordinal variables, see
Goodman and Kruskal's gamma In statistics, Goodman and Kruskal's gamma is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. It measures the strength of association of the cross tabulated data when both v ...
.


Standard contents of a contingency table

* Multiple columns (historically, they were designed to use up all the white space of a printed page). Where each row refers to a specific sub-group in the population (in this case men or women), the columns are sometimes referred to as ''banner points'' or ''cuts'' (and the rows are sometimes referred to as ''stubs''). * Significance tests. Typically, either ''column comparisons'', which test for differences between columns and display these results using letters, or, ''cell comparisons'', which use color or arrows to identify a cell in a table that stands out in some way. * ''Nets'' or ''netts'' which are sub-totals. * One or more of: percentages, row percentages, column percentages, indexes or averages. * Unweighted sample sizes (counts).


Measures of association

The degree of association between the two variables can be assessed by a number of coefficients. The following subsections describe a few of them. For a more complete discussion of their uses, see the main articles linked under each subsection heading.


Odds ratio

The simplest measure of association for a 2 × 2 contingency table is the
odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
. Given two events, A and B, the odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due to symmetry), the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the odds ratio is 1; if the odds ratio is greater than 1, the events are positively associated; if the odds ratio is less than 1, the events are negatively associated. The odds ratio has a simple expression in terms of probabilities; given the joint probability distribution: : \begin & B = 1 & B = 0 \\ \hline A = 1 & p_ & p_ \\ A = 0 & p_ & p_ \end the odds ratio is: :OR = \frac.


Phi coefficient

A simple measure, applicable only to the case of 2 × 2 contingency tables, is the
phi coefficient In statistics, the phi coefficient (or mean square contingency coefficient and denoted by φ or rφ) is a measure of association for two binary variables. In machine learning, it is known as the Matthews correlation coefficient (MCC) and used as ...
(φ) defined by : \phi=\pm\sqrt, where is computed as in Pearson's chi-squared test, and ''N'' is the grand total of observations. φ varies from 0 (corresponding to no association between the variables) to 1 or −1 (complete association or complete inverse association), provided it is based on frequency data represented in 2 × 2 tables. Then its sign equals the sign of the product of the
main diagonal In linear algebra, the main diagonal (sometimes principal diagonal, primary diagonal, leading diagonal, major diagonal, or good diagonal) of a matrix A is the list of entries a_ where i = j. All off-diagonal elements are zero in a diagonal matri ...
elements of the table minus the product of the off–diagonal elements. φ takes on the minimum value −1.0 or the maximum value of +1.0
if and only if In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false. The connective is b ...
every marginal proportion is equal to 0.5 (and two diagonal cells are empty).


Cramér's ''V'' and the contingency coefficient ''C''

Two alternatives are the ''contingency coefficient'' ''C'', and
Cramér's V In statistics, Cramér's V (sometimes referred to as Cramér's phi and denoted as φ''c'') is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic and ...
. The formulae for the ''C'' and ''V'' coefficients are: : C=\sqrt and : V=\sqrt, ''k'' being the number of rows or the number of columns, whichever is less. ''C'' suffers from the disadvantage that it does not reach a maximum of 1.0, notably the highest it can reach in a 2 × 2 table is 0.707 . It can reach values closer to 1.0 in contingency tables with more categories; for example, it can reach a maximum of 0.870 in a 4 × 4 table. It should, therefore, not be used to compare associations in different tables if they have different numbers of categories. ''C'' can be adjusted so it reaches a maximum of 1.0 when there is complete association in a table of any number of rows and columns by dividing ''C'' by \sqrt where ''k'' is the number of rows or columns, when the table is square , or by \sqrt scriptstyle 4/math> where ''r'' is the number of rows and ''c'' is the number of columns.


Tetrachoric correlation coefficient

Another choice is the tetrachoric correlation coefficient but it is only applicable to 2 × 2 tables.
Polychoric correlation In statistics, polychoric correlation{{Cite web, url=https://support.sas.com/documentation/cdl/en/procstat/65543/HTML/default/viewer.htm#procstat_corr_details14.htm, title=Base SAS(R) 9.3 Procedures Guide: Statistical Procedures, Second Edition, we ...
is an extension of the tetrachoric correlation to tables involving variables with more than two levels. Tetrachoric correlation assumes that the variable underlying each
dichotomous A dichotomy is a partition of a whole (or a set) into two parts (subsets). In other words, this couple of parts must be * jointly exhaustive: everything must belong to one part or the other, and * mutually exclusive: nothing can belong simult ...
measure is normally distributed. The coefficient provides "a convenient measure of
he Pearson product-moment He or HE may refer to: Language * He (pronoun), an English pronoun * He (kana), the romanization of the Japanese kana へ * He (letter), the fifth letter of many Semitic alphabets * He (Cyrillic), a letter of the Cyrillic script called ''He'' ...
correlation when graduated measurements have been reduced to two categories." The tetrachoric correlation coefficient should not be confused with the
Pearson correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
computed by assigning, say, values 0.0 and 1.0 to represent the two levels of each variable (which is mathematically equivalent to the φ coefficient).


Lambda coefficient

The lambda coefficient is a measure of the strength of association of the cross tabulations when the variables are measured at the
nominal level Nominal level is the operating level at which an electronic signal processing device is designed to operate. The electronic circuits that make up such equipment are limited in the maximum signal they can handle and the low-level internally genera ...
. Values range from 0.0 (no association) to 1.0 (the maximum possible association). Asymmetric lambda measures the percentage improvement in predicting the dependent variable. Symmetric lambda measures the percentage improvement when prediction is done in both directions.


Uncertainty coefficient

The
uncertainty coefficient In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil and is based on the concept of information entropy. Definition S ...
, or Theil's U, is another measure for variables at the nominal level. Its values range from −1.0 (100% negative association, or perfect inversion) to +1.0 (100% positive association, or perfect agreement). A value of 0.0 indicates the absence of association. Also, the uncertainty coefficient is conditional and an asymmetrical measure of association, which can be expressed as : U(X, Y) \neq U(Y, X) . This asymmetrical property can lead to insights not as evident in symmetrical measures of association.


Others

* Gamma test: No adjustment for either table size or ties. *
Kendall's tau In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a ...
: Adjustment for ties. ** Tau-b: Used for square tables. ** Tau-c: Used for rectangular tables.


See also

*
Confusion matrix In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a su ...
*
Pivot table A pivot table is a table of grouped values that aggregates the individual items of a more extensive table (such as from a database, spreadsheet, or business intelligence program) within one or more discrete categories. This summary might include ...
, in spreadsheet software, cross-tabulates sampling data with counts (contingency table) and/or sums. *
TPL Tables TPL Tables is a cross tabulation system used to generate statistical tables for analysis or publication. {{Infobox software , name = TPL Tables , developer = QQQ Software, Inc. , latest_release_version = 8.0 , latest_release_date = {{Start date a ...
is a tool for generating and printing crosstabs. *The
iterative proportional fitting The iterative proportional fitting procedure (IPF or IPFP, also known as biproportional fitting or biproportion in statistics or economics (input-output analysis, etc.), RAS algorithm in economics, raking in survey statistics, and matrix scaling in ...
procedure essentially manipulates contingency tables to match altered joint distributions or marginal sums. *The
multivariate statistics Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. Multivariate statistics concerns understanding the different aims and background of each of the dif ...
in special multivariate discrete probability distributions. Some procedures used in this context can be used in dealing with contingency tables. *
OLAP cube An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term ''cube'' here refers to a multi-dimensional dataset, which is also sometimes c ...
, a modern multidimensional computing form of contingency tables *
Panel data In statistics and econometrics, panel data and longitudinal data are both multi-dimensional data involving measurements over time. Panel data is a subset of longitudinal data where observations are for the same subjects each time. Time series and ...
, multidimensional data over time


References


Further reading

* Andersen, Erling B. 1980. ''Discrete Statistical Models with Social Science Applications''. North Holland, 1980. * * * *


External links


On-line analysis of contingency tables: calculator with examples


* ttp://statpages.org/ctab2x2.html Fisher and chi-squared calculator of 2 × 2 contingency table br>More Correlation Coefficients
March 24, 2008, G. David Garson, North Carolina State University
CustomInsight.com Cross Tabulation



StATS: Steves Attempt to Teach Statistics Odds ratio versus relative risk (January 9, 2001)

Epi Info Community Health Assessment Tutorial Lesson 5 Analysis: Creating Statistics
{{Statistics, descriptive Infographics Frequency distribution