HOME

TheInfoList



OR:

Log-linear analysis is a technique used in
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
to examine the relationship between more than two
categorical variable In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...
s. The technique is used for both
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
and model building. In both these uses, models are tested to find the most parsimonious (i.e., least complex) model that best accounts for the variance in the observed frequencies. (A
Pearson's chi-square test Pearson's chi-squared test (\chi^2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests (e.g. ...
could be used instead of log-linear analysis, but that technique only allows for two of the variables to be compared at a time.)


Fitting criterion

Log-linear analysis uses a
likelihood ratio The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
statistic \Chi^2 that has an approximate
chi-square distribution In probability theory and statistics, the chi-squared distribution (also chi-square or \chi^2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-square ...
when the sample size is large: :\Chi^2 = 2\sum O_ \ln \frac, where :\ln =
natural logarithm The natural logarithm of a number is its logarithm to the base of the mathematical constant , which is an irrational and transcendental number approximately equal to . The natural logarithm of is generally written as , , or sometimes, if ...
; :O_ = observed frequency in cell''ij'' (''i'' = row and ''j'' = column); :E_ = expected frequency in cell''ij''. :\Chi^2 = the deviance for the model.


Assumptions

There are three assumptions in log-linear analysis: 1. The observations are
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...
and
random In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no :wikt:order, order and does not follow an intelligible pattern or combination. Ind ...
; 2. Observed frequencies are normally distributed about expected frequencies over repeated samples. This is a good approximation if both (a) the expected frequencies are greater than or equal to 5 for 80% or more of the categories and (b) all expected frequencies are greater than 1. Violations to this assumption result in a large reduction in power. Suggested solutions to this violation are: delete a variable, combine levels of one variable (e.g., put males and females together), or collect more data. 3. The logarithm of the expected value of the response variable is a linear combination of the explanatory variables. This assumption is so fundamental that it is rarely mentioned, but like most linearity assumptions, it is rarely exact and often simply made to obtain a tractable model. Additionally, data should always be categorical. Continuous data can first be converted to categorical data, with some loss of information. With both continuous and categorical data, it would be best to use
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
. (Any data that is analysed with log-linear analysis can also be analysed with logistic regression. The technique chosen depends on the research questions.)


Variables

In log-linear analysis there is no clear distinction between what variables are the
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...
or
dependent A dependant is a person who relies on another as a primary source of income. A common-law spouse who is financially supported by their partner may also be included in this definition. In some jurisdictions, supporting a dependant may enabl ...
variables. The variables are treated the same. However, often the theoretical background of the variables will lead the variables to be interpreted as either the independent or dependent variables.


Models

The goal of log-linear analysis is to determine which model components are necessary to retain in order to best account for the data. Model components are the number of
main effect In the design of experiments and analysis of variance, a main effect is the effect of an independent variable on a dependent variable averaged across the levels of any other independent variables. The term is frequently used in the context of facto ...
s and
interactions Interaction is action that occurs between two or more objects, with broad use in philosophy and the sciences. It may refer to: Science * Interaction hypothesis, a theory of second language acquisition * Interaction (statistics) * Interactions o ...
in the model. For example, if we examine the relationship between three variables—variable A, variable B, and variable C—there are seven model components in the saturated model. The three main effects (A, B, C), the three two-way interactions (AB, AC, BC), and the one three-way interaction (ABC) gives the seven model components. The log-linear models can be thought of to be on a continuum with the two extremes being the simplest model and the
saturated model In mathematical logic, and particularly in its subfield model theory, a saturated model ''M'' is one that realizes as many complete types as may be "reasonably expected" given its size. For example, an ultrapower model of the hyperreals is \al ...
. The simplest model is the model where all the expected frequencies are equal. This is true when the variables are not related. The saturated model is the model that includes all the model components. This model will always explain the data the best, but it is the least parsimonious as everything is included. In this model, observed frequencies equal expected frequencies, therefore in the likelihood ratio chi-square statistic, the ratio \frac=1 and \ln(1)=0. This results in the likelihood ratio chi-square statistic being equal to 0, which is the best model fit. Other possible models are the conditional equiprobability model and the mutual dependence model. Each log-linear model can be represented as a log-linear equation. For example, with the three variables (''A'', ''B'', ''C'') the saturated model has the following log-linear equation: :\ln(F_)=\lambda + \lambda_i^A + \lambda_j^B +\lambda_k^C + \lambda_^ + \lambda_^+ \lambda_^ + \lambda_^, \, where :F_ = expected frequency in cell''ijk''; :\lambda = the relative weight of each variable.


Hierarchical model

Log-linear analysis models can be hierarchical or nonhierarchical. Hierarchical models are the most common. These models contain all the lower order interactions and main effects of the interaction to be examined.


Graphical model

A log-linear model is graphical if, whenever the model contains all two-factor terms generated by a higher-order interaction, the model also contains the higher-order interaction. As a direct-consequence, graphical models are hierarchical. Moreover, being completely determined by its two-factor terms, a graphical model can be represented by an undirected graph, where the vertices represent the variables and the edges represent the two-factor terms included in the model.


Decomposable model

A log-linear model is decomposable if it is graphical and if the corresponding graph is chordal.


Model fit

The model fits well when the residuals (i.e., observed-expected) are close to 0, that is the closer the observed frequencies are to the expected frequencies the better the model fit. If the likelihood ratio chi-square statistic is non-significant, then the model fits well (i.e., calculated expected frequencies are close to observed frequencies). If the likelihood ratio chi-square statistic is significant, then the model does not fit well (i.e., calculated expected frequencies are not close to observed frequencies).
Backward elimination In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...
is used to determine which of the model components are necessary to retain in order to best account for the data. Log-linear analysis starts with the saturated model and the highest order interactions are removed until the model no longer accurately fits the data. Specifically, at each stage, after the removal of the highest ordered interaction, the likelihood ratio chi-square statistic is computed to measure how well the model is fitting the data. The highest ordered interactions are no longer removed when the likelihood ratio chi-square statistic becomes significant.


Comparing models

When two models are
nested ''Nested'' is the seventh studio album by Bronx-born singer, songwriter and pianist Laura Nyro, released in 1978 on Columbia Records. Following on from her extensive tour to promote 1976's ''Smile'', which resulted in the 1977 live album '' Seas ...
, models can also be compared using a chi-square difference test. The chi-square difference test is computed by subtracting the likelihood ratio chi-square statistics for the two models being compared. This value is then compared to the chi-square critical value at their difference in degrees of freedom. If the chi-square difference is smaller than the chi-square critical value, the new model fits the data significantly better and is the preferred model. Else, if the chi-square difference is larger than the critical value, the less parsimonious model is preferred.


Follow-up tests

Once the model of best fit is determined, the highest-order interaction is examined by conducting chi-square analyses at different levels of one of the variables. To conduct chi-square analyses, one needs to break the model down into a 2 × 2 or 2 × 1
contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business i ...
. For example, if one is examining the relationship among four variables, and the model of best fit contained one of the three-way interactions, one would examine its simple two-way interactions at different levels of the third variable.


Effect sizes

To compare effect sizes of the interactions between the variables,
odds ratios An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
are used. Odds ratios are preferred over chi-square statistics for two main reasons: 1. Odds ratios are independent of the sample size; 2. Odds ratios are not affected by unequal marginal distributions.


Software


For datasets with a few variables – general log-linear models


R
with th

function of th
MASS
package (se


IBM SPSS Statistics
with th

procedure


For datasets with hundreds of variables – decomposable models


Chordalysis
ref name=Petitjean/>


See also

*
Poisson regression In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable ''Y'' has a Poisson distribution, and assumes the logari ...
*
Log-linear model A log-linear model is a mathematical model that takes the form of a function whose logarithm equals a linear combination of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. That is, it has ...


References


Further reading


Log-linear Models
* Simkiss, D.; Ebrahim, G. J.; Waterston, A. J. R. (Eds.) "Chapter 14: Analysing categorical data: Log-linear analysis". ''Journal of Tropical Pediatrics'', online only area, “Research methods II: Multivariate analysis” (pp. 144–153). Retrieved May 2012 from http://www.oxfordjournals.org/tropej/online/ma_chap14.pdf * Pugh, M. D. (1983). "Contributory fault and rape convictions: Log-linear models for blaming the victim". ''Social Psychology Quarterly, 46'', 233–242. * Tabachnick, B. G., & Fidell, L. S. (2007). ''Using Multivariate Statistics (5th ed.).'' New York, NY: Allyn and Bacon. {{Authority control Categorical variable interactions