Log-linear analysis

Last updated June 13, 2023

Log-linear analysis is a technique used in statistics to examine the relationship between more than two categorical variables. The technique is used for both hypothesis testing and model building. In both these uses, models are tested to find the most parsimonious (i.e., least complex) model that best accounts for the variance in the observed frequencies. (A Pearson's chi-square test could be used instead of log-linear analysis, but that technique only allows for two of the variables to be compared at a time.^[1])

Fitting criterion

Log-linear analysis uses a likelihood ratio statistic $\mathrm {X} ^{2}$ that has an approximate chi-square distribution when the sample size is large:^[2]

\mathrm {X} ^{2}=2\sum O_{ij}\ln {\frac {O_{ij}}{E_{ij}}},

where

\ln =

natural logarithm;

O_{ij}=

observed frequency in cell_ij (i = row and j = column);

E_{ij}=

expected frequency in cell_ij.

\mathrm {X} ^{2}=

the deviance for the model.^[3]

Assumptions

There are three assumptions in log-linear analysis:^[2]

1. The observations are independent and random;

2. Observed frequencies are normally distributed about expected frequencies over repeated samples. This is a good approximation if both (a) the expected frequencies are greater than or equal to 5 for 80% or more of the categories and (b) all expected frequencies are greater than 1. Violations to this assumption result in a large reduction in power. Suggested solutions to this violation are: delete a variable, combine levels of one variable (e.g., put males and females together), or collect more data.

3. The logarithm of the expected value of the response variable is a linear combination of the explanatory variables. This assumption is so fundamental that it is rarely mentioned, but like most linearity assumptions, it is rarely exact and often simply made to obtain a tractable model.

Additionally, data should always be categorical. Continuous data can first be converted to categorical data, with some loss of information. With both continuous and categorical data, it would be best to use logistic regression. (Any data that is analysed with log-linear analysis can also be analysed with logistic regression. The technique chosen depends on the research questions.)

Variables

In log-linear analysis there is no clear distinction between what variables are the independent or dependent variables. The variables are treated the same. However, often the theoretical background of the variables will lead the variables to be interpreted as either the independent or dependent variables.^[1]

Models

The goal of log-linear analysis is to determine which model components are necessary to retain in order to best account for the data. Model components are the number of main effects and interactions in the model. For example, if we examine the relationship between three variables—variable A, variable B, and variable C—there are seven model components in the saturated model. The three main effects (A, B, C), the three two-way interactions (AB, AC, BC), and the one three-way interaction (ABC) gives the seven model components.

The log-linear models can be thought of to be on a continuum with the two extremes being the simplest model and the saturated model. The simplest model is the model where all the expected frequencies are equal. This is true when the variables are not related. The saturated model is the model that includes all the model components. This model will always explain the data the best, but it is the least parsimonious as everything is included. In this model, observed frequencies equal expected frequencies, therefore in the likelihood ratio chi-square statistic, the ratio ${\frac {O_{ij}}{E_{ij}}}=1$ and $\ln(1)=0$ . This results in the likelihood ratio chi-square statistic being equal to 0, which is the best model fit.^[2] Other possible models are the conditional equiprobability model and the mutual dependence model.^[1]

Each log-linear model can be represented as a log-linear equation. For example, with the three variables (A, B, C) the saturated model has the following log-linear equation:^[1]

\ln(F_{ijk})=\lambda +\lambda _{i}^{A}+\lambda _{j}^{B}+\lambda _{k}^{C}+\lambda _{ij}^{AB}+\lambda _{ik}^{AC}+\lambda _{jk}^{BC}+\lambda _{ijk}^{ABC},\,

where

F_{ijk}=

expected frequency in cell_ijk;

\lambda =

the relative weight of each variable.

Hierarchical model

Log-linear analysis models can be hierarchical or nonhierarchical. Hierarchical models are the most common. These models contain all the lower order interactions and main effects of the interaction to be examined.^[1]

Graphical model

A log-linear model is graphical if, whenever the model contains all two-factor terms generated by a higher-order interaction, the model also contains the higher-order interaction.^[4] As a direct-consequence, graphical models are hierarchical. Moreover, being completely determined by its two-factor terms, a graphical model can be represented by an undirected graph, where the vertices represent the variables and the edges represent the two-factor terms included in the model.

Decomposable model

A log-linear model is decomposable if it is graphical and if the corresponding graph is chordal.

Model fit

The model fits well when the residuals (i.e., observed-expected) are close to 0, that is the closer the observed frequencies are to the expected frequencies the better the model fit. If the likelihood ratio chi-square statistic is non-significant, then the model fits well (i.e., calculated expected frequencies are close to observed frequencies). If the likelihood ratio chi-square statistic is significant, then the model does not fit well (i.e., calculated expected frequencies are not close to observed frequencies).

Backward elimination is used to determine which of the model components are necessary to retain in order to best account for the data. Log-linear analysis starts with the saturated model and the highest order interactions are removed until the model no longer accurately fits the data. Specifically, at each stage, after the removal of the highest ordered interaction, the likelihood ratio chi-square statistic is computed to measure how well the model is fitting the data. The highest ordered interactions are no longer removed when the likelihood ratio chi-square statistic becomes significant.^[2]

Comparing models

When two models are nested, models can also be compared using a chi-square difference test. The chi-square difference test is computed by subtracting the likelihood ratio chi-square statistics for the two models being compared. This value is then compared to the chi-square critical value at their difference in degrees of freedom. If the chi-square difference is smaller than the chi-square critical value, the new model fits the data significantly better and is the preferred model. Else, if the chi-square difference is larger than the critical value, the less parsimonious model is preferred.^[1]

Follow-up tests

Once the model of best fit is determined, the highest-order interaction is examined by conducting chi-square analyses at different levels of one of the variables. To conduct chi-square analyses, one needs to break the model down into a 2 × 2 or 2 × 1 contingency table.^[2]

For example, if one is examining the relationship among four variables, and the model of best fit contained one of the three-way interactions, one would examine its simple two-way interactions at different levels of the third variable.

Effect sizes

To compare effect sizes of the interactions between the variables, odds ratios are used. Odds ratios are preferred over chi-square statistics for two main reasons:^[1]

1. Odds ratios are independent of the sample size;

2. Odds ratios are not affected by unequal marginal distributions.

Software

For datasets with a few variables – general log-linear models

R with the loglm function of the MASS package (see tutorial)
IBM SPSS Statistics with the GENLOG procedure (usage)

For datasets with hundreds of variables – decomposable models

Chordalysis ^[5]

Related Research Articles

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

Trip distribution is the second component in the traditional four-step transportation forecasting model. This step matches tripmakers’ origins and destinations to develop a “trip table”, a matrix that displays the number of trips going from each origin to each destination. Historically, this component has been the least developed component of the transportation planning model.

In statistics, the Vuong closeness test is a likelihood-ratio-based test for model selection using the Kullback–Leibler information criterion. This statistic makes probabilistic statements about two models. They can be nested, strictly non-nested or partially non-nested. The statistic tests the null hypothesis that the two models are equally close to the true data generating process, against the alternative that one model is closer. It cannot make any decision whether the "closer" model is the true model.

The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are drawn from identical distributions, or whether outcome frequencies follow a specified distribution. In the analysis of variance, one of the components into which the variance is partitioned may be a lack-of-fit sum of squares.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure. Other types of survival models such as accelerated failure time models do not exhibit proportional hazards. The accelerated failure time model describes a situation where the biological or mechanical life history of an event is accelerated.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

In statistics, a power transform is a family of functions applied to create a monotonic transformation of data using power functions. It is a data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association, and for other data stabilization procedures.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog (1969) and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test and appropriate to use when the data are right skewed and censored. It is widely used in clinical trials to establish the efficacy of a new treatment in comparison with a control treatment when the measurement is the time to event. The test is sometimes called the Mantel–Cox test. The logrank test can also be viewed as a time-stratified Cochran–Mantel–Haenszel test.

In probability theory and statistics, the Conway–Maxwell–Poisson distribution is a discrete probability distribution named after Richard W. Conway, William L. Maxwell, and Siméon Denis Poisson that generalizes the Poisson distribution by adding a parameter to model overdispersion and underdispersion. It is a member of the exponential family, has the Poisson distribution and geometric distribution as special cases and the Bernoulli distribution as a limiting case.

In statistics, Tukey's test of additivity, named for John Tukey, is an approach used in two-way ANOVA to assess whether the factor variables are additively related to the expected value of the response variable. It can be applied when there are no replicated values in the data set, a situation in which it is impossible to directly estimate a fully general non-additive regression structure and still have information left to estimate the error variance. The test statistic proposed by Tukey has one degree of freedom under the null hypothesis, hence this is often called "Tukey's one-degree-of-freedom test."

In statistics Wilks' theorem offers an asymptotic distribution of the log-likelihood ratio statistic, which can be used to produce confidence intervals for maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio test.

In statistics, the variance function is a smooth function which depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

References

1 2 3 4 5 6 7 Howell, D. C. (2009). Statistical methods for psychology (7th ed.). Belmot, CA: Cengage Learning. pp. 630–655.
1 2 3 4 5 Field, A. (2005). Discovering statistics using SPSS (2nd ed.) . Thousand Oaks, CA: SAGE Publications. pp. 695–718. ISBN 9780761944515.
↑ Agresti, Alan (2007). An Introduction to Categorical Data Analysis (2nd ed.). Hoboken, NJ: Wiley Inter-Science. p. 212. doi:10.1002/0470114754. ISBN 978-0-471-22618-5.
↑ Christensen, R. (1997). Log-Linear Models and Logistic Regression (2nd ed.). Springer.
↑ Petitjean, F.; Webb, G.I.; Nicholson, A.E. (2013). Scaling log-linear analysis to high-dimensional data (PDF). International Conference on Data Mining. Dallas, TX, USA: IEEE. pp. 597–606.