Conditional logistic regression

Last updated May 19, 2024

Conditional logistic regression is an extension of logistic regression that allows one to account for stratification and matching. Its main field of application is observational studies and in particular epidemiology. It was devised in 1978 by Norman Breslow, Nicholas Day, Katherine Halvorsen, Ross L. Prentice and C. Sabai.^[1] It is the most flexible and general procedure for matched data.

Background

Observational studies use stratification or matching as a way to control for confounding.

Logistic regression can account for stratification by having a different constant term for each stratum. Let us denote $Y_{i\ell }\in \{0,1\}$ the label (e.g. case status) of the $\ell$ th observation of the $i$ th stratum and $X_{i\ell }\in \mathbb {R} ^{p}$ the values of the corresponding predictors. We then take the likelihood of one observation to be

\mathbb {P} (Y_{i\ell }=1|X_{i\ell })={\frac {\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i\ell })}{1+\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i\ell })}}

where $\alpha _{i}$ is the constant term for the $i$ th stratum. The parameters in this model can be estimated using maximum likelihood estimation.

For example, consider estimating the impact of exercise on the risk of cardiovascular disease. If people who exercise more are younger, have better access to healthcare, or have other differences that improve their health, then a logistic regression of cardiovascular disease incidence on minutes spent exercising may overestimate the impact of exercise on health. To address this, we can group people based on demographic characteristics like age and zip code of their home residence. Each stratum $\ell$ is a group of people with similar demographics. The vector $X_{i\ell }$ contains information about the variable of interest (in this case, minutes spent exercising) for individual $i$ in stratum $\ell$ . The value $\alpha _{i}$ is the impact of demographics on cardiovascular disease incidence $Y_{i\ell }$ , which is assumed to be the same for all people in the stratum. The vector ${\boldsymbol {\beta }}$ (which, in this example, is just a scalar) is the quantity of interest --- the impact of exercise on cardiovascular disease. We can also include control variables within $X_{i\ell }$ .

Motivation

Logistic regression as described above works satisfactorily when the number of strata is small relative to the amount of data. If we hold the number of strata fixed and increase the amount of data, estimates of the model parameters ( $\alpha _{i}$ for each stratum and the vector ${\boldsymbol {\beta }}$ ) converge to their true values.

Pathological behavior, however, occurs when we have many small strata because the number of parameters grow with the amount of data. For example, if each stratum contains two datapoints, then the number of parameters in a model with $N$ datapoints is $N/2+p$ , so the number of parameters is of the same order as the number of datapoints. In these settings, as we increase the amount of data, the asymptotic results on which maximum likelihood estimation is based on are not valid and the resulting estimates are biased. Conditional logistic regression fixes this issue. In fact, it can be shown that the unconditional analysis of matched pair data results in an estimate of the odds ratio which is the square of the correct, conditional one.^[2]

In addition to tests based on logistic regression, several other tests existed before conditional logistic regression for matched data as shown in related tests. However, they did not allow for the analysis of continuous predictors with arbitrary stratum size. All of those procedures also lack the flexibility of conditional logistic regression and in particular the possibility to control for covariates.

Conditional likelihood

Conditional logistic regression uses a conditional likelihood approach that deals with the above pathological behavior by conditioning on the number of cases in each stratum. This eliminates the need to estimate the strata parameters.

When the strata are pairs, where the first observation is a case and the second is a control, this can be seen as follows

{\begin{aligned}&\mathbb {P} (Y_{i1}=1,Y_{i2}=0|X_{i1},X_{i2},Y_{i1}+Y_{i2}=1)\\&={\frac {\mathbb {P} (Y_{i1}=1|X_{i1})\mathbb {P} (Y_{i2}=0|X_{i2})}{\mathbb {P} (Y_{i1}=1|X_{i1})\mathbb {P} (Y_{i2}=0|X_{i2})+\mathbb {P} (Y_{i1}=0|X_{i1})\mathbb {P} (Y_{i2}=1|X_{i2})}}\\[6pt]\ &={\frac {{\frac {\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i1})}{1+\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i1})}}\times {\frac {1}{1+\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i2})}}}{{\frac {\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i1})}{1+\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i1})}}\times {\frac {1}{1+\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i2})}}+{\frac {1}{1+\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i1})}}\times {\frac {\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i2})}{1+\exp(\alpha _{i}+{\boldsymbol {\beta }}^{\top }X_{i2})}}}}\\[6pt]\ &={\frac {\exp({\boldsymbol {\beta }}^{\top }X_{i1})}{\exp({\boldsymbol {\beta }}^{\top }X_{i1})+\exp({\boldsymbol {\beta }}^{\top }X_{i2})}}.\\[6pt]\end{aligned}}

With similar computations, the conditional likelihood of a stratum of size $m$ , with the $k$ first observations being the cases, is

\mathbb {P} (Y_{ij}=1{\text{ for }}j\leq k,Y_{ij}=0{\text{ for }}k<j\leq m|X_{i1},...,X_{im},\sum _{j=1}^{m}Y_{ij}=k)={\frac {\exp(\sum _{j=1}^{k}{\boldsymbol {\beta }}^{\top }X_{ij})}{\sum _{J\in {\mathcal {C}}_{k}^{m}}\exp(\sum _{j\in J}{\boldsymbol {\beta }}^{\top }X_{ij})}},

where ${\mathcal {C}}_{k}^{m}$ is the set of all subsets of size $k$ of the set $\{1,...,m\}$ .

The full conditional log likelihood is then simply the sum of the log likelihoods for each stratum. The estimator is then defined as the $\beta$ that maximizes the conditional log likelihood.

Implementation

Conditional logistic regression is available in R as the function clogit in the survival package. It is in the survival package because the log likelihood of a conditional logistic model is the same as the log likelihood of a Cox model with a particular data structure.^[3]

It is also available in python through the statsmodels package starting with version 0.14.^[4]

Related tests

Paired difference test allows to test the association between a binary outcome and a continuous predictor while taking into account pairing.
Cochran-Mantel-Haenszel test allows to test the association between a binary outcome and a binary predictor while taking into account stratification with arbitrary strata size. When its conditions of application are verified, it is identical to the conditional logistic regression score test.^[5]

Notes

↑ Breslow NE, Day NE, Halvorsen KT, Prentice RL, Sabai C (1978). "Estimation of multiple relative risk functions in matched case-control studies". Am J Epidemiol. 108 (4): 299–307. doi:10.1093/oxfordjournals.aje.a112623. PMID 727199.
↑ Breslow, N.E.; Day, N.E. (1980). Statistical Methods in Cancer Research. Volume 1-The Analysis of Case-Control Studies. Lyon, France: IARC. pp. 249–251. Archived from the original on 2016-12-26. Retrieved 2016-11-04.
↑ Lumley, Thomas. "R documentation Conditional logistic regression" . Retrieved November 3, 2016.
↑ "statsmodels.discrete.conditional_models.ConditionalLogit" . Retrieved March 25, 2023.
↑ Day, N. E., Byar, D. P. (1979). "Testing hypotheses in case-control studies-equivalence of Mantel-Haenszel statistics and logit score tests". Biometrics. 35 (3): 623–630. doi:10.2307/2530253. JSTOR 2530253. PMID 497345.{{cite journal}}: CS1 maint: multiple names: authors list (link)

Related Research Articles

The likelihood function is the joint probability mass of observed data viewed as a function of the parameters of a statistical model. Intuitively, the likelihood function $is the probability of observing data assuming is the actual parameter.$

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space $under the operation of composition.$

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

In information theory, the cross-entropy between two probability distributions $and, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In statistics, a fixed effects model is a statistical model in which the model parameters are fixed or non-random quantities. This is in contrast to random effects models and mixed models in which all or some of the model parameters are random variables. In many applications including econometrics and biostatistics a fixed effects model refers to a regression model in which the group means are fixed (non-random) as opposed to a random effects model in which the group means are a random sample from a population. Generally, data can be grouped according to several observed factors. The group means could be modeled as fixed or random effects for each grouping. In a fixed effects model each group mean is a group-specific fixed quantity.

Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure. Other types of survival models such as accelerated failure time models do not exhibit proportional hazards. The accelerated failure time model describes a situation where the biological or mechanical life history of an event is accelerated.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector $, and an observation drawn from a multinomial distribution with probability vector p and number of trials n . The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.$

In statistics, the matrix t-distribution is the generalization of the multivariate t-distribution from vectors to matrices. The matrix t-distribution shares the same relationship with the multivariate t-distribution that the matrix normal distribution shares with the multivariate normal distribution. For example, the matrix t-distribution is the compound distribution that results from sampling from a matrix normal distribution having sampled the covariance matrix of the matrix normal from an inverse Wishart distribution.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function $with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.$

In machine learning, local case-control sampling is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[pmid727199-1] Breslow NE, Day NE, Halvorsen KT, Prentice RL, Sabai C (1978). "Estimation of multiple relative risk functions in matched case-control studies". Am J Epidemiol. 108 (4): 299–307. doi:10.1093/oxfordjournals.aje.a112623. PMID 727199.

[2] Breslow, N.E.; Day, N.E. (1980). Statistical Methods in Cancer Research. Volume 1-The Analysis of Case-Control Studies. Lyon, France: IARC. pp. 249–251. Archived from the original on 2016-12-26. Retrieved 2016-11-04.

[3] Lumley, Thomas. "R documentation Conditional logistic regression" . Retrieved November 3, 2016.

[4] "statsmodels.discrete.conditional_models.ConditionalLogit" . Retrieved March 25, 2023.

[5] Day, N. E., Byar, D. P. (1979). "Testing hypotheses in case-control studies-equivalence of Mantel-Haenszel statistics and logit score tests". Biometrics. 35 (3): 623–630. doi:10.2307/2530253. JSTOR 2530253. PMID 497345.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[1]

[2]

[3]

[4]

[5]