Local case-control sampling

Last updated August 23, 2022

In machine learning, local case-control sampling^[1] is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.

Imbalanced datasets

In classification, a dataset is a set of N data points $(x_{i},y_{i})_{i=1}^{N}$ , where $x_{i}\in \mathbb {R} ^{d}$ is a feature vector, $y_{i}\in \{0,1\}$ is a label. Intuitively, a dataset is imbalanced when certain important statistical patterns are rare. The lack of observations of certain patterns does not always imply their irrelevance. For example, in medical studies of rare diseases, the small number of infected patients (cases) conveys the most valuable information for diagnosis and treatments.

Formally, an imbalanced dataset exhibits one or more of the following properties:

Marginal Imbalance. A dataset is marginally imbalanced if one class is rare compared to the other class. In other words, $\mathbb {P} (Y=1)\approx 0$ .
Conditional Imbalance. A dataset is conditionally imbalanced when it is easy to predict the correct labels in most cases. For example, if $X\in \{0,1\}$ , the dataset is conditionally imbalanced if $\mathbb {P} (Y=1\mid X=0)\approx 0$ and $\mathbb {P} (Y=1\mid X=1)\approx 1$ .

Algorithm outline

In logistic regression, given the model $\theta =(\alpha ,\beta )$ , the prediction is made according to $\mathbb {P} (Y=1\mid X;\theta )={\tilde {p}}_{\theta }(x)={\frac {\exp(\alpha +\beta ^{T}x)}{1+\exp(\alpha +\beta ^{T}x)}}$ . The local-case control sampling algorithm assumes the availability of a pilot model ${\tilde {\theta }}=({\tilde {\alpha }},{\tilde {\beta }})$ . Given the pilot model, the algorithm performs a single pass over the entire dataset to select the subset of samples to include in training the logistic regression model. For a sample $(x,y)$ , define the acceptance probability as $a(x,y)=|y-{\tilde {p}}_{\tilde {\theta }}(x)|$ . The algorithm proceeds as follows:

Generate independent $z_{i}\sim {\text{Bernoulli}}(a(x_{i},y_{i}))$ for $i\in \{1,\ldots ,N\}$ .
Fit a logistic regression model to the subsample $S=\{(x_{i},y_{i}):z_{i}=1\}$ , obtaining the unadjusted estimates ${\hat {\theta }}_{S}=({\hat {\alpha }}_{S},{\hat {\beta }}_{S})$ .
The output model is ${\hat {\theta }}=({\hat {\alpha }},{\hat {\beta }})$ , where ${\hat {\alpha }}\leftarrow {\hat {\alpha }}_{S}+{\tilde {\alpha }}$ and ${\hat {\beta }}\leftarrow {\hat {\beta }}_{S}+{\tilde {\beta }}$ .

The algorithm can be understood as selecting samples that surprises the pilot model. Intuitively these samples are closer to the decision boundary of the classifier and is thus more informative.

Obtaining the pilot model

In practice, for cases where a pilot model is naturally available, the algorithm can be applied directly to reduce the complexity of training. In cases where a natural pilot is nonexistent, an estimate using a subsample selected through another sampling technique can be used instead. In the original paper describing the algorithm, the authors propose to use weighted case-control sampling with half the assigned sampling budget. For example, if the objective is to use a subsample with size $N=1000$ , first estimate a model ${\tilde {\theta }}$ using $N_{h}=500$ samples from weighted case control sampling, then collect another $N_{h}=500$ samples using local case-control sampling.

Larger or smaller sample size

It is possible to control the sample size by multiplying the acceptance probability with a constant $c$ . For a larger sample size, pick $c>1$ and adjust the acceptance probability to $\min(ca(x_{i},y_{i}),1)$ . For a smaller sample size, the same strategy applies. In cases where the number of samples desired is precise, a convenient alternative method is to uniformly downsample from a larger subsample selected by local case-control sampling.

Properties

The algorithm has the following properties. When the pilot is consistent, the estimates using the samples from local case-control sampling is consistent even under model misspecification. If the model is correct then the algorithm has exactly twice the asymptotic variance of logistic regression on the full data set. For a larger sample size with $c>1$ , the factor 2 is improved to $1+{\frac {1}{c}}$ .

Related Research Articles

The likelihood function describes the joint probability of the observed data as a function of the parameters of the chosen statistical model. For each specific parameter value $in the parameter space, the likelihood function therefore assigns a probabilistic prediction to the observed data . Since it is essentially the product of sampling densities, the likelihood generally encapsulates both the data-generating process as well as the missing-data mechanism that produced the observed sample.$

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distribution. There are two different parameterizations in common use:

With a shape parameter k and a scale parameter θ.
With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.

In statistics, the (binary) logistic model is a statistical model that models the probability of one event taking place by having the log-odds for the event be a linear combination of one or more independent variables ("predictors"). In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents a convenient approach for setting hyperparameters, but has been mostly supplanted by fully Bayesian hierarchical analyses since the 2000s with the increasing availability of well-performing computation techniques.

In Bayesian probability theory, if the posterior distribution p(θ | x) is in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(x | θ).

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In information theory, the cross-entropy between two probability distributions $and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In probability theory and statistics, the beta prime distribution is an absolutely continuous probability distribution.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. 48 samples of robust M-estimators can be found in a recent review study.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of $independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.$

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

The shifted log-logistic distribution is a probability distribution also known as the generalized log-logistic or the three-parameter log-logistic distribution. It has also been called the generalized logistic distribution, but this conflicts with other uses of the term: see generalized logistic distribution.

In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

Conditional logistic regression is an extension of logistic regression that allows one to take into account stratification and matching. Its main field of application is observational studies and in particular epidemiology. It was devised in 1978 by Norman Breslow, Nicholas Day, Katherine Halvorsen, Ross L. Prentice and C. Sabai. It is the most flexible and general procedure for matched data.

IQ imbalance is a performance-limiting issue in the design of a class of radio receivers known as direct conversion receivers. These translate the received radio frequency signal directly from the carrier frequency $to baseband using a single mixing stage.$

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

↑ Fithian, William; Hastie, Trevor (2014). "Local case-control sampling: Efficient subsampling in imbalanced data sets". The Annals of Statistics. 42 (5): 1693–1724. arXiv: 1306.3706 . doi:10.1214/14-aos1220. PMC 4258397 . PMID 25492979.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[LCC-1] Fithian, William; Hastie, Trevor (2014). "Local case-control sampling: Efficient subsampling in imbalanced data sets". The Annals of Statistics. 42 (5): 1693–1724. arXiv: 1306.3706 . doi:10.1214/14-aos1220. PMC 4258397 . PMID 25492979.

[1]