Multilevel regression with poststratification

Last updated

Multilevel regression with poststratification (MRP) is a statistical technique used for correcting model estimates for known differences between a sample population (the population of the data one has), and a target population (a population one wishes to estimate for).

Contents

The poststratification refers to the process of adjusting the estimates, essentially a weighted average of estimates from all possible combinations of attributes (for example age and sex). Each combination is sometimes called a "cell". The multilevel regression is the use of a multilevel model to smooth noisy estimates in the cells with too little data by using overall or nearby averages.

One application is estimating preferences in sub-regions (e.g., states, individual constituencies) based on individual-level survey data gathered at other levels of aggregation (e.g., national surveys). [1]

Mathematical formulation

Following the MRP model description, [2] assume represents single outcome measurement and the population mean value of , , is the target parameter of interest. In the underlying population, each individual, , belongs to one of poststratification cells characterized by a unique set of covariates. The multilevel regression with poststratification model involves the following pair of steps:

MRP step 1 (multilevel regression): The multilevel regression model specifies a linear predictor for the mean , or the logit transform of the mean in the case of a binary outcome, in poststratification cell ,

where is the outcome measurement for respondent in cell , is the fixed intercept, is the unique covariate vector for cell , is a vector of regression coefficients (fixed effects), is the varying coefficient (random effect), maps the cell index to the corresponding category index of variable . All varying coefficients are exchangeable batches with independent normal prior distributions.

MRP step 2: poststratification: The poststratification (PS) estimate for the population parameter of interest is where is the estimated outcome of interest for poststratification cell and is the size of the -th poststratification cell in the population. Estimates at any subpopulation level are similarly derived where is the subset of all poststratification cells that comprise .

The technique and its advantages

The technique essentially involves using data from, for example, censuses relating to various types of people corresponding to different characteristics (e.g., age, race), in a first step to estimate the relationship between those types and individual preferences (i.e., multi-level regression of the dataset). This relationship is then used in a second step to estimate the sub-regional preference based on the number of people having each type/characteristic in that sub-region (a process known as "poststratification"). [3] In this way the need to perform surveys at sub-regional level, which can be expensive and impractical in an area (e.g., a country) with many sub-regions (e.g. counties, ridings, or states), is avoided. It also avoids issues with consistency of survey when comparing different surveys performed in different areas. [4] [1] Additionally, it allows the estimating of preference within a specific locality based on a survey taken across a wider area that includes relatively few people from the locality in question, or where the sample may be highly unrepresentative. [5]

History

The technique was originally developed by Gelman and T. Little in 1997, [6] building upon ideas of Fay and Herriot [7] and R. Little. [8] It was subsequently expanded on by Park, Gelman, and Bafumi in 2004 and 2006. It was proposed for use in estimating US-state-level voter preference by Lax and Philips in 2009. Warshaw and Rodden subsequently proposed it for use in estimating district-level public opinion in 2012. [1] Later, Wang et al. [9] used survey data of Xbox users to predict the outcome of the 2012 US presidential election. The Xbox gamers were 65% 18- to 29-year-olds and 93% male, while the electorate as a whole was 19% 18- to 29-year-olds and 47% male. Even though the original data was highly biased, after multilevel regression with poststratification the authors were able to get estimates that agreed with those coming from polls using large amounts of random and representative data. Since then it has also been proposed for use in the field of epidemiology. [5]

YouGov used the technique to successfully predict the overall outcome of the 2017 UK general election, [10] correctly predicting the result in 93% of constituencies. [11] In the 2019 and 2024 elections other pollsters used MRP including Survation [12] and Ipsos. [13]

Limitations and extensions

MRP can be extended to estimating the change of opinion over time [4] and when used to predict elections works best when used relatively close to the polling date, after nominations have closed. [14]

Both the "multilevel regression" and "poststratification" ideas of MRP can be generalized. Multilevel regression can be replaced by nonparametric regression [15] or regularized prediction, and poststratification can be generalized to allow for non-census variables, i.e. poststratification totals that are estimated rather than being known. [16]

Related Research Articles

<span class="mw-page-title-main">Autocorrelation</span> Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

<span class="mw-page-title-main">Student's t-distribution</span> Probability distribution

In probability theory and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis.

<span class="mw-page-title-main">Logistic distribution</span> Continuous probability distribution

In probability theory and statistics, the logistic distribution is a continuous probability distribution. Its cumulative distribution function is the logistic function, which appears in logistic regression and feedforward neural networks. It resembles the normal distribution in shape but has heavier tails. The logistic distribution is a special case of the Tukey lambda distribution.

In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors, in which case the mixture distribution is a multivariate distribution.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

<span class="mw-page-title-main">Linear discriminant analysis</span> Method used in statistics, pattern recognition, and other fields

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

<span class="mw-page-title-main">Ordinary least squares</span> Method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

In statistics, the ordered logit model or proportional odds logistic regression is an ordinal regression model—that is, a regression model for ordinal dependent variables—first considered by Peter McCullagh. For example, if one question on a survey is to be answered by a choice among "poor", "fair", "good", "very good" and "excellent", and the purpose of the analysis is to see how well that response can be predicted by the responses to other questions, some of which may be quantitative, then ordered logistic regression may be used. It can be thought of as an extension of the logistic regression model that applies to dichotomous dependent variables, allowing for more than two (ordered) response categories.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model with a possible unmeasured correlation between observations from different timepoints.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one continuous dependent variable. The two-way ANOVA not only aims at assessing the main effect of each independent variable but also if there is any interaction between them.

In mathematics, specifically group theory, finite groups of prime power order , for a fixed prime number and varying integer exponents , are briefly called finitep-groups.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described by S. S. Stevens in 1946. The ordinal scale is distinguished from the nominal scale by having a ranking. It also differs from the interval scale and ratio scale by not having category widths that represent equal increments of the underlying attribute.

Shrinkage fields is a random field-based machine learning technique that aims to perform high quality image restoration using low computational overhead.

References

  1. 1 2 3 Buttice, Matthew K.; Highton, Benjamin (Autumn 2013). "How Does Multilevel Regression and Poststratification Perform with Conventional National Surveys?" (PDF). Political Analysis. 21 (4): 449–451. doi:10.1093/pan/mpt017. JSTOR   24572674.
  2. Downes, Marnie Downes; at al. (August 2018). "Multilevel Regression and Poststratification: A Modeling Approach to Estimating Population Quantities From Highly Selected Survey Samples". American Journal of Epidemiology. 187 (8): 1780–1790. doi:10.1093/aje/kwy070. PMID   29635276.
  3. "What is MRP?". Survation.com. Survation. 5 November 2018. Retrieved 31 October 2019.
  4. 1 2 Gelman, Andrew; Lax, Jeffrey; Phillips, Justin; Gabry, Jonah; Trangucci, Robert (28 August 2018). "Using Multilevel Regression and Poststratification to Estimate Dynamic Public Opinion" (PDF): 1–3. Retrieved 31 October 2019.{{cite journal}}: Cite journal requires |journal= (help)
  5. 1 2 Downes, Marnie; Gurrin, Lyle C.; English, Dallas R.; Pirkis, Jane; Currier, Diane; Spital, Matthew J.; Carlin, John B. (9 April 2018). "Multilevel Regression and Poststratification: A Modeling Approach to Estimating Population Quantities From Highly Selected Survey Samples". American Journal of Epidemiology. 179 (8): 187. Retrieved 31 October 2019.
  6. Gelman, Andrew; Little, Thomas (1997). "Poststratification into many categories using hierarchical logistic regression". Survey Methodology. 23: 127–135.
  7. Fay, Robert; Herriot, Roger (1979). "Estimates of income for small places: An application of James-Stein procedures to census data". Journal of the American Statistical Association. 74 (423): 1001–1012. doi:10.1080/01621459.1979.10482505. JSTOR   2286322.
  8. Little, Roderick (1993). "Post-stratification: A modeler's perspective". Journal of the American Statistical Association. 88 (423): 1001–1012. doi:10.1080/01621459.1993.10476368. JSTOR   2290792.
  9. Wang, Wei; Rothschild, David; Goel, Sharad; Gelman, Andrew (2015). "Forecasting elections with non-representative polls" (PDF). International Journal of Forecasting. 31 (3): 980–991. doi: 10.1016/j.ijforecast.2014.06.001 .
  10. Revell, Timothy (9 June 2017). "How YouGov's experimental poll correctly called the UK election". New Scientist. Retrieved 31 October 2019.
  11. Cohen, Daniel (27 September 2019). "'I've never known voters be so promiscuous': the pollsters working to predict the next UK election". The Guardian. Retrieved 31 October 2019.
  12. Survation 2019 https://www.survation.com/2019-general-election-mrp-predictions-survation-and-dr-chris-hanretty/
  13. Ipsos 2024 https://www.ipsos.com/en-uk/uk-opinion-polls/ipsos-election-mrp
  14. James, William; MacLellan, Kylie (15 October 2019). "A question of trust: British pollsters battle to call looming election". Reuters . Retrieved 31 October 2019.
  15. Bisbee, James (2019). "BARP: Improving Mister P Using Bayesian Additive Regression Trees". American Political Science Review. 113 (4): 1060–1065. doi:10.1017/S0003055419000480. S2CID   201385400.
  16. Gelman, Andrew (28 October 2018). "MRP (or RPP) with non-census variables". Statistical Modeling, Causal Inference, and Social Science.