Multilevel regression with poststratification

Last updated

Multilevel regression with poststratification (MRP) is a statistical technique used for correcting model estimates for known differences between a sample population (the population of the data you have), and a target population (a population you would like to estimate for).

Contents

The poststratification refers to the process of adjusting the estimates, essentially a weighted average of estimates from all possible combinations of attributes (for example age and sex). Each combination is sometimes called a "cell". The multilevel regression is used to smooth noisy estimates in the cells with too little data by using overall or nearby averages.

One application is estimating preferences in sub-regions (e.g., states, individual constituencies) based on individual-level survey data gathered at other levels of aggregation (e.g., national surveys). [1]

The technique and its advantages

The technique essentially involves using data from, for example, censuses relating to various types of people corresponding to different characteristics (e.g., age, race), in a first step to estimate the relationship between those types and individual preferences (i.e., multi-level regression of the dataset). This relationship is then used in a second step to estimate the sub-regional preference based on the number of people having each type/characteristic in that sub-region (a process known as "poststratification"). [2] In this way the need to perform surveys at sub-regional level, which can be expensive and impractical in an area (e.g., a country) with many sub-regions (e.g. counties, ridings, or states), is avoided. It also avoids issues with consistency of survey when comparing different surveys performed in different areas. [3] [1] Additionally, it allows the estimating of preference within a specific locality based on a survey taken across a wider area that includes relatively few people from the locality in question, or where the sample may be highly unrepresentative. [4]

History

The technique was originally developed by Gelman and T. Little in 1997, [5] building upon ideas of Fay and Herriot [6] and R. Little. [7] It was subsequently expanded on by Park, Gelman, and Bafumi in 2004 and 2006. It was proposed for use in estimating US-state-level voter preference by Lax and Philips in 2009. Warshaw and Rodden subsequently proposed it for use in estimating district-level public opinion in 2012. [1] Later, Wang et al. [8] used survey data of Xbox users to predict the outcome of the 2012 US presidential election. The Xbox gamers were 65% 18- to 29-year-olds and 93% male, while the electorate as a whole was 19% 18- to 29-year-olds and 47% male. Even though the original data was highly biased, after multilevel regression with poststratification the authors were able to get estimates that agreed with those coming from polls using large amounts of random and representative data. Since then it has also been proposed for use in the field of epidemiology. [4]

YouGov used the technique to successfully predict the overall outcome of the 2017 UK general election, [9] correctly predicting the result in 93% of constituencies. [10]

Limitations and extensions

MRP can be extended to estimating the change of opinion over time [3] and when used to predict elections works best when used relatively close to the polling date, after nominations have closed. [11]

Both the "multilevel regression" and "poststratification" ideas of MRP can be generalized. Multilevel regression can be replaced by nonparametric regression [12] or regularized prediction, and poststratification can be generalized to allow for non-census variables, i.e. poststratification totals that are estimated rather than being known. [13]

Related Research Articles

<span class="mw-page-title-main">Statistical inference</span> Process of using data analysis

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

<span class="mw-page-title-main">Conjoint analysis</span> Survey-based statistical technique

Conjoint analysis is a survey-based statistical technique used in market research that helps determine how people value different attributes that make up an individual product or service.

<span class="mw-page-title-main">Cross-validation (statistics)</span> Statistical model validation technique

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It can also be used to assess the quality of a fitted model and the stability of its parameters.

<span class="mw-page-title-main">Spurious relationship</span> Apparent, but false, correlation between causally-independent variables

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor.

<span class="mw-page-title-main">Mathematical statistics</span> Branch of statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency. Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data. There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias. A few of the well known attempts to deal with missing data include: hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; non-negative matrix factorization; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation.

In statistics, multicollinearity or collinearity is a situation where the predictors in a regression model are linearly dependent.

In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results otherwise. Robust regression methods are designed to limit the effect that violations of assumptions by the underlying data-generating process have on regression estimates.

In statistics, the ordered logit model is an ordinal regression model—that is, a regression model for ordinal dependent variables—first considered by Peter McCullagh. For example, if one question on a survey is to be answered by a choice among "poor", "fair", "good", "very good" and "excellent", and the purpose of the analysis is to see how well that response can be predicted by the responses to other questions, some of which may be quantitative, then ordered logistic regression may be used. It can be thought of as an extension of the logistic regression model that applies to dichotomous dependent variables, allowing for more than two (ordered) response categories.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

Small area estimation is any of several statistical techniques involving the estimation of parameters for small sub-populations, generally used when the sub-population of interest is included in a larger survey.

One application of multilevel modeling (MLM) is the analysis of repeated measures data. Multilevel modeling for repeated measures data is most often discussed in the context of modeling change over time ; however, it may also be used for repeated measures data in which time is not a factor.

Ecological regression is a statistical technique which runs regression on aggregates, often used in political science and history to estimate group voting behavior from aggregate data.

Survation is a polling and market research agency based in London, England. Survation have been conducting research surveys since 2010. Surveys are conducted via telephone, online panel and face to face as well as omnibus research for a broad range of clients including television, newspapers, charities, lobby groups, trade unions, law firms and political parties. Damian Lyons Lowe is the company founder and Chief Executive.

In the run-up to the general election on 8 June 2017, various organisations carried out opinion polling to gauge voting intentions. Results of such polls are displayed in this article. Most of the polling companies listed are members of the British Polling Council (BPC) and abide by its disclosure rules.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

Prior to the 2019 United Kingdom general election, various organisations carried out opinion polling to gauge voting intentions. Results of such polls are displayed in this list. Most of the pollsters listed are members of the British Polling Council (BPC) and abide by its disclosure rules. Opinion polling about attitudes to the leaders of various political parties can be found in a separate article.

Various opinion polls were conducted in advance of the 2019 European Parliament election. Before the April delay, a number of polls asked respondents to imagine how they would vote in a then-hypothetical scenario in which European elections would be held.

Opinion polling for the next United Kingdom general election is being carried out continually by various organisations to gauge voting intention. Most of the polling companies listed are members of the British Polling Council (BPC) and abide by its disclosure rules. The dates for these opinion polls range from the 2019 general election on 12 December to the present day.

The Fay–Herriot model is a statistical model which includes some distinct variation for each of several subgroups of observations. It is an area-level model, meaning some input data are associated with sub-aggregates such as regions, jurisdictions, or industries. The model produces estimates about the subgroups. The model is applied in the context of small area estimation in which there is a lot of data overall, but not much for each subgroup.

References

  1. 1 2 3 Buttice, Matthew K.; Highton, Benjamin (Autumn 2013). "How Does Multilevel Regression and Poststratification Perform with Conventional National Surveys?" (PDF). Political Analysis. 21 (4): 449–451. doi:10.1093/pan/mpt017. JSTOR   24572674.
  2. "What is MRP?". Survation.com. Survation. 5 November 2018. Retrieved 31 October 2019.
  3. 1 2 Gelman, Andrew; Lax, Jeffrey; Phillips, Justin; Gabry, Jonah; Trangucci, Robert (28 August 2018). "Using Multilevel Regression and Poststratification to Estimate Dynamic Public Opinion" (PDF): 1–3. Retrieved 31 October 2019.{{cite journal}}: Cite journal requires |journal= (help)
  4. 1 2 Downes, Marnie; Gurrin, Lyle C.; English, Dallas R.; Pirkis, Jane; Currier, Diane; Spital, Matthew J.; Carlin, John B. (9 April 2018). "Multilevel Regression and Poststratification: A Modeling Approach to Estimating Population Quantities From Highly Selected Survey Samples". American Journal of Epidemiology. 179 (8): 187. Retrieved 31 October 2019.
  5. Gelman, Andrew; Little, Thomas (1997). "Poststratification into many categories using hierarchical logistic regression". Survey Methodology. 23: 127–135.
  6. Fay, Robert; Herriot, Roger (1979). "Estimates of income for small places: An application of James-Stein procedures to census data". Journal of the American Statistical Association. 74 (423): 1001–1012. doi:10.1080/01621459.1979.10482505. JSTOR   2286322.
  7. Little, Roderick (1993). "Post-stratification: A modeler's perspective". Journal of the American Statistical Association. 88 (423): 1001–1012. doi:10.1080/01621459.1993.10476368. JSTOR   2290792.
  8. Wang, Wei; Rothschild, David; Goel, Sharad; Gelman, Andrew (2015). "Forecasting elections with non-representative polls" (PDF). International Journal of Forecasting. 31 (3): 980–991. doi: 10.1016/j.ijforecast.2014.06.001 .
  9. Revell, Timothy (9 June 2017). "How YouGov's experimental poll correctly called the UK election". New Scientist. Retrieved 31 October 2019.
  10. Cohen, Daniel (27 September 2019). "'I've never known voters be so promiscuous': the pollsters working to predict the next UK election". The Guardian. Retrieved 31 October 2019.
  11. James, William; MacLellan, Kylie (15 October 2019). "A question of trust: British pollsters battle to call looming election". Reuters . Retrieved 31 October 2019.
  12. Bisbee, James (2019). "BARP: Improving Mister P Using Bayesian Additive Regression Trees". American Political Science Review. 113 (4): 1060–1065. doi:10.1017/S0003055419000480. S2CID   201385400.
  13. Gelman, Andrew (28 October 2018). "MRP (or RPP) with non-census variables". Statistical Modeling, Causal Inference, and Social Science.