Ecological fallacy

Last updated

An ecological fallacy (also ecological inference fallacy [1] or population fallacy) is a formal fallacy in the interpretation of statistical data that occurs when inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong. From the conceptual standpoint of mereology, four common ecological fallacies are:

Contents

From a statistical point of view, these ideas can be unified by specifying proper statistical models to make formal inferences, using aggregate data to make unobserved relationships in individual level data. [2]

Examples

Mean and median

An example of ecological fallacy is the assumption that a population mean has a simple interpretation when considering likelihoods for an individual.

For instance, if the mean score of a group is larger than zero, this does not imply that a random individual of that group is more likely to have a positive score than a negative one (as long as there are more negative scores than positive scores an individual is more likely to have a negative score). Similarly, if a particular group of people is measured to have a lower mean IQ than the general population, it is an error to conclude that a randomly-selected member of the group is more likely than not to have a lower IQ than the mean IQ of the general population; it is also not necessarily the case that a randomly selected member of the group is more likely than not to have a lower IQ than a randomly-selected member of the general population. Mathematically, this comes from the fact that a distribution can have a positive mean but a negative median. This property is linked to the skewness of the distribution.

Consider the following numerical example:

Individual and aggregate correlations

Research dating back to Émile Durkheim suggests that predominantly Protestant localities have higher suicide rates than predominantly Catholic localities. [3] According to Freedman, [4] the idea that Durkheim's findings link, at an individual level, a person's religion to their suicide risk is an example of the ecological fallacy. A group-level relationship does not automatically characterize the relationship at the level of the individual.

Similarly, even if at the individual level, wealth is positively correlated to tendency to vote Republican in the United States, we observe that wealthier states tend to vote Democratic. For example, in the 2004 United States presidential election, the Republican candidate, George W. Bush, won the fifteen poorest states, and the Democratic candidate, John Kerry, won 9 of the 11 wealthiest states in the Electoral College. Yet 62% of voters with annual incomes over $200,000 voted for Bush, but only 36% of voters with annual incomes of $15,000 or less voted for Bush. [5] Aggregate-level correlation will differ from individual-level correlation if voting preferences are affected by the total wealth of the state even after controlling for individual wealth. The true driving factor in voting preference could be self-perceived relative wealth; perhaps those who see themselves as better off than their neighbours are more likely to vote Republican. In this case, an individual would be more likely to vote Republican if they became wealthier, but they would be more likely to vote for a Democrat if their neighbor's wealth increased (resulting in a wealthier state).

However, the observed difference in voting habits based on state- and individual-level wealth could also be explained by the common confusion between higher averages and higher likelihoods as discussed above. States may not be wealthier because they contain more wealthy people (i.e., more people with annual incomes over $200,000), but rather because they contain a small number of super-rich individuals; the ecological fallacy then results from incorrectly assuming that individuals in wealthier states are more likely to be wealthy.

Many examples of ecological fallacies can be found in studies of social networks, which often combine analysis and implications from different levels. This has been illustrated in an academic paper on networks of farmers in Sumatra. [6]

Robinson's paradox

A 1950 paper by William S. Robinson computed the illiteracy rate and the proportion of the population born outside the US for each state and for the District of Columbia, as of the 1930 census. [7] He showed that these two figures were associated with a negative correlation of −0.53; in other words, the greater the proportion of immigrants in a state, the lower its average illiteracy (or, equivalently, the higher its average literacy). However, when individuals are considered, the correlation between illiteracy and nativity was +0.12 (immigrants were on average more illiterate than native citizens). Robinson showed that the negative correlation at the level of state populations was because immigrants tended to settle in states where the native population was more literate. He cautioned against deducing conclusions about individuals on the basis of population-level, or "ecological" data. In 2011, it was found that Robinson's calculations of the ecological correlations are based on the wrong state level data. The correlation of −0.53 mentioned above is in fact −0.46. [8] Robinson's paper was seminal, but the term 'ecological fallacy' was not coined until 1958 by Selvin. [9]

Formal problem

The correlation of aggregate quantities (or ecological correlation) is not equal to the correlation of individual quantities. Denote by Xi, Yi two quantities at the individual level. The formula for the covariance of the aggregate quantities in groups of size N is

The covariance of two aggregated variables depends not only on the covariance of two variables within the same individuals but also on covariances of the variables between different individuals. In other words, correlation of aggregate variables take into account cross sectional effects which are not relevant at the individual level.

The problem for correlations entails naturally a problem for regressions on aggregate variables: the correlation fallacy is therefore an important issue for a researcher who wants to measure causal impacts. Start with a regression model where the outcome is impacted by

The regression model at the aggregate level is obtained by summing the individual equations:

Nothing prevents the regressors and the errors from being correlated at the aggregate level. Therefore, generally, running a regression on aggregate data does not estimate the same model than running a regression with individual data.

The aggregate model is correct if and only if

This means that, controlling for , does not determine .

Choosing between aggregate and individual inference

There is nothing wrong in running regressions on aggregate data if one is interested in the aggregate model. For instance, for the governor of a state, it is correct to run regressions between police force on crime rate at the state level if one is interested in the policy implication of a rise in police force. However, an ecological fallacy would happen if a city council deduces the impact of an increase in police force in the crime rate at the city level from the correlation at the state level.

Choosing to run aggregate or individual regressions to understand aggregate impacts on some policy depends on the following trade-off: aggregate regressions lose individual-level data but individual regressions add strong modeling assumptions. Some researchers suggest that the ecological correlation gives a better picture of the outcome of public policy actions, thus they recommend the ecological correlation over the individual level correlation for this purpose (Lubinski & Humphreys, 1996). Other researchers disagree, especially when the relationships among the levels are not clearly modeled. To prevent ecological fallacy, researchers with no individual data can model first what is occurring at the individual level, then model how the individual and group levels are related, and finally examine whether anything occurring at the group level adds to the understanding of the relationship. For instance, in evaluating the impact of state policies, it is helpful to know that policy impacts vary less among the states than do the policies themselves, suggesting that the policy differences are not well translated into results, despite high ecological correlations (Rose, 1973).

Group and total averages

Ecological fallacy can also refer to the following fallacy: the average for a group is approximated by the average in the total population divided by the group size. Suppose one knows the number of Protestants and the suicide rate in the USA, but one does not have data linking religion and suicide at the individual level. If one is interested in the suicide rate of Protestants, it is a mistake to estimate it by the total suicide rate divided by the number of Protestants. Formally, denote the mean of the group, we generally have:

However, the law of total probability gives

As we know that is between 0 and 1, this equation gives a bound for .

Simpson's paradox

A striking ecological fallacy is Simpson's paradox: the fact that when comparing two populations divided into groups, the average of some variable in the first population can be higher in every group and yet lower in the total population. Formally, when each value of Z refers to a different group and X refers to some treatment, it can happen that

When does not depend on , the Simpson's paradox is exactly the omitted variable bias for the regression of Y on X where the regressor is a dummy variable and the omitted variable is a categorical variable defining groups for each value it takes. The application is striking because the bias is high enough that parameters have opposite signs.

The ecological fallacy was discussed in a court challenge to the 2004 Washington gubernatorial election in which a number of illegal voters were identified, after the election; their votes were unknown, because the vote was by secret ballot. The challengers argued that illegal votes cast in the election would have followed the voting patterns of the precincts in which they had been cast, and thus adjustments should be made accordingly. [10] An expert witness said this approach was like trying to figure out Ichiro Suzuki's batting average by looking at the batting average of the entire Seattle Mariners team, since the illegal votes were cast by an unrepresentative sample of each precinct's voters, and might be as different from the average voter in the precinct as Ichiro was from the rest of his team. [11] The judge determined that the challengers' argument was an ecological fallacy and rejected it. [12]

See also

Related Research Articles

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

<span class="mw-page-title-main">Multivariate random variable</span> Random variable with multiple component dimensions

In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

<span class="mw-page-title-main">Correlation</span> Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

Covariance in probability theory and statistics is a measure of the joint variability of two random variables.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y that have a maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Camille Jordan in 1875.

<span class="mw-page-title-main">Kriging</span> Method of interpolation

In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging gives the best linear unbiased prediction (BLUP) at unsampled locations. Interpolating methods based on other criteria such as smoothness may not yield the BLUP. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov.

<span class="mw-page-title-main">Fisher transformation</span> Statistical transformation

In statistics, the Fisher transformation of a Pearson correlation coefficient is its inverse hyperbolic tangent (artanh). When the sample correlation coefficient r is near 1 or -1, its distribution is highly skewed, which makes it difficult to estimate confidence intervals and apply tests of significance for the population correlation coefficient ρ. The Fisher transformation solves this problem by yielding a variable whose distribution is approximately normally distributed, with a variance that is stable over different values of r.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

<span class="mw-page-title-main">Regression dilution</span> Statistical bias in linear regressions

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

<span class="mw-page-title-main">Simple linear regression</span> Linear regression model with a single explanatory variable

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, an exchangeable sequence of random variables is a sequence X1X2X3, ... whose joint probability distribution does not change when the positions in the sequence in which finitely many of them appear are altered. In other words, the joint distribution is invariant to finite permutation. Thus, for example the sequences

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

<span class="mw-page-title-main">Errors-in-variables models</span> Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

References

  1. Charles Ess; Fay Sudweeks (2001). Culture, technology, communication: towards an intercultural global village. SUNY Press. p. 90. ISBN   978-0-7914-5015-4. The problem lies with the 'ecological fallacy' (or fallacy of division)—the impulse to apply group or societal level characteristics to individuals within that group.
  2. King, Gary (1997). A Solution to the Ecological Inference Problem. Princeton University Press. ISBN   978-0-691-01240-7.{{cite book}}: CS1 maint: date and year (link)
  3. Durkheim, (1951/1897). Suicide: A study in sociology. Translated by John A. Spaulding and George Simpson. New York: The Free Press. ISBN   0-684-83632-7.
  4. Freedman, D. A. (1999). Ecological Inference and the Ecological Fallacy. International Encyclopedia of the Social & Behavioral Sciences, Technical Report No. 549. https://web.stanford.edu/class/ed260/freedman549.pdf
  5. Gelman, Andrew; Park, David; Shor, Boris; Bafumi, Joseph; Cortina, Jeronimo (2008). Red State, Blue State, Rich State, Poor State . Princeton University Press. ISBN   978-0-691-13927-2.
  6. Matous, Petr (2015). "Social networks and environmental management at multiple levels: soil conservation in Sumatra". Ecology and Society. 20 (3): 37. doi: 10.5751/ES-07816-200337 . hdl: 10535/9990 .
  7. Robinson, W.S. (1950). "Ecological Correlations and the Behavior of Individuals". American Sociological Review . 15 (3): 351–357. doi:10.2307/2087176. JSTOR   2087176.
  8. The research note on this curious data glitch is published in Te Grotenhuis, Manfred; Eisinga, Rob; Subramanian, S.V. (2011). "Robinson's Ecological Correlations and the Behavior of Individuals: methodological corrections". Int J Epidemiol . 40 (4): 1123–1125. doi: 10.1093/ije/dyr081 . hdl: 2066/99678 . PMID   21596762. The data Robinson used and the corrections are available at http://www.ru.nl/mt/rob/downloads/
  9. Selvin, Hanan C. (1958). "Durkheim's Suicide and Problems of Empirical Research". American Journal of Sociology . 63 (6): 607–619. doi:10.1086/222356. S2CID   143488519.
  10. George Howland Jr. (May 18, 2005). "The Monkey Wrench Trial: Dino Rossi's challenge of the 2004 election is on shaky legal ground. But if he prevails, watch litigation become an option in close races everywhere". Seattle Weekly. Archived from the original on December 1, 2008. Retrieved December 17, 2008.
  11. Christopher Adolph (May 12, 2005). "Report on the 2004 Washington Gubernatorial Election". Expert witness report to the Chelan County Superior Court in Borders et al v. King County et al.
  12. Borders et al. v. King County et al. Archived 2008-10-18 at the Wayback Machine , transcript of the decision by Chelan County Superior Court Judge John Bridges, June 6, 2005, published: June 8, 2005

Further reading