Sampling bias

Last updated

In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample [1] of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. [2] If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling.

Contents

Medical sources sometimes refer to sampling bias as ascertainment bias. [3] [4] Ascertainment bias has basically the same definition, [5] [6] but is still sometimes classified as a separate type of bias. [5]

Distinction from selection bias

Sampling bias is usually classified as a subtype of selection bias, [7] sometimes specifically termed sample selection bias, [8] [9] [10] but some classify it as a separate type of bias. [11] A distinction, albeit not universally accepted, of sampling bias is that it undermines the external validity of a test (the ability of its results to be generalized to the entire population), while selection bias mainly addresses internal validity for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.

However, selection bias and sampling bias are often used synonymously. [12]

Types

Symptom-based sampling

The study of medical conditions begins with anecdotal reports. By their nature, such reports only include those referred for diagnosis and treatment. A child who can't function in school is more likely to be diagnosed with dyslexia than a child who struggles but passes. A child examined for one condition is more likely to be tested for and diagnosed with other conditions, skewing comorbidity statistics. As certain diagnoses become associated with behavior problems or intellectual disability, parents try to prevent their children from being stigmatized with those diagnoses, introducing further bias. Studies carefully selected from whole populations are showing that many conditions are much more common and usually much milder than formerly believed.

Truncate selection in pedigree studies

Simple pedigree example of sampling bias Ascertainment bias.png
Simple pedigree example of sampling bias

Geneticists are limited in how they can obtain data from human populations. As an example, consider a human characteristic. We are interested in deciding if the characteristic is inherited as a simple Mendelian trait. Following the laws of Mendelian inheritance, if the parents in a family do not have the characteristic, but carry the allele for it, they are carriers (e.g. a non-expressive heterozygote). In this case their children will each have a 25% chance of showing the characteristic. The problem arises because we can't tell which families have both parents as carriers (heterozygous) unless they have a child who exhibits the characteristic. The description follows the textbook by Sutton. [13]

The figure shows the pedigrees of all the possible families with two children when the parents are carriers (Aa).

The probabilities of each of the families being selected is given in the figure, with the sample frequency of affected children also given. In this simple case, the researcher will look for a frequency of 47 or 58 for the characteristic, depending on the type of truncate selection used.

The caveman effect

An example of selection bias is called the "caveman effect". Much of our understanding of prehistoric peoples comes from caves, such as cave paintings made nearly 40,000 years ago. If there had been contemporary paintings on trees, animal skins or hillsides, they would have been washed away long ago. Similarly, evidence of fire pits, middens, burial sites, etc. are most likely to remain intact to the modern era in caves. Prehistoric people are associated with caves because that is where the data still exists, not necessarily because most of them lived in caves for most of their lives. [14]

Problems due to sampling bias

Sampling bias is problematic because it is possible that a statistic computed of the sample is systematically erroneous. Sampling bias can lead to a systematic over- or under-estimation of the corresponding parameter in the population. Sampling bias occurs in practice as it is practically impossible to ensure perfect randomness in sampling. If the degree of misrepresentation is small, then the sample can be treated as a reasonable approximation to a random sample. Also, if the sample does not differ markedly in the quantity being measured, then a biased sample can still be a reasonable estimate.

The word bias has a strong negative connotation. Indeed, biases sometimes come from deliberate intent to mislead or other scientific fraud. In statistical usage, bias merely represents a mathematical property, no matter if it is deliberate or unconscious or due to imperfections in the instruments used for observation. While some individuals might deliberately use a biased sample to produce misleading results, more often, a biased sample is just a reflection of the difficulty in obtaining a truly representative sample, or ignorance of the bias in their process of measurement or analysis. An example of how ignorance of a bias can exist is in the widespread use of a ratio (a.k.a. fold change) as a measure of difference in biology. Because it is easier to achieve a large ratio with two small numbers with a given difference, and relatively more difficult to achieve a large ratio with two large numbers with a larger difference, large significant differences may be missed when comparing relatively large numeric measurements. Some have called this a 'demarcation bias' because the use of a ratio (division) instead of a difference (subtraction) removes the results of the analysis from science into pseudoscience (See Demarcation Problem).

Some samples use a biased statistical design which nevertheless allows the estimation of parameters. The U.S. National Center for Health Statistics, for example, deliberately oversamples from minority populations in many of its nationwide surveys in order to gain sufficient precision for estimates within these groups. [15] These surveys require the use of sample weights (see later on) to produce proper estimates across all ethnic groups. Provided that certain conditions are met (chiefly that the weights are calculated and used correctly) these samples permit accurate estimation of population parameters.

Historical examples

Example of biased sample: as of June 2008 55% of web browsers (Internet Explorer) in use did not pass the Acid2 test. Due to the nature of the test, the sample consisted mostly of web developers. Acid2compliancebyusage.png
Example of biased sample: as of June 2008 55% of web browsers (Internet Explorer) in use did not pass the Acid2 test. Due to the nature of the test, the sample consisted mostly of web developers.

A classic example of a biased sample and the misleading results it produced occurred in 1936. In the early days of opinion polling, the American Literary Digest magazine collected over two million postal surveys and predicted that the Republican candidate in the U.S. presidential election, Alf Landon, would beat the incumbent president, Franklin Roosevelt, by a large margin. The result was the exact opposite. The Literary Digest survey represented a sample collected from readers of the magazine, supplemented by records of registered automobile owners and telephone users. This sample included an over-representation of wealthy individuals, who, as a group, were more likely to vote for the Republican candidate. In contrast, a poll of only 50 thousand citizens selected by George Gallup's organization successfully predicted the result, leading to the popularity of the Gallup poll.

Another classic example occurred in the 1948 presidential election. On election night, the Chicago Tribune printed the headline DEWEY DEFEATS TRUMAN , which turned out to be mistaken. In the morning the grinning president-elect, Harry S. Truman, was photographed holding a newspaper bearing this headline. The reason the Tribune was mistaken is that their editor trusted the results of a phone survey. Survey research was then in its infancy, and few academics realized that a sample of telephone users was not representative of the general population. Telephones were not yet widespread, and those who had them tended to be prosperous and have stable addresses. (In many cities, the Bell System telephone directory contained the same names as the Social Register). In addition, the Gallup poll that the Tribune based its headline on was over two weeks old at the time of the printing. [17]

In air quality data, pollutants (such as carbon monoxide, nitrogen monoxide, nitrogen dioxide, or ozone) frequently show high correlations, as they stem from the same chemical process(es). These correlations depend on space (i.e., location) and time (i.e., period). Therefore, a pollutant distribution is not necessarily representative for every location and every period. If a low-cost measurement instrument is calibrated with field data in a multivariate manner, more precisely by collocation next to a reference instrument, the relationships between the different compounds are incorporated into the calibration model. By relocation of the measurement instrument, erroneous results can be produced. [18]

A twenty-first century example is the COVID-19 pandemic, where variations in sampling bias in COVID-19 testing have been shown to account for wide variations in both case fatality rates and the age distribution of cases across countries. [19] [20]

Statistical corrections for a biased sample

If entire segments of the population are excluded from a sample, then there are no adjustments that can produce estimates that are representative of the entire population. But if some groups are underrepresented and the degree of underrepresentation can be quantified, then sample weights can correct the bias. However, the success of the correction is limited to the selection model chosen. If certain variables are missing the methods used to correct the bias could be inaccurate. [21]

For example, a hypothetical population might include 10 million men and 10 million women. Suppose that a biased sample of 100 patients included 20 men and 80 women. A researcher could correct for this imbalance by attaching a weight of 2.5 for each male and 0.625 for each female. This would adjust any estimates to achieve the same expected value as a sample that included exactly 50 men and 50 women, unless men and women differed in their likelihood of taking part in the survey.[ citation needed ]

See also

Related Research Articles

<span class="mw-page-title-main">Cluster sampling</span> Sampling methodology in statistics

In statistics, cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is often used in marketing research.

The gambler's fallacy, also known as the Monte Carlo fallacy or the fallacy of the maturity of chances, is the incorrect belief that, if an event has occurred more frequently than expected, it is less likely to happen again in the future. The fallacy is commonly associated with gambling, where it may be believed, for example, that the next dice roll is more than usually likely to be six because there have recently been fewer than the expected number of sixes.

In statistics, survey sampling describes the process of selecting a sample of elements from a target population to conduct a survey. The term "survey" may refer to many different types or techniques of observation. In survey sampling it most often involves a questionnaire used to measure the characteristics and/or attitudes of people. Different ways of contacting members of a sample once they have been selected is the subject of survey data collection. The purpose of sampling is to reduce the cost and/or the amount of work that it would take to survey the entire target population. A survey that measures the entire target population is called a census. A sample refers to a group or section of a population from which information is to be obtained

<span class="mw-page-title-main">Sampling (statistics)</span> Selection of data points in statistics.

In statistics, quality assurance, and survey methodology, sampling is the selection of a subset or a statistical sample of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt to collect samples that are representative of the population. Sampling has lower costs and faster data collection compared to recording data from the entire population, and thus, it can provide insights in cases where it is infeasible to measure an entire population.

Statistical bias, in the mathematical field of statistics, is a systematic tendency in which the methods used to gather data and generate statistics present an inaccurate, skewed or biased depiction of reality. Statistical bias exists in numerous stages of the data collection and analysis process, including: the source of the data, the methods used to collect the data, the estimator chosen, and the methods used to analyze the data. Data analysts can take various measures at each stage of the process to reduce the impact of statistical bias in their work. Understanding the source of statistical bias can help to assess whether the observed results are close to actuality. Issues of statistical bias has been argued to be closely linked to issues of statistical validity.

Sampling is the use of a subset of the population to represent the whole population or to inform about (social) processes that are meaningful beyond the particular cases, individuals or sites studied. Probability sampling, or random sampling, is a sampling technique in which the probability of getting any particular sample may be calculated. In cases where external validity is not of critical importance to the study's goals or purpose, researchers might prefer to use nonprobability sampling. Nonprobability sampling does not meet this criterion. Nonprobability sampling techniques are not intended to be used to infer from the sample to the general population in statistical terms. Instead, for example, grounded theory can be produced through iterative nonprobability sampling until theoretical saturation is reached.

An opinion poll, often simply referred to as a survey or a poll, is a human research survey of public opinion from a particular sample. Opinion polls are usually designed to represent the opinions of a population by conducting a series of questions and then extrapolating generalities in ratio or within confidence intervals. A person who conducts polls is referred to as a pollster.

In statistics, self-selection bias arises in any situation in which individuals select themselves into a group, causing a biased sample with nonprobability sampling. It is commonly used to describe situations where the characteristics of the people which cause them to select themselves in the group create abnormal or undesirable conditions in the group. It is closely related to the non-response bias, describing when the group of people responding has different responses than the group of people not responding.

Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase "selection bias" most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may be false.

A straw poll, straw vote, or straw ballot is an ad hoc or unofficial vote. It is used to show the popular opinion on a certain matter, and can be used to help politicians know the majority opinion and help them decide what to say in order to gain votes.

<span class="mw-page-title-main">Response bias</span> Type of bias

Response bias is a general term for a wide range of tendencies for participants to respond inaccurately or falsely to questions. These biases are prevalent in research involving participant self-report, such as structured interviews or surveys. Response biases can have a large impact on the validity of questionnaires or surveys.

Statistics, when used in a misleading fashion, can trick the casual observer into believing something other than what the data shows. That is, a misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the misuse may be accidental. In others, it is purposeful and for the gain of the perpetrator. When the statistical reason involved is false or misapplied, this constitutes a statistical fallacy.

External validity is the validity of applying the conclusions of a scientific study outside the context of that study. In other words, it is the extent to which the results of a study can be generalized to and across other situations, people, stimuli, and times. In contrast, internal validity is the validity of conclusions drawn within the context of a particular study. Because general conclusions are almost always a goal in research, external validity is an important property of any study. Mathematical analysis of external validity concerns a determination of whether generalization across heterogeneous populations is feasible, and devising statistical and computational methods that produce valid generalizations.

In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample, such as means and quartiles, generally differ from the statistics of the entire population. The difference between the sample statistic and population parameter is considered the sampling error. For example, if one measures the height of a thousand individuals from a population of one million, the average height of the thousand is typically not the same as the average height of all one million people in the country.

In statistics, a sampling frame is the source material or device from which a sample is drawn. It is a list of all those within a population who can be sampled, and may include individuals, households or institutions.

An open-access poll is a type of opinion poll in which a nonprobability sample of participants self-select into participation. The term includes call-in, mail-in, and some online polls.

Participation bias or non-response bias is a phenomenon in which the results of elections, studies, polls, etc. become non-representative because the participants disproportionately possess certain traits which affect the outcome. These traits mean the sample is systematically different from the target population, potentially resulting in biased estimates.

The Heckman correction is a statistical technique to correct bias from non-randomly selected samples or otherwise incidentally truncated dependent variables, a pervasive issue in quantitative social sciences when using observational data. Conceptually, this is achieved by explicitly modelling the individual sampling probability of each observation together with the conditional expectation of the dependent variable. The resulting likelihood function is mathematically similar to the tobit model for censored dependent variables, a connection first drawn by James Heckman in 1974. Heckman also developed a two-step control function approach to estimate this model, which avoids the computational burden of having to estimate both equations jointly, albeit at the cost of inefficiency. Heckman received the Nobel Memorial Prize in Economic Sciences in 2000 for his work in this field.

<span class="mw-page-title-main">Randomness</span> Apparent lack of pattern or predictability in events

In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual random events are, by definition, unpredictable, but if the probability distribution is known, the frequency of different outcomes over repeated events is predictable. For example, when throwing two dice, the outcome of any particular roll is unpredictable, but a sum of 7 will tend to occur twice as often as 4. In this view, randomness is not haphazardness; it is a measure of uncertainty of an outcome. Randomness applies to concepts of chance, probability, and information entropy.

<span class="mw-page-title-main">Coverage error</span>

Coverage error is a type of non-sampling error that occurs when there is not a one-to-one correspondence between the target population and the sampling frame from which a sample is drawn. This can bias estimates calculated using survey data. For example, a researcher may wish to study the opinions of registered voters by calling residences listed in a telephone directory. Undercoverage may occur if not all voters are listed in the phone directory. Overcoverage could occur if some voters have more than one listed phone number. Bias could also occur if some phone numbers listed in the directory do not belong to registered voters. In this example, undercoverage, overcoverage, and bias due to inclusion of unregistered voters in the sampling frame are examples of coverage error.

References

  1. "Sampling Bias". Medical Dictionary. Archived from the original on 10 March 2016. Retrieved 23 September 2009.
  2. "Biased sample". TheFreeDictionary. Retrieved 23 September 2009. Mosby's Medical Dictionary, 8th edition
  3. Weising K (2005). DNA fingerprinting in plants: principles, methods, and applications. London: Taylor & Francis Group. p.  180. ISBN   978-0-8493-1488-9.
  4. Ramírez i Soriano A (29 November 2008). Selection and linkage desequilibrium tests under complex demographies and ascertainment bias (PDF) (Ph.D. thesis). Universitat Pompeu Fabra. p. 34.
  5. 1 2 Panacek EA (May 2009). "Error and Bias in Clinical Research" (PDF). SAEM Annual Meeting. New Orleans, LA: Society for Academic Emergency Medicine. Archived from the original (PDF) on 17 August 2016. Retrieved 14 November 2009.
  6. "Ascertainment Bias". Medilexicon Medical Dictionary. Archived from the original on 6 August 2016. Retrieved 14 November 2009.
  7. "Selection Bias". Dictionary of Cancer Terms. Archived from the original on 9 June 2009. Retrieved 23 September 2009.
  8. Ards S, Chung C, Myers SL (February 1998). "The effects of sample selection bias on racial differences in child abuse reporting". Child Abuse & Neglect. 22 (2): 103–15. doi: 10.1016/S0145-2134(97)00131-2 . PMID   9504213.
  9. Cortes C, Mohri M, Riley M, Rostamizadeh A (2008). "Sample Selection Bias Correction Theory" (PDF). Algorithmic Learning Theory. Lecture Notes in Computer Science. 5254: 38–53. arXiv: 0805.2775 . CiteSeerX   10.1.1.144.4478 . doi:10.1007/978-3-540-87987-9_8. ISBN   978-3-540-87986-2. S2CID   842488.
  10. Cortes C, Mohri M (2014). "Domain adaptation and sample bias correction theory and algorithm for regression" (PDF). Theoretical Computer Science. 519: 103–126. CiteSeerX   10.1.1.367.6899 . doi:10.1016/j.tcs.2013.09.027.
  11. Fadem B (2009). Behavioral Science. Lippincott Williams & Wilkins. p. 262. ISBN   978-0-7817-8257-9.
  12. Wallace R (2007). Maxcy-Rosenau-Last Public Health and Preventive Medicine (15th ed.). McGraw Hill Professional. p. 21. ISBN   978-0-07-159318-2.
  13. Sutton HE (1988). An Introduction to Human Genetics (4th ed.). Harcourt Brace Jovanovich. ISBN   978-0-15-540099-3.
  14. Berk RA (June 1983). "An Introduction to Sample Selection Bias in Sociological Data". American Sociological Review. 48 (3): 386–398. doi:10.2307/2095230. JSTOR   2095230.
  15. "Minority Health". National Center for Health Statistics. 2007.
  16. "Browser Statistics". Refsnes Data. June 2008. Retrieved 2008-07-05.
  17. Lienhard JH. "Gallup Poll". The Engines of Our Ingenuity. Retrieved 29 September 2007.
  18. Tancev G, Pascale C (October 2020). "The Relocation Problem of Field Calibrated Low-Cost Sensor Systems in Air Quality Monitoring: A Sampling Bias". Sensors. 20 (21): 6198. Bibcode:2020Senso..20.6198T. doi: 10.3390/s20216198 . PMC   7662848 . PMID   33143233.
  19. Ward D (20 April 2020). Sampling Bias: Explaining Wide Variations in COVID-19 Case Fatality Rates. Preprint (Report). Bern, Switzerland. doi:10.13140/RG.2.2.24953.62564/1.
  20. Böttcher L, D'Orsogna MR, Chou T (May 2021). "Using excess deaths and testing statistics to determine COVID-19 mortalities". European Journal of Epidemiology. 36 (5): 545–558. doi: 10.1007/s10654-021-00748-2 . PMC   8127858 .
  21. Cuddeback G, Wilson E, Orme JG, Combs-Orme T (2004). "Detecting and Statistically Correcting Sample Selection Bias". Journal of Social Service Research. 30 (3): 19–33. doi:10.1300/J079v30n03_02. S2CID   11685550.