Simple random sample

Last updated April 06, 2024

In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sample in a random way. In SRS, each subset of k individuals has the same probability of being chosen for the sample as any other subset of k individuals.^[1] Simple random sampling is a basic type of sampling and can be a component of other more complex sampling methods.^[2]

Introduction

The principle of simple random sampling is that every set with the same number of items has the same probability of being chosen. For example, suppose N college students want to get a ticket for a basketball game, but there are only X < N tickets for them, so they decide to have a fair way to see who gets to go. Then, everybody is given a number in the range from 0 to N-1, and random numbers are generated, either electronically or from a table of random numbers. Numbers outside the range from 0 to N-1 are ignored, as are any numbers previously selected. The first X numbers would identify the lucky ticket winners.

In small populations and often in large ones, such sampling is typically done "without replacement", i.e., one deliberately avoids choosing any member of the population more than once. Although simple random sampling can be conducted with replacement instead, this is less common and would normally be described more fully as simple random sampling with replacement. Sampling done without replacement is no longer independent, but still satisfies exchangeability, hence most results of mathematical statistics still hold. Further, for a small sample from a large population, sampling without replacement is approximately the same as sampling with replacement, since the probability of choosing the same individual twice is low. Survey methodology textbooks generally consider simple random sampling without replacement as the benchmark to compute the relative efficiency of other sampling approaches.^[3]

An unbiased random selection of individuals is important so that if many samples were drawn, the average sample would accurately represent the population. However, this does not guarantee that a particular sample is a perfect representation of the population. Simple random sampling merely allows one to draw externally valid conclusions about the entire population based on the sample. The concept can be extended when the population is a geographic area.^[4] In this case, area sampling frames are relevant.

Conceptually, simple random sampling is the simplest of the probability sampling techniques. It requires a complete sampling frame, which may not be available or feasible to construct for large populations. Even if a complete frame is available, more efficient approaches may be possible if other useful information is available about the units in the population.

Advantages are that it is free of classification error, and it requires minimum previous knowledge of the population other than the frame. Its simplicity also makes it relatively easy to interpret data collected in this manner. For these reasons, simple random sampling best suits situations where not much information is available about the population and data collection can be efficiently conducted on randomly distributed items, or where the cost of sampling is small enough to make efficiency less important than simplicity. If these conditions do not hold, stratified sampling or cluster sampling may be a better choice.

Relationship between simple random sample and other methods

Equal probability sampling (epsem)

A sampling method for which each individual unit has the same chance of being selected is called equal probability sampling (epsem for short).

Using a simple random sample will always lead to an epsem, but not all epsem samples are SRS. For example, if a teacher has a class arranged in 5 rows of 6 columns and she wants to take a random sample of 5 students she might pick one of the 6 columns at random. This would be an epsem sample but not all subsets of 5 pupils are equally likely here, as only the subsets that are arranged as a single column are eligible for selection. There are also ways of constructing multistage sampling, that are not srs, while the final sample will be epsem.^[5] For example, systematic random sampling produces a sample for which each individual unit has the same probability of inclusion, but different sets of units have different probabilities of being selected.

Samples that are epsem are self weighting, meaning that the inverse of selection probability for each sample is equal.

Distinction between a systematic random sample and a simple random sample

Consider a school with 1000 students, and suppose that a researcher wants to select 100 of them for further study. All their names might be put in a bucket and then 100 names might be pulled out. Not only does each person have an equal chance of being selected, we can also easily calculate the probability (P) of a given person being chosen, since we know the sample size (n) and the population (N):

1. In the case that any given person can only be selected once (i.e., after selection a person is removed from the selection pool):

{\begin{aligned}P&=1-{\frac {N-1}{N}}\cdot {\frac {N-2}{N-1}}\cdot \cdots \cdot {\frac {N-n}{N-(n-1)}}\\[8pt]&{\stackrel {\text{Canceling:}}{=}}1-{\frac {N-n}{N}}\\[8pt]&={\frac {n}{N}}\\[8pt]&={\frac {100}{1000}}\\[8pt]&=10\%\end{aligned}}

2. In the case that any selected person is returned to the selection pool (i.e., can be picked more than once):

P=1-\left(1-{\frac {1}{N}}\right)^{n}=1-\left({\frac {999}{1000}}\right)^{100}=0.0952\dots \approx 9.5\%

This means that every student in the school has in any case approximately a 1 in 10 chance of being selected using this method. Further, any combination of 100 students has the same probability of selection.

If a systematic pattern is introduced into random sampling, it is referred to as "systematic (random) sampling". An example would be if the students in the school had numbers attached to their names ranging from 0001 to 1000, and we chose a random starting point, e.g. 0533, and then picked every 10th name thereafter to give us our sample of 100 (starting over with 0003 after reaching 0993). In this sense, this technique is similar to cluster sampling, since the choice of the first unit will determine the remainder. This is no longer simple random sampling, because some combinations of 100 students have a larger selection probability than others – for instance, {3, 13, 23, ..., 993} has a 1/10 chance of selection, while {1, 2, 3, ..., 100} cannot be selected under this method.

Sampling a dichotomous population

If the members of the population come in three kinds, say "blue" "red" and "black", the number of red elements in a sample of given size will vary by sample and hence is a random variable whose distribution can be studied. That distribution depends on the numbers of red and black elements in the full population. For a simple random sample with replacement, the distribution is a binomial distribution . For a simple random sample without replacement, one obtains a hypergeometric distribution .^[6]

Algorithms

Several efficient algorithms for simple random sampling have been developed.^[7]^[8] A naive algorithm is the draw-by-draw algorithm where at each step we remove the item at that step from the set with equal probability and put the item in the sample. We continue until we have a sample of desired size $k$ . The drawback of this method is that it requires random access in the set.

The selection-rejection algorithm developed by Fan et al. in 1962^[9] requires a single pass over data; however, it is a sequential algorithm and requires knowledge of total count of items $n$ , which is not available in streaming scenarios.

A very simple random sort algorithm was proved by Sunter in 1977.^[10] The algorithm simply assigns a random number drawn from uniform distribution $(0,1)$ as a key to each item, then sorts all items using the key and selects the smallest $k$ items.

J. Vitter in 1985^[11] proposed reservoir sampling algorithms, which are widely used. This algorithm does not require knowledge of the size of the population $n$ in advance, and uses constant space.

Random sampling can also be accelerated by sampling from the distribution of gaps between samples^[12] and skipping over the gaps.

Related Research Articles

In mathematics, a combination is a selection of items from a set that has distinct members, such that the order of selection does not matter. For example, given three fruits, say an apple, an orange and a pear, there are three combinations of two that can be drawn from this set: an apple and a pear; an apple and an orange; or a pear and an orange. More formally, a k-combination of a set S is a subset of k distinct elements of S. So, two combinations are identical if and only if each combination has the same members. If the set has n elements, the number of k-combinations, denoted by $or, is equal to the binomial coefficient$

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events.

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Common quantiles have special names, such as quartiles, deciles, and percentiles. The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as its mathematical definition is not actually random nor a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

In probability theory, the birthday problem asks for the probability that, in a set of $n$ randomly chosen people, at least two will share a birthday. The birthday paradox refers to the counterintuitive fact that only 23 people are needed for that probability to exceed 50%.

In probability theory, the law of large numbers (LLN) is a mathematical theorem that states that the average of the results obtained from a large number of independent and identical random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference.

<span class="mw-page-title-main">Hypergeometric distribution</span> Discrete probability distribution

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of $successes in draws, without replacement, from a finite population of size that contains exactly objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of successes in draws with replacement.$

In survey methodology, one-dimensional systematic sampling is a statistical method involving the selection of elements from an ordered sampling frame. The most common form of systematic sampling is an equiprobability method. This applies in particular when the sampled units are individuals, households or corporations. When a geographic area is sampled for a spatial analysis, bi-dimensional systematic sampling on an area sampling frame can be applied.

Fitness proportionate selection, also known as roulette wheel selection, is a genetic operator used in genetic algorithms for selecting potentially useful solutions for recombination.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

A random permutation is a random ordering of a set of objects, that is, a permutation-valued random variable. The use of random permutations is often fundamental to fields that use randomized algorithms such as coding theory, cryptography, and simulation. A good example of a random permutation is the shuffling of a deck of cards: this is ideally a random permutation of the 52 cards.

In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution wherein a finite number of values are equally likely to be observed; every one of n values has equal probability 1/n. Another way of saying "discrete uniform distribution" would be "a known, finite number of outcomes equally likely to happen".

In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, $and which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated where stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable under no constraint other than that it is contained in the distribution's support.$

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In the theory of finite population sampling, Bernoulli sampling is a sampling process where each element of the population is subjected to an independent Bernoulli trial which determines whether the element becomes part of the sample. An essential property of Bernoulli sampling is that all elements of the population have equal probability of being included in the sample.

Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of $k$ items from a population of unknown size $n$ in a single pass over the items. The size of the population $n$ is not known to the algorithm and is typically too large for all $n$ items to fit into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size $k$ over the part of the population seen so far.

In survey methodology, the design effect is a measure of the expected impact of a sampling design on the variance of an estimator for some parameter. It is calculated as the ratio of the variance of an estimator based on a sample from an (often) complex sampling design, to the variance of an alternative estimator based on a simple random sample (SRS) of the same number of elements. The $can be used to adjust the variance of an estimator in cases where the sample is not drawn using simple random sampling. It may also be useful in sample size calculations and for quantifying the representativeness of a sample. The term "design effect" was coined by Leslie Kish in 1965.$

References

↑ Yates, Daniel S.; David S. Moore; Daren S. Starnes (2008). The Practice of Statistics, 3rd Ed. Freeman. ISBN 978-0-7167-7309-2.
↑ Thompson, Steven K. (2012). Sampling. Wiley series in probability and statistics (3rd ed.). Hoboken, N.J: John Wiley & Sons. ISBN 978-1-118-16293-4.
↑ Cochran, William Gemmell (1977). Sampling techniques. Wiley series in probability and mathematical statistics (3d ed.). New York: Wiley. ISBN 978-0-471-16240-7.
↑ Cressie, Noel A. C. (2015). Statistics for spatial data (Revised ed.). Hoboken, NJ: John Wiley & Sons, Inc. ISBN 978-1-119-11517-5.
↑ Peters, Tim J., and Jenny I. Eachus. "Achieving equal probability of selection under various random sampling strategies." Paediatric and perinatal epidemiology 9.2 (1995): 219-224.
↑ Ash, Robert B. (2008). Basic probability theory. Mineola, N.Y: Dover Publications. ISBN 978-0-486-46628-6. OCLC 190785258.
↑ Tille, Yves; Tillé, Yves (2006-01-01). Sampling Algorithms - Springer. Springer Series in Statistics. doi:10.1007/0-387-34240-0. ISBN 978-0-387-30814-2.
↑ Meng, Xiangrui (2013). "Scalable Simple Random Sampling and Stratified Sampling" (PDF). Proceedings of the 30th International Conference on Machine Learning (ICML-13): 531–539.
↑ Fan, C. T.; Muller, Mervin E.; Rezucha, Ivan (1962-06-01). "Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers". Journal of the American Statistical Association. 57 (298): 387–402. doi:10.1080/01621459.1962.10480667. ISSN 0162-1459.
↑ Sunter, A. B. (1977-01-01). "List Sequential Sampling with Equal or Unequal Probabilities without Replacement". Applied Statistics. 26 (3): 261–268. doi:10.2307/2346966. JSTOR 2346966.
↑ Vitter, Jeffrey S. (1985-03-01). "Random Sampling with a Reservoir". ACM Trans. Math. Softw. 11 (1): 37–57. CiteSeerX 10.1.1.138.784 . doi:10.1145/3147.3165. ISSN 0098-3500.
↑ Vitter, Jeffrey S. (1984-07-01). "Faster methods for random sampling". Communications of the ACM. 27 (7): 703–718. CiteSeerX 10.1.1.329.6400 . doi:10.1145/358105.893. ISSN 0001-0782.

the

External links

Media related to Random sampling at Wikimedia Commons

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Yates, Daniel S.; David S. Moore; Daren S. Starnes (2008). The Practice of Statistics, 3rd Ed. Freeman. ISBN 978-0-7167-7309-2.

[2] Thompson, Steven K. (2012). Sampling. Wiley series in probability and statistics (3rd ed.). Hoboken, N.J: John Wiley & Sons. ISBN 978-1-118-16293-4.

[3] Cochran, William Gemmell (1977). Sampling techniques. Wiley series in probability and mathematical statistics (3d ed.). New York: Wiley. ISBN 978-0-471-16240-7.

[4] Cressie, Noel A. C. (2015). Statistics for spatial data (Revised ed.). Hoboken, NJ: John Wiley & Sons, Inc. ISBN 978-1-119-11517-5.

[5] Peters, Tim J., and Jenny I. Eachus. "Achieving equal probability of selection under various random sampling strategies." Paediatric and perinatal epidemiology 9.2 (1995): 219-224.

[6] Ash, Robert B. (2008). Basic probability theory. Mineola, N.Y: Dover Publications. ISBN 978-0-486-46628-6. OCLC 190785258.

[7] Tille, Yves; Tillé, Yves (2006-01-01). Sampling Algorithms - Springer. Springer Series in Statistics. doi:10.1007/0-387-34240-0. ISBN 978-0-387-30814-2.

[8] Meng, Xiangrui (2013). "Scalable Simple Random Sampling and Stratified Sampling" (PDF). Proceedings of the 30th International Conference on Machine Learning (ICML-13): 531–539.

[9] Fan, C. T.; Muller, Mervin E.; Rezucha, Ivan (1962-06-01). "Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers". Journal of the American Statistical Association. 57 (298): 387–402. doi:10.1080/01621459.1962.10480667. ISSN 0162-1459.

[10] Sunter, A. B. (1977-01-01). "List Sequential Sampling with Equal or Unequal Probabilities without Replacement". Applied Statistics. 26 (3): 261–268. doi:10.2307/2346966. JSTOR 2346966.

[11] Vitter, Jeffrey S. (1985-03-01). "Random Sampling with a Reservoir". ACM Trans. Math. Softw. 11 (1): 37–57. CiteSeerX 10.1.1.138.784 . doi:10.1145/3147.3165. ISSN 0098-3500.

[12] Vitter, Jeffrey S. (1984-07-01). "Faster methods for random sampling". Communications of the ACM. 27 (7): 703–718. CiteSeerX 10.1.1.329.6400 . doi:10.1145/358105.893. ISSN 0001-0782.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

v t e Social survey research
Data collection	Collection methods Questionnaire Interview Structured Semi-structured Unstructured Couple
Methodology	Census Sampling frame Statistical sample Sampling for surveys Random sampling Simple random sampling Quota sampling Stratified sampling Nonprobability sampling Sample size determination Research design Panel study Cohort study Cross-sectional study Cross-sequential study
Survey errors	Sampling error Standard error Sampling bias Systematic errors Non-sampling error Specification error Frame error Measurement error Response errors Non-response bias Coverage error Pseudo-opinion Processing errors
Data analysis	Categorical data Contingency table Level of measurement Descriptive statistics Exploratory data analysis Multivariate statistics Psychometrics Statistical inference Statistical models Graphical Log-linear Structural
Applications	Audience measurement Demography Market research Opinion poll Public opinion
Major surveys	List of comparative social surveys Afrobarometer American National Election Studies Asian Barometer Survey Comparative Study of Electoral Systems Emerson College Polling Eurobarometer European Social Survey Gallup Poll General Social Survey Household, Income and Labour Dynamics in Australia Survey International Social Survey Latinobarómetro List of household surveys in the United States National Health and Nutrition Examination Survey New Zealand Attitudes and Values Study Suffolk University Political Research Center The Phillips Academy Poll Quinnipiac University Polling Institute World Values Survey
Associations	American Association for Public Opinion Research European Society for Opinion and Marketing Research International Statistical Institute Pew Research Center World Association for Public Opinion Research
Category Projects Business Politics Psychology Sociology Statistics