# Independent and identically distributed random variables

Last updated

In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. [1] This property is usually abbreviated as i.i.d. or iid or IID. Herein, i.i.d. is used, because it is the most prevalent.

## Introduction

In statistics, it is commonly assumed that observations in a sample are effectively i.i.d. The assumption (or requirement) that observations be i.i.d. tends to simplify the underlying mathematics of many statistical methods (see mathematical statistics and statistical theory). In practical applications of statistical modeling, however, the assumption may or may not be realistic. [2] To partially test how realistic the assumption is on a given data set, the correlation can be computed, lag plots drawn or turning point test performed. [3] The generalization of exchangeable random variables is often sufficient and more easily met.

The i.i.d. assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum (or average) of i.i.d. variables with finite variance approaches a normal distribution.

Often the i.i.d. assumption arises in the context of sequences of random variables. Then "independent and identically distributed" implies that an element in the sequence is independent of the random variables that came before it. In this way, an i.i.d. sequence is different from a Markov sequence, where the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order Markov sequence). An i.i.d. sequence does not imply the probabilities for all elements of the sample space or event space must be the same. [4] For example, repeated throws of loaded dice will produce a sequence that is i.i.d., despite the outcomes being biased.

## Definition

### Definition for two random variables

Suppose that the random variables ${\displaystyle X}$ and ${\displaystyle Y}$ are defined to assume values in ${\displaystyle I\subseteq \mathbb {R} }$. Let ${\displaystyle F_{X}(x)=\operatorname {P} (X\leq x)}$ and ${\displaystyle F_{Y}(y)=\operatorname {P} (Y\leq y)}$ be the cumulative distribution functions of ${\displaystyle X}$ and ${\displaystyle Y}$, respectively, and denote their joint cumulative distribution function by ${\displaystyle F_{X,Y}(x,y)=\operatorname {P} (X\leq x\land Y\leq y)}$.

Two random variables ${\displaystyle X}$ and ${\displaystyle Y}$ are identically distributed if and only if [5] ${\displaystyle F_{X}(x)=F_{Y}(x)\,\forall x\in I}$.

Two random variables ${\displaystyle X}$ and ${\displaystyle Y}$ are independent if and only if ${\displaystyle F_{X,Y}(x,y)=F_{X}(x)\cdot F_{Y}(y)\,\forall x,y\in I}$. (See further Independence (probability theory) § Two random variables.)

Two random variables ${\displaystyle X}$ and ${\displaystyle Y}$ are i.i.d. if they are independent and identically distributed, i.e. if and only if

{\displaystyle {\begin{aligned}&F_{X}(x)=F_{Y}(x)\,&\forall x\in I\\&F_{X,Y}(x,y)=F_{X}(x)\cdot F_{Y}(y)\,&\forall x,y\in I\end{aligned}}}

(Eq.1)

### Definition for more than two random variables

The definition extends naturally to more than two random variables. We say that ${\displaystyle n}$ random variables ${\displaystyle X_{1},\ldots ,X_{n}}$ are i.i.d. if they are independent (see further Independence (probability theory)#More than two random variables) and identically distributed, i.e. if and only if

{\displaystyle {\begin{aligned}&F_{X_{1}}(x)=F_{X_{k}}(x)\,&\forall k\in \{1,\ldots ,n\}{\text{ and }}\forall x\in I\\&F_{X_{1},\ldots ,X_{n}}(x_{1},\ldots ,x_{n})=F_{X_{1}}(x_{1})\cdot \ldots \cdot F_{X_{n}}(x_{n})\,&\forall x_{1},\ldots ,x_{n}\in I\end{aligned}}}

(Eq.2)

where ${\displaystyle F_{X_{1},\ldots ,X_{n}}(x_{1},\ldots ,x_{n})=\operatorname {P} (X_{1}\leq x_{1}\land \ldots \land X_{n}\leq x_{n})}$ denotes the joint cumulative distribution function of ${\displaystyle X_{1},\ldots ,X_{n}}$.

## Examples

The following are examples or applications of i.i.d. random variables:

• A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see the Gambler's fallacy).
• A sequence of fair or loaded dice rolls is i.i.d.
• A sequence of fair or unfair coin flips is i.i.d.
• In signal processing and image processing the notion of transformation to i.i.d. implies two specifications, the "i.d." (i.d. = identically distributed) part and the "i." (i. = independent) part:
• (i.d.) the signal level must be balanced on the time axis;
• (i.) the signal spectrum must be flattened, i.e. transformed by filtering (such as deconvolution) to a white noise signal (i.e. a signal where all frequencies are equally present).

The following are examples of data sampling that do not satisfy the i.i.d. assumption:

• A medical dataset where multiple samples are taken from multiple patients, it is very likely that samples from same patients may be correlated.
• Samples drawn from time dependent processes, for example, year-wise census data.

## Generalizations

Many results that were first proven under the assumption that the random variables are i.i.d. have been shown to be true even under a weaker distributional assumption.

### Exchangeable random variables

The most general notion which shares the main properties of i.i.d. variables are exchangeable random variables, introduced by Bruno de Finetti.[ citation needed ] Exchangeability means that while variables may not be independent, future ones behave like past ones – formally, any value of a finite sequence is as likely as any permutation of those values – the joint probability distribution is invariant under the symmetric group.

This provides a useful generalization – for example, sampling without replacement is not independent, but is exchangeable.

### Lévy process

In stochastic calculus, i.i.d. variables are thought of as a discrete time Lévy process: each variable gives how much one changes from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process. One may generalize this to include continuous time Lévy processes, and many Lévy processes can be seen as limits of i.i.d. variables—for instance, the Wiener process is the limit of the Bernoulli process.

## In Machine Learning

In machine learning theory, i.i.d. assumption is often made for training datasets to imply that all samples stem from the same generative process and that the generative process is assumed to have no memory of past generated samples.

## Related Research Articles

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success or failure. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable , or just distribution function of , evaluated at , is the probability that will take a value less than or equal to .

In probability theory, the expected value of a random variable , denoted or , is a generalization of the weighted average, and is intuitively the arithmetic mean of a large number of independent realizations of . The expected value is also known as the expectation, mathematical expectation, mean, average, or first moment. Expected value is a key concept in economics, finance, and many other subjects.

Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms. Typically these axioms formalise probability in terms of a probability space, which assigns a measure taking values between 0 and 1, termed the probability measure, to a set of outcomes called the sample space. Any specified subset of these outcomes is called an event. Central subjects in probability theory include discrete and continuous random variables, probability distributions, and stochastic processes, which provide mathematical abstractions of non-deterministic or uncertain processes or measured quantities that may either be single occurrences or evolve over time in a random fashion. Although it is not possible to perfectly predict random events, much can be said about their behavior. Two major results in probability theory describing such behaviour are the law of large numbers and the central limit theorem.

In probability and statistics, a random variable, random quantity, aleatory variable, or stochastic variable is described informally as a variable whose values depend on outcomes of a random phenomenon. The formal mathematical treatment of random variables is a topic in probability theory. In that context, a random variable is understood as a measurable function defined on a probability space that maps from the sample space to the real numbers.

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes.

In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

In probability theory, there exist several different notions of convergence of random variables. The convergence of sequences of random variables to some limit random variable is an important concept in probability theory, and its applications to statistics and stochastic processes. The same concepts are known in more general mathematics as stochastic convergence and they formalize the idea that a sequence of essentially random or unpredictable events can sometimes be expected to settle down into a behavior that is essentially unchanging when items far enough into the sequence are studied. The different possible notions of convergence relate to how such a behavior can be characterized: two readily understood behaviors are that the sequence eventually takes a constant value, and that values in the sequence continue to change but can be described by an unchanging probability distribution.

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

In probability theory, de Finetti's theorem states that positively correlated exchangeable observations are conditionally independent relative to some latent variable. An epistemic probability distribution could then be assigned to this variable. It is named in honor of Bruno de Finetti.

In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability and the value 0 with probability . Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question. Such questions lead to outcomes that are boolean-valued: a single bit whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q. It can be used to represent a coin toss where 1 and 0 would represent "heads" and "tails", respectively, and p would be the probability of the coin landing on heads or tails, respectively. In particular, unfair coins would have

In probability theory, the probability generating function of a discrete random variable is a power series representation of the probability mass function of the random variable. Probability generating functions are often employed for their succinct description of the sequence of probabilities Pr(X = i) in the probability mass function for a random variable X, and to make available the well-developed theory of power series with non-negative coefficients.

In probability theory, the central limit theorem states that, under certain circumstances, the probability distribution of the scaled mean of a random sample converges to a normal distribution as the sample size increases to infinity. Under stronger assumptions, the Berry–Esseen theorem, or Berry–Esseen inequality, gives a more quantitative result, because it also specifies the rate at which this convergence takes place by giving a bound on the maximal error of approximation between the normal distribution and the true distribution of the scaled sample mean. The approximation is measured by the Kolmogorov–Smirnov distance. In the case of independent samples, the convergence rate is n−1/2, where n is the sample size, and the constant is estimated in terms of the third absolute normalized moments.

Renewal theory is the branch of probability theory that generalizes the Poisson process for arbitrary holding times. Instead of exponentially distributed holding times, a renewal process may have any independent and identically distributed (IID) holding times that have finite mean. A renewal-reward process additionally has a random sequence of rewards incurred at each holding time, which are IID but need not be independent of the holding times.

A Dynkin system, named after Eugene Dynkin, is a collection of subsets of another universal set satisfying a set of axioms weaker than those of σ-algebra. Dynkin systems are sometimes referred to as λ-systems or d-system. These set families have applications in measure theory and probability.

In probability theory and statistics, the characteristic function of any real-valued random variable completely defines its probability distribution. If a random variable admits a probability density function, then the characteristic function is the Fourier transform of the probability density function. Thus it provides an alternative route to analytical results compared with working directly with probability density functions or cumulative distribution functions. There are particularly simple results for the characteristic functions of distributions defined by the weighted sums of random variables.

In mathematics, a π-system on a set is a collection of certain subsets of such that

In probability theory, an indecomposable distribution is a probability distribution that cannot be represented as the distribution of the sum of two or more non-constant independent random variables: Z ≠ X + Y. If it can be so expressed, it is decomposable:Z = X + Y. If, further, it can be expressed as the distribution of the sum of two or more independent identically distributed random variables, then it is divisible:Z = X1 + X2.

In statistics, an exchangeable sequence of random variables is a sequence X1X2X3, ... whose joint probability distribution does not change when the positions in the sequence in which finitely many of them appear are altered. Thus, for example the sequences

In probability theory and statistics, a categorical distribution is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution,. The K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.

## References

### Citations

1. Clauset, Aaron (2011). "A brief primer on probability distributions" (PDF). Santa Fe Institute.
2. Hampel, Frank (1998), "Is statistics too difficult?", Canadian Journal of Statistics, 26 (3): 497–513, doi:10.2307/3315772, hdl:, JSTOR   3315772 (§8).
3. Le Boudec, Jean-Yves (2010). Performance Evaluation Of Computer And Communication Systems (PDF). EPFL Press. pp. 46–47. ISBN   978-2-940222-40-7. Archived from the original (PDF) on 2013-10-12. Retrieved 2013-06-14.
4. Cover, T. M.; Thomas, J. A. (2006). Elements Of Information Theory. Wiley-Interscience. pp. 57–58. ISBN   978-0-471-24195-9.
5. Casella & Berger 2002 , Theorem 1.5.10