Independent and identically distributed random variables

Last updated
A chart showing a uniform distribution Laglag h=001.png
A chart showing a uniform distribution

In probability theory and statistics, a collection of random variables is independent and identically distributed (i.i.d., iid, or IID) if each random variable has the same probability distribution as the others and all are mutually independent. [1] IID was first defined in statistics and finds application in many fields, such as data mining and signal processing.

Contents

Introduction

Statistics commonly deals with random samples. A random sample can be thought of as a set of objects that are chosen randomly. More formally, it is "a sequence of independent, identically distributed (IID) random data points."

In other words, the terms random sample and IID are synonymous. In statistics, "random sample" is the typical terminology, but in probability, it is more common to say "IID."

Application

Independent and identically distributed random variables are often used as an assumption, which tends to simplify the underlying mathematics. In practical applications of statistical modeling, however, this assumption may or may not be realistic. [3]

The i.i.d. assumption is also used in the central limit theorem, which states that the probability distribution of the sum (or average) of i.i.d. variables with finite variance approaches a normal distribution. [4]

The i.i.d. assumption frequently arises in the context of sequences of random variables. Then, "independent and identically distributed" implies that an element in the sequence is independent of the random variables that came before it. In this way, an i.i.d. sequence is different from a Markov sequence, where the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first-order Markov sequence). An i.i.d. sequence does not imply the probabilities for all elements of the sample space or event space must be the same. [5] For example, repeated throws of loaded dice will produce a sequence that is i.i.d., despite the outcomes being biased.

In signal processing and image processing, the notion of transformation to i.i.d. implies two specifications, the "i.d." part and the "i." part:

i.d. – The signal level must be balanced on the time axis.

i. – The signal spectrum must be flattened, i.e. transformed by filtering (such as deconvolution) to a white noise signal (i.e. a signal where all frequencies are equally present).

Definition

Definition for two random variables

Suppose that the random variables and are defined to assume values in . Let and be the cumulative distribution functions of and , respectively, and denote their joint cumulative distribution function by .

Two random variables and are identically distributed if and only if [6] .

Two random variables and are independent if and only if . (See further Independence (probability theory) § Two random variables.)

Two random variables and are i.i.d. if they are independent and identically distributed, i.e. if and only if

(Eq.1)

Definition for more than two random variables

The definition extends naturally to more than two random variables. We say that random variables are i.i.d. if they are independent (see further Independence (probability theory) § More than two random variables) and identically distributed, i.e. if and only if

EQUATION
(Eq.2)

where denotes the joint cumulative distribution function of .

Definition for independence

In probability theory, two events, and , are called independent if and only if . In the following, is short for .

Suppose there are two events of the experiment, and . If , there is a possibility . Generally, the occurrence of has an effect on the probability of — this is called conditional probability. Additionally, only when the occurrence of has no effect on the occurrence of , there is .

Note: If and , then and are mutually independent which cannot be established with mutually incompatible at the same time; that is, independence must be compatible and mutual exclusion must be related.

Suppose , , and are three events. If , , , and are satisfied, then the events , , and are mutually independent.

A more general definition is there are events, . If the probabilities of the product events for any events are equal to the product of the probabilities of each event, then the events are independent of each other.

Examples

Example 1

A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see the gambler's fallacy).

Example 2

Toss a coin 10 times and record how many times the coin lands on heads.

  1. Independent – Each outcome of landing will not affect the other outcome, which means the 10 results are independent from each other.
  2. Identically distributed – Regardless of whether the coin is fair (probability 1/2 of heads) or unfair, as long as the same coin is used for each flip, each flip will have the same probability as each other flip.

Such a sequence of two possible i.i.d. outcomes is also called a Bernoulli process.

Example 3

Roll a die 10 times and record how many times the result is 1.

  1. Independent – Each outcome of the die roll will not affect the next one, which means the 10 results are independent from each other.
  2. Identically distributed – Regardless of whether the die is fair or weighted, each roll will have the same probability as every other roll. In contrast, rolling 10 different dice, some of which are weighted and some of which are not, would not produce i.i.d. variables.

Example 4

Choose a card from a standard deck of cards containing 52 cards, then place the card back in the deck. Repeat this 52 times. Record the number of kings that appear.

  1. Independent – Each outcome of the card will not affect the next one, which means the 52 results are independent from each other. In contrast, if each card that is drawn is kept out of the deck, subsequent draws would be affected by it (drawing one king would make drawing a second king less likely), and the result would not be independent.
  2. Identically distributed – After drawing one card from it, each time the probability for a king is 4/52, which means the probability is identical each time.

Generalizations

Many results that were first proven under the assumption that the random variables are i.i.d. have been shown to be true even under a weaker distributional assumption.

Exchangeable random variables

The most general notion which shares the main properties of i.i.d. variables are exchangeable random variables, introduced by Bruno de Finetti.[ citation needed ] Exchangeability means that while variables may not be independent, future ones behave like past ones — formally, any value of a finite sequence is as likely as any permutation of those values — the joint probability distribution is invariant under the symmetric group.

This provides a useful generalization — for example, sampling without replacement is not independent, but is exchangeable.

Lévy process

In stochastic calculus, i.i.d. variables are thought of as a discrete time Lévy process: each variable gives how much one changes from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process.

One may generalize this to include continuous time Lévy processes, and many Lévy processes can be seen as limits of i.i.d. variables—for instance, the Wiener process is the limit of the Bernoulli process.

In machine learning

Machine learning (ML) involves learning statistical relationships within data. To train ML models effectively, it is crucial to use data that is broadly generalizable. If the training data is insufficiently representative of the task, the model's performance on new, unseen data may be poor.

The i.i.d. hypothesis allows for a significant reduction in the number of individual cases required in the training sample, simplifying optimization calculations. In optimization problems, the assumption of independent and identical distribution simplifies the calculation of the likelihood function. Due to this assumption, the likelihood function can be expressed as:

To maximize the probability of the observed event, the log function is applied to maximize the parameter . Specifically, it computes:

where

Computers are very efficient at performing multiple additions, but not as efficient at performing multiplications. This simplification enhances computational efficiency. The log transformation, in the process of maximizing, converts many exponential functions into linear functions.

There are two main reasons why this hypothesis is practically useful with the central limit theorem (CLT):

  1. Even if the sample originates from a complex non-Gaussian distribution, it can be well-approximated because the CLT allows it to be simplified to a Gaussian distribution ("for a large number of observable samples, the sum of many random variables will have an approximately normal distribution").
  2. The second reason is that the model's accuracy depends on the simplicity and representational power of the model unit, as well as the data quality. The simplicity of the unit makes it easy to interpret and scale, while the representational power and scalability improve model accuracy. In a deep neural network, for instance, each neuron is simple yet powerful in representation, layer by layer, capturing more complex features to enhance model accuracy.

See also

Related Research Articles

<span class="mw-page-title-main">Random variable</span> Variable representing a random phenomenon

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers to neither randomness nor variability but instead is a mathematical function in which

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

In probability theory, de Finetti's theorem states that exchangeable observations are conditionally independent relative to some latent variable. An epistemic probability distribution could then be assigned to this variable. It is named in honor of Bruno de Finetti.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In probability theory and statistics, the cumulantsκn of a probability distribution are a set of quantities that provide an alternative to the moments of the distribution. Any two probability distributions whose moments are identical will have identical cumulants as well, and vice versa.

In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of exact simulation method. The method works for any distribution in with a density.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

<span class="mw-page-title-main">Kruskal–Wallis test</span> Non-parametric method for testing whether samples originate from the same distribution

The Kruskal–Wallis test by ranks, Kruskal–Wallis test, or one-way ANOVA on ranks is a non-parametric statistical test for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

Probability theory and statistics have some commonly used conventions, in addition to standard mathematical notation and mathematical symbols.

<span class="mw-page-title-main">Characteristic function (probability theory)</span> Fourier transform of the probability density function

In probability theory and statistics, the characteristic function of any real-valued random variable completely defines its probability distribution. If a random variable admits a probability density function, then the characteristic function is the Fourier transform of the probability density function. Thus it provides an alternative route to analytical results compared with working directly with probability density functions or cumulative distribution functions. There are particularly simple results for the characteristic functions of distributions defined by the weighted sums of random variables.

In statistics, an exchangeable sequence of random variables is a sequence X1X2X3, ... whose joint probability distribution does not change when the positions in the sequence in which finitely many of them appear are altered. In other words, the joint distribution is invariant to finite permutation. Thus, for example the sequences

In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF).

In probability and statistics, the class of exponential dispersion models (EDM), also called exponential dispersion family (EDF), is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

Although the term well-behaved statistic often seems to be used in the scientific literature in somewhat the same way as is well-behaved in mathematics it can also be assigned precise mathematical meaning, and in more than one way. In the former case, the meaning of this term will vary from context to context. In the latter case, the mathematical conditions can be used to derive classes of combinations of distributions with statistics which are well-behaved in each sense.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

References

  1. Clauset, Aaron (2011). "A brief primer on probability distributions" (PDF). Santa Fe Institute. Archived from the original (PDF) on 2012-01-20. Retrieved 2011-11-29.
  2. Stephanie (2016-05-11). "IID Statistics: Independent and Identically Distributed Definition and Examples". Statistics How To. Retrieved 2021-12-09.
  3. Hampel, Frank (1998), "Is statistics too difficult?", Canadian Journal of Statistics, 26 (3): 497–513, doi:10.2307/3315772, hdl: 20.500.11850/145503 , JSTOR   3315772, S2CID   53117661 (§8).
  4. Blum, J. R.; Chernoff, H.; Rosenblatt, M.; Teicher, H. (1958). "Central Limit Theorems for Interchangeable Processes". Canadian Journal of Mathematics. 10: 222–229. doi: 10.4153/CJM-1958-026-0 . S2CID   124843240.
  5. Cover, T. M.; Thomas, J. A. (2006). Elements Of Information Theory. Wiley-Interscience. pp. 57–58. ISBN   978-0-471-24195-9.
  6. Casella & Berger 2002 , Theorem 1.5.10

Further reading