probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

and

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, a collection of

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

s is independent and identically distributed (''i.i.d.'', ''iid'', or ''IID'') if each random variable has the same

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

as the others and all are mutually

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...

. IID was first defined in statistics and finds application in many fields, such as

data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

and

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...

Introduction

Statistics commonly deals with random samples. A random sample can be thought of as a set of objects that are chosen randomly. More formally, it is "a sequence of independent, identically distributed (IID) random data points." In other words, the terms ''random sample'' and ''IID'' are synonymous. In statistics, "''random sample''" is the typical terminology, but in probability, it is more common to say "IID." * Identically distributed means that there are no overall trends — the distribution does not fluctuate and all items in the sample are taken from the same

probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

distribution. * Independent means that the sample items are all independent events. In other words, they are not connected to each other in any way; knowledge of the value of one variable gives no information about the value of the other and vice versa.

Application

Independent and identically distributed random variables are often used as an assumption, which tends to simplify the underlying mathematics. In practical applications of

statistical modeling A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...

, however, this assumption may or may not be realistic. The i.i.d. assumption is also used in the

central limit theorem In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the Probability distribution, distribution of a normalized version of the sample mean converges to a Normal distribution#Standard normal distributi ...

, which states that the probability distribution of the sum (or average) of i.i.d. variables with finite

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

approaches a

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

. The i.i.d. assumption frequently arises in the context of sequences of random variables. Then, "independent and identically distributed" implies that an element in the sequence is independent of the random variables that came before it. In this way, an i.i.d. sequence is different from a Markov sequence, where the probability distribution for the th random variable is a function of the previous random variable in the sequence (for a first-order Markov sequence). An i.i.d. sequence does not imply the probabilities for all elements of the

sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...

or event space must be the same. For example, repeated throws of loaded dice will produce a sequence that is i.i.d., despite the outcomes being biased. In

and

image processing An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...

, the notion of transformation to i.i.d. implies two specifications, the "i.d." part and the "i." part: i.d. – The signal level must be balanced on the time axis. i. – The signal spectrum must be flattened, i.e. transformed by filtering (such as

deconvolution In mathematics, deconvolution is the inverse of convolution. Both operations are used in signal processing and image processing. For example, it may be possible to recover the original signal after a filter (convolution) by using a deconvolution ...

) to a

white noise In signal processing, white noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. The term is used with this or similar meanings in many scientific and technical disciplines, i ...

signal (i.e. a signal where all frequencies are equally present).

Definition

Definition for two random variables

Suppose that the random variables

X

and

Y

are defined to assume values in

I \subseteq \mathbb

. Let

F_X(x) = \operatorname(X\leq x)

and

F_Y(y) = \operatorname(Y\leq y)

be the cumulative distribution functions of

X

and

Y

, respectively, and denote their joint cumulative distribution function by

F_(x,y) = \operatorname(X\leq x \land Y\leq y)

. Two random variables

X

and

Y

are independent if and only if

F_(x,y) = F_(x) \cdot F_(y)

for all

x,y \in I

. (For the simpler case of events, two events

A

and

B

are independent if and only if

P(A\land B) = P(A) \cdot P(B)

, see also .) Two random variables

X

and

Y

are identically distributed if and only if

F_X(x)=F_Y(x)

for all

x \in I

. Two random variables

X

and

Y

are i.i.d. if they are independent ''and'' identically distributed, i.e. if and only if

Definition for more than two random variables

The definition extends naturally to more than two random variables. We say that

n

random variables

X_1,\ldots,X_n

are i.i.d. if they are independent (see further ) ''and'' identically distributed, i.e. if and only if where

F_(x_1,\ldots,x_n) = \operatorname(X_1\leq x_1 \land \ldots \land X_n\leq x_n)

denotes the joint cumulative distribution function of

X_1,\ldots,X_n

Examples

Example 1

A sequence of outcomes of spins of a fair or unfair

roulette Roulette (named after the French language, French word meaning "little wheel") is a casino game which was likely developed from the Italy, Italian game Biribi. In the game, a player may choose to place a bet on a single number, various grouping ...

wheel is i.i.d. One implication of this is that if the roulette ball lands on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see the

gambler's fallacy The gambler's fallacy, also known as the Monte Carlo fallacy or the fallacy of the maturity of chances, is the belief that, if an event (whose occurrences are Independent and identically distributed random variables, independent and identically dis ...

Example 2

Toss a coin 10 times and write down the results into variables

A_1,\ldots,A_

. # Independent: Each outcome

A_i

will not affect the other outcome

A_j

(for

i\neq j

from 1 to 10), which means the variables

A_1,\ldots,A_

are independent of each other. # Identically distributed: Regardless of whether the coin is fair (with a probability of 1/2 for heads) or biased, as long as the same coin is used for each flip, the probability of getting heads remains consistent across all flips. Such a sequence of i.i.d. variables is also called a

Bernoulli process In probability and statistics, a Bernoulli process (named after Jacob Bernoulli) is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The ...

Example 3

Roll a die 10 times and save the results into variables

A_1,\ldots,A_

. # Independent: Each outcome of the die roll will not affect the next one, which means the 10 variables are independent from each other. # Identically distributed: Regardless of whether the die is fair or weighted, each roll will have the same probability of seeing each result as every other roll. In contrast, rolling 10 different dice, some of which are weighted and some of which are not, would not produce i.i.d. variables.

Example 4

Choose a card from a standard deck of cards containing 52 cards, then place the card back in the deck. Repeat this 52 times. Observe when a king appears. # Independent: Each observation will not affect the next one, which means the 52 results are independent from each other. In contrast, if each card that is drawn is kept out of the deck, subsequent draws would be affected by it (drawing one king would make drawing a second king less likely), and the observations would not be independent. # Identically distributed: After drawing one card from it (and then returning the card to the deck), each time the probability for a king is 4/52, which means the probability is identical each time.

Generalizations

Many results that were first proven under the assumption that the random variables are i.i.d. have been shown to be true even under a weaker distributional assumption.

Exchangeable random variables

The most general notion which shares the main properties of i.i.d. variables are

exchangeable random variables In statistics, an exchangeable sequence of random variables (also sometimes interchangeable) is a sequence ''X''1, ''X''2, ''X''3, ... (which may be finitely or infinitely long) whose joint probability distribution does not change wh ...

, introduced by

Bruno de Finetti Bruno de Finetti (13 June 1906 – 20 July 1985) was an Italian probabilist statistician and actuary, noted for the "operational subjective" conception of probability. The classic exposition of his distinctive theory is the 1937 , which discuss ...

. Exchangeability means that while variables may not be independent, future ones behave like past ones — formally, any value of a finite sequence is as likely as any

permutation In mathematics, a permutation of a set can mean one of two different things: * an arrangement of its members in a sequence or linear order, or * the act or process of changing the linear order of an ordered set. An example of the first mean ...

of those values — the

joint probability distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

is invariant under the

symmetric group In abstract algebra, the symmetric group defined over any set is the group whose elements are all the bijections from the set to itself, and whose group operation is the composition of functions. In particular, the finite symmetric grou ...

. This provides a useful generalization — for example,

sampling without replacement Sampling may refer to: *Sampling (signal processing), converting a continuous signal into a discrete signal * Sampling (graphics), converting continuous colors into discrete color components *Sampling (music), the reuse of a sound recording in ano ...

is not independent, but is exchangeable.

Lévy process

stochastic calculus Stochastic calculus is a branch of mathematics that operates on stochastic processes. It allows a consistent theory of integration to be defined for integrals of stochastic processes with respect to stochastic processes. This field was created an ...

, i.i.d. variables are thought of as a

discrete time In mathematical dynamics, discrete time and continuous time are two alternative frameworks within which variables that evolve over time are modeled. Discrete time Discrete time views values of variables as occurring at distinct, separate "poi ...

Lévy process In probability theory, a Lévy process, named after the French mathematician Paul Lévy, is a stochastic process with independent, stationary increments: it represents the motion of a point whose successive displacements are random, in which disp ...

: each variable gives how much one changes from one time to another. For example, a sequence of Bernoulli trials is interpreted as the

. This could be generalized to include continuous time ''Lévy processes'', and many Lévy processes can be seen as limits of i.i.d. variables—for instance, the

Wiener process In mathematics, the Wiener process (or Brownian motion, due to its historical connection with Brownian motion, the physical process of the same name) is a real-valued continuous-time stochastic process discovered by Norbert Wiener. It is one o ...

is the limit of the Bernoulli process.

In machine learning

Machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

(ML) involves learning statistical relationships within data. To train ML models effectively, it is crucial to use data that is broadly generalizable. If the

training data In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

is insufficiently representative of the task, the model's performance on new, unseen data may be poor. The i.i.d. hypothesis allows for a significant reduction in the number of individual cases required in the training sample, simplifying optimization calculations. In optimization problems, the assumption of independent and identical distribution simplifies the calculation of the likelihood function. Due to this assumption, the likelihood function can be expressed as:

l(\theta) = P(x_1, x_2, x_3,...,x_n, \theta) = P(x_1, \theta) P(x_2, \theta) P(x_3, \theta) ... P(x_n, \theta)

To maximize the probability of the observed event, the log function is applied to maximize the parameter

\theta

. Specifically, it computes:

\mathop\limits_\theta \log(l(\theta))

where

\log(l(\theta)) = \log(P(x_1, \theta)) + \log(P(x_2, \theta)) + \log(P(x_3, \theta)) + ... + \log(P(x_n, \theta))

Computers are very efficient at performing multiple additions, but not as efficient at performing multiplications. This simplification enhances computational efficiency. The log transformation, in the process of maximizing, converts many exponential functions into linear functions. There are two main reasons why this hypothesis is practically useful with the

(CLT): # Even if the sample originates from a complex non-

Gaussian distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

, it can be well-approximated because the CLT allows it to be simplified to a Gaussian distribution. # The second reason is that the model's accuracy depends on the simplicity and representational power of the model unit, as well as the data quality. The simplicity of the unit makes it easy to interpret and scale, while the representational power and scalability improve model accuracy. In a deep

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

, for instance, each neuron is simple yet powerful in representation, layer by layer, capturing more complex features to enhance model accuracy.