probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...

and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...

. This property is usually abbreviated as ''i.i.d.'', ''iid'', or ''IID''. IID was first defined in statistics and finds application in different fields such as data mining and signal processing.

Introduction

In statistics, we commonly deal with random samples. A random sample can be thought of as a set of objects that are chosen randomly. Or, more formally, it’s “a sequence of independent, identically distributed (IID) random variables”. In other words, the terms ''random sample'' and ''IID'' are basically one and the same. In statistics, we usually say “random sample,” but in probability it’s more common to say “IID.” * Identically Distributed means that there are no overall trends–the distribution doesn’t fluctuate and all items in the sample are taken from the same probability distribution. * Independent means that the sample items are all independent events. In other words, they aren’t connected to each other in any way. In other words, knowledge of the value of one variable gives no information about the value of the other and vice versa.

Application

Independent and identically distributed random variables are often used as an assumption, which tends to simplify the underlying mathematics. In practical applications of

statistical modeling A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...

, however, the assumption may or may not be realistic. The i.i.d. assumption is also used in

central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themsel ...

, which states that the probability distribution of the sum (or average) of i.i.d. variables with finite

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

approaches a

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

. Often the i.i.d. assumption arises in the context of sequences of random variables. Then "independent and identically distributed" implies that an element in the sequence is independent of the random variables that came before it. In this way, an i.i.d. sequence is different from a Markov sequence, where the probability distribution for the ''n''th random variable is a function of the previous random variable in the sequence (for a first order Markov sequence). An i.i.d. sequence does not imply the probabilities for all elements of the

sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...

or event space must be the same. For example, repeated throws of loaded dice will produce a sequence that is i.i.d., despite the outcomes being biased.

Definition

Definition for two random variables

Suppose that the random variables

X

and

Y

are defined to assume values in

I \subseteq \mathbb

. Let

F_X(x) = \operatorname(X\leq x)

and

F_Y(y) = \operatorname(Y\leq y)

be the

cumulative distribution functions In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ever ...

X

and

Y

, respectively, and denote their joint cumulative distribution function by

F_(x,y) = \operatorname(X\leq x \land Y\leq y)

. Two random variables

X

and

Y

are identically distributed

if and only if In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false. The connective is b ...

F_X(x)=F_Y(x) \, \forall x \in I

. Two random variables

X

and

Y

are independent if and only if

F_(x,y) = F_(x) \cdot F_(y) \, \forall x,y \in I

. (See further .) Two random variables

X

and

Y

are i.i.d. if they are independent ''and'' identically distributed, i.e. if and only if

Definition for more than two random variables

The definition extends naturally to more than two random variables. We say that

n

random variables

X_1,\ldots,X_n

are i.i.d. if they are independent (see further ) ''and'' identically distributed, i.e. if and only if where

F_(x_1,\ldots,x_n) = \operatorname(X_1\leq x_1 \land \ldots \land X_n\leq x_n)

denotes the joint cumulative distribution function of

X_1,\ldots,X_n

Definition for independence

In probability theory, two events A, B are called independent if and only if P(A and B) = P(A)P(B). In the following P(AB) is short for P(A and B). Suppose there are the two events of the experiment A, B if P(A) > 0, there is possibility P(B, A). Generally, the occurrence of A has an effect on the probability of B, which is called conditional probability, and only when the occurrence of A has no effect on the occurrence of B, there is P(B, A) = P(B). Note: If P(A) > 0, P(B) > 0 then A, B are mutually independent which cannot be established with mutually incompatible at the same time, that is, independence must be compatible and mutual exclusion must be related. Suppose A, B, C are three events. If P(AB) = P(A)P(B), P(BC) = P(B)P(C), P(AC) = P(A)P(C), P(ABC) = P(A)P(B)P(C) are satisfied, then the events A, B, C are independent of each other. A more general definition is there are n events, A₁, A₂,...,A_n. If the probabilities of the product events for any 2, 3, ..., n events are equal to the product of the probabilities of each event, then the events A₁, A₂, ..., A_n are independent of each other.

Examples

Example 1

A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see the Gambler's fallacy). A sequence of fair or loaded dice rolls is i.i.d. A sequence of fair or unfair coin flips is i.i.d. In

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...

and image processing the notion of transformation to i.i.d. implies two specifications, the "i.d."part and the "i." part: (i.d.) the signal level must be balanced on the time axis; (i.) the signal spectrum must be flattened, i.e. transformed by filtering (such as

deconvolution In mathematics, deconvolution is the operation inverse to convolution. Both operations are used in signal processing and image processing. For example, it may be possible to recover the original signal after a filter (convolution) by using a deco ...

) to a

white noise In signal processing, white noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. The term is used, with this or similar meanings, in many scientific and technical disciplines ...

signal (i.e. a signal where all frequencies are equally present).

Example 2

Toss a coin 10 times and record how many times does the coin lands on head. # Independent – each outcome of landing will not affect the other outcome, which means the 10 results are independent from each other. # Identically Distributed – if the coin is a homogeneous material, each time the probability for head is 0.5, which means the probability is identical for each time.

Example 3

Roll a dice 10 times and record how many time the result is 1. # Independent – each outcome of the dice will not affect the next one, which means the 10 results are independent from each other. # Identically Distributed – if the dice is a homogeneous material, each time the probability for the number 1 is 1/6, which means the probability is identical for each time.

Example 4

Choose a card from a standard deck of cards containing 52 cards, then place the card back in the deck. Repeat it for 52 times. Record the number of King appears # Independent – each outcome of the card will not affect the next one, which means the 52 results are independent from each other. # Identically Distributed – after drawing one card from it, each time the probability for King is 4/52, which means the probability is identical for each time.

Generalizations

Many results that were first proven under the assumption that the random variables are i.i.d. have been shown to be true even under a weaker distributional assumption.

Exchangeable random variables

The most general notion which shares the main properties of i.i.d. variables are

exchangeable random variables In statistics, an exchangeable sequence of random variables (also sometimes interchangeable) is a sequence ''X''1, ''X''2, ''X''3, ... (which may be finitely or infinitely long) whose joint probability distribution does not change whe ...

, introduced by

Bruno de Finetti Bruno de Finetti (13 June 1906 – 20 July 1985) was an Italian probabilist statistician and actuary, noted for the "operational subjective" conception of probability. The classic exposition of his distinctive theory is the 1937 "La prévision: ...

. Exchangeability means that while variables may not be independent, future ones behave like past ones – formally, any value of a finite sequence is as likely as any permutation of those values – the joint probability distribution is invariant under the

symmetric group In abstract algebra, the symmetric group defined over any set is the group whose elements are all the bijections from the set to itself, and whose group operation is the composition of functions. In particular, the finite symmetric group ...

. This provides a useful generalization – for example,

sampling without replacement In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sample ...

is not independent, but is exchangeable.

Lévy process

stochastic calculus Stochastic calculus is a branch of mathematics that operates on stochastic processes. It allows a consistent theory of integration to be defined for integrals of stochastic processes with respect to stochastic processes. This field was created an ...

, i.i.d. variables are thought of as a discrete time

Lévy process In probability theory, a Lévy process, named after the French mathematician Paul Lévy, is a stochastic process with independent, stationary increments: it represents the motion of a point whose successive displacements are random, in which disp ...

: each variable gives how much one changes from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process. One may generalize this to include continuous time Lévy processes, and many Lévy processes can be seen as limits of i.i.d. variables—for instance, the

Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is ...

is the limit of the Bernoulli process.

In machine learning

Why assume the data in machine learning are independent and identically distributed?

Machine learning uses currently acquired massive quantities of data to deliver faster, more accurate results. Therefore, we need to use historical data with overall representativeness. If the data obtained is not representative of the overall situation, then the rules will be summarized badly or wrongly. Through i.i.d. hypothesis, the number of individual cases in the training sample can be greatly reduced. This assumption makes maximization very easy to calculate mathematically. Observing the assumption of independent and identical distribution in mathematics simplifies the calculation of the likelihood function in optimization problems. Because of the assumption of independence, the likelihood function can be written like this :

l(\theta) = P(x_1, x_2, x_3,...,x_n, \theta) = P(x_1, \theta) P(x_2, \theta) P(x_3, \theta) ... P(x_n, \theta)

In order to maximize the probability of the observed event, take the log function and maximize the parameter θ. That is to say, to compute: :

\mathop\limits_\theta \log(l(\theta))

where :

\log(l(\theta)) = \log(P(x_1, \theta)) + \log(P(x_2, \theta)) + \log(P(x_3, \theta)) + ... + \log(P(x_n, \theta))

The computer is very efficient to calculate multiple additions, but it is not efficient to calculate the multiplication. This simplification is the core reason for the increase in computational efficiency. And this Log transformation is also in the process of maximizing, turning many exponential functions into linear functions. For two reasons, this hypothesis is easy to use the central limit theorem in practical applications. # Even if the sample comes from a more complex non-Gaussian distribution, it can also approximate well. Because it can be simplified from the central limit theorem to Gaussian distribution. For a large number of observable samples, "the sum of many random variables will have an approximately normal distribution". # The second reason is that the accuracy of the model depends on the simplicity and representative power of the model unit, as well as the data quality. Because the simplicity of the unit makes it easy to interpret and scale, and the representative power + scale out of the unit improves the model accuracy. Like in a deep neural network, each neuron is very simple but has strong representative power, layer by layer to represent more complex features to improve model accuracy.