HOME

TheInfoList



OR:

Beliefs depend on the available information. This idea is formalized in
probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...
by conditioning.
Conditional probabilities In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...
,
conditional expectation In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take “on average” over an arbitrarily large number of occurrences – give ...
s, and
conditional probability distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the co ...
s are treated on three levels: discrete probabilities,
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
s, and
measure theory In mathematics, the concept of a measure is a generalization and formalization of geometrical measures ( length, area, volume) and other common notions, such as mass and probability of events. These seemingly distinct concepts have many simil ...
. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.


Conditioning on the discrete level

Example: A fair coin is tossed 10 times; the
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
''X'' is the number of heads in these 10 tosses, and ''Y'' is the number of heads in the first 3 tosses. In spite of the fact that ''Y'' emerges before ''X'' it may happen that someone knows ''X'' but not ''Y''.


Conditional probability

Given that ''X'' = 1, the conditional probability of the event ''Y'' = 0 is : \mathbb (Y=0, X=1) = \frac = 0.7 More generally, : \begin \mathbb (Y=0, X=x) &= \frac = \frac && x = 0, 1, 2, 3, 4, 5, 6, 7. \\ pt\mathbb (Y=0, X=x) &= 0 && x =8,9,10. \end One may also treat the conditional probability as a random variable, — a function of the random variable ''X'', namely, : \mathbb (Y=0, X) = \begin \binom 7 X / \binomX & X \leqslant 7,\\ 0 & X > 7.\end The expectation of this random variable is equal to the (unconditional) probability, : \mathbb ( \mathbb (Y=0, X) ) = \sum_x \mathbb (Y=0, X=x) \mathbb (X=x) = \mathbb (Y=0), namely, : \sum_^7 \frac \cdot \frac1 \binomx = \frac 1 8 , which is an instance of the
law of total probability In probability theory, the law (or formula) of total probability is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct even ...
\mathbb(\mathbb(A, X)) = \mathbb(A). Thus, \mathbb (Y=0, X=1) may be treated as the value of the random variable \mathbb (Y=0, X) corresponding to ''X'' = 1. On the other hand, \mathbb (Y=0, X=1) is well-defined irrespective of other possible values of ''X''.


Conditional expectation

Given that ''X'' = 1, the conditional expectation of the random variable ''Y'' is \mathbb (Y, X=1) = \tfrac More generally, : \mathbb (Y, X=x) = \frac3 x, \qquad x =0, \ldots, 10. (In this example it appears to be a linear function, but in general it is nonlinear.) One may also treat the conditional expectation as a random variable, — a function of the random variable ''X'', namely, : \mathbb (Y, X) = \frac3 X. The expectation of this random variable is equal to the (unconditional) expectation of ''Y'', : \mathbb ( \mathbb (Y, X) ) = \sum_x \mathbb (Y, X=x) \mathbb (X=x) = \mathbb (Y), namely, : \sum_^ \frac x \cdot \frac1 \binomx = \frac 3 2, or simply : \mathbb \left( \frac X \right) = \frac \mathbb (X) = \frac \cdot 5 = \frac32, which is an instance of the
law of total expectation The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if X is a random variable whose expected v ...
\mathbb( \mathbb (Y, X) ) = \mathbb(Y). The random variable \mathbb(Y, X) is the best predictor of ''Y'' given ''X''. That is, it minimizes the mean square error \mathbb(Y - f(X))^2 on the class of all random variables of the form ''f''(''X''). This class of random variables remains intact if ''X'' is replaced, say, with 2''X''. Thus, \mathbb(Y, 2X) = \mathbb(Y, X). It does not mean that \mathbb (Y, 2X) = \tfrac \times 2X; rather, \mathbb (Y, 2X) = \tfrac \times 2X = \tfracX. In particular, \mathbb (Y, 2X=2) = \tfrac. More generally, \mathbb(Y, g(X)) = \mathbb(Y, X) for every function ''g'' that is one-to-one on the set of all possible values of ''X''. The values of ''X'' are irrelevant; what matters is the partition (denote it α''X'') : \Omega = \ \uplus \ \uplus \dots of the sample space Ω into disjoint sets . (Here x_1, x_2, \ldots are all possible values of ''X''.) Given an arbitrary partition α of Ω, one may define the random variable E ( ''Y'' , α ). Still, E ( E ( ''Y'' , α)) = E ( ''Y'' ). Conditional probability may be treated as a special case of conditional expectation. Namely, P ( ''A'' , ''X'' ) = E ( ''Y'' , ''X'' ) if ''Y'' is the
indicator Indicator may refer to: Biology * Environmental indicator of environmental health (pressures, conditions and responses) * Ecological indicator of ecosystem health (ecological processes) * Health indicator, which is used to describe the health o ...
of ''A''. Therefore the conditional probability also depends on the partition α''X'' generated by ''X'' rather than on ''X'' itself; P ( ''A'' , ''g''(''X'') ) = P (''A'' , ''X'') = P (''A'' , α), α = α''X'' = α''g''(''X''). On the other hand, conditioning on an event ''B'' is well-defined, provided that \mathbb(B) \neq 0, irrespective of any partition that may contain ''B'' as one of several parts.


Conditional distribution

Given ''X'' = x, the conditional distribution of ''Y'' is : \mathbb ( Y=y , X=x ) = \frac = \frac for 0 ≤ ''y'' ≤ min ( 3, ''x'' ). It is the
hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of k successes (random draws for which the object drawn has a specified feature) in n draws, ''without'' ...
H ( ''x''; 3, 7 ), or equivalently, H ( 3; ''x'', 10-''x'' ). The corresponding expectation 0.3 ''x'', obtained from the general formula : n \frac for H ( ''n''; ''R'', ''W'' ), is nothing but the conditional expectation E (''Y'' , ''X'' = ''x'') = 0.3 ''x''. Treating H ( ''X''; 3, 7 ) as a random distribution (a random vector in the four-dimensional space of all measures on ), one may take its expectation, getting the unconditional distribution of ''Y'', — the
binomial distribution In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
Bin ( 3, 0.5 ). This fact amounts to the equality : \sum_^ \mathbb ( Y=y , X=x ) \mathbb (X=x) = \mathbb (Y=y) = \frac1 \binom 3 y for ''y'' = 0,1,2,3; which is an instance of the
law of total probability In probability theory, the law (or formula) of total probability is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct even ...
.


Conditioning on the level of densities

Example. A point of the sphere ''x''2 + ''y''2 + ''z''2 = 1 is chosen at random according to the uniform distribution on the sphere. The random variables ''X'', ''Y'', ''Z'' are the coordinates of the random point. The joint density of ''X'', ''Y'', ''Z'' does not exist (since the sphere is of zero volume), but the joint density ''f''''X'',''Y'' of ''X'', ''Y'' exists, : f_ (x,y) = \begin \frac1 &\text x^2+y^2<1,\\ 0 &\text. \end (The density is non-constant because of a non-constant angle between the sphere and the plane.) The density of ''X'' may be calculated by integration, : f_X(x) = \int_^ f_(x,y) \, \mathrmy = \int_^ \frac \, ; surprisingly, the result does not depend on ''x'' in (−1,1), : f_X(x) = \begin 0.5 &\text -1 which means that ''X'' is distributed uniformly on (−1,1). The same holds for ''Y'' and ''Z'' (and in fact, for ''aX'' + ''bY'' + ''cZ'' whenever ''a''2 + b2 + c2 = 1). Example. A different measure of calculating the marginal distribution function is provided below f_ (x,y,z) = \frac3 f_X(x) = \int_^ \int_^ \frac =3\sqrt/4\, ;


Conditional probability


Calculation

Given that ''X'' = 0.5, the conditional probability of the event ''Y'' ≤ 0.75 is the integral of the conditional density, : f_(y) = \frac = \begin \frac &\text -\sqrt :\mathbb (Y \le 0.75, X=0.5) = \int_^ f_(y) \, \mathrmy = \int_^ \frac = \tfrac12 + \tfrac1 \arcsin \sqrt = \tfrac56. More generally, : \mathbb (Y \le y, X=x) = \tfrac12 + \tfrac1 \arcsin \frac for all ''x'' and ''y'' such that −1 < ''x'' < 1 (otherwise the denominator ''f''''X''(''x'') vanishes) and \textstyle -\sqrt < y < \sqrt (otherwise the conditional probability degenerates to 0 or 1). One may also treat the conditional probability as a random variable, — a function of the random variable ''X'', namely, : \mathbb (Y \le y, X) = \begin 0 &\text X^2 \ge 1-y^2 \text y<0,\\ \frac12 + \frac1 \arcsin \frac &\text X^2 < 1-y^2,\\ 1 &\text X^2 \ge 1-y^2 \text y>0. \end The expectation of this random variable is equal to the (unconditional) probability, : \mathbb ( \mathbb (Y\le y, X) ) = \int_^ \mathbb (Y\le y, X=x) f_X(x) \, \mathrmx = \mathbb (Y\le y), which is an instance of the
law of total probability In probability theory, the law (or formula) of total probability is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct even ...
E ( P ( ''A'' , ''X'' ) ) = P ( ''A'' ).


Interpretation

The conditional probability P ( ''Y'' ≤ 0.75 , ''X'' = 0.5 ) cannot be interpreted as P ( ''Y'' ≤ 0.75, ''X'' = 0.5 ) / P ( ''X'' = 0.5 ), since the latter gives 0/0. Accordingly, P ( ''Y'' ≤ 0.75 , ''X'' = 0.5 ) cannot be interpreted via empirical frequencies, since the exact value ''X'' = 0.5 has no chance to appear at random, not even once during an infinite sequence of independent trials. The conditional probability can be interpreted as a limit, : \begin \mathbb (Y\le0.75 , X=0.5) &= \lim_ \mathbb (Y\le0.75 , 0.5-\varepsilon


Conditional expectation

The conditional expectation E ( ''Y'' , ''X'' = 0.5 ) is of little interest; it vanishes just by symmetry. It is more interesting to calculate E ( , ''Z'', , ''X'' = 0.5 ) treating , ''Z'', as a function of ''X'', ''Y'': : \begin , Z, &= h(X,Y) = \sqrt; \\ \mathrm ( , Z, , X=0.5 ) &= \int_^ h(0.5,y) f_ (y) \, \mathrm y = \\ & = \int_^ \sqrt \cdot \frac \\ &= \frac2\pi \sqrt . \end More generally, : \mathbb ( , Z, , X=x ) = \frac2\pi \sqrt for −1 < ''x'' < 1. One may also treat the conditional expectation as a random variable, — a function of the random variable ''X'', namely, : \mathbb ( , Z, , X ) = \frac2\pi \sqrt. The expectation of this random variable is equal to the (unconditional) expectation of , ''Z'', , : \mathbb ( \mathbb ( , Z, , X ) ) = \int_^ \mathbb ( , Z, , X=x ) f_X(x) \, \mathrmx = \mathbb (, Z, ), namely, : \int_^ \frac2\pi \sqrt \cdot \frac2 = \tfrac, which is an instance of the
law of total expectation The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if X is a random variable whose expected v ...
E ( E ( ''Y'' , ''X'' ) ) = E ( ''Y'' ). The random variable E(, ''Z'', , ''X'') is the best predictor of , ''Z'', given ''X''. That is, it minimizes the mean square error E ( , ''Z'', - ''f''(''X'') )2 on the class of all random variables of the form ''f''(''X''). Similarly to the discrete case, E ( , ''Z'', , ''g''(''X'') ) = E ( , ''Z'', , ''X'' ) for every measurable function ''g'' that is one-to-one on (-1,1).


Conditional distribution

Given ''X'' = x, the conditional distribution of ''Y'', given by the density ''f''''Y'', ''X''=''x''(y), is the (rescaled) arcsin distribution; its cumulative distribution function is : F_ (y) = \mathbb ( Y \le y , X = x ) = \frac12 + \frac1\pi \arcsin \frac for all ''x'' and ''y'' such that ''x''2 + ''y''2 < 1.The corresponding expectation of ''h''(''x'',''Y'') is nothing but the conditional expectation E ( ''h''(''X'',''Y'') , ''X''=''x'' ). The
mixture In chemistry, a mixture is a material made up of two or more different chemical substances which are not chemically bonded. A mixture is the physical combination of two or more substances in which the identities are retained and are mixed in the ...
of these conditional distributions, taken for all ''x'' (according to the distribution of ''X'') is the unconditional distribution of ''Y''. This fact amounts to the equalities : \begin & \int_^ f_ (y) f_X(x) \, \mathrmx = f_Y(y), \\ & \int_^ F_ (y) f_X(x) \, \mathrmx = F_Y(y), \end the latter being the instance of the law of total probability mentioned above.


What conditioning is not

On the discrete level, conditioning is possible only if the condition is of nonzero probability (one cannot divide by zero). On the level of densities, conditioning on ''X'' = ''x'' is possible even though P ( ''X'' = ''x'' ) = 0. This success may create the illusion that conditioning is ''always'' possible. Regretfully, it is not, for several reasons presented below.


Geometric intuition: caution

The result P ( ''Y'' ≤ 0.75 , ''X'' = 0.5 ) = 5/6, mentioned above, is geometrically evident in the following sense. The points (''x'',''y'',''z'') of the sphere ''x''2 + ''y''2 + ''z''2 = 1, satisfying the condition ''x'' = 0.5, are a circle ''y''2 + ''z''2 = 0.75 of radius \sqrt on the plane ''x'' = 0.5. The inequality ''y'' ≤ 0.75 holds on an arc. The length of the arc is 5/6 of the length of the circle, which is why the conditional probability is equal to 5/6. This successful geometric explanation may create the illusion that the following question is trivial. : A point of a given sphere is chosen at random (uniformly). Given that the point lies on a given plane, what is its conditional distribution? It may seem evident that the conditional distribution must be uniform on the given circle (the intersection of the given sphere and the given plane). Sometimes it really is, but in general it is not. Especially, ''Z'' is distributed uniformly on (-1,+1) and independent of the ratio ''Y''/''X'', thus, P ( ''Z'' ≤ 0.5 , ''Y''/''X'' ) = 0.75. On the other hand, the inequality ''z'' ≤ 0.5 holds on an arc of the circle ''x''2 + ''y''2 + ''z''2 = 1, ''y'' = ''cx'' (for any given ''c''). The length of the arc is 2/3 of the length of the circle. However, the conditional probability is 3/4, not 2/3. This is a manifestation of the classical Borel paradox. Another example. A random rotation of the three-dimensional space is a rotation by a random angle around a random axis. Geometric intuition suggests that the angle is independent of the axis and distributed uniformly. However, the latter is wrong; small values of the angle are less probable.


The limiting procedure

Given an event ''B'' of zero probability, the formula \textstyle \mathbb (A, B) = \mathbb ( A \cap B ) / \mathbb (B) is useless, however, one can try \textstyle \mathbb (A, B) = \lim_ \mathbb ( A \cap B_n ) / \mathbb (B_n) for an appropriate sequence of events ''B''''n'' of nonzero probability such that ''B''''n'' ↓ ''B'' (that is, \textstyle B_1 \supset B_2 \supset \dots and \textstyle B_1 \cap B_2 \cap \dots = B ). One example is given above. Two more examples are Brownian bridge and Brownian excursion. In the latter two examples the law of total probability is irrelevant, since only a single event (the condition) is given. By contrast, in the example above the law of total probability applies, since the event ''X'' = 0.5 is included into a family of events ''X'' = ''x'' where ''x'' runs over (−1,1), and these events are a partition of the probability space. In order to avoid paradoxes (such as the Borel's paradox), the following important distinction should be taken into account. If a given event is of nonzero probability then conditioning on it is well-defined (irrespective of any other events), as was noted above. By contrast, if the given event is of zero probability then conditioning on it is ill-defined unless some additional input is provided. Wrong choice of this additional input leads to wrong conditional probabilities (expectations, distributions). In this sense, "''the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.''" (
Kolmogorov Andrey Nikolaevich Kolmogorov ( rus, Андре́й Никола́евич Колмого́ров, p=ɐnˈdrʲej nʲɪkɐˈlajɪvʲɪtɕ kəlmɐˈɡorəf, a=Ru-Andrey Nikolaevich Kolmogorov.ogg, 25 April 1903 – 20 October 1987) was a Sovi ...
) The additional input may be (a) a symmetry (invariance group); (b) a sequence of events ''B''''n'' such that ''B''''n'' ↓ ''B'', P ( ''B''''n'' ) > 0; (c) a partition containing the given event. Measure-theoretic conditioning (below) investigates Case (c), discloses its relation to (b) in general and to (a) when applicable. Some events of zero probability are beyond the reach of conditioning. An example: let ''X''''n'' be independent random variables distributed uniformly on (0,1), and ''B'' the event "''X''''n'' → 0 as ''n'' → ∞"; what about P ( ''X''''n'' < 0.5 , ''B'' ) ? Does it tend to 1, or not? Another example: let ''X'' be a random variable distributed uniformly on (0,1), and ''B'' the event "''X'' is a rational number"; what about P ( ''X'' = 1/''n'' , ''B'' ) ? The only answer is that, once again,


Conditioning on the level of measure theory

Example. Let ''Y'' be a random variable distributed uniformly on (0,1), and ''X'' = ''f''(''Y'') where ''f'' is a given function. Two cases are treated below: ''f'' = ''f''1 and ''f'' = ''f''2, where ''f''1 is the continuous piecewise-linear function : f_1(y) = \begin 3y &\text 0 \le y \le 1/3,\\ 1.5(1-y) &\text 1/3 \le y \le 2/3,\\ 0.5 &\text 2/3 \le y \le 1, \end and ''f''2 is the
Weierstrass function In mathematics, the Weierstrass function is an example of a real-valued function (mathematics), function that is continuous function, continuous everywhere but Differentiable function, differentiable nowhere. It is an example of a fractal curve ...
.


Geometric intuition: caution

Given ''X'' = 0.75, two values of ''Y'' are possible, 0.25 and 0.5. It may seem evident that both values are of conditional probability 0.5 just because one point is
congruent Congruence may refer to: Mathematics * Congruence (geometry), being the same size and shape * Congruence or congruence relation, in abstract algebra, an equivalence relation on an algebraic structure that is compatible with the structure * In mod ...
to another point. However, this is an illusion; see below.


Conditional probability

The conditional probability P ( ''Y'' ≤ 1/3 , ''X'' ) may be defined as the best predictor of the indicator : I = \begin 1 &\text Y \le 1/3,\\ 0 &\text, \end given ''X''. That is, it minimizes the mean square error E ( ''I'' - ''g''(''X'') )2 on the class of all random variables of the form ''g'' (''X''). In the case ''f'' = ''f''1 the corresponding function ''g'' = ''g''1 may be calculated explicitly, Proof: : \begin \mathbb ( I - g(X) )^2 & = \int_0^ (1-g(3y))^2 \, \mathrmy + \int_^ g^2 (1.5(1-y)) \, \mathrmy + \int_^1 g^2 (0.5) \, \mathrmy \\ & = \int_0^1 (1-g(x))^2 \frac + \int_^1 g^2(x) \frac + \frac13 g^2(0.5) \\ & = \frac13 \int_0^ (1-g(x))^2 \, \mathrmx + \frac13 g^2(0.5) + \frac13 \int_^1 ( (1-g(x))^2 + 2g^2(x) ) \, \mathrmx \, ; \end it remains to note that (1−''a'' )2 + 2''a''2 is minimal at ''a'' = 1/3. : g_1(x) = \begin 1 &\text 0 < x < 0.5,\\ 0 &\text x = 0.5,\\ 1/3 &\text 0.5 < x < 1. \end Alternatively, the limiting procedure may be used, : g_1(x) = \lim_ \mathbb ( Y \le 1/3 , x-\varepsilon \le X \le x+\varepsilon ) \, , giving the same result. Thus, P ( ''Y'' ≤ 1/3 , ''X'' ) = ''g''1 (''X''). The expectation of this random variable is equal to the (unconditional) probability, E ( P ( ''Y'' ≤ 1/3 , ''X'' ) ) = P ( ''Y'' ≤ 1/3 ), namely, : 1 \cdot \mathbb (X<0.5) + 0 \cdot \mathbb (X=0.5) + \frac13 \cdot \mathbb (X>0.5) = 1 \cdot \frac16 + 0 \cdot \frac13 + \frac13 \cdot \left( \frac16 + \frac13 \right) = \frac13, which is an instance of the
law of total probability In probability theory, the law (or formula) of total probability is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct even ...
E ( P ( ''A'' , ''X'' ) ) = P ( ''A'' ). In the case ''f'' = ''f''2 the corresponding function ''g'' = ''g''2 probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically. Indeed, the
space Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually consider ...
L2 (Ω) of all square integrable random variables is a
Hilbert space In mathematics, Hilbert spaces (named after David Hilbert) allow generalizing the methods of linear algebra and calculus from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise natural ...
; the indicator ''I'' is a vector of this space; and random variables of the form ''g'' (''X'') are a (closed, linear) subspace. The
orthogonal projection In linear algebra and functional analysis, a projection is a linear transformation P from a vector space to itself (an endomorphism) such that P\circ P=P. That is, whenever P is applied twice to any vector, it gives the same result as if it wer ...
of this vector to this subspace is well-defined. It can be computed numerically, using finite-dimensional approximations to the infinite-dimensional Hilbert space. Once again, the expectation of the random variable P ( ''Y'' ≤ 1/3 , ''X'' ) = ''g''2 (''X'') is equal to the (unconditional) probability, E ( P ( ''Y'' ≤ 1/3 , ''X'' ) ) = P ( ''Y'' ≤ 1/3 ), namely, : \int_0^1 g_2 (f_2(y)) \, \mathrmy = \tfrac13. However, the Hilbert space approach treats ''g''2 as an equivalence class of functions rather than an individual function. Measurability of ''g''2 is ensured, but continuity (or even
Riemann integrability In the branch of mathematics known as real analysis, the Riemann integral, created by Bernhard Riemann, was the first rigorous definition of the integral of a function on an interval. It was presented to the faculty at the University of Göt ...
) is not. The value ''g''2 (0.5) is determined uniquely, since the point 0.5 is an atom of the distribution of ''X''. Other values ''x'' are not atoms, thus, corresponding values ''g''2 (''x'') are not determined uniquely. Once again, "''the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.''" (
Kolmogorov Andrey Nikolaevich Kolmogorov ( rus, Андре́й Никола́евич Колмого́ров, p=ɐnˈdrʲej nʲɪkɐˈlajɪvʲɪtɕ kəlmɐˈɡorəf, a=Ru-Andrey Nikolaevich Kolmogorov.ogg, 25 April 1903 – 20 October 1987) was a Sovi ...
. Alternatively, the same function ''g'' (be it ''g''1 or ''g''2) may be defined as the Radon–Nikodym derivative : g = \frac, where measures μ, ν are defined by : \begin \mu (B) &= \mathbb ( X \in B ), \\ \nu (B) &= \mathbb ( X \in B, \, Y \le \tfrac) \end for all Borel sets B \subset \mathbb R. That is, μ is the (unconditional) distribution of ''X'', while ν is one third of its conditional distribution, : \nu (B) = \mathbb ( X \in B , Y \le \tfrac ) \mathbb ( Y \le \tfrac ) = \tfrac13 \mathbb ( X \in B , Y \le \tfrac ). Both approaches (via the Hilbert space, and via the Radon–Nikodym derivative) treat ''g'' as an equivalence class of functions; two functions ''g'' and ''g′'' are treated as equivalent, if ''g'' (''X'') = ''g′'' (''X'') almost surely. Accordingly, the conditional probability P ( ''Y'' ≤ 1/3 , ''X'' ) is treated as an equivalence class of random variables; as usual, two random variables are treated as equivalent if they are equal almost surely.


Conditional expectation

The conditional expectation \mathbb(Y, X) may be defined as the best predictor of ''Y'' given ''X''. That is, it minimizes the mean square error \mathbb(Y-h(X))^2 on the class of all random variables of the form ''h''(''X''). In the case ''f'' = ''f''1 the corresponding function ''h'' = ''h''1 may be calculated explicitly, Proof: :\begin \mathbb( Y - h_1(X) )^2 &= \int_0^1 \left ( y - h_1 ( f_1(x) ) \right )^2 \, \mathrmy \\ &= \int_0^ (y-h_1(3y))^2 \, \mathrmy + \int_^ \left( y-h_1( 1.5(1-y) ) \right)^2 \, \mathrmy + \int_^1 \left ( y - h_1(\tfrac) \right)^2 \, \mathrmy \\ &= \int_0^1 \left( \frac x 3 - h_1(x) \right)^2 \frac + \int_^1 \left (1 - \frac - h_1(x) \right)^2 \frac + \frac13 h_1^2(\tfrac) - \frac 5 9 h_1(\tfrac) + \frac \\ &= \frac13 \int_0^ \left( h_1(x) - \frac x 3 \right)^2 \, \mathrmx + \tfrac13 h_1^2(\tfrac) - \tfrac h_1(\tfrac) + \tfrac + \tfrac13 \int_^1 \left( \left( h_1(x) - \frac x 3 \right)^2 + 2 \left ( h_1(x) - 1 + \frac \right)^2 \right) \, \mathrmx; \end it remains to note that :\left (a-\frac x 3 \right )^2 + 2 \left (a-1+\frac3 \right )^2 is minimal at a = \tfrac3, and \tfrac13 a^2 - \tfraca is minimal at a = \tfrac 5 6. : h_1(x) = \begin \frac & 0 < x < \frac\\ pt \frac & x = \frac\\ pt \frac(2-x) & \frac < x < 1 \end Alternatively, the limiting procedure may be used, : h_1(x) = \lim_ \mathbb ( Y , x-\varepsilon \leqslant X \leqslant x+\varepsilon ), giving the same result. Thus, \mathbb(Y, X) = h_1(X). The expectation of this random variable is equal to the (unconditional) expectation, \mathbb(\mathbb(Y, X)) = \mathbb(Y), namely, : \int_0^1 h_1(f_1(y)) \, \mathrmy = \int_0^ \frac3 \, \mathrmy + \int_^ \frac3 \, \mathrmy + \int_^ \frac \, \mathrmy + \int_^1 \frac56 \, \mathrmy = \frac12, which is an instance of the
law of total expectation The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if X is a random variable whose expected v ...
\mathbb(\mathbb(Y, X)) = \mathbb(Y). In the case ''f'' = ''f''2 the corresponding function ''h'' = ''h''2 probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically in the same way as ''g''2 above, — as the orthogonal projection in the Hilbert space. The law of total expectation holds, since the projection cannot change the scalar product by the constant 1 belonging to the subspace. Alternatively, the same function ''h'' (be it ''h''1 or ''h''2) may be defined as the Radon–Nikodym derivative : h = \frac, where measures μ, ν are defined by : \begin \mu (B) &= \mathbb (X \in B )\\ \nu (B) &= \mathbb (Y, X \in B ) \end for all Borel sets B \subset \R. Here \mathbb(Y;A) is the restricted expectation, not to be confused with the conditional expectation \mathbb(Y, A) = \mathbb(Y;A)/\mathbb(A).


Conditional distribution

In the case ''f'' = ''f''1 the conditional
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ev ...
may be calculated explicitly, similarly to ''g''1. The limiting procedure gives: : F_ (y) = \mathbb \left ( Y \leqslant y \left , X =\tfrac \right. \right ) = \lim_ \mathbb \left ( Y \leqslant y \left , \tfrac-\varepsilon \leqslant X \leqslant \tfrac+\varepsilon \right. \right ) = \begin 0 & -\infty < y < \tfrac\\ pt\tfrac & y = \tfrac\\ pt\tfrac & \tfrac < y < \tfrac\\ pt\tfrac & y = \tfrac\\ pt1 & \tfrac < y < \infty \end which cannot be correct, since a cumulative distribution function must be
right-continuous In mathematics, a continuous function is a function such that a continuous variation (that is a change without jump) of the argument induces a continuous variation of the value of the function. This means that there are no abrupt changes in value ...
! This paradoxical result is explained by measure theory as follows. For a given ''y'' the corresponding F_ (y) = \mathbb ( Y \leqslant y , X = x) is well-defined (via the Hilbert space or the Radon–Nikodym derivative) as an equivalence class of functions (of ''x''). Treated as a function of ''y'' for a given ''x'' it is ill-defined unless some additional input is provided. Namely, a function (of ''x'') must be chosen within every (or at least almost every) equivalence class. Wrong choice leads to wrong conditional cumulative distribution functions. A right choice can be made as follows. First, F_ (y) = \mathbb ( Y \leqslant y , X = x) is considered for rational numbers ''y'' only. (Any other dense countable set may be used equally well.) Thus, only a countable set of equivalence classes is used; all choices of functions within these classes are mutually equivalent, and the corresponding function of rational ''y'' is well-defined (for almost every ''x''). Second, the function is extended from rational numbers to real numbers by right continuity. In general the conditional distribution is defined for almost all ''x'' (according to the distribution of ''X''), but sometimes the result is continuous in ''x'', in which case individual values are acceptable. In the considered example this is the case; the correct result for ''x'' = 0.75, : F_ (y) = \mathbb\left ( Y \leqslant y \left , X = \tfrac \right. \right ) = \begin 0 & -\infty < y < \tfrac\\ pt\tfrac & \tfrac\leqslant y < \tfrac\\ pt 1 & \tfrac \leqslant y < \infty\end shows that the conditional distribution of ''Y'' given ''X'' = 0.75 consists of two atoms, at 0.25 and 0.5, of probabilities 1/3 and 2/3 respectively. Similarly, the conditional distribution may be calculated for all ''x'' in (0, 0.5) or (0.5, 1). The value ''x'' = 0.5 is an atom of the distribution of ''X'', thus, the corresponding conditional distribution is well-defined and may be calculated by elementary means (the denominator does not vanish); the conditional distribution of ''Y'' given ''X'' = 0.5 is uniform on (2/3, 1). Measure theory leads to the same result. The mixture of all conditional distributions is the (unconditional) distribution of ''Y''. The conditional expectation \mathbb(Y, X=x) is nothing but the expectation with respect to the conditional distribution. In the case ''f'' = ''f''2 the corresponding F_ (y) = \mathbb ( Y \leqslant y , X = x) probably cannot be calculated explicitly. For a given ''y'' it is well-defined (via the Hilbert space or the Radon–Nikodym derivative) as an equivalence class of functions (of ''x''). The right choice of functions within these equivalence classes may be made as above; it leads to correct conditional cumulative distribution functions, thus, conditional distributions. In general, conditional distributions need not be atomic or
absolutely continuous In calculus, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship between the two central oper ...
(nor mixtures of both types). Probably, in the considered example they are
singular Singular may refer to: * Singular, the grammatical number that denotes a unit quantity, as opposed to the plural and other forms * Singular homology * SINGULAR, an open source Computer Algebra System (CAS) * Singular or sounder, a group of boar, ...
(like the
Cantor distribution The Cantor distribution is the probability distribution whose cumulative distribution function is the Cantor function. This distribution has neither a probability density function nor a probability mass function, since although its cumulativ ...
). Once again, the mixture of all conditional distributions is the (unconditional) distribution, and the conditional expectation is the expectation with respect to the conditional distribution.


Technical details


See also

*
Conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...
*
Conditional expectation In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take “on average” over an arbitrarily large number of occurrences – give ...
*
Conditional probability distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the co ...
*
Joint probability distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
* Borel's paradox * Regular conditional probability *
Disintegration theorem In mathematics, the disintegration theorem is a result in measure theory and probability theory. It rigorously defines the idea of a non-trivial "restriction" of a measure to a measure zero subset of the measure space in question. It is relate ...
*
Law of total variance In probability theory, the law of total variance or variance decomposition formula or conditional variance formulas or law of iterated variances also known as Eve's law, states that if X and Y are random variables on the same probability space, and ...
*
Law of total cumulance In probability theory and mathematical statistics, the law of total cumulance is a generalization to cumulants of the law of total probability, the law of total expectation, and the law of total variance. It has applications in the analysis of t ...


Notes


References

* *{{citation, last=Pollard, first=David, title=A user's guide to measure theoretic probability, year=2002, publisher=Cambridge University Press *Draheim, Dirk (2017
Generalized Jeffrey Conditionalization (A Frequentist Semantics of Partial Conditionalization)
Springer Conditional probability