Kosambi–Karhunen–Loève Theorem
   HOME

TheInfoList



OR:

In the theory of
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
es, the Karhunen–Loève theorem (named after
Kari Karhunen Karl Onni Uolcvi Karbunen (April 12, 1915 – September 16, 1992) was a Finnish probabilist and a mathematical statistician. He is best known for the Karhunen–Loève theorem and Karhunen–Loève transform. Education and career Karhunen r ...
and
Michel Loève Michel Loève (January 22, 1907 – February 17, 1979) was a French-American probability theory, probabilist and mathematical statistics, mathematical statistician, of Jewish origin. He is known in mathematical statistics and probability theory f ...
), also known as the Kosambi–Karhunen–Loève theorem states that a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
can be represented as an infinite linear combination of
orthogonal function In mathematics, orthogonal functions belong to a function space that is a vector space equipped with a bilinear form. When the function space has an interval as the domain, the bilinear form may be the integral of the product of functions over ...
s, analogous to a
Fourier series A Fourier series () is a summation of harmonically related sinusoidal functions, also known as components or harmonics. The result of the summation is a periodic function whose functional form is determined by the choices of cycle length (or ''p ...
representation of a function on a bounded interval. The transformation is also known as Hotelling transform and
eigenvector In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...
transform, and is closely related to
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA) technique widely used in image processing and in data analysis in many fields. There exist many such expansions of a stochastic process: if the process is indexed over , any
orthonormal basis In mathematics, particularly linear algebra, an orthonormal basis for an inner product space ''V'' with finite dimension is a basis for V whose vectors are orthonormal, that is, they are all unit vectors and orthogonal to each other. For example, ...
of yields an expansion thereof in that form. The importance of the Karhunen–Loève theorem is that it yields the best such basis in the sense that it minimizes the total
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
. In contrast to a Fourier series where the coefficients are fixed numbers and the expansion basis consists of sinusoidal functions (that is,
sine In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is oppo ...
and cosine functions), the coefficients in the Karhunen–Loève theorem are
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s and the expansion basis depends on the process. In fact, the orthogonal basis functions used in this representation are determined by the
covariance function In probability theory and statistics, the covariance function describes how much two random variables change together (their ''covariance'') with varying spatial or temporal separation. For a random field or stochastic process ''Z''(''x'') on a doma ...
of the process. One can think that the Karhunen–Loève transform adapts to the process in order to produce the best possible basis for its expansion. In the case of a ''centered'' stochastic process (''centered'' means for all ) satisfying a technical continuity condition, admits a decomposition : X_t = \sum_^\infty Z_k e_k(t) where are pairwise
uncorrelated In probability theory and statistics, two real-valued random variables, X, Y, are said to be uncorrelated if their covariance, \operatorname ,Y= \operatorname Y- \operatorname \operatorname /math>, is zero. If two variables are uncorrelated, there ...
random variables and the functions are continuous real-valued functions on that are pairwise
orthogonal In mathematics, orthogonality is the generalization of the geometric notion of ''perpendicularity''. By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
in . It is therefore sometimes said that the expansion is ''bi-orthogonal'' since the random coefficients are orthogonal in the probability space while the deterministic functions are orthogonal in the time domain. The general case of a process that is not centered can be brought back to the case of a centered process by considering which is a centered process. Moreover, if the process is
Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...
, then the random variables are Gaussian and
stochastically independent Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...
. This result generalizes the ''Karhunen–Loève transform''. An important example of a centered real stochastic process on is the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
; the Karhunen–Loève theorem can be used to provide a canonical orthogonal representation for it. In this case the expansion consists of sinusoidal functions. The above expansion into uncorrelated random variables is also known as the ''Karhunen–Loève expansion'' or ''Karhunen–Loève decomposition''. The
empirical Empirical evidence for a proposition is evidence, i.e. what supports or counters this proposition, that is constituted by or accessible to sense experience or experimental procedure. Empirical evidence is of central importance to the sciences and ...
version (i.e., with the coefficients computed from a sample) is known as the ''Karhunen–Loève transform'' (KLT), ''
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
'', ''
proper orthogonal decomposition The proper orthogonal decomposition is a numerical method that enables a reduction in the complexity of computer intensive simulations such as computational fluid dynamics and structural analysis (like crash simulations). Typically in fluid dynam ...
(POD)'', ''
empirical orthogonal functions In statistics and signal processing, the method of empirical orthogonal function (EOF) analysis is a decomposition of a signal or data set in terms of orthogonal basis functions which are determined from the data. The term is also interchangeable ...
'' (a term used in
meteorology Meteorology is a branch of the atmospheric sciences (which include atmospheric chemistry and physics) with a major focus on weather forecasting. The study of meteorology dates back millennia, though significant progress in meteorology did not ...
and
geophysics Geophysics () is a subject of natural science concerned with the physical processes and physical properties of the Earth and its surrounding space environment, and the use of quantitative methods for their analysis. The term ''geophysics'' som ...
), or the '' Hotelling transform''.


Formulation

*Throughout this article, we will consider a random process defined over a
probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
and indexed over a closed interval , which is
square-integrable In mathematics, a square-integrable function, also called a quadratically integrable function or L^2 function or square-summable function, is a real number, real- or complex number, complex-valued measurable function for which the integral of the s ...
, has zero-mean, and with covariance function . In other words, we have: ::\forall t\in ,b\qquad X_t\in L^2(\Omega, F,\mathbf), \quad \text \mathbf _t^2< \infty, ::\forall t\in ,b\qquad \mathbf _t0, ::\forall t,s \in ,b\qquad K_X(s,t)=\mathbf _s X_t The square-integrable condition \mathbf _t^2< \infty is logically equivalent to K_X(s,t) being finite for all s,t \in ,b/math>. *We associate to a
linear operator In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...
(more specifically a
Hilbert–Schmidt integral operator In mathematics, a Hilbert–Schmidt integral operator is a type of integral transform. Specifically, given a domain (an open and connected set) Ω in ''n''-dimensional Euclidean space R''n'', a Hilbert–Schmidt kernel is a function ''k''  ...
) defined in the following way: :: \begin &T_&: L^2( ,b &\to L^2( ,b\\ &&: f \mapsto T_f &= \int_a^b K_X(s,\cdot) f(s) \, ds \end :Since is a linear operator, it makes sense to talk about its eigenvalues ''λk'' and eigenfunctions , which are found solving the homogeneous Fredholm
integral equation In mathematics, integral equations are equations in which an unknown Function (mathematics), function appears under an integral sign. In mathematical notation, integral equations may thus be expressed as being of the form: f(x_1,x_2,x_3,...,x_n ; ...
of the second kind ::\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t)


Statement of the theorem

Theorem. Let be a zero-mean square-integrable stochastic process defined over a probability space and indexed over a closed and bounded interval 'a'', ''b'' with continuous covariance function . Then is a Mercer kernel and letting be an orthonormal basis on formed by the eigenfunctions of with respective eigenvalues admits the following representation :X_t=\sum_^\infty Z_k e_k(t) where the convergence is in , uniform in ''t'' and :Z_k=\int_a^b X_t e_k(t)\, dt Furthermore, the random variables have zero-mean, are uncorrelated and have variance ''λk'' :\mathbf _k0,~\forall k\in\mathbb \qquad \mbox\qquad \mathbf _i Z_j\delta_ \lambda_j,~\forall i,j\in \mathbb Note that by generalizations of Mercer's theorem we can replace the interval 'a'', ''b''with other compact spaces ''C'' and the Lebesgue measure on 'a'', ''b''with a Borel measure whose support is ''C''.


Proof

*The covariance function satisfies the definition of a Mercer kernel. By
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...
, there consequently exists a set , of eigenvalues and eigenfunctions of forming an orthonormal basis of , and can be expressed as ::K_X(s,t)=\sum_^\infty \lambda_k e_k(s) e_k(t) *The process can be expanded in terms of the eigenfunctions as: ::X_t=\sum_^\infty Z_k e_k(t) :where the coefficients (random variables) are given by the projection of on the respective eigenfunctions ::Z_k=\int_a^b X_t e_k(t) \,dt *We may then derive ::\begin \mathbf _k&=\mathbf\left int_a^b X_t e_k(t) \,dt\right\int_a^b \mathbf _te_k(t) dt=0 \\ pt\mathbf _i Z_j=\mathbf\left \int_a^b \int_a^b X_t X_s e_j(t)e_i(s)\, dt\, ds\right\ &=\int_a^b \int_a^b \mathbf\left _t X_s\righte_j(t)e_i(s) \, dt\, ds\\ &=\int_a^b \int_a^b K_X(s,t) e_j(t)e_i(s) \,dt \, ds\\ &=\int_a^b e_i(s)\left(\int_a^b K_X(s,t) e_j(t) \,dt\right) \, ds\\ &=\lambda_j \int_a^b e_i(s) e_j(s) \, ds\\ &=\delta_\lambda_j \end :where we have used the fact that the are eigenfunctions of and are orthonormal. *Let us now show that the convergence is in . Let ::S_N=\sum_^N Z_k e_k(t). :Then: ::\begin \mathbf \left X_t-S_N \right , ^2 \right =\mathbf \left _t^2 \right \mathbf \left _N^2 \right - 2\mathbf \left _t S_N \right \ &=K_X(t,t)+\mathbf\left sum_^N \sum_^N Z_k Z_\ell e_k(t)e_\ell(t) \right-2\mathbf\left _t\sum_^N Z_k e_k(t)\right\ &=K_X(t,t)+\sum_^N \lambda_k e_k(t)^2 -2\mathbf\left sum_^N \int_a^b X_t X_s e_k(s) e_k(t) \,ds \right\ &=K_X(t,t)-\sum_^N \lambda_k e_k(t)^2 \end :which goes to 0 by Mercer's theorem.


Properties of the Karhunen–Loève transform


Special case: Gaussian distribution

Since the limit in the mean of jointly Gaussian random variables is jointly Gaussian, and jointly Gaussian random (centered) variables are independent if and only if they are orthogonal, we can also conclude: Theorem. The variables have a joint Gaussian distribution and are stochastically independent if the original process is Gaussian. In the Gaussian case, since the variables are independent, we can say more: : \lim_ \sum_^N e_i(t) Z_i(\omega) = X_t(\omega) almost surely.


The Karhunen–Loève transform decorrelates the process

This is a consequence of the independence of the .


The Karhunen–Loève expansion minimizes the total mean square error

In the introduction, we mentioned that the truncated Karhunen–Loeve expansion was the best approximation of the original process in the sense that it reduces the total mean-square error resulting of its truncation. Because of this property, it is often said that the KL transform optimally compacts the energy. More specifically, given any orthonormal basis of , we may decompose the process as: :X_t(\omega)=\sum_^\infty A_k(\omega) f_k(t) where :A_k(\omega)=\int_a^b X_t(\omega) f_k(t)\,dt and we may approximate by the finite sum :\hat_t(\omega)=\sum_^N A_k(\omega) f_k(t) for some integer ''N''. Claim. Of all such approximations, the KL approximation is the one that minimizes the total mean square error (provided we have arranged the eigenvalues in decreasing order).


Explained variance

An important observation is that since the random coefficients ''Z''''k'' of the KL expansion are uncorrelated, the Bienaymé formula asserts that the variance of ''X''''t'' is simply the sum of the variances of the individual components of the sum: :\operatorname _t\sum_^\infty e_k(t)^2 \operatorname _k\sum_^\infty \lambda_k e_k(t)^2 Integrating over 'a'', ''b''and using the orthonormality of the ''e''''k'', we obtain that the total variance of the process is: :\int_a^b \operatorname _t\, dt=\sum_^\infty \lambda_k In particular, the total variance of the ''N''-truncated approximation is :\sum_^N \lambda_k. As a result, the ''N''-truncated expansion explains :\frac of the variance; and if we are content with an approximation that explains, say, 95% of the variance, then we just have to determine an N\in\mathbb such that :\frac \geq 0.95.


The Karhunen–Loève expansion has the minimum representation entropy property

Given a representation of X_t=\sum_^\infty W_k\varphi_k(t), for some orthonormal basis \varphi_k(t) and random W_k, we let p_k=\mathbb X_t, _^2/math>, so that \sum_^\infty p_k=1. We may then define the representation
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
to be H(\)=-\sum_i p_k \log(p_k). Then we have H(\)\ge H(\), for all choices of \varphi_k. That is, the KL-expansion has minimal representation entropy. Proof: Denote the coefficients obtained for the basis e_k(t) as p_k, and for \varphi_k(t) as q_k. Choose N\ge 1. Note that since e_k minimizes the mean squared error, we have that : \mathbb \left, \sum_^N Z_ke_k(t)-X_t\_^2\le \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 Expanding the right hand size, we get: : \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 =\mathbb, X_t^2, _ + \sum_^N \sum_^N \mathbb _\ell \varphi_\ell(t)W_k^*\varphi_k^*(t)-\sum_^N \mathbb _k \varphi_k X_t^* - \sum_^N \mathbb _tW_k^*\varphi_k^*(t) Using the orthonormality of \varphi_k(t), and expanding X_t in the \varphi_k(t) basis, we get that the right hand size is equal to: : \mathbb _t2_-\sum_^N\mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2\mathbb X_t, _^2/math>, so that \sum_^\infty p_k=1. We may then define the representation
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
to be H(\)=-\sum_i p_k \log(p_k). Then we have H(\)\ge H(\), for all choices of \varphi_k. That is, the KL-expansion has minimal representation entropy. Proof: Denote the coefficients obtained for the basis e_k(t) as p_k, and for \varphi_k(t) as q_k. Choose N\ge 1. Note that since e_k minimizes the mean squared error, we have that : \mathbb \left, \sum_^N Z_ke_k(t)-X_t\_^2\le \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 Expanding the right hand size, we get: : \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 =\mathbb, X_t^2, _ + \sum_^N \sum_^N \mathbb _\ell \varphi_\ell(t)W_k^*\varphi_k^*(t)-\sum_^N \mathbb _k \varphi_k X_t^* - \sum_^N \mathbb _tW_k^*\varphi_k^*(t) Using the orthonormality of \varphi_k(t), and expanding X_t in the \varphi_k(t) basis, we get that the right hand size is equal to: : \mathbb _t2_-\sum_^N\mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2\mathbb X_t, _^2/math>, so that \sum_^\infty p_k=1. We may then define the representation
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
to be H(\)=-\sum_i p_k \log(p_k). Then we have H(\)\ge H(\), for all choices of \varphi_k. That is, the KL-expansion has minimal representation entropy. Proof: Denote the coefficients obtained for the basis e_k(t) as p_k, and for \varphi_k(t) as q_k. Choose N\ge 1. Note that since e_k minimizes the mean squared error, we have that : \mathbb \left, \sum_^N Z_ke_k(t)-X_t\_^2\le \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 Expanding the right hand size, we get: : \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 =\mathbb, X_t^2, _ + \sum_^N \sum_^N \mathbb _\ell \varphi_\ell(t)W_k^*\varphi_k^*(t)-\sum_^N \mathbb _k \varphi_k X_t^* - \sum_^N \mathbb _tW_k^*\varphi_k^*(t) Using the orthonormality of \varphi_k(t), and expanding X_t in the \varphi_k(t) basis, we get that the right hand size is equal to: : \mathbb _t2_-\sum_^N\mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2\mathbb X_t, _^2/math>, so that \sum_^\infty p_k=1. We may then define the representation
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
to be H(\)=-\sum_i p_k \log(p_k). Then we have H(\)\ge H(\), for all choices of \varphi_k. That is, the KL-expansion has minimal representation entropy. Proof: Denote the coefficients obtained for the basis e_k(t) as p_k, and for \varphi_k(t) as q_k. Choose N\ge 1. Note that since e_k minimizes the mean squared error, we have that : \mathbb \left, \sum_^N Z_ke_k(t)-X_t\_^2\le \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 Expanding the right hand size, we get: : \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 =\mathbb, X_t^2, _ + \sum_^N \sum_^N \mathbb _\ell \varphi_\ell(t)W_k^*\varphi_k^*(t)-\sum_^N \mathbb _k \varphi_k X_t^* - \sum_^N \mathbb _tW_k^*\varphi_k^*(t) Using the orthonormality of \varphi_k(t), and expanding X_t in the \varphi_k(t) basis, we get that the right hand size is equal to: : \mathbb _t2_-\sum_^N\mathbb W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2\mathbb X_t, _^2/math>, so that \sum_^\infty p_k=1. We may then define the representation
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
to be H(\)=-\sum_i p_k \log(p_k). Then we have H(\)\ge H(\), for all choices of \varphi_k. That is, the KL-expansion has minimal representation entropy. Proof: Denote the coefficients obtained for the basis e_k(t) as p_k, and for \varphi_k(t) as q_k. Choose N\ge 1. Note that since e_k minimizes the mean squared error, we have that : \mathbb \left, \sum_^N Z_ke_k(t)-X_t\_^2\le \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 Expanding the right hand size, we get: : \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 =\mathbb, X_t^2, _ + \sum_^N \sum_^N \mathbb _\ell \varphi_\ell(t)W_k^*\varphi_k^*(t)-\sum_^N \mathbb _k \varphi_k X_t^* - \sum_^N \mathbb _tW_k^*\varphi_k^*(t) Using the orthonormality of \varphi_k(t), and expanding X_t in the \varphi_k(t) basis, we get that the right hand size is equal to: : \mathbb _t2_-\sum_^N\mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2\mathbb X_t, _^2/math>, so that \sum_^\infty p_k=1. We may then define the representation
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
to be H(\)=-\sum_i p_k \log(p_k). Then we have H(\)\ge H(\), for all choices of \varphi_k. That is, the KL-expansion has minimal representation entropy. Proof: Denote the coefficients obtained for the basis e_k(t) as p_k, and for \varphi_k(t) as q_k. Choose N\ge 1. Note that since e_k minimizes the mean squared error, we have that : \mathbb \left, \sum_^N Z_ke_k(t)-X_t\_^2\le \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 Expanding the right hand size, we get: : \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 =\mathbb, X_t^2, _ + \sum_^N \sum_^N \mathbb _\ell \varphi_\ell(t)W_k^*\varphi_k^*(t)-\sum_^N \mathbb _k \varphi_k X_t^* - \sum_^N \mathbb _tW_k^*\varphi_k^*(t) Using the orthonormality of \varphi_k(t), and expanding X_t in the \varphi_k(t) basis, we get that the right hand size is equal to: : \mathbb _t2_-\sum_^N\mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2\mathbb X_t, _^2/math>, so that \sum_^\infty p_k=1. We may then define the representation
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
to be H(\)=-\sum_i p_k \log(p_k). Then we have H(\)\ge H(\), for all choices of \varphi_k. That is, the KL-expansion has minimal representation entropy. Proof: Denote the coefficients obtained for the basis e_k(t) as p_k, and for \varphi_k(t) as q_k. Choose N\ge 1. Note that since e_k minimizes the mean squared error, we have that : \mathbb \left, \sum_^N Z_ke_k(t)-X_t\_^2\le \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 Expanding the right hand size, we get: : \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 =\mathbb, X_t^2, _ + \sum_^N \sum_^N \mathbb _\ell \varphi_\ell(t)W_k^*\varphi_k^*(t)-\sum_^N \mathbb _k \varphi_k X_t^* - \sum_^N \mathbb _tW_k^*\varphi_k^*(t) Using the orthonormality of \varphi_k(t), and expanding X_t in the \varphi_k(t) basis, we get that the right hand size is equal to: : \mathbb _t2_-\sum_^N\mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2\mathbb X_t, _^2/math>, so that \sum_^\infty p_k=1. We may then define the representation
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
to be H(\)=-\sum_i p_k \log(p_k). Then we have H(\)\ge H(\), for all choices of \varphi_k. That is, the KL-expansion has minimal representation entropy. Proof: Denote the coefficients obtained for the basis e_k(t) as p_k, and for \varphi_k(t) as q_k. Choose N\ge 1. Note that since e_k minimizes the mean squared error, we have that : \mathbb \left, \sum_^N Z_ke_k(t)-X_t\_^2\le \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 Expanding the right hand size, we get: : \mathbb\left, \sum_^N W_k\varphi_k(t)-X_t\_^2 =\mathbb, X_t^2, _ + \sum_^N \sum_^N \mathbb _\ell \varphi_\ell(t)W_k^*\varphi_k^*(t)-\sum_^N \mathbb _k \varphi_k X_t^* - \sum_^N \mathbb _tW_k^*\varphi_k^*(t) Using the orthonormality of \varphi_k(t), and expanding X_t in the \varphi_k(t) basis, we get that the right hand size is equal to: : \mathbb _t2_-\sum_^N\mathbb W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève>W_k, ^2/math> We may perform identical analysis for the e_k(t), and so rewrite the above inequality as: : \le Subtracting the common first term, and dividing by \mathbb X_t, ^2_/math>, we obtain that: : \sum_^N p_k\ge \sum_^N q_k This implies that: : -\sum_^\infty p_k \log(p_k)\le -\sum_^\infty q_k \log(q_k)


Linear Karhunen–Loève approximations

Consider a whole class of signals we want to approximate over the first vectors of a basis. These signals are modeled as realizations of a random vector of size . To optimize the approximation we design a basis that minimizes the average approximation error. This section proves that optimal bases are Karhunen–Loeve bases that diagonalize the covariance matrix of . The random vector can be decomposed in an orthogonal basis :\left\_ as follows: :Y=\sum_^ \left\langle Y, g_m \right\rangle g_m, where each :\left\langle Y, g_m \right\rangle =\sum_^ g_m^* /math> is a random variable. The approximation from the first vectors of the basis is :Y_M=\sum_^ \left\langle Y, g_m \right\rangle g_m The energy conservation in an orthogonal basis implies :\varepsilon \mathbf \left\ =\sum_^ \mathbf\left\ This error is related to the covariance of defined by :R n,m\mathbf \left\ For any vector we denote by the covariance operator represented by this matrix, :\mathbf\left\=\langle Kx,x \rangle =\sum_^ \sum_^ R ,m ^* /math> The error is therefore a sum of the last coefficients of the covariance operator :\varepsilon \sum_^ The covariance operator is Hermitian and Positive and is thus diagonalized in an orthogonal basis called a Karhunen–Loève basis. The following theorem states that a Karhunen–Loève basis is optimal for linear approximations. Theorem (Optimality of Karhunen–Loève basis). Let be a covariance operator. For all , the approximation error :\varepsilon \sum_^\left\langle K g_m, g_m \right\rangle is minimum if and only if :\left\_ is a Karhunen–Loeve basis ordered by decreasing eigenvalues. :\left\langle K g_m, g_m \right\rangle \ge \left\langle Kg_, g_ \right\rangle, \qquad 0\le m


Non-Linear approximation in bases

Linear approximations project the signal on ''M'' vectors a priori. The approximation can be made more precise by choosing the ''M'' orthogonal vectors depending on the signal properties. This section analyzes the general performance of these non-linear approximations. A signal f\in \Eta is approximated with M vectors selected adaptively in an orthonormal basis for \Eta :\Beta =\left\_ Let f_M be the projection of f over M vectors whose indices are in : :f_M=\sum_ \left\langle f, g_m \right\rangle g_m The approximation error is the sum of the remaining coefficients :\varepsilon \left\=\sum_^ \left\ To minimize this error, the indices in must correspond to the M vectors having the largest inner product amplitude :\left, \left\langle f, g_m \right\rangle \. These are the vectors that best correlate f. They can thus be interpreted as the main features of f. The resulting error is necessarily smaller than the error of a linear approximation which selects the M approximation vectors independently of f. Let us sort :\left\_ in decreasing order :\left, \left \langle f, g_ \right \rangle \\ge \left, \left \langle f, g_ \right \rangle \. The best non-linear approximation is :f_M=\sum_^M \left\langle f, g_ \right\rangle g_ It can also be written as inner product thresholding: :f_M=\sum_^\infty \theta_T \left( \left\langle f, g_m \right\rangle \right) g_m with :T=\left, \left\langle f, g_ \right \rangle\, \qquad \theta_T(x)= \begin x & , x, \ge T \\ 0 & , x, < T \end The non-linear error is :\varepsilon \left\=\sum_^ \left\ this error goes quickly to zero as M increases, if the sorted values of \left, \left\langle f, g_ \right\rangle \ have a fast decay as k increases. This decay is quantified by computing the \Iota^\Rho norm of the signal inner products in B: :\, f \, _ =\left( \sum_^\infty \left, \left\langle f, g_m \right\rangle \^p \right)^ The following theorem relates the decay of to \, f\, _ Theorem (decay of error). If \, f\, _<\infty with then :\varepsilon le \frac M^ and :\varepsilon o\left( M^ \right). Conversely, if \varepsilon o\left( M^ \right) then \, f\, _<\infty for any .


Non-optimality of Karhunen–Loève bases

To further illustrate the differences between linear and non-linear approximations, we study the decomposition of a simple non-Gaussian random vector in a Karhunen–Loève basis. Processes whose realizations have a random translation are stationary. The Karhunen–Loève basis is then a Fourier basis and we study its performance. To simplify the analysis, consider a random vector ''Y'' 'n''of size ''N'' that is random shift modulo ''N'' of a deterministic signal ''f'' 'n''of zero mean :\sum_^f 0 :Y f (n-p)\bmod N /math> The random shift ''P'' is uniformly distributed on , ''N'' − 1 :\Pr ( P=p )=\frac, \qquad 0\le p Clearly :\mathbf\=\frac \sum_^ f n-p)\bmod N0 and :R ,k\mathbf \=\frac\sum_^ f n-p)\bmod Nf k-p)\bmod N = \frac f\Theta \bar -k \quad \bar f n/math> Hence :R ,kR_Y -k \qquad R_Y \fracf \Theta \bar /math> Since RY is N periodic, Y is a circular stationary random vector. The covariance operator is a circular convolution with RY and is therefore diagonalized in the discrete Fourier Karhunen–Loève basis :\left\_. The power spectrum is Fourier transform of : :P_Y \hat_Y \frac \left, \hat \^2 Example: Consider an extreme case where f \delta \delta -1/math>. A theorem stated above guarantees that the Fourier Karhunen–Loève basis produces a smaller expected approximation error than a canonical basis of Diracs \left\_. Indeed, we do not know a priori the abscissa of the non-zero coefficients of ''Y'', so there is no particular Dirac that is better adapted to perform the approximation. But the Fourier vectors cover the whole support of Y and thus absorb a part of the signal energy. :\mathbf \left\=P_Y = \frac\sin^2 \left(\frac \right) Selecting higher frequency Fourier coefficients yields a better mean-square approximation than choosing a priori a few Dirac vectors to perform the approximation. The situation is totally different for non-linear approximations. If f \delta \delta -1/math> then the discrete Fourier basis is extremely inefficient because f and hence Y have an energy that is almost uniformly spread among all Fourier vectors. In contrast, since f has only two non-zero coefficients in the Dirac basis, a non-linear approximation of Y with gives zero error.


Principal component analysis

We have established the Karhunen–Loève theorem and derived a few properties thereof. We also noted that one hurdle in its application was the numerical cost of determining the eigenvalues and eigenfunctions of its covariance operator through the Fredholm integral equation of the second kind :\int_a^b K_X(s,t) e_k(s)\,ds=\lambda_k e_k(t). However, when applied to a discrete and finite process \left(X_n\right)_, the problem takes a much simpler form and standard algebra can be used to carry out the calculations. Note that a continuous process can also be sampled at ''N'' points in time in order to reduce the problem to a finite version. We henceforth consider a random ''N''-dimensional vector X=\left(X_1~X_2~\ldots~X_N\right)^T. As mentioned above, ''X'' could contain ''N'' samples of a signal but it can hold many more representations depending on the field of application. For instance it could be the answers to a survey or economic data in an econometrics analysis. As in the continuous version, we assume that ''X'' is centered, otherwise we can let X:=X-\mu_X (where \mu_X is the
mean vector There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of ''X'') which is centered. Let us adapt the procedure to the discrete case.


Covariance matrix

Recall that the main implication and difficulty of the KL transformation is computing the eigenvectors of the linear operator associated to the covariance function, which are given by the solutions to the integral equation written above. Define Σ, the covariance matrix of ''X'', as an ''N'' × ''N'' matrix whose elements are given by: :\Sigma_= \mathbf
_i X_j I, or i, is the ninth Letter (alphabet), letter and the third vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in Engl ...
\qquad \forall i,j \in \ Rewriting the above integral equation to suit the discrete case, we observe that it turns into: :\sum_^N \Sigma_ e_j=\lambda e_i \quad \Leftrightarrow \quad \Sigma e=\lambda e where e=(e_1~e_2~\ldots~e_N)^T is an ''N''-dimensional vector. The integral equation thus reduces to a simple matrix eigenvalue problem, which explains why the PCA has such a broad domain of applications. Since Σ is a positive definite symmetric matrix, it possesses a set of orthonormal eigenvectors forming a basis of \R^N, and we write \_ this set of eigenvalues and corresponding eigenvectors, listed in decreasing values of . Let also be the orthonormal matrix consisting of these eigenvectors: :\begin \Phi &:=\left(\varphi_1~\varphi_2~\ldots~\varphi_N\right)^T\\ \Phi^T \Phi &=I \end


Principal component transform

It remains to perform the actual KL transformation, called the ''principal component transform'' in this case. Recall that the transform was found by expanding the process with respect to the basis spanned by the eigenvectors of the covariance function. In this case, we hence have: :X =\sum_^N \langle \varphi_i,X\rangle \varphi_i =\sum_^N \varphi_i^T X \varphi_i In a more compact form, the principal component transform of ''X'' is defined by: :\begin Y=\Phi^T X \\ X=\Phi Y \end The ''i''-th component of ''Y'' is Y_i=\varphi_i^T X, the projection of ''X'' on \varphi_i and the inverse transform yields the expansion of on the space spanned by the \varphi_i: :X=\sum_^N Y_i \varphi_i=\sum_^N \langle \varphi_i,X\rangle \varphi_i As in the continuous case, we may reduce the dimensionality of the problem by truncating the sum at some K\in\ such that :\frac\geq \alpha where α is the explained variance threshold we wish to set. We can also reduce the dimensionality through the use of multilevel dominant eigenvector estimation (MDEE).X. Tang, “Texture information in run-length matrices,” IEEE Transactions on Image Processing, vol. 7, No. 11, pp. 1602–1609, Nov. 1998


Examples


The Wiener process

There are numerous equivalent characterizations of the
Wiener process In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is o ...
which is a mathematical formalization of
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
. Here we regard it as the centered standard Gaussian process W''t'' with covariance function : K_W(t,s) = \operatorname(W_t,W_s) = \min (s,t). We restrict the time domain to 'a'', ''b'' ,1without loss of generality. The eigenvectors of the covariance kernel are easily determined. These are : e_k(t) = \sqrt \sin \left( \left(k - \tfrac\right) \pi t \right) and the corresponding eigenvalues are : \lambda_k = \frac. This gives the following representation of the Wiener process: Theorem. There is a sequence ''i'' of independent Gaussian random variables with mean zero and variance 1 such that : W_t = \sqrt \sum_^\infty Z_k \frac. Note that this representation is only valid for t\in ,1 On larger intervals, the increments are not independent. As stated in the theorem, convergence is in the L2 norm and uniform in ''t''.


The Brownian bridge

Similarly the
Brownian bridge A Brownian bridge is a continuous-time stochastic process ''B''(''t'') whose probability distribution is the conditional probability distribution of a standard Wiener process ''W''(''t'') (a mathematical model of Brownian motion) subject to the co ...
B_t=W_t-tW_1 which is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...
with covariance function :K_B(t,s)=\min(t,s)-ts can be represented as the series :B_t = \sum_^\infty Z_k \frac


Applications

Adaptive optics Adaptive optics (AO) is a technology used to improve the performance of optical systems by reducing the effect of incoming wavefront distortions by deforming a mirror in order to compensate for the distortion. It is used in astronomical tele ...
systems sometimes use K–L functions to reconstruct wave-front phase information (Dai 1996, JOSA A). Karhunen–Loève expansion is closely related to the
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
. The latter has myriad applications in image processing, radar, seismology, and the like. If one has independent vector observations from a vector valued stochastic process then the left singular vectors are
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimates of the ensemble KL expansion.


Applications in signal estimation and detection


Detection of a known continuous signal ''S''(''t'')

In communication, we usually have to decide whether a signal from a noisy channel contains valuable information. The following hypothesis testing is used for detecting continuous signal ''s''(''t'') from channel output ''X''(''t''), ''N''(''t'') is the channel noise, which is usually assumed zero mean Gaussian process with correlation function R_N (t, s) = E (t)N(s)/math> :H: X(t) = N(t), :K: X(t) = N(t)+s(t), \quad t\in(0,T)


Signal detection in white noise

When the channel noise is white, its correlation function is :R_N(t) = \tfrac N_0 \delta (t), and it has constant power spectrum density. In physically practical channel, the noise power is finite, so: :S_N(f) = \begin \frac &, f, w \end Then the noise correlation function is sinc function with zeros at \frac, n \in \mathbf. Since are uncorrelated and gaussian, they are independent. Thus we can take samples from ''X''(''t'') with time spacing : \Delta t = \frac \text (0,''T''). Let X_i = X(i\,\Delta t). We have a total of n = \frac = T(2\omega) = 2\omega T i.i.d observations \ to develop the likelihood-ratio test. Define signal S_i = S(i\,\Delta t), the problem becomes, :H: X_i = N_i, :K: X_i = N_i + S_i, i = 1,2,\ldots,n. The log-likelihood ratio :\mathcal(\underline) = \log\frac \Leftrightarrow \Delta t \sum^n_ S_i x_i = \sum^n_ S(i\,\Delta t)x(i\,\Delta t) \, \Delta t \gtrless \lambda_\cdot2 As , let: :G = \int^T_0 S(t)x(t) \, dt. Then ''G'' is the test statistics and the Neyman–Pearson optimum detector is :G(\underline) > G_0 \Rightarrow K < G_0 \Rightarrow H. As ''G'' is Gaussian, we can characterize it by finding its mean and variances. Then we get :H: G \sim N \left (0,\tfracN_0E \right ) :K: G \sim N \left (E,\tfracN_0E \right ) where :\mathbf = \int^T_0 S^2(t) \, dt is the signal energy. The false alarm error :\alpha = \int^\infty_ N \left (0, \tfracN_0E \right) \, dG \Rightarrow G_0 = \sqrt \Phi^(1-\alpha) And the probability of detection: :\beta = \int^\infty_ N \left (E, \tfracN_0E \right) \, dG = 1-\Phi \left (\frac \right ) = \Phi \left (\sqrt - \Phi^(1-\alpha) \right ), where Φ is the cdf of standard normal, or Gaussian, variable.


Signal detection in colored noise

When N(t) is colored (correlated in time) Gaussian noise with zero mean and covariance function R_N(t,s) = E (t)N(s) we cannot sample independent discrete observations by evenly spacing the time. Instead, we can use K–L expansion to decorrelate the noise process and get independent Gaussian observation 'samples'. The K–L expansion of ''N''(''t''): :N(t) = \sum^_ N_i \Phi_i(t), \quad 0 where N_i =\int N(t)\Phi_i(t)\,dt and the orthonormal bases \ are generated by kernel R_N(t,s), i.e., solution to : \int ^T_0 R_N(t,s)\Phi_i(s)\,ds = \lambda_i \Phi_i(t), \quad \operatorname _i= \lambda_i. Do the expansion: :S(t) = \sum^_S_i\Phi_i(t), where S_i = \int^T _0 S(t)\Phi_i(t) \, dt, then :X_i = \int^T _0 X(t)\Phi_i(t) \, dt = N_i under H and N_i + S_i under K. Let \overline = \, we have :N_i are independent Gaussian r.v's with variance \lambda_i :under H: \ are independent Gaussian r.v's. ::f_H 0= f_H(\underline) = \prod^\infty_ \frac \exp \left (-\frac \right ) :under K: \ are independent Gaussian r.v's. ::f_K (t)\mid 0= f_K(\underline) = \prod^\infty_ \frac \exp \left(-\frac \right) Hence, the log-LR is given by :\mathcal(\underline) = \sum^_ \frac and the optimum detector is :G = \sum^\infty_ S_i x_i \lambda_i > G_0 \Rightarrow K, < G_0 \Rightarrow H. Define :k(t) = \sum^\infty_ \lambda_i S_i \Phi_i(t), 0 then G = \int^T _0 k(t)x(t)\,dt.


=How to find ''k''(''t'')

= Since :\int^T_0 R_N(t,s)k(s) \, ds = \sum^\infty_ \lambda_i S_i \int^T _0 R_N(t,s)\Phi_i (s) \, ds = \sum^\infty_ S_i \Phi_i(t) = S(t), k(t) is the solution to :\int^T_0 R_N(t,s)k(s)\,ds = S(t). If ''N''(''t'')is wide-sense stationary, :\int^T_0 R_N(t-s)k(s) \, ds = S(t), which is known as the Wiener–Hopf equation. The equation can be solved by taking fourier transform, but not practically realizable since infinite spectrum needs spatial factorization. A special case which is easy to calculate ''k''(''t'') is white Gaussian noise. :\int^T_0 \frac\delta(t-s)k(s) \, ds = S(t) \Rightarrow k(t) = C S(t), \quad 0 The corresponding impulse response is ''h''(''t'') = ''k''(''T'' − ''t'') = ''CS''(''T'' − ''t''). Let ''C'' = 1, this is just the result we arrived at in previous section for detecting of signal in white noise.


=Test threshold for Neyman–Pearson detector

= Since X(t) is a Gaussian process, :G = \int^T_0 k(t)x(t) \, dt, is a Gaussian random variable that can be characterized by its mean and variance. :\begin \mathbf \mid H&= \int^T_0 k(t)\mathbf (t)\mid H,dt = 0 \\ \mathbf \mid K&= \int^T_0 k(t)\mathbf (t)\mid K,dt = \int^T_0 k(t)S(t)\,dt \equiv \rho \\ \mathbf ^2\mid H&= \int^T_0 \int^T_0 k(t)k(s) R_N(t,s)\,dt\,ds = \int^T_0 k(t) \left (\int^T_0 k(s)R_N(t,s) \, ds \right) = \int^T_0 k(t)S(t) \, dt = \rho \\ \operatorname \mid H&= \mathbf ^2\mid H- (\mathbf \mid H^2 = \rho \\ \mathbf ^2\mid K&=\int^T_0\int^T_0k(t)k(s) \mathbf (t)x(s),dt\,ds = \int^T_0\int^T_0k(t)k(s)(R_N(t,s) +S(t)S(s)) \, dt\, ds = \rho + \rho^2\\ \operatorname \mid K&= \mathbf K- (\mathbf K^2 = \rho + \rho^2 -\rho^2 = \rho \end Hence, we obtain the distributions of ''H'' and ''K'': :H: G \sim N(0,\rho) :K: G \sim N(\rho, \rho) The false alarm error is :\alpha = \int^\infty_ N(0,\rho)\,dG = 1 - \Phi \left (\frac \right ). So the test threshold for the Neyman–Pearson optimum detector is :G_0 = \sqrt \Phi^ (1-\alpha). Its power of detection is :\beta = \int^\infty_ N(\rho, \rho) \, dG = \Phi \left (\sqrt - \Phi^(1 - \alpha) \right) When the noise is white Gaussian process, the signal power is :\rho = \int^T_0 k(t)S(t) \, dt = \int^T_0 S(t)^2 \, dt = E.


=Prewhitening

= For some type of colored noise, a typical practise is to add a prewhitening filter before the matched filter to transform the colored noise into white noise. For example, N(t) is a wide-sense stationary colored noise with correlation function :R_N(\tau) = \frac e^ :S_N(f) = \frac The transfer function of prewhitening filter is :H(f) = 1 + j \frac.


Detection of a Gaussian random signal in Additive white Gaussian noise (AWGN)

When the signal we want to detect from the noisy channel is also random, for example, a white Gaussian process ''X''(''t''), we can still implement K–L expansion to get independent sequence of observation. In this case, the detection problem is described as follows: :H_0 : Y(t) = N(t) :H_1 : Y(t) = N(t) + X(t), \quad 0 ''X''(''t'') is a random process with correlation function R_X(t,s) = E\ The K–L expansion of ''X''(''t'') is :X(t) = \sum^\infty_ X_i \Phi_i(t), where :X_i = \int^T_0 X(t)\Phi_i(t)\,dt and \Phi_i(t) are solutions to : \int^T_0 R_X(t,s)\Phi_i(s)ds= \lambda_i \Phi_i(t). So X_i's are independent sequence of r.v's with zero mean and variance \lambda_i. Expanding ''Y''(''t'') and ''N''(''t'') by \Phi_i(t), we get :Y_i = \int^T_0 Y(t)\Phi_i(t) \, dt = \int^T_0 (t) + X(t)Phi_i(t) = N_i + X_i, where :N_i = \int^T_0 N(t)\Phi_i(t)\,dt. As ''N''(''t'') is Gaussian white noise, N_i's are i.i.d sequence of r.v with zero mean and variance \tfracN_0, then the problem is simplified as follows, :H_0: Y_i = N_i :H_1: Y_i = N_i + X_i The Neyman–Pearson optimal test: :\Lambda = \frac = Ce^, so the log-likelihood ratio is :\mathcal = \ln(\Lambda) = K -\sum^\infty_\tfracy_i^2 \frac. Since :\widehat_i = \frac is just the minimum-mean-square estimate of X_i given Y_i's, :\mathcal = K + \frac \sum^\infty_ Y_i \widehat_i. K–L expansion has the following property: If :f(t) = \sum f_i \Phi_i(t), g(t) = \sum g_i \Phi_i(t), where :f_i = \int_0^T f(t) \Phi_i(t)\,dt, \quad g_i = \int_0^T g(t)\Phi_i(t) \, dt. then :\sum^\infty_ f_i g_i = \int^T_0 g(t)f(t)\,dt. So let :\widehat(t\mid T) = \sum^\infty_ \widehat_i \Phi_i(t), \quad \mathcal = K + \frac \int^T_0 Y(t) \widehat(t\mid T) \, dt. Noncausal filter ''Q''(''t'',''s'') can be used to get the estimate through :\widehat(t\mid T) = \int^T_0 Q(t,s)Y(s)\,ds. By
orthogonality principle In statistics and signal processing, the orthogonality principle is a necessary and sufficient condition for the optimality of a Bayesian estimator. Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in ...
, ''Q''(''t'',''s'') satisfies :\int^T_0 Q(t,s)R_X(s,t)\,ds + \tfrac Q(t, \lambda) = R_X(t, \lambda), 0 < \lambda < T, 0 However, for practical reasons, it's necessary to further derive the causal filter ''h''(''t'',''s''), where ''h''(''t'',''s'') = 0 for ''s'' > ''t'', to get estimate \widehat(t\mid t). Specifically, :Q(t,s) = h(t,s) + h(s, t) - \int^T_0 h(\lambda, t)h(s, \lambda) \, d\lambda


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Polynomial chaos Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogon ...
*
Reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
*
Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most not ...


Notes


References

* * * * * * * *Wu B., Zhu J., Najm F.(2005) "A Non-parametric Approach for Dynamic Range Estimation of Nonlinear Systems". In Proceedings of Design Automation Conference(841-844) 2005 *Wu B., Zhu J., Najm F.(2006) "Dynamic Range Estimation". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25 Issue:9 (1618–1636) 2006 *


External links

* ''Mathematica'
KarhunenLoeveDecomposition
function. * ''E161: Computer Image Processing and Analysis'' notes by Pr. Ruye Wang at
Harvey Mudd College Harvey Mudd College (HMC) is a private college in Claremont, California, focused on science and engineering. It is part of the Claremont Colleges, which share adjoining campus grounds and resources. The college enrolls 902 undergraduate students ...
br>
{{DEFAULTSORT:Karhunen-Loeve theorem Probability theorems Signal estimation Theorems in statistics fr:Transformée de Karhunen-Loève