HOME

TheInfoList



OR:

In
signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, and scientific measurements. Signal processing techniq ...
, independent component analysis (ICA) is a computational method for separating a
multivariate Multivariate may refer to: In mathematics * Multivariable calculus * Multivariate function * Multivariate polynomial In computing * Multivariate cryptography * Multivariate division algorithm * Multivariate interpolation * Multivariate optical c ...
signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are
statistically independent Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...
from each other. ICA is a special case of
blind source separation Source separation, blind signal separation (BSS) or blind source separation, is the separation of a set of source signal processing, signals from a set of mixed signals, without the aid of information (or with very little information) about the s ...
. A common example application is the "
cocktail party problem The cocktail party effect is the phenomenon of the brain's ability to focus one's auditory attention on a particular stimulus while filtering out a range of other stimuli, such as when a partygoer can focus on a single conversation in a noisy room ...
" of listening in on one person's speech in a noisy room.


Introduction

Independent component analysis attempts to decompose a multivariate signal into independent non-Gaussian signals. As an example, sound is usually a signal that is composed of the numerical addition, at each time t, of signals from several sources. The question then is whether it is possible to separate these contributing sources from the observed total signal. When the statistical independence assumption is correct, blind ICA separation of a mixed signal gives very good results. It is also used for signals that are not supposed to be generated by mixing for analysis purposes. A simple application of ICA is the "
cocktail party problem The cocktail party effect is the phenomenon of the brain's ability to focus one's auditory attention on a particular stimulus while filtering out a range of other stimuli, such as when a partygoer can focus on a single conversation in a noisy room ...
", where the underlying speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no time delays or echoes. Note that a filtered and delayed signal is a copy of a dependent component, and thus the statistical independence assumption is not violated. Mixing weights for constructing the ''M'' observed signals from the N components can be placed in an M \times N matrix. An important thing to consider is that if N sources are present, at least N observations (e.g. microphones if the observed signal is audio) are needed to recover the original signals. When there are an equal number of observations and source signals, the mixing matrix is square (''M = N''). Other cases of underdetermined (''M < N'') and overdetermined (''M > N'') have been investigated. That the ICA separation of mixed signals gives very good results is based on two assumptions and three effects of mixing source signals. Two assumptions: #The source signals are independent of each other. #The values in each source signal have non-Gaussian distributions. Three effects of mixing source signals: #Independence: As per assumption 1, the source signals are independent; however, their signal mixtures are not. This is because the signal mixtures share the same source signals. #Normality: According to the
Central Limit Theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...
, the distribution of a sum of independent random variables with finite variance tends towards a Gaussian distribution.
Loosely speaking, a sum of two independent random variables usually has a distribution that is closer to Gaussian than any of the two original variables. Here we consider the value of each signal as the random variable. #Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal. Those principles contribute to the basic establishment of ICA. If the signals extracted from a set of mixtures are independent, and have non-Gaussian histograms or have low complexity, then they must be source signals.


Defining component independence

ICA finds the independent components (also called factors, latent variables or sources) by maximizing the statistical independence of the estimated components. We may choose one of many ways to define a proxy for independence, and this choice governs the form of the ICA algorithm. The two broadest definitions of independence for ICA are # Minimization of mutual information # Maximization of non-Gaussianity The Minimization-of-
Mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
(MMI) family of ICA algorithms uses measures like Kullback-Leibler Divergence and maximum entropy. The non-Gaussianity family of ICA algorithms, motivated by the
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...
, uses
kurtosis In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurt ...
and
negentropy In information theory and statistics, negentropy is used as a measure of distance to normality. The concept and phrase "negative entropy" was introduced by Erwin Schrödinger in his 1944 popular-science book ''What is Life?'' Later, Léon Brillo ...
. Typical algorithms for ICA use centering (subtract the mean to create a zero mean signal), whitening (usually with the
eigenvalue decomposition In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the matr ...
), and
dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
as preprocessing steps in order to simplify and reduce the complexity of the problem for the actual iterative algorithm. Whitening and
dimension reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
can be achieved with
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
or
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is re ...
. Whitening ensures that all dimensions are treated equally ''a priori'' before the algorithm is run. Well-known algorithms for ICA include
infomax Infomax is an optimization principle for artificial neural networks and other information processing systems. It prescribes that a function that maps a set of input values ''I'' to a set of output values ''O'' should be chosen or learned so as to ...
,
FastICA FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed- ...
,
JADE Jade is a mineral used as jewellery or for ornaments. It is typically green, although may be yellow or white. Jade can refer to either of two different silicate minerals: nephrite (a silicate of calcium and magnesium in the amphibole group of ...
, and
kernel-independent component analysis In statistics, kernel-independent component analysis (kernel ICA) is an efficient algorithm for independent component analysis which estimates source components by optimizing a ''generalized variance'' contrast function, which is based on representa ...
, among others. In general, ICA cannot identify the actual number of source signals, a uniquely correct ordering of the source signals, nor the proper scaling (including sign) of the source signals. ICA is important to
blind signal separation Blind may refer to: * The state of blindness, being unable to see * A window blind, a covering for a window Blind may also refer to: Arts, entertainment, and media Films * ''Blind'' (2007 film), a Dutch drama by Tamar van den Dop * ''Blind' ...
and has many practical applications. It is closely related to (or even a special case of) the search for a
factorial code {{Short description, Data representation for machine learning Most real world data sets consist of data vectors whose individual components are not statistically independent. In other words, knowing the value of an element will provide information a ...
of the data, i.e., a new vector-valued representation of each data vector such that it gets uniquely encoded by the resulting code vector (loss-free coding), but the code components are statistically independent.


Mathematical definitions

Linear independent component analysis can be divided into noiseless and noisy cases, where noiseless ICA is a special case of noisy ICA. Nonlinear ICA should be considered as a separate case.


General definition

The data are represented by the observed
random vector In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value ...
\boldsymbol=(x_1,\ldots,x_m)^T and the hidden components as the random vector \boldsymbol=(s_1,\ldots,s_n)^T. The task is to transform the observed data \boldsymbol, using a linear static transformation \boldsymbol as \boldsymbol = \boldsymbol \boldsymbol, into a vector of maximally independent components \boldsymbol measured by some function F(s_1,\ldots,s_n) of independence.


Generative model


Linear noiseless ICA

The components x_i of the observed random vector \boldsymbol=(x_1,\ldots,x_m)^T are generated as a sum of the independent components s_k, k=1,\ldots,n: x_i = a_ s_1 + \cdots + a_ s_k + \cdots + a_ s_n weighted by the mixing weights a_. The same generative model can be written in vector form as \boldsymbol=\sum_^ s_k \boldsymbol_k, where the observed random vector \boldsymbol is represented by the basis vectors \boldsymbol_k=(\boldsymbol_,\ldots,\boldsymbol_)^T. The basis vectors \boldsymbol_k form the columns of the mixing matrix \boldsymbol=(\boldsymbol_1,\ldots,\boldsymbol_n) and the generative formula can be written as \boldsymbol=\boldsymbol \boldsymbol, where \boldsymbol=(s_1,\ldots,s_n)^T. Given the model and realizations (samples) \boldsymbol_1,\ldots,\boldsymbol_N of the random vector \boldsymbol, the task is to estimate both the mixing matrix \boldsymbol and the sources \boldsymbol. This is done by adaptively calculating the \boldsymbol vectors and setting up a cost function which either maximizes the non-gaussianity of the calculated s_k = \boldsymbol^T \boldsymbol or minimizes the mutual information. In some cases, a priori knowledge of the probability distributions of the sources can be used in the cost function. The original sources \boldsymbol can be recovered by multiplying the observed signals \boldsymbol with the inverse of the mixing matrix \boldsymbol=\boldsymbol^, also known as the unmixing matrix. Here it is assumed that the mixing matrix is square (n=m). If the number of basis vectors is greater than the dimensionality of the observed vectors, n>m, the task is overcomplete but is still solvable with the
pseudo inverse In mathematics, and in particular, algebra, a generalized inverse (or, g-inverse) of an element ''x'' is an element ''y'' that has some properties of an inverse element but not necessarily all of them. The purpose of constructing a generalized in ...
.


Linear noisy ICA

With the added assumption of zero-mean and uncorrelated Gaussian noise n\sim N(0,\operatorname(\Sigma)), the ICA model takes the form \boldsymbol=\boldsymbol \boldsymbol+n.


Nonlinear ICA

The mixing of the sources does not need to be linear. Using a nonlinear mixing function f(\cdot, \theta) with parameters \theta the
nonlinear ICA In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many othe ...
model is x=f(s, \theta)+n.


Identifiability

The independent components are identifiable up to a permutation and scaling of the sources. This identifiability requires that: * At most one of the sources s_k is Gaussian, * The number of observed mixtures, m, must be at least as large as the number of estimated components n: m \ge n. It is equivalent to say that the mixing matrix \boldsymbol must be of full
rank Rank is the relative position, value, worth, complexity, power, importance, authority, level, etc. of a person or object within a ranking, such as: Level or position in a hierarchical organization * Academic rank * Diplomatic rank * Hierarchy * ...
for its inverse to exist.


Binary ICA

A special variant of ICA is binary ICA in which both signal sources and monitors are in binary form and observations from monitors are disjunctive mixtures of binary independent sources. The problem was shown to have applications in many domains including
medical diagnosis Medical diagnosis (abbreviated Dx, Dx, or Ds) is the process of determining which disease or condition explains a person's symptoms and signs. It is most often referred to as diagnosis with the medical context being implicit. The information re ...
, multi-cluster assignment,
network tomography Network tomography is the study of a network's internal characteristics using information derived from end point data. The word tomography is used to link the field, in concept, to other processes that infer the internal characteristics of an obje ...
and
internet resource management The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a ''internetworking, network of networks'' that consists ...
. Let be the set of binary variables from m monitors and be the set of binary variables from n sources. Source-monitor connections are represented by the (unknown) mixing matrix \boldsymbol, where g_ = 1 indicates that signal from the ''i''-th source can be observed by the ''j''-th monitor. The system works as follows: at any time, if a source i is active (y_i=1) and it is connected to the monitor j (g_=1) then the monitor j will observe some activity (x_j=1). Formally we have: : x_i = \bigvee_^n (g_\wedge y_j), i = 1, 2, \ldots, m, where \wedge is Boolean AND and \vee is Boolean OR. Note that noise is not explicitly modelled, rather, can be treated as independent sources. The above problem can be heuristically solvedJohan Himbergand Aapo Hyvärinen,
Independent Component Analysis For Binary Data: An Experimental Study
', Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2001), San Diego, California, 2001.
by assuming variables are continuous and running
FastICA FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed- ...
on binary observation data to get the mixing matrix \boldsymbol (real values), then apply
round number A round number is an integer that ends with one or more " 0"s (zero-digit) in a given base. So, 590 is rounder than 592, but 590 is less round than 600. In both technical and informal language, a round number is often interpreted to stand for a ...
techniques on \boldsymbol to obtain the binary values. This approach has been shown to produce a highly inaccurate result. Another method is to use
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. ...
: recursively breaking the observation matrix \boldsymbol into its sub-matrices and run the inference algorithm on these sub-matrices. The key observation which leads to this algorithm is the sub-matrix \boldsymbol^0 of \boldsymbol where x_ = 0, \forall j corresponds to the unbiased observation matrix of hidden components that do not have connection to the i-th monitor. Experimental results fromHuy Nguyen and Rong Zheng,
Binary Independent Component Analysis With or Mixtures
', IEEE Transactions on Signal Processing, Vol. 59, Issue 7. (July 2011), pp. 3168–3181.
show that this approach is accurate under moderate noise levels. The Generalized Binary ICA framework introduces a broader problem formulation which does not necessitate any knowledge on the generative model. In other words, this method attempts to decompose a source into its independent components (as much as possible, and without losing any information) with no prior assumption on the way it was generated. Although this problem appears quite complex, it can be accurately solved with a
branch and bound Branch and bound (BB, B&B, or BnB) is an algorithm design paradigm for discrete and combinatorial optimization problems, as well as mathematical optimization. A branch-and-bound algorithm consists of a systematic enumeration of candidate solut ...
search tree algorithm or tightly upper bounded with a single multiplication of a matrix with a vector.


Methods for blind source separation


Projection pursuit

Signal mixtures tend to have Gaussian probability density functions, and source signals tend to have non-Gaussian probability density functions. Each source signal can be extracted from a set of signal mixtures by taking the inner product of a weight vector and those signal mixtures where this inner product provides an orthogonal projection of the signal mixtures. The remaining challenge is finding such a weight vector. One type of method for doing so is
projection pursuit Projection pursuit (PP) is a type of statistical technique which involves finding the most "interesting" possible projections in multidimensional data. Often, projections which deviate more from a normal distribution are considered to be more inter ...
.James V. Stone(2004); "Independent Component Analysis: A Tutorial Introduction", The MIT Press Cambridge, Massachusetts, London, England; Projection pursuit seeks one projection at a time such that the extracted signal is as non-Gaussian as possible. This contrasts with ICA, which typically extracts ''M'' signals simultaneously from ''M'' signal mixtures, which requires estimating a ''M'' × ''M'' unmixing matrix. One practical advantage of projection pursuit over ICA is that fewer than ''M'' signals can be extracted if required, where each source signal is extracted from ''M'' signal mixtures using an ''M''-element weight vector. We can use
kurtosis In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurt ...
to recover the multiple source signal by finding the correct weight vectors with the use of projection pursuit. The kurtosis of the probability density function of a signal, for a finite sample, is computed as : K=\frac-3 where \mathbf is the
sample mean The sample mean (or "empirical mean") and the sample covariance are statistics computed from a Sample (statistics), sample of data on one or more random variables. The sample mean is the average value (or mean, mean value) of a sample (statistic ...
of \mathbf, the extracted signals. The constant 3 ensures that Gaussian signals have zero kurtosis, Super-Gaussian signals have positive kurtosis, and Sub-Gaussian signals have negative kurtosis. The denominator is the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
of \mathbf, and ensures that the measured kurtosis takes account of signal variance. The goal of projection pursuit is to maximize the kurtosis, and make the extracted signal as non-normal as possible. Using kurtosis as a measure of non-normality, we can now examine how the kurtosis of a signal \mathbf = \mathbf^T \mathbf extracted from a set of ''M'' mixtures \mathbf=(x_1,x_2,\ldots,x_M)^T varies as the weight vector \mathbf is rotated around the origin. Given our assumption that each source signal \mathbf is super-gaussian we would expect: #the kurtosis of the extracted signal \mathbf to be maximal precisely when \mathbf = \mathbf. #the kurtosis of the extracted signal \mathbf to be maximal when \mathbf is orthogonal to the projected axes S_1 or S_2, because we know the optimal weight vector should be orthogonal to a transformed axis S_1 or S_2. For multiple source mixture signals, we can use kurtosis and Gram-Schmidt Orthogonalization (GSO) to recover the signals. Given ''M'' signal mixtures in an ''M''-dimensional space, GSO project these data points onto an (''M-1'')-dimensional space by using the weight vector. We can guarantee the independence of the extracted signals with the use of GSO. In order to find the correct value of \mathbf, we can use
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
method. We first of all whiten the data, and transform \mathbf into a new mixture \mathbf, which has unit variance, and \mathbf=(z_1,z_2,\ldots,z_M)^T. This process can be achieved by applying
Singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is re ...
to \mathbf, : \mathbf = \mathbf \mathbf \mathbf^T Rescaling each vector U_i=U_i/\operatorname(U_i^2), and let \mathbf = \mathbf. The signal extracted by a weighted vector \mathbf is \mathbf = \mathbf^T \mathbf. If the weight vector w has unit length, then the variance of y is also 1, that is \operatorname \mathbf^T \mathbf)^21. The kurtosis can thus be written as: : K=\frac-3=\operatorname \mathbf^T \mathbf)^43. The updating process for \mathbf is: :\mathbf_=\mathbf_-\eta\operatorname mathbf(\mathbf_^T \mathbf)^3 where \eta is a small constant to guarantee that \mathbf converges to the optimal solution. After each update, we normalize \mathbf_=\frac, and set \mathbf_=\mathbf_, and repeat the updating process until convergence. We can also use another algorithm to update the weight vector \mathbf. Another approach is using
negentropy In information theory and statistics, negentropy is used as a measure of distance to normality. The concept and phrase "negative entropy" was introduced by Erwin Schrödinger in his 1944 popular-science book ''What is Life?'' Later, Léon Brillo ...
instead of kurtosis. Using negentropy is a more robust method than kurtosis, as kurtosis is very sensitive to outliers. The negentropy methods are based on an important property of Gaussian distribution: a Gaussian variable has the largest entropy among all continuous random variables of equal variance. This is also the reason why we want to find the most nongaussian variables. A simple proof can be found in
Differential entropy Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuo ...
. :J(x) = S(y) - S(x)\, y is a Gaussian random variable of the same covariance matrix as x :S(x) = - \int p_x(u) \log p_x(u) du An approximation for negentropy is :J(x)=\frac(E(x^3))^2 + \frac(kurt(x))^2 A proof can be found in the original papers of Comon; it has been reproduced in the book ''Independent Component Analysis'' by Aapo Hyvärinen, Juha Karhunen, and Erkki Oja This approximation also suffers from the same problem as kurtosis (sensitivity to outliers). Other approaches have been developed. :J(y) = k_1(E(G_1(y)))^2 + k_2(E(G_2(y)) - E(G_2(v))^2 A choice of G_1 and G_2 are :G_1 = \frac\log(\cosh(a_1u)) and G_2 = -\exp(-\frac)


Based on infomax

Infomax ICABell, A. J.; Sejnowski, T. J. (1995). "An Information-Maximization Approach to Blind Separation and Blind Deconvolution", Neural Computation, 7, 1129-1159 is essentially a multivariate, parallel version of projection pursuit. Whereas projection pursuit extracts a series of signals one at a time from a set of ''M'' signal mixtures, ICA extracts ''M'' signals in parallel. This tends to make ICA more robust than projection pursuit.James V. Stone (2004). "Independent Component Analysis: A Tutorial Introduction", The MIT Press Cambridge, Massachusetts, London, England; The projection pursuit method uses Gram-Schmidt orthogonalization to ensure the independence of the extracted signal, while ICA use
infomax Infomax is an optimization principle for artificial neural networks and other information processing systems. It prescribes that a function that maps a set of input values ''I'' to a set of output values ''O'' should be chosen or learned so as to ...
and
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
estimate to ensure the independence of the extracted signal. The Non-Normality of the extracted signal is achieved by assigning an appropriate model, or prior, for the signal. The process of ICA based on
infomax Infomax is an optimization principle for artificial neural networks and other information processing systems. It prescribes that a function that maps a set of input values ''I'' to a set of output values ''O'' should be chosen or learned so as to ...
in short is: given a set of signal mixtures \mathbf and a set of identical independent model
cumulative distribution functions In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ever ...
(cdfs) g, we seek the unmixing matrix \mathbf which maximizes the joint
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of the signals \mathbf=g(\mathbf), where \mathbf=\mathbf are the signals extracted by \mathbf. Given the optimal \mathbf, the signals \mathbf have maximum entropy and are therefore independent, which ensures that the extracted signals \mathbf=g^(\mathbf) are also independent. g is an invertible function, and is the signal model. Note that if the source signal model
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
p_s matches the
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
of the extracted signal p_, then maximizing the joint entropy of Y also maximizes the amount of
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
between \mathbf and \mathbf. For this reason, using entropy to extract independent signals is known as
infomax Infomax is an optimization principle for artificial neural networks and other information processing systems. It prescribes that a function that maps a set of input values ''I'' to a set of output values ''O'' should be chosen or learned so as to ...
. Consider the entropy of the vector variable \mathbf=g(\mathbf), where \mathbf=\mathbf is the set of signals extracted by the unmixing matrix \mathbf. For a finite set of values sampled from a distribution with pdf p_, the entropy of \mathbf can be estimated as: : H(\mathbf)=-\frac\sum_^N \ln p_(\mathbf^t) The joint pdf p_ can be shown to be related to the joint pdf p_ of the extracted signals by the multivariate form: : p_(Y)=\frac where \mathbf=\frac is the Jacobian matrix. We have , \mathbf, =g'(\mathbf), and g' is the pdf assumed for source signals g'=p_s, therefore, : p_(Y)=\frac=\frac therefore, : H(\mathbf)=-\frac\sum_^N \ln\frac We know that when p_=p_s, p_ is of uniform distribution, and H() is maximized. Since : p_(\mathbf)=\frac=\frac where , \mathbf, is the absolute value of the determinant of the unmixing matrix \mathbf. Therefore, : H(\mathbf)=-\frac\sum_^N \ln\frac so, : H(\mathbf)=\frac\sum_^N \ln p_\mathbf(\mathbf^t)+\ln, \mathbf, +H(\mathbf) since H(\mathbf)=-\frac\sum_^N\ln p_\mathbf(\mathbf^t), and maximizing \mathbf does not affect H_, so we can maximize the function : h(\mathbf)=\frac\sum_^N \ln p_\mathbf(\mathbf^t)+\ln, \mathbf, to achieve the independence of extracted signal. If there are ''M'' marginal pdfs of the model joint pdf p_ are independent and use the commonly super-gaussian model pdf for the source signals p_=(1-\tanh(\mathbf)^2), then we have : h(\mathbf)=\frac\sum_^M\sum_^N \ln (1-\tanh(\mathbf)^2)+\ln, \mathbf, In the sum, given an observed signal mixture \mathbf, the corresponding set of extracted signals \mathbf and source signal model p_=g', we can find the optimal unmixing matrix \mathbf, and make the extracted signals independent and non-gaussian. Like the projection pursuit situation, we can use gradient descent method to find the optimal solution of the unmixing matrix.


Based on maximum likelihood estimation

Maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
estimation (MLE) is a standard statistical tool for finding parameter values (e.g. the unmixing matrix \mathbf) that provide the best fit of some data (e.g., the extracted signals y) to a given a model (e.g., the assumed joint probability density function (pdf) p_s of source signals). The ML "model" includes a specification of a pdf, which in this case is the pdf p_s of the unknown source signals s. Using ML ICA, the objective is to find an unmixing matrix that yields extracted signals y = \mathbfx with a joint pdf as similar as possible to the joint pdf p_s of the unknown source signals s. MLE is thus based on the assumption that if the model pdf p_s and the model parameters \mathbf are correct then a high probability should be obtained for the data x that were actually observed. Conversely, if \mathbf is far from the correct parameter values then a low probability of the observed data would be expected. Using MLE, we call the probability of the observed data for a given set of model parameter values (e.g., a pdf p_s and a matrix \mathbf) the ''likelihood'' of the model parameter values given the observed data. We define a ''likelihood'' function \mathbf of \mathbf: \mathbf = p_s (\mathbfx), \det \mathbf, . This equals to the probability density at x, since s = \mathbfx. Thus, if we wish to find a \mathbf that is most likely to have generated the observed mixtures x from the unknown source signals s with pdf p_s then we need only find that \mathbf which maximizes the ''likelihood'' \mathbf. The unmixing matrix that maximizes equation is known as the MLE of the optimal unmixing matrix. It is common practice to use the log ''likelihood'', because this is easier to evaluate. As the logarithm is a monotonic function, the \mathbf that maximizes the function \mathbf also maximizes its logarithm \ln \mathbf. This allows us to take the logarithm of equation above, which yields the log ''likelihood'' function \ln \mathbf =\sum_\sum_ \ln p_s(w^T_ix_t) + N\ln, \det \mathbf, If we substitute a commonly used high-
Kurtosis In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurt ...
model pdf for the source signals p_s = (1-\tanh(s)^2) then we have \ln \mathbf =\sum_^ \sum_^\ln(1-\tanh(w^T_i x_t )^2) + \ln , \det \mathbf, This matrix \mathbf that maximizes this function is the ''
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
estimation''.


History and background

The early general framework for independent component analysis was introduced by Jeanny Hérault and Bernard Ans from 1984, further developed by Christian Jutten in 1985 and 1986, and refined by Pierre Comon in 1991,P.Comon, Independent Component Analysis, Workshop on Higher-Order Statistics, July 1991, republished in J-L. Lacoume, editor, Higher Order Statistics, pp. 29-38. Elsevier, Amsterdam, London, 1992
HAL link
/ref> and popularized in his paper of 1994.Pierre Comon (1994) Independent component analysis, a new concept? http://www.ece.ucsb.edu/wcsl/courses/ECE594/594C_F10Madhow/comon94.pdf In 1995, Tony Bell and
Terry Sejnowski Terrence Joseph Sejnowski (born 13 August 1947) is the Francis Crick Professor at the Salk Institute for Biological Studies where he directs the Computational Neurobiology Laboratory and is the director of the Crick-Jacobs center for theoretical ...
introduced a fast and efficient ICA algorithm based on
infomax Infomax is an optimization principle for artificial neural networks and other information processing systems. It prescribes that a function that maps a set of input values ''I'' to a set of output values ''O'' should be chosen or learned so as to ...
, a principle introduced by Ralph Linsker in 1987. There are many algorithms available in the literature which do ICA. A largely used one, including in industrial applications, is the FastICA algorithm, developed by Hyvärinen and Oja, which uses the
negentropy In information theory and statistics, negentropy is used as a measure of distance to normality. The concept and phrase "negative entropy" was introduced by Erwin Schrödinger in his 1944 popular-science book ''What is Life?'' Later, Léon Brillo ...
as cost function. Other examples are rather related to
blind source separation Source separation, blind signal separation (BSS) or blind source separation, is the separation of a set of source signal processing, signals from a set of mixed signals, without the aid of information (or with very little information) about the s ...
where a more general approach is used. For example, one can drop the independence assumption and separate mutually correlated signals, thus, statistically "dependent" signals. Sepp Hochreiter and
Jürgen Schmidhuber Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist most noted for his work in the field of artificial intelligence, deep learning and artificial neural networks. He is a co-director of the Dalle Molle Institute for Artifi ...
showed how to obtain non-linear ICA or source separation as a by-product of
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
(1999). Their method does not require a priori knowledge about the number of independent sources.


Applications

ICA can be extended to analyze non-physical signals. For instance, ICA has been applied to discover discussion topics on a bag of news list archives. Some ICA applications are listed below: * optical Imaging of neurons * neuronal spike sorting * face recognition * modelling receptive fields of primary visual neurons * predicting stock market prices * mobile phone communications * colour based detection of the ripeness of tomatoes * removing artifacts, such as eye blinks, from
EEG Electroencephalography (EEG) is a method to record an electrogram of the spontaneous electrical activity of the brain. The biosignals detected by EEG have been shown to represent the postsynaptic potentials of pyramidal neurons in the neocortex ...
data. * predicting decision-making using EEG * analysis of changes in gene expression over time in single
cell Cell most often refers to: * Cell (biology), the functional basic unit of life Cell may also refer to: Locations * Monastic cell, a small room, hut, or cave in which a religious recluse lives, alternatively the small precursor of a monastery ...
RNA-sequencing experiments. * studies of the resting state network of the brain. * astronomy and cosmology * finance


Availability

ICA can be applied through the following software: * SAS PROC ICA *
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
Python implementatio
sklearn.decomposition.FastICA


See also

*
Blind deconvolution In electrical engineering and applied mathematics, blind deconvolution is deconvolution without explicit knowledge of the impulse response function used in the convolution. This is usually achieved by making appropriate assumptions of the input to ...
*
Factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
*
Hilbert spectrum The Hilbert spectrum (sometimes referred to as the Hilbert amplitude spectrum), named after David Hilbert, is a statistical tool that can help in distinguishing among a mixture of moving signals. The spectrum itself is decomposed into its compo ...
*
Image processing An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...
* Non-negative matrix factorization (NMF) * Nonlinear dimensionality reduction *
Projection pursuit Projection pursuit (PP) is a type of statistical technique which involves finding the most "interesting" possible projections in multidimensional data. Often, projections which deviate more from a normal distribution are considered to be more inter ...
*
Varimax rotation In statistics, a varimax rotation is used to simplify the expression of a particular sub-space in terms of just a few major items each. The actual coordinate system is unchanged, it is the orthogonal basis that is being rotated to align with those ...


Notes


References

* Comon, Pierre (1994)
"Independent Component Analysis: a new concept?"
''Signal Processing'', 36(3):287–314 (The original paper describing the concept of ICA) * Hyvärinen, A.; Karhunen, J.; Oja, E. (2001):
Independent Component Analysis
', New York: Wiley,
Introductory chapter
) * Hyvärinen, A.; Oja, E. (2000)
"Independent Component Analysis: Algorithms and Application"
''Neural Networks'', 13(4-5):411-430. (Technical but pedagogical introduction). * Comon, P.; Jutten C., (2010): Handbook of Blind Source Separation, Independent Component Analysis and Applications. Academic Press, Oxford UK. * Lee, T.-W. (1998): ''Independent component analysis: Theory and applications'', Boston, Mass: Kluwer Academic Publishers, * Acharyya, Ranjan (2008): ''A New Approach for Blind Source Separation of Convolutive Sources - Wavelet Based Separation Using Shrinkage Function'' {{ISBN, 978-3639077971 (this book focuses on unsupervised learning with Blind Source Separation)


External links


What is independent component analysis?
by Aapo Hyvärinen

by Aapo Hyvärinen
A Tutorial on Independent Component Analysis

FastICA as a package for Matlab, in R language, C++

ICALAB Toolboxes
for Matlab, developed at RIKEN
High Performance Signal Analysis Toolkit
provides C++ implementations of FastICA and Infomax
ICA toolbox
Matlab tools for ICA with Bell-Sejnowski, Molgedey-Schuster and mean field ICA. Developed at DTU.
Demonstration of the cocktail party problem

EEGLAB Toolbox
ICA of
EEG Electroencephalography (EEG) is a method to record an electrogram of the spontaneous electrical activity of the brain. The biosignals detected by EEG have been shown to represent the postsynaptic potentials of pyramidal neurons in the neocortex ...
for Matlab, developed at UCSD.
FMRLAB Toolbox
ICA of
fMRI Functional magnetic resonance imaging or functional MRI (fMRI) measures brain activity by detecting changes associated with blood flow. This technique relies on the fact that cerebral blood flow and neuronal activation are coupled. When an area ...
for Matlab, developed at UCSD
MELODIC
part of the
FMRIB Software Library The FMRIB Software Library, abbreviated FSL, is a software library containing image analysis and statistical tools for functional, structural and diffusion MRI brain imaging data. FSL is available as both precompiled binaries and source code ...
.
Discussion of ICA used in a biomedical shape-representation context

FastICA, CuBICA, JADE and TDSEP algorithm for Python and more...

Group ICA Toolbox and Fusion ICA Toolbox

Tutorial: Using ICA for cleaning EEG signals
Signal estimation Dimension reduction