HOME

TheInfoList



OR:

In statistics, kernel Fisher discriminant analysis (KFD), also known as generalized discriminant analysis and kernel discriminant analysis, is a kernelized version of linear discriminant analysis (LDA). It is named after
Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who ...
.


Linear discriminant analysis

Intuitively, the idea of LDA is to find a projection where class separation is maximized. Given two sets of labeled data, \mathbf_1 and \mathbf_2, we can calculate the mean value of each class, \mathbf_1 and \mathbf_2, as : \mathbf_i = \frac\sum_^\mathbf_n^i, where l_i is the number of examples of class \mathbf_i. The goal of linear discriminant analysis is to give a large separation of the class means while also keeping the in-class variance small. This is formulated as maximizing, with respect to \mathbf, the following ratio: : J(\mathbf) = \frac, where \mathbf_B is the between-class covariance matrix and \mathbf_W is the total within-class covariance matrix: : \begin \mathbf_B & = (\mathbf_2-\mathbf_1)(\mathbf_2-\mathbf_1)^ \\ \mathbf_W & = \sum_\sum_^(\mathbf_n^i-\mathbf_i)(\mathbf_n^i-\mathbf_i)^. \end The maximum of the above ratio is attained at : \mathbf \propto \mathbf^_W(\mathbf_2-\mathbf_1). as can be shown by the
Lagrange multiplier In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied e ...
method (sketch of proof): Maximizing J(\mathbf) = \frac is equivalent to maximizing : \mathbf^\mathbf_B\mathbf subject to : \mathbf^\mathbf_W\mathbf = 1 . This, in turn, is equivalent to maximizing I(\mathbf, \lambda) = \mathbf^\mathbf_B\mathbf - \lambda (\mathbf^\mathbf_W\mathbf - 1) , where \lambda is the Lagrange multiplier. At the maximum, the derivatives of I(\mathbf, \lambda) with respect to \mathbf and \lambda must be zero. Taking \frac = \mathbf yields : \mathbf_B\mathbf - \lambda \mathbf_W \mathbf = \mathbf, which is trivially satisfied by \mathbf = c \mathbf^_W(\mathbf_2-\mathbf_1) and \lambda = (\mathbf_2-\mathbf_1)^ \mathbf^_W(\mathbf_2-\mathbf_1).


Extending LDA

To extend LDA to non-linear mappings, the data, given as the \ell points \mathbf_i, can be mapped to a new feature space, F, via some function \phi. In this new feature space, the function that needs to be maximized is : J(\mathbf) = \frac, where : \begin \mathbf_B^ & = \left (\mathbf_2^-\mathbf_1^ \right ) \left (\mathbf_2^-\mathbf_1^ \right )^ \\ \mathbf_W^ & = \sum_ \sum_^ \left (\phi(\mathbf_n^i)-\mathbf_i^ \right ) \left (\phi(\mathbf_n^i)-\mathbf_i^ \right)^, \end and : \mathbf_i^ = \frac\sum_^\phi(\mathbf_j^i). Further, note that \mathbf\in F. Explicitly computing the mappings \phi(\mathbf_i) and then performing LDA can be computationally expensive, and in many cases intractable. For example, F may be infinitely dimensional. Thus, rather than explicitly mapping the data to F, the data can be implicitly embedded by rewriting the algorithm in terms of
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alge ...
s and using kernel functions in which the dot product in the new feature space is replaced by a kernel function,k(\mathbf,\mathbf) =\phi( \mathbf) \cdot\phi(\mathbf). LDA can be reformulated in terms of dot products by first noting that \mathbf will have an expansion of the form : \mathbf = \sum_^l\alpha_i\phi(\mathbf_i). Then note that : \mathbf^\mathbf_i^ = \frac\sum_^\sum_^\alpha_jk \left (\mathbf_j,\mathbf_k^i \right ) = \mathbf^\mathbf_i, where : (\mathbf_i)_j = \frac\sum_^k(\mathbf_j,\mathbf_k^i). The numerator of J(\mathbf) can then be written as: : \mathbf^\mathbf_B^\mathbf = \mathbf^ \left (\mathbf_2^-\mathbf_1^ \right ) \left (\mathbf_2^-\mathbf_1^ \right )^\mathbf = \mathbf^\mathbf\mathbf, \qquad \text \qquad \mathbf = (\mathbf_2-\mathbf_1)(\mathbf_2-\mathbf_1)^. Similarly, the denominator can be written as : \mathbf^\mathbf_W^\mathbf=\mathbf^\mathbf\mathbf,\qquad \text \qquad \mathbf = \sum_\mathbf_j(\mathbf-\mathbf_)\mathbf_j^, with the n^, m^ component of \mathbf_j defined as k(\mathbf_n,\mathbf_m^j), \mathbf is the identity matrix, and \mathbf_ the matrix with all entries equal to 1/l_j. This identity can be derived by starting out with the expression for \mathbf^ \mathbf_W^\mathbf and using the expansion of \mathbf and the definitions of \mathbf_W^ and \mathbf_i^ : \begin \mathbf^\mathbf_W^\mathbf &= \left(\sum_^l\alpha_i\phi^(\mathbf_i)\right)\left(\sum_\sum_^ \left (\phi(\mathbf_n^j)-\mathbf_j^ \right ) \left (\phi(\mathbf_n^j)-\mathbf_j^ \right )^\right) \left(\sum_^l\alpha_k\phi(\mathbf_k)\right)\\ &= \sum_\sum_^l\sum_^\sum_^l \left (\alpha_i\phi^ (\mathbf_i) \left (\phi(\mathbf_n^j)-\mathbf_j^ \right ) \left (\phi(\mathbf_n^j)-\mathbf_j^ \right )^ \alpha_k\phi(\mathbf_k) \right ) \\ & = \sum_\sum_^l\sum_^\sum_^l \left(\alpha_ik(\mathbf_i,\mathbf_n^j)-\frac\sum_^\alpha_ik(\mathbf_i,\mathbf_p^j)\right) \left(\alpha_kk(\mathbf_k,\mathbf_n^j)-\frac\sum_^\alpha_kk(\mathbf_k,\mathbf_q^j)\right) \\ & = \sum_\left( \sum_^l\sum_^\sum_^l \left ( \alpha_i\alpha_kk(\mathbf_i,\mathbf_n^j)k(\mathbf_k,\mathbf_n^j) - \frac\sum_^k(\mathbf_i,\mathbf_n^j)k(\mathbf_k,\mathbf_p^j) + \frac\sum_^\sum_^ k(\mathbf_i,\mathbf_p^j)k(\mathbf_k,\mathbf_q^j) \right) \right )\\ & = \sum_\left( \sum_^l\sum_^\sum_^l\left( \alpha_i\alpha_kk(\mathbf_i,\mathbf_n^j)k(\mathbf_k,\mathbf_n^j) - \frac\sum_^k(\mathbf_i,\mathbf_n^j)k(\mathbf_k,\mathbf_p^j) \right)\right) \\ pt& = \sum_ \mathbf^ \mathbf_j\mathbf_j^\mathbf - \mathbf^ \mathbf_j\mathbf_\mathbf_j^\mathbf \\ pt& = \mathbf^\mathbf\mathbf. \end With these equations for the numerator and denominator of J(\mathbf), the equation for J can be rewritten as : J(\mathbf) = \frac. Then, differentiating and setting equal to zero gives : (\mathbf^\mathbf\mathbf)\mathbf\mathbf = (\mathbf^\mathbf\mathbf)\mathbf\mathbf. Since only the direction of \mathbf, and hence the direction of \mathbf, matters, the above can be solved for \mathbf as : \mathbf = \mathbf^(\mathbf_2- \mathbf_1). Note that in practice, \mathbf is usually singular and so a multiple of the identity is added to it : \mathbf_ = \mathbf+\epsilon\mathbf. Given the solution for \mathbf, the projection of a new data point is given by : y(\mathbf) = (\mathbf\cdot\phi(\mathbf)) = \sum_^l\alpha_ik(\mathbf_i,\mathbf).


Multi-class KFD

The extension to cases where there are more than two classes is relatively straightforward. Let c be the number of classes. Then multi-class KFD involves projecting the data into a (c-1)-dimensional space using (c-1) discriminant functions : y_i = \mathbf_i^\phi(\mathbf) \qquad i= 1,\ldots,c-1. This can be written in matrix notation : \mathbf = \mathbf^\phi(\mathbf), where the \mathbf_i are the columns of \mathbf. Further, the between-class covariance matrix is now : \mathbf_B^ = \sum_^c l_i(\mathbf_i^-\mathbf^)(\mathbf_i^-\mathbf^)^, where \mathbf^\phi is the mean of all the data in the new feature space. The within-class covariance matrix is : \mathbf_W^ = \sum_^c \sum_^(\phi(\mathbf_n^i)-\mathbf_i^)(\phi(\mathbf_n^i)-\mathbf_i^)^, The solution is now obtained by maximizing : J(\mathbf) = \frac. The kernel trick can again be used and the goal of multi-class KFD becomes : \mathbf^* = \underset = \frac, where A = mathbf_1,\ldots,\mathbf_/math> and : \begin M & = \sum_^cl_j(\mathbf_j-\mathbf_)(\mathbf_j-\mathbf_)^ \\ N & = \sum_^c\mathbf_j(\mathbf-\mathbf_)\mathbf_j^. \end The \mathbf_i are defined as in the above section and \mathbf_ is defined as : (\mathbf_)_j = \frac\sum_^k(\mathbf_j,\mathbf_k). \mathbf^ can then be computed by finding the (c-1) leading eigenvectors of \mathbf^\mathbf. Furthermore, the projection of a new input, \mathbf_t, is given by : \mathbf(\mathbf_t) = \left(\mathbf^\right)^\mathbf_t, where the i^ component of \mathbf_t is given by k(\mathbf_i,\mathbf_t).


Classification using KFD

In both two-class and multi-class KFD, the class label of a new input can be assigned as : f(\mathbf) = arg\min_j D(\mathbf(\mathbf),\bar_j), where \bar_j is the projected mean for class j and D(\cdot,\cdot) is a distance function.


Applications

Kernel discriminant analysis has been used in a variety of applications. These include: * Face recognition and detection *Hand-written digit recognition *Palmprint recognition *Classification of malignant and benign cluster microcalcifications *Seed classification *Search for the Higgs Boson at CERN


See also

*
Factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
*
Kernel principal component analysis In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are perfor ...
*
Kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
* Linear discriminant analysis


References

{{Reflist


External links


Kernel Discriminant Analysis in C#
- C# code to perform KFD.

- Includes a method for performing KFD.
Handwriting Recognition using Kernel Discriminant Analysis
- C# code that demonstrates handwritten digit recognition using KFD. Statistical classification