In statistics, kernel Fisher discriminant analysis (KFD), also known as generalized discriminant analysis and kernel discriminant analysis, is a kernelized version of

linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...

(LDA). It is named after

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who ...

Linear discriminant analysis

Intuitively, the idea of LDA is to find a projection where class separation is maximized. Given two sets of labeled data,

\mathbf_1

and

\mathbf_2

, we can calculate the mean value of each class,

\mathbf_1

and

\mathbf_2

, as :

\mathbf_i = \frac\sum_^\mathbf_n^i,

where

l_i

is the number of examples of class

\mathbf_i

. The goal of linear discriminant analysis is to give a large separation of the class means while also keeping the in-class variance small. This is formulated as maximizing, with respect to

\mathbf

, the following ratio: :

J(\mathbf) = \frac,

where

\mathbf_B

is the between-class covariance matrix and

\mathbf_W

is the total within-class covariance matrix: :

\begin
\mathbf_B & = (\mathbf_2-\mathbf_1)(\mathbf_2-\mathbf_1)^ \\
\mathbf_W & = \sum_\sum_^(\mathbf_n^i-\mathbf_i)(\mathbf_n^i-\mathbf_i)^.
\end

The maximum of the above ratio is attained at :

\mathbf \propto \mathbf^_W(\mathbf_2-\mathbf_1).

as can be shown by the

Lagrange multiplier In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied ...

method (sketch of proof): Maximizing

J(\mathbf) = \frac

is equivalent to maximizing :

\mathbf^\mathbf_B\mathbf

subject to :

\mathbf^\mathbf_W\mathbf = 1 .

This, in turn, is equivalent to maximizing

I(\mathbf, \lambda) = \mathbf^\mathbf_B\mathbf - \lambda  (\mathbf^\mathbf_W\mathbf - 1)

, where

\lambda

is the Lagrange multiplier. At the maximum, the derivatives of

I(\mathbf, \lambda)

with respect to

\mathbf

and

\lambda

must be zero. Taking

\frac = \mathbf

yields :

\mathbf_B\mathbf - \lambda \mathbf_W \mathbf = \mathbf,

which is trivially satisfied by

\mathbf = c \mathbf^_W(\mathbf_2-\mathbf_1)

and

\lambda = (\mathbf_2-\mathbf_1)^ \mathbf^_W(\mathbf_2-\mathbf_1).

Extending LDA

To extend LDA to non-linear mappings, the data, given as the

\ell

points

\mathbf_i,

can be mapped to a new feature space,

F,

via some function

\phi.

In this new feature space, the function that needs to be maximized is :

J(\mathbf) = \frac,

where :

\begin
\mathbf_B^ & = \left (\mathbf_2^-\mathbf_1^ \right ) \left (\mathbf_2^-\mathbf_1^ \right )^ \\
\mathbf_W^ & = \sum_ \sum_^ \left (\phi(\mathbf_n^i)-\mathbf_i^ \right ) \left (\phi(\mathbf_n^i)-\mathbf_i^ \right)^,
\end

and :

\mathbf_i^ = \frac\sum_^\phi(\mathbf_j^i).

Further, note that

\mathbf\in F

. Explicitly computing the mappings

\phi(\mathbf_i)

and then performing LDA can be computationally expensive, and in many cases intractable. For example,

F

may be infinitely dimensional. Thus, rather than explicitly mapping the data to

F

, the data can be implicitly embedded by rewriting the algorithm in terms of

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...

s and using kernel functions in which the dot product in the new feature space is replaced by a kernel function,

k(\mathbf,\mathbf) =\phi( \mathbf) \cdot\phi(\mathbf)

. LDA can be reformulated in terms of dot products by first noting that

\mathbf

will have an expansion of the form :

\mathbf = \sum_^l\alpha_i\phi(\mathbf_i).

Then note that :

\mathbf^\mathbf_i^ = \frac\sum_^\sum_^\alpha_jk \left (\mathbf_j,\mathbf_k^i \right ) = \mathbf^\mathbf_i,

where :

(\mathbf_i)_j = \frac\sum_^k(\mathbf_j,\mathbf_k^i).

The numerator of

J(\mathbf)

can then be written as: :

\mathbf^\mathbf_B^\mathbf = \mathbf^ \left (\mathbf_2^-\mathbf_1^ \right ) \left (\mathbf_2^-\mathbf_1^ \right )^\mathbf = \mathbf^\mathbf\mathbf, \qquad \text \qquad \mathbf = (\mathbf_2-\mathbf_1)(\mathbf_2-\mathbf_1)^.

Similarly, the denominator can be written as :

\mathbf^\mathbf_W^\mathbf=\mathbf^\mathbf\mathbf,\qquad \text \qquad \mathbf = \sum_\mathbf_j(\mathbf-\mathbf_)\mathbf_j^,

with the

n^, m^

component of

\mathbf_j

defined as

k(\mathbf_n,\mathbf_m^j), \mathbf

is the identity matrix, and

\mathbf_

the matrix with all entries equal to

1/l_j

. This identity can be derived by starting out with the expression for

\mathbf^ \mathbf_W^\mathbf

and using the expansion of

\mathbf

and the definitions of

\mathbf_W^

and

\mathbf_i^

& = \mathbf^\mathbf\mathbf. \end

With these equations for the numerator and denominator of

J(\mathbf)

, the equation for

J

can be rewritten as :

J(\mathbf) = \frac.

Then, differentiating and setting equal to zero gives :

(\mathbf^\mathbf\mathbf)\mathbf\mathbf = (\mathbf^\mathbf\mathbf)\mathbf\mathbf.

Since only the direction of

\mathbf

, and hence the direction of

\mathbf,

matters, the above can be solved for

\mathbf

as :

\mathbf = \mathbf^(\mathbf_2- \mathbf_1).

Note that in practice,

\mathbf

is usually singular and so a multiple of the identity is added to it :

\mathbf_ = \mathbf+\epsilon\mathbf.

Given the solution for

\mathbf

, the projection of a new data point is given by :

y(\mathbf) = (\mathbf\cdot\phi(\mathbf)) = \sum_^l\alpha_ik(\mathbf_i,\mathbf).

Multi-class KFD

The extension to cases where there are more than two classes is relatively straightforward. Let

c

be the number of classes. Then multi-class KFD involves projecting the data into a

(c-1)

-dimensional space using

(c-1)

discriminant functions :

y_i = \mathbf_i^\phi(\mathbf) \qquad i= 1,\ldots,c-1.

This can be written in matrix notation :

\mathbf = \mathbf^\phi(\mathbf),

where the

\mathbf_i

are the columns of

\mathbf

. Further, the between-class covariance matrix is now :

\mathbf_B^ = \sum_^c l_i(\mathbf_i^-\mathbf^)(\mathbf_i^-\mathbf^)^,

where

\mathbf^\phi

is the mean of all the data in the new feature space. The within-class covariance matrix is :

\mathbf_W^ = \sum_^c \sum_^(\phi(\mathbf_n^i)-\mathbf_i^)(\phi(\mathbf_n^i)-\mathbf_i^)^,

The solution is now obtained by maximizing :

J(\mathbf) = \frac.

The kernel trick can again be used and the goal of multi-class KFD becomes :

\mathbf^* = \underset = \frac,

where

A = mathbf_1,\ldots,\mathbf_/math> and

: \begin
M & = \sum_^cl_j(\mathbf_j-\mathbf_)(\mathbf_j-\mathbf_)^ \\
N & = \sum_^c\mathbf_j(\mathbf-\mathbf_)\mathbf_j^.
\end The \mathbf_i are defined as in the above section and \mathbf_is defined as  

: (\mathbf_)_j = \frac\sum_^k(\mathbf_j,\mathbf_k). \mathbf^can then be computed by finding the (c-1) leading eigenvectors of \mathbf^\mathbf . Furthermore, the projection of a new input, \mathbf_t, is given by : \mathbf(\mathbf_t) = \left(\mathbf^\right)^\mathbf_t, where the i^component of \mathbf_t is given by k(\mathbf_i,\mathbf_t) .

Classification using KFD

In both two-class and multi-class KFD, the class label of a new input can be assigned as :

f(\mathbf) = arg\min_j D(\mathbf(\mathbf),\bar_j),

where

\bar_j

is the projected mean for class

j

and

D(\cdot,\cdot)

is a distance function.

Applications

Kernel discriminant analysis has been used in a variety of applications. These include: *

Face recognition A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, an ...

and detection *Hand-written digit recognition *Palmprint recognition *Classification of malignant and benign cluster microcalcifications *Seed classification *Search for the Higgs Boson at CERN

References

{{Reflist

External links

Kernel Discriminant Analysis in C#
- C# code to perform KFD.

- Includes a method for performing KFD.
Handwriting Recognition using Kernel Discriminant Analysis
- C# code that demonstrates handwritten digit recognition using KFD. Statistical classification

Linear discriminant analysis

Extending LDA

Multi-class KFD

Classification using KFD

Applications

See also

References

External links