Neighbourhood components analysis is a

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

method for classifying

multivariate Multivariate may refer to: In mathematics * Multivariable calculus * Multivariate function * Multivariate polynomial In computing * Multivariate cryptography * Multivariate division algorithm * Multivariate interpolation * Multivariate optical c ...

data into distinct classes according to a given

distance metric In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general sett ...

over the data. Functionally, it serves the same purposes as the

K-nearest neighbors algorithm In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regre ...

, and makes direct use of a related concept termed ''stochastic nearest neighbours''.

Definition

Neighbourhood components analysis aims at "learning" a distance metric by finding a linear transformation of input data such that the average leave-one-out (LOO) classification performance is maximized in the transformed space. The key insight to the algorithm is that a matrix

A

corresponding to the transformation can be found by defining a differentiable objective function for

A

, followed by use of an iterative solver such as conjugate gradient descent. One of the benefits of this algorithm is that the number of classes

k

can be determined as a function of

A

, up to a scalar constant. This use of the algorithm therefore addresses the issue of

model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...

Explanation

In order to define

A

, we define an objective function describing classification accuracy in the transformed space and try to determine

A^*

such that this objective function is maximized.

A^* = \mbox_A f(A)

Leave-one-out (LOO) classification

Consider predicting the class label of a single data point by consensus of its

k

-nearest neighbours with a given distance metric. This is known as ''leave-one-out'' classification. However, the set of nearest-neighbours

C_i

can be quite different after passing all the points through a linear transformation. Specifically, the set of neighbours for a point can undergo discrete changes in response to smooth changes in the elements of

A

, implying that any objective function

f(\cdot)

based on the neighbours of a point will be ''piecewise-constant'', and hence ''not differentiable''.

Solution

We can resolve this difficulty by using an approach inspired by stochastic gradient descent. Rather than considering the

k

-nearest neighbours at each transformed point in LOO-classification, we'll consider the entire transformed data set as ''stochastic nearest neighbours''. We define these using a

softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

of the squared

Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...

between a given LOO-classification point and each other point in the transformed space:

p_ =
\begin
 \frac, & \mbox j \ne i \\
 0, & \mbox j = i
\end

The probability of correctly classifying data point

i

is the probability of classifying the points of each of its neighbours with the same class

C_i

p_i = \sum_ p_ \quad

where

p_

is the probability of classifying neighbour

j

of point

i

. Define the objective function using LOO classification, this time using the entire data set as stochastic nearest neighbours:

f(A) = \sum_i \sum_ p_ = \sum_i p_i

Note that under stochastic nearest neighbours, the consensus class for a single point

i

is the expected value of a point's class in the limit of an infinite number of samples drawn from the distribution over its neighbours

j \in C_i

i.e.:

P(Class(X_i) = Class(X_j)) = p_

. Thus the predicted class is an

affine combination In mathematics, an affine combination of is a linear combination : \sum_^ = \alpha_ x_ + \alpha_ x_ + \cdots +\alpha_ x_, such that :\sum_^ =1. Here, can be elements ( vectors) of a vector space over a field , and the coefficients \alpha_ ...

of the classes of every other point, weighted by the softmax function for each

j \in C_j

where

C_j

is now the entire transformed data set. This choice of objective function is preferable as it is differentiable with respect to

A

(denote

x_ = x_i - x_j

\frac = - 2A \sum_i \sum_ p_ \left ( x_ x_^T - \sum_k p_ x_ x_^T \right )

= 2A \sum_i \left ( p_i\sum_k p_x_x_^T - \sum_ p_x_x_^T \right )

Obtaining a

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...

for

A

means that it can be found with an iterative solver such as conjugate gradient descent. Note that in practice, most of the innermost terms of the gradient evaluate to insignificant contributions due to the rapidly diminishing contribution of distant points from the point of interest. This means that the inner sum of the gradient can be truncated, resulting in reasonable computation times even for large data sets.

Alternative formulation

"Maximizing

f(\cdot)

is equivalent to minimizing the

L_1

-distance between the predicted class distribution and the true class distribution (ie: where the

p_i

induced by

A

are all equal to 1). A natural alternative is the KL-divergence, which induces the following objective function and gradient:" (Goldberger 2005)

g(A) = \sum_i \log \left ( \sum_ p_ \right ) = \sum_i \log (p_i)

\frac = 2A \sum_i \left ( \sum_k p_ x_ x_^T - \frac{\sum_{j \in C_i} p_{ij \right )

In practice, optimization of

A

using this function tends to give similar performance results as with the original.

History and background

Neighbourhood components analysis was developed by Jacob Goldberger, Sam Roweis, Ruslan Salakhudinov, and Geoff Hinton at the University of Toronto's department of computer science in 2004.

References

*J. Goldberger, G. Hinton, S. Roweis, R. Salakhutdinov. (2005)
Neighbourhood Components Analysis
'' Advances in Neural Information Processing Systems. 17, 513–520, 2005.

External links

Software

* The MLPACK library contains a

C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...

implementation
nca
(

) *

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...

's
NeighborhoodComponentsAnalysis
implementation (

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...

) Statistical classification