multivariate statistics Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. Multivariate statistics concerns understanding the different aims and background of each of the diff ...

, spectral clustering techniques make use of the

spectrum A spectrum (plural ''spectra'' or ''spectrums'') is a condition that is not limited to a specific set of values but can vary, without gaps, across a continuum. The word was first used scientifically in optics to describe the rainbow of colors ...

(

eigenvalues In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...

) of the similarity matrix of the data to perform

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset. In application to image segmentation, spectral clustering is known as

segmentation-based object categorization The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partition ...

Definitions

Given an enumerated set of data points, the similarity matrix may be defined as a symmetric matrix

A

, where

A_\geq 0

represents a measure of the similarity between data points with indices

i

and

j

. The general approach to spectral clustering is to use a standard clustering method (there are many such methods, ''k''-means is discussed below) on relevant

eigenvector In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...

s of a

Laplacian matrix In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Lapl ...

A

. There are many different ways to define a Laplacian which have different mathematical interpretations, and so the clustering will also have different interpretations. The eigenvectors that are relevant are the ones that correspond to smallest several eigenvalues of the Laplacian except for the smallest eigenvalue which will have a value of 0. For computational efficiency, these eigenvectors are often computed as the eigenvectors corresponding to the largest several eigenvalues of a function of the Laplacian.

Laplacian matrix In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Lapl ...

Spectral clustering is well known to relate to partitioning of a mass-spring system, where each mass is associated with a data point and each spring stiffness corresponds to a weight of an edge describing a similarity of the two related data points, as in the

spring system In engineering and physics, a spring system or spring network is a model of physics described as a graph with a position at each vertex and a spring of given stiffness and length along each edge. This generalizes Hooke's law to higher dimensions. ...

. Specifically, the classical reference explains that the eigenvalue problem describing transversal vibration modes of a mass-spring system is exactly the same as the eigenvalue problem for the graph

defined as :

L:=D-A

, where

D

is the

diagonal matrix In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal ma ...

D_ = \sum_j A_,

and A is the

adjacency matrix In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In the special case of a finite simpl ...

. The masses that are tightly connected by the springs in the mass-spring system evidently move together from the equilibrium position in low-frequency vibration modes, so that the components of the eigenvectors corresponding to the smallest eigenvalues of the graph Laplacian can be used for meaningful clustering of the masses. For example, assuming that all the springs and the masses are identical in the 2-dimensional spring system pictured, one would intuitively expect that the loosest connected masses on the right-hand side of the system would move with the largest amplitude and in the opposite direction to the rest of the masses when the system is shaken — and this expectation will be confirmed by analyzing components of the eigenvectors of the graph Laplacian corresponding to the smallest eigenvalues, i.e., the smallest vibration frequencies.

Laplacian matrix normalization

The goal of normalization is making the diagonal entries of the Laplacian matrix to be all unit, also scaling off-diagonal entries correspondingly. In a weighted graph, a vertex may have a large degree because of a small number of connected edges but with large weights just as well as due to a large number of connected edges with unit weights. A popular normalized spectral clustering technique is the normalized cuts algorithm or ''Shi–Malik algorithm'' introduced by Jianbo Shi and

Jitendra Malik Jitendra Malik is an Indian-American academic who is the Arthur J. Chick Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He is known for his research in computer vision. Academic biography ...

,Jianbo Shi and Jitendra Malik
"Normalized Cuts and Image Segmentation"
IEEE Transactions on PAMI, Vol. 22, No. 8, Aug 2000. commonly used for

image segmentation In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects ( sets of pixels). The goal of segmentation is to simpli ...

. It partitions points into two sets

(B_1,B_2)

based on the

v

corresponding to the second-smallest

eigenvalue In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...

of the symmetric normalized Laplacian defined as :

L^\text:=I-D^AD^.

The vector

v

is also the

corresponding to the second-largest

of the symmetrically normalized

D^AD^.

The random walk (or left) normalized Laplacian is defined as :

L^\text := D^ L = I - D^ A

and can also be used for spectral clustering. A mathematically equivalent algorithm takes the

u

corresponding to the largest

of the random walk normalized adjacency matrix

P = D^A

. The eigenvector

v

of the symmetrically normalized Laplacian and the eigenvector

u

of the left normalized Laplacian are related by the identity

D^ v = u.

Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
via Spectral Embedding

Knowing the

n

-by-

k

matrix

V

of selected eigenvectors, mapping — called spectral embedding — of the original

n

data points is performed to a

k

-dimensional vector space using the rows of

V

. Now the analysis is reduced to clustering vectors with

k

components, which may be done in various ways. In the simplest case

k=1

, the selected single eigenvector

v

, called the Fiedler vector, corresponds to the second smallest eigenvalue. Using the components of

v,

one can place all points whose component in

v

is positive in the set

B_+

and the rest in

B_-

, thus bi-partitioning the graph and labeling the data points with two labels. This sign-based approach follows the intuitive explanation of spectral clustering via the mass-spring model — in the low frequency vibration mode that the Fiedler vector

v

represents, one cluster data points identified with mutually strongly connected masses would move together in one direction, while in the complement cluster data points identified with remaining masses would move together in the opposite direction. The algorithm can be used for

hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...

by repeatedly partitioning the subsets in the same fashion. In the general case

k>1

, any vector clustering technique can be used, e.g.,

DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: give ...

Algorithms

; Basic Algorithm # Calculate the Laplacian

L

(or the normalized Laplacian) # Calculate the first

k

eigenvectors (the eigenvectors corresponding to the

k

smallest eigenvalues of

L

) # Consider the matrix formed by the first

k

eigenvectors; the

l

-th row defines the features of graph node

l

# Cluster the graph nodes based on these features (e.g., using

k-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...

) If the similarity matrix

A

has not already been explicitly constructed, the efficiency of spectral clustering may be improved if the solution to the corresponding eigenvalue problem is performed in a matrix-free fashion (without explicitly manipulating or even computing the similarity matrix), as in the

Lanczos algorithm The Lanczos algorithm is an iterative method devised by Cornelius Lanczos that is an adaptation of power methods to find the m "most useful" (tending towards extreme highest/lowest) eigenvalues and eigenvectors of an n \times n Hermitian matri ...

. For large-sized graphs, the second eigenvalue of the (normalized) graph

is often

ill-conditioned In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input ...

, leading to slow convergence of iterative eigenvalue solvers.

Preconditioning In mathematics, preconditioning is the application of a transformation, called the preconditioner, that conditions a given problem into a form that is more suitable for numerical solving methods. Preconditioning is typically related to reducing ...

is a key technology accelerating the convergence, e.g., in the matrix-free

LOBPCG Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) is a matrix-free method for finding the largest (or smallest) eigenvalues and the corresponding eigenvectors of a symmetric generalized eigenvalue problem :A x= \lambda B x, for a g ...

method. Spectral clustering has been successfully applied on large graphs by first identifying their community structure, and then clustering communities. Spectral clustering is closely related to

nonlinear dimensionality reduction Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-d ...

, and dimension reduction techniques such as locally-linear embedding can be used to reduce errors from noise or outliers.

Costs

Denoting the number of the data points ny

n

, it is important to estimate the memory footprint and compute time, or number of arithmetic operations (AO) performed, as a function of

n

. No matter the algorithm of the spectral clustering, the two main costly items are the construction of the graph Laplacian and determining its

k

eigenvectors for the spectral embedding. The last step — determining the labels from the

n

-by-

k

matrix of eigenvectors — is typically the least expensive requiring only

kn

AO and creating just a

n

-by-

1

vector of the labels in memory. The need to construct the graph Laplacian is common for all distance- or correlation-based clustering methods. Computing the eigenvectors is specific to spectral clustering only.

Constructing graph Laplacian

The graph Laplacian can be and commonly is constructed from the adjacency matrix. The construction can be performed matrix-free, i.e., without explicitly forming the matrix of the graph Laplacian and no AO. It can also be performed in-place of the adjacency matrix without increasing the memory footprint. Either way, the costs of constructing the graph Laplacian is essentially determined by the costs of constructing the

n

-by-

n

graph adjacency matrix. Moreover, a normalized Laplacian has exactly the same eigenvectors as the normalized adjacency matrix, but with the order of the eigenvalues reversed. Thus, instead of computing the eigenvectors corresponding to the smallest eigenvalues of the normalized Laplacian, one can equivalently compute the eigenvectors corresponding to the largest eigenvalues of the normalized adjacency matrix, without even talking about the Laplacian matrix. Naive constructions of the graph

, e.g., using the RBF kernel, make it dense, thus requiring

n^2

memory and

n^2

AO to determine each of the

n^2

entries of the matrix. Nystrom method can be used to approximate the similarity matrix, but the approximate matrix is not elementwise positive, i.e. cannot be interpreted as a distance-based similarity. Algorithms to construct the graph adjacency matrix as a

sparse matrix In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse b ...

are typically based on a

nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...

, which estimate or sample a neighborhood of a given data point for nearest neighbors, and compute non-zero entries of the adjacency matrix by comparing only pairs of the neighbors. The number of the selected nearest neighbors thus determines the number of non-zero entries, and is often fixed so that the memory footprint of the

n

-by-

n

graph adjacency matrix is only

O(n)

, only

O(n)

sequential arithmetic operations are needed to compute the

O(n)

non-zero entries, and the calculations can be trivially run in parallel.

Computing eigenvectors

The cost of computing the

n

-by-

k

(with

k\ll n

) matrix of selected eigenvectors of the graph Laplacian is normally proportional to the cost of multiplication of the

n

-by-

n

graph Laplacian matrix by a vector, which varies greatly whether the graph Laplacian matrix is dense or sparse. For the dense case the cost thus is

O(n^2)

. The very commonly cited in the literature cost

O(n^3)

comes from choosing

k=n

and is clearly misleading, since, e.g., in a hierarchical spectral clustering

k=1

as determined by the Fiedler vector. In the sparse case of the

n

-by-

n

graph Laplacian matrix with

O(n)

non-zero entries, the cost of the matrix-vector product and thus of computing the

n

-by-

k

with

k\ll n

matrix of selected eigenvectors is

O(n)

, with the memory footprint also only

O(n)

— both are the optimal low bounds of complexity of clustering

n

data points. Moreover, matrix-free eigenvalue solvers such as

can efficiently run in parallel, e.g., on multiple

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...

with distributed memory, resulting not only in high quality clusters, which spectral clustering is famous for, but also top performance.

Software

Free software implementing spectral clustering is available in large open source projects like

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ma ...

using

with

multigrid In numerical analysis, a multigrid method (MG method) is an algorithm for solving differential equations using a hierarchy of discretizations. They are an example of a class of techniques called multiresolution methods, very useful in problems exhi ...

preconditioning In mathematics, preconditioning is the application of a transformation, called the preconditioner, that conditions a given problem into a form that is more suitable for numerical solving methods. Preconditioning is typically related to reducing ...

ARPACK ARPACK, the ARnoldi PACKage, is a numerical software library written in FORTRAN 77 for solving large scale eigenvalue problems in the matrix-free fashion. The package is designed to compute a few eigenvalues and corresponding eigenvectors of la ...

, MLlib for pseudo-eigenvector clustering using the

power iteration In mathematics, power iteration (also known as the power method) is an eigenvalue algorithm: given a diagonalizable matrix A, the algorithm will produce a number \lambda, which is the greatest (in absolute value) eigenvalue of A, and a nonzero vect ...

method, and R.

Relationship with other clustering methods

The ideas behind spectral clustering may not be immediately obvious. It may be useful to highlight relationships with other methods. In particular, it can be described in the context of kernel clustering methods, which reveals several similarities with other approaches.

Relationship with ''k''-means

The weighted kernel ''k''-means problem shares the objective function with the spectral clustering problem, which can be optimized directly by multi-level methods.

Relationship to DBSCAN

In the trivial case of determining connected graph components — the optimal clusters with no edges cut — spectral clustering is also related to a spectral version of

clustering that finds density-connected components.

Measures to compare clusterings

Ravi Kannan, Santosh Vempala and Adrian Vetta proposed a bicriteria measure to define the quality of a given clustering. They said that a clustering was an (α, ε)-clustering if the conductance of each cluster (in the clustering) was at least α and the weight of the inter-cluster edges was at most ε fraction of the total weight of all the edges in the graph. They also look at two approximation algorithms in the same paper.

History and related literatures

Spectral clustering has a long history. Spectral clustering as a

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machin ...

method was popularized by Shi & Malik and Ng, Jordan, & Weiss. Ideas and network measures related to spectral clustering also play an important role in a number of applications apparently different from clustering problems. For instance, networks with stronger spectral partitions take longer to converge in opinion-updating models used in sociology and economics.{{cite journal , last1=Golub , first1=Benjamin , last2=Jackson , first2=Matthew O. , title=How Homophily Affects the Speed of Learning and Best-Response Dynamics , journal=The Quarterly Journal of Economics , publisher=Oxford University Press (OUP) , volume=127 , issue=3 , date=2012-07-26 , issn=0033-5533 , doi=10.1093/qje/qjs021 , pages=1287–1338

References

Cluster analysis algorithms Algebraic graph theory

Definitions

Laplacian matrix In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Lapl ...

Laplacian matrix normalization

Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ... via Spectral Embedding

Algorithms

Costs

Constructing graph Laplacian

Computing eigenvectors

Software

Relationship with other clustering methods

Relationship with ''k''-means

Relationship to DBSCAN

Measures to compare clusterings

History and related literatures

See also

References

Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
via Spectral Embedding