SUBCLU is an algorithm for

clustering high-dimensional data Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology ...

by Karin Kailing,

Hans-Peter Kriegel Hans-Peter Kriegel (1 October 1948, Germany) is a German computer scientist and professor at the Ludwig Maximilian University of Munich and leading the Database Systems Group in the Department of Computer Science. He was previously professor at ...

and Peer Kröger.Karin Kailing,

and Peer Kröger.
Density-Connected Subspace Clustering for High-Dimensional Data
'. In: ''Proc. SIAM Int. Conf. on Data Mining (SDM'04)'', pp. 246-257, 2004. It is a

subspace clustering Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology c ...

algorithm that builds on the density-based clustering algorithm

DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: gi ...

. SUBCLU can find clusters in

axis-parallel In geometry, an axis-aligned object (axis-parallel, axis-oriented) is an object in ''n''-dimensional space whose shape is aligned with the coordinate axes of the space. Examples are axis-aligned rectangles (or hyperrectangles), the ones with edge ...

subspaces, and uses a bottom-up, greedy strategy to remain efficient.

Approach

SUBCLU uses a

monotonicity In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of orde ...

criteria: if a cluster is found in a subspace

S

, then each subspace

T \subseteq S

also contains a cluster. However, a cluster

C \subseteq DB

in subspace

S

is not necessarily a cluster in

T \subseteq S

, since clusters are required to be maximal, and more objects might be contained in the cluster in

T

that contains

C

. However, a density-connected set in a subspace

S

is also a density-connected set in

T \subseteq S

. This ''downward-closure property'' is utilized by SUBCLU in a way similar to the

Apriori algorithm AprioriRakesh Agrawal and Ramakrishnan SrikanFast algorithms for mining association rules Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994. is an algorithm for frequent ...

: first, all 1-dimensional subspaces are clustered. All clusters in a higher-dimensional subspace will be subsets of the clusters detected in this first clustering. SUBCLU hence recursively produces

k+1

-dimensional candidate subspaces by combining

k

-dimensional subspaces with clusters sharing

k-1

attributes. After pruning irrelevant candidates,

is applied to the candidate subspace to find out if it still contains clusters. If it does, the candidate subspace is used for the next combination of subspaces. In order to improve the runtime of

, only the points known to belong to clusters in one

k

-dimensional subspace (which is chosen to contain as little clusters as possible) are considered. Due to the downward-closure property, other point cannot be part of a

k+1

-dimensional cluster anyway.

Pseudocode

SUBCLU takes two parameters,

\epsilon\!\,

and

MinPts

, which serve the same role as in

. In a first step, DBSCAN is used to find 1D-clusters in each subspace spanned by a single attribute:

\mathtt(DB, eps, MinPts)

S_1 := \emptyset

C_1 := \emptyset

\mathtt\, a \in Attributes

C^ = \mathtt(DB, \, eps, MinPts)\!\,

\mathtt (C^ \neq \emptyset)

:::

S_1 := S_1 \cup \

:::

C_1 := C_1 \cup C^

\mathtt

\mathtt

: // In a second step,

k+1

-dimensional clusters are built from

k

-dimensional ones: :

k := 1\!\,

\mathtt (C_k \neq \emptyset)

\mathtt_ := \mathtt(S_k)\!\,

\mathtt\, cand \in \mathtt_

:::

\mathtt \min_ \sum_ , C_i,

:::

C^ := \emptyset

:::

\mathtt\, cl \in C^\mathtt

::::

C^ := C^ \cup \mathtt(cl, cand, eps, MinPts)

::::

\mathtt\, (C^ \neq \emptyset)

:::::

S_ := S_ \cup cand

:::::

C_ := C_ \cup C^

::::

\mathtt

:::

\mathtt

\mathtt

k:=k+1\!\,

\mathtt

\mathtt\!\,

The set

S_k

contains all the

k

-dimensional subspaces that are known to contain clusters. The set

C_k

contains the sets of clusters found in the subspaces. The

bestSubspace

is chosen to minimize the runs of DBSCAN (and the number of points that need to be considered in each run) for finding the clusters in the candidate subspaces. Candidate subspaces are generated much alike the

generates the frequent itemset candidates: Pairs of the

k

-dimensional subspaces are compared, and if they differ in one attribute only, they form a

k+1

-dimensional candidate. However, a number of irrelevant candidates are found as well; they contain a

k

-dimensional subspace that does not contain a cluster. Hence, these candidates are removed in a second step:

\mathtt(S_k)

\mathtt_ := \emptyset

\mathtt\,s_1 \in S_k

\mathtt\,s_2 \in S_k

:::

\mathtt\,(s_1\,\mathtt\,s_2\,\,\mathtt)

::::

\mathtt_ := \mathtt_ \cup \

:::

\mathtt

\mathtt

\mathtt

:''// Pruning of irrelevant candidate subspaces'' :

\mathtt\,cand \in \mathtt_

\mathtt\, k\texttt\,s \subset cand

:::

\mathtt \, (s \not \in S_k)

::::

\mathtt_ = \mathtt_ \setminus \

:::

\mathtt

\mathtt

\mathtt

\mathtt{end}\,\!

Availability

An example implementation of SUBCLU is available in the ELKI framework.

References

Cluster analysis algorithms