Clustering high-dimensional data is the

cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

of data with anywhere from a few dozen to many thousands of

dimension In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coor ...

s. Such

high-dimensional space In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coord ...

s of data are often encountered in areas such as

medicine Medicine is the science and Praxis (process), practice of caring for a patient, managing the diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, and Health promotion ...

, where

DNA microarray A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to ...

technology can produce many measurements at once, and the clustering of

text document A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating ...

s, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

Problems

Four problems need to be overcome for clustering in high-dimensional data: * Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. This problem is known as the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. Th ...

. * The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless: ::

\lim_ \frac = 0

* A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in

newborn screening Newborn screening (NBS) is a public health program of screening in infants shortly after birth for conditions that are treatable, but not clinically evident in the newborn period. The goal is to identify infants at risk for these conditions ea ...

a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the ''local feature relevance'' problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient. * Given a large number of attributes, it is likely that some attributes are

correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statisti ...

. Hence, clusters might exist in arbitrarily oriented

affine subspace In mathematics, an affine space is a geometric structure that generalizes some of the properties of Euclidean spaces in such a way that these are independent of the concepts of distance and measure of angles, keeping only the properties relate ...

s. Recent research indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.

Approaches

Approaches towards clustering in axis-parallel or arbitrarily oriented

s differ in how they interpret the overall goal, which is finding clusters in data with high dimensionality. An overall different approach is to find clusters based on

pattern A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated li ...

in the data matrix, often referred to as biclustering, which is a technique frequently utilized in

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

Subspace clustering

The adjacent image shows a mere two-dimensional space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters

c_a

(in subspace

\

) and

c_b

c_c

c_d

(in subspace

\

) can be found.

c_c

cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the

x

axis. In two dimensions, the two clusters

c_

and

c_

can be identified. The problem of subspace clustering is given by the fact that there are

2^d

different subspaces of a space with

d

dimensions. If the subspaces are not axis-parallel, an infinite number of subspaces is possible. Hence, subspace clustering algorithms utilize some kind of

heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediat ...

to remain computationally feasible, at the risk of producing inferior results. For example, the ''downward-closure property'' (cf.

association rules Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.P ...

) can be used to build higher-dimensional subspaces only by combining lower-dimensional ones, as any subspace T containing a cluster, will result in a full space S also to contain that cluster (i.e. S ⊆ T), an approach taken by most of the traditional algorithms such as CLIQUE,

SUBCLU SUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger.Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data'. In: ''Proc. SIAM ...

. It is also possible to define a subspace using different degrees of relevance for each dimension, an approach taken by iMWK-Means, EBK-Modes and CBK-Modes.

Projected clustering

Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special

distance function In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general sett ...

together with a regular

clustering algorithm Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

. For example, the PreDeCon algorithm checks which attributes seem to support a clustering for each point, and adjusts the distance function such that dimensions with low

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...

are amplified in the distance function. In the figure above, the cluster

c_c

might be found using

DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: g ...

with a distance function that places less emphasis on the

x

-axis and thus exaggerates the low difference in the

y

-axis sufficiently enough to group the points into a cluster.

PROCLUS Proclus Lycius (; 8 February 412 – 17 April 485), called Proclus the Successor ( grc-gre, Πρόκλος ὁ Διάδοχος, ''Próklos ho Diádokhos''), was a Greek Neoplatonist philosopher, one of the last major classical philosophe ...

uses a similar approach with a

k-medoid The -medoids problem is a clustering problem similar to -means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the -means and -medoids algorithms are partitional (breaking the dataset up into group ...

clustering. Initial medoids are guessed, and for each medoid the subspace spanned by attributes with low variance is determined. Points are assigned to the medoid closest, considering only the subspace of that medoid in determining the distance. The algorithm then proceeds as the regular PAM algorithm. If the distance function weights attributes differently, but never with 0 (and hence never drops irrelevant attributes), the algorithm is called a ''"soft"-projected clustering algorithm''.

Projection-based clustering

Projection-based clustering is based on a nonlinear projection of high-dimensional data into a two-dimensional space.Thrun, M. C., & Ultsch, A. : Using Projection based Clustering to Find Distance and Density based Clusters in High-Dimensional Data, J. Classif., pp. 1-33, doi: 10.1007/s00357-020-09373-2. Typical projection-methods like

t-distributed stochastic neighbor embedding t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally de ...

(t-SNE), or neighbor retrieval visualizer (NerV) are used to project data explicitly into two dimensions disregarding the subspaces of higher dimension than two and preserving only relevant neighborhoods in high-dimensional data. In the next step, the Delaunay graph between the projected points is calculated, and each vertex between two projected points is weighted with the high-dimensional distance between the corresponding high-dimensional data points. Thereafter the shortest path between every pair of points is computed using the Dijkstra algorithm. The shortest paths are then used in the clustering process, which involves two choices depending on the structure type in the high-dimensional data. This Boolean choice can be decided by looking at the topographic map of high-dimensional structures. In a benchmarking of 34 comparable clustering methods, projection-based clustering was the only algorithm that always was able to find the high-dimensional distance or density-based structure of the dataset. Projection-based clustering is accessible in the open-source R package "ProjectionBasedClustering" on CRAN.

Hybrid approaches

Not all algorithms try to either find a unique cluster assignment for each point or all clusters in all subspaces; many settle for a result in between, where a number of possibly overlapping, but not necessarily exhaustive set of clusters are found. An example is FIRES, which is from its basic approach a subspace clustering algorithm, but uses a

too aggressive to credibly produce all subspace clusters. Another hybrid approach is to include a human-into-the-algorithmic-loop: Human domain expertise can help to reduce an exponential search space through heuristic selection of samples. This can be beneficial in the health domain where, e.g., medical doctors are confronted with high-dimensional descriptions of patient conditions and measurements on the success of certain therapies. An important question in such data is to compare and correlate patient conditions and therapy results along with combinations of dimensions. The number of dimensions is often very large, consequently one needs to map them to a smaller number of relevant dimensions to be more amenable for expert analysis. This is because irrelevant, redundant, and conflicting dimensions can negatively affect effectiveness and efficiency of the whole analytic process.

Correlation clustering

Another type of subspaces is considered in Correlation clustering (Data Mining).

Software

ELKI ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally at the database s ...

includes various subspace and correlation clustering algorithms *FCPS includes over fifty clustering algorithmsThrun, M. C., & Stier, Q.: Fundamental Clustering Algorithms Suite, SoftwareX, Vol. 13(C), pp. 100642, doi: 10.1016/j.softx.2020.100642, 2021.

References

Cluster analysis