T-SNE
   HOME

TheInfoList



OR:

t-distributed stochastic neighbor embedding (t-SNE) is a
statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and Geoffrey Hinton, where Laurens van der Maaten proposed the ''t''-distributed variant. It is a
nonlinear dimensionality reduction Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-d ...
technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate. t-SNE has been used for visualization in a wide range of applications, including
genomics Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
, computer security research,
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, music analysis, cancer research,
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, geological domain interpretation, and biomedical signal processing. While t-SNE plots often seem to display clusters, the visual clusters can be influenced strongly by the chosen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such "clusters" can be shown to even appear in non-clustered data, and thus may be false findings. Interactive exploration may thus be necessary to choose parameters and validate results. It has been demonstrated that t-SNE is often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering.


Details

Given a set of N high-dimensional objects \mathbf_1, \dots, \mathbf_N, t-SNE first computes probabilities p_ that are proportional to the similarity of objects \mathbf_i and \mathbf_j, as follows. For i \neq j, define : p_ = \frac and set p_ = 0. Note that \sum_j p_ = 1 for all i. As Van der Maaten and Hinton explained: "The similarity of datapoint x_j to datapoint x_i is the conditional probability, p_, that x_i would pick x_j as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x_i." Now define : p_ = \frac This is motivated because p_ and p_ from the N samples are estimated as 1/N, so conditional probability can be written as p_ = Np_ and p_ = Np_ . Since p_ = p_, you can obtain previous formula. Also note that p_ = 0 and \sum_ p_ = 1. The bandwidth of the Gaussian kernels \sigma_i is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of \sigma_i are used in denser parts of the data space. Since the Gaussian kernel uses the Euclidean distance \lVert x_i-x_j \rVert, it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the p_ become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this. t-SNE aims to learn a d-dimensional map \mathbf_1, \dots, \mathbf_N (with \mathbf_i \in \mathbb^d and d typically chosen as 2 or 3) that reflects the similarities p_ as well as possible. To this end, it measures similarities q_ between two points in the map \mathbf_i and \mathbf_j, using a very similar approach. Specifically, for i \neq j, define q_ as : q_ = \frac and set q_ = 0 . Herein a heavy-tailed
Student t-distribution In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the Expected value, mean of a Normal distribution, norm ...
(with one-degree of freedom, which is the same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points \mathbf_i in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution P from the distribution Q, that is: : \mathrm\left(P \parallel Q\right) = \sum_ p_ \log \frac The minimization of the Kullback–Leibler divergence with respect to the points \mathbf_i is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.


Software

* The R packag
Rtsne
implements t-SNE in R. * ELKI contains tSNE, also with Barnes-Hut approximation * scikit-learn, a popular machine learning libary in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation. * Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE
online version


References

{{reflist


External links


Visualizing Data Using t-SNE
Google Tech Talk about t-SNE
Implementations of t-SNE in various languages
A link collection maintained by Laurens van der Maaten Machine learning algorithms Dimension reduction