Affinity Propagation
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and data mining, affinity propagation (AP) is a
clustering algorithm Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
based on the concept of "message passing" between data points. Unlike clustering algorithms such as -means or -medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm. Similar to -medoids, affinity propagation finds "exemplars," members of the input set that are representative of clusters.


Algorithm

Let through be a set of data points, with no assumptions made about their internal structure, and let be a function that quantifies the similarity between any two points, such that
iff In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false. The connective is bicon ...
is more similar to than to . For this example, the negative squared distance of two data points was used i.e. for points and , s(i,k) = - \left\, x_i - x_k \right\, ^2 The diagonal of (i.e. s(i,i)) is particularly important, as it represents the instance preference, meaning how likely a particular instance is to become an exemplar. When it is set to the same value for all inputs, it controls how many classes the algorithm produces. A value close to the minimum possible similarity produces fewer classes, while a value close to or larger than the maximum possible similarity produces many classes. It is typically initialized to the median similarity of all pairs of inputs. The algorithm proceeds by alternating between two message-passing steps, which update two matrices: * The "responsibility" matrix has values that quantify how well-suited is to serve as the exemplar for , relative to other candidate exemplars for . * The "availability" matrix contains values that represent how "appropriate" it would be for to pick as its exemplar, taking into account other points' preference for as an exemplar. Both matrices are initialized to all zeroes, and can be viewed as
log-probability In probability theory and computer science, a log probability is simply a logarithm of a probability. The use of log probabilities means representing probabilities on a logarithmic scale, instead of the standard , 1/math> unit interval. Since t ...
tables. The algorithm then performs the following updates iteratively: * First, responsibility updates are sent around: r(i,k) \leftarrow s(i,k) - \max_ \left\ * Then, availability is updated per ::a(i,k) \leftarrow \min \left( 0, r(k,k) + \sum_ \max(0, r(i',k)) \right) for i \neq k and ::a(k,k) \leftarrow \sum_ \max(0, r(i',k)). Iterations are performed until either the cluster boundaries remain unchanged over a number of iterations, or some predetermined number (of iterations) is reached. The exemplars are extracted from the final matrices as those whose 'responsibility + availability' for themselves is positive (i.e. (r(i,i) + a(i,i)) > 0).


Applications

The inventors of affinity propagation showed it is better for certain computer vision and
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
tasks, e.g. clustering of pictures of human faces and identifying regulated transcripts, than -means, even when -means was allowed many random restarts and initialized using PCA. A study comparing affinity propagation and
Markov clustering In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a ...
on
protein interaction graph Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respond ...
partitioning found Markov clustering to work better for that problem. A semi-supervised variant has been proposed for
text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
applications. Another recent application was in economics, when the affinity propagation was used to find some temporal patterns in the output multipliers of the US economy between 1997 and 2017.


Software

* A
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
implementation is included in the
ELKI ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally at the database s ...
data mining framework. * A
Julia Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e.g. ...
implementation of affinity propagation is contained in Julia Statistics's Clustering.jl package. * A
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
version is part of the
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector m ...
library. * An R implementation is available in the "apcluster" package.


References

{{reflist Cluster analysis algorithms