In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, and especially in
biostatistics
Biostatistics (also known as biometry) is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experimen ...
, cophenetic correlation (more precisely, the cophenetic correlation coefficient) is a measure of how faithfully a
dendrogram
A dendrogram is a diagram representing a Tree (graph theory), tree graph. This diagrammatic representation is frequently used in different contexts:
* in hierarchical clustering, it illustrates the arrangement of the clusters produced by ...
preserves the pairwise distances between the original unmodeled data points. Although it has been most widely applied in the field of biostatistics (typically to assess cluster-based models of
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
sequences, or other
taxonomic models), it can also be used in other fields of inquiry where raw data tend to occur in clumps, or clusters. This coefficient has also been proposed for use as a test for nested clusters.
Calculating the cophenetic correlation coefficient
Suppose that the original data have been modeled using a cluster method to produce a dendrogram ; that is, a simplified model in which data that are "close" have been grouped into a hierarchical tree. Define the following distance measures.
*
, the Euclidean distance between the ''i''th and ''j''th observations.
*
, the dendrogrammatic distance between the model points
and
. This distance is the height of the node at which these two points are first joined together.
Then, letting
be the average of the ''x''(''i'', ''j''), and letting
be the average of the ''t''(''i'', ''j''), the cophenetic correlation coefficient ''c'' is given by
:
Software implementation
It is possible to calculate the cophenetic correlation in
R using the dendextend R package.
In
Python, the
SciPy package also has an implementation.
In
MATLAB
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...
, the Statistic and Machine Learning toolbox contains an implementation.
See also
*
Cophenetic In the clustering of biological information such as data from microarray experiments, the cophenetic similarity or cophenetic distanceSokal, R. R. and F. J. Rohlf. 1962. The comparison of dendrograms by objective methods. Taxon, 11:33-40 of two ob ...
References
External links
Numerical example of cophenetic correlationComputing and displaying Cophenetic distances
{{DEFAULTSORT:Cophenetic Correlation
Covariance and correlation