Weighted correlation network analysis, also known as weighted gene co-expression
network
Network, networking and networked may refer to:
Science and technology
* Network theory, the study of graphs as a representation of relations between discrete objects
* Network science, an academic field that studies complex networks
Mathematics
...
analysis (WGCNA), is a widely used
data mining
Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
method especially for studying
biological network
A biological network is a method of representing systems as complex sets of binary interactions or relations between various biological entities. In general, networks or graphs are used to capture relationships between entities or objects. A typ ...
s based on pairwise
correlations between variables. While it can be applied to most
high-dimensional data sets, it has been most widely used in
genomic
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
applications. It allows one to define modules (clusters), intramodular hubs, and network nodes with regard to module membership, to study the relationships between co-expression modules, and to compare the network topology of different networks (differential network analysis). WGCNA can be used as a
data reduction technique (related to oblique
factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observe ...
), as a
clustering method (fuzzy clustering), as a
feature
Feature may refer to:
Computing
* Feature recognition, could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (machine learning), in statistics: individual measurable properties of the phenome ...
selection method (e.g. as gene screening method), as a framework for integrating complementary (genomic) data (based on weighted correlations between quantitative variables), and as a
data exploratory technique.
Although WGCNA incorporates traditional data exploratory techniques, its intuitive network language and analysis framework transcend any standard analysis technique. Since it uses network methodology and is well suited for integrating complementary genomic data sets, it can be interpreted as
systems biologic or systems genetic data analysis method. By selecting intramodular hubs in consensus modules, WGCNA also gives rise to network based
meta analysis
Meta-analysis is a method of synthesis of quantitative data from multiple independent studies addressing a common research question. An important part of this method involves computing a combined effect size across all of the studies. As such, th ...
techniques.
History
The WGCNA method was developed by
Steve Horvath, a professor of
human genetics
Human genetics is the study of inheritance as it occurs in Human, human beings. Human genetics encompasses a variety of overlapping fields including: classical genetics, cytogenetics, molecular genetics, biochemical genetics, genomics, populatio ...
at the
David Geffen School of Medicine at UCLA
The UCLA School of Medicine (also known as the David Geffen School of Medicine at UCLA) is the accredited medical school of the University of California, Los Angeles. Founded in 1951, it is the second medical school in the University of Califor ...
and of
biostatistics
Biostatistics (also known as biometry) is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experimen ...
at the
UCLA
The University of California, Los Angeles (UCLA) is a public land-grant research university in Los Angeles, California, United States. Its academic roots were established in 1881 as a normal school then known as the southern branch of the C ...
Fielding School of Public Health and his colleagues at UCLA, and (former) lab members (in particular Peter Langfelder, Bin Zhang, Jun Dong). Much of the work arose from collaborations with applied researchers. In particular, weighted correlation networks were developed in joint discussions with cancer researchers
Paul Mischel, Stanley F. Nelson, and neuroscientists
Daniel H. Geschwind, Michael C. Oldham, according to the acknowledgement section in.
Comparison between weighted and unweighted correlation networks
A weighted correlation network can be interpreted as special case of a
weighted network
A weighted network is a network where the ties among nodes have weights assigned to them. A network is a system whose elements are somehow connected. The elements of a system are represented as nodes (also known as actors or vertices) and the con ...
,
dependency network or correlation network. Weighted correlation network analysis can be attractive for the following reasons:
* The network construction (based on soft thresholding the
correlation coefficient
A correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two c ...
) preserves the continuous nature of the underlying correlation information. For example, weighted correlation networks that are constructed on the basis of correlations between numeric variables do not require the choice of a hard threshold. Dichotomizing information and (hard)-thresholding may lead to information loss.
* The network construction gives highly robust results with respect to different choices of the soft threshold.
In contrast, results based on unweighted networks, constructed by thresholding a pairwise association measure, often strongly depend on the threshold.
* Weighted correlation networks facilitate a geometric interpretation based on the angular interpretation of the correlation, chapter 6 in.
* Resulting network statistics can be used to enhance standard data-mining methods such as cluster analysis since (dis)-similarity measures can often be transformed into weighted networks;
see chapter 6 in.
* WGCNA provides powerful module preservation statistics which can be used to quantify similarity to another condition. Also module preservation statistics allow one to study differences between the modular structure of networks.
* Weighted networks and correlation networks can often be approximated by "factorizable" networks.
Such approximations are often difficult to achieve for sparse, unweighted networks. Therefore, weighted (correlation) networks allow for a parsimonious parametrization (in terms of modules and module membership) (chapters 2, 6 in
) and.
Method
First, one defines a gene co-expression
similarity measure
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...
which is used to define the network. We denote the gene co-expression similarity measure of a pair of genes i and j by
. Many co-expression studies use the absolute value of the correlation as an unsigned co-expression similarity measure,
where gene expression profiles
and
consist of the expression of genes i and j across multiple samples. However, using the absolute value of the correlation may obfuscate biologically relevant information, since no distinction is made between gene repression and activation. In contrast, in signed networks the similarity between genes reflects the sign of the correlation of their expression profiles. Varied transformation (or scaling) approaches can be considered if a signed co-expression measure between gene expression profiles
and
is needed. For example, one can (linearly) scale the correlations to be within the