Community Structure
   HOME

TheInfoList



OR:

In the study of
complex network In the context of network theory, a complex network is a graph (network) with non-trivial topological features—features that do not occur in simple networks such as lattices or random graphs but often occur in networks representing real s ...
s, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally. In the particular case of ''non-overlapping'' community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But ''overlapping'' communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.


Properties

In the study of
networks Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...
, such as computer and information networks, social networks and biological networks, a number of different characteristics have been found to occur commonly, including the small-world property,
heavy-tailed In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distr ...
degree distribution In the study of graphs and networks, the degree of a node in a network is the number of connections it has to other nodes and the degree distribution is the probability distribution of these degrees over the whole network. Definition The degree o ...
s, and clustering, among others. Another common characteristic is community structure. In the context of networks, community structure refers to the occurrence of groups of nodes in a network that are more densely connected internally than with the rest of the network, as shown in the example image to the right. This inhomogeneity of connections suggests that the network has certain natural divisions within it. Communities are often defined in terms of the partition of the set of vertices, that is each node is put into one and only one community, just as in the figure. This is a useful simplification and most community detection methods find this type of community structure. However, in some cases a better representation could be one where vertices are in more than one community. This might happen in a social network where each vertex represents a person, and the communities represent the different groups of friends: one community for family, another community for co-workers, one for friends in the same sports club, and so on. The use of cliques for community detection discussed below is just one example of how such overlapping community structure can be found. Some networks may not have any meaningful community structure. Many basic network models, for example, such as the
random graph In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs li ...
and the
Barabási–Albert model The Barabási–Albert (BA) model is an algorithm for generating random scale-free networks using a preferential attachment mechanism. Several natural and human-made systems, including the Internet, the World Wide Web, citation networks, and s ...
, do not display community structure.


Importance

Community structures are quite common in real networks. Social networks include community groups (the origin of the term, in fact) based on common location, interests, occupation, etc. Finding an underlying community structure in a network, if it exists, is important for a number of reasons. Communities allow us to create a large scale map of a network since individual communities act like meta-nodes in the network which makes its study easier. Individual communities also shed light on the function of the system represented by the network since communities often correspond to functional units of the system. In metabolic networks, such functional groups correspond to cycles or pathways whereas in the protein interaction network, communities correspond to proteins with similar functionality inside a biological cell. Similarly, citation networks form communities by research topic. Being able to identify these sub-structures within a network can provide insight into how network function and topology affect each other. Such insight can be useful in improving some algorithms on graphs such as
spectral clustering In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as ...
. Importantly, communities often have very different properties than the average properties of the networks. Thus, only concentrating on the average properties usually misses many important and interesting features inside the networks. For example, in a given social network, both gregarious and reticent groups might exists simultaneously. Existence of communities also generally affects various processes like rumour spreading or epidemic spreading happening on a network. Hence to properly understand such processes, it is important to detect communities and also to study how they affect the spreading processes in various settings. Finally, an important application that community detection has found in network science is the prediction of missing links and the identification of false links in the network. During the measurement process, some links may not get observed for a number of reasons. Similarly, some links could falsely enter into the data because of the errors in the measurement. Both these cases are well handled by community detection algorithm since it allows one to assign the probability of existence of an edge between a given pair of nodes.


Algorithms for finding communities

Finding communities within an arbitrary network can be a computationally difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.


Minimum-cut method

One of the oldest algorithms for dividing networks into parts is the
minimum cut In graph theory, a minimum cut or min-cut of a graph is a cut (a partition of the vertices of a graph into two disjoint subsets) that is minimal in some metric. Variations of the minimum cut problem consider weighted graphs, directed graphs, ter ...
method (and variants such as ratio cut and normalized cut). This method sees use, for example, in load balancing for parallel computing in order to minimize communication between processor nodes. In the minimum-cut method, the network is divided into a predetermined number of parts, usually of approximately the same size, chosen such that the number of edges between groups is minimized. The method works well in many of the applications for which it was originally intended but is less than ideal for finding community structure in general networks since it will find communities regardless of whether they are implicit in the structure, and it will find only a fixed number of them.


Hierarchical clustering

Another method for finding community structures in networks is hierarchical clustering. In this method one defines a
similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
quantifying some (usually topological) type of similarity between node pairs. Commonly used measures include the
cosine similarity In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle betw ...
, the
Jaccard index The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is freque ...
, and the
Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...
between rows of the
adjacency matrix In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In the special case of a finite simp ...
. Then one groups similar nodes into communities according to this measure. There are several common schemes for performing the grouping, the two simplest being
single-linkage clustering In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of el ...
, in which two groups are considered separate communities if and only if all pairs of nodes in different groups have similarity lower than a given threshold, and
complete linkage clustering Complete-linkage clustering is one of several methods of agglomerative hierarchical clustering. At the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters until all ...
, in which all nodes within every group have similarity greater than a threshold. An important step is how to determine the threshold to stop the agglomerative clustering, indicating a near-to-optimal community structure. A common strategy consist to build one or several metrics monitoring global properties of the network, which peak at given step of the clustering. An interesting approach in this direction is the use of various similarity or dissimilarity measures, combined through convex sums,. Another approximation is the computation of a quantity monitoring the density of edges within clusters with respect to the density between clusters, such as the partition density, which has been proposed when the similarity metric is defined between edges (which permits the definition of overlapping communities), and extended when the similarity is defined between nodes, which allows to consider alternative definitions of communities such as guilds (i.e. groups of nodes sharing a similar number of links with respect to the same neighbours but not necessarily connected themselves). These methods can be extended to consider multidimensional networks, for instance when we are dealing with networks having nodes with different types of links.


Girvan–Newman algorithm

Another commonly used algorithm for finding communities is the
Girvan–Newman algorithm The Girvan–Newman algorithm (named after Michelle Girvan and Mark Newman) is a hierarchical method used to detect communities in complex systems.Girvan M. and Newman M. E. J.Community structure in social and biological networks Proc. Natl. Acad. ...
. This algorithm identifies edges in a network that lie between communities and then removes them, leaving behind just the communities themselves. The identification is performed by employing the graph-theoretic measure
betweenness centrality In graph theory, betweenness centrality (or "betweeness centrality") is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such ...
, which assigns a number to each edge which is large if the edge lies "between" many pairs of nodes. The Girvan–Newman algorithm returns results of reasonable quality and is popular because it has been implemented in a number of standard software packages. But it also runs slowly, taking time O(''m''2''n'') on a network of ''n'' vertices and ''m'' edges, making it impractical for networks of more than a few thousand nodes.


Modularity maximization

In spite of its known drawbacks, one of the most widely used methods for community detection is modularity maximization. Modularity is a benefit function that measures the quality of a particular division of a network into communities. The modularity maximization method detects communities by searching over possible divisions of a network for one or more that have particularly high modularity. Since exhaustive search over all possible divisions is usually intractable, practical algorithms are based on approximate optimization methods such as greedy algorithms, simulated annealing, or spectral optimization, with different approaches offering different balances between speed and accuracy. A popular modularity maximization approach is the Louvain method, which iteratively optimizes local communities until global modularity can no longer be improved given perturbations to the current community state. An algorithm that utilizes the RenEEL scheme, which is an example of the
Extremal Ensemble Learning Extremal Ensemble Learning (EEL) is a machine learning algorithmic paradigm for graph partitioning. EEL creates an ensemble Ensemble may refer to: Art * Architectural ensemble * Ensemble (album), ''Ensemble'' (album), Kendji Girac 2015 album * ...
(EEL) paradigm, is currently the best modularity maximizing algorithm. The usefulness of modularity optimization is questionable, as it has been shown that modularity optimization often fails to detect clusters smaller than some scale, depending on the size of the network ( resolution limit); on the other hand the landscape of modularity values is characterized by a huge degeneracy of partitions with high modularity, close to the absolute maximum, which may be very different from each other.


Statistical inference

Methods based on
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
attempt to fit a
generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsi ...
to the network data, which encodes the community structure. The overall advantage of this approach compared to the alternatives is its more principled nature, and the capacity to inherently address issues of
statistical significance In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (simply by chance alone). More precisely, a study's defined significance level, denoted by \alpha, is the p ...
. Most methods in the literature are based on the stochastic block model as well as variants including mixed membership, degree-correction, and hierarchical structures.
Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
can be performed using principled approaches such as
minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
(or equivalently, Bayesian model selection) and
likelihood-ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...
. Currently many algorithms exist to perform efficient inference of stochastic block models, including
belief propagation A belief is an attitude that something is the case, or that some proposition is true. In epistemology, philosophers use the term "belief" to refer to attitudes about the world which can be either true or false. To believe something is to take i ...
and agglomerative
Monte Carlo Monte Carlo (; ; french: Monte-Carlo , or colloquially ''Monte-Carl'' ; lij, Munte Carlu ; ) is officially an administrative area of the Principality of Monaco, specifically the ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is ...
. In contrast to approaches that attempt to cluster a network given an objective function, this class of methods is based on generative models, which not only serve as a description of the large-scale structure of the network, but also can be used to ''generalize'' the data and predict the occurrence of missing or spurious links in the network.


Clique-based methods

Clique A clique ( AusE, CanE, or ), in the social sciences, is a group of individuals who interact with one another and share similar interests. Interacting with cliques is part of normative social development regardless of gender, ethnicity, or popular ...
s are subgraphs in which every node is connected to every other node in the clique. As nodes can not be more tightly connected than this, it is not surprising that there are many approaches to community detection in networks based on the detection of cliques in a graph and the analysis of how these overlap. Note that as a node can be a member of more than one clique, a node can be a member of more than one community in these methods giving an "''overlapping community structure''". One approach is to find the "''maximal cliques''". That is to find the cliques which are not the subgraph of any other clique. The classic algorithm to find these is the
Bron–Kerbosch algorithm In computer science, the Bron–Kerbosch algorithm is an enumeration algorithm for finding all maximal cliques in an undirected graph. That is, it lists all subsets of vertices with the two properties that each pair of vertices in one of the listed ...
. The overlap of these can be used to define communities in several ways. The simplest is to consider only maximal cliques bigger than a minimum size (number of nodes). The union of these cliques then defines a subgraph whose components (disconnected parts) then define communities. Such approaches are often implemented in
social network analysis software Social network analysis software (SNA software) is software which facilitates quantitative or qualitative analysis of social networks, by describing features of a network either through numerical or visual representation. Overview Networks can ...
such as UCInet. The alternative approach is to use cliques of fixed size k. The overlap of these can be used to define a type of k-regular
hypergraph In mathematics, a hypergraph is a generalization of a graph in which an edge can join any number of vertices. In contrast, in an ordinary graph, an edge connects exactly two vertices. Formally, an undirected hypergraph H is a pair H = (X,E) wh ...
or a structure which is a generalisation of the
line graph In the mathematical discipline of graph theory, the line graph of an undirected graph is another graph that represents the adjacencies between edges of . is constructed in the following way: for each edge in , make a vertex in ; for every ...
(the case when k=2) known as a "''Clique graph''". The clique graphs have vertices which represent the cliques in the original graph while the edges of the clique graph record the overlap of the clique in the original graph. Applying any of the previous community detection methods (which assign each node to a community) to the clique graph then assigns each clique to a community. This can then be used to determine community membership of nodes in the cliques. Again as a node may be in several cliques, it can be a member of several communities. For instance the clique percolation method defines communities as percolation clusters of k-cliques. To do this it finds all k-cliques in a network, that is all the complete sub-graphs of k-nodes. It then defines two k-cliques to be adjacent if they share k-1 nodes, that is this is used to define edges in a clique graph. A community is then defined to be the maximal union of k-cliques in which we can reach any k-clique from any other k-clique through series of k-clique adjacencies. That is communities are just the connected components in the clique graph. Since a node can belong to several different k-clique percolation clusters at the same time, the communities can overlap with each other.


Testing methods of finding communities algorithms

The evaluation of algorithms, to detect which are better at detecting community structure, is still an open question. It must be based on analyses of networks of known structure. A typical example is the "four groups" test, in which a network is divided into four equally-sized groups (usually of 32 nodes each) and the probabilities of connection within and between groups varied to create more or less challenging structures for the detection algorithm. Such benchmark graphs are a special case of the planted l-partition model of Condon and
Karp Karp may refer to: Places * Karp, Podlaskie Voivodeship, in north-east Poland * Karp, Lublin Voivodeship, in east Poland People * Karp (surname) * Karp Khachvankyan (1923–1998), Armenian actor and director Other uses * KARP-FM, a radio st ...
, or more generally of " stochastic block models", a general class of random network models containing community structure. Other more flexible benchmarks have been proposed that allow for varying group sizes and nontrivial degree distributions, such as LFR benchmark which is an extension of the four groups benchmark that includes heterogeneous distributions of node degree and community size, making it a more severe test of community detection methods. Commonly used computer-generated benchmarks start with a network of well-defined communities. Then, this structure is degraded by rewiring or removing links and it gets harder and harder for the algorithms to detect the original partition. At the end, the network reaches a point where it is essentially random. This kind of benchmark may be called "open". The performance on these benchmarks is evaluated by measures such as normalized
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
or variation of information. They compare the solution obtained by an algorithm with the original community structure, evaluating the similarity of both partitions.


Detectability

During recent years, a rather surprising result has been obtained by various groups which shows that a phase transition exists in the community detection problem, showing that as the density of connections inside communities and between communities become more and more equal or both become smaller (equivalently, as the community structure becomes too weak or the network becomes too sparse), suddenly the communities become undetectable. In a sense, the communities themselves still exist, since the presence and absence of edges is still correlated with the community memberships of their endpoints; but it becomes information-theoretically impossible to label the nodes better than chance, or even distinguish the graph from one generated by a null model such as the Erdos–Renyi model without community structure. This transition is independent of the type of algorithm being used to detect communities, implying that there exists a fundamental limit on our ability to detect communities in networks, even with optimal Bayesian inference (i.e., regardless of our computational resources). Consider a stochastic block model with total n nodes, q=2 groups of equal size, and let p_\text and p_\text be the connection probabilities inside and between the groups respectively. If p_\text>p_\text, the network would possess community structure since the link density inside the groups would be more than the density of links between the groups. In the sparse case, p_\text and p_\text scale as O(1/n) so that the average degree is constant: :p_\text=c_\text/n and p_\text=c_\text/n Then it becomes impossible to detect the communities when: :c_\text-c_\text=\sqrt


See also

*
Complex network In the context of network theory, a complex network is a graph (network) with non-trivial topological features—features that do not occur in simple networks such as lattices or random graphs but often occur in networks representing real s ...
*
Hierarchy A hierarchy (from Greek: , from , 'president of sacred rites') is an arrangement of items (objects, names, values, categories, etc.) that are represented as being "above", "below", or "at the same level as" one another. Hierarchy is an important ...
*
Network theory Network theory is the study of graphs as a representation of either symmetric relations or asymmetric relations between discrete objects. In computer science and network science, network theory is a part of graph theory: a network can be defi ...
*
Percolation theory In statistical physics and mathematics, percolation theory describes the behavior of a network when nodes or links are added. This is a geometric type of phase transition, since at a critical fraction of addition the network of small, disconnected ...


References


External links


Community detection in graphs
– an introduction
Are there implementations of algorithms for community detection in graphs?
Stack Overflow In software, a stack overflow occurs if the call stack pointer exceeds the stack bound. The call stack may consist of a limited amount of address space, often determined at the start of the program. The size of the call stack depends on many fac ...

What are the differences between community detection algorithms in igraph?
Stack Overflow In software, a stack overflow occurs if the call stack pointer exceeds the stack bound. The call stack may consist of a limited amount of address space, often determined at the start of the program. The size of the call stack depends on many fac ...
{{DEFAULTSORT:Community Structure Networks