Graphs: Community Structure
Graph communities are sets of tightly-connected nodes, which can usefully represent groups of entities characterised by proximity or direct interaction. These may be people grouped according to social relationships or proteins interacting for a given metabolic process. In this post we look at a fast heuristic used to extract non-overlapping communities (the Louvain algorithm) and a method for overlapping community detection, which relies on maximising the likelihood of observing a graph given relaxed generative models of node-community membership.
Modularity: Measuring the Presence of Community Structure
We define modularity, $Q$, as a measure of how well a network is partitioned1 into communities. Given a partition of graph into disjoint groups, $s \in S$, for any one of those groups the modularity will be proportional to the difference in the real and expected numbers of edges within that group
\[Q \propto \sum_{s \in S} \vert E_s \vert - \mathbb{E}\left [ \vert E_s \vert \right ]\]The expectation term, $\mathbb{E}\left [ \vert E_s \vert \right ]$, necessitates a null model over which to take the expectation.
It is worth emphasising that the value of the modularity depends on the choice of partition2, i.e. the specific allocation of elements into subgroups (subsets). We encounter negative values of the normalized modularity, $Q$, in the presence of “anticommunities”, where fewer links exist between nodes in a group than we would expect. These anticommunities are not cancelled out3; i.e. communities are not formed in allocating the edges of group elements to other elements, since those other elements lie outside the given community. It is only by reconfiguring the subgroups constituting the partition that we can calculate the maximum modularity over all possible partitions.
For the null model, we can use the configuration model4, which permits multi-edges and self-loops yielding a multigraph (a Reed-Molloy random graph).
Taking the probability in the limit as the number of edges, $m$, becomes large that nodes $i$ and $j$ will share an edge in the random null model, which is $k_i \cdot k_j / 2m$, and using the normalising constant, $1 / 2m$ since the sum is over at most $2m$ edges, to scale the modularity to lie in the range $[-1, 1]$, we arrive at our final formula for modularity
\[Q(G, S) = \dfrac{1}{2m} \sum_{s \in S} \sum_{i \in S} \sum_{j \in S} A_{ij} - \dfrac{k_i \cdot k_j}{2m}\]Note that modularity considers self-loops (which will push it in the positive direction) but recall that these are vanishingly infrequent; $\mathcal{O}(1 / N)$. Usually, a value of $0.3$ to $0.7$ is considered to indicate significant community structure.
We can perform the same summation in a single loop using the Kronecker delta function considering the communities of nodes $i$ and $j$: $\delta(c_i, c_j)$
\[Q(G, S) = \dfrac{1}{2m} \sum_{ij} \left [ A_{ij} - \dfrac{k_i \cdot k_j}{2m} \right ] \cdot \delta(c_i, c_j)\]Until now, we have simply computed a metric, the modularity, to measure the extent to which a given partition of the nodes is clustered into communities. We can flip this problem around and find communities by finding the partition that maximises the modularity for a given graph.
Finding Communities: Maximising Modularity with the Louvain Algorithm
The Louvain algorithm (Blondel et al. 20085) is a heuristic algorithm for greedily maximising the graph modularity that works in practice and is very fast, operating in log-linear time in the size of the graph, $\mathcal{O}(N \cdot log(N))$. It can operate on weighted graphs and provides a hierarchical grouping of communities. It is widely used due to its rapid convergence.
Hierarchical clustering sees nodes clustered into groups, then those groups themselves clustered and so on. Dendrograms are equivalent ways of representing this structure. [Image Source: Ahn, Bagrow and Lehmann (2010) Link communities reveal multiscale complexity in networks. https://arxiv.org/abs/0903.3178v3]
The hierarchical nature of the clustering that emerges is an artefact of the method but is arguably justified as networks exhibit self-similarity. The authors note:
The algorithm is reminiscent of the self-similar nature of complex networks and naturally incorporates a notion of hierarchy, as communities of communities are built during the process. The height of the hierarchy that is constructed is determined by the number of passes and is generally a small number…
Louvain Algorithm
The Louvain algorithm operates in two phases at each pass and is repeated until the modularity converges (i.e. stops increasing).
WHILE
the modularity is still increasing (compared to the previous pass):
- Phase (1): Partitioning: Modularity is maximised by allowing only local changes to node-community memberships, i.e. nodes are reallocated to different communities if it will increase the modularity.
- Phase (2): Aggregation: The identified communities are aggregated into super-nodes to build a new network.
The Louvain algorithm starts from a singleton partition in which each node is in its own community (a). The algorithm moves individual nodes from one community to another to find a partition (b). Based on this partition, an aggregate network is created (c). The algorithm then moves individual nodes in the aggregate network (d). These steps are repeated until the quality cannot be increased further. [Image Source: Blondel et al. (2008) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment. https://arxiv.org/pdf/0803.0476.pdf; Caption Source: Traag, Waltman and van Eck (2019) From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports. https://www.nature.com/articles/s41598-019-41695-z]
The algorithm begins by assigning each node to its own community. For each node, two operations are performed:
- the change in modularity, $\Delta Q$, is computed for $i$ being placed into a community with neighbour $j$
- node $i$ is moved into a community with the neighbour $j$ that maximises the change in modularity, $\Delta Q$
This process is terminated when a local maximum is reached such that the modularity is not increasing as a result of individual node reassignments.
Note this process crucially depends on the order in which nodes are considered; hence greedy. That said, empirical results suggest that the node ordering does not significantly impact the modularity achieved.
The authors state all of the above points eloquently in the original paper also noting that their algorithm by contrast to previous proposals is limited more by memory than computational constraints, being capable (in 2008) of identifying communities in a $118$ million node network in $152$ minutes. I quote6 from Section 2 (“Method”) of the paper:
Our algorithm is divided in two phases that are repeated iteratively. Assume that we start with a weighted network of $N$ nodes. First, we assign a different community to each node of the network. So, in this initial partition there are as many communities as there are nodes. Then, for each node $i$ we consider the neighbours $j$ of $i$ and we evaluate the gain of modularity that would take place by removing $i$ from its community and by placing it in the community of $j$. The node $i$ is then placed in the community for which this gain is maximum (in case of a tie we use a breaking rule), but only if this gain is positive. If no positive gain is possible, $i$ stays in its original community. This process is applied repeatedly and sequentially for all nodes until no further improvement can be achieved and the first phase is then complete.
Let us insist on the fact that a node maybe, and often is, considered several times. This first phase stops when a local maxima of the modularity is attained, i.e., when no individual move can improve the modularity.
One should also note that the output of the algorithm depends on the order in which the nodes are considered. Preliminary results on several test cases seem to indicate that the ordering of the nodes does not have a significant influence on the modularity that is obtained. However the ordering can influence the computation time.
Efficiently Computing Modularity Gain (Phase 1: Partitioning)
Evaluating the modularity change for each node sequentially would be computationally inefficient.
The equivalent result can be computed using the following expression for the gain in modularity, $\Delta Q$, obtained by moving an isolated node, $i$, into a community, $C$, in additive combination with a complementary expression for the change in modularity by removing it from its original community
\[\Delta Q (i \rightarrow C) = \underbrace{\left [ \frac{\Sigma_{in} + k_{i, in}}{2m} - {\left ( \frac{\Sigma_{tot} + k_{i}}{2m} \right )}^2 \right ]}_{\text{After merging node } i} - \underbrace{\left [ \overbrace{\frac{\Sigma_{in}}{2m} - \left (\frac{\Sigma_{tot}}{2m} \right )^2}^{\text{modularity of community } C} - \overbrace{\left ( \frac{k_i}{2m} \right )^2}^{\text{modularity of node } i} \right ]}_{\text{Before merging node } i}\]where:
- $\Sigma_{in}$ is the sum of he weights of the edges inside community $C$
- $\Sigma_{tot}$ is the sum of he weights of the edges incident to nodes in community $C$
- $k_i$ is the degree of node $i$ (i.e. the sum of he weights of the edges incident to node $i$)
- $k_{i, in}$ is the sum of the weights of the edges from $i$ to other nodes in C
- $m$ is the sum of the edge weights across the whole graph
The first term can be considered the modularity when node $i$ is placed in community $C$, i.e. after the merge, and the second can be considered the modularity when node $i$ is outside community $C$, i.e. before the merge.
This implementation is fast as the summations can be cached and then remaining terms evaluated per node. Further improvements can be made by parallelizing the computation of modularity gain across nodes.
Restructuring (Phase 2: Aggregation)
As mentioned, once the first partitioning phase is completed for a given pass, the communities obtained are contracted into super-nodes and the network is created such that:
- Super-nodes are connected if there is at least one edge between the nodes of the corresponding communities (for an unweighted graph)
- The edge weight between two super-nodes is the sum of the edge weights between their corresponding communities
After this is complete, we iterate until all nodes are assigned to the same community and determine our final clusters as those on the level of the clustering hierarchy with the maximal modularity7.
Pseudocode for the whole sequential (as opposed to parallelized) Louvain algorithm is shown below.
Sequential Louvain Algorithm Pseudocode [Image Source: Scalable community detection with the Louvain algorithm by Navid Sedighpour. AmirKabir University of Technology (July 2016). https://www.slideshare.net/NavidSedighpour1/scalable-community-detection-with-the-louvain-algorithm]
Overlapping Communities
Individuals are typically members of several communities simultaneously, for example in social science contexts we might consider people’s social and professional circles at the same time or in molecular biology we might want to consider proteins’ roles in several metabolic pathways. Accordingly, we want graphs, as abstractions of these phenomena, to model this multiple community membership.
In terms of a graph adjacency matrix, the single-membership scenario is expressed as most entries (edges) corresponding to single colours, i.e. they are within-community edges, with very few entries being mixed colour representing between-community edges. Allowing multiple membership results in more of the entries being multicoloured, representing a combination of within- and between-community links depending on the community.
The Community Affiliation Graph Model (AGM)
We can approach this problem by defining a generative model for graphs based on node-community affiliations: the Community Affiliation Graph Model (AGM).
Given a graph, $G$, we can then assume that $G$ was generated by a Affiliation Graph Model and find the one which is most likely to have generated the graph we observe, so discovering community structure.
A bipartite community affiliation graph (shown below) links communities, $C$, to nodes, $V$ via memberships, $M$. The model is additionally parameterized with a probability for each community, $p_C$, that nodes of that community will link to each other.
A Bipartite Community Affiliation Graph maps communities (top) to nodes (bottom) via memberships, visualised as directed egdes. [Image Source: Jaewon Yang and Jure Leskovec (2012) Community-Affiliation Graph Model for Overlapping Network. IEEE 12th International Conference on Data Mining. https://cs.stanford.edu/people/jure/pubs/agmfit-icdm12.pdf]
We generate graphs given parameters, $\{V, C, M, \{p_C\}\}$, as an Erdős–Rényi random graph model within each of the communities using the $p_C$ parameter to determine the edge connections according to independent Bernoulli trials.
However, edges now connect nodes that may belong to multiple communities and therefore their probability of existing is determined according to the number of communities the corresponding nodes have in common. Specifically, the probability that two nodes, $u$ and $v$, will be connected is given by the complement of the probability that they won’t be connected, which in turn is given by the product over all mutual communities of $u$ and $v$ that the nodes will not be connected, $1 - p_C$ for any given community, $C$
\[p(u, v) = 1 - \prod_{c \in M_u \cap M_v}{(1 - p_C)}\]The case when $u$ and $v$ have no communities in common is handled by assigning all nodes membership of a background epsilon community.
AGMs allow us to express non-overlapping communities (as considered above), overlapping communities and nested communities.
Finally, we use maximum likelihood estimation to find the parameters of the Affiliation Graph Model, $F$, that is most likely to have generated the graph we observe, $G$. In order to do this we need a means to efficiently calculate $P(G \vert F)$, which we then maximise over the parameter space of $F$ e.g. using gradient descent: $\DeclareMathOperator*{\argmax}{arg\,max} \argmax_F P(G \vert F)$.
Defining the Likelihood of a Graph
Given an AGM, $F$, and a graph, $G$, we can define the graph likelihood, $P(G \vert F)$, as the product of the probabilities that we would observe the adjacency matrix that we do.
Formally, we take the edge-specific probabilities calculated as we have seen above and multiply them for the edges that are present in the graph (e.g. $1$ entries in the adjacency matrix in the case of an unweighted graph) and multiply their complements for the edges that are absent ($0$ entries respectively). This leads naturally to the equation for the graph likelihood
\[P(G \vert F) = \prod_{(u,v) \in G} P(u, v) \prod_{(u,v) \notin G} (1 - P(u, v))\]The first product corresponds to present edges and the second to absent ones.
Note that this problem in its current formulation is non-trivial due to the combinatorial constraints imposed by the fact of having overlapping communities.
Since the bipartite graph representation of the AGM yields a difficult combinatorial problem for likelihood maximisation, we relax the AGM by giving the memberships strengths.
This turns the bipartite graph we had previously into a weighted complete graph8 where the membership strengths of node $u$ are represented as a row vector, $F_u$; the membership strength of node $u$ to community $A$ for example would be an element of this vector, $F_uA$, where $F_uA = 0$ would indicate no membership of $u$ to $A$.
Given the above formulation, the probability of nodes $u$ and $v$ linking (i.e. the edge $(u, v)$ existing) is proportional to the strength of the shared memberships, which can be computed via the dot product of the nodes’ membership strength vectors
\[P(u, v) = 1 - exp(-F_u \cdot F_v^T)\]For the purposes of maximising likelihood, it is sufficient that the above quantity is proportional to the target quantity.
Given a network, $G(V, E)$, we maximise the log-likelihood, $l(F)$, of all the edges occurring and all the non-edges not occurring
\[l(F) = \sum_{(u, v) \in E} log (1 - exp(-F_u \cdot F_v^T)) - \sum_{(u, v) \notin E} F_u \cdot F_v^T\]For optimization, we start with a random AGM, $F$, and update the membership strength, $F_{uC}$, of a node $u$ to community $C$, while fixing the memberships of all other nodes. This procedure takes linear time in the degree of node $u$.
For full details of the above formulation see Jaewon Yang and Jure Leskovec (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. Proceedings of the sixth ACM international conference on Web search and data mining. https://cs.stanford.edu/people/jure/pubs/bigclam-wsdm13.pdf.
References and Notes
This post is largely based on Lecture 4: Community Structure in Networks from Stanford CS224W - Machine Learning with Graphs, specifically the Fall 2019 Lectures. Further materials for study are signposted here.
-
Given a set, $S$, a partition is a collection of disjoint subsets, $A_i$ for $i = 1, \dots, \vert S \vert$, of $S$ whose union is $S$, i.e. $\cup_i A_i = S$. Two subsets, $A_i$ and $A_j$, are disjoint if they are non-overlapping, i.e. $A_i \cap A_j = \emptyset$ for $i \neq j$. An equivalent definition of a partition is a function where each element of a set is mapped onto exactly one subset, i.e. a bijection. ↩
-
Recall, we in fact defined modularity given a certain grouping of proposed communities. ↩
-
Indeed, if they were, the modularity metric would be useless. ↩
-
See Section 1.2.2 from the notes from Lecture 4 of Network Analysis and Modeling, CSCI 5352 by Prof. Aaron Clauset (2017) for the particular version of the configuration model we’ll use, which permits a multigraph structure. ↩
-
Blondel, Vincent D; Guillaume, Jean-Loup; Lambiotte, Renaud; Lefebvre, Etienne (2008) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment. https://arxiv.org/pdf/0803.0476.pdf. ↩
-
I inserted paragraph breaks to separate out the different points. ↩
-
We could the appropriate number of passes to obtain our desired hierarchical depth but the Louvain algorithm is suited to returning clusters for a natural number (according to the data) so terminating the algorithm prematurely according to a desired hierarchical depth may not return clusters with a high modularity given the number of clusters. ↩
-
The bipartite graph is said to be complete in that connections (possibly with very low weight) exist between all communities and nodes. ↩