2 GRAPHS AND COMMUNITIES
A graph G(V,E) is defined as a set V of n vertices,
also called nodes, and a set E of m edges connecting
pair of vertices. Edges may have a weight
representing a degree of relationship between nodes,
like the strength of the connection, the length of the
path between the two nodes or something else.
Nodes connected by an edge are said to be
neighbours.
Unweighted graphs can be thought as a special
class of weighted graphs in which edges can have
weight 0 or 1.
A graph is dense if the number of edges is close
to n
2
, otherwise it is sparse.
A graph can be directed or undirected: given a
pair of nodes, the graph is directed if the order of
nodes in the pair matters, i.e. the edge starts from the
first node and ends on the second. Otherwise, if the
order does not matter, the graph is said to be
undirected.
One way to store the edges is by using the
adjacency matrix, an n by n matrix whose cell in the
i-th row and j-th column contains the weight of the
edge from node i to node j. Obviously, the adjacency
matrix of an undirected graph is symmetric.
In a (weighted) graph, the degree (strength or
weighted degree) k of a node is the number (the sum
of the weights) of the edges connecting it to other
vertices.
A path is an ordered sequence of edges where
each edge starts from the end of the previous one.
A component in a graph is a set of nodes that can
be reached from each other using path (Diestel
2005). A graph is partitioned if it is composed by
more than one component.
A community is defined as a subset of nodes
having more edges leading to members of the same
community than to other nodes in the graph. The
term community comes from the original application
of this concept to social networks; however,
community detection is now used to assess
robustness of network infrastructures and to analyse
interaction networks.
The definition of community is a bit vague and
then a mathematical measure is needed in order to
compare different assignment of nodes to
communities in a graph.
Given that, different approaches to community
detection have been developed (Fortunato, 2010),
ranging from clustering techniques like k-means,
spectral methods, the maximization of a target
function and even to game theoretic algorithms.
Both spectral methods and k-means require a-priori
knowledge of the number of communities, but we
wanted an algorithm able to automatically detect the
communities without the need of setting a parameter.
We decided to focus on modularity optimization
algorithms, because they do not require the number
of communities as a parameter and are mostly
deterministic.
2.1 Modularity
Given a graph containing nodes belonging to a set of
communities, the modularity measure (Newman,
2004); (Newman, 2006) evaluates how well
connected the nodes inside a community are in
respect of the other nodes, using the following
formula:
1
Q =
2m 2m
ij
ij i j
i, j
kk
A δ c,c
(1)
Q is the modularity; i and j are nodes; A is the
adjacency matrix of the graph; k is the (weighted)
degree of a node; m is half of the sum of all the
elements of A; c
i
is the community of node i; delta is
a function returning 1 if the communities passed as
parameters are the same, 0 otherwise.
The modularity value ranges from -0.5 to 1. In
theory, maximizing the modularity means that the
best partitioning of the graph has been found.
However, modularity maximization is not a
simple task (Brandes et al., 2008) and that definition
of modularity has its limits on finding small
communities (Fortunato and Barthélemy, 2007).
2.2 Modularity Maximization
There are different algorithms for modularity
maximization: the original algorithm by Girvan and
Newman Girvan and Newman (Girvan and Newman
2002); (Newman and Girvan, 2004) is too expensive
in terms of computational complexity given the size
of our clusters.
We then focused on an algorithm known as the
Louvain method.
2.2.1 Louvain Method
The Louvain method (Blondel et al., 2008) is a
greedy algorithm for modularity maximization. The
greedy approach uses a heuristic that locally
maximize the modularity of the next state.
The algorithm starts with all the nodes assigned
to different communities. It then proceeds as
follows:
CommunityDetectionwithinClustersHelpsLargeScaleProteinAnnotation-PreliminaryResultsofModularity
MaximizationfortheBAR+Database
329