the algorithm will still fail to see beyond minor barri-
ers.
We suggest the improvement of local community
identification algorithms in general by adding more
contextual information to their selection criteria. By
making a local algorithm look ahead further than one
edge from the community we decrease the shortsight-
edness of this approach and allow for a more informed
and balanced judgement on community membership,
at the cost of higher computational and situational
crawling complexity. This improvement can be ap-
plied to any local algorithm.
We propose the following example application
where we extend the algorithm its network knowledge
by offering nodes one step beyond the universe. Be-
sides the universe node itself, we also investigate the
addition of that a node together with any combination
of its neighbors. This will provide insight into pos-
sible local community structure, and quality, that lies
beyond the community universe. The set of possible
addition sets for community C at distance 2, denoted
as A(C), is defined as follows:
A(C) =
[
u∈U(C)
X ∪{u}
X ∈ ℘(b(u))
(10)
where b(u) =
v
u → v
is the set of nodes that can
be reached from u in one step. Then we are interested
in:
argmax
C
0
∈A(C)
µ(C ∪C
0
) (11)
Note that the computational complexity will in-
crease rapidly as the lookahead distance k is in-
creased, especially in dense networks. Further re-
search should consider different distances for the
lookahead approach, consider the quality and com-
putational complexity tradeoff and determine an opti-
mum (if any).
4 VALIDATION
Our improvement to local community identification
algorithms as proposed in section three is validated
by a series of tests on synthetic networks. Since we
aim to verify the improvement of local identification
algorithms in the area where these often struggle we
will generate networks with a low average degree (4,
5 and 8). This will result in networks that contain a
lot of potentially troublesome start nodes. We will
run four tests on every node in the network, varying
the algorithm (regular or improved) and the commu-
nity measure. We measure community quality by the
widely known local modularity (Clauset, 2005) and
relative density (Schaeffer, 2005; Lancichinetti et al.,
2009) definitions.
For our test we generate an undirected network
according to the Barabsi–Albert model (Albert and
Barabasi, 2002) and rewire it to create a flat commu-
nity structure as proposed by Bagrow (Bagrow, 2008).
The rewiring is done by creating k sets of nodes (rep-
resenting the communities) in the network and then
rewiring inter-community edges to intra-community
edges while preserving the degree distribution. Our
benchmark networks contain 128 nodes equally di-
vided amongst 4 communities.
The quality of a community identification algo-
rithm is evaluated by quantifying the similarity be-
tween the algorithm output and the synthetic commu-
nity structure. We will adopt the Jaccard Similarity
Coefficient (JSC) which is defined as the commonal-
ity of both sets divided by their generality:
JSC(X,Y ) =
|X ∩Y |
|X ∪Y |
(12)
4.1 Results
Running these tests on 50 generated networks yields
the plot of the JSC score frequency shown in Figure
1. In the networks with an average degree of 4 and
5 we observe a significant increase in high similar-
ity and decrease of outliers for both quality measures
when our algorithm improvement is applied. There
is a relatively low gain for the more dense networks
with average degree 8. The plots also show that the re-
sult, while strongly improved, is not perfect yet. We
observe a couple of reasons why even the improved
algorithm is struggling for some start nodes.
First of all, when the boundaries of two communi-
ties are not very sharp and the start node is a bound-
ary node (as defined by the synthetic graph structure)
the algorithm may start of in the wrong direction and
identify the wrong community. Suppose we start with
node v that is a member of community C according
to the synthetic structure. If the algorithm identifies
C
0
∪ v where C
0
is another community defined by the
synthetic structure, then the result of the algorithm
may be a quite strong community. But the similar-
ity measure will yield a bad result because there is
very little overlap between the reference community
and the found community.
Also, the local algorithm may find a strong com-
munity that is a subset of the community it is sup-
posed to find. The gap between the found commu-
nity and agglomerating until the algorithm identifies
a larger and stronger community may be too large for
the lookahead algorithm to recognize. There lies a
tradeoff between the time complexity of the looka-
head algorithm and the effectiveness of identifying
improvement beyond the universe candidates.
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
402