3 SYSTEM DESIGN
3.1 System Architecture
The system architecture is shown in Figure 1.
Figure 1: System Architecture.
P1 implements a focused crawler module that is
briefly discussed in the next sub-section (3.2) that
collects all required information for each publication
retrieved via ACM Portal. P2 analyzes the
information collected in P1 in order to construct a
set of weighted graphs representing associations
among index terms of different strength. Process P3
computes all maximal weighted cliques identified in
the graphs constructed in P2. We describe processes
P2 and P3 in section 3.3. The cliques represent the
likelihood that researchers involved in a field
characterized by a subset of the index terms in a
clique might also be interested in other index terms
of the clique as well. P7 provides a component for
visualizing the maximal weighted cliques identified
in P3. Processes P4, P5 and P6 implement a meta-
search engine application that allows the evaluation
of our ranking algorithm. The system provides a
search interface so that users submit queries related
to areas of their expertise. The queries are initially
queued and later re-submitted to ACM Portal in
order to retrieve the default top 10 results produced
by the latter as well as the original ranking order.
The users also provide feedback evaluations on the
quality and relevance of the results’ ranking that
allows the comparison of the two different ranking
approaches.
3.2 Focused Crawler
We have developed a module that extracted all
publication information for approximately 10000
papers of 15457 authors available in ACM Portal,
including index terms, authors, abstract and
publication date.
3.3 Graph Model
Based on the collected papers, we constructed
different types of graphs representing different types
of associations among index terms.
In a Type I graph, two index terms t1 and t2 are
connected by an edge (t1, t2) with weight w if and
only if there are exactly w papers in the collection
indexed under both index terms t1 and t2. Type I
graph represents the strongest type of association of
a pair of index terms; the fact that both terms appear
together in the same paper reveals a strong affinity
among the topics in the area of interest of the
particular paper.
In a Type II graph, two index terms t1 and t2 are
connected by an edge (t1, t2) with weight w if and
only if there are w distinct authors that have
published at least one paper where t1 appears but not
t2 and also at least one paper where t2 appears but
not t1. Type II graph represents the second strongest
type of association and reveals a relation among the
index terms in the general area of interest of a
specific researcher, thus the association.
3.3.1 Maximal Weighted Cliques
In order to examine the strongest types of index term
associations as well as their evolution in the time
dimension we have constructed a set of graphs of the
above mentioned different types for a set of different
5-year periods. Graphs representing more recent
periods are considered as more relevant when
compared to older graphs. Similarly graphs
representing type I associations are more important
than type II. For each of the aforementioned graphs,
our system computes all maximal weighted cliques
for each graph, where we define a maximal clique of
minimum weight w
0
in the graph G to be any
maximal clique c so that for each pair of nodes v
1
&
v
2
in V there is an (undirected) arc e=(v
1
,v
2
) with
weight w
e
≥w
0.
. Computing all cliques in a graph is
an intractable problem (Garey and Johnson, 1979)
both in time and in space complexity, but in our
case, the constructed graphs are of very reasonable
size limited to around 300 nodes (total number of
index terms as specified by the ACM Classification
scheme) in each of the graphs. Furthermore our
algorithms take into consideration only the strongest
of edges (whose weight exceeds a certain threshold).
Given these restrictions, we implemented a recursive
algorithm following (Bron and Kerbosch, 1973) that
computes all maximally weighted cliques for all
graphs in our databases in less than 5 minutes of
CPU time.
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
510