PUBSEARCH
A Hierarchical Heuristic Scheme for Ranking Academic Search Results
Emanouil Amolochitis
1,2
, Ioannis T. Christou
1
and Zheng-Hua Tan
3
1
Athens Information Technology, 19Km Markopoulou Ave., PO Box 68, Paiania 19002, Greece
2
CTiF, Aalborg University, Aalborg, Denmark
3
Dept. of Electronic Systems, Aalborg University, Aalborg, Denmark
Keywords: Academic Search, Search and Retrieval, Heuristic Document Re-ranking.
Abstract: In this paper we present PubSearch, a meta-search engine system for academic publications. We have
designed a ranking algorithm consisting of a hierarchical set of heuristic models including term frequency,
depreciated citation count and a graph-based score for associations among paper index terms. We used our
algorithm to re-rank the default search results produced by online digital libraries such as ACM Portal in
response to specific user-submitted queries. The experimental results show that the ranking algorithm used
by our system can provide a more relevant ranking scheme compared to ACM Portal.
1 INTRODUCTION
In this paper, we introduce PubSearch, a meta-search
engine that uses a hierarchical ranking algorithm to
re-rank the search results produced by available
online digital libraries such as ACM Portal that
provide a consistent scheme for indexing academic
publications.
After examining a set of more than ten thousand
publications retrieved from ACM Portal we have
constructed a set of graphs representing different
types of associations among index terms. In the
constructed graphs we have identified maximal
weighted cliques that represent frequently-
appearing, strongly-related index terms. Our ranking
algorithm uses these graphs so as to identify the
matching degree of a publication’s index terms
against the formed cliques.
Our system uses a hierarchical three-level
ordering of the search results; each level orders the
results and then clusters them together into buckets
based on different properties examined at each level.
Every level in the hierarchy (except the top) re-ranks
the results contained in each bucket produced by its
immediate higher level and places them in finer-
grain buckets resulting in an alternative ranking
order at the end of the process.
2 RELATED WORK
CiteData (Harpale et al., 2010) addresses the
problem of lack of consistent datasets in the field of
personalized search for academic publications and
also shows that personalized search algorithms for
academic publications outperform non-personalized
methods.
In (Newman, 2001, 2004) the author shows that
different types of scientific networks reveal certain
collaboration patterns. Similarly in (Barabsi et al.,
2001) the authors examine a number of journals to
identify network evolution and topology as well as
patterns of co-authorship at specific points in time.
The authors in (Liben-Nowell, 2007) introduce an
approach that examines collaboration network
topology as well network member proximity in order
to predict the likelihood of future interactions.
In (Aljaber et al., 2009) the authors present a
publication representation scheme that attempts to
identify important index terms covered by journal
articles by identifying publication context by
examining relevant synonymous vocabulary.
The aforementioned methods reveal that
examining network structure and topology as well as
attempting to identify the presence of clusters in
such networks can provide useful background
knowledge that can be utilized in information
retrieval applications.
509
Amolochitis E., T. Christou I. and Tan Z. (2012).
PUBSEARCH - A Hierarchical Heuristic Scheme for Ranking Academic Search Results.
In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 509-514
DOI: 10.5220/0003704705090514
Copyright
c
SciTePress
3 SYSTEM DESIGN
3.1 System Architecture
The system architecture is shown in Figure 1.
Figure 1: System Architecture.
P1 implements a focused crawler module that is
briefly discussed in the next sub-section (3.2) that
collects all required information for each publication
retrieved via ACM Portal. P2 analyzes the
information collected in P1 in order to construct a
set of weighted graphs representing associations
among index terms of different strength. Process P3
computes all maximal weighted cliques identified in
the graphs constructed in P2. We describe processes
P2 and P3 in section 3.3. The cliques represent the
likelihood that researchers involved in a field
characterized by a subset of the index terms in a
clique might also be interested in other index terms
of the clique as well. P7 provides a component for
visualizing the maximal weighted cliques identified
in P3. Processes P4, P5 and P6 implement a meta-
search engine application that allows the evaluation
of our ranking algorithm. The system provides a
search interface so that users submit queries related
to areas of their expertise. The queries are initially
queued and later re-submitted to ACM Portal in
order to retrieve the default top 10 results produced
by the latter as well as the original ranking order.
The users also provide feedback evaluations on the
quality and relevance of the results’ ranking that
allows the comparison of the two different ranking
approaches.
3.2 Focused Crawler
We have developed a module that extracted all
publication information for approximately 10000
papers of 15457 authors available in ACM Portal,
including index terms, authors, abstract and
publication date.
3.3 Graph Model
Based on the collected papers, we constructed
different types of graphs representing different types
of associations among index terms.
In a Type I graph, two index terms t1 and t2 are
connected by an edge (t1, t2) with weight w if and
only if there are exactly w papers in the collection
indexed under both index terms t1 and t2. Type I
graph represents the strongest type of association of
a pair of index terms; the fact that both terms appear
together in the same paper reveals a strong affinity
among the topics in the area of interest of the
particular paper.
In a Type II graph, two index terms t1 and t2 are
connected by an edge (t1, t2) with weight w if and
only if there are w distinct authors that have
published at least one paper where t1 appears but not
t2 and also at least one paper where t2 appears but
not t1. Type II graph represents the second strongest
type of association and reveals a relation among the
index terms in the general area of interest of a
specific researcher, thus the association.
3.3.1 Maximal Weighted Cliques
In order to examine the strongest types of index term
associations as well as their evolution in the time
dimension we have constructed a set of graphs of the
above mentioned different types for a set of different
5-year periods. Graphs representing more recent
periods are considered as more relevant when
compared to older graphs. Similarly graphs
representing type I associations are more important
than type II. For each of the aforementioned graphs,
our system computes all maximal weighted cliques
for each graph, where we define a maximal clique of
minimum weight w
0
in the graph G to be any
maximal clique c so that for each pair of nodes v
1
&
v
2
in V there is an (undirected) arc e=(v
1
,v
2
) with
weight w
e
w
0.
. Computing all cliques in a graph is
an intractable problem (Garey and Johnson, 1979)
both in time and in space complexity, but in our
case, the constructed graphs are of very reasonable
size limited to around 300 nodes (total number of
index terms as specified by the ACM Classification
scheme) in each of the graphs. Furthermore our
algorithms take into consideration only the strongest
of edges (whose weight exceeds a certain threshold).
Given these restrictions, we implemented a recursive
algorithm following (Bron and Kerbosch, 1973) that
computes all maximally weighted cliques for all
graphs in our databases in less than 5 minutes of
CPU time.
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
510
Figure 2: Interactive Graph Visualization.
In order to visualize the strongest maximal
weighted cliques in the constructed graphs we used
Prefuse’s Information Visualization Toolkit (Heer et
al., 2005). These visualizations allow for an
interactive view of the most important types of
associations among strongly connected index terms
of interest.
A visualization of type I graphs is shown in
Figure 2 (http://hermes.ait.gr/scholarGraph/index).
3.4 Ranking Heuristic Hierarchy
The hierarchy of heuristics is shown in Figure 3.
Figure 3: Heuristic Hierarchy.
Initially the algorithm calculates the total term
frequency of all query terms appearing in each
publication result and normalizes the term frequency
value by dividing over the total number of terms of
the particular publication. The algorithm then
clusters together all results based on their TF score
into buckets of specific range (that is automatically
learned in a training phase of the system).
For each set of results that fall inside any given
TF bucket range, the algorithm performs another re-
ordering of these results, this time using as criterion
a depreciated citation count score. In principle, we
want to promote high impact recent publications at
the expense of older publications that may have
higher overall citation count but could be considered
as outdated. For this purpose we have introduced a
depreciated citation count formula that is defined as
a function of a publication’s citation count
depreciated by the years passed since its publication
date.
In this level therefore, the results within each TF
bucket (from the previous step) are ordered and
clustered together into new finer-grain buckets
(called DCC buckets) of specified range (also
learned during the training phase), according to a
d
epreciated citation count score, calculated for each
paper using the following formulae:
pppcnd
=
(1)
10
1tanh( )
4
1
2
p
p
y
d
+
=−
(2)
Where n
p
is the number of citations for the
specific paper p according to Google Scholar, y
p
is
the number of years passed since the publication of
the paper p, and c
p
is the (time-depreciated) citation-
based score for p.
After the second-level clustering of the results
completes, we perform a final ordering of the results
PUBSEARCH - A Hierarchical Heuristic Scheme for Ranking Academic Search Results
511
within each DCC bucket by calculating the degree of
matching of each result’s index terms with the
maximal weighted cliques of all constructed graphs.
The calculation details are as follows.
Let C be the set of all cliques to examine. Let c
i
denote the total number of index terms in clique i.
Let d denote the total number of index terms of
publication p and p
i
denote the total number of index
terms of publication p that belong to clique i; for
each clique
iC the system calculates the
matching degree of all publication index terms with
those of a clique. In cases of a perfect match
(meaning that all index terms of i appear as index
terms of p) in order to avoid bias towards
publications with a big number of index terms
against cliques with a small number of index terms
we calculate the percentage match
i
m as follows:
i
i
c
m
d
=
(3)
For all remaining cases (non-perfect match) the
percentage matching is calculated using:
i
i
i
p
m
c
=
(4)
If
i
mt> where t is a configurable threshold for
the accepted matching level (in our case t = 0.75) the
process continues, else the system stops processing
the current clique and moves to the next one. In case
that the matching level is above t the system
calculates a weight score w
p,i
representing the overall
value of the association of p with c
i
as follows:
,pi i
ii
wwmesac××
(5)
where
iw is the weight score of the examined
maximal weighted clique i, and
i
ac is a score
related to the association type that the current graph
that the current clique belongs to represents (
1
i
ac
=
for association type I,
0.6
i
ac
=
for type II). Finally,
es is an exponential smoothing factor that
depreciates cliques of graphs covering older periods
in order to promote more recent ones. Since each
type of graph has a different significance, we
consider recent graphs of stronger association types
as more significant and thus we assign greater value
to maximal weighted cliques of such graphs.
The algorithm calculates for each publication a
total clique matching score S
p
which corresponds to
the sum of matching score of the publication’s index
terms with all maximal weighted cliques and
determines the final ranking of the results
accordingly.
,
p
i
p
iC
Sw
=
(6)
4 EXPERIMENTS DESIGN
In order to evaluate our ranking algorithm’s
accuracy we developed a meta-search engine
application that provides a user interface allowing
users to submit queries as in an ordinary search
engine. A number of researchers from different
computer science and electrical engineering
disciplines were asked to submit a number of queries
related to their area of expertise and for consistency
reasons all queries processed consisted of two-to-
four words, with the optional use of quotes for
specifying specific keyword sequences. Also since
we need to be able to identify specific users
registration is required. All submitted user queries
are re-submitted to ACM Portal by our system and
the default top ten results as well as all related
publication information is extracted. The default
ranking order produced by ACM Portal is saved for
later comparison with the order suggested by our
own ranking algorithm. The system also attempts to
retrieve the full publication text (to be processed
later for calculating the query term frequency score)
in addition to the total number of citations via
Google Scholar.
When all required data are collected, our ranking
algorithm executes and generates an alternative
ranking scheme for the default ten results provided
by ACM Portal. When this process completes the
user is asked to provide relevance feedback (1 to 5
score where 1 stands for “least relevant” and 5
stands for “most relevant”) for the default top ten
results produced by ACM portal. Since both systems
attempt to re-rank the same set of results we use the
same feedback score to evaluate both ranking
algorithms. The total feedback score
s(q) for each
submitted query q is calculated as the sum of
feedback scores for each publication p in the result
set using lexicographic ordering:
1
() 2 ( )
n
ni
i
i
s
qfp
=
=
(7)
where
n is the number of results and f(p
i
) —
normalized in [0,1] — is the relevance feedback
provided by the user for the publication
p
i
appearing
in position
i in the list of results. This evaluation
scheme reflects the importance that users place in
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
512
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233
Pub/Search
including max.weighted cliques
PubSearch
excluding max.weighted cliques
ACM Portal
Figure 4: Computational results.
the top search results as opposed to results lower in
the ranking hierarchy and allows for determining as
the strongest ranking scheme the one that received
higher scores for the publication results in the
highest position regardless of the score received for
results in lower positions in the results list.
5 COMPUTATIONAL RESULTS
In an initial training phase, five volunteer users
(research scientists in the fields of computer science
and electrical engineering) submitted 12 queries in
total and provided feedback evaluations on the
ranking quality of the results. The training phase
resulted in a fine-tuning of the bucket ranges of each
of the three heuristics in the heuristic hierarchy of
our scheme.
We used another set of 33 queries from 12
different experts in computer science and electrical
engineering to measure the effectiveness of the
proposed re-ranking algorithm.
As it turns out,
PubSearch compares well with
ACM Portal, and in fact outperforms ACM Portal
in
20 out of 33 query instances, sometimes by
significant margin. In Fig. 4,
we compare the results
of ACM Portal against PubSearch with and without
the third and last heuristic in the hierarchy enabled
;
as it can be seen from the figure, the max. weighted
cliques heuristic improves the performance of
PubSearch in 20 out of the 33 queries in total as
well.
6 CONCLUSIONS AND FUTURE
DIRECTIONS
The results indicate that the traditional information
retrieval metrics based on term frequency are
insufficient to determine accurately the relevance of
a specific publication with respect to a specific
query. On the other hand, term frequency along with
time-depreciated citation count is a good criterion
for the overall current value of a paper that
combined with the final clique score provides an
even improved indication about value to papers of
similar or interdisciplinary nature.
REFERENCES
Aljaber, B., Stokes, N., Bailey, J., Pei, J., 2009. Document
clustering of scientific texts using citation contexts.
Journal of Information Retrieval, 13(2), pp. 101-131.
Barabsi, A., Jeong, H., Ned, Z., Ravasz, E., Schubert, A.,
Vicsek, T., 2001. Evolution of the social network of
scientific collaborations. Physica A: Statistical
Mechanics and its Applications, 311 (3-4), 590-614.
Bron, C., Kerbosch, J., 1973. Algorithm 457: finding all
cliques of an undirected graph. Communications of the
ACM, 16(9), pp 575-577.
Garey, M R., Johnson, D S., 1979. Computers and
intractability: A guide to the theory of NP-
Completeness. Freeman, San Francisco, CA.
Harpale, A., Yang, Y., Gopal, S., He, D., Yue, Z., 2010.
CiteData: A new multi-faceted dataset for evaluating
personalized search performance. In: Proc. ACM
Conf. on Information & Knowledge Management
CIKM 10, Oct. 26-30, 2010, Toronto, Canada.
Heer, J., Card, S., Landay, J. 2005. Prefuse: a toolkit for
interactive information visualization. In: Proc.
SIGCHI conference on Human factors in computing
systems.
PUBSEARCH - A Hierarchical Heuristic Scheme for Ranking Academic Search Results
513
Newman, M., 2001. The structure of scientific
collaboration networks. In: Proc. National Academy of
Sciences USA, 98, 404-409.
Newman, M., 2004. Coauthorship Networks and Patterns
of Scientific Collaboration. In: Proc. National
Academy of Sciences USA, 101, 5200-5205
Samudrala, R., Moult, J., 1998. A graph-theoretic
algorithm for comparative modeling of protein
structure. Journal of Molecular Biology, 279(1), pp.
287-302.
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
514