Table 1: Examples of subsumption procedure.
Bigram Frequency Trigram Frequency Covered
adaptable user 9 adaptable user interface 8 Yes
ada programming 9 ada programming environment 2
ada programming language 2
ada programming support 3
advanced ada programming 2 Yes
application software 39 application software development 3
application software systems 2
embedded application software 2
mobile application software 2
generic application software 2 No
topic designated by the bigram is broad enough and is
not covered by the cluster members.
5.3 Experiments with Topic Re-ranking
As mentioned above the data stored in DBLP spans
49 years, from 1959 to 2008. However it can be seen
from the Figure 1, that scientific activity starts to grow
toward mid eighties. That is the reason why we re-
strict our experiments to topics which appeared no
earlier than 1988. (The sharp fall of the curve to-
ward the end of 2010 is explained by the fact that
the data from 2007 − 2008 had not been completely
introduced into the database by the time we down-
loaded the file.). Additionally we restrict the minimal
topic frequency to 5 for the bi-grams, and 2 for the
tri-grams.
Table 1: Examples of subsumption procedure.
Bigram Frequency Trigram Frequency Covered
adaptable user 9 adaptable user interface 8 Yes
ada programming 9 ada programming environment 2
ada programming language 2
ada programming support 3
advanced ada programming 2 Yes
application software 39 application software development 3
application software systems 2
embedded application software 2
mobile application software 2
generic application software 2 No
row is an example of a cluster which is formed around
the bigram ”ada programming”. It is covered by the
corresponding trigrams but is not eliminated. Anal-
ysis of the list of such clusters shows that many bi-
grams while covered by some set of trigrams have a
meaning of their own and could potentially serve for
topic labeling. The last row is an example of a cluster
built around the bigram ”application software”. The
topic designated by the bigram is broad enough and is
not covered by the cluster members.
5.3 Experiments with topic re-ranking
As mentioned above the data stored in DBLP spans
49 years, from 1959 to 2008. However it can be seen
from the Figure 1, that scientific activity starts to grow
toward mid eighties. That is the reason why we re-
strict our experiments to topics which appeared no
earlier than 1988. (The sharp fall of the curve to-
ward the end of 2010 is explained by the fact that
the data from 2007 − 2008 had not been completely
introduced into the database by the time we down-
loaded the file.). Additionally we restrict the minimal
topic frequency to 5 for the bi-grams, and 2 for the
tri-grams.
0
10000
20000
30000
40000
50000
60000
1960 1970 1980 1990 2000
# of Publications
Year
Publication number distribution by year
Figure 1: Paper distribution in DBLP from 1959 to 2008.
5.3.1 Results of the ranking by citation
Table 2 lists 20 top ranked topics according to the ci-
tation ranking computed using the equation (3).
We observe that the ranking results agree with our
expectations, as almost all twenty topics designate
broad areas of computer science. They are featured
by high numbers of both - conferences and papers,
and reflect ”trendy” research directions of the last
15years. The metric captures a high interest in rela-
tively new topic - ”semantic web”: despite its shortest
span (8 years), and relatively recent emergence (2001)
it scores seventh on the total list of topics.
As we descend toward the lower ranked topics we
notice that they gradually become more focused. Ta-
ble 3 shows more specific topics, which may also
be multi-disciplinary technical terms, like ”distance
measure”. Note that the number of papers the topics
occur in is still quite high while the number of confer-
ences changes to moderate.
5.3.2 Results of the ranking by the clustering
coefficient
Let us now look at the topic list ranked according to
the clustering coefficient cc
0
G
T
described in subsection
3.2. Table 4 shows 5 topics from the top, and 5 top-
ics from the bottom of the list. The top ranked topics
represent quite specific research fields such as theo-
rem proving or cryptography. On the contrary the
last five topics do not only represent the broad ar-
eas of computer science, they correspond exactly to
the top most ranked topics according to the citation
metric. This experiment proves our expectations that
the clustering coefficient may serve to distinguish be-
tween broad and focused topics and gives priority to
the more specific ones. We do not discuss here the
ranking results yielded by the ratio of two clustering
coefficients defined by equation (5). Analysis of the
topic list has shown that the results do not support our
predictions. Why it is so remains an open problem so
far.
Figure 1: Paper distribution in DBLP from 1959 to 2008.
5.3.1 Results of the Ranking by Citation
Table 2 lists 20 top ranked topics according to the ci-
tation ranking computed using the equation (3).
We observe that the ranking results agree with our
expectations, as almost all twenty topics designate
broad areas of computer science. They are featured
by high numbers of both - conferences and papers,
and reflect ”trendy” research directions of the last
15years. The metric captures a high interest in rela-
tively new topic - ”semantic web”: despite its shortest
span (8 years), and relatively recent emergence (2001)
it scores seventh on the total list of topics.
As we descend toward the lower ranked topics we
notice that they gradually become more focused. Ta-
ble 3 shows more specific topics, which may also
be multi-disciplinary technical terms, like ”distance
measure”. Note that the number of papers the topics
occur in is still quite high while the number of confer-
ences changes to moderate.
5.3.2 Results of the Ranking by the Clustering
Coefficient
Let us now look at the topic list ranked according to
the clustering coefficient cc
0
G
T
described in subsection
3.2. Table 4 shows 5 topics from the top, and 5 top-
ics from the bottom of the list. The top ranked topics
represent quite specific research fields such as theo-
rem proving or cryptography. On the contrary the
last five topics do not only represent the broad ar-
eas of computer science, they correspond exactly to
the top most ranked topics according to the citation
metric. This experiment proves our expectations that
the clustering coefficient may serve to distinguish be-
tween broad and focused topics and gives priority to
the more specific ones. We do not discuss here the
ranking results yielded by the ratio of two clustering
coefficients defined by equation (5). Analysis of the
topic list has shown that the results do not support our
predictions. Why it is so remains an open problem so
far.
5.3.3 Results of the Ranking by t f .id f
Table 5 presents the 10 top entries from the topic
list ranked according to the t f.id f . Since this met-
ric gives the maximal weight to items which occur
in 1 document we set the minimal number of doc-
uments (i.e. conferences in our case) to 3. We do
so after the manual check of the results on an unre-
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
240