TOPIC DETECTION IN BIBLIOGRAPHIC DATABASES

Maria Biryukov

MINE group, University of Luxembourg, 6, Richard Coudenhove-Kalergi, L-1359, Luxembourg, Luxembourg

Keywords:

Topic detection, Topic clustering, Graph properties, Knowledge discovery in data bases.

Abstract:

Detection of research topics in scientiﬁc publications has attracted a lot of attention in the past few years. In

this paper we introduce and compare various metrics of topic ranking, which allow to distinguish between

general and focused topic terms. We use DBLP as a testbed for our experiments.

1 INTRODUCTION

Topic detection in scientiﬁc publications is an active

research area in text mining and knowledge discov-

ery in databases. Various techniques have been pro-

posed for this purpose and range from language mod-

eling (Wang et al., 2007; Jo et al., 2007; Diederich

and Balke, 2007) to graph-based approaches and bib-

liometrics (Mann et al., 2006; Bird et al., 2009;

Lars Backstrom et al., 2006; Mei et al., 2008). In this

paper we study several metrics for ranking research

topics. Our metrics are based on the topic distribution

in publications and venues, and on the co-authorship

relation. Using these metrics we show how to differ-

entiate between general and speciﬁc topics. We also

propose a way of grouping topics into semantically

related clusters.

This paper is organized as follows: Section 2 de-

scribes the process of topic generation. In Section 3

the various ways of topic ranking are introduced. Sec-

tion 4 outlines the approach for ﬁnding related topics.

Section 5 presents the experiments and discusses the

results. Paper’s summary and synopsis of the future

work are given in Section 6.

2 TOPIC GENERATION

The goal of the current paper is to extract topics

and build topic clusters via the combination of three

sources of information: text, co-authorship graph, and

time. We start from extracting topic using publication

titles which constitute the textual component for the

purpose of this paper.

2.1 Extracting Topics

In this paper a topic is deﬁned as a collocation com-

posed of n consecutive words, where 2 ≤ n ≤ 3. Re-

quiring the topic components to be a collocation im-

plies that they are semantically related, together con-

vey a certain meaning which is different from the

meaning of individual words, and the probability of

their co-occurrence is higher than it would be ex-

pected if the words were independent (Manning and

H.Schutze, 1999). In this context, expressions like

”data mining” or ”disjunctive logic programming” are

examples of topics. To determine whether or not a se-

quence forms a collocation we apply a likelihood ratio

test for binomial distribution (Dunning, 1993). This

test belongs to the class of hypothesis tests where one

formulates two hypotheses: null hypothesis which ex-

presses the word independence, and not-null hypoth-

esis under which the words are semantically related

and their co-occurrence is not a chance event. The

equations 1 and 2 formalize these hypotheses for the

case of testing two words but can be extended for

longer expressions.

: P(w

) = p = P(w

|¬w

) (1)

: P(w

) = p

6= p

= P(w

|¬w

) (2)

By taking the ratio of the likelihoods of the two

hypotheses λ one can say how much more likely one

hypothesis is than the other. The null hypothesis H

rejected if p

 p

. It has been shown in (Dunning,

1993) that the quantity −2logλ is asymptotically χ

distributed. Hence we can use the χ

distribution ta-

ble to determine for each word sequence the conﬁ-

dence level of its −2logλ value, and compare it to the

236

Biryukov M. (2009).

TOPIC DETECTION IN BIBLIOGRAPHIC DATABASES.

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 236-242

DOI: 10.5220/0002332702360242

 SciTePress

treshold value required for a collocation which is set

tp 10.83 with conﬁdence level p = 0.001. All candi-

dates which satisfy the treshold are considered valid

collocations and make up the resulting list of prelim-

inary topics.

We discuss the topic lists in Section 5.

2.2 Topic Terms Reﬁnement

As mentioned above we allow topic terms composed

of two and three words (bi- and tri-grams further in

this text). Any trigram can be seen as an extension

of some bigram by one word. Presumably there are

cases when −2logλ values are sufﬁciently high to re-

tain both - a bigram and its corresponding trigram(s)

as topic terms. Thus we obtain terms like ”generative

model” as well as ”discriminative generative model”

and ”probabilistic generative model”. However in

some other cases selecting a trigram along with its

bigrams may yield false positives. For example in

”world wide web” only the trigram itself makes sense

but neither world wide nor wide web are valid by

themselves. To minimize such cases we complete the

process of topic generation by applying subsumption

approach proposed in (Sanderson and Croft, 1999)

for the deriving of concept hierarchies from text. The

original idea is the following: given two terms x and y,

x subsumes y if the documents which y occurs in are

a subset of the documents which x occurs in. Since x

subsumes y and because it is more frequent, x is the

parent of y. We adopt this idea and modify it in such

a way that it serves in two different scenarios.

• Cleaning Topic List from Meaningless Colloca-

tions. Given a bigram x and its extension, trigram

y, we eliminate x as having no stand alone mean-

ing if it occurs in 80% of the documents (i.e. pub-

lication titles) which y occurs in. In other words,

x is removed from the list of topics if it occurs as

part of y in at least 80% of the cases. Note that

we do not require a complete overlap between the

occurrences of x and y. Doing so would lead to

preserving a high number of meaningless bigrams

just because of a few cases in which x did occur

without y.

• Deﬁning Clusters of Lexically Related Terms.

Given a bigram x and its multiple extensions

Y = {y

,...,y

}, the cluster is formed with

the central term being x, and the member terms

,...,y

},y

∈ Y .

After the reﬁnement we can proceed with studying

some of the topic properties.

3 RERANKING OF THE TOPICS

Since collocations are semantically meaningful units,

the ranked list obtained in the way described above

could already serve as a ﬁnal ranked list of topics.

However we consider the re-ranking due to the fol-

lowing observations. First of all, the two and three

word collocations are generated separately, which re-

sults in two independent topic lists. Because bi- and

tri-grams have different ranges of weights there is no

straightforward way to compile them into one ranked

list of topics without recurring to any external infor-

mation. Second remark addresses the meaning of the

collocation weight in general. The −2logλ value of

a topic reﬂects its relevance to the corpus as a whole.

However it fails to capture the information about topic

generality or speciﬁcity. Neither it sheds light on the

topic relatedness. To overcome the lack of such infor-

mation we deﬁne additional metrics for topic ranking.

They are described in the following subsections.

3.1 Ranking of Topics by Citation

It is common to measure citations as an evidence

of importance of an object or event. To decide on

salience of a topic we deﬁne two types of citations:

citation by title and citation by conference. The idea

behind it is to consider every apparition of the given

topic after its ﬁrst occurrence as a reference (or cita-

tion) of the original topic. Note that at this point we

incorporate time dimension into the analysis. To com-

pute the new weight weight

of a topic t

∈ T where T

denotes the list of topics produced via the collocation

extraction as described in subsection 2.1, we deﬁne:

• Citation by title cite

t,i

as a number of titles which

topic t

occurs in after the ﬁrst apparition.

• Citation by conference cite

c,i

as a number of dif-

ferent conferences which topic t

occurs in after

its ﬁrst apparition.

Then the resulting topic weight is given by the product

of the two types of citations:

weight

= cite

t,i

× cite

c,i

(3)

This metric favors topics which have high coun-

ters for both, titles and conferences. Consequently we

expect topics that reﬂect broad trends to outrank the

more locally focused ones.

3.2 Ranking Topics by Co-authoship

So far only the textual and temporal informations

have been used to create, reﬁne, and re-rank the top-

ics. The metric described in this subsection aims

TOPIC DETECTION IN BIBLIOGRAPHIC DATABASES

237

at distinguishing between broad and focused topics

as well, but it uses co-author graph properties to do

so. Intuitively more general topics will be spread

among many not necessarily related to each other au-

thors. More speciﬁc topics are expected to demon-

strate an opposite behavior revealing tight co-author

clusters behind themselves. The measure of co-author

connectivity is captured by the clustering coefﬁcient

which quantiﬁes how close the direct neighbors of a

vertex are to form a complete graph (clique) (Watts

and Strogatz, 1998).

To compute the topic weight in this graph-based

metric we build a co-authorship graph G

for each

topic t

∈ T , with vertices {V

} being the authors of

all the papers which t

occurs in, and edges {E

} de-

ﬁned by the co-authorship relation between the au-

thors in G

. The topic weight weight

is given by the

clustering coefﬁcient of G

, cc

, and is computed as

follows:

weight

= cc

(|V

| × (|V

| − 1))/2

(4)

where the nominator is the number of edges in G

and the denominator is the maximal number of edges

that would have been in G

if it was a complete clique.

We observe that such graphs are sparse: they

represent a set of typically unrelated cliques. That is,

the edges in G

are mainly the ones which connect

the authors of every given paper, but there are almost

no edges between the authors of the different papers.

However one may assume that some v

∈ V

are

connected to each other but not necessarily via par-

ticular t

. It follows that G

might not fully reﬂect the

co-authorship relations between the authors related to

. To remedy the situation we complete the G

with

information from the global graph G = {V,E}, where

{V } are the authors of all publications listed in the

bibliographical database, and there is an edge e

i, j

∈ E

between some v

and v

∈ V if they co-authored

at least one paper. The process of building G

is now modiﬁed in the following way: after the

authors of all papers containing t

are introduced

and appropriately connected in G

, every pair of

unconnected vertices v

is checked for having an

edge in the global graph G. Should there be one, an

edge e

i, j

is added to the G

. After all the vertices

} ∈ G

have been checked a new clustering co-

efﬁcient cc

is computed with the updated number

of edges {E

} ∈ G

. It makes sense now to check

whether or not information from the graph G has

changed the author connectivity in G

. We do so by

computing a new weight of t

, weight

, which is the

ratio of the two clustering coefﬁcients, cc

and cc

weight

(5)

We expect that the closer weight

value is to 1 the

more general is the term.

3.3 Ranking of Topics by t f .id f Value

Term frequency - inverse document frequency (t f .id f )

is another way of separating terms into general and

speciﬁc. Introduced in (Spark, 1972) it has been

widely used in the ﬁeld of information retrieval. We

use it here as a benchmark for the two other metrics

introduced in subsections 3.1 and 3.2. The metric

combines the term salience for the collection of doc-

uments (t f ) with its informativeness (id f ) presuming

that the more focused terms will be concentrated in a

fewer number of documents than more general ones

which would be spread throughout the collection. We

apply this metric as follows:

• term t

= topic t

∈ T ;

• document d

= c

, where c

is a conference from

the list of all conferences C in the database;

• t f

i, j

is the number of titles which t

occurs in;

• c f

is the number of different conferences which

occurs in.

The weight of each topic t

,t ∈ T is given by (6):

weight(i, j) =



(1 + log(t f

i, j

))log

c f

i f t f

i, j

≥ 1

0 i f t f

i, j

= 0

(6)

where f (t f ) = (1+log(t f

i, j

)),t f > 0 is the damp-

ening function. (See page 542 of (Manning and

H.Schutze, 1999) for the details). We expect that

more general topics will be featured not only by the

high number of hosting titles but also by the high

number of conferences which they occur in, as op-

posed to the more speciﬁc ones, grouped in relatively

small number of venues.

In Section 5 we compare the results of all the three

different metrics.

4 FINDING RELATED TOPICS

The question we did not deal with yet is how to iden-

tify semantically related topics. In subsection 2.2

we have brieﬂy shown how to group them lexically.

However this approach has left out semantic similar-

ity of topics being strongly restricted to their word-

ing. In this section we describe how we plan to extend

the graph analysis suggested above to enable semantic

clustering.

KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval

238

The underlying assumption is that authors which

share some topic t

belong to the same community and

thus other topics T

they may share are possibly re-

lated to t

. To check whether or not a topic t

∈ T

related to a topic t

we use the updated graph G

and

the global graph G (as described in 3.2) and perform

the following steps:

1. count how many authors in G

share t

(internal

links, which we call i − links);

2. count how many authors in G share t

(external

links, e − links);

3. deﬁne score

as a ratio of

e−links

i−links

The higher proportion of the internal links is the

stronger t

and t

are related.

Note that this metric is straightforward and sim-

ple. Alternatively we can compute the strength of the

topic relatedness using hypothesis testing, as a ratio

of two likelihoods: L(H

) which expresses that t

and

are related, and L

(

) which says that they are not.

(Likelihoods relation was used in (Jo et al., 2007) to

compute the probability of a token to be a term, using

citation graph built from the CiteSeer data).

Thus we may detect the following relationship:

text summarization: {multidocument text summa-

rization, automatic summarization, information re-

trieval, text processing, ... }, where ”text summariza-

tion” is a topic in question t

, and ”multidocument text

summarization”, ”text processing”, etc., constitute a

set of related topics T

By taking clustering coefﬁcient into account

(equation (5)) we may transform such clusters of re-

lated topics in some kind of hierarchy with more gen-

eral topics being parent nodes of the more speciﬁc

ones. We may combine this information with time to

capture the dynamic development of a broader topic

as a whole or trace the evolution of its subtopics.

5 EXPERIMENTS

AND EVALUATION

In this section we discuss experiments that have been

performed to test the methods described above. We

focus on conference publications and use computer

science bibliographic database DBLP as a test bed.

Our experiments are run on the DBLP release from

February 2008

The up-to-date versions of DBLP are available for

download from http://dblp.uni-trier.de/xml/ in xml format.

5.1 Data Collection and Preparation

The xml ﬁle is parsed and the data is stored in a

database. Then it is organized into two independent

sets. One is intended for the collocation extraction

and contains titles of conference papers. The ini-

tial list consisting of 610895 items is further prepro-

cessed by converting to the low case, removing stop

words (we use a list provided by the Lingua pack-

age (Potencier and Humphrey, )), punctuation, and ti-

tles which contain non-ASCII symbols. These consti-

tute ∼ 2% of the total number, and are mostly French

and German ones with a few occurrences of the math-

ematical notation. The resulting list contains 599456

titles. In the second set we store complete information

about the publications, including author names, title,

year, and venue. It counts 610895 titles, 609053 au-

thors, and 3996 conferences in the range of 49 years,

from 1959 to 2008.

5.2 Evaluation of Topics on DBLP

The preprocessed list of titles serves as the input to

the program which generates topics. (We use the NLP

package for collocation extraction (Banerjee and Ped-

ersen, 2003), with loglikehood ratio test λ as a statistic

metric, and 10.83 as a cutoff weight for the −2logλ

value.) The process yields 392994 bi- and 3150332

tri-grams. Since the titles were modiﬁed during the

preparation stage, not all the collocations are valid.

We then conduct a post-processing which amounts to:

1. matching collocations to the original titles. Col-

locations that contain punctuation marks and/or

stopwords, or which components fail to represent

a sequence, are eliminated.

2. merging singular and plural cases into one entry;

3. subsumption, as described in subsection 2.2.

At the end of the post-processing we obtain a

structure known in information retrieval as inverted

ﬁle where for each entry the number of occurrences

and an array of hosting titles are stored. The number

of retained topics is reduced to 124480.

Table 1 shows some examples of the subsump-

tion process. The ﬁrst row illustrates elimination of

a meaningless bigram ”adaptable user”. The second

row is an example of a cluster which is formed around

the bigram ”ada programming”. It is covered by the

corresponding trigrams but is not eliminated. Anal-

ysis of the list of such clusters shows that many bi-

grams while covered by some set of trigrams have a

meaning of their own and could potentially serve for

topic labeling. The last row is an example of a cluster

built around the bigram ”application software”. The

TOPIC DETECTION IN BIBLIOGRAPHIC DATABASES

239

Table 1: Examples of subsumption procedure.

Bigram Frequency Trigram Frequency Covered

adaptable user 9 adaptable user interface 8 Yes

ada programming 9 ada programming environment 2

ada programming language 2

ada programming support 3

advanced ada programming 2 Yes

application software 39 application software development 3

application software systems 2

embedded application software 2

mobile application software 2

generic application software 2 No

topic designated by the bigram is broad enough and is

not covered by the cluster members.

5.3 Experiments with Topic Re-ranking

As mentioned above the data stored in DBLP spans

49 years, from 1959 to 2008. However it can be seen

from the Figure 1, that scientiﬁc activity starts to grow

toward mid eighties. That is the reason why we re-

strict our experiments to topics which appeared no

earlier than 1988. (The sharp fall of the curve to-

ward the end of 2010 is explained by the fact that

the data from 2007 − 2008 had not been completely

introduced into the database by the time we down-

loaded the ﬁle.). Additionally we restrict the minimal

topic frequency to 5 for the bi-grams, and 2 for the

tri-grams.

Table 1: Examples of subsumption procedure.

Bigram Frequency Trigram Frequency Covered

adaptable user 9 adaptable user interface 8 Yes

ada programming 9 ada programming environment 2

ada programming language 2

ada programming support 3

advanced ada programming 2 Yes

application software 39 application software development 3

application software systems 2

embedded application software 2

mobile application software 2

generic application software 2 No

row is an example of a cluster which is formed around

the bigram ”ada programming”. It is covered by the

corresponding trigrams but is not eliminated. Anal-

ysis of the list of such clusters shows that many bi-

grams while covered by some set of trigrams have a

meaning of their own and could potentially serve for

topic labeling. The last row is an example of a cluster

built around the bigram ”application software”. The

topic designated by the bigram is broad enough and is

not covered by the cluster members.

5.3 Experiments with topic re-ranking

As mentioned above the data stored in DBLP spans

49 years, from 1959 to 2008. However it can be seen

from the Figure 1, that scientiﬁc activity starts to grow

toward mid eighties. That is the reason why we re-

strict our experiments to topics which appeared no

earlier than 1988. (The sharp fall of the curve to-

ward the end of 2010 is explained by the fact that

the data from 2007 − 2008 had not been completely

introduced into the database by the time we down-

loaded the ﬁle.). Additionally we restrict the minimal

topic frequency to 5 for the bi-grams, and 2 for the

tri-grams.

10000

20000

30000

40000

50000

60000

1960 1970 1980 1990 2000

# of Publications

Year

Publication number distribution by year

Figure 1: Paper distribution in DBLP from 1959 to 2008.

5.3.1 Results of the ranking by citation

Table 2 lists 20 top ranked topics according to the ci-

tation ranking computed using the equation (3).

We observe that the ranking results agree with our

expectations, as almost all twenty topics designate

broad areas of computer science. They are featured

by high numbers of both - conferences and papers,

and reﬂect ”trendy” research directions of the last

15years. The metric captures a high interest in rela-

tively new topic - ”semantic web”: despite its shortest

span (8 years), and relatively recent emergence (2001)

it scores seventh on the total list of topics.

As we descend toward the lower ranked topics we

notice that they gradually become more focused. Ta-

ble 3 shows more speciﬁc topics, which may also

be multi-disciplinary technical terms, like ”distance

measure”. Note that the number of papers the topics

occur in is still quite high while the number of confer-

ences changes to moderate.

5.3.2 Results of the ranking by the clustering

coefﬁcient

Let us now look at the topic list ranked according to

the clustering coefﬁcient cc

described in subsection

3.2. Table 4 shows 5 topics from the top, and 5 top-

ics from the bottom of the list. The top ranked topics

represent quite speciﬁc research ﬁelds such as theo-

rem proving or cryptography. On the contrary the

last ﬁve topics do not only represent the broad ar-

eas of computer science, they correspond exactly to

the top most ranked topics according to the citation

metric. This experiment proves our expectations that

the clustering coefﬁcient may serve to distinguish be-

tween broad and focused topics and gives priority to

the more speciﬁc ones. We do not discuss here the

ranking results yielded by the ratio of two clustering

coefﬁcients deﬁned by equation (5). Analysis of the

topic list has shown that the results do not support our

predictions. Why it is so remains an open problem so

far.

Figure 1: Paper distribution in DBLP from 1959 to 2008.

5.3.1 Results of the Ranking by Citation

Table 2 lists 20 top ranked topics according to the ci-

tation ranking computed using the equation (3).

We observe that the ranking results agree with our

expectations, as almost all twenty topics designate

broad areas of computer science. They are featured

by high numbers of both - conferences and papers,

and reﬂect ”trendy” research directions of the last

15years. The metric captures a high interest in rela-

tively new topic - ”semantic web”: despite its shortest

span (8 years), and relatively recent emergence (2001)

it scores seventh on the total list of topics.

As we descend toward the lower ranked topics we

notice that they gradually become more focused. Ta-

ble 3 shows more speciﬁc topics, which may also

be multi-disciplinary technical terms, like ”distance

measure”. Note that the number of papers the topics

occur in is still quite high while the number of confer-

ences changes to moderate.

5.3.2 Results of the Ranking by the Clustering

Coefﬁcient

Let us now look at the topic list ranked according to

the clustering coefﬁcient cc

described in subsection

3.2. Table 4 shows 5 topics from the top, and 5 top-

ics from the bottom of the list. The top ranked topics

represent quite speciﬁc research ﬁelds such as theo-

rem proving or cryptography. On the contrary the

last ﬁve topics do not only represent the broad ar-

eas of computer science, they correspond exactly to

the top most ranked topics according to the citation

metric. This experiment proves our expectations that

the clustering coefﬁcient may serve to distinguish be-

tween broad and focused topics and gives priority to

the more speciﬁc ones. We do not discuss here the

ranking results yielded by the ratio of two clustering

coefﬁcients deﬁned by equation (5). Analysis of the

topic list has shown that the results do not support our

predictions. Why it is so remains an open problem so

far.

5.3.3 Results of the Ranking by t f .id f

Table 5 presents the 10 top entries from the topic

list ranked according to the t f.id f . Since this met-

ric gives the maximal weight to items which occur

in 1 document we set the minimal number of doc-

uments (i.e. conferences in our case) to 3. We do

so after the manual check of the results on an unre-

KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval

240

Table 2: The 20 top ranked topics by the citation metric.

topic weight # of conferences # of titles year span

web service 2039826 654 3119 1994 13

sensor network 1777047 501 3547 1993 12

data mining 1045044 572 1827 1993 16

ad hoc network 1004598 441 2278 1995 13

wireless sensor network 648999 351 1849 1999 10

mobile agent 622362 474 1313 1994 15

wireless network 563178 371 1518 1992 17

semantic web 495624 386 1284 2001 8

multi agent system 492063 403 1221 1991 18

support vector machine 379874 341 1114 1996 13

mobile ad hoc 363025 325 1117 1998 11

virtual environment 359755 341 1055 1990 18

digital library 293112 236 1242 1991 17

association rule 261318 291 898 1993 16

face recognition 256522 251 1022 1990 18

context aware 241696 332 728 1996 12

web application 238924 322 742 1996 13

reinforcement learning 218240 248 880 1988 20

evolutionary algorithm 195487 233 839 1993 15

virtual reality 185472 288 644 1990 18

Table 3: Topics on the 500th

rank.

topic weight # of conferences # of titles year span

distance measure 6688 76 88 1990 15

heterogeneous computing 6649 61 109 1989 17

online game 6608 59 112 2001 7

aspect oriented programming 6528 64 102 1997 11

predictive control 6510 62 105 1995 11

Table 4: 5 top and 5 bottom ranked topics according to the clustering coefﬁcient.

topic vertices edges (local) edges (global) cc

spiral architecture 19 40 43 0.25146

face authentication 112 913 1030 0.16570

blue gene 209 3059 3523 0.16208

proof planning 39 53 114 0.15385

proof carrying code 21 30 32 0.15238

...

wireless network 3311 4945 6737 0.00123

data mining 3641 5779 7563 0.00114

ad hoc network 4254 6183 8482 0.00094

web service 5732 10561 14698 0.00089

sensor network 6475 12883 16730 0.00080

Table 5: 10 top most ranked topics by the t f .id f .

topic weight by t f .id f # of conferences # of papers rank by citation rank by clustering coefﬁcient

research note 40.05 4 128 4289 4680

interactive presentation 34.97 4 61 7293 8121

co chair 33.92 12 135 1745 1251

output analysis 33.75 4 51 8344 2000

parallel manipulator 33.16 10 99 2581 8759

poster abstract 32.80 7 68 4536 9119

workshop chair 32.74 4 44 9229 1579

simulation optimization 32.70 7 67 4557 7423

digital government 32.16 9 76 3431 5765

low voltage 31.68 36 337 288 5568

TOPIC DETECTION IN BIBLIOGRAPHIC DATABASES

241

stricted set, which put forward dozens of terms like

”session chair”, ”extended abstract”, etc. Despite this

measure, we immediately notice that among the se-

lected items there is a high number of non-topic terms

such as ”research note” or ”interactive presentation”.

The mixture of topic and non-topics terms happens

everywhere throughout the list. Note also that the

ﬁgures in the last two columns which correspond to

the topic rank assigned by the citation and cluster-

ing coefﬁcient metrics respectively, do not allow to

establish dependency between this and the two other

metrics. We explain such a behavior by the fact that

t f .id f is the less informed of all and clearly prefers

items with the high paper-to-conference ratio which

does not model the topic properties correctly.

6 SUMMARY AND FUTURE

WORK

In this paper we have described the way of research

topic extraction based on the titles of scientiﬁc publi-

cations. We have introduced and compared the three

different methods of topic ranking aiming at distin-

guishing between general and speciﬁc topics. The

rankings by citation and clustering coefﬁcient have

yielded topic lists which corresponded to our expec-

tations: the ﬁrst metric put forward the broader top-

ics, while the second favored the more focused ones.

On the contrary, the t f .id f weighting has failed to

generate a coherent list, mixing up topic and non-

topic terms. Such an outcome shows that paper-to-

conference relationship alone does not provide sufﬁ-

cient ground for the topic ranking.

So far the topic extraction is based on the publica-

tion titles only. One of the limitations of this approach

is that it does not allow to capture semantic relations

between the topic terms treating them as atomic. Ex-

tending textual information with the paper abstracts

will alleviate this problem. It will also contribute

to the process of ﬁnding related topics via the graph

analysis that we have sketched in this paper.

ACKNOWLEDGEMENTS

I would like to thank Professor Christoph Schommer

for his attention, valuable comments and useful sug-

gestions while writing this paper.

REFERENCES

Banerjee, S. and Pedersen, T. (2003). The design, imple-

mentation, and use of the Ngram Statistic Package.

In Proceedings of the Fourth International Conference

on Intelligent Text Processing and Computational Lin-

guistics, pages 370–381.

Bird, C., Barr, E., Nash, A., Filkov, V., Devanbu, P., and Su,

Z. (2009). Structure and dynamics of research col-

laboration in computer science. In Proceedings of the

2009 SIAM International Conference on Data Mining.

SIAM.

Diederich, J. and Balke, W.-T. (2007). The semantic grow-

bag algorithm: Automatically deriving categorization

systems. In ECDL, pages 1–13.

Dunning, T. (1993). Accurate methods for the statistics of

surprise and coincidence. Computational Linguistics,

pages 61–74.

Jo, Y., Lagoze, C., and Giles, C. L. (2007). Detecting re-

search topics via the correlation between graphs and

texts. In KDD, pages 370–379.

Lars Backstrom, D. P. H., Kleinberg, J. M., and Lan, X.

(2006). Group formation in large social networks:

membership, growth, and evolution. In KDD, pages

44–54.

Mann, G. S., Mimno, D. M., and McCallum, A. (2006).

Bibliometric impact measures leveraging topic analy-

sis. In JCDL, pages 65–74.

Manning, C. and H.Schutze (1999). Foundation of statisti-

cal natural language processing. The MIT press, Lon-

don, 2nd edition.

Mei, Q., Cai, D., Zhang, D., and Zhai, C. (2008). Topic

modeling with network regularization. In 17 Inter-

national World Wide Web Conferences (WWW), pages

101–110.

Potencier, F. and Humphrey, M. Lin-

gua: Stop words for several languages.

http://search.cpan.org/ creamyg/Lingua-StopWords-

0.09/lib/Lingua/StopWords.pm.

Sanderson, M. and Croft, W. B. (1999). Deriving concept

hierarchies from text. In SIGIR, pages 206–213.

Spark, J. (1972). A statistical interpretation of term speci-

ﬁcity and its application in retrieval. Journal of Doc-

umentation, pages 11–21.

Wang, X., McCallum, A., and Wei, X. (2007). Topical n-

grams: Phrase and topic discovery, with an application

to information retrieval. In ICDM, pages 697–702.

Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics

of small-world networks. Nature, pages 440–442.

KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval

242