Inﬁnite Topic Modelling for Trend Tracking

Hierarchical Dirichlet Process Approaches with Wikipedia Semantic based Method

Yishu Miao

, Chunping Li

, Hui Wang

and Lu Zhang

School of Software, Tsinghua University, Beijing, China

School of Computing and Mathematics, University of Ulster, Jordanstown, U.K.

Keywords:

Hierarchical Dirichlet Process, Topic Modelling, Wikipedia, Temporal Analysis, News.

Abstract:

The current affairs people concern closely vary in different periods and the evolution of trends corresponds

to the reports of medias. This paper considers tracking trends by incorporating non-parametric Bayesian ap-

proaches with temporal information and presents two topic modelling methods. One utilizes an inﬁnite tem-

poral topic model which obtains the topic distribution over time by placing a time prior when discovering

topics dynamically. In order to better organize the event trend, we present another progressive superposed

topic model which simulates the whole evolutionary processes of topics, including new topics’ generation,

stable topics’ evolution and old topics’ vanishment, via a series of superposed topics distribution generated by

hierarchical Dirichlet process. Both of the two approaches aim at solving the real-world task while avoiding

Markov assumption and breaking the number limitation of topics. Meanwhile, we employ Wikipedia based

semantic background knowledge to improve the discovered topics and their readability. The experiments are

carried out on the corpus of BBC news about American Forum. The results demonstrate better organized

topics, evolutionary processes of topics over time and model effectiveness.

1 INTRODUCTION

At the outset of this work lies the observation that in

the analysis of time stamped documents, such as news

corpus, people are concerned about what events have

taken place and their entire evolutionary processes.

As we can see from Figure 1, each colour represents a

trend that accords with the timestamp on the timeline

above. Correspondingly,there exist several articles re-

lated to each trend, such as “Media condemn N Korea

nuclear test” in May.26th, “North Korea increases its

leverage” in Jun.8th, “Obama ‘prepared’ for N Ko-

rea test” in Jun.21st and “Clinton’s high drama Ko-

rean mission” in Aug.6th. The successive articles in-

dicate a period of attention paid by Americans wor-

rying about Korea nuclear issue. Despite each arti-

cle differs in the title, they are concerning the same

topic. Hence, we attempt to discover the latent topic

via topic modelling on the content of these articles .

Topic modelling (Landauer, 1997) (Hofmann,

1999) has been a prevailing method on text analy-

sis and feature reduction. The topics are called “re-

duced description”(Blei et al., 2003) associated with

the documents. Thus, we extract the trend via bunches

of words with probability represented by topic. As

time elapses, some topics which drew people’s atten-

tion will fade away from public view, while some new

topics may raise a great awareness conversely accom-

panying with the occurrence of big events. Moreover,

there also exist several topics which attract persis-

tent attention and become signiﬁcant parts of people’s

daily life. All of these topics are crucial to trend track-

ing.

The purpose of trend tracking focuses on tempo-

ral information and related entities. (Blei and Laf-

ferty, ), (Wang et al., 2008), (XueruiWang and Mc-

Callum, 2006) and (AlSumait et al., 2008), etc. are

those topic models extracting the temporal infor-

mation while modelling topic distribution. However,

considering the compatibility of Markov assumption

and Dirichlet distribution, it is not amenable to use se-

quential modelling method in traditional topic models

with Dirichlet prior. After deliberating the generation

of topics when applying non-parameter Bayesian ap-

proach, we ﬁnd that there exists signiﬁcant temporal

information not only in the words of documents, but

also in the generative process of topics. Hence, we

associate the models with the time information and

Miao Y., Li C., Wang H. and Zhang L..

Inﬁnite Topic Modelling for Trend Tracking - Hierarchical Dirichlet Process Approaches with Wikipedia Semantic based Method.

DOI: 10.5220/0004133300350044

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 35-44

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Articles on timeline of year 2009.

no Markov assumption in two approaches: to model a

distribution over time, or to make use of the dynamic

feature of Hierarchical Dirichlet process(Teh et al.,

2006) alternatively. When applied to tackle the real-

world task such as trend tracking, the models without

Markov assumption are much easier not only when

building a succinct graphic model but also to depress

the complexity of inference algorithm.

Firstly, we employ a time variable in the ﬁrst ap-

proach to model the evolutionary process of topic

based upon the publishing time of documents while

discovering topics. As an inﬁnite topic model, it is

more likely to generate a new topic at the timestamp

where the texts congregate together. From the Chi-

nese restaurant process perspective(Sudderth, 2006),

a man prefers to sit on a new table if he knows that

at 12:00, there will be many clients coming in, for he

doesn’t want to have dinner with too many strangers.

Then, we present a superposition topic model

which generates new topics progressively. In this

model, at ﬁrst, we divide the corpus into several parts

in the time order. After we achieve the topic distri-

bution over terms on the ﬁrst sub-corpus, it has al-

ready generated corresponding topics without setting

the number of the topics beforehand. Based on the

previous distributions, this model assigns the terms to

join the preceding topics or hold together in groups as

new topics. Besides, those topics, seldom discussed

in the following corpus, will vanish after several it-

erations as vanished topics. While a majority of top-

ics will be stable topics in their whole evolutionary

processes due to persistent attention of public such as

Obama and Military in American news.

After introducing the two inﬁnite topic modelling

approaches, we consider incorporating a semantic

method to improve the experiment results. Even the

conventional media, i.e. newspaper, incline to illus-

trate an event via hackneyed way which leads to a

small moiety of trivial words and exaggerate rhetoric.

In order to tackle the impediment, we attempt to em-

ploy Wikipedia semantic background knowledge to

improve the readability of discovered topics. By map-

ping the terms in the articles to Wikipedia concepts,

we will achieve a majority of entities from the corpus

to improve the granularity of extracted topic distribu-

tion. Hence, we build an entity model to analyze the

event trend based on the means mentioned above. In

this model, we will not discard the words after map-

ping them to entities, but sample every one during

Gibbs Sampling process and it will contribute to the

generation of topic distribution over entities.

The experiments and evaluation are mainly on the

BBC news of American forum which contains nor-

mative articles and precise publishing time. In Sec-

tion 2, we review some related developments of inﬁ-

nite topic models and temporal information analysis.

Then, we introduce two non-parametric Bayesian ap-

proaches and the entity model based on Wikipedia se-

mantic method in Section 3.In Section 4, we present

the experiment result analysis according to the com-

parison of different models. Finally, we have the con-

cluding remarks and future work in Section 5.

2 RELATED WORK

In this section, we brieﬂy introduce related work

including non-parametric topic modelling methods,

temporal information analysis and semantic knowl-

edge based method.

Basically, hierarchical Dirichlet process(HDP)

(Teh et al., 2006) as an extended Dirichlet pro-

cess(DP)(Ferguson, 1973) is a typical implementation

of non-parametric Bayesian approach. Such as inﬁ-

nite LDA(Heinrich, 2011), it achieves a relative sat-

isfactory result with a low level of complexity. Be-

sides, dHDP(Ren et al., 2008) assumed that each para-

graph of one document is associated with a particu-

lar topic, and the probability of a given topic being

manifested in a speciﬁc document evolves dynami-

cally between contiguous time-stamped documents.

But it only presents the inﬁnity of time-stamp number,

while the topic number is still a limitation in the topic

evolutionary scenario. Evo-HDP(Zhang et al., 2010)

and inﬁnite Dynamic topic models(Ahmed and Xing,

2010) can automatically determine the topic number,

but both are based on Markov assumption. (Balasub-

ramanyan et al., 2009) is similar to our ﬁrst approach,

which combines HDP and TOT model simply, but

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

makes no use of time information when generating

new topic. Besides, there also exist an approach about

trend tracking via a combination of Dynamic topic

models and time series methods without HDP(Hong

et al., 2011).

Semantic based methods are usually used for the

purpose of analysing short textual data or eliminat-

ing multilingual and ambiguous problems. There ex-

ist several approaches incorporating topic modelling.

For example, (Kataria et al., 2011) uses Wikipedia an-

notations to label the entities and learn word-entity

associations. (Ni et al., 2009) takes advantage of

multilingual corpus of Wikipedia to extract the uni-

versal topics and cluster the terms of different lan-

guages. Besides, the plentiful context information of

Wiki concepts is proﬁtable to automatic topics la-

belling. (Lau et al., 2011) presents a typical approach

to tackle the topic comprehension problem caused

by unsupervised topic modelling. However, we em-

ploy Wikipedia as background knowledge in this sce-

nario aiming at improving the quality of topic cluster-

ing which is different from these models mentioned

above. Based on CorrLDA(Newman et al., 2006), we

discover the topics distribution over entities extracted

by Wikipedia without discarding words when sam-

pled in inference step and improves the readability of

topics entirely.

Without Markov assumption, our approach is con-

sidered more succinct and easy to be employed on

the real-world task when compared to other existing

methods. In this paper, we also make a comparison of

these two approaches and have discovered the differ-

ent features between them, which will be presented in

the experiment part.

3 INFINITE TOPIC MODELLING

FOR TREND TRACKING

At the beginning of this section, we brieﬂy introduce

Dirichlet process mixtures and HDP. Then the two

inﬁnite topic modelling approaches, including basic

instruction, graphic representation and inference pro-

cess, are discussed in the following part. Afterwards,

we present the Wikipedia based semantic approach

and its implementation method.

3.1 Hierarchical Dirichlet Process and

Non-parametric Graphic Model

DP have been used as non-parameter Bayesian ap-

proach to estimate the number of components which

deﬁne a distribution over distributions. We use G ∼

Figure 2: (a) Dirichlet process, (b) Hierarchical Dirichlet

process, (c) Stick-breaking representation of HDP.

DP(α

) to represent a DP, where G

is base mea-

sure, and α

as the concentration parameter is a posi-

tive real number. Dirichlet mixture model, employ DP

as a non-parameter prior on the latent parameter dis-

tribution, is one of the most signiﬁcant applications of

DP. In Dirichlet process mixtures, we suppose:

|G ∼ G

|θ

∼ F(θ

) (1)

denotes the parameter of mth component, while

F(θ

) denotes the distribution when given θ

and Fig-

ure 2(a) shows the graphic representation. From the

perspectiveof Stick-breakingconstruction, we place a

Dirichlet process prior on the latent parameter distri-

bution G ∼ DP(α,H). Hence the other representation

of a Dirichlet process mixture is presented as follows:

π|α

∼ GEM(α

) z

|π

π ∼ π

∼ G

|θ

∼ F(θ

) (2)

The random probability distribution G on θ satisﬁes

G(θ) = Σ

∞

k=1

δ(θ,θ

). In this formula, π, as a ran-

dom probability measure on positive integers, satis-

ﬁes Σ

∞

k=1

= 1. GEM stands for Grifﬁths, Engen and

McCloskey and the construction of π|α

∼ GEM(α

)

can be presented in the Stick-beaking construction as

follows:

′

|α

∼ Beta(1,α

)

∼ G

(3)

Then we deﬁne a random measure G as:

= π

′

k−1

∏

l−1

(1− π

′

)

G = Σ

∞

k=1

(4)

Hence, it is an alternative way to express Dirichlet

mixture model.

Nevertheless, no sharing can occur between

groups of data if a single Dirichlet process is applied.

In order to link these mixture models, the base distri-

bution can be itself distributed as a Dirichlet process,

InfiniteTopicModellingforTrendTracking-HierarchicalDirichletProcessApproacheswithWikipediaSemanticbased

Method

and then it allows groups share statistical strength.

This non-parameter Bayesian approach is hierarchical

Dirichlet process(Teh et al., 2006). A HDP deﬁnes a

set of random probability measures G

for group j and

each of them is drawn from a DP(α

). Moreover,

the global measure G

is also drawn from a DP(γ,H).

The deﬁnition can be presented as follows, and Figure

2(b) shows the graphical model:

|γ,H ∼ DP(γ,H) G

|α

∼ DP(α

)

∼ G

|θ

∼ F(θ

) (5)

The base measure H is drawn from a DP, hence ev-

ery child shares the measure and is conditionally in-

dependent with each other. Correspondingly, we give

the Stick-breaking construction perspective and the

graphic model in Figure 2(c).

π|γ ∼ GEM(γ)

|α

,π

π ∼ DP(α

,π

π) z

|θ

∼ θ

|H ∼ H x

|φ

∼ F(φ

) (6)

As illustrated above, all the measures are drawn from

the base DP, which means that the discrete base distri-

bution is shared by every document in the corpus. The

global cluster weighs π

π ∼ GEM(γ) follow a Stick-

breaking process and denote the Dirichlet prior of in-

ﬁnite topic distribution over terms.

3.2 Inﬁnite Temporal Topic Model

In this section, we present inﬁnite temporal topic

model(ITTM) to incorporate time information dur-

ing the topic discovering based on the inﬁnite

model(Heinrich, 2011).

Ordinarily, HDP is used as prior in the mixture

models to create inﬁnite topic model and limited in a

speciﬁc time period. Virtually, the universal time as-

signment on documents could be taken advantage of,

and used as a time prior on probability measures when

be drawn from global measure via DP. To put it an-

other way, it obtains a higher probability to generate a

new topic when the words congregate together at this

timestamp in Dirichlet Process. Similar to Topics over

(a) Bayesian network (b) Stick-Breaking representation

Figure 3: Inﬁnite Temporal Topic Model.

Time model, ITTM avoids discretization by associ-

ating continuous time distributions with topics. The

time range is normalized between 0 and 1 so that the

Beta distribution can easily form the peaks of every

time distribution of topics based on the time variable.

As illustrated in Figure 3, the ITTM can be repre-

sented in two perspectives. λ is the parameter of Beta

distribution on time prior. t

represents the times-

tamp associated with word n in document m and ψ

denotes its parameter of Beta distribution. The gener-

ative process is described as follows:

1. Draw an inﬁnite dimension multinomial π

π,

π|γ ∼ GEM(γ)

2. For each toic z, draw a multinomial φ

|β ∼ Dirichlet(β)

3. For each document m,

3.1 Draw a time prior ξ ∼ Beta(λ),

3.2 Update probability measures,

µ = N(t

|ξ,Σ) and π

π = (π

−1

,µπ

where π

π = (π

−1

,π

)

3.3 Draw an inﬁnite multinomial θ

|α

,π

π ∼ Dirichlet(α

π)

3.4 For each word n in the document,

i. Draw a topic z

∼ Multinomial(θ

)

ii. Draw a word w

∼ Multinomial(φ

)

iii. Draw a time t

∼ Beta(ψ

)

In order to implement the model, we use Gibbs

sampling as the inference algorithm. Similar to inﬁ-

nite LDA, this model employs Dirichlet as the base

distribution, where π

π ∼ Dirichlet(γ/K). Note that, we

acquire the hyper-parameter via the topics global dis-

tribution over time. When updating the measures, we

set π

= ξπ

, where π = (π

−1

,π

), in which time prior

is drawn from ξ ∼ Beta(λ). In this process, we con-

sider K as an inﬁnite variable.

If there exists no time prior to control the probabil-

ity of generating new topic by DP, the new topics may

be distributed averagely over time. But it does not ac-

cord with the real topics distribution. With the itera-

tion accumulates, the probability of the topic assign-

ment of everyword will be manipulatedby time factor

gradually. For the Beta distribution will be sharper af-

ter updating the posterior distribution. Hence, we ex-

pect more topics generated in the timestamp where

they congregate in reality and the beta distribution of

each topic will be a little smoother relatively. So that

we would have a balance between content relevance

and time information.

Sampling z. Since the Stick-breaking represen-

tation models the topic distribution over terms by

Dirichlet distribution which is the same as LDA. The

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

conditional probability is:

p(z

|•) ∝ (n

m,z

+απ

)·

(1− t

)

−1

B(ψ

,ψ

)

+ β

− 1

i=1

+ β

) − 1

(7)

As there will be new topic being generated, the mea-

sures π

π should be updated. Thus, its posterior proba-

bility is:

π ∼ Dirichlet(u

,...,u

k−1

,γ) (8)

where the u

is the sum number of words assigned

to the jth topic in all documents, and parameter γ

manipulate the probability of generating a new topic.

The n

m,z

represents the number of words assigned

to topic z

in document m. Besides, the parameters

of Beta distribution associated to every topic illustrate

the spikes of the trends, and they will be updated as:

= t

(

(1−t

)

− 1)

= (1 − t

)(

(1−t

)

− 1) (9)

3.3 Superposition Topic Model

In HDP, topic number is determined by the corpus it-

self and the hyper-parameters associated with Dirich-

let process. Scilicet it ﬁnds a most suitable number of

the components on the basis of their internal aggre-

gation and discrimination between each other. A new

component will be generated when it seems that no

component ﬁts the preceding ones. As it mentioned

above, the generation of new component via HDP is

limited in speciﬁc time period. It means that the whole

process of clustering is solely based on content rele-

vance. Hence, we present a superposition topic model

Figure 4: Superposition Topic Model.

(STM) to make use of the dynamic feature of HDP in

another perspective.

In the beginning, we discretize the corpus by time

beforehand. The STM discovers the topics on the ini-

tial part of the corpus without setting number of top-

ics normally. After achieving the initial distributions,

the STM proceeds on the following part of corpus,

and it is more likely to generate new topics by HDP

due to the focal point of documents differs as the

time elapses. Likewise, if the contents of new arti-

cles are nearly the same as the previous, the topics

will retain their old number. From the perspective of

Chinese restaurant process, a group of new comers

would rather hold together in a new table than join

the previous tables with unfamiliar dishes. Hence, ev-

ery new part of corpus is processed upon the pre-

ceding topic distribution and the STM is updated si-

multaneously. When we employ the STM on news

dataset, the progressivelygenerated topics will unfold

the diversion of public provenance correspondingly.

Its graphic representation is shown in Figure 4, and

the generative process is described as follows:

1. Draw an inﬁnite dimension multinomial π

π,

π|γ ∼ GEM(γ)

2. For each toic z, draw a multinomial φ

|β ∼ Dirichlet(β)

3. For each part d of the corpus,

3.1 If d 6= 1, update the measures π

π,

π ∼ Dirichlet(u

,...,u

k−1

,γ)

3.2 Draw an inﬁnite multinomial θ

|α

,π

π ∼ Dirichlet(α

π),

3.3 For each word n in the document,

i. Draw a topic z

∼ Multinomial(θ

)

ii. Draw a word w

∼ Multinomial(φ

)

Since the global measures π

π is updated by time

order, the content relevance between preceding docu-

ments and new arrival ones are the most signiﬁcant

inducement of generating a new topic. Hence, it is

no necessary for us to get entangled in subjoining

temporal information while increasing the complex-

ity of model structure, which has been explained by

Occam’s razor primely. The HDP will automatically

determine the sampled term to join the previous top-

ics or to be a new one. Apparently, the judgements are

effected by time to some big extent.

Sampling z. The conditional probability when

sampling z can be achieved via:

p(z

|•) ∝ ·(n

m,z

+ α

) ·

+ β

− 1

i=1

+ β

) − 1

(10)

which is similar to traditional LDA. The updating step

of hyper-parameters is vital in the gradual inference

InfiniteTopicModellingforTrendTracking-HierarchicalDirichletProcessApproacheswithWikipediaSemanticbased

Method

Table 1: A comparison of topic distribution.

Original healthcare bill insurance health reform coverage americans option applause committee ﬁnance secretary

Topic 0.0480 0.0452 0.0396 0.0387 0.0347 0.0179 0.0168 0.0128 0.0088 0.0078 0.0072 0.0070

Wiki Healthcare Bill Insurance Reform Health Coverage Americans Option Debate Secretary Applause House

Topic 0.0607 0.0568 0.0476 0.0439 0.0423 0.0205 0.0203 0.0155 0.0111 0.0104 0.0101 0.0101

process. Even though all the topics in the corpus are

still exchangeable, the words in preceding documents

will not be sampled in the following inference process

any more. Documents in every part of the corpus are

only sampled in the unique sub-corpus which belongs

to a speciﬁc epoch. The previous sampled words in

other epoch will no longer be associated to another

topic in case of topic drifting. Hence, the global clus-

ter weighs will be progressively updated and deter-

mine the topic assignment of every term in the subse-

quent documents.

3.3.1 Wikipedia Semantic Knowledge

Wikipedia concepts are commonly employed to over-

come the drawback of bag of words(BOW) in text

analysis. However, in this paper, we exploit the

Wikipedia concepts as an approach to extract entities

from speciﬁc document due to its ontology property.

Besides, we also extract capitalized words for auxil-

iary, as there exist a minority of abbreviations and hu-

man names which Wikipedia is yet incapable to inter-

pret. Hence, we integrate those words with Wikipedia

concepts as our entity volume.

For the sake of briefness, we solely extract the part

of document in the graphic model representation so

that the approach becomes much easier to compre-

hend and expand. In this model, we employ two vari-

ables to model the entities when discovering topics.

and z

represent the cth entity in document m

Figure 5: Graphic Model of Wikipedia Semantic Approach

from Documental Perspective.

and the assigned topic associated with the entity re-

spectively. According to the graphic model in Figure

5, entity e is an observed variable which depends on

assigned topic z

and its topic distribution parame-

ter φ. Besides, z

depends on the topic assignment of

the word in the document m. The generative process

is described as follows:

1. Draw Hierarchical Dirichlet Process prior

2. For each topic z, draw a multinomial φ

|β ∼ Dirichlet(β)

3. For each topic z, draw a multinomial φ

|β ∼ Dirichlet(β)

4. For each document m,

4.1 Draw an inﬁnite multinomial θ

|α

,π

π ∼ Dirichlet(α

π),

4.2 For each word n in the document,

i. Draw a topic z

∼ Multinomial(θ

)

ii. Draw a word w

∼ Multinomial(φ

)

4.3 For each entity c in the document,

i. Draw a topic z

∼ Multinomial(Z

)

ii. Draw a entity e

∼ Multinomial(φ

)

Theoretically, after the sampling process, the topic

distribution over terms is different from the distribu-

tion over entities. And they may differ in the num-

ber of topics because of the dynamic property of inﬁ-

nite model. However, after Gibbs sampling process,

the representative meanings of them turn out to be

extraordinary similar. Moreover, topics over entities

achieve better readability and lower perplexity (as il-

lustrated in Table 1). When employed in the scenario

of trend tracking, the Wikipedia semantic based ap-

proach discovers more trend-speciﬁc topical entities,

while the increased complexity of inference algorithm

is limited in a linear magnitude.

Sampling z. The z sampling process remains the

same as the original form (7) or (10). While, the con-

ditional probability of z is:

p(z

|•) ∝ ·

m,z

+ β

− 1

i=1

+ β

) − 1

(11)

Where S

m,z

denotes the sum of words in docu-

ment m which have been assigned to topic z

and

m,z

represents the prior of generating topic z

in document m and V denotes the volume of entities.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

4 EXPERIMENT

4.1 BBC News

In this section, we present the discovered topics

and their evolutionary processes on 3500 BBC news

which is shown in Table 2. We also follow the dif-

ference between ITTM and STM with interest. In the

experiment, ITTM discovered 90 topics by 1000 iter-

ations, while STM discovered 80 topics by 2400 it-

erations (200 iterations in every epoch-speciﬁed sub-

corpus). For the purpose of interpreting model effec-

tiveness, we analyzed the raw corpus by key word

matching based on discovered topics. Then we can

evaluate the matching degree between tracked trend

and reality.

Table 2: Details of BBC news corpus.

Month Jan Feb Mar Apr May Jun Jul

Articles 309 284 279 285 279 255 309

Aug Sep Oct Nov Dec

273 318 260 310 339

4.1.1 ITTM

As illustrated in Figure 6(a), there exist about 20 top-

ics with higher heat score while the others stay in a

low level respectively. Most of the 20 topics are re-

garding politics or society, such as Obama, Military

and Government. It illustrates the vane of news trend

is focused on politics. In Figure 6(b), we ﬁnd that

the general heat score rises sharply from June to July

and August. After looking back to the raw news ar-

ticles, we are aware of several booming news hap-

pened around July and August, such as Helicopter

crash, Michael Jackson’s death, Dugard kidnapping

case and so on. But unfortunately, ITTM didn’t or-

ganize them as single topics, because these topics are

more likely to be assigned to the topics with wide cov-

erage like Topic Society or Topic Criminal.

4.1.2 STM

STM discovered 80 topics in the experiment and

we recorded the result in every sub-corpus which is

shown in Table 3. In the ﬁrst month, STM generates

34 topics, while the number rises gradually to the end

of 80. Since the global probabilitymeasure θ is shared

in everysub-corpus,we could extract the evolutionary

process by tracing the transformation of topic distri-

bution over terms which is represented by φ. By cal-

culating the KL divergence between every two topics,

we achieve an evolutionary process(shown in Table 4

(a) Vertex of every topic distribution

(b) General topic heat assigned to every month

Figure 6: Topics of ITTM in 2009.

partly). We conclude all the topics into 3 categories,

including stable topics, new topics and vanished top-

ics. In Table 4, the topics with grey ground colour

change slightly during the whole lifetime and their

KL divergence remain below 1.0, so we name them

stable topics. Correspondingly, we call the generated

topics by STM of every epoch as new topics. More-

over, when an old topic vanishes, a new topic will

also be generated concomitantly. The bolded numbers

which exceed 1.0 in Table 4 divide the topic into 2

parts, hence the latter ones are taken into account as

new topics too, and the formers are perceived van-

ished topics. Table 5 illustrates the evolution of topics

from vanished ones to new ones.

As a whole, the stable topics, new topics and van-

ished topics contain 50, 49 and 19 respectively. In or-

der to exhibit the evolutionary processes of topics, we

present 23 topics in two parts with their heat score in

Figure 10. Most topics of Figure 7(a) are stable ones,

while in Figure 7(b) most of them are new boom-

ing topics. The peaks of broken lines show the trend

primely. It is easy for us to demonstrate the causa-

tion by big events in 2009, for instance, “Flu pan-

demic” in April, “Enough bomb-grade uranium of

Iran” in April, “Helicopter crash of Maryland” in

July, “Jaycee Lee Dugard abduction case” in Au-

gust and “Fort Hood shootings at US army base”

in November. Meanwhile, the stable topics such as

Topic Politics, Guantanamo, economy, criminal and

InfiniteTopicModellingforTrendTracking-HierarchicalDirichletProcessApproacheswithWikipediaSemanticbased

Method

(a)

(b)

Figure 7: Topic trends of STM.

Iraq war receive persistent public attention by peo-

ple.

Afterwards, we employed the Wikipedia semantic

based approach on STM and achieved better readabil-

ity and distinctly organized topics. Table 3 shows a

signiﬁcant decline in perplexity and Table 1 presents

more topic-representative entities with higher proba-

bility, which demonstrate that it contributes to track-

ing the event to some extent.

4.2 Trend Tracking Analysis

For the purpose of better exhibiting trend tracking re-

sult, we draw a curve via key words matching which

we consider generally reﬂects the real trend of each

topics. Hence, we obtain Figure 8, including four top-

ics, ‘Flu’, ‘Healthcare’, ‘Gay’ and ‘Train’. As it is

shown in the ﬁgure, ITTM matches the spikes of heat

trends precisely, but is incapable to simulate multi-

spikes trend of the reality. For instance, in Figure

8(c), the curve drops from June and fails to match the

peak in September. Relatively, STM simulates the real

trend primely, even though the multi-spikes and a bit

ﬂuctuation somewhere in the whole year.

In general, ITTM generates a series of smoothing

curves to ﬁt the real trends and extracts the spikes

of every topic distribution over time when discover-

ing topics. Nevertheless, STM simulates the trends

via topic distributions transformation, for the topics

are dominated by global probability measures. Even

though these two approaches are based on different

assumptions, both of them generally model the whole

evolutionary processes of topics.

4.3 Model Effectiveness

On the basis of the experiments above, these ﬁndings

suggest the models are capable of tracking trends and

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

Table 3: Experimental results of STM.

Month Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec

Sum topics 34 48 58 63 67 69 72 74 78 80 80 80

Generated New Topics 34 14 10 7 4 2 3 2 4 2 0 0

Perplexity over Terms 2817 2744 3132 2922 3172 2892 3358 3321 3427 3719 3676 3723

Perplexity over Entities 1465 1668 2102 1910 2100 1899 2139 2157 2187 2414 2368 2342

Table 4: KL Divergence between one topic and its preceding one.

No Topic Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec

Topic2 Politics - 0.170 0.056 0.036 0.030 0.027 0.043 0.027 0.012 0.008 0.013 0.013

Topic4 Guantanamo - 0.119 0.071 0.030 0.070 0.016 0.017 0.013 0.012 0.009 0.019 0.010

Topic7 Weather - 0.217 1.027 0.341 0.058 0.276 0.076 1.994 0.081 0.017 0.033 0.006

Topic11 Economy - 0.258 0.056 0.025 0.018 0.011 0.015 0.008 0.012 0.005 0.007 0.006

Topic12 Obama - 0.058 0.021 0.014 0.008 0.004 0.004 0.003 0.004 0.003 0.003 0.003

Topic13 Criminal - 0.275 0.147 0.047 0.052 0.021 0.019 0.018 0.015 0.013 0.014 0.008

Topic21 Flu - 0.631 0.259 4.143 0.401 0.008 0.025 0.003 0.005 0.007 0.006 0.005

Topic29 Healthcare - 2.998 0.373 0.072 0.043 0.039 1.008 0.227 0.352 0.048 0.094 0.213

Topic36 Army - - 1.338 1.585 0.027 0.061 0.250 0.155 0.061 0.066 1.890 0.015

Topic43 Local Politics - - 0.777 0.062 1.345 2.559 0.040 0.045 0.041 0.060 0.157 0.037

Topic44 Neoconservatism - - 1.993 0.055 0.154 0.001 1.060 0.041 0.031 0.116 0.014 0.122

Topic51 Court - - - 0.640 2.025 0.215 0.932 0.062 0.039 0.083 0.049 0.014

Topic56 Train - - - 1.120 0.427 2.431 0.280 0.030 0.074 0.033 0.123 0.013

Topic60 UN - - - - 1.357 2.755 1.346 0.027 0.451 0.050 0.040 0.018

Topic63 Helicopter Crash - - - - - 0.475 2.010 0.722 0.264 0.099 0.435 0.074

Table 5: Two examples of topic diversion.

Topic May ‘Train’ British Trains Rail Travel Gordon Services Brown Sexual Peruvian Gdp Runs Minister Sort Position

56th Jun ‘Accident’ Bermuda Uighurs Train British China Trains Palau London Accident Foreign Government Four

Topic Jul ‘Weather’ Weather Service Sheriff Myers Project Died Police County Storm Brother Snow Ms Couple

7th Aug ‘Criminal’ Garrido Ms Weather Dugard Police County Service Sheriff Project Myers Jaycee Phillip Probyn

(a) Topic ‘Flu’ (b) Topic ‘Healthcare’

Figure 8: Evolutionary processes of topics simulated by ITTM and STM.

Table 6: Topic Perplexity.

Num.of Articles 100 500 1000 2000 3500

ITTM 1493 2852 4288 4454 4514

STM 1435 2138 2640 3068 3551

STM&Wiki 1023 1336 1629 1779 2249

inﬁniteLDA 1432 2008 2386 2718 3098

receive a series of desirable results. Likewise, we did

further experiment on different magnitude corpus to

reveal the effectiveness of each model.

From the data in Table 6, we employ ITTM, STM,

Table 7: Experiment Results on Jan. 2010 corpus.

Model ITTM STM STM&Wiki inﬁniteLDA

Sum topics 63 85 85 66

Perplexity

2365 2385 1461 2295

STM&Wiki and inﬁnite LDA on those corpuses. The

results indicate that STM&Wiki obtain the best per-

formance, while the perplexity of ITTM is slightly

bigger than others. Furthermore, we prepared a cor-

pus of Jan. 2010 which contains 344 articles for trend

prediction. Table 7 indicates the perplexity compari-

InfiniteTopicModellingforTrendTracking-HierarchicalDirichletProcessApproacheswithWikipediaSemanticbased

Method

Table 8: Trend Prediction.

Model Number of Articles in each Trend

ITTM Articles 6 11 17 10 15 13 19 19 12 129 7 6 9 8 7

Related 5 4 8 5 8 8 12 3 8 60 4 4 3 5 4

STM Articles 7 9 16 6 85 8 15 96 17 13 6 6

Related 3 5 6 3 48 5 6 42 10 9 4 3

iLDA Articles 19 7 9 8 6 7 11 183 44 18

Related 8 4 3 3 3 3 5 83 18 11

son on topic inference between these models.

Then we organized clustered articles by the mod-

els, and made a manual evaluation (based on arti-

cle title and news description) as shown in Table 8.

We got the sum precision about trend prediction of

these models. Each of ITTM, STM and inﬁnite LDA

is 0.4896, 0.5070 and 0.4519. Interestingly, we ﬁnd

some trends contain much more articles than the other

ones. The reason is that in the middle of Jan. 2010,

a powerful earthquake rocks Haiti which triggered a

series of news reports on this disaster. Most articles

in these trends are concerning this event. After all,

both of ITTM and STM can predict a real world trend

successfully, even though on the booming event like

“Haiti earthquake”.

5 CONCLUSIONS AND FUTURE

WORK

In this paper, we present two approaches incorporat-

ing HDP and temporal information on real-world task

without Markov assumption. Meanwhile, a Wikipedia

semantic based approach has been exploited to im-

prove the results of topic modelling. Namely, the

models hold the complexity in a low level with suc-

cinct graphic representation. The experimental results

indicate the capability of tracking trend from news

media. As a signiﬁcant ﬁnding, the ITTM simulates

the peak of event trend precisely but fails to handle

the multi-spikes situation. While the STM is capable

of tracking the trends with ﬂuctuations and discov-

ering new topics, stable topics and vanished topics.

Because of the ﬂexibility and no number limitation of

topics, the models can be easily extended to other sce-

narios. Our future work might focus on tracking the

user interest by incorporating propagation algorithms

based on proposed models. The combination of inﬁ-

nite topic modelling and location factor is also under

our consideration.

REFERENCES

Ahmed, A. and Xing, E. P. (2010). Timeline: A dynamic

hierarchical dirichlet process model for recovering

birth/death and evolution of topics in text stream. In

UAI ’10.

AlSumait, L., Barbara, D., and Domeniconi, C. (2008).

On-line lda: Adaptive topic models for mining text

streams with applications to topic detection and track-

ing. In ICDM ’08, pages 3–12.

Balasubramanyan, R., Cohen, W. W., and Hurst, M. (2009).

Modeling corpora of timestamped documents using

semisupervised nonparametric topic models. In NIPS.

Blei, D., Ng, A., Jordan, M., and Lafferty, J. (2003). La-

tent dirichlet allocation. Journal of Machine Learning

Research, 3(993-1022).

Blei, D. M. and Lafferty, J. D. Dynamic topic models. In

ICML.

Ferguson, T. (1973). Bayesian analysis of some nonpara-

metric problems. Annals of Statistics, 1:209–230.

Heinrich, G. (2011). ”inﬁnite lda”-implementing the hdp

with minimum code complexity. Tecnical Note.

Hofmann, T. (1999). Probabilistic latent semantic indexing.

In SIGIR.

Hong, L., Yin, D., Guo, J., and Davison, B. D. (2011).

Tracking trends: Incorporating term volume into tem-

poral topic models. In KDD.

Kataria, S. S., Kumar, K. S., Rastogi, R., Sen, P., and Sen-

gamedu, S. H. (2011). Entity disambiguation with hi-

erarchical topic models. In KDD.

Landauer, T. K.and Dumais, S. T. (1997). A solution to

plato’s problem: the latent semantic analysis theory

of acquisition, induction, and representation of knowl-

edge. Psychological Review, 104(211-240).

Lau, J. H., Grieser, K., Newman, D., and Baldwin, T.

(2011). Automatic labelling of topic models. In Pro-

ceedings of the 49th Annual Meeting of the Associa-

tion for Computational Linguistics, pages 1536–1545.

Newman, D., Chemudugunta, C., and Smyth, P. (2006). Sta-

tistical entitytopic models. In KDD.

Ni, X., Sun, J.-T., Hu, J., and Chen, Z. (2009). Mining mul-

tilingual topics from wikipedia. In WWW.

Ren, L., Dunson, D. B., and Carin, L. (2008). The dynamic

hierarchical dirichlet process. In ICML.

Sudderth, E. B. (2006). Graphical models for visual ob-

ject recognition and tracking. Doctoral Thesis, Mas-

sachusetts Institute of Technology.

Teh, Y., Jordan, M., Beal, M., and Blei, D. (2006). Hier-

archical dirichlet processes. Journal of the American

Statistical Association, 101(1566-1581).

Wang, C., Blei, D. M., and Heckerman, D. (2008). Contin-

uous time dynamic topic models. In UAI ’08, pages

579–586.

XueruiWang and McCallum, A. (2006). Topics over time: a

non-markov continuous-time model of topical trends.

In KDD.

Zhang, J., Song, Y., Zhang, C., and Liu, S. (2010). Evo-

lutionary hierarchical dirichlet processes for multiple

correlated time-varying corpora. In KDD.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval