GENERATING A VISUAL OVERVIEW

OF LARGE DIACHRONIC DOCUMENT COLLECTIONS

BASED ON THE DETECTION OF TOPIC CHANGE

Florian Holz

, Sven Teresniak

, Gerhard Heyer

and Gerik Scheuermann

Natural Language Processing Group,

Image and Signal Processing Group

Institute of Computer Science, University of Leipzig, Leipzig, Germany

Keywords:

Visual analytics, Document collection summarization, Browsing document collections, Topic detection,

Change of meaning, Volatility, Time-sliced corpora.

Abstract:

Large digital diachronic document collections are a central source of information in science, business, and for

the general public. One challenge for the efﬁcient visualization of these collections is the automatic calculation

and visualization of the main topics. These topics can then serve as the basis for an overview of the content and

any subsequent interactive visual analysis. We introduce the new language processing concept of volatility of

terms measured as the change of the context of terms. We demonstrate that volatility can serve as an excellent

basis for the visual overview of large collections using two different examples.

1 INTRODUCTION

Large collections of digital diachronic text such as the

New York Times corpus and other newspaper or jour-

nal archives in many ways contain temporal informa-

tion related to events, stories and topics. Therefore, a

good starting point for any visual analysis of such col-

lections is a visual representation of contained topics.

To detect the appearance of new topics and tracking

the reappearance and evolution of them is the goal of

topic detection and tracking (Allan et al., 1998; Al-

lan, 2002). For a collection of documents, relevant

terms need to be identiﬁed and related to a particu-

lar time-span, or known events, and vice versa, time-

spans need to be related to relevant terms.

However, topics not only depict events in time,

they also mirror an author’s, or society’s, view on the

events described. And this view can change over time.

In language, the relevance of things happening is con-

stantly rated and evaluated. In our view, therefore,

topics represent a conceptualization of events and sto-

ries that is not statically related to a certain period of

time, but can itself change over time. Tracking these

changes of topics over time is highly useful for mon-

itoring changes of public opinion and preferences as

well as tracing historical developments. In addition to

This research has been funded in part by DFG Focus

Project Nr. 1335 Scalable Visual Analytics under DFG

number SCHE 663/4-1.

term frequency, we consider a term’s global context

(see below) as a second dimension for analyzing its

relevance and temporal extension and argue that the

global context of a term may be taken to represent its

meaning(s). We use these two dimensions as the basis

of our visual overview of the topics in the collection.

Changes over time in the global context of a term

indicate a change of meaning. The rate of change

is indicative of how much the “opinion stakeholders”

agree on the meaning of a term. Fixing the meaning

of a term can thus be compared to ﬁxing the price of

a stock. Likewise the analysis of the volatility of a

term’s global contexts can be employed to detect top-

ics and their change over time. Therefore, we offer

the user to study frequency and volatility of a selected

topic over time to let him ﬁnd time spans of inter-

est with respect to the topic. We ﬁrst explain the ba-

sic notions and assumptions of our approach and then

present ﬁrst experimental results.

2 BASIC NOTIONS

Following (Heyer et al., 2008), we take a term to

mean the inﬂected type of a word, where the notion

of a word is taken to mean an equivalence class of in-

ﬂected forms of a base form. Likewise we take the no-

tion of a topic to mean an equivalence class of words

describing an event (as computed by the global con-

153

Holz F., Teresniak S., Heyer G. and Scheuermann G. (2010).

GENERATING A VISUAL OVERVIEW OF LARGE DIACHRONIC DOCUMENT COLLECTIONS BASED ON THE DETECTION OF TOPIC CHANGE.

In Proceedings of the International Conference on Imaging Theory and Applications and International Conference on Information Visualization Theory

and Applications, pages 153-156

DOI: 10.5220/0002836701530156

 SciTePress

Table 1: The 30 most signiﬁcant co-occurrences and its sig-

niﬁcance value (cut to 3 digits) in the global context of “abu

ghraib” on May 10, 2004.

prisoners 0.346, abuse 0.346, secretary 0.259, abu ghraib

prison 0.259, iraqi 0.247, rumsfeld 0.221, military 0.218,

prison 0.218, bush 0.210. prisoner 0.200, photographs

0.183, donald 0.183, secretary of defense 0.174, pris-

ons 0.174, photos 0.174, the scandal 0.174, interrogation

0.163, naked 0.163, mistreatment 0.163, under 0.162, sol-

dier 0.154, saddam 0.154, armed 0.154, defense 0.143, the

bush 0.140, senate 0.140, videos 0.130, torture 0.130, arab

0.130, captured 0.130

text of the topic’s name), and the notion of a concept

to mean an equivalence class of semantically related

words. The global context of a topic’s name is the set

of all its statistically signiﬁcant co-occurrences within

a corpus. We compute a term’s set of co-occurrences

on the basis of the term’s joint appearance with its co-

occurring terms within a predeﬁned text window tak-

ing an appropriate measure for statistically signiﬁcant

co-occurrence. The signiﬁcance values are computed

using the log-likelihood measure following (Dunning,

1993) and afterwards normalized according to the ac-

tual corpus size. These signiﬁcance values only serve

for sorting the co-occurrence terms; their absolute

values are not considered at all. Table 1 exempli-

ﬁes the global context computed for the term “abu

ghraib” based on the New York Times corpus of May

10, 2004. The numbers in parenthesis behind a term

indicate its statistical signiﬁcance (normalized to the

corpus size and multiplied by 10

), which are used to

rank the co-occurring terms (cf. Fig. 1).

3 THE SETTING

The processing of large and very large document col-

lections has several difﬁculties which make it hard

to provide substantial help for an user who wants to

access certain documents, especially when the exact

item or its position is unknown to the user. The state-

of-the-art interfaces for accessing large document col-

lections are indeces like google and other search en-

gines, which rely mainly on indexing all or statisti-

cally relevant terms, and structured catalogues like

(web) opacs, which need annotated metadata for each

document and use these for ﬁltering.

The most hampering aspect is the large amount

of data itself and the complexity of its analysis. For

instance computing the global contexts of terms in

a corpus has a time and space complexity of O(n

where n = number of types is about 1,000,000 to

10,000,000. Therefore it is difﬁcult to compute and

even to deﬁne appropriate and use- and meaningful

measures describing terms and their relations. Thus

most analyses rely on term frequency, which is efﬁ-

ciently computeable, and e. g. often relevance mea-

sures comparing local term frequency in a document

to the total frequency in a reference corpus.

We aim for a new paradigm in interacting with

large time-related corpora. Obviously, it is impossi-

ble to present information about every document in a

large collection at once, because if there are for in-

stance 1.6 Mio documents like in the New York Time

corpus (cf. Sect. 6), there are only about 0.82 pixels

per document for visualization, aasuming a standard

screen with 1 280 × 1024 pixels. So an aggregated

view on the content is necessary, and this view should

enable a visualization-based interactive exploration of

the collection which is driven by the users attention

and intent by providing him details on demand.

Therefore we want to identify the most relevant

terms in the sense that these terms are related to the

most considerable developments over the time span

of the corpus. We establish the measure of volatility

of a term (see next section) to cover the change of its

global context which indicates a change of usage of

the term. So we can provide an overview over the

most evolving topics as an entrance into the whole

collection.

4 VOLATILITY COMPUTATION

The basis of our analysis is a set of time slice cor-

pora. These are corpora belonging to a certain pe-

riod of time, e. g. all newspaper articles of the same

day. The assessment of change of meaning of a term

is done by comparing the term’s global contexts of the

different time slice corpora.

The measure of the change of meaning is volati-

lity. It is derived from the widely used risk measure in

econometrics and ﬁnance

, and based on the sequence

of the signiﬁcant co-occurrences in the global con-

text sorted according to their signiﬁcance values (see

Sect. 2) and measures the change of the sequences

over different time slices. This is because the change

of meaning of a certain term leads to a change of the

usage of this term together with other terms and there-

fore to a (maybe slight) change of its co-occurrences

and their signiﬁcance values in the time-slice-speciﬁc

global context of the term. The exact algorithm to ob-

tain the volatility of a certain term is shown in Fig. 1.

For the detailed natural language processing back-

ground see (Holz and Teresniak, 2010).

But it is calculated differently and not based on widely

used gain/loss measures. For an overview over miscella-

neous approaches to volatility see (Taylor, 2007).

IVAPP 2010 - International Conference on Information Visualization Theory and Applications

154

1. Build a corpus where all time slices are joined together.

2. Compute for this overall corpus all signiﬁcant co-

occurrences C(t) for every term t.

3. Compute all signiﬁcant co-occurrences C

(t) for every

time slice t

for every term t.

4. For every co-occurrence term c

t, j

∈ C(t) compute the

series of ranks rank

t, j

(i) variing i which represents the

ranks of c

t, j

in the different global contexts of t for ev-

ery time slice t

5. Compute the coefﬁcient of variation of the rank series

CV(rank

t, j

(i)) for every co-occurrence term in c

t, j

∈

C(t).

6. Compute the average of the coefﬁcients of variation of

all co-occurences terms C(t) to obtain the volatility of

term t

Vol(t) = avg





rank

t, j

(i)





Figure 1: Computing the volatility.

5 VISUAL OVERVIEW

The visual overview is a 2D plot where every term’s

position is given by the term’s absolute frequency and

the term’s volatility variance computed as the vari-

ance all per-day volatilities (cf. Sect. 6). The x-axis

comprises of the rank of the term in the frequency-

sorted term list while the y-axis indicates the volatility

of the term. Thus the overview depicts the relation be-

tween how present a term was in the shown time span

and how much the related topic evolved over it. The

overview provides a simple and intuitive aggregation

of the document collection. Figure 2 shows such an

(zoomed) overview computed for all articles of the

New York Times corpus in 2004. The high-frequent

terms are on the right side, the low-frequent ones on

the left. The x-axis is displayed logarithmic. Accord-

ing to the power law distribution of term frequencies

in natural language (cf. Zipf’s law), the logarithmic

view leads to a concentration of the most terms in

the middle of the x-axis which would in a linear view

mostly to be found indistinguishably right next to the

y-axis.

This represenation allows the user to get an direct

overview over the most evolving topics covered in the

processed documents. In an interactive application

the user can explore more and less evolving aspects

of the covered time span by zooming into certain ar-

eas. If the user ﬁnds an interesting term, it’s easy to

provide him the curve of the volatility of this term

showing the term’s development over the time span

like shown in Fig. 3 (cf. Sect. 6). Using the signiﬁ-

cant co-occurrences the user can be provided the most

related terms as well.

0.0002

0.00025

0.0003

0.00035

0.0004

100 1000

aggregated view of volatile concepts for 2004

log(word frequency) in 2004

barack obama

barack

republican convention

democratic convention

subject to change

satiric

abu ghraib prison

shinnecock

d-day

bronze medal

medalist

caucuses

smarty jones

sprinters

joseph i

all-purpose

keynote

incumbents

moral values

fourth of july

the oscar

zell

gymnastics

halloween

colin powell

lieberman

farley

democratic national convention

swimmer

inflamed

the capture

iowa caucuses

college football

same-sex couples

olympian

abbas

gandhi

the gold medal

hyperion

freestyle

guts

popular vote

chain of command

the reach

christmas eve

the rings

league championship series

yasir arafat

white-collar

gold medal

moderator

angular

the christmas

first album

Figure 2: Variance of volatility according to word frequency

for 2004, zoomed.

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Jan 03

Apr 03

Jul 03

Oct 03

Jan 04

Apr 04

Jul 04

Oct 04

Jan 05

Apr 05

Jul 05

Oct 05

Jan 06

Apr 06

Jul 06

Oct 06

Jan 07

Volatility

Term Frequency

Date

Volatility 'abu ghraib'

Frequency 'abu ghraib'

Figure 3: 30-day volatility of “abu ghraib” from 2003 to

2006 based on the NYT corpus.

6 EXPERIMENTS

In what follows, we present results of experiments

that were carried out on the basis of data based on the

New York Times Annotated Corpus (NYT)

. Table 2

lists some general characteristics of this corpus.

Figure 3 shows the development of the volatility

of “abu ghaib” from January 2004 to December 2006.

The volatility was computed per day with a window

of 30 days, i. e. for the volatility for a certain day

the last 30 days before were taken into account (cf.

Fig. 1). The daily frequency of “abu ghraib” is also

shown in Fig. 3 as a 30-day average over the last 30

days, too. The clearly outstanding peaks of the vola-

tility are easily connectable to certain events and their

related media coverage. The ﬁrst peak beginning in

May 2004 is caused by the initial discussion about the

torture pictures and videos taken in the prison in Abu

Ghraib (cf. Tab. 1).

The volatility does not generally correlate with the

word frequency as e. g. the volatility peak in April

2005 shows. It’s caused by the news coverage of a

suicide attack at the prison on April 5. The new as-

http://www.ldc.upenn.edu/

GENERATING A VISUAL OVERVIEW OF LARGE DIACHRONIC DOCUMENT COLLECTIONS BASED ON THE

DETECTION OF TOPIC CHANGE

155

pect and topic shift does not lead to an extended cov-

erage in the New York Times but is measureable as

a change of context. The peak in November and De-

cember 2005 is related to an exhibition where pictures

from Abu Ghraib have been shown together with oth-

ers from the Weimarer Republic and World War II.

The event also does not cause a more frequent usage

of “abu ghraib” in the New York Times, but is nev-

ertheless detectable by the related change of context.

Table 3 shows this for the November 20, when the

reporting about the exhibition started.

Once established as a symbol, the Abu Ghraib cri-

sis is stressed controversely in many contexts and thus

remains high-volatile at least until November of 2006,

even though the absolute frequency of “abu ghraib”

is quiet low (cf. Fig. 3). For previous experiments

and more detailed examples see (Holz and Teresniak,

2010).

7 CONCLUSIONS

In this paper, we have presented a new approach to

the analysis of large diachronic document collections.

We proposed to analyze the change of topics over time

by considering changes in the gobal contexts of terms

as indicative of a change of meaning. Measuring this

change it is possible to visualize a substantial amount

of a large time-related data volume concentrating on

the most evolving topics and to provide a simple and

intuitive overview over the wohle document collec-

tion. First experiments, carried out using data from

contemporary news corpora for German and English,

indicate the validity of the approach. In particular,

it could be shown that the proposed measure of a

term’s volatility is highly independent from a term’s

frequency.

This overview is planned to be the basis of an ad-

vanced interactive exploration application. So in a

next step we plan to combine the the overview and

term-speciﬁc analysis within one user interface which

provides also zooming and ﬁltering options. There-

fore the user is planned to be able to select a cer-

tain term or set of terms to get the volatility devel-

Table 2: Characteristics of the used corpus NYT.

language english

time span Jan 87 – Jun 07

no. time slices 7 475

no. document 1.65 mil.

no. tokens 1 200 mil.

no. types 3.6 mil.

no. sig. co-occurrences 29 500 mil.

size (plain text) 5.7 GB

Table 3: The 30 most signiﬁcant co-occurrences in the

global context of “abu ghraib” on November 20, 2005.

disasters, hook, grosz, international center, ﬁnalized,

weighty, inkling, complement, partnerships, guggenheim

museum, collaborative, the big city, easel, reaped, hudson

river museum, blockbuster, enlarging, goya, weimar, art

museums, eras, inconvenient, negatives, golub, poughkeep-

sie, griswold, big city, impressionist, staging, neuberger

opment within an selectable time frame and to have

access to the most prominent documents of this time

frame which have high impact on the volatility of the

term(s). It is also intended to provide access to such

terms which are most heavily changing the global

context of the previously chosen term indicating into

which direction its meaning and usage changes within

the selected time frame. Combining those overviews

of subsequent time spans, it is possible to show the

terms’ developments as a trajectories for every term.

So, rising or declining topics can be identiﬁed by hav-

ing the according terms moving along the x-axis while

they gain or loose variance of volatility in contrast to

other concepts which may stay in their area over the

different overview representations.

REFERENCES

Allan, J. (2002). Introduction to topic detection and track-

ing, pages 1–16. Kluwer Academic Publishers, Nor-

well, MA, USA.

Allan, J. et al. (1998). Topic detection and tracking pilot

study ﬁnal report. In Proc. DARPA Broadcast News

Transcription and Understanding Workshop, pages

194–218.

Dunning, T. E. (1993). Accurate methods for the statistics

of surprise and coincidence. Computational Linguis-

tics, 19(1):61–74.

Heyer, G., Quasthoff, U., and Wittig, T. (2008). Text Min-

ing: Wissensrohstoff Text – Konzepte, Algorithmen,

Ergebnisse. W3L-Verlag, 2nd edition.

Holz, F. and Teresniak, S. (2010). Towards automatic de-

tection and tracking of topic change. In Gelbukh, A.,

editor, Proc. CICLing 2010: Conference on Intelli-

gent Text Processing and Computational Linguistics,

LNCS 6008. Springer LNCS.

Taylor, S. J. (2007). Introduction to asset price dynam-

ics, volatility, and prediction. In Asset Price Dynam-

ics, Volatility, and Prediction, Introductory Chapters.

Princeton University Press.

IVAPP 2010 - International Conference on Information Visualization Theory and Applications

156