Table 1: The 30 most significant co-occurrences and its sig-
nificance value (cut to 3 digits) in the global context of “abu
ghraib” on May 10, 2004.
prisoners 0.346, abuse 0.346, secretary 0.259, abu ghraib
prison 0.259, iraqi 0.247, rumsfeld 0.221, military 0.218,
prison 0.218, bush 0.210. prisoner 0.200, photographs
0.183, donald 0.183, secretary of defense 0.174, pris-
ons 0.174, photos 0.174, the scandal 0.174, interrogation
0.163, naked 0.163, mistreatment 0.163, under 0.162, sol-
dier 0.154, saddam 0.154, armed 0.154, defense 0.143, the
bush 0.140, senate 0.140, videos 0.130, torture 0.130, arab
0.130, captured 0.130
text of the topic’s name), and the notion of a concept
to mean an equivalence class of semantically related
words. The global context of a topic’s name is the set
of all its statistically significant co-occurrences within
a corpus. We compute a term’s set of co-occurrences
on the basis of the term’s joint appearance with its co-
occurring terms within a predefined text window tak-
ing an appropriate measure for statistically significant
co-occurrence. The significance values are computed
using the log-likelihood measure following (Dunning,
1993) and afterwards normalized according to the ac-
tual corpus size. These significance values only serve
for sorting the co-occurrence terms; their absolute
values are not considered at all. Table 1 exempli-
fies the global context computed for the term “abu
ghraib” based on the New York Times corpus of May
10, 2004. The numbers in parenthesis behind a term
indicate its statistical significance (normalized to the
corpus size and multiplied by 10
6
), which are used to
rank the co-occurring terms (cf. Fig. 1).
3 THE SETTING
The processing of large and very large document col-
lections has several difficulties which make it hard
to provide substantial help for an user who wants to
access certain documents, especially when the exact
item or its position is unknown to the user. The state-
of-the-art interfaces for accessing large document col-
lections are indeces like google and other search en-
gines, which rely mainly on indexing all or statisti-
cally relevant terms, and structured catalogues like
(web) opacs, which need annotated metadata for each
document and use these for filtering.
The most hampering aspect is the large amount
of data itself and the complexity of its analysis. For
instance computing the global contexts of terms in
a corpus has a time and space complexity of O(n
2
),
where n = number of types is about 1,000,000 to
10,000,000. Therefore it is difficult to compute and
even to define appropriate and use- and meaningful
measures describing terms and their relations. Thus
most analyses rely on term frequency, which is effi-
ciently computeable, and e. g. often relevance mea-
sures comparing local term frequency in a document
to the total frequency in a reference corpus.
We aim for a new paradigm in interacting with
large time-related corpora. Obviously, it is impossi-
ble to present information about every document in a
large collection at once, because if there are for in-
stance 1.6 Mio documents like in the New York Time
corpus (cf. Sect. 6), there are only about 0.82 pixels
per document for visualization, aasuming a standard
screen with 1 280 × 1024 pixels. So an aggregated
view on the content is necessary, and this view should
enable a visualization-based interactive exploration of
the collection which is driven by the users attention
and intent by providing him details on demand.
Therefore we want to identify the most relevant
terms in the sense that these terms are related to the
most considerable developments over the time span
of the corpus. We establish the measure of volatility
of a term (see next section) to cover the change of its
global context which indicates a change of usage of
the term. So we can provide an overview over the
most evolving topics as an entrance into the whole
collection.
4 VOLATILITY COMPUTATION
The basis of our analysis is a set of time slice cor-
pora. These are corpora belonging to a certain pe-
riod of time, e. g. all newspaper articles of the same
day. The assessment of change of meaning of a term
is done by comparing the term’s global contexts of the
different time slice corpora.
The measure of the change of meaning is volati-
lity. It is derived from the widely used risk measure in
econometrics and finance
1
, and based on the sequence
of the significant co-occurrences in the global con-
text sorted according to their significance values (see
Sect. 2) and measures the change of the sequences
over different time slices. This is because the change
of meaning of a certain term leads to a change of the
usage of this term together with other terms and there-
fore to a (maybe slight) change of its co-occurrences
and their significance values in the time-slice-specific
global context of the term. The exact algorithm to ob-
tain the volatility of a certain term is shown in Fig. 1.
For the detailed natural language processing back-
ground see (Holz and Teresniak, 2010).
1
But it is calculated differently and not based on widely
used gain/loss measures. For an overview over miscella-
neous approaches to volatility see (Taylor, 2007).
IVAPP 2010 - International Conference on Information Visualization Theory and Applications
154