be exploited to assist users A common assumption
is that tags are representative of resource semantics.
Recently, this assumption started being questioned.
Quoting Zanardi and Capra ”. . . as tags are informally
defined, continually changing, and ungoverned , so-
cial tagging has often been criticized for lowering,
rather than increasing, the efficiency of searching .. .”
(Zanardi and Capra, 2008, p. 51). Hence, it seems
reasonable to use tags, especially popular ones, with
caution. Nonetheless, we believe that visually sum-
marizing contents of documents that people associate
with such tags helps users to learn the multiple mean-
ings of a tag or to understand for what resources a yet
unknown tag is used. Summarization and visualiza-
tion of document contents is a way of knowledge gen-
eration in collaborative tagging systems which sub-
sumes perspectives on a tag of many different users.
The study of content to assess tag semantics is not by
itself new. For example, (Moxley et al., 2009) derive
semantics of tags assigned to Flickr pictures by an-
alyzing geographical coordinates of the depicted lo-
cations. However, we also account for the fact that
the meaning(s) associated with a tag may change over
time.
Beside AdaptivePLSA (Gohr et al., 2009) that ex-
tends PLSA (Hofmann, 2001) to streaming document
collections and that is used in this study, other ap-
proaches (Mei and Zhai, 2005; Blei and Lafferty,
2006; Wang and McCallum, 2006) model dynamic
document collections, too. Some allow for words to
become obsolete and irrelevant while others emerge
(AlSumait et al., 2008; Chou and Chen, 2008). Cap-
turing terminological evolution is indispensable for
visualizing the semantic evolution of tags, because
that evolution is inevitably associated with the in-
creased importance of some words that were irrele-
vant or unknown in the past.
3 SUMMARIZING DOCUMENTS
The aim is to provide users of collaborative tagging
systems with a summary of contents under tags by
document prototypes so that these users, if in doubt
about the meaning and usage of a certain tag, might
inspect this summary to clarify its meaning.
3.1 Document Prototypes
The contents of a document collection
~
D can be sum-
marized by prototypes of documents in
~
D. Document
prototypes abstract from the documents and thereby
describe the whole set of documents in a condensed
way. Thus, inspecting them allows to get an overview
about the contents of the documents. Because their
number is much smaller than the number of docu-
ments, inspecting prototypes is more efficient than
reading single documents. We denote the collection
of documents by the vector
~
D of document IDs to al-
low for multiple occurrences of documents.
We use probabilistic topic modeling of documents
to derive document prototypes. Topic modeling of-
ten assumes topics to be represented by multinomial
distributions over words of the vocabulary (Hofmann,
2001; Blei et al., 2003). Topics capture patterns of
words that often co-occur in different documents.
Because topics are distributions over words they
are less suitable to summarize document collections.
But being a word distribution a topic allows to rank
words according to their probability. The top ranked
words are most strongly associated to that topic.
Inspecting these words allows to deduce what the
topic’s meaning. Consequently, we define for each
learned topic a document prototype consisting of the
N
top
top ranked words for that topic.
Many collections change over time, because, for
example, new documents are added. As an exam-
ple of such a collection, consider the documents as-
sociated with a certain tag in a collaborative tagging
system. As users interact with such systems, they
contribute new documents over time and tag these
documents. To provide a summary of contents of
documents annotated with a tag, we determine doc-
ument prototypes over time. In contrast to summa-
rizing static collection, we would also have to derive
how these prototypes change over time to capture the
dynamic nature of these collections.
We adapt the approach of summarizing a static
document collection by examining the collection as
it evolves over time. Therefore, we define a stream
of documents and learn topics for successive parts of
the stream using an extension of probabilistic latent
semantic analysis (Hofmann, 2001) described in Sec-
tion 4. From the topics over time we derive docu-
ment prototypes over time to be visualized for study-
ing how the contents of documents change through
time. But first, we elaborate on how collaborative
tagging systems are used for managing annotations of
documents with tags. Next, we explain how we con-
struct a stream of documents under a tag to study how
that content changes over time.
3.2 Tagging Events in Collaborative
Tagging Systems
Collaborative tagging systems for academic articles
manage bibliographic entries that are contributed by
users. Bibliographic entries contain author informa-
VISUALLY SUMMARIZING THE EVOLUTION OF DOCUMENTS UNDER A SOCIAL TAG
87