how to organize the article contents and how to pro-
vide links from Wikipedia articles to external (out-
side Wikipedia) resources. The quest in establishing
structural and contextual editing guidelines is to help
Wikipedia contributors supply content that is reli-
able, neutral, verifiable via cited sources, complete
and useful for the covered subjects. Above all, it is
critical that articles communicate encyclopaedic
knowledge that would distinguish Wikipedia as an
encyclopaedia from other information sources.
However, Wikipedia’s open nature, lacking co-
ordinated editing, is susceptible to transforming its
hosting articles from information sources to infor-
mation loops. This is mainly because editors might
rely on common resources for editing articles on
similar topics, or they might link articles to external
resources that have borrowed content from Wikipe-
dia. In the former situation, different articles might
contain identical textual fragments, which if re-
produced across several articles might result in se-
vere content duplication, which in turn would en-
gage users to reading the same text for different top-
ics; thus entering in information loops. Considering
the findings of (Buriol et al., 2006) that topically-
relevant Wikipedia articles are densely linked, the
above scenario is likely to occur as Wikipedia con-
tent and links proliferate. In the second situation, an
article might reproduce verbatim the body of its
linked external sources and vice versa. In this case,
readers visiting refereed material for acquiring sup-
plemental information for the article topics would
end up re-reading the same text (or pieces of it).
Evidently, both situations hinder users from ob-
taining unique information in the contents of differ-
ent articles and harm the overall Wikipedia quality.
Driven by the desire to capture information unique-
ness across Wikipedia articles, we carried out the
present study. The aim of our work is to explore the
information loops in Wikipedia articles in order to
deduce the amount of unique information in their
contents. We believe that the findings of our study
will give useful insights regarding the articles’ qual-
ity and will assist Wikipedia administrators deter-
mine effective article revision and maintenance poli-
cies. In the following paragraphs, we discuss the
details of our approach for quantifying information
uniqueness in Wikipedia articles.
3.2 Information Uniqueness in Article
Contents
To capture information uniqueness in the contents of
Wikipedia articles, we rely on articles dealing with
related topics and we compute the degree to which
different articles duplicate the same informational
extracts in their body. Our speculation is that the
degree of the articles’ information uniqueness is
conversely analogous to the degree of their content
duplication, in the sense that the more content two
articles share in common the less unique information
they communicate.
The method we employ for estimating informa-
tion uniqueness in the Wikipedia collection operates
upon topically-related articles. The reason for focus-
ing on topically related articles is because informa-
tion duplication is mainly pronounced for documents
dealing with common or similar subjects (Davison,
2000). Therefore, trying to capture content duplica-
tion across the entire Wikipedia collection would
significantly increase the overhead and the computa-
tional complexity of our measurements without add-
ing much value to the delivered results.
To identify topically-related articles, we rely on
the Wikipedia categories to which every article has
been assigned by its editors and we deem articles to
be topically-related if they share at least one com-
mon category. By deducing the articles’ topical re-
latedness based on their assigned categories, we en-
sure both consistency and accuracy in their topical
descriptions as the latter are collectively supplied by
humans and we obviate the need to re-categorize the
articles.
Based on the article categories, we organize them
into topical clusters and we process the documents
in every cluster in order to identify duplicate content
in their body. Having organized Wikipedia articles
into topical clusters, we download the contents of
every cluster, we parse them to remove mark-up and
apply tokenization to identify word tokens in the
articles’ textual body. Based on the tokenized arti-
cles in every topic, we estimate the articles’ lexical
and semantic elements duplication from which we
subsequently derive the amount of the articles’ in-
formation uniqueness.
To identify lexical content duplication across the
articles’ body, we estimate Containment within the
articles’ text. For our measurements, we firstly lexi-
cally analyze the articles’ textual body into canoni-
cal sequences of tokens and represent them as con-
tiguous sub-sequences of w tokens, called shingles
and computed via the w-shingling technique (Broder
et al., 1997). Then, we eliminate identical shingles
from every article and we compute the containment
of an article’s text in the body of the remaining arti-
cles clustered in the same topic. Containment be-
tween article pairs is determined as the ratio of an
article’s shingles that are contained in the shingles of
another article, given by:
Where S(a
i
) denotes the shingles of article a
i
and
S(a
j
) denotes the shingles of article a
j
. Containment
INFORMATION UNIQUENESS IN WIKIPEDIA ARTICLES
139