In the course of our study, we built upon earlier
work (Kirtsis et al., 2010) in which we assessed the
amount to which Wikipedia external links add new
content to their corresponding articles’ body and we
make the below-listed contributions:
We examine the statistical distribution of exter-
nal links in Wikipedia articles, we identify popular
domain names in the articles' external resources and
we capture the correlation between the articles'
length and the fraction of their external links. Our
examination helps assess the appropriateness of the
articles’ external links, since according to Wikipedia
editing guidelines, external links should point to
information that is not yet part of the article.
We quantify decay in Wikipedia external links in
an attempt to assist editors and administrators de-
termine the amount of repair that the articles' linked
sources should undergo. To measure decay of an
article's external pages, we use the proportion of
dead links to which the article points. A dead link
points to a page that either is no longer available or
redirects to a page that has nothing to do with the
original page. Measuring decay of Wikipedia linked-
to resources indicates the appropriateness of the arti-
cles' linking pages in providing readers with reliable
yet valid information about the subject of the article.
The remainder of the paper is organized as follows.
We start our discussion with an overview on related
work. In Section 3, we describe how external links
can serve as implicit indicators of Wikipedia’s qual-
ity. Specifically, we discuss how to capture and in-
terpret the distribution of external links in Wikipedia
articles and we address the problem of identifying
external links' decay. In Section 4, we present an
experiment we carried out in which we assessed the
usefulness of external links for communicating valid
and complete information about the Wikipedia arti-
cle contents. The discussion of experimental results
sheds light on the article external features to be con-
sidered in future Wikipedia quality assessment ef-
forts. We conclude the paper in Section 5 where we
outline our plans for future work.
2 RELATED WORK
Related work falls in two main categories, which we
discuss in turn. First, there is related work in study-
ing the nature and the quality of Wikipedia. In this
respect, researchers have suggested a number of
article features that signify quality, e.g. their articles’
survival period (Cross, 2006), the number and fre-
quency of their edits (Wilkinson and Huberman,
2007), their revision history (Adler and de Alfaro,
2007), the amount of outbound citations to scientific
publications (Nielsen, 2007), the dedication and ex-
pertise of their editors (Riehle, 2005), etc. Our work
extends prior studies that infer the Wikipedia arti-
cles’ quality based on their contents investigation in
that we also examine the impact of external links to
their corresponding articles’ quality. In our work, we
significantly enrich our earlier study (Kirtsis et al.,
2010) by examining additional features in assessing
the article links’ and by running large-scale evalua-
tion experiments on the Wikipedia outlinks’ distri-
bution and properties.
In a similar direction, (Buriol et al., 2006) study
the evolution of the Wikipedia link graph over time
and observe an increasing link density in Wikipedia
as time goes by. This is also attested in the work of
(Kamps and Koolen, 2009), who contrast the link
structure of the Web and Wikipedia in an attempt to
assess the impact of global and local link topology in
web retrieval effectiveness. Although our work re-
lates to the above studies that examine the Wikipe-
dia link structure, it is different in that we investigate
the contribution of external resources in comple-
menting the article contents. Most importantly, we
estimate not only external links' distribution but also
their decay. Although, the encyclopaedic organiza-
tion of Wikipedia is different from the Web data
organization in that it is steered by specific guide-
lines about what and how to link
3
still Wikipedia
like all large websites suffers from the link rot phe-
nomenon. As of November 2006 dump, nearly 10%
of external links were broken and although repairing
and maintenance actions are been taken ever since,
the problem of rotten external links persists.
Thus, our study also relates to existing works
that focus on the identification of dead links. The
majority of existing works concentrate on the web
information sources' decay and examine millions of
web pages for deriving statistics about the fraction
of dead web links (Fetterly, et al., 2003; Morishima
et al., 2008). Another signal of web decay, apart
from dead links, is soft-404 server errors, which
imply that a URL redirects to a page that returns an
OK HTTP code, but contains a totally different con-
tent from the one requested (Bar-Yossef et al., 2004;
Lee et al., 2009). Moreover, researchers attested that
decayed pages are characterized by outdated or else
abandoned content. A number of studies have been
reported that try to capture abandoned pages based
on the estimation of their links and content age (Ja-
towt et al., 2007; Popitsch and Haslhofer, 2010). Our
3
http://en.wikipedia.org/wiki/Wikipedia:Linking.
QUALITY ASSESSMENT OF WIKIPEDIA EXTERNAL LINKS
249