2 THE DATA SOURCES
Visits to Wikipedia are issued in the form of URLs
sent from users’ browsers. These URL’s are regis-
tered by the Wikimedia Foundation Squid servers in
the form of log lines after serving the requested con-
tent. Squid servers are a special kind of servers per-
forming web caching which are used by the Wikime-
dia Foundation as the first layer to manage the overall
traffic directed to all its projects. Part of the informa-
tion they register is sent to universities and research
centers interested in its study.
2.1 The Wikimedia Foundation Squid
Subsystem
As a part of their job, Squid systems do log informa-
tion about every request they serve whether the cor-
responding contents stem from their caches or, on the
contrary, are provided by the web servers placed be-
hind them. These log lines are sent in a UDP-packet
stream to our facilities where they are conveniently
stored.
The Wikimedia Foundation Squid servers use a
customized format for generating their log lines.
However, we do not receive all of the registered infor-
mation, basically in consideration to users’ privacy,
but just several fields of the log format. The most
important field we receive is the URL which consti-
tutes the submitted request. In addition, the date of
the request and a field indicating if it caused a write
operation to the database are also included.
2.2 Featured Articles
Featured articles are considered the best articles all
over the Wikipedia. In order to be promoted to this
status, the articles, first, have to be nominated and in-
cluded in an special page as candidates to featured
articles. Usually and prior to the their nomination,
future candidate articles pass through a peer revision
process in which reviewers make suggestions to im-
prove their quality.
Featured article have to meet a set of criteria apart
from the requirements demanded to every Wikipedia
article. These criteria cover from a clear and compre-
hensive writing of the article to a proper structure and
organization. Other aspects such us stability, neutral-
ity as well as length and citation robustness are also
considered.
In what our research is concerned, we analyze the
impact of featured articles in two very different ways.
First, we consider the influence of the promotion of
articles to the featured status in their number of visits.
Then, we also study the impact of the presentation of
a featured article as an example of high quality con-
tent in the main page of some editions of Wikipedia.
Regarding this, our main goal is to determine some
kind of pattern which can serve to model the traffic to
an article after being considered as featured.
3 METHODOLOGY OF THE
STUDY
The analysis presented here is based on a sample
of the Wikimedia Foundation Squid log lines corre-
sponding to two different periods, each consisting in
three months: Mars, April and May in one set, and
September, October and November, in the other one.
As we receive the 1% of all the traffic directed to the
Wikimedia Foundation projects, this results in more
than 8,200 million log lines to process for the consid-
ered months.
This analysis has focused just on the traffic di-
rected to the Wikipedia project and to ensure that the
study involved mature and highly active language edi-
tions, only the requests corresponding to the top-six
visited editions have been considered.
Once the log lines from the Wikimedia Founda-
tion Squid systems have been received in our facil-
ities, they become ready to be analyzed by the tool
developed for this aim: The WikiSquilter project.
The analysis consists on a characterization based on
a parsing process to extract the relevant elements of
information prior to a filtering one according to the
study directives. As a result of both processes, nec-
essary data to conduct a characterization are obtained
and stored in a relational database for further analysis.
Browsing the special pages of each Wikipedia edi-
tion devoted to its featured contents, we obtained the
featured articles promoted during April and October
2009. Moreover, we extracted the featured articles
appearing in the main page during the same months.
Then, we queried the database resulting from the pro-
cessing of the Squid log lines to look for the num-
ber of visits corresponding to those articles during the
aforementioned months as well as during the previous
and the following ones in the aim of finding out what
impact have the two featured mechanisms on the vis-
its that articles get. We did all the analysis shown here
using the GNU R statistical package (R Development
Core Team, 2009).
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
302