ing a new connection from the browser to another do-
main, a transmission of source/location information
(IP address) is caused and may also reveal further pro-
tocol specific information like HTTP referrers. (Eck-
ersley, 2010) shows how this kind of information can
be used for passive web tracking. From our point of
view today, disclosing IP address information must be
classified as personal data transfer that can be used for
tracking purposes. In this paper, tracking is defined as
a connection to an external (third party) host, that is
not part of the visited/requested website. It cannot be
proven whether the third party uses personal data for
tracking purposes or not. However, it is clear that the
data could be used to track.
The results are similar to the ones generated by the
Firefox add-on Lightbeam ((Lightbeam, 2015)) that
also provides a graphical overview about third party
connections. Due to the fact that we need the ability to
block connections not directed to archive.org, further
development was necessary. In Section 3 we describe
how a development framework (PyQt) was modified
to obtain all external requests that occur during a web
request. As soon as a website is fully parsed, all net-
work requests are saved in a list and can be processed
further.
2.3 Retrospective Analysis
Founded in 1996, archive.org became well known as
an internet library by preserving the state of popu-
lar websites. If not disabled by the website owner,
archive.org stores the current state of public internet
websites several times a year (Day, 2006), (Olston and
Najork, 2010). Information about new popular web-
sites are donated by the Alexa.org database. This in-
formation will be used for a retrospective analysis of
the websites with a focus on third-party connections.
We can now obtain all requests for a given
archive.org website. We also need to restrict our
analysis to a set of websites. We decided to use
the 10,000 most popular websites according to the
Alexa.org database (as of March 2015). Unfortu-
nately, Alexa.org was not able to provide the most
visited websites for the years before 2007. Other
databases, like Netcraft or archive.org, were not able
to provide this either. Therefore, our analysis is based
on the most popular 10,000 websites today.
The archive.org JSON API
1
allows us to check
how many snapshots are available and where they can
be found. For each of the 10,000 websites and for
each year between the years 2000 and 2015, we re-
quest a snapshot overview from archive.org. The re-
1
JSON API for archive.org services and metadata,
https://archive.org/help/json.php
sult is a list of snapshots that can be processed. This
processing results in up to 16 lists of resources (for
each year) that the browser loads after visiting the
website in the archive. Finally, we perform an analy-
sis of what kind of trackers were used historically and
how tracker usage changed in the last years.
As already stated in Section 2.2, an ideal analysis
of web tracking cannot be performed due to the fact
that active parts (web servers) sometimes do not exist
anymore or do not show the same behavior. Further-
more, redirections to content on other websites cannot
be followed if they are not preserved by archive.org.
For example if an advertising spot was sold by the
website owner and filled with different content for
each request. This could generate much more exter-
nal requests if visited multiple times. Due to this, the
results of this analysis must be interpreted as a mini-
mum of tracking, but could be higher.
3 IMPLEMENTATION
For our analysis, it is necessary to identify external
connections from an archived website. A static anal-
ysis, like using regular expression to find external re-
sources within HTML source code, has been shown to
be insufficient. A reason for this is code obfuscation
that looks like this:
var src = (document.location.protocol ===
’https:’ ? ’https:/’ : ’http:/’)
+ ’/imagesrv.adition.com/js/srp.js’;
document.write(’<scr’ + ’ipt type=
"text/javascript" src="’ + src + ’"
charset="utf-8"></scr’ + ’ipt>’);
In this code, the address of the tracker (adi-
tion.com) is obfuscated in a simple form, but good
enough to defeat an automatic URL search in the
source code.
Thus, a more dynamic analysis of websites is re-
quired. PyQt is a library that connects the Qt C++
cross-platform application framework with the in-
terpreted language Python. Qt is a toolkit that in-
cludes a web browser widget that supports all mod-
ern web techniques (JavaScript, CSS, AJAX etc.)
according to their whitepaper (Riverbank, 2013).
When this browser widget is parsing a website,
there are various points where resources (images,
scripts, etc.) are requested. We identified the
PyQt4.QtNetwork.QNetworkAccessManager class
where all network-based requests come together. If a
resource must be loaded, the method createRequest
is called and contains the full address (URL) of the re-
source. We overwrote this class, so that:
ICISSP 2016 - 2nd International Conference on Information Systems Security and Privacy
140