similar to JavaScript codes. Furthermore, we
tested opening the Java applet in an HTML file and
achieved the same result, because of the
aforementioned sandboxing limitations (Section
4.2). Yet, server-based web services where the
message contains only HTML code can still be
exploited towards this goal. In particular, once the
connection is opened, for instance along with the
image, the server obtains information—such as the
IP address—and uses the time the connection
remains open as an indicator of whether or not the
message is read. Obviously, the latter cannot
evidently prove whether the user has read the
message too or has just opened it without reading it.
5.2 Web-based Approach
In contrast to the email-based approach, the web-
based approach is more promising towards
information retrieval since we are not restricted to
limited services. One can freely host any code on a
server, with the only restrictions being the ones
imposed by web browsers, especially sandbox-based
browsers. Since JavaScript has proved useful in
a web environment, we have implemented a
Node.js server. The server is entirely written in
JavaScript, as can host pages be. Once the user
opens the webpage, the script embedded in the
HTML code of the page extracts information about
the user and the device and sends it to the server
where it is saved for further use.
This approach extracts information such as type
of the operating system, IP address, web browser
and its version, plugins installed and a geological
map should the user accept for their location to be
sent. The MAC address can be obtained too, but it
would require prompts for the user to accept and
may reveal the server’s intentions.
5.3 The Web Server
Following the web-based approach, we setup a
Node.js server with MongoDB as our database.
Since we receive unstructured data since browser
components vary between browsers, storing the data
in a MongoDB database seemed ideal. The core files
in the web server are app.js and getInfo.js.
Other systems include the Node.js software
package and modules such as: Express for a
flexible web application framework, Jade
format/template for HTML files, Socket.io to
open a web socket between the user and the server,
Forever to continuously run the server and
mongojs to interface between MongoDB database
and our Node.js server. The app.js script
contains the code that represents and runs the
hypertext transfer protocol (HTTP) server. Its main
function is to retrieve the data from the user who
accesses the website and to store the information in
the server’s database. A secondary function is to
match the visiting user’s data to previous records to
check if it matches. If it does, we can identify the
user after multiple visits at different time spans.
The Node.js software provides a lightweight
alternative to other server models, perfect for short-
term experiments as the one carried out in this study.
One drawback of this software is that a session must
stay open by a secure shell client so that the server
can run. To circumvent this, we can use server
scripts as a daemon to keep the software running.
We use Forever, a command line interface, which
is able to run a script for an indefinite amount of
time. The script will stop when we run the command
“stop”, connection is lost, or server crashes. Yet,
Forever ensures that if the script is terminated
prematurely, it will execute again.
The other core file, getInfo.js, runs on the
client-side and retrieves as much information from
the browser and the operating system of the visiting
user. Among the obtained information are the web
browser that the user used to access the webpage,
along with the language that is being used, type of
operating system the machine is using (e.g.
Windows, Mac, Linux, etc.), information about the
plugins that the web browser has installed and the IP
address and port number the user is connected to.
The IP address can properly identify a user that
is not behind a proxy. However, solely by the IP
address, we may not identify an individual who is
part of a LAN. We often can filter out users by using
their operating system, browser and plugins.
Therefore, we can single out a specific computer
from a network. The rationale of using plugins is
that users tend to visit various websites regularly,
and most likely downloads applications that make
browsing easier and more efficient. This action is
almost unique, meaning each user can have a
collection of diverse plugins that almost no other
user has. Hence, plugins act as quasi-identifiers.
Furthermore, we can often infer behavioral
information about the user. In particular, we can
identify an individual if they access the page
repeatedly. This approach along with other
information, such as the IP addresses and browser
information can uniquely identify most users with a
very high success rate. The implementation of the
server and client-side script also includes generating
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
320