Semantic Blogging (Möller, 2005) uses desktop
data for the needs of blogging activities. With
Semantic Blogging a user can easily handle his own
desktop data (contacts, calendars etc.). Semantic
Blogging is the most closely related application to
the presented one but again needs user interaction in
order to be used.
WikSAR (Aumueller, 2005) is a web application
that uses desktop data in a wiki environment (such
as addresses, calendars etc). Although this project
can use local data on the web, it doesn’t process
them by any means; it only presents them with a
more elegant way of browsing.
Gnowsis Semantic Desktop (Sauermann, 2005)
translates desktop data into semantic data for major
operating systems. Although not web related, it
supplies an easy way to access desktop data based
on semantic analysis.
Automatic Bookmark Classification (Benz,
2006): This project has a lot in common with the one
presented here, as it also classifies a user’s
bookmarks. However, there is no automated usage
of the classification process; the user is prompted to
accept the classification result or insert the results he
believes that suit best.
Personalized Search (Teevan, 2005) studies the
impact of web mining techniques on a search engine.
They use of logs from previous searches as well as
previously visited web sites to profile a user and
return more accurate results. Only web data is being
used, not taking advantage of desktop local data.
3 THE APPLICATION
As proof of concept for the aforementioned idea of
exploiting desktop data on the web, an application
has been developed using the Python programming
language. This application is able to profile a user by
using the bookmarks he has assigned to his web
browser. This program has no user interface and the
results are being acquired without any user
interaction.
This application belongs to the Web 3.0 family,
as it takes advantage of web services in order to
detect what the user is really interested in. The web
service being used is a classification service
(URLclassifier, 2009). URLclassifier is a web
service accepting a URL and returning the categories
that are embedded in the relevant web page, as a
result. After the application gathers the user’s
bookmarks, it uses this service to classify them into
categories. Those categories are actually the user’s
fields of interest and can be used to profile the
individual whose bookmarks were processed.
An on-line classification service is being used in
order the local application not to consume a lot of
processing power. Thus, the application can be used
in mobile devices that do not have the processing
power needed for the execution of a classification
program. With the on-line service, full text of web
pages is being processed. Another way of creating a
light-weight classifier is to use only the URL of a
web page as input to classification (Kan, 2005)
(Baykan, 2009).
The application extracts the user’s bookmarks
and processes them with the help of the on-line
classifier. Each of the popular browsers has a
different way of storing it’s bookmarks. Safari stores
it’s bookmarks in a .plist file while Firefox uses a
.html file and Opera a plain text file. For everyone of
the above browsers a different parser was developed,
enabling the program's first phase to gather
bookmarks from each one them. In a subsequent
phase, the application sends the gathered bookmarks
one by one as input to URLClassifier, in order to get
the necessary results.
After the classification, all the results are stored
into a list which can be consumed by a third party
application or just be printed for debugging
purposes.
3.1 Application’s Accuracy
A number of computer science students was asked to
submit their bookmarks to use them as input for the
presented application, in order to test the
application’s accuracy with real life samples. Some
samples contained a big number of bookmarks
(greater than 100), while some others only few (less
than 10). This was advantageous to the
experimentation because it was possible to examine
if the success percentage of the application varies
between different input sizes.
After applying the classification process that was
described in the previous section, the outcome was
handed back to the users in order to evaluate the
correctness of the classification results. From their
answers a set of hit-miss counts was generated.
Figure 2 shows the five results of a single set
with the maximum number of bookmarks. This
particular set consists of 120 web sites and it was the
largest set of those that were tested. The left bars
indicate the successful categorization of the sites
while the right bars indicate the false categorization
of the sites. As someone can see in this set, the
classification is quite accurate. To support this
COMBINING DESKTOP DATA AND WEB 3.0 TECHNOLOGIES TO PROFILE A USER
351