EXTRACTING PRECISE ACTIVITIES OF USERS FROM HTTP
LOGS
Kiyotaka Takasuka
1
, Kazutaka Maruyama
2
, Minoru Terada
3
and Yoshikatsu Tada
4
1
Graduate School of Information Systems, The University of Electro-Communications
1–5–1 Chofugaoka, Chofu, Tokyo, Japan
2
Information Technology Center, The University of Tokyo, 2–11–16 Yayoi, Bunkyo, Tokyo, Japan
3
Department of Information and Communication Engineering, The University of Electro-Communications
1–5–1 Chofugaoka, Chofu, Tokyo, Japan
4
Graduate School of Information Systems, The University of Electro-Communications
1–5–1 Chofugaoka, Chofu, Tokyo, Japan
Keywords:
HTTP request, Browsing history, User profile, Filtering.
Abstract:
Browsing histories are often used to build user profiles for browsing supports and personalizations. But,
the browsing history also contains HTTP requests generated concomitantly with user activity(concomitant
request), which must be removed in order to build correct user profiles. Current filtering methods are based
on rather simple characteristics of requests such as the extension of the file name or reported content types.
We invent a more efficient filtering method based on other characteristics such as the intervals of requests and
the referer relations of requests. In this paper we analyze these characteristics in real web transactions and
evaluate their usefulness on filtering.
1 INTRODUCTION
As the number of the Internet users grows, the amount
of information on web has explosively swelled. The
phenomenon is well known as information explosion.
Many researchers have been tackling the problem
from various aspects, such as information retrieval,
recommendation and extraction. We focus on the data
preprocessing from the aspects of browsing supports
and personalizations. The browsing history is often
used to build the user profile at these researches. The
following can be used as the source of the browsing
history:
1. the browsing history itself,
2. the access log recorded by the web server,
3. the access log recorded by the proxy server, and
4. the access log recorded by the network sniffer.
The source #2, #3, and #4 above collect many ex-
tra HTTP requests in contrast to #1. The browser
records only HTTP requests which explicitly occurs
by the user. We call the requests base requests. On
the other hand, the access log recorded by the web
server, proxy server or network sniffer includes extra
requests which implicitly are occurred by the browser
after loading the obviously requested web page, such
as images embedded in the web page, icons, css files
and javascript files. We call the requests concomi-
tant requests. In the point of view of building the
user profile, the browser’s history is suitable because
it includes only explicit users’ activities. If users
provided their own history in some way, such as by
the web browser extension, precise activities could be
collected.
The access log is also available for this purpose
with the elimination of the concomitant requests. But
it is difficult because the number of concomitant re-
quests is huge against the number of the base requests.
This problem is specially difficult when the access log
recorded by the proxy server or the network sniffer is
used.
It is necessary to remove concomitant requests
when these access logs are used as the data source.
Traditional methods are to restrict the extension of
the file name in URLs and the content-type included
in the HTTP response. However, these methods are
not enough when requests to advertisements or XML
files are included in the access log. Many requests
to advertisements don’t have the extension of the file
name in URLs, their content-types of the response are
the same as base requests. These remaining concomi-
tant requests affect the user profile. Therefore, the
341