
provide an efficient navigation to the visitor. When
the navigation matches a rule, the hypertext
organization of the document requested is
dynamically modified.
WhatNext (Z. Su, 2000) is focused on path-
based prediction model inspired by n-gram
prediction models commonly used in speech
processing communities. The algorithm build is n-
gram prediction model based on the occurrence
frequency. Each sub-string of length n is an n-gram.
The algorithm scans through all sub-strings exactly
once, recording occurrence frequencies of the next
click immediately after the sub-string in all sessions.
The maximum occurred request is used as the
prediction for the sub-string.
In (R. R. Sarukkai, 2000), the authors proposed
to use Markov chains to dynamically model the
URL access patterns that are observed in navigation
logs based on the previous state. In (Xing Dongshan,
2002), a new a new Markov model is presented.
Another prediction system proposed in (J.
Pitkow, 1999) is based on the assumption of mining
longest repeating subsequences to predict surfing. In
(Xin Chen, 2003), the authors use a popularity-based
prediction model for web pre-fetching.
2 PREPROCESSING
There are three main tasks for performing Web
Usage Mining or Web usage Analysis:
Preprocessing, Pattern Discovery, Pattern Analysis
(J.Srivasta, 2000). As Preprocessing is the first step
for our task and it is very important, we discuss it in
this session.
Preprocessing consist of converting the usage,
content, and structure information into the data
abstractions necessary for pattern discovery.
Typically, only the portion of each user session that
is accessing a specific site can be used for analysis,
since the access information is not publicly available
from the vast majority of Web servers. There are
two ways (KhuResearch) to identify server sessions:
(1) IP address associated with a time range:
We assume that from this period of the time, all
accesses from a specific machine (unique IP
address) belong to a specific user.
(2) Cookie associated with a time range:
After reconfigure the server, whenever a new
visitor comes in, he/she will be set with a cookie.
When he/she revisits the server, we could identify
him/her by the cookie.
A thirty minute timeout is often used as the
default method of breaking a user’s click-stream into
sessions. Clearing the useless information in the
Web logs is needed.
After preprocessing is applied to the original
Web log files, pieces of Web logs can be obtained.
Each piece of Web log is a sequence of events from
one user or session in timestamp ascending order,
i.e. the events happened early goes before the events
happened late. We use a single letter or a single
letter with a number subscript to denote one event,
and a sequence of event can be denoted as a
sequence of letters or letters with number subscripts.
For example, let E be a set of events. A Web log
piece or (Web) access sequence
n
eeeS L
21
=
)( Ee
i
∈
for
)1( ni ≤≤
is a sequence of events,
while n is called the length of the access sequence.
Because one can access the same web page more
than one time during a single access to the web site,
it is not necessary that
ji
ee ≠
for
)( ji ≠
in an
access sequence S.
3 CONSTRUCT THE WAP-TREE
As our algorithm is based on the compact web
access pattern tree (WAP-tree) (J.Pei, 2000)
structure, in this section, we introduce how to
construct a WAP-Tree, which is previously
presented by J.Pei et al. Due to different purpose,
our WAP-Tree has some difference. In order not to
lose information, we do not do any truncation to
WAP-Tree during construction.
The compactness of the WAP-Tree comes from
the fact that if two access sequences share a common
prefix P, the prefix P can be shared in the WAP-
Tree.
The WAP-Tree can be defined as follows (J.Pei,
2000). The only difference from (J.Pei, 2000) is that
we use the complete sequences.
1. Each node in a WAP-Tree registers two pieces
of information: label and count, denoted as label:
count. The root of the tree is a special virtual node
with an empty label and count 0. Every other node is
labeled by an event in the event set E, and is
associated with a count which registers the number
of occurrences of the corresponding prefix ended
with that event in the Web access sequence database.
2. The WAP-Tree is constructed as follows: for
each access sequence in the database, insert them
into WAP-Tree. The insertion of sequences is started
from the root of WAP-Tree. Considering the first
event, denoted as e, increment the count of child
node with label e by 1 if there exist one; otherwise
create a child labeled by e and set the count to 1.
Then, recursively insert the rest of the sequence to
the sub tree rooted at that child labeled e.
3. Auxiliary node linkage structures are
constructed to assist node traversal in a WAP-Tree
as follows. All the nodes in the tree with the same
PREDICTING WEB REQUESTS EFFICIENTLY USING A PROBABILITY MODEL
49