2 RELATED WORKS
Various models have been proposed for modelling
the user navigation behaviour and predicting the
next requests of users. According to (
Pierrakos,
Paliouras, Papatheodorou and Spyropoulos, 2003),
association rules, sequential pattern discovery,
clustering, and classification are most popular
methods for web usage mining. Association rules
(
Agrawal, Mannila, Srikant, Toivonen and Verkamo,
1996
) were proposed to capture the co-occurrences
of buying different items in a supermarket shopping.
Association rules indicate groups that are related
together. Methods that use association rules can be
found in (
Yang, Li and Wang, 2004) too.
The prediction scheme described in
(
Padmanabhan and Mogul, 1996) used a dependency
graph, DG, to model user navigation behaviour. The
DG prediction algorithm constructs a dependency
graph that describes the pattern of user page
requests. Every page visited by user is represented
by a node in the dependency graph. There is an arc
from node A to B if and only if at some point in time
a client accessed to B within w accesses after A,
where w is the lookahead window size. The weight
of the arc is the ratio of the number of accesses to B
within a window after A to the number of accesses
to A itself. A DG is effectively a first-order Markov
model. In this method the consecutiveness of
requests are not applied.
Markov models contain precise information
about users’ navigation behaviour. They are most
widely used in sequential pattern discovery for link
prediction. Lower order Markov models are not very
accurate in predicting the user’s browsing behaviour,
since these models do not look far into the past to
correctly discriminate the different observed
patterns. Higher order Markov models give better
predictive precision with reduced hit rate. All-kth-
Order Markov model maintains Markov predictors
of order i, for all 1 ≤ i ≤k. This model improves
prediction coverage and accuracy but the number of
states in this model grows exponentially when the
order of model increases. Improvements on the
efficiency of PPM are examined in (
Deshpande and
Karypis, 2004). Three pruning criteria are proposed:
a) support-pruning, b) confidence-pruning and c)
error-pruning. The subject of (
Deshpande and Karypis,
2004) is mainly the efficiency. The resulting model,
called selective Markov model has a low state
complexity. But this model is not online and can not
be incrementally updated.
The Longest Repeating Subsequence, LRS PPM
(
Pitkow and Pirolli, 1999) stores a subset of all passes
that are frequently accessed. It uses longest repeated
sequence to predict next request. In this model each
path occurs more than some Threshold T, where T
typically equals one. In (
Chen and Zhang, 2003),
popularity-based PPM is proposed. In this model,
the tree is dynamically updated with a variable
height in each set of branches where a popular URL
can lead a set of long branches, and a less popular
URL leads to a set of short ones.
The study in (
Ban, Gu and Jin, 2007) presents an
online method for predicting next user request. In
this model the entropy of a node is an important
factor in prediction. But in this model, the memory
efficiency of algorithm is not considered.
The techniques that mentioned above, work well
for web sites that do not have a complex structure
and do not dynamically generate web pages.
Complex structure of a web site led to large number
of states in a Markov model and so it needs much
runtime requirements such as memory and
computation power.
3 ONLINE PPM
PREDICTION MODEL
In this paper we apply our pruning methods on a
prediction tree based on PPM. The PPM model has
an upper bound for its context length. The context
length is the sequence length that preceding the
current symbol. A kth order PPM model keeps the
contexts of length 0 to k. The predictor is
represented by a tree. The number on each edge
records the number of times the request sequence
occurs in the path from the root node to end node of
that edge. The PPM prediction tree for sequences
ABCDE, ABCA, CACD, BCD and ADAB is
displayed in Figure 1.
Figure 1: The PPM prediction tree for sequences ABCDE,
ABCA, CACD, BCD and ADAB.
We create the PPM prediction tree, online. An
online prediction method needs not rely on time-