diminish their specificity. For each one of the visited
URLs, we obtained the length of the generalized URL
based on next expression:
max{MinNSegment, (1− α) ∗ NSegments} (1)
Where NSegments represents the number of
segments separated by ’/’ appearing in the URL.
MinNSegment represents the minimum number of
segments, starting from the root, an URL can have af-
ter the generalization step, whereas, α represents the
fraction of the URL that will be erased in the gener-
alized version. This generalization process will allow
us to work with a more general structure of the site
avoiding the confusion that too specific zones could
generate. For the NASA database we instantiated
MinNSegment to 3 and evaluated the system with a
range of values for α from 0 to 0.75. The experiments
showed that values larger than 0.5 saturated and, as a
consequence, we will show results for the following
values for α: 0 (not generalized), 0.25 and 0.5. In ad-
dition, the stages of the system where the generaliza-
tion is used can also be varied. Thus, we will evaluate
the effect of the values for α parameter as well as the
effect of using it or not in the different stages.
2.2 Pattern Discovery and Analysis
Unsupervised machine learning techniques have
shown to be adequate to discover user profiles (Pier-
rakos et al., 2003) in the pattern discovery and
analysis stage. We used PAM (Partitioning Around
Medoids) (Kaufman and Rousseeuw, 1990) clustering
algorithm and a Sequence Alignment Method, Edit
Distance (Gusfield, 1997)(Chordiaand Adhiya, 2011)
as a metric to compare sequencesand to groupinto the
same segment users that show similar navigation pat-
terns. PAM requires the K parameter to be estimated.
This parameter is related to the specificity of the gen-
erated profiles, when greater its value is more specific
the profiles will be. We didn’t have prior knowledge
of the structure of the data in NASA database and we
performed an analysis to try to find the value of K that
is enough to group the sessions with common charac-
teristics but does not force to group examples with not
similar navigation patterns in the same cluster. The
outcome of the clustering process is a set of groups of
user sessions that show similar behaviorbut we intend
to generate profiles. That is, to find the common click
sequences appearing among the sessions in a cluster.
To generate profiles or to discover the associated
navigation patterns for each one of the discovered
groups we evaluated two strategies: popularity and
frequent pattern mining. The popularity based strat-
egy selects the X most popular URLs in each cluster
as its profile. The amount of URLs to propose to the
user, X, has to be decided and the system does not
provide any kind of evidence for making this deci-
sion. The frequent pattern mining algorithm we used
to build profiles is SPADE (Sequential PAttern Dis-
covery using Equivalence classes) (Zaki, 2001) which
provides for each cluster a set of URLs that are likely
to be visited for the sessions belonging to it. The num-
ber of proposed URLs depends on parameters related
to SPADE algorithm such as minimum support and
maximum allowed number of sequences per cluster.
A fixed value for minimum support, 0.5, showed to
be a good option. With this value the designed system
becomes a self regulated system thatfinds an adequate
number of URLs to propose and achieves a balance
between precision and recall.
Although for the rest of the stages we experi-
mented with generalized and not generalized URLs,
we applied the SPADE algorithm using the original
URLs appearing in the user click sequence, because,
otherwise, the system would require an extra stage.
2.3 Exploitation
In the exploitation stage, the only part that has to be
done in real time, we propose the use of k-Nearest
Neighbor (Dasarathy, 1991) to calculate the distance
of the click sequence (average linkage distance based
on Edit distance (Gusfield, 1997)) of the new users
to the clusters generated in the previous phase. The
distance can be calculated at any stage of the naviga-
tion process, that is, from the first click of the new
user to more advanced navigation points. As a con-
sequence the system will propose to the new user the
profile corresponding to the nearest cluster. That is
the set of links that models the users in the clusters.
Those URLs are no generalized, because otherwise it
would be proposing zones of the web site, and, as a
consequence, the system would require an extra stage
in order to be useful for the final user.
At this point a question arises: will new users’ be-
havior be identical to the generated profiles or will
they have some similarities with more than one pro-
file? That is, will diversification help when generating
link proposals? To answer to the question we have an-
alyzed two options: 1-NN based approach, where just
the profile of the nearest cluster to the user is used
to make proposals, and, 2-NN based approach, which
combines two profiles, the ones belonging to the two
nearest neighbors clusters of the user.
AdaptationoftheUserNavigationSchemeusingClusteringandFrequentPatternMiningTechiquesforProfiling
189