Enhancing a Web Usage Mining based Tourism Website Adaptation with

Content Information

Olatz Arbelaitz, Ibai Gurrutxaga, Aizea Lojo, Javier Muguerza, Jes´us M. P´erez and I˜nigo Perona

Dept. of Computer Architecture and Technology, University of Basque Country UPV-EHU,

M. Lardizabal, 1, 20018 Donostia, Spain

Keywords:

Adaptive Web, Link Prediction, User Proﬁle, Collaborative Filtering, Machine Learning, Web Usage Mining,

Web Content Mining, Semantics.

Abstract:

Websites are important tools for tourism destinations. The adaptation of the websites to the users’ preferences

and requirements will turn the websites into more effective tools. Using machine learning techniques to build

user proﬁles allows us to take into account their real preferences. This paper presents the ﬁrst approach of

a system that, based on a collaborative ﬁltering approach, adapts a tourism website to improve the browsing

experience of the users: it generates automatically interesting links for new users. In this work we ﬁrst build a

system based just on the usage information stored in web log ﬁles (common log format) and then combine it

with the web content information to improve the performance of the system. The use of content information

not only improves the results but it also offers very useful information about the users’ interests to travel

agents.

1 INTRODUCTION

Intelligent systems in the tourism sector are being

studied recently (Gretzel, 2011). Intelligent systems

are next generation information systems that might

provide tourism consumers and service providerswith

the most relevant information, more decision sup-

port, greater mobility, and ﬁnally, the most enjoyable

travel experiences. There is currently a wide range of

technologies rellated to them such as recommender

systems, context-aware systems, autonomous search

agents, web mining tools, etc. Creating these systems

requires a thorough knowledge of tourists’ psychol-

ogy, social structures where tourism is experienced,

the ratio of tourists that use technology, the structure

of the tourism industry, the language of tourism, etc.

Moreover, travel agents are among service providers

whom their internet adaption could be the best mar-

keting device for their business and a tool for their

competitive advantages (Abou-Shouk et al., 2012).

In this context the adaptation of tourism websites

to the user requirements becomes specially impor-

tant. That is, web personalization becomes essential.

Web personalization (Pierrakos et al., 2003) can be

deﬁned as the set of actions that are useful to dynam-

ically adapt the presentation, the navigation scheme

and/or web content, based on preferences, abilities,

or user requirements. Web personalization in tourism

can positively affect both the feeling of the user and

the business.

This paper presents a preliminary approach to

adapt a tourism web page, www.bidasoaturismo.com,

according to the browsing preferences of the users.

The proposed adaptation is to automatically generate

links to the users while they are navigating so that

their objective is reached more easily. Furthermore,

the system will provide useful information about the

tourists to the service providers.

Adaptations of the web environments to speciﬁc

users in navigation time require a previous phase of

generatinguser proﬁles containingthe most important

facts about their navigation preferences in this case.

The most widely used method for obtaining informa-

tion about users is observing their actions (Schiafﬁno

and Amandi, 2009). In adaptive systems, the user pro-

ﬁle is used to behave differently for different users.

Our research is contextualized in the use of web

mining (Mobasher, 2006) to build user proﬁles and

then propose adaptations to the website based on the

obtained proﬁles and extract semantic information

from them. We could deﬁne web mining as the ap-

plication of machine learning techniques to data from

the Internet. This process requires a data acquisition

and pre-processing stage which is not easy because it

287

Arbelaitz O., Gurrutxaga I., Lojo A., Muguerza J., M. Pérez J. and Perona I..

Enhancing a Web Usage Mining based Tourism Website Adaptation with Content Information.

DOI: 10.5220/0004171002870292

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 287-292

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

requires several steps such as cleaning and merging

data from multiple log ﬁles, user identiﬁcation, ses-

sion identiﬁcation, completion of route etc. The ma-

chine learning techniques are mainly applied in the

pattern discovery and analysis phase to ﬁnd groups of

web users with common characteristics related to the

Internet and the corresponding patterns or user pro-

ﬁles. And ﬁnally, the patterns detected in the previ-

ous steps are used in the operational phase to adapt

the system and make navigation more comfortable for

new users. Our claim is that semantic information can

improve the quality of proﬁles based on web usage in-

formation.

We have built a system based on the collaborative

ﬁltering approach that takes the content of the web

and the minimum information stored in a web server

as input: server log ﬁles stored in web Common Log

Format (CLF) (W3C, 1995), i.e., the proﬁles are con-

structed without perturbing the user.

This paper uses our previous experience in auto-

matically generating links in a local tourism web page

and improves its behavior and gives it added value in-

troducing semantic analysis of the content in the user

proﬁle. Although this is a preliminary work we claim

that the combination of both usage and semantics can

lead to more accurate and richer recommendations

and moreover it gives to the travel agents greater in-

sight about the real interests of the tourists.

The article summarizes in Section 2 the data ac-

quisition environment. Section 3 is devoted to de-

scribing the characteristics of the system we have de-

veloped. Then, Section 4 presents some of the re-

sults obtained in the performed experiments. Finally,

we summarize in Section 5 the conclusions and future

work.

2 DATA

In this work we have used a database from our

environment: Bidasoa-Txingudi bay which is lo-

cated at the western tip of the Pyrenees mountains

and, straddling two countries, France and Spain,

links the Basque provinces of Gipuzkoa and La-

purdi. The Bidasoa River has had the effect of

linking socially and culturally the three towns sur-

rounding the bay: Hendaye, Hondarribia and Irun.

The area offers the opportunity of a wide range

of tourism activities and, Bidasoa Turismo website

(BTw), www.bidasoaturismo.com, includes all sorts

of practical tourist information to visit the area: the-

matic tourism, professionaltourism, gourmet tourism,

agenda, suggestions, etc. Our work will make more

pleasant and effective the navigation of the user and,

as a consequence, it will also contribute to a more en-

joyable travel experiencefor the tourist. Moreover the

generated semantic proﬁles will be a good marketing

device for service providers. We acquired nearly four

months of usage data of BTw: from January 9, 2012

to April 30, 2012. The information contained in this

database belongs to web server logs of requests (a to-

tal of 897,301) stored in common log format (W3C,

1995). Furthermore, we also use the content infor-

mation of the website, i.e., the text appearing in the

website.

3 PROPOSED SYSTEM

The work presented in this paper enhances the perfor-

mance of a web usage mining (Srivastava et al., 2005)

application including a semantic analysis of the con-

tent information. As every web usage mining process,

it can be divided into three main steps: data acquisi-

tion and preprocessing (Cooley et al., 1999), pattern

discovery and analysis, and, exploitation.

3.1 Data Acquisition and Preprocessing

We acquired two types of data. On one hand, we ac-

quired usage information, and, on the other hand, we

acquired the content information for BTw.

For the ﬁrst type, nearly 4 months of usage data

were collected from BTw and ﬁrst of all, we prepro-

cessed the used URLs so that further uses of the same

URL were identiﬁed in the same way and reduced all

agenda accesses to a single agenda page. We prepro-

cessed the log ﬁles to obtain information from differ-

ent users and sessions. Before identifying user ses-

sions, we ﬁltered erroneous requests, image requests,

etc. so that the only requests taken into account for

our experiments are the ones related to user clicks (the

amount of requests was reduced nearly in a 50%: to

470,402 requests).

We performed the user identiﬁcation process

based on IP addresses and as an heuristic to iden-

tify sessions within a users’ activity, we ﬁxed the ex-

pire time of each session to 10 minutes of inactiv-

ity (He and G¨oker, 2000). Among the obtained ses-

sions, we selected the most relevant ones; the ones

with higher activity level (3 or more clicks), removed

the outliers, i.e., the ones with more than 55 requests

(out of 98% percentile), and, grouped consecutive re-

quests of agenda in a single request because we de-

tected that many of them are automatically generated

by the server. After the whole preprocessing phase

the database contains 55,454 user requests divided in

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

288

9,549 sessions, with an average length of 5.8 requests,

where a total of 308 different URLs are visited.

To acquire content data we used the GNU Wget

(GNU, 1996) computer program to retrieve content

from the BTw web server. We downloaded the HTML

ﬁles of the whole web site using recursive download-

ing. We then applied an HTML parser to obtain the

content of each page and ﬁltered the menus of the web

pages so that in further steps we work only with the

real content. In order to limit our work, we only per-

formed the analysis of the static part of the website

having a total of 231 URLs. Note that there are some

URLs, mainly private URLs related to web adminis-

tration not accessible for a normal user, appearing in

the usage information that will not have their equiva-

lent in the content part.

3.2 Session Representation

Being the aim of this work to detect sets of users with

similar navigation patterns and to use them to make

the navigation of future users easier, and obtain se-

mantic proﬁles, we represented the information cor-

responding to each of the sessions as a clickstream or

sequence of clicks preformed in the URLs of BTw.

3.3 Pattern Discovery and Analysis

This is the stage that, taking as input the user click

sequences, is in charge of modeling users and pro-

ducing user proﬁles. Unsupervised machine learning

techniqueshaveshown to be adequate to discoveruser

proﬁles (Pierrakos et al., 2003). We have used a crisp

clustering algorithm to group users that show similar

navigation patterns.

3.3.1 Clustering

We used PAM (Partitioning Around Medoids) (Kauf-

man and Rousseeuw, 1990) clustering algorithm and

a Sequence Alignment Method, Edit Distance (Gus-

ﬁeld, 1997)(Chordia and Adhiya, 2011) as a metric

to compare sequences to group into the same segment

users that show similar navigation patterns. Although

further analysis should be done, as a ﬁrst approach

and based on the analysis of the distribution of the

different URLs in the sessions, we instantiated the

maximum number of clusters, K parameter of PAM

algorithm, to 50.

3.3.2 Proﬁle Generation

The outcome of the clustering process is a set of

groups of user sessions that show similar behavior.

But we intend to model those users or to discover the

associated navigation patterns or proﬁles for each one

of the discovered groups. That is, to ﬁnd the com-

mon click sequences appearing among the sessions in

a cluster. We used SPADE (Sequential PAttern Dis-

covery using Equivalence classes) (Zaki, 2001), an

efﬁcient algorithm for mining frequent sequences, to

extract the most common click sequences of the clus-

ter. In order to build the proﬁles of each cluster using

SPADE, we mapped each user session with a SPADE

sequence, with events containing a single user click.

The application of SPADE provides for each cluster

a set of URLs that are likely to be visited for the ses-

sions belongingto it. SPADE parameters such as min-

imum support and maximum allowed number of se-

quences per cluster regulate the system so that it ﬁnds

an adequate number of URLs to propose and a bal-

ance between the precision and the recall of the sys-

tem is achieved. We used the SPADE parameters to

propose a similar amount of URLs per cluster in the

two approaches, the one without semantic informa-

tion and the other using semantic information.

3.3.3 Enriching the Proﬁles with Semantics

The proﬁles have been generated up to this point using

only usage information but we propose to enrich the

proﬁles with semantic information to improve perfor-

mance. Obviously the navigation pattern of the users

depends on their interests, and, as a consequence,

URLs with similar or related content to the ones ap-

pearing in the user proﬁle will also be interesting for

the user. We have used two types of tools for ﬁnd-

ing similarity between URL contents: MG4J (Boldi

and Vigna, 2006) search engine and a keyword ex-

traction based approach (KYWD). MG4J is a full text

indexer for large collection of documents written in

Java, developed at the University of Milano. We used

this tool with TfIdf distance to obtain similarity values

between every possible pair of URLs in the website.

This gives us the chance to obtain for each URL a list

of URLs ordered by semantic similarity.

On the other hand, we have used Yahoo Term Ex-

tractor tool (Yahoo!, 2011) to extract keywords and

compare semantic contents of different URLs. Key-

words and key phrases (multi-word units) are widely

used in large document collections. They describe the

content of single documents and provide a kind of se-

mantic meta data that is useful for a wide variety of

purposes. Once the keywords of each URL have been

extracted, we have used the cosine similarity distance

(Madylova and gduc, 2009) to compare URLs.

similarity(URL

,URL

) =

∑

i=1

× w

∑

i=1

)

∑

i=1

)

(1)

EnhancingaWebUsageMiningbasedTourismWebsiteAdaptationwithContentInformation

289

For calculating the weight of keyword i in each

document (URL

), we used:

= (t f/t f max) ∗ log(N/n) (2)

where t f is the keywords’ frequency in URL

; t fmax

is the maximum term frequency in URL

; N is the

number of URLs and n is the number of documents

containing keyword i. The URLs with larger similar-

ity value will be the semantically more similar ones.

The two previous approaches have been used in

the same way to enrich usage information based user

proﬁles: we added to the proﬁles generated using

SPADE two extra links, the most semantically similar

ones, for each proposed link. In the case those URLs

already appeared in the proﬁle we have not taken them

into account.

Furthermore,the KYWD gives us the option to ex-

tract semantic information from the obtained proﬁles.

With this aim we have analyzed which are the most

important (frequent) keywords in each of the gener-

ated proﬁles.

3.4 Exploitation

This is the part that needs to be done in real time. Up

to now, we have identiﬁed groups of users with sim-

ilar navigation patterns and we have generated user

proﬁles or most common paths for each of the groups.

At this point we need to use that information to au-

tomatically propose links to new users navigating in

the web. We propose the use of k-Nearest Neighbor

(Dasarathy, 1991) learning approach to calculate the

distance of the click sequence (average linkage dis-

tance based on Edit distance (Gusﬁeld, 1997)) of the

new users to the clusters generated in the previous

phase.

Our hypothesis is that the navigation pattern of

that user will be similar to the user proﬁle of its near-

est cluster. As a consequence the system will propose

to the new user the set of links that models the users

in the cluster.

4 EXPERIMENTS: RESULTS AND

ANALYSIS

4.1 Experimental Setup

In order to evaluate the performance of the whole pro-

cess, the best choice would be to carry out an exper-

iment in a real environment. This real experiment

should be carried out in a conversational manner with

users and this interaction should provide the feedback

required to validate our system following the guide-

lines of Conversational Case Base Reasoning (CCBR)

(Aha et al., 2001). But in this work, as a ﬁrst ap-

proach we carried out a standard validation method.

In this case, we applied the hold-out method dividing

the database into two parts. One for generating the

clusters and extracting user proﬁles, and, another one

for testing or using it in exploitation. To simulate a

real situation we based the division of the database on

temporal criteria: we used the oldest examples (66%

of the database, 6366 user sessions) for training and

the latest ones (33%, 3183 user sessions), for testing.

We validated the system in two different situa-

tions: when no content information is used (SP) and

when proﬁles are enriched with semantic information

extracted from content. For the latter, we have used

two tools to compare URLs: MG4J and the Keyword

based comparison (KYWD).

We validated the system from two points of view

the user point of view and the service provider point

of view. For the ﬁrst part we used the test examples

as described in the exploitation section and then we

compared the automatically generated links with the

real click sequences of the users. For the latter, we

analyzed the description based on the most frequent

keywords for each cluster.

We performed the evaluation taking into account

that in real executions, when a user starts navigating,

only its ﬁrst few clicks will be available to be used

for deciding the corresponding proﬁle and proposing

new links according to it. We have simulated this real

situation using 10%, 25% and 50% of the user naviga-

tion sequence in the test examples to select the nearest

cluster or proﬁle.

We computed statistics based on results for each

one of the new users. We compared the number of

proposed links that are really used in the test exam-

ples (hits), the number of proposals that are not used

(misses), and the amount of links used by the test

users. An ideal system would maintain precision and

recall as high as possible. But, in order to compare

the two options, we keep the amount of proposed

links not too high (around 50% of the average se-

quence length) and focus on recall because it gives

us an idea of the achieved coverage, i.e., the amount

of links really used in the test examples that our sys-

tem proposes. In order to ensure that precision values

do not suffer a sudden drop we present values for two

statistics: recall (Re) and F-measure (Fm). Note that

the obtained values could be seen as a lower bound

because, although not appearing in the user naviga-

tion sequence, the proposed links could be useful and

interesting for her/him. Unluckily their usefulness

could only be evaluated in a controlled experiment us-

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

290

ing the user feedback. Besides, since some links ap-

pearing in the usage analysis (mainly those related to

web adminstration issues) do not appear in the stored

content information, recall and F-measure values are

limited (could never be 100%).

We calculated two values for the used statistics: an

upper bound (ReUp, FmUp) that takes into considera-

tion the whole test sequence, and the values calculated

using only the clicks in the test sequence that have not

been used to select the nearest proﬁle (Re, Fm); that

is, taking into account the remaining 90%, 75% or

50% (for the cases 10%, 25% and 50% respectively).

4.2 Results and Analysis

A ﬁrst analysis of the results shows that most of the

users start navigating from the initial page of the web-

site and visit the agenda. Although this is interesting

information for the travel agent, the proposal of those

two URLs as part of the proﬁles inﬂates the values

of the calculated statistics, and, as a consequence we

have removed them form the generated proﬁles.

The ﬁrst conclusion we can draw from the results

is that even if the values of the measured metrics vary

depending on the selected option, all of them are able

to predict a certain percentage of the links a new user

will be visiting.

Figure 1: SP vs Semantics Enriching. Upper bound.

Figure 2: SP vs Semantics Enriching.

We present in ﬁgures 1 and 2 recall and F-measure

values for the usage based system (dashed lines) and

the system combining usage and content information

(continued lines). We adjusted the parameters so that

Table 1: Semantics of proﬁles.

Topic Keywords

Mountain cycle trails, mountain paths, path cycle

Sea bay, river, ocean bay, river boat trips, natural treasures

Accommodation hotel, accommodation, youth hostels

Cuisine sugar, eggs, recipes typical product, innovative cuisine

History walled city, borda, palace

Events events activities, markets fairs, rural sports

both system proposed a similar amount of links to the

new users: 3.5 in average for SP option and 3.8 in av-

erage for MG4J and KWYD options. Note that when

larger the proposed amount of links is, smaller is the

support of some of them, so the system is risking more

and, as a consequence, a drop in the F-measure value

is very probable as a consequence of a drop in preci-

sion.

Graphics in ﬁgures 1 and 2 show that when enrich-

ing the system with content information recall and the

F-measure values increase, that is, the system guesses

more links among the ones really used by the users

clicks (Re). The improvement is larger in the case of

the upper bound but even in the real case content in-

formation seems to be important. On the other hand,

this improvementis moreevident at early stages of the

navigation when the usage information is very lim-

ited. Those are the moments when the prediction can

probably contribute more to the user navigation expe-

rience becoming more satisfactory.

Finally, if we analyze the semantics of the gener-

ated proﬁles based on the extracted keywords, we re-

alize that most of the clusters group users with similar

interests. Table 1 shows an example of the semantics

of some of the clusters and some of its related key-

words. The names of the topics appearing in the left

hand column in Table 1 have been assigned manu-

ally whereas the keywords on the right hand column

have been obtained automatically. The results in the

table clearly show that the users clustered in different

groups based on their usage patterns, besides navigat-

ing in a different way, have clearly different interests

and our system is able to extract information about

them. Calculating statistics of the amount of users

in each cluster the service providers could obtain very

useful informationabout the main interests of thepeo-

ple accessing BTw and use it in the future for market-

ing campaigns or modiﬁcations in the website.

5 CONCLUSIONS

We designed a system that, without disturbing the

users, based just on server log information,content in-

formation and machine learning techniques, identiﬁes

different groups of users, builds the corresponding

EnhancingaWebUsageMiningbasedTourismWebsiteAdaptationwithContentInformation

291

proﬁles, automatically generates useful link proposals

for new users, and moreover,it gives insight about the

users’ preferences to the tourism agents. This work

has been done for Bidasoa Turismo, a tourism web-

site in our environment, but it could be extended to

any other environment since it uses the minimum in-

formation stored in any web server (in common log

format). We preprocessed the data, prepared it so that

it could be used with machine learning algorithms, we

divided the database into two parts training and test,

applied PAM to the training data to discover groups

of users with similar navigation patterns and SPADE

to discover the proﬁles associated to each of the clus-

ters. We further enriched thoseproﬁlesaddingseman-

tic information using two options MG4J and KYWD.

In the exploitation phase we related each test example

to just one of the built proﬁles (1-NN).

We evaluated, based on a a hold-out strategy, dif-

ferent conﬁgurations of the system and how it per-

forms at different stages of the user navigation: 10%,

25% and 50%. We calculated recall and F-measure

statistics and analyzed the semantic proﬁles.

Results showed that the use of the semantic

knowledge extracted from the website content in-

formation improves the performance, recall and F-

measure values, of the system proposed, and, more-

over, this improvement is greater at early stages of the

navigation so, the system deals better with the zero

day or cold start problem. Furthermore, using content

information gives the option to enrich the generated

proﬁles with semantic information that can be very

useful for service providers.

This work opens the door to many future tasks. A

deeper analysis of the differences of the two options

implemented for URL content comparison MG4J and

KYWD should be done. Moreover, the Topic Model-

ing option could be another option to extract semantic

knowledge from the websites’ content. More sophis-

ticated strategies to build semantic could also be ex-

plored.

ACKNOWLEDGEMENTS

This work was funded by the University of the Basque

Country, general funding for research groups, AL-

DAPA (GIU10/02); by the Science and Education

Department of the Spanish Government, ModelAc-

cess (TIN2010-15549 project); by the Diputaci´on

Foral de Gipuzkoa, Zer4You (DG10/5); and by the

Basque Government’s SAIOTEK program, Datacc

(S-PE11UN097).

REFERENCES

Abou-Shouk, M., Lim, W. M., and Megicks, P. (2012). In-

ternet adoption by travel agents: a case of egypt. Inter-

national Journal of Tourism Research, pages n/a–n/a.

Aha, D. W., Breslow, L., and Mu˜noz-Avila, H. (2001).

Conversational case-based reasoning. Appl. Intell.,

14(1):9–32.

Boldi, P. and Vigna, S. (2006). Mg4j at trec 2006. In

Voorhees, E. M. and Buckland, L. P., editors, TREC,

volume Special Publication 500-272. National Insti-

tute of Standards and Technology (NIST).

Chordia, B. S. and Adhiya, K. P. (2011). Grouping web

access sequences using sequence alignment method.

Indian Journal of Computer Science and Engineering

(IJCSE), 2(3):308–314.

Cooley, R., Mobasher, B., and Srivastava, J. (1999). Data

preparation for mining world wide web browsing pat-

terns. Knowledge and Information System, 1:5–32.

Dasarathy, S. (1991). Nearest neighbor (NN) norms : NN

pattern classiﬁcation techniques. IEEE Computer So-

ciety Press.

GNU (1996). Gnu wget.

Gretzel, U. (2011). Intelligent systems in tourism: A so-

cial science perspective. Annals of Tourism Research,

38(3):757–779.

Gusﬁeld, D. (1997). Algorithms on strings, trees, and se-

quences: computer science and computational biol-

ogy. Cambridge University Press, New York, NY,

USA.

He, D. and G¨oker, A. (2000). Detecting session boundaries

from web user logs. Proceedings of the 22nd Annual

Colloquium on Information Retrieval Research.

Kaufman, L. and Rousseeuw, P. (1990). Finding Groups

in Data An Introduction to Cluster Analysis. Wiley

Interscience, New York.

Madylova, A. and gduc, S. G. (2009). A taxonomy based se-

mantic similarity of documents using the cosine mea-

sure. In ISCIS, pages 129–134. IEEE.

Mobasher, B. (2006). 12 web usage mining. Encyclopedia

of Data Warehousing and Data Mining Idea Group

Publishing, pages 449–483.

Pierrakos, D., Paliouras, G., Papatheodorou, C., and Spy-

ropoulos, C. D. (2003). Web usage mining as a tool for

personalization: A survey. User Modeling and User-

Adapted Interaction, 13(4):311–372.

Schiafﬁno, S. and Amandi, A. (2009). Artiﬁcial intelli-

gence. chapter Intelligent user proﬁling, pages 193–

216. Springer-Verlag, Berlin, Heidelberg.

Srivastava, T., Desikan, P., and Kumar, V. (2005). Web min-

ing – concepts, applications and research directions.

pages 275–307.

W3C (1995). The world wide web consortium: The com-

mon log format.

Yahoo! (June 15 2011). Term extraction documentation for

yahoo! search.

Zaki, J. M. (2001). Spade: An efﬁcient algorithm for min-

ing frequent sequences. Mach. Learn., 42(1-2):31–60.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

292