A COLLABORATIVE FILTERING APPROACH COMBINING

CLUSTERING AND NAVIGATIONAL BASED CORRELATIONS

Ilham Esslimani, Armelle Brun and Anne Boyer

KIWI Team, Universit´e Nancy2, LORIA

615 rue du Jardin Botanique, 54600 Villers-L`es-Nancy, France

Keywords:

Recommender systems, Collaborative ﬁltering, Clustering, Usage analysis, Navigational patterns.

Abstract:

Recommender systems are widely used for automatic personalization of information on web sites and informa-

tion retrieval systems. Collaborative Filtering (CF) is the most popular recommendation technique, but several

CF systems still suffer from problems like data rating availability and space dimensionality for neighborhood

selection. In this paper, we present a new CF approach (PSN-CF) that uses usage traces to model users. These

traces are used to estimate ratings that will be employed to generate clusters. Then, the PSN-CF evaluates

navigational correlations between users within these clusters. Predictions are performed in a following step.

The performance of PSN-CF is evaluated in terms of accuracy and time processing on a real usage dataset.

We show that PSN-CF highly improves the accuracy of predictions in terms of MAE. Moreover, the use of

clustering and positive sequences before computing the navigational correlations contributes to an important

reduction of time processing.

1 INTRODUCTION

The development of Internet engendered an impor-

tant proliferation of information resources. Thus, the

need of tools for automatic personalization of infor-

mation becomes heightened. Recommender systems

are widely used for this purpose thanks to their abil-

ity to analyze users behaviors and guide them towards

relevant resources that suit their preferences.

Collaborative Filtering (CF) is a recommendation

technique that identiﬁes similarities between users,

based on their ratings in order to select neighbors and

compute predictions for the active users.

Despite the success of recommender systems

and collaborative ﬁltering in many application areas,

some research questions still remain. Some of these

questions concern the requirement of explicit rating

data to compute correlations between users. As ex-

plicit rating data is not always available, one chal-

lenge for recommender systems is to take into consid-

eration another type of data that could represent efﬁ-

ciently users behaviors. In this context, usage traces

can be a relevant source of data.

Moreover, another challenge for CF systems is the re-

duction of space dimensionality. Indeed, as the num-

ber of users and resources tend to increase, it turns out

that the required time for computing correlations and

generating neighborhoods also increases. Therefore,

employing clustering techniques is one way to reduce

time required for processing these correlations.

So, the research problem we are interested in, is

related to the integration of usage traces in CF sys-

tems. Thus, by using these traces, how can we es-

timate implicit ratings, how can we select nearest

neighbors and evaluate correlations between users, ﬁ-

nally how can we improve the accuracy of predic-

tions.

In this paper, we attempt to explore these issues

and propose some solutions. We suggest a new CF ap-

proach called PSN-CF, that exploits navigational pat-

terns to model users. This approach integrates users

clustering and a new navigational based technique to

enhance the performance of CF.

This paper is organized as follows. We describe in

the second part some research studies related to clus-

tering based recommendersystems and analysis of us-

age traces. In the third part of the paper, we present

the PSN-CF approach. The fourth part describes the

experimentation. Then, the results of the model ex-

perimentations are put forward in the ﬁfth part and

ﬁnally we discuss these results and present a conclu-

sion.

364

Esslimani I., Brun A. and Boyer A.

A COLLABORATIVE FILTERING APPROACH COMBINING CLUSTERING AND NAVIGATIONAL BASED CORRELATIONS.

DOI: 10.5220/0001841303640369

In Proceedings of the Fifth International Conference on Web Information Systems and Technologies (WEBIST 2009), page

ISBN: 978-989-8111-81-4

2 RELATED WORK

2.1 Clustering based Recommender

Systems

In the context of recommender systems, clustering

algorithms are generally used to identify clusters of

similar users, sharing preferences. Clustering meth-

ods have been integrated in several CF based recom-

mender systems in order to reduce dimensionality or

to alleviate the sparsity and scalability problems. To

overcome the sparsity problem, (Xue et al., 2005) use

a CF system based on a K-meansclustering in order to

smooth the unrated data for individual users accord-

ing to the clusters. For the same issue, (Jiang et al.,

2006) suggest a cluster-based collaborative ﬁltering

based on an iterative clustering method that exploits

the inter-relationship between users and items. In this

model, both users and items are clustered using the K-

means algorithm, then a predicted rating is generated

over user classes and item classes.

As regards the problem of scalability, (Conner and

Herlocker, 1999) choose to partition items by exper-

imenting various clustering algorithms. Predictions

are then computed independently within each parti-

tion. With the same perspective (George and Merugu,

2005) use a collaborative ﬁltering approach based on

a weighted co-clustering algorithm that involves si-

multaneous clustering of users and items.

2.2 Analysis of Usage Traces

Several studies describe the impact of usage traces

on the recommendation process in predictivesystems.

Analysis of usage traces is mainly related to the area

of Web Usage Mining (WUM) which aims at observ-

ing users behaviors while interacting with a system.

This observation refers to direct traces as explicit rat-

ings and annotations, or non direct traces like book-

marking, frequencies of visits, visited links, etc. from

which users preferences can be inferred.

Frequent patterns mining, Longest Common Sub-

sequences (LCS) technique and Markov models, are

some of the WUM approaches that tend to harness the

navigational activities in order to analyze users behav-

iors. (Gery and Haddad, 2003) describe the attempt of

frequent patterns mining as the discovering of time or-

dered sequences that have been followed by past users

in order to predict future resources.

Discovering of Longest Common Subsequences

(LCS) is another technique that has been applied in

WUM domain in order to analyze the potential links

between navigational paths and users proﬁles. Ba-

sically, this technique is one dynamic programming

method, it aims at identifying the longest common

subsequence relating to two given sequences. (Jalali

et al., 2008) suggest an LCS based architecture for

classifying navigational patterns and generating pre-

dictions to users. In (Banerjee and Ghosh, 2001) an

algorithm based on LCS technique is proposed for

clustering users by using their navigational data. This

clustering approach uses the similarities between two

navigational paths based on the LCS and the time

spent on resources contained in this LCS.

Another approach that uses sequential links for

navigational activities is Markov chain model. In ac-

cordance with (Eirinaki et al., 2005), the sequential

dependencies of navigational behaviors of users are

modeled by Markov Models ; the conditional proba-

bility of one resource, considering users navigational

traces is computed.

3 DESCRIPTION OF PROPOSED

APPROACH

In this paper, we propose a new collaborative ﬁltering

model which exploits on one hand clusters of users

based on a similarity matrix of users ; this step allows

a reduction of dimensionality. On the other hand it

uses the navigational patterns, so as to reﬁne the result

of this clustering and produce recommendations.

Figure 1 describes the different layers of our

model called “Pam clustering on Similarities and

Navigational based-CF” (PSN-CF) comparing to the

classical clustering based CF model. The classical

clustering based CF (dashed lines) uses directly the

rating matrix (User x Item) in order to generate clus-

ters and compute predictions. At the opposite, the

PSN-CF model (solid lines) uses a similarity matrix

(User x User) computed by using the rating matrix,

so as to create clusters of users. At the same time,

PSN-CF applies a selection of “Positive Sequences”

from the original users sequences, these sequences

contain the preferred accessed resources. Then PSN-

CF computes the navigational correlations between

users, based on Positive Sequences, within individ-

ual clusters. The Positive Sequences are used to im-

prove the time processing over the stage of computing

navigational correlations. Finally, the new generated

neighbors are used to compute predictions.

The following sections present in details the dif-

ferent mechanisms used by PSN-CF model at each

layer.

A COLLABORATIVE FILTERING APPROACH COMBINING CLUSTERING AND NAVIGATIONAL BASED

CORRELATIONS

365

Similarity Matrix (U x U)

Rating Matrix (U x I)

Clustering

Clusters

Computing of Navigational

correlations

Generation of Predictions

Users Sequences

Positive Sequences

Estimation of ratings

Computing

Correlations of ratings

Generation of clusters

Selection of neighbors

Input

Usage Traces

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Input

Classical clustering based CF

PSU-CF

Figure 1: Global scheme comparing the PSN-CF approach

to the classical clustering based CF.

3.1 Estimating Ratings

As mentioned in section 2.2, implicit ratings can be

inferred from users navigational activities. In our ap-

proach, to estimate these ratings (Layer 1) we choose

two implicit parameters: frequencies of visiting a re-

source and duration of visiting a resource. Consider-

ing an active user u

, the frequency of visiting an item

is the ratio of the number of visits of i

)

) and

the average number of visits on all items I (

,I)

) as

described in Equation (1).

Frequency

)

,I)

(1)

As regards duration, it is computed as the ratio of the

duration of visiting an item i

(Drt

)

) and the total

duration of visiting all items I (Drt

,I)

) as presented

in Equation (2).

Duration

)

Drt

)

Drt

,I)

(2)

Once frequencies and durations are calculated, we

use a formula suggested by (Castagnos, 2008), in or-

der to compute and normalize our ratings according

to the rating scale [1− 5] from bad to excellent.

3.2 Computing Clusters

In the context of recommender systems, clustering

methods usually use “User x Item” matrices to cre-

ate clusters. In our approach, we choose to employ

rather a “User x User” matrix (containing similarity

values between users) in order to generate clusters.

Hence, the clusters are constructed based on similari-

ties with other users instead of similarities on ratings.

Moreover,in classical clustering approaches, the clus-

ters are constructed based on co-rated items between

users. At the opposite, our approach ensures the ex-

ploitation of additional items by taking into account

users similarities.

3.2.1 Generation of the Similarity Matrix

In order to generate the “User x User” similarity ma-

trix (Layer 2) required for the clustering step, we use

the Pearson Correlation Coefﬁcient (Herlocker et al.,

1999) so as to compute the similarities between users

based on the estimated ratings.

3.2.2 Users Clustering

In our approach, the goal of clustering data is to

reduce the search space and the time required for

computing navigational correlations and to improve

the selection of neighborhoods. We choose a parti-

tioning algorithm called PAM (Partitioning Around

Medoids). The relevance of the PAM method com-

paring to other partitioning algorithms like K-means

is its robustness (Kaufman and Rousseeuw, 1990).

We choose to use a “User x User” matrix as input of

the PAM algorithm. Thus, clusters are generated ac-

cording to the similarities between users as presented

in Layer 3. In the following section, we present the

technique used for evaluating navigational similarities

between users.

3.3 Navigational Similarities

We propose at this step the integration of navigational

patterns in the process of recommendations based on

a collaborative approach, so as to reﬁne the clusters

provided by the previous step (Layer 4).

In our model, we consider that two users u

and

, who share common sequential patterns are highly

correlated. Therefore, our goal is to identify for ev-

ery pair of users < u

, u

>, the maximum length

Kmax

, u

) of a pattern among their common pat-

terns. As described in (Layer 3), we select from orig-

inal sequences, only “Positive Sequences”(P

) in or-

der to reduce the time processing required to identify

these common patterns among users sessions. Thus,

WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies

366

we retain in sequences only items with high ratings.

Then, the similarity of navigation between two users

is computed by using Equation 3.

This formula computes, for each pair of users u

and u

the correlation of navigation SimNav

)

the ratio of the maximum length of a common fre-

quent pattern L

Kmax

, u

) and the minimum of max-

imum sizes of u

and u

sessions denoted SessMax

)

and SessMax

)

. We note that the common frequent

pattern is intra-session.

imNav

)

Kmax

, u

)

min(SessMax

)

, SessMax

)

(3)

This metric emphasizes the importance of the

longest frequent patterns to evaluate similarities of

users. The higher the length of a sequential pattern

is, the more the users are correlated.

3.4 Prediction Generation

Once the navigational similarities between the active

user and other users within a cluster are calculated, we

employ the weighted average prediction formula used

by classical CF (Herlocker et al., 1999) to compute

predictions. This step corresponds to the last layer of

the PSN-CF approach (Layer 5).

4 EXPERIMENTATION

4.1 Datasets

In order to evaluate the performance of PSN-CF, we

use real usage datasets extracted from the intranet of

Credit Agricole Banking Group, in particular the us-

age data relating to the Department of Strategies and

Technology Watch.

Thus, to train our model, we use the usage data

that reﬂects the navigational activities of users. This

data has been collected during 24 months and stored

in server log ﬁles. The selected dataset is related to

748 users and 3856 resources. It has been split into

80% and 20% corresponding respectively to training

and test datasets. The tests have been performed on

a Windows Server 2003 PC, with 2 Go of RAM and

3, 4 GHz processor (Pentium IV).

As regards clustering, in order to create clusters

from matrices, we used “R”

, an environment for sta-

tistical computing and graphics. In our experiments,

10 clusters are created.

http://www.r-project.org

4.2 Evaluation

Different evaluation metrics can be used in the exper-

imentation of CF systems. The most important crite-

rion in recommender systems is precision. The preci-

sion measures the accuracy of recommendations com-

paring to real votes. As a measure of precision evalu-

ation, we used the Mean Absolute Error (MAE). This

metric computes the mean of absolute errors between

predicted ratings and the real ratings that are actually

assigned by users.

Since items that have high prediction values are

the ones that are recommended to users, we use also

the HMAE (High MAE) metric (Baltrunas and Ricci,

2007) to evaluate the performance of the model. The

HMAE is similar to MAE but it considers only items

that are predicted with a value of 4 or 5. In our exper-

imentation, we choose the HMAE metric to measure

how our system is able to recommend relevant items

to active users.

5 RESULTS

In order to analyze the performance of our approach,

we evaluate the precision of predictions generated by

the PSN-CF model in terms of MAE and HMAE.

PSN-CF accuracy is compared to several variants of

CF models so as to study the impact of clustering

users, the inﬂuence of the nature of the matrix used

for clustering users and the importance of using navi-

gational patterns.

Additionally, we evaluate the impact of the use of

Positive Sequences on time processing of naviga-

tional based correlations.

We note that, before the computation of predic-

tions (Layer 5), we used at the same time (for all the

models) two criteria to select the nearest neighbors of

the active user: a threshold relating to the correlation

value between the active user and other users and a

minimum number of co-rated items between the ac-

tive user and other users.

5.1 MAE

Table 1 presents the MAE values related to CF mod-

els when either no clustering is performed or cluster-

ing is applied on a rating matrix. We can ﬁrst no-

tice that when no clustering is performed the accu-

racy only slightly decreases when only navigational

data is used, compared to classical CF. This conﬁrms

the idea that navigational patterns are almost as infor-

mative as rating data and may contain complementary

information to ratings.

A COLLABORATIVE FILTERING APPROACH COMBINING CLUSTERING AND NAVIGATIONAL BASED

CORRELATIONS

367

Table 1: MAE values with and without clustering.

Classical CF Navigational CF

No clustering 0.7631 0.7895

K-means 0.7826 0.7971

PAM 0.7998 0.8253

Table 2: MAE values when using a similarity matrix for

clustering.

Navigational CF

K-means 0.7809

PAM 0.6747

In the case of classical CF, performing clustering

on ratings (referred to as Classical Clustering based

CF in Figure 1, Layers 1, 3 and 5) does not improve

the accuracy. PAM clustering leads to the lowest ac-

curacy (decrease of about 5%). Besides, when con-

sidering navigational data (in addition to ratings), we

can notice that, as in the case of classical CF, the use

of PAM clustering on a rating matrix leads to the low-

est accuracy.

The PSN-CF model we propose is based on clus-

tering applied on a similarity matrix (Layers 1, 2, 3,

4 and 5), we thus present in Table 2 the correspond-

ing MAE when either PAM or K-means algorithm is

performed.

From Table 2, we observe that when performing

PAM clustering on a similarity matrix of users, MAE

is improved by about 15% compared to clustering on

a rating matrix. This conﬁguration corresponds to the

PSN-CF model. This improvement can be explained

by the fact that using a similarity matrix to perform

clustering does not group users only according to the

way they have similarly rated items commonly seen.

Users are grouped in a cluster when they are com-

monly similar to all the other users. Moreover, this

approach has the advantage to not only consider com-

monly rated items, but all the items users have rated.

From Tables 1 and 2, we can notice that the K-

means clustering slightly depends on the nature of the

matrix: similar accuracy values are obtained for both

matrices.

5.2 HMAE

Let us recall that only items with high prediction val-

ues are suggested by recommender systems to the ac-

tive user. Here, we are also interested in the HMAE

values when performing clustering, based on a rating

matrix or a similarity matrix. These values are pre-

sented in Tables 3 and 4.

Table 3: Comparison of HMAE values with and without

clustering.

Classical CF Navigational CF

No clustering 0.5415 0.5014

K-means 1.2851 1.2723

PAM 1.1689 1.1590

Table 4: HMAE values when using a similarity matrix for

clustering.

Navigational CF

K-means 0.5874

PAM 0.6036

We can ﬁrst notice that when no clustering is per-

formed, navigational based CF outperforms classical

CF by 7% in terms of HMAE, contrary to MAE.

When clustering is based on a rating matrix (Table

3), HMAE values are highly increased for both clas-

sical and navigational CF. However, at the opposite

of MAE (Table 1), the use of navigational patterns in

addition to ratings leads to an improvementof HMAE

in all the studied models.

Table 4 presents the HMAE values related to clus-

tering based on a similarity matrix in the case of a

navigational CF. When clustering is applied on a sim-

ilarity matrix, HMAE is highly decreased for both K-

means and PAM clustering comparing to the results

of Table 3. The lowest HMAE is obtained when K-

means is used, however a similar HMAE is observed

when using PAM clustering (the PSN-CF model).

Even if no improvement is obtained in terms of

HMAE compared to no clustering, a large improve-

ment is obtained in terms of MAE and the computa-

tion time of neighbors is decreased. The following

section is dedicated to the study of this computation

time.

5.3 Time Processing

We are now interested in the time processing required

during the phase of computing the navigational cor-

relations within clusters, by either applying the selec-

tion of Positive Sequences or not.

Results show that the models that do not perform

clustering require on average a computation time 4

times higher for computing the navigational corre-

lations. We observe also that the selection of Posi-

tive Sequences contributes to an important reduction

of time processing. Indeed, the processing time de-

creases by about 8% when no clustering is used and

from 16% to 30% for clustering-based models. Let

us note that the use of Positive Sequences by naviga-

WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies

368

tional based models decreases accuracy at the worst

case by about 1.85% in terms of MAE and by 2.34%

in terms of HMAE. When PAM clustering is used ei-

ther an improvement or a stability is observed in terms

of accuracy.

6 CONCLUSIONS

In this paper, we presented a new Collaborative Fil-

tering approach, named PSN-CF, that exploits navi-

gational patterns. PSN-CF is structured in different

layers as described in Figure 1.

Unlike classical predictive systems based on us-

age patterns, PSN-CF is user-based and attempts to

identify behavioral correlations between users. The

originality of PSN-CF consists in the exploitation of

navigational patterns in the context of CF, thus no ex-

plicit preferences need to be provided by users. Ad-

ditionally, PSN-CF exploits the concept of Positive

Sequences so as to assess navigational correlations

based on users preferred resources with the objective

of reducing time processing required for computing

these correlations.

PSN-CF has been evaluated both in terms of MAE

and HMAE and has been compared to other CF mod-

els. The experimentation shows the high interest of

using the PAM clustering based on a similarity ma-

trix, on the accuracy of the CF system in terms of

MAE. Moreover, the use of clustering based on simi-

larities is also beneﬁt in terms of HMAE. Experiments

also showed the relevance of using both navigational

patterns and estimated rating data for generating ac-

curate high predictions. Last, the experiments showed

the advantage of selecting Positive Sequences that is

a trade-off between the optimization of time process-

ing of navigational correlations and reduction of ac-

curacy.

As a future work, we intend to ﬁrst exploit addi-

tional methods that allow the reduction of dimension-

ality, second evaluate the impact of its combination

with the navigationalbased CF on the accuracy of rec-

ommendations. Additionally, we plan to extend PSN-

CF in the direction of social networks and examine

the possibilities of modeling potential links between

users in the context of behavioral networks.

ACKNOWLEDGEMENTS

We would like to thank Mr. Jean Philippe Blanchard

and acknowledge the ﬁnancial support to this project

provided by the Credit Agricole Banking Group.

REFERENCES

Baltrunas, L. and Ricci, F. (2007). Dynamic item weighting

and selection for collaborative ﬁltering. In Web mining

2.0 Workshop, ECML-PKDD 2007. Springer-Verlag.

Banerjee, A. and Ghosh, J. (2001). Clickstream cluster-

ing using weighted longest common subsequences. In

Proceedings of the Web Mining Workshop at the 1st

SIAM Conference on Data Mining.

Castagnos, S. (2008). Mod´elisation de comporte-

ments et apprentissage stochastique non supervis´e de

strat´egies d’interactions sociales au sein de syst´emes

temps r´eel de recherche et d’acc´es `a l’information.

PhD thesis, Nancy 2 University, France.

Conner, M. and Herlocker, J. (1999). Clustering items for

collaborative ﬁltering. In Proceedings of the ACM SI-

GIR Workshop on Recommender Systems.

Eirinaki, M., Vazirgiannis, M., and Kapogiannis, D. (2005).

Web path recommendations based on page ranking

and markov models. In Proceedings of the 7th annual

ACM international workshop on Web information and

data management. ACM Press.

George, T. and Merugu, S. (2005). A scalable collaborative

ﬁltering framework based on co-clustering. In Pro-

ceedings of the Fifth IEEE International Conference

on Data Mining. IEEE Computer Society.

Gery, M. and Haddad, H. (2003). Evaluation of web usage

mining approaches for user’s next request prediction.

In Proceedings of the 5th ACM international work-

shop on Web information and data management. ACM

Press.

Herlocker, J., Konstan, J., Borchers, A., and Riedl, J.

(1999). An algorithmic framework for performing

collaborative ﬁltering. In Proceedings of the 22nd

annual international ACM SIGIR conference on Re-

search and development in information retrieval.

Jalali, M., Mustapha, N., Sulaiman, N., and Mamat, A.

(2008). A web usage mining approach based on lcs

algorithm in online predicting recommendation sys-

tems. In Proceedings of 12th conference of informa-

tion visualisation.

Jiang, X., Song, W., and Feng, W. (2006). Optimizing col-

laborative ﬁltering by interpolating the individual and

group behaviors. In APWeb.

Kaufman, L. and Rousseeuw, P. (1990). Finding Groups

in Data: An Introduction to Cluster Analysis. John

Wiley and Sons, New York.

Xue, G., Lin, C., and Yang, Q. (2005). Scalable collabo-

rative ﬁltering using cluster-based smoothing. In Pro-

ceedings of the 28th annual international ACM SIGIR

conference on Research and development in informa-

tion retrieval.

A COLLABORATIVE FILTERING APPROACH COMBINING CLUSTERING AND NAVIGATIONAL BASED

CORRELATIONS

369