terms, it amounts to identifying active user to a set of
persons having the same tastes and, that, in function
of his/her preferences and his/her past readings. This
system starts from the principle that users having ap-
preciated the same documents have the same topics of
interests. Thus, it is possible to predict pieces of data
likely to live up users’ expectations by taking advan-
tage of experience of a similar population.
The common feature of most of existing collabo-
rative filtering methods is to be centralized. Even if
the research of nearest neighbors among some thou-
sands of candidates in real time is no longer a prob-
lem, the transition to hundred thousands of users or
more remains an open issue. According to (Breese
et al., 1998), the bottleneck due to a large user pop-
ulation of potential neighbors in conventional collab-
orative filtering algorithms is problematic. (Sarwar
et al., 2001) have paved the way by proposing an al-
ternative: they suggest to compute recommendations
by identifying items that are similar to other items the
user has liked. They suppose that the relationships
between items are relatively static. Nevertheless, this
approach is unlikely to work in the context investi-
gated in this paper, since the number of users is there
far more important than the number of items. More-
over, the bouquet (that is to say items it contains) can
change radically from one week to the next. There-
fore, we have chosen to explore ways to distribute
computations.
Furthermore, centralization of data is in contra-
diction with the agreement of 28 January 1981
of the Council of Europe and with instructions of
the Commission Nationale de l’Informatique et des
Libert
´
es
4
(CNIL), unless users are handled with
anonymity. As a matter of fact, the confidentiality
of any information related to the users constitutes an
european legal obligation. In France, it is the CNIL
organization that is responsible for the protection of
private life and for the preservation of personal data.
In order to distribute the model, we have thus
decided to split a clustering collaborative filtering
method into client and server parts. However, this is
not enough to solve all the problems ASTRA is con-
fronted with, since this filtering method requires to
have explicit numerical or boolean votes. For mar-
keting reasons
5
, this kind of votes is not suitable, be-
cause it underpins some negative valuations of items
by users. We will show, in section 3, how to bypass
this difficulty with an assistance function to the votes.
We then present the clustering algorithm in section 4.
Part 5 is dedicated to a discussion about the advan-
tages and drawbacks of the model. At last, Part 6
4
http://www.cnil.fr
5
ASTRA doesn’t want that users could positively reject
items for which companies have paid the inclusion in the
bouquet.
presents our perspectives of research. Beforehand,
we would like to familiarize reader with the global
architecture of our information filtering system in the
following section.
2 ARCHITECTURE
The architecture of our information filtering system
is shown on figure 1. This model associates a user
profiling method based on the Chan formula (Chan,
1999) (cf. infra, 3 Assistance to votes, p. 3) and a new
version of the hierarchical clustering algorithm, also
called RecTree (Chee et al., 2001) (cf. infra, 4 Clus-
tering algorithm, p. 3). This new version presents the
advantage to be distributed.
Figure 1: Architecture of information filtering module.
Web sites are sent via satellites from the
Casablanca server to the client. Moreover, users
who also have a standard internet connection can sent
non-numerical votes (cf. infra, 3 Assistance to votes,
p. 3) and suggestions for new contents to the server.
This system interfaces itself with our information fil-
tering module thanks to DLL files.
In order to distribute the system, the server part has
been separated from the client side. The assistance
function to the votes determines numerical votes for
the items according to the users’ actions. Then, these
numerical votes are sent to the server, together with
the non-numerical ones. The server thus has at its dis-
posal, as input parameters, the matrix of users votes
and a database including sites and descriptors. In this
way, the server has no information about the popu-
lation, but anonymous votes. Users preferences are
stored in the profile on clients. Thus, the confidential-
ity criterion is duly respected.
The RecTreealgorithm aims at reducing quantity
of data that needs to be processed. The offline compu-
tations of RecTree allow to build typical users pro-
WEBIST 2005 - WEB INTERFACES AND APPLICATIONS
300