COLLABORATIVE FILTERING BASED ON
CONTENT ADDRESSING
Shlomo Berkovsky, Yaniv Eytani, Larry Manevitz
Computer Science Department,University of Haifa, Israel
Keywords: Collaborative Filtering, Recommender Systems, Content-Addressable Systems.
Abstract: Collaborative Filtering (CF) is one of the most popular recommendation techniques. It is based on the
assumption that users with similar tastes prefer similar items. One of the major drawbacks of the CF is its
limited scalability, as the complexity of the CF grows linearly both with the number of available users and
items. This work proposes a new fast variant of the CF employed over multi-dimensional content-
addressable space. Our approach heuristically decreases the computational effort required by the CF
algorithm by limiting the search process only to potentially similar users. Experimental results demonstrate
that our approach is capable of generate recommendations with high levels of accuracy, while significantly
improving performance in comparison with the traditional implementation of the CF.
1 INTRODUCTION
The quantity of available information grows rapidly
and exceeds our cognitive processing capabilities.
Thus, there is a pressing need for intelligent systems
providing services tailored to users' real needs and
interests. Recommender Systems (RSs) (Resnick &
Varian, 1997) are one of the commonly used
approaches to address this problem. These systems
assist users to select a suitable item among a set of
potential selectable items through applying statistical
and knowledge discovery techniques (Sarwar et al,
2000). RSs are used in different domains, such as,
movies (Good et al, 1999), jokes (Goldberg et al,
2001), music (Aguzolli et al, 2002), and many
others.
Collaborative Filtering (CF) (Herlocker et al,
1999) is probably one of the most familiar and most
widely-used techniques to generate
recommendations in RSs. It relies on the assumption
that people who agreed in the past will also agree in
the future (Shardanand & Maes, 1995). The input for
the CF algorithm is a matrix of users' ratings on a set
of items, where each row represents ratings of a
single user and each column represents ratings on a
single item. CF aggregates the ratings to recognize
similarities between users and generates new
recommendation for an item by weighting the
ratings of similar users on the item.
CF algorithm is typically partitioned to three
generic stages: (1) Similarity Computation:
weighting all the users with respect to their
similarity with the active user (i.e., the user whose
ratings are predicted), (2) Neighborhood Formation:
selecting the most similar users for the prediction
generation, and (3) Prediction Generation:
computing the prediction by weighting the ratings of
the selected users.
A major drawback of CF is its limited scalability.
The stages of Similarity Computation and
Neighborhood Formation require comparing an
active user with all the other users over all the
available ratings. Hence, the complexity of the CF
grows both with the number of users and items in
the ratings matrix. For a matrix containing ratings of
M users on N items, computational complexity of the
above stages is O(MN). This poses a problem in
real-life systems, where the recommendations are
generated using millions of ratings on thousands of
items, e.g., Web-based RSs. Although previous
studies, such as (Breese et al, 1998), (Goldberg et al,
2001), and (Chee et al, 2001) tackle the issue of
reducing the computational effort required by the
CF, it remains one of the most important issues in
CF research.
In this work we develop a fast heuristic variant
of the CF algorithm that decreases the computational
effort required by the Similarity Computation and
the Neighborhood Formation stages. The basic
91
Berkovsky S., Eytani Y. and Manevitz L. (2006).
COLLABORATIVE FILTERING BASED ON CONTENT ADDRESSING.
In Proceedings of the Eighth International Conference on Enterprise Information Systems - AIDSS, pages 91-98
DOI: 10.5220/0002454100910098
Copyright
c
SciTePress
conjecture of the heuristic algorithm is that losing
general completeness of the exhaustive search (1)
has a minor negative effect on the accuracy of the
predictions, but (2) significantly decreases the
required computational effort. Thus, it provides a
scalable approach, applicable to real-life scenarios
with high number of users and items.
The proposed heuristic approach is based on a
notion of content-addressable data management
(Ratnasamy et al, 2001), providing an adaptive
topology for mapping of users' profiles to a multi-
dimensional space. This mapping implicitly clusters
similar users and limits the Similarity Computation
and the Neighborhood Formation stages to a
heuristic search among highly similar users only.
Experimental evaluation of the proposed
approach demonstrates high efficiency and good
accuracy of the proposed algorithm, in comparison
with the traditional (exhaustive) KNN search. The
algorithm is also highly scalable with the number of
nearest neighbors to be found.
The rest of the paper is organized as follows.
Section 2 surveys the works focusing on the CF and
the required computational effort reduction. Section
3 describes CAN, Peer-to-Peer content-addressable
platform for decentralized data management. Section
4 describes the decentralized storage of users'
profiles over CAN platform and elaborates on the
proposed variant of the CF over CAN. Section 5
presents and analyzes the experimental results.
Finally, section 6 lists our conclusions and presents a
list of open questions for future research.
2 COLLABORATIVE FILTERING
Collaborative Filtering (CF) is probably one of the
most familiar and widely-used recommendation
techniques. An input for the CF is so-called ratings
matrix, where each user is represented by a set of
ratings given on various items, and each item is
represented by a set of ratings given by the users.
CF requires similarity metric between users to be
explicitly defined. The state-of-the-art CF systems
exploit three similarity metrics: Cosine Similarity 0,
Mean Squared Difference (MSD) (Pennock et al,
2000), and Pearson correlation (Sarwar et al, 2000).
This work focuses on the MSD, computing the
degree of similarity between users x and y by:
2
||
,,
1
,
()
||
xy
xi yi
i
xy
RR
sim
xy
=
=
(1)
where |xy| denotes the number of items rated by
both users, and R
x,i
denotes the rating of user x on
item i. In some sense, sim
x,y
can be considered as the
dissimilarity of the users, as the lower the result of
the MSD computation, the greater is the real
similarity of the users.
Prediction P
a,j
for the rating of the user a on item
j is computed as a weighted average of the ratings of
his K most similar users, i.e., K nearest neighbors,
by:
'
,,
'
1
,
,
1
()
||
K
kj k ak
k
aj a
K
ak
k
R
Rsim
PR
sim
=
=
−⋅
=+
(2)
where R
x,y
denotes the rating of user x on item i, R
z
'
denotes the average rating of user z, and sim
v,u
denotes the level of similarity between users v and u.
The Similarity Computation stage of the CF
requires comparing the active user with every other
user in the system. For a ratings matrix storing the
ratings of M users on N items, the computational
complexity of the Similarity Computation stage is
O(MN). This points on the poor scalability of the
Similarity Computation stage, as the complexity
grows linearly with both the number of users and the
number of items in the matrix.
2.1 Reducing the Computational
Effort Required by the CF
Many works deal with decreasing the computational
effort required by the CF. In general, it is achieved
either by preprocessing the ratings matrix, or by
distributing the computationally intensive stages of
the CF among multiple machines.
Various preprocessing techniques for decreasing
the computational effort required by the CF (e.g.,
correlation coefficients, vector-based similarity, and
statistical Bayesian methods) are discussed and
analyzed in (Breese et al, 1998). Another technique,
exploiting pre-clustering of the ratings matrix, is
discussed in (Goldberg et al, 2001). There, principal
component analysis is used to find two
discriminative dimensions of the ratings matrix and
all the vectors are projected onto the resulting plane.
This inherently partitions the users to clusters or
neighborhoods, which are further used to generate
the recommendations. In (Chee et al, 2001), the
authors use a tree-like data structure and apply a
divide-and-conquer approach using an iterative K-
means clustering to group the users. This leads to
smaller and more homogeneous clustering of users
for the recommendations generation stage.
ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
92
An alternative approach is to distribute the CF
and the required computational effort among the
users, such that every user independently computes
its similarity with the active user. This is initially
proposed in (Tveit, 2001) and elaborated in (Sarwar
et al, 2001). The latter also develops a detailed
taxonomy of the CF distribution approaches and
presents implementation frameworks for different
application domains. PocketLens project (Miller et
al, 2004) compares five decentralized distributed
architectures for the CF. The experiments show that
performance of the decentralized mechanism is
similar to the performance of the centralized CF,
while it provides increased robustness and security.
Further improvements to the decentralized CF
are discussed in (Han et al, 2004) that proposes
Peer-to-Peer platform for decentralized management
of user profiles. However, it approximates the
identification of the most similar users, and
therefore, the accuracy of the prediction is reduced.
This work loosely bases on the ideas of CAN
(Ratnasamy et al, 2001), content-addressable Peer-
to-Peer platform. We implement a fast heuristic
variant of the CF, using CAN-like multi-dimensional
space for maintaining connectivity of similar users.
This allows to significantly decrease the
computational effort required by the Similarity
Computation and Neighborhood Formation stages
through limiting the search process to a search
among highly similar users only.
3 CONTENT-ADDRESSABLE
DATA MANAGEMENT
This section present the general architecture of CAN
(Ratnasamy et al, 2001), scalable decentralized data
management platform. In CAN, the users are
represented by nodes in virtual N-dimensional
coordinate space. Each node maintains an N-
dimensional subspace, called a zone. For example,
consider a 2-dimensional space partitioned to 3
zones, maintained by the users A, B, and C (figure 1-
left).
Figure 1: 2-Dimensional CAN Space.
Two nodes are called neighbors if their
coordinate spans overlap along N-1 dimensions and
adjoin along one dimension (e.g., nodes A and B in
figure 1-left). To maintain connectivity, each node
maintains a table of pointers to its neighbors. CAN's
routing algorithm greedily forwards messages to the
nodes that are closer to the target node than the
current node (the distance metric exploited is the
discrepancy in the address space). Thus, the
messages are routed between any pair of CAN nodes
in a logarithmic number of hops.
Also, CAN provides connectivity maintenance
algorithm, stable to sporadic joins and departures of
nodes. When a new node is inserted, it must be
given its own zone. This is done by splitting a zone
of one of the existing neighbors according to the
following steps: (1) the new node finds an existing
networks node, (2) the new node is routed to the
target zone that will be split, and (3) the target zone
is split and the neighbors of the new zone are
updated to maintain connectivity and facilitate
routings. Note, that only a subset of neighbors of the
zone that was split is affected by the insertion of a
new node.
The issue of splitting the target zone (i.e., how to
split the existing zone, where the new node was
mapped to) is one of the important issues affecting
the performance of CAN. A number of policies are
proposed, analyzed and compared in (Ratnasamy et
al, 2001). The simplest policy for the zones splitting
is so-called ordered splitting. According to this
policy, the number of dimension across which a
zone is split, iteratively increases from 1 to N.
For example, consider a node D joining CAN
space (figure 1-middle). Assuming that the zone of a
node C will be split, D is routed to C, and the zone is
split across the horizontal dimension (i.e., the next
split of the zones C or D will be performed across
the vertical dimension and so forth). Finally, D
notifies its neighbors, i.e., the nodes B and C, about
the new node and the neighbors' pointers are
updated. Note that in this case, only the zone that
was split (C), and part of its neighbors (only B) are
affected by the join of a node D, whereas other
nodes are not affected.
Disconnections of nodes are handled in a similar
manner. Disconnecting node finds a neighbor node
that will take the responsibility for its zone, and
updates other neighbors about the departure. For
example, consider node B disconnecting (figure 1-
right), and assume node D taking the responsibility
for the zone previously managed by B.
Thus, CAN provides a decentralized platform,
supporting (1) dynamic space partitioning and zones
allocation, (2) efficient routing algorithm, and (3)
connectivity maintenance algorithm over virtual N-
COLLABORATIVE FILTERING BASED ON CONTENT ADDRESSING
93
dimensional coordinate space. Note that distributed
structure of CAN is not robust against sudden
departures of node, as fault-tolerance is not one of
the main goals of the platform. However, CAN
facilitates completely decentralized self-manageable
platform for content-addressable data management
in distributed environment.
4 CF OVER CONTENT-
ADDRESABLE SPACE
This work proposes a fast heuristic variant of the CF
algorithm. It uses content addressing architecture for
the purposes of optimizing traditional exhaustive
search to a search among highly similar users only.
Although our algorithm is heuristic by nature,
experimental results demonstrate that it facilitates
efficient search process without hampering the
accuracy of the generated recommendations.
4.1 Mapping User Profiles
The input for the CF algorithm is a matrix of users'
ratings on items, where each row (vector) represents
the ratings of a single user and each column
represents the ratings on a single item. The total
number of items (N) defines an N-dimensional
space, where the coordinates range in each
dimension corresponds the range of ratings on the
respective item.
To handle the ratings matrix in content-
addressable manner, we map it to CAN-like space.
Each rating is projected using uniform injective
mapping onto the appropriate dimension, such that
the whole vector of length N is mapped to a single
point in N-dimensional space. Thus, every user is
represented in the space by a single node (whose
location corresponds the set of ratings given by the
user), and the respective zone (storing a list of
neighbor zones). Users (through their ratings
vectors) can be dynamically inserted and removed,
since connectivity maintenance algorithm guarantees
that the structure remains connected regardless of
joins and disconnections of the nodes.
Deciding on the zones split policy affects the
evolving structure of the ratings vectors. In our
implementation, we used the above mentioned
ordered splitting policy. This policy may be sub-
optimal in terms of the number of neighbor zones,
resulting in less efficient algorithm, i.e., more
comparisons or finding less similar neighbors.
However, our experimental results indicate that the
use of this simple policy considerably increases the
efficiency of KNN, in comparison with traditional
exhaustive search. Evaluating other splitting policies
is beyond the scope of this work.
In addition to the guaranteed connectivity,
content-addressable space inherently clusters similar
users. Thus, the distance between two similar users
(in our case, according to the MSD similarity metric)
is lower than the distance between two arbitrary
users. This is achieved through the use of injective
mapping, preserving users' similarity while inserting
the ratings vectors into the content-addressable
space. The next subsection shows a use of the
inherent clustering for the purpose of developing fast
heuristic variant of the CF algorithm.
4.2 Heuristic Neighbors Search
The Neighborhood Formation stage of the CF over
the evolving N-dimensional space can be
schematically described as heuristically expanding
breadth-first search. The algorithm for finding K
Nearest Neighbors (KNN) of a user x is briefly
explained by the following pseudo-code. It uses two
lists of size K: (1) CANDIDATES – list of candidates
for nearest neighbors, and (2) KNN – list of real
nearest neighbors. In principle, the algorithm needs
the CANDIDATES list only, as the KNN list is
completely static. For the sake of clarity, we show
an algorithm that uses two lists.
K_Nearest_Neighbors (user x)
(1) let KNN and CANDIDATES be lists of
size K, initially empty
(2) map(x) into the CAN space
(3) foreach u
map(x)
neighbors(map(x))
(4) compute distance(x,u)
(5) insert u into CANDIDATES, s.t.
CANDIDATES is sorted by distances
(6) for i=1 to K
(7) choose v from CANDIDATES,
s.t. distance(x,v) is smallest
(8) for each w in neighbors(v) s.t.
distance(x,w) is not computed yet
(9) compute distance(x,w)
(10) insert w into CANDIDATES, s.t.
it remains sorted by distances
(11) move v from CANDIDATES to KNN
(12)return KNN
Initially, the algorithm pretends to map the active
user x to its location in the N-dimensional space
(step 2). Next, the algorithm identifies the zone x is
mapped to, and its neighbors, i.e., users managing
the neighbor zones (step 3). For each of these zones,
ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
94
the degree of similarity, i.e., the distance between x
and the relevant node, is computed. The neighbor
node is inserted into the CANDIDATES list such that
the candidate nodes are sorted according to their
distances from the active user x (steps 4 and 5).
Then the algorithm iteratively performs the
following operations:
Selects v, the nearest neighbor stored in the
CANDIDATES list (step 7),
Identifies the neighbors of v that are not in the
CANDIDATES list yet, computes their distances
from x, and inserts them to the CANDIDATES,
while keeping the list sorted (steps 8, 9, and 10).
Removes v from the CANDIDATES list and
inserts it to the KNN list.
Finally, the algorithm returns the resulting KNN list
(step 12).
Consider an example execution of the KNN
search as illustrated in figure 2. The initial structure
of 2-dimensional space is depicted in figure 2a.
Assume that the active user is mapped to the zone e.
Figure 2: Stages of KNN Search over 2-D CAN Space.
Thus, e and its neighbors, i.e., nodes managing
the zones c, d, f, and i, are the first candidates for the
being nearest neighbors and they are added to the
CANDIDATES list. Assume that the node managing
the zone e is the closest one. It is moved from the
CANDIDATES list to the KNN list (figure 2a). Since
all the neighbors of e are already known, the next
closest neighbor is chosen among its neighbors.
Assume that the next closest neighbor is the node
managing the zone f. It is moved from the
CANDIDATES list to the KNN list, and its only new
neighbor, node managing g, is added to the
CANDIDATES list (figure 2b). The next closest
neighbor is from the zone c, adding the node
managing the zone b to the CANDIDATES list
(figure 2c). Assume that the next closest neighbor is
the node managing g (not a neighbor of e). It adds
the node managing the zone h to the CANDIDATES
list (figure 2d). This process is iteratively repeated
until the KNN list contains K nearest neighbors.
The proposed algorithm reduces the
computational effort required by the Similarity
Computation and the Neighborhood Formation
stages, in comparison with the traditional CF
algorithm, where an active user is compared with all
the available users. Conversely, the proposed
heuristic algorithm compares the active users with
potentially similar users only.
Since every user in the N-dimensional space
continuously maintains an updated list of neighbors,
any neighbor of a given user can be accessed
through a single network hop. This is true regardless
of the distance between the neighbors. Thus, the
algorithm will also work in sparse spaces, where the
distance between neighbors in the underlying
network might be very high.
5 EXPERIMENTAL RESULTS
In the experimental part of our work we used Jester
dataset of jokes (Goldberg et al, 2001). Jester is
Web-based jokes RS, containing 4.1 millions of
ratings (from -10.00 to +10.00) of 73,421 users on
100 jokes. We chose a subset of 14,192 users that
rated all 100 jokes to get a dense matrix of complete
vectors where every value corresponds to actual
rating. We implemented a centralized simulation of
a 100-dimensional space (space dimension equals to
the number of rated jokes) and inserted the above
14,192 users into the space. Insertions of the users
were done using the ordered splitting policy.
5.1 Scalability
These experiments were designed to evaluate the
scalability of the proposed variant of KNN. The
efficiency of CAN-based KNN is measured by the
number of comparisons performed during the
Neighborhood Formation stage of the CF.
In the first experiment we measured number of
comparisons performed during the search. We
gradually increased the number of users (M) in the
system from M=1,000 to M=14,000. For each M, we
computed the number of comparisons performed in
the traditional exhaustive KNN search and in CAN-
based variant of KNN. Both searches aimed to find
K=5 nearest neighbors. For each value of M, the
experiments were repeated 1,000 times for different
active users. The results are shown on Figure 3.
COLLABORATIVE FILTERING BASED ON CONTENT ADDRESSING
95
0
3000
6000
9000
12000
15000
1000 3000 5000 7000 9000 11000 13000
M
number of comparisons
exhaustive
CAN-based
Figure 3: Average Number of Comparisons vs. M.
As expected, the number of comparisons in
CAN-based KNN is significantly lower than in
traditional KNN and it grows at a logarithmic-like
manner with the number of users. This is explained
by the fact that in CAN-based KNN the active user
is compared only with a subset of highly similar
users (located in a close vicinity in content-
addressable space), whereas in traditional KNN it is
exhaustively compared with all other users.
To achieve a better understanding of
comparison-based scalability of the proposed
approach, we computed the ratio between the
number of comparisons in CAN-based KNN and the
number of comparisons in traditional (exhaustive)
KNN. This ratio was computed for different values
of M. The results show that the ratio steadily
decreases with M. This allows us to conclude that
the proposed algorithm is applicable in large-scale
systems with high number of users and items, e.g.,
on the Web.
The second experiment evaluated the scalability
of CAN-based KNN with the number of nearest
neighbors (K) to be found. We gradually increased
the value of K from K=1 to K=50. For each value of
K, we measured the number of comparisons needed
to find K nearest neighbors for M=1,000, 2,000,
4,000, 8,000, and 14,000 users. For each value of M
and K, the experiments were repeated 1,000 times
for different active users. The number of
comparisons as a function of K for the above values
of M is shown on figure 4.
As can be clearly seen, the number of
comparisons in CAN-based KNN remains roughly
unchanged when K increases. This is explained by
the observation that most of the KNN users are
located in a close vicinity to the active user (this
characterizes a real-life naturally clustered data).
Thus, the similar users are discovered in the early
stages of the KNN search, while further expansions
contribute very few new similar users.
0
300
600
900
1200
1500
1800
1 5 10 15 20 25 30 35 40 45 50
K
number of comparisons
M=1000
M=2000
M=4000
M=8000
M=14000
Figure 4: Average Number of Comparisons vs. K.
Both experiments show good scalability of CAN-
based KNN with K. This means, that practical RSs
can use higher values of K, form a larger and more
reliable neighborhood, and generate more accurate
predictions with only a very minor computational
overhead.
5.2 Accuracy
The following experiments were designed to
evaluate the accuracy of the results obtained by the
proposed heuristic variant of KNN. In the first
experiment we compared the sets of users, i.e., the
neighborhoods, found by the traditional (exhaustive)
KNN and by CAN-based variant of KNN.
Let us denote by KNN
e
the set of users found by
the traditional exhaustive KNN search and by KNN
h
the set of users found by CAN-based heuristic
variant of KNN. Since CAN-based KNN is a
heuristic approach, a sub-optimal structure of zones
may lead to KNN
e
KNN
h
. As predictions are
produced by aggregating the ratings of similar users,
identifying the most similar user is critical for
producing an accurate prediction. Thus, we define
the accuracy of CAN-based search by:
(3)
The cardinality of the KNN
e
set was K=10, while
the cardinality of the KNN
h
set was gradually
increased from K'=1 to K'=100. The accuracy was
computed for M=1,000, 2,000, 4,000, 8,000 and
14,000 users. For each value of M and K', the
experiments were repeated 1,000 times for different
active users. Figure 5 shows the accuracy as a
function of K' for the above values of M.
||
||
eh
h
K
NN KNN
accuracy
KNN
=
ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
96
0
0.2
0.4
0.6
0.8
1
1 102030405060708090100
K'
accuracy
M=1000
M=2000
M=4000
M=8000
M=14000
Figure 5: Precision of CAN-based KNN.
As can be clearly seen, the curves behave
similarly and the accuracy increases with K', such
that for K'>50 it is over 0.9 for all the given values
of M. Previous experiments show that the algorithm
is highly scalable with K. Thus, retrieving a larger
set of users (i.e., higher values of K') leads to a
minor increase in the computational overhead.
Hence, it is reasonable to increase the number of
neighbors found by CAN-based search in order to
achieve higher accuracy and to generate better
recommendations.
We evaluated the quality of the neighborhood
found by CAN-based search by computing the
average similarity between the nearest neighbors and
the active user. This was compared to the average
similarity of neighborhood found in an exhaustive
manner. In addition, we evaluated the accuracy of
the recommendations through well-known Mean
Average Error (MAE) metric (Herlocker et al,
1999):
1
||
N
ii
i
p
r
MAE
N
=
=
(4).
where N denotes the number of predicted items, and
p
i
is the predicted, and r
i
is the real rating on item i.
We gradually increased the number of users from
M=1,000 to M=14,000. For each value of M, we
compared the average similarity of heuristically
found neighbors with the average similarity of
exhaustively found neighbors for K=K'=10. We
also generated the recommendations basing on both
heuristically and exhaustively found neighborhoods.
For each value of M, the above experiments were
repeated 1,000 times for different active users. The
results of the average similarity comparison are
shown on figure 6, while MAE computation results
are shown on figure 7.
0.3
0.34
0.38
0.42
0.46
0.5
0.54
0.58
1000 3000 5000 7000 9000 11000 13000
M
average similarity
exhaustiv e
CAN-based
Figure 6: Average Similarity vs. M.
0.155
0.158
0.161
0.164
0.167
0.17
0.173
0.176
1000 3000 5000 7000 9000 11000 13000
M
MAE
exhaustive
CAN-based
Figure 7: MAE of the Recommendations vs. M.
Although both the similarity and MAE of CAN-
based search are higher (i.e., the neighbors are more
dissimilar and the accuracy is lower), the curves are
very close and the results are quite similar. Average
deviation of the similarities is 2.93% and of the
MAEs is only 0.38%. This allows us to conclude
that the proposed heuristic algorithm succeeds in
producing both accurate neighborhoods and
recommendations.
6 CONCLUSIONS AND FUTURE
RESEARCH
One of the major drawbacks of the state-of-the-art
CF implementations is their high computational
complexity, which grows linearly both with the
number of users and items in the system. In this
work we propose to heuristically decrease the
required computational effort through implementing
the CF over content-addressable CAN-like N-
dimensional space.
Experiments conducted over Jester dataset of
jokes ratings show that proposed heuristic algorithm
outperforms traditional exhaustive KNN search. Our
algorithm decreases the number of required
comparisons, while the ratio between the numbers of
comparisons steadily decreases with the number of
COLLABORATIVE FILTERING BASED ON CONTENT ADDRESSING
97
users. For example, for 14,000 users the number of
comparisons was decreased by almost an order of
magnitude (by 87%). Other experiment shows that
the number of comparisons roughly remains
unchanged when K increases. This allows us to
increase the number of nearest neighbors to be
retrieved (and to improve the accuracy of the
prediction) with a very minor computational
overhead.
In the accuracy experiments we compare the
neighborhoods and the predictions found by CAN-
based KNN and by the traditional KNN. The found
neighborhoods are similar and the recommendations
are very close, which indicates on a high accuracy of
the proposed algorithm. In summary, comparing the
proposed heuristic KNN search with traditional
exhaustive search shows that our algorithm achieves
high accuracy (similar to the accuracy of the
traditional exhaustive search), while significantly
decreasing the required computational effort.
In this work, we assumed that user's ratings on
all the items are available. Thus, the mapping of the
ratings vectors to CAN space is straightforward.
However, this is unachievable in many real-life
scenarios, where an average user rates only a small
portion of the available items. In the future, we plan
to study CAN-based management of incomplete
vectors, where part of the ratings is missing. Using
statistical methods to complete the vectors through
predicting the missing ratings might be a promising
research direction.
In addition to decreasing the computational
effort, our algorithm can naturally be extended to
distribute it among multiple users. In traditional
implementations of the CF, the Similarity
Computation and the Neighborhood Formation
stages are performed in a single central location.
However, as the underlying CAN platform is
originally distributed Peer-to-Peer platform, it
inherently allows distributed and fully decentralized
storage of the ratings matrix. In future, we plan to
implement a distributed variant of the algorithm and
to investigate the distribution issues.
The current work is limited to the Mean Squared
Difference similarity metric, since the injective
mapping to a multi-dimensional CAN-like space
inherently supports it. However, for other metrics,
such as Cosine Similarity or Pearson correlation,
CAN space might be inappropriate and new types of
topologies and the respective mappings should be
developed. We plan to study other metrics and to
produce a general framework for efficient heuristic
Collaborative Filtering.
REFERENCES
Aguzzoli, S., Avesani, P., Massa, P., 1997, Collaborative
Case-Based Recommender System, in proceedings of
the ECCBR Conference.
Breese, J., Heckerman, D., Kadie, C., 1998, Empirical
Analysis of Predictive Algorithms for Collaborative
Filtering, in proceedings of the UAI Conference.
Chee, S.H.S., Han, J., Wang, K., 2001, RecTree: An
Efficient Collaborative Filtering Method, in
proceedings of the DaWaK Conference.
Goldberg, K., Roeder, T., Gupta, D., Perkins, C., 2001,
Eigentaste: A Constant Time Collaborative Filtering
Algorithm”, in Information Retrieval Journal, vol.
4(2).
Good N., Schafer, J.B., Konstan, J.A., Borchers A.,
Sarwar, B., Herlocker, J., Riedl, J., 1999, Combining
Collaborative Filtering with Personal Agents for
Better Recommendations, in proceedings of the AAAI
Conference.
Han, P., Xie, B., Yang, F., Shen, R., 2004, A Scalable P2P
Recommender System Based on Distributed
Collaborative Filtering, in Expert Systems with
Applications Journal, vol. 27(2).
Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J.,
1999, An Algorithmic Framework for Performing
Collaborative Filtering, in proceedings of the SIGIR
Conference.
Miller, B.N., Konstan, J.A., Riedl, J., 2004, PocketLens:
Toward a Personal Recommender System, in ACM
Transactions on Information Systems, vol.22 (3).
Pennock, D.M., Horvitz, E., Giles, C.L., 2000, Social
Choice Theory and Recommender Systems: Analysis
of the Axiomatic Foundations of Collaborative
Filtering, in proceedings of the AAAI Conference.
Ratnasamy, S., Francis, P., Handley, M., Karp, R.,
Shenker, S., 2001, A Scalable Content-Addressable
Network, in proceedings of the SIGCOMM
Conference.
Resnick, P., Varian, H.R., 1997, Recommender Systems, in
Communications of the ACM, vol. 40(3).
Sarwar, B., Karypis, G., Konstan, J., Riedl, J., 2000,
Analysis of Recommendation Algorithms for E-
Commerce, in proceedings of the EC Conference.
Sarwar, B.M., Konstan, J.A., Riedl, J., 2001, Distributed
Recommender Systems: New Opportunities for
Internet Commerce, in “Internet Commerce and
Software Agents: Cases, Technologies and
Opportunities”, Idea Group Publishers.
Shardanand, U., Maes, P., 1995, Social Information
Filtering: Algorithms for Automating "Word of Mouth,
in proceedings of the CHI Conference.
Tveit, A., 2001, “Peer-to-Peer Based Recommendations
for Mobile Commerce”, in proceedings of the WMC
Workshop.
ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
98