3. to normalize user ratings by “deviation from the
mean”.
4. to use the TopN best neighbours (highest similar-
ity to the test user) for neighbourhood selection.
The potential best range of neighbours (N) was
found to be between 20 and 60.
5. to weight neighbour contributions when forming
predictions.
More recently, work has shown that results us-
ing a combination of a mean-square difference met-
ric and the Jaccard coefficient between users outper-
forms the commonly-used approach using Movielens
and Netflix datasets (Bobadilla et al., 2010). Two pa-
rameters from Herlocker et al.’s work - similarity be-
tween users and normalisation of user ratings - were
re-evaluated using the Movielens and Netflix dataset
(Howe and Forbes, 2008). It was found that Pearson
correlation is not necessarily the best similarity met-
ric to use and different parameterisations work better
for different datasets.
Some work has applied genetic algorithms in the
collaborative filtering domain. However, the ap-
proaches which learn per user are very computation-
ally expensive. Hwang uses a genetic algorithm, per
user, to learn an optimal weighting scheme for the
collaborative filtering system for each user (Hwang,
2010). Both collaborative and inferred content infor-
mation is used. In comparison to a traditional collab-
orative filtering approach, using the metrics of preci-
sion, recall and the f1 measure, improvements were
seen with the genetic algorithm approach. Ko et al.
first classify items into groups using a Bayesian clas-
sifier to reduce the dimensions of the space. A genetic
algorithm is used to cluster users in this new space
(Ko and Lee, 2002). Ujjin et al. use a genetic al-
gorithm to find the best “profile” that describes each
user in the dataset (Ujjin and Bentley, 2002). The
Movielens dataset is used and 22 features from the
dataset are used to create a profile for each user us-
ing the movie ratings and user and movie details. The
weights for each feature are evolved, per user, using
a genetic algorithm. Similarity is found between pro-
files.
3 METHODOLOGY AND TEST
SETS
The collaborative filtering technique used is a stan-
dard neighbourhood-basedtest approach where a por-
tion of users are chosen as the test users (10%) and a
portion of their items are withheld as test items (up
to 10%). The task is to generate predictions for the
withheld test items for the test users. Using a similar-
ity function, users similar to the test users are found
(their neighbours). Deviation from the mean is used
to normalise user ratings. Similarity scores between
users are “dampened” if the number of items co-rated
by two users is below a certain significance threshold.
Using a prediction formula, predictions for test items
are calculated using a function based on the neigh-
bour’s ratings for the test items, the neighbour’s simi-
larity score with the test user, the neighbour’s mean
ratings and the test user’s mean rating. The accu-
racy of the predictions are calculated based on the pre-
dicted ratings produced by the system and the actual
ratings given to the test items in the withheld set using
mean absolute error (MAE).
For the genetic algorithm, the parameters chosen
are based on a subset of those tested in the work by
(Herlocker et al., 2002). The flow of control of the
genetic algorithm experiment is as follows:
For each of 20 generations:
1. Pick test users and test items. A new set of test
users and items are picked for each new genera-
tion to avoid over-fitting.
2. Randomly generate a population of individuals, of
a fixed size (size is 50 in these experiments).
3. Calculate the fitness of each individual by set-
ting all of the collaborative filtering parameters to
the values indicated in the individual and running
the collaborative filtering component. The aver-
age MAE is calculated and returned as the fitness
score of the individual. The genetic algorithm for
this experiment is required to minimise the fitness
score.
4. Perform the genetic algorithm operators of
crossover and mutation and selection. The
crossover operator used is single point crossover
and the crossoverrate is 80%. The mutation rate is
set at 5%. The selection operator used is roulette
wheel selection.
The parameters tested per position in the chromo-
some are:
• sigT, the significance threshold, which is an inte-
ger in the range 0 to 100. This is used when cal-
culating the similarity between users to dampen
the similarity between two users if the number
of co-rated items between the users is less than
this threshold. The dampening used is that al-
ready outlined from the work by (Herlocker et al.,
2002). 100 was chosen as the limit as it does not
seem reasonable to dampen a similarity score if
the number of co-rated items is greater than 100.
• sim, the similarity option, which is an integer
value in the range 0 to 2. This indicates which
LEARNING NEIGHBOURHOOD-BASED COLLABORATIVE FILTERING PARAMETERS
453