2 HADOOP FRAMEWORK
Hadoop MapReduce is a programming model and
software framework that supports data-intensive dis-
tributed applications. This Apache project is an open-
source framework for reliable, scalable, distributed
computing and data storage. It can rapidly process
vast amounts of data in parallel on large clusters of
computer nodes. Hadoop MapReduce was inspired
by Google’s MapReduce (L
¨
ammel, 2007) and Google
File System (GFS) (Ghemawat et al., 2003) papers.
MapReduce is based on the observation that many
tasks have the same structure: a large number of
records (e.g., documents or database records) is se-
quentially processed, generating partial results which
are then aggregated to obtain the final outcome. Of
course, the per-record computation and aggregation
vary by task, but the fundamental structure remains
the same (Elsayed et al., 2008). MapReduce provides
an abstraction layer which simplifies the development
of these data-intensive applications by defining a map
and reduce operation with the following signature:
map : (k
x
,v
x
) → [k
y
,v
y
] (1)
The map operation is applied to every input record,
which has the data structure of a key-value pair. This
mapper generates an arbitrary number of key-value
pairs as an intermediate result (indicated in equation
1 by the square brackets). Afterwards, these inter-
mediate results are grouped based on their key. The
reducer gets all values associated with the same in-
termediate key as an input and generates an arbitrary
number of key-value pairs.
reduce : (k
y
,[v
y
]) → [k
z
,v
z
] (2)
3 CONTENT
CHARACTERIZATION
Many CB recommendation algorithms are based on
relevant semantic metadata describing the content
items of the system. However, many online systems
do not dispose of structured metadata, forcing them to
rely on textual descriptions of the content. Therefore,
the proposed MapReduce operations, used for calcu-
lating item similarities or recommendations, are only
dependent on such a set of textual documents describ-
ing the content items of the system. To handle these
content descriptions, the documents are transformed
to characterizing terms and a vector of term weights
w
t
, which indicate the relevance of each term t for the
item.
To identify these terms t and calculate the term
weights w
t
, we adopted the Term Frequency - Inverse
Document Frequency (TFIDF) (Salton and McGill,
1983) weighting scheme. Although the ordering of
terms (i.e. phrases) is ignored in this model, it has
proved its efficiency in the context of information re-
trieval and text mining (Elsayed et al., 2008). The
TFIDF can be obtained by calculating the frequency
of each word in each document and the frequency of
each word in the document corpus. The frequency of
a word in a document is defined as the ratio of the
number of times the word appears in the document,
n, and the total number of words in the document,
N. The frequency of a word in the document corpus
stands for the ratio of the number of documents that
contain the word, m, and the total number of docu-
ments in the corpus, D.
To calculate the term weights w
t
of an item de-
scription as TFIDF, the following four MapReduce
jobs are executed. The first job calculates the num-
ber of times each word appears in a description, n.
Therefore, the map operation of this job takes the item
identifier (i.e. id) as input key and the content of the
description as input value. For every word in the de-
scription, a new key-value pair is produced as output:
the key consists of the combination of the word and
the item identifier; the value is just 1. Afterwards,
a reducer counts the number of appearances of each
word in a description by adding the values for each
word-id combination.
map : (id, content) → [(word,id),1]
reduce : ((word,id),[1]) → ((word,id), n)
(3)
The mapper of the second job merely rearranges the
data of the records by moving the word from the key
to the value. In this way, the following reducer is
able to count the number of words in each document,
i.e. N.
map : ((word,id),n) → (id,(word,n))
reduce : (id, [word,n]) → [(word,id),(n, N)]
(4)
The third job calculates the number of item descrip-
tions in the corpus that contain a particular word. The
mapper of this job rearranges the data and the re-
ducer outputs the number of descriptions containing
the word, i.e. m.
map : ((word,id),(n,N)) → (word,(id,n,N,1))
reduce : (word,[id,n,N,1]) → [(word,id),(n,N,m)]
(5)
The fourth job, which only consists of a mapper (i.e.
the reducer is the identity operation), produces the
TFIDF of each id-word pair. The total number of item
descriptions in the document corpus is calculated in
the file system and provided as an input variable of
this MapReduce job. Although it is possible to merge
WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies
238