The CBF using Vector Space Modeling (VSM)
follows the same process of text classification by cat-
egorizing documents into predefined thematic cate-
gories. This is often uses the supervised learning and
conducted in two main phases: the document index-
ing and classifier learning (Sebastiani, 2002): In doc-
ument indexing the numeric representation of docu-
ment is created by applying the two steps on docu-
ments: first a subset of terms from all terms occur-
ring in the whole collection is selected and then term
weighting is calculated by assigning a numeric value
to each term in order to build the profiles of docu-
ments based on its contribution to each document. In
the classifier learning a document classifier is devel-
oped by learning from the numeric representations of
the documents.
In information retrieval and VSM, term weight-
ing is formulated as term frequency.inverse docu-
ments frequency known as t f id f . The t f id f is one of
the most popular term-weighting techniques for CBF
(Lops et al., 2011). For instance, 83% of content-
based recommender systems in the domain of digital
libraries use t f id f (Beel et al., 2016). CBF using Vec-
tor Space Model (VSM) is often conducted in three
phases:
1. Feature extraction, each product is represented
by a subset of terms from all terms occurring in
the items collection
2. Term weighting, the items’ features are weighted
using the most common weighting method in the
VSM known as term frequency-inverse document
frequency t f id f method. The t f gives a local
view of term, expresses the assumption that mul-
tiple appearance of term in a document are no less
important than single appearance. The id f gives
a global view of terms across the entire collec-
tion assuming that rare terms are no less important
than frequent term. For more details on t f id f for
CBF the readers can refer to (Pazzani and Billsus,
2007; Lops et al., 2011). In t fid f the user pro-
file is represented by a vector of weights where
each component denotes the importance of term
to user.
3. k-nearest neighbor classifier and cosine simi-
larity measure, from the two above phases, the
user profile and the content of new items are rep-
resented as t f id f vectors of terms’ weight. The
CBF system calculates the similarity between the
documents previously seen and rated by the user
and the new document. Prediction of user’s inter-
est in particular document is obtained by cosine
similarity. As pointed out in the reference (Lops
et al., 2011) cosine similarity is most widely used
to determine the closeness between two docu-
ments.
The k-Nearest Neighbor (k-NN) is a classical
method for recommender systems (Lops et al., 2011).
k-NN is a basic machine learning algorithm used for
classification problems. It compares the new item
with all stored labeled items in training set using the
cosine similarity measure and determines the k near-
est neighbors. The class label of the testing item or
new item can be determined from the class labels of
the nearest neighbor in the training set. Each item
in the training set is presented by a weighted vector,
which each component j presents the t fid f of corre-
sponding term t
j
. For each item in the testing we cal-
culate the t f id f on the m terms selected from training
set. The training phase of the algorithm consists only
of storing the attribute vectors with their class label
in memory. k-NN algorithm compares the all stored
items to query item using a cosine similarity function
and determine the k nearest neighbors. A majority
voting rule is applied to assign a query item to a class
"LIKE" or "DISLIKE". The k-NN classifier is one of
the successful techniques for CBF.
Although k-NN classifier has been successfully
applied to some CBF applications, it suffers from
some limitations such as: it requires a high compu-
tation time because it needs to compute distance of
each query item to all training items (it does not have
a true training phase, all the training set is used); the
pre-processing, normalization or change of input do-
main is often required to bring all the input data to
the same absolute scale. The number k in the k-NN is
given a priory. So, if one changes the number k, the
assignment decision may be also changed. To address
these issues a k-CR was introduced. In the following
section, we describe our CBF based k-CR.
3 METHODOLOGY
Our methodology follows the same process of text
classification by categorizing documents into prede-
fined thematic categories. This is often uses the su-
pervised learning and conducted in two main phases:
the document indexing and classifier learning (Sebas-
tiani, 2002): In document indexing the numeric repre-
sentation of document is created by applying the two
steps on documents: first a subset of terms from all
terms occurring in the whole collection is selected and
then term weighting is calculated by assigning a nu-
meric value to each term in order to build the profiles
of documents based on its contribution to each docu-
ment. In the classifier learning a document classifier
is developed by learning from the numeric represen-
The k Closest Resemblance Classifier for Amazon Products Recommender System
875