In (Lee and Chun, 2007) content-based tag rec-
ommendation which uses graph representation is pre-
sented. Their system recommends the tags extracted
from the content of a blog using an artificial neural
network which uses WordNet and word frequencies
in the training step. An example of content-based tag
recommendation which uses graph representation is
presented in (Lee and Chun, 2007). Their system rec-
ommends the tags extracted from the content of a blog
using an artificial neural network which uses Word-
Net and word frequencies in the training step.
The authors in (Tatu et al., 2008) utilize informa-
tion from resource content and the folksonomic struc-
ture of the graph. They use the graph to create a set
of tags related to the resource and a set of tags re-
lated to the user. Then the system enrich tag vocab-
ularies of the set of tags related to resource or user
by WordNet based search for words that represent the
same concept in order to recommend to the user. A
method which creates resource related tags with the
keywords found in the resource’s title and extending
them with the tags that co-occur with the base tags in
the system is presented in (Lipczak et al., 2009). Ex-
isting tag recommendation studies use previous tags
that has been assigned to the resource by other users.
Thus, they become insufficient when a new resource
appears. Our recommendation model utilize content
of the Web document, hence new or frequently as-
signed resources does not alter our recommendation
success.
3 PROPOSED METHOD
3.1 Analysis of Tagging Behavior
It can be assumed that Web pages can be represented
by their text. In this study, this text is separated into
five different sections: (1) main content for long texts
in the body part of the document (C); (2) page title
(P), (3) heading 1 (H1); (4) heading 2 (H2); and (5)
the anchor text in the links (A). There are 6 heading
tags available in HTML coding and H1 is the largest
being at the top of the heading structure hierarchy. In
the remaining part of this paper, dx
i
denotes one of
this five sections of a document d
i
. A preprocessing
step is performed which includes stop word removal
and stemming of terms. The main content of a Web
page is then represented by top-k terms that have the
highest frequency among the other terms in the body
part of the document. The terms in a section of the
document are combined into a single vector:
−→
dx
i
= (wx
1
, f
i1
), (wx
2
, f
i2
), . . . , (wx
n
, f
in
) (1)
where wx
1
, wx
2
, . . . , wx
n
are terms that appear in the
corresponding section dx
i
and f
i1
, f
i2
, . . . , f
in
are the
frequencies of the terms. Thus, a Web document can
be represented by 5 term vectors. Instead of com-
monly used TF-IDF (Term Frequency/Inverse Doc-
ument Frequency) weighting scheme we used TF
weighting in vector representations.
The tags assigned to a Web document are com-
bined into a single tag vector:
−→
tt
i
= (t
1
, f
i1
), (t
2
, f
i2
), . . . , (t
l
, f
il
) (2)
where t
1
, t
2
, . . . , t
l
are tags assigned by users to docu-
ment d
i
and f
i1
, f
i2
, . . . , f
il
are the frequencies of the
corresponding tags in that document.
As stated earlier, the aim of this step is to find
a relationship between terms appeared in the docu-
ment and the tags assigned to it. For this reason, the
similarity between each term vector and tag vector of
the document is computed using the cosine similarity
measure:
sim(
−→
dx
i
,
−→
tt
i
) =
−→
dx
i
•
−→
tt
i
k
−→
dx
i
kk
−→
tt
i
k
(3)
The second step of tag analysis comprises of
determining the semantic relationship between the
scope of a document and tags of this document using
WordNet. Each term in each term vector of a doc-
ument is converted into its hypernym and hyponym
versions using WordNet. A term’s hypernym is a gen-
eral term whereas a hyponym is specific. The fre-
quency f
ij
of a termt
j
in a term vector of d
i
is mapped
to its hypernyms/hyponyms {h
1
, . . . , h
j
, . . . , h
r
}. The
frequencies of synonym terms are determined in a
similar way of hypernym/hyponymcase. The similar-
ity between each term vector and synonym tag vector
is computed based on the cosine measure.
3.2 Personalized Tag Recommendations
We are given a set of users U = {u
1
, u
2
, . . . , u
N
}, a
set of Web pages R = {d
1
, d
2
, . . . , d
K
} and a set of
tags T = {t
1
, t
2
, . . . , t
M
}. In this paper, we will use the
following notations:
• tags(u
i
) ⊆ T is the set of tags used by user u
i
.
• tags(u
i
, d
j
) ⊆ tags(u
i
) is the set of tags given by
user u
i
to a Web page d
j
.
• tags(d
j
) ⊆ T is the set of tags given to Web page
d
j
.
• tags(dx
j
) ⊆ tags(d
j
) is the set of tags of Web page
d
j
that appear in the dx
j
part of that page. Note
that dx can be one of the five different sections of
the document, such as main content, page title, h1,
h2 or anchor text.
TAG RECOMMENDATION BASED ON USER'S BEHAVIOR IN COLLABORATIVE TAGGING SYSTEMS
571