A Keyphrase Extraction Approach for Social Tagging Systems

Felice Ferrara and Carlo Tasso

Artiﬁcial Intelligence Laboratory

Department of Mathematics and Computer Science, University of Udine, Udine, Italy

Keywords:

Keyphrase Extraction, Social Tagging, Keyphrase Recommendation.

Abstract:

Social tagging systems allow people to classify resources by using a set of freely chosen terms named tags.

However, by shifting the classiﬁcation task from a set of experts to a larger and not trained set of people, the

results of the classiﬁcation are not accurate. The lack of control and guidelines generates noisy tags (i.e. tags

without a clear semantics) which deteriorate the precision of the user generated classiﬁcations. In order to face

this limitation several tools have been proposed in the literature for suggesting to the users tags which properly

describe a given resource. In this paper we propose to suggest n-grams (named keyphrases) by following the

idea that sequences of two/three terms can better face potential ambiguities. More speciﬁcally, in this work,

we identify a set of features which characterize n-grams able to describe meaningful aspects reported in Web

pages. By means of these features we developed a mechanism which can support people to manually classify

Web pages by automatically suggesting meaningful keyphrases expressed in English.

1 INTRODUCTION

In this paper we propose an innovative content-based

approach for automatic tag recommendation in social

tagging systems. The keyphrase extraction mecha-

nism proposed in this work opens many interesting

perspectives for empowering the access to the knowl-

edge stored in social tagging systems. The main one

is related to the task of associating a tag to a speciﬁc

semantic meaning: keyphrases extracted from a Web

page can be used to identify concepts or entries de-

ﬁned in a semantic knowledge source such as Word-

net or Wikipedia.

By enriching the semantic value of tags the ef-

fectiveness of other applications can be improved as

well. During the last ten years many recommender

systems have been proposed to integrate tags in the

process of modeling both the user interests and the

resources available in social tagging systems. The

main limitation of these approaches depends still on

the fact that the meaning of a tag is usually inferred by

taking into account only statistical information about

the co-occurrences of tags. By disambiguating tags

and enriching them with other semantic or ontologi-

cal knowledge we can improve the accuracy of both

collaborative ﬁltering mechanisms and content based

approaches.

The paper is organized as follows: the proposed

approach to extract keyphrases from Web pages is il-

lustrated in Section 2; Section 3 describes the evalua-

tion settings and the results; ﬁnal considerations con-

clude the paper in Section 4.

2 EXTRACTING KEYPHRASES

FROM WEB PAGES

By following the traditional schema adopted by sev-

eral keyphrase extraction mechanisms we split the de-

scription of the approach into two parts: the candidate

phrase extraction (Section 2.1) and the phrase selec-

tion phase (Section 2.2).

2.1 Candidate Phrase Identiﬁcation

Given an HTML page, a format conversion step is

exploited for extracting the meaningful textual corpus

from the document, i.e the textual parts which con-

tain the relevant facts reported in the resource. More

speciﬁcally, the format conversion includes:

• the removal of unrelevant parts from the docu-

ment by exploiting an open source Web service

called Boilerpipe

http://code.google.com/p/boilerpipe/

362

Ferrara F. and Tasso C..

A Keyphrase Extraction Approach for Social Tagging Systems.

DOI: 10.5220/0004144203620365

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 362-365

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

• extracting metadata included in the source of the

page by means of HTML tags such as KEY-

WORDS, DESCRIPTION, and TITLE.

• translating the text into the English language by

using freely available API (we are currently using

the Google Translate API).

The output of the format conversionphase is a text

in English constituted by the title of the Web page,

followed by the metadata extracted from the HTML

tags, and concluded by the text extracted by the Boil-

erpipe service.

This text is analyzed in the cleaning and sentence

delimiting step in order to delimit sentences, follow-

ing the assumption that each keyphrase cannot be lo-

cated simultaneously in two distinct sentences.

In the POS-tagging and n-gram extraction step

we assign a POS tag (noun, adjective, verb, etc.) to

each token in the cleaned text by using the Stanford

log-linear part-of-speech tagger

and then we extract

all possible subsequences of phrases including up to

3 words (uni-grams, bi-grams, and tri-grams).

In order to discard keyphrases which do not have

a very signiﬁcant meaning a pruning process is ex-

ploited in the subsequent stemming and stopword

removing step where: the phrases starting or ending

with a stopword or a sentence delimiter are removed;

plural forms and singular forms are collapsed by us-

ing the Porter stemmer algorithm (Porter, 1997); a

well deﬁned set of POS patterns is used to ﬁlter for ex-

ample uni-grams that are labeled as adjective or verb.

The output of all the previous steps is consti-

tuted by three lists containing respectively the result-

ing candidate uni-grams, bi-grams, and tri-grams.

2.2 DIKpEW: Phrase Selection

As proposed in (Pudota et al., 2010), some character-

istics of the candidate keyphrases are assessed in the

feature calculation step for identifying the most rel-

evant keyphrases. The evaluated characteristics have

been identiﬁed by taking into account how Web pages

usually store meaningfulinformation. The considered

features are qualitatively described below

1. Phrase Frequency: this feature is the classical

term frequency (TF) metric, exploited in many

state of the art keyphrase extraction systems (Tur-

ney, 1999)(Hulth, 2003)(Hulth and Megyesi,

2006). In our work, the TF value is normalized

and computed separately for each n-gram list.

2. POS Value: as observed in (Hulth, 2003)(Barker

and Cornacchia, 2000), most author-assigned

http://nlp.stanford.edu/software/tagger.shtml.

keyphrases for a document turn out to be noun

phrases. For this reason we increase the weight of

candidate phrases containing more noun phrases.

3. Phrase Depth: following the idea that the main

concepts and information are usually reported in

the ﬁrst part of the document we compute the

phrase depth value for each phrase as the number

of words preceding a phrase’s ﬁrst occurrence.

4. Wikipedia. The Wikipedia feature is used to iden-

tify more coherent and recognized phrases by fol-

lowing the idea that keyphrases that are also en-

tries of Wikipedia are more likely associated to

well-deﬁned concepts/meaning.

5. Title. It highlights keyphrases that are included in

the title of the Web page (if known). We followed

the hypothesis that the title summarizes meaning-

ful concepts which are more deeply discussed in

the rest of the text.

6. Description. Authors of Web pages often add a

short description of the main contents of the Web

page by using the DESCRIPTION HTML tag.

According to the idea that the summary provided

by the author may contain very meaningful infor-

mation we compute this boolean feature for each

keyphrase: the feature is set to 1 if the keyphrase

is in the description, 0 otherwise.

7. Keyword. Even if authors of Web pages are not re-

quired to classify their published resources, they

usually add some keywords in order to be prop-

erly indexed by search engines. Since these terms

are labels generated by the authors themself, we

consider these terms as meaningful keyphrases.

In the scoring and ranking step we combined

the value of each feature in order to compute a score

(named keyphraseness) for each candidate keyphrase.

The keyphraseness is a weighted combination of the

evaluated features where the weights of the features

where experimentally computed by using the opin-

ions of a limited set of people.

Finally, the keyphrases associated to the higher

keyphraseness are ﬁltered and recommended in the ﬁ-

nal keyphrase ﬁltering step.

3 EVALUATION

Web pages are usually not classiﬁed with keyphrases

by their authors and this had a strong impact on our

evaluation procedure. In fact there are not freely

available datasets which can be used to execute an au-

tomatic evaluation of the described mechanism. For

this reason we decided to exploit a live evaluation

AKeyphraseExtractionApproachforSocialTaggingSystems

363

involving a set of volunteers which had the task of

judging the accuracy of the results returned by our

approach. Moreover, due to the lack of keyphrases

associated to Web pages, we could not use KEA (Tur-

ney, 2000) for comparing our results to one of the

state of the art mechanisms: in fact, the KEA mech-

anism needs to be trained by using a corpus of an-

notated documents. In order to face this issue we

decided to use as baseline approach a system where

keyphrases are scored and ranked according to their

frequencies. This choice seems reasonable since, as

our approach does, the baseline approach takes into

account only the information available in a speciﬁc

document (without considering the characteristics of

the documents in a speciﬁc collection): the most fre-

quent keyphrases obtain an higher score. By using

such score, the baseline mechanism can extract the

two top scored uni-grams, the ﬁve top scored bi-

grams, and the three top scored tri-grams. The ﬁ-

nal set of keyphrases is then built by these 10 ﬁltered

keyphrases.

The results returned by both our mechanism and

the baseline approach were evaluated by using a Web

application where a set of volunteers judged the ac-

curacy of the results. Since our approach is mainly

aimed at supporting the users of social tagging sys-

tems, we created a Web based application which sim-

ulates the interaction of a user with a social tagging

system. By using this application, the volunteers

could submit an URL and then the evaluation frame-

work returned to the users a list of keyphrases for the

speciﬁc Web page. The list of returned keyphrases

was built by the results produced by both the pro-

posed approach and the baseline mechanism. How-

ever, the two sets of keyphrases were presented to

the evaluators mixed in a random order. By merging

the keyphrases without a speciﬁc order we avoided to

bias the human evaluators since they were not able to

recognize the keyphrases returned by one of the two

compared approaches.

The evaluators had to vote each returned

keyphrase by using the following 5-Likert scale: Ex-

cellent - the keyphrase is very meaningful. It reports

relevant facts, people, topics or other elements which

characterize the Web page; Good - the keyphrase is

still signiﬁcant for classifying the document but it is

not the best. The keyphrase reports facts, people,

topics or other elements which characterize the Web

page, but are more weakly connected to the main con-

tent of the page; Neutral - you are not sure about the

signiﬁcance of the keyphrase for the document; Poor

- the keyphrase does not properly describe the con-

tents; Very Poor - the keyphrase does not make sense.

We involved 26 volunteers (20 men and 6 women)

who worked for two weeks. The volunteers were

students and workers. The oldest participant was 63

years old, the youngest was 22 years old and the av-

erage age was 37 years. The volunteers evaluated the

keyphrases generated for 209 Web pages written in

Italian and in English.

We used the Normalized Discounted Cumulative

Gain (NDCG) metric to evaluate the results of our

evaluation. The NDCG metric is commonly used in

the area of Information Retrieval in order to evaluate

the accuracy of ranking mechanisms. This measure

is speciﬁcally used in scenarios where the ranked

results are associated to different relevance levels,

since it takes into account both the position and the

usefulness (or gain) of the results to assign a score

to the evaluated ranking mechanisms. In particular,

the NDCG metric is based on the assumption that an

accurate ranking mechanism puts the most relevant

results in the ﬁrst positions of the generated ranking.

This means that the accuracy of a ranking mechanism

is assessed by the NDCG metric by combining

information about the position of the items in the

ranking and the feedback relevance provided by the

users. Technically, the NDCG metric assigns a score

to a ranking mechanism by taking into account a

set of ranked lists of resources where, given a list

of ranked resources, each resource is associated to

one speciﬁc grade value of a graded relevance scale.

More formally, given a ranking mechanism and a

ranked list of resources returned by the mechanism

where the resource (in our case the keyphrase) in

position i is associated to a relevance level rel

(in our

case the position is deﬁned by our algorithm and the

relevance by the evaluators) the NDCG computes the

gain for this list as follows

DCG = rel

∑

i=2

rel

log

where n is the number of results in the ranked list and

in our speciﬁc case n is equal to 10. In our evaluation

the graded relevance scale is deﬁned by the following

relevance levels: Excellent = 4; Good = 3; Neutral =

2; Poor = 1; Very poor = 0. The DCG is then used

to quantify the accuracy of a response generated by

a ranking mechanism according to both a ﬁxed rele-

vance scale and the opinions of the evaluators.

By computing the DCG over each evaluation pro-

vided by our evaluators, we obtained an assessment

of the accuracy for each evaluated Web page. These

computed DCG are combined in the computation of

the NDCG which is used to normalize the DCG val-

ues in [0, 1] and, ﬁnally, to compute the accuracy of

the mechanism as the mean of these normalized val-

ues. If the evaluated keyphrase extraction mechanism

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

364

Table 1: Performance of DIKpEW compared to the baseline

mechanism.

NDCG@5 NDCG@10

Base Ita 0.484 0.437

DIKpEW Ita 0.558 0.614

Base Eng 0.485 0.576

DIKpEW Eng 0.523 0.686

returns only very relevant keyphrases then the NDCG

assumes the ideal value 1. Table 1 reports the 8 differ-

ent NDCG values computed for evaluating and com-

paring the accuracy of the top 5 and top 10 keyphrases

extracted by: (i) the baseline system from Web pages

written in Italian (Base Ita); (ii) our approach from

Web pages written in Italian (DIKpEW Ita); (iii) the

baseline system from Web pages written in English

(Base Eng); (iv) our approach from Web pages writ-

ten in English (DIKpEW Eng); .

According to the results showed in Table 1 our ap-

proach outperforms the baseline mechanism. More-

over, the accuracy of the results computed for the

Web pages in Italian are comparable to the accuracy

for the Web pages in English. This means that the

noise introduced by the translation in English does

not signiﬁcantly lowers the accuracy of the results.

This can be justiﬁed in two ways: (i) the weight of

the keyphrase depends on a set of statistical features

which discard possible incorrect translation; (ii) the

Wikipedia feature allows us to throw out (or at least to

assign to lower positions) the bi-grams and tri-grams

which have not a clear meaning.

A ﬁnal consideration concerns the NDCG met-

ric: it is important to emphasize that we exploited it

only for comparing our approach to a baseline refer-

ence. In fact, the choice of selecting only the top N

keyphrases (where N=5 or N=10) does not tackle the

possibility of working with pages with only 2 or 3

signiﬁcant phrases. In this case, the NDCG@10 met-

ric, for example, would be much lower than 1. Future

work will also address this issue.

4 CONCLUSIONS

In this work we presented an approach which is aimed

at supporting the users of social tagging systems in

classifying Web pages. In particular,the presented ap-

proach identiﬁes English n-grams from a Web docu-

ment for suggesting meaningful labels for the speciﬁc

resource. An experimental evaluation showed that

the proposed approach is plausible and future analysis

will investigate if the proposed approach can produce

better results for speciﬁc topics or speciﬁc set of Web

pages (blogs, newspapers, etc.).

The proposed approach can provide keyphrases

which appear already in the given document. Future

work will focus on overcoming this limitation by nav-

igating other knowledge sources such as Wikipedia

and Wordnet, producing in such a way meaningful

tags which are constituted by uni-grams, bi-grams, or

tri-grams not contained in the text, and that are the re-

sult of a domain reasoning activity. We also plan to

integrate our approach in collaborative and content-

based recommender systems following the ideas pro-

posed in (Ferrara and Tasso, 2011) and (Ferrara et al.,

2011).

REFERENCES

Barker, K. and Cornacchia, N. (2000). Using noun phrase

heads to extract document keyphrases. In Proceedings

of the 13th Biennial Conference of the Canadian So-

ciety on Computational Studies of Intelligence, pages

40–52, London, UK. Springer-Verlag.

Ferrara, F., Pudota, N., and Tasso, C. (2011). A keyphrase-

based paper recommender system. In Agosti, M., Es-

posito, F., Meghini, C., and Orio, N., editors, Digi-

tal Libraries and Archives, volume 249 of Communi-

cations in Computer and Information Science, pages

14–25. Springer Berlin Heidelberg.

Ferrara, F. and Tasso, C. (2011). Extracting and exploiting

topics of interests from social tagging systems. In Pro-

ceedings of the International Conference on Adaptive

and Intelligent Systems, ICAIS’11, pages 285–296,

Berlin, Heidelberg. Springer-Verlag.

Hulth, A. (2003). Improved automatic keyword extraction

given more linguistic knowledge. In Proceedings of

the 2003 conference on Empirical methods in natu-

ral language processing, pages 216–223, Morristown,

NJ, USA. Association for Computational Linguistics.

Hulth, A. and Megyesi, B. B. (2006). A study on automat-

ically extracted keywords in text categorization. In

ACL-44: Proc. of the 21st Int. Conf. on Computational

Linguistics and the 44th annual meeting of the Associ-

ation for Computational Linguistics, pages 537–544,

Morristown, NJ, USA. ACL.

Porter, M. F. (1997). An algorithm for sufﬁx stripping.

Readings in information retrieval, pages 313–316.

Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., and Tasso,

C. (2010). Automatic keyphrase extraction and on-

tology mining for content-based tag recommendation.

International Journal of Intelligent Systems, Special

Issue: New Trends for Ontology-Based Knowledge

Discovery, 25:1158–1186.

Turney, P. (1999). Learning to extract keyphrases from

text. Technical Report ERB-1057, National Research

Council, Institute for Information Technology.

Turney, P. D. (2000). Learning algorithms for keyphrase

extraction. Information Retrieval, 2(4):303–336.

AKeyphraseExtractionApproachforSocialTaggingSystems

365