Personalized Recommendation and Explanation

by using Keyphrases Automatically extracted from Scientiﬁc Literature

Dario De Nart, Felice Ferrara and Carlo Tasso

Artiﬁcial Intelligence Laboratory,

Department of Mathematics and Computer Science, University of Udine, Udine, Italy

Keywords:

Adaptive Personalization, Scientiﬁc Paper Recommendation, Concept-based Recommendation, User

Modelling.

Abstract:

Recommender systems are commonly used for discovering potentially relevant papers in huge collections of

scientiﬁc documents. In this paper we propose a concept-based recommender system where relevant concepts

are automatically extracted from scientiﬁc resources in order to both model user interests and generate rec-

ommendations. Differently from other work in the literature, our concept-based recommender system does

not depend on speciﬁc domain ontologies and, on the other hand, is based on an unsupervised, domain inde-

pendent keyphrase extraction algorithm that identiﬁes relevant concepts included in a scientiﬁc paper. This

semantic-oriented approach allows the user to easily inspect and modify his user model and to effectively

justify the proposed recommendations by showing the main concepts included in the suggested papers.

1 INTRODUCTION

Discovering relevant papers is an ordinary and time-

consuming task for researchers since they need to stay

tuned with the most relevant scientiﬁc advances. In

order to support researchers, several systems (such

as CiteseerX, Google Scholar, Research Gate, CiteU-

like, and Mendeley) provide facilities and tools, such

as recommender systems, in order to simplify the task

of accessing the knowledge available in huge collec-

tions of scientiﬁc papers.

Recommender systems can support scientists by

ﬁltering information according to the personal inter-

ests of the researchers. Collaborative Filtering (CF)

recommender systems, which ﬁlter resources accord-

ing to the opinions of people, have been used to

reach this goal. For example, in CiteUlike, two

collaborative ﬁltering mechanisms are exploited: (i)

an item-based CF recommender system where the

tags provided by the users are utilized for identifying

the resources similar to those the active user

previ-

ously liked and (ii) a user-based recommender system

where the resources liked to the users who share pa-

pers with the active user are recommended (Bogers

and Van den Bosch, 2008). Content-based recom-

mender systems can be used for identifying poten-

In this paper we refer the user which is going to receive

the recommendations as active user.

tially relevant resources as well. These recommender

systems represent each resource by means of a set of

features (such as the metadata associated to the re-

sources or other terms extracted from the papers) and

the same set of features is also used for modelling the

user interests. Since resources and papers are rep-

resented by means of the same set of features, the

relevance of a paper for a researcher is computed by

matching the user proﬁle against the representation of

the speciﬁc paper. Obviously, the precision of the rec-

ommendations strongly depends on the features ex-

ploited by the recommender.

In this work we propose to use more semantic fea-

tures by automatically extracting the most relevant

concepts from scientiﬁc papers. By using concepts as

features, we built a concept-based recommender that

suggests the papers related to the concepts of inter-

est for the active user. More speciﬁcally, concepts

are identiﬁed as keyphrases automatically extracted

from the scientiﬁc paper. A keyphrase (KP) is a

short phrase (typically constituted by up to three/four

words) that indicates one of the main ideas/concepts

included in a document. A keyphrase list is a short

list of keyphrases that reﬂects the content of a sin-

gle document, capturing the main topics discussed

and providing a brief summary of its content. The

proposed recommender system builds a user proﬁle

mainly by means of relevance feedback, i.e. by ex-

De Nart D., Tasso C. and Ferrara F..

Personalized Recommendation and Explanation by using Keyphrases Automatically extracted from Scientiﬁc Literature.

DOI: 10.5220/0004539000960103

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge

Management and Information Sharing (KDIR-2013), pages 96-103

ISBN: 978-989-8565-75-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

ploiting the keyphrase lists extracted from the papers

that are considered and explicitly stated as relevant

by the active user. Then, in order to compute the rel-

evance of a new article, the user proﬁle is matched

against the keyphrase list extracted from that article.

The domain-independent keyphrase extraction avoids

a manual classiﬁcation of papers and it still identiﬁes

a signiﬁcant set of concepts as we showed in (Ferrara

and Tasso, 2013). The idea of using more semantic

features is due to two main goals. First, our concept

based recommender system can explain why the sys-

tem recommended the documents by showing: (i) the

keyphrases which are both in the user model and in

the paper and (ii) other keyphrases found in the doc-

ument which are not yet stored in the user model but

can support the user in understanding/evaluating the

new paper. The explanation of recommendations by

means of keyphrases produces several beneﬁts. First

of all, the user satisfaction can be increased since ex-

planations save his time: the user is not forced to

read the entire document in order to catch the main

contents of the paper. Second, the system allows the

users to take a look to the main concepts stored in the

user model. In this way, a user can explicitly eval-

uate his interest for the various concepts and can in-

crease or decrease his interest level for speciﬁc con-

cepts or even remove them from his proﬁle. By al-

lowing users to provide this new feedback the system

can generate a more accurate user proﬁle improving,

in this way, the accuracy of the recommendation pro-

cess and, consequently, the user satisfaction. In this

paper we show that these two goals can be reached

by providing, at the same time, accurate recommen-

dations.

The paper is organized as follows: Section 2 re-

views related work, a brief architectural overview of

the system is presented in Section 3, the proposed rec-

ommendation method is described in Section 4, the

evaluation performed so far is described in Section 5,

and Section 6 concludes the paper.

2 RELATED WORK

Several works in the literature deal with the problem

of ﬁnding relevant scientiﬁc literature, mostly from

an Information Retrieval perspective, such as in (Bol-

lacker et al., 2000), where CiteSeer is introduced.

However there are several authors who have taken

into account more personalization-based approaches

to the problem, leading to the creation of recom-

mender systems rather than search engines. Several

examples analyze the textual contents of scientiﬁc

papers in order to provide recommendations to re-

searchers. Some of them take into account speciﬁc

sections of the papers such as the bibliography which

can be used to build, navigate, and, moreover, mine

the citation graph (i.e. the directed graph in which

each vertex represents an academic publication and

each edge represents a citation from one publication

to another) in order to generate the recommendations.

For instance, the citation graph is browsed by the rec-

ommender system described in (Huynh et al., 2012),

where a set of liked papers is used as seed for navi-

gating the citation graph.

On the other hand, our work aims at extracting

from the papers the main ideas and concepts in or-

der to describe the user interests from a more seman-

tic perspective. Similarly, the feedback of the users

of social systems, such as CiteUlike and BibSonomy,

has been also used for identifying the concepts of in-

terests of researchers. The authors of (Jiang et al.,

2012), for example, extract the tags provided by the

users of CiteUlike for generating a dictionary which

can be used for identifying relevant concepts in the

abstracts of scientiﬁc publications. In (Ferrara and

Tasso, 2011), the tags of the users of BibSonomy are

instead exploited for discovering if the user may be in-

terested in several distinct Topics of Interest (ToI). In

this case a clustering mechanism is utilized for joining

together tags with similar meanings where the simi-

larity depends on the number of times two tags have

been applied to the same resource. Such tag clus-

ters allow to organize papers into different collections,

each one associated to a speciﬁc ToI for the single

user. Only opinions of users interested in a speciﬁc

ToI are then considered for computing recommenda-

tions. More speciﬁcally, resources labelled by tags

which are evaluated as more similar to the tags as-

sociated to a ToI are considered more relevant than

other resources, and resources bookmarked by users

more similar to the active user are more relevant than

others as well. The precision of these approaches de-

pends on the active participation of the users whereas

the content-based recommender system described in

this paper is solely based on the automatic extraction

of the main concepts from a scientiﬁc resource.

The textual content of scientiﬁc papers is also an-

alyzed in a concept-based recommender system pro-

posed in (Chandrasekaran et al., 2008), where authors

and papers are modeled by trees of concepts: using

the ACM Computing Classiﬁcation System (CCS),

the authors trained a vector space classiﬁer in order to

associate concepts of the CCS classiﬁcations to doc-

uments. The hierarchical organization of the CCS al-

lows the system to represent user interests and docu-

ments by trees of concepts. A user proﬁle and a pa-

per representation are then compared by a tree edit-

PersonalizedRecommendationandExplanationbyusingKeyphrasesAutomaticallyextractedfromScientificLiterature

Figure 1: System Architecture Overview.

distance which computes a similarity measure among

trees. Our approach, on the other hand, does not need

a training phase and it also does not depend on spe-

ciﬁc ontologies for identifying relevant concepts (i.e.

keyphrases constituted by n-grams) in the papers.

In (Govindaraju and Ramanathan, 2012), the au-

thors propose a content-based ﬁltering system based

on a simple, unsupervised, keyphrase extraction tech-

nique to identify relevant concepts and entities. Such

keyphrases are then organized, for each document, in

a graph model, clustered, and matched against other

KPs graphs in order to measure the degree of simi-

larity between documents. However, their KP extrac-

tion technique does not take into account linguistic

features (terms are extracted accordingly to their fre-

quency in the document), keyphrases are considered

as atomic entities and recommendation is based on

the Jacquard similarity measure and metadata-driven

criterias rather than on an actual comparison of the

graph models.

3 SYSTEM OVERVIEW

In order to support our claims and to test our approach

we have developed a speciﬁc recommender system

for scientiﬁc publications, named Recommender and

Explanation System (RES), described in the follow-

ing. The main goal of RES is providing personalized

access to documents retrieved from CiteSeerX. The

architecture of the system, showed in Figure 1, in-

cludes a database called Scientiﬁc Paper Collection

(SPC ), a repository for user proﬁles and registry, and

the following three main modules:

1) A Web User Interface devoted to (i) let the user

create and manage proﬁles, (ii) specify one or more

documents of interest, to be used as positive relevance

feedback, either by browsing a list of articles within

the SPC or uploading new ones, (iii) query CiteSeerX,

and (iv) request recommendations. These are pre-

sented as a ranked list of documents where the top

items are those that better match the user proﬁle. For

each document two lists of Keyphrases are shown:

the ﬁrst includes KPs representing concepts that ac-

tually match the user proﬁle, the latter is constituted

of relevant KPs extracted from the document but not

matching the user proﬁle. This information, shown in

Figure 2, serves two goals: it brieﬂy explains why a

document was recommended by highlighting its main

concepts and, secondly, offers the user a way to pro-

vide relevance feedback. Users can adjust the weight

of each KP in their proﬁles by checking the “more” or

the “less” checkbox.

2) A Collection Manager Module, devoted to: (i)

execute queries on CiteSeer and crawl results, (ii)

pre-process articles by extracting KPs from full text,

and (iii) store their representations, as a list of KPs,

into the SPC. This module has been developed us-

ing the Dikpe KP extraction algorithm described in

(Ferrara et al., 2011), which has proven to perform

signiﬁcantly better than other known systems. The

Dikpe KP extractor provides, as output, a list of KPs

extracted from the document where each KP has a

weight called Keyphraseness that summarizes the sev-

eral linguistic and statistical indicators exploited in

the extraction process. The higher the Keyphraseness,

the more relevant is the KP in the document.

3) A Recommendation Engine Module devoted to:

build and maintain individual user proﬁles; retrieve

from the SPC the set of documents returned by a

query, and then recommend the most promising pa-

pers.

The SPC is a crucial part of the system since

Keyphrase Extraction, being an advanced Informa-

tion Extraction task, takes time and processing a set of

hundreds of query results cannot be done in an inter-

active way. In order to address this issue, we decided

to let RES process retrieved documents only once, in

an asynchronous way, and save their representation

for later use. On the other hand, when the document

KPs are known, the recommendation algorithm pro-

posed is very efﬁcient and it is able to rank large sets

in a short time.

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

Figure 2: Recommendation screenshot.

4 PROPOSED METHOD

In the RES system, both user proﬁle and document

content are represented by a network structure called

Context Graph (CG). For each document stored in

the SPC, a CG is built by processing its weighted KP

list. User proﬁles are represented by CGs built from

a pool of KPs belonging to one or more SPC docu-

ments marked by the user as interesting and, possibly,

enriched with other KPs gathered via relevance feed-

back, for example by providing a fragment of text or

a speciﬁc paper not previously included in the SPC,

or a speciﬁc list of KPs or keywords.

CGs are built by taking into account each single

term belonging to each KP; each term is stemmed and

then represented as a node of the graph; if two terms

belong to the same KP, their corresponding nodes are

connected by an arc. Both nodes and arcs are as-

signed a weight which is computed according to the

Keyphraseness values associated to each KP contain-

ing the corresponding terms. In Figure 3 is shown the

small CG formed by the KP list [information ﬁlter-

ing, adaptive web personalization, adaptive ﬁltering,

content based ﬁltering, social web, web usage, col-

laborative ﬁltering].

As new KPs are added into the CG, either by di-

rect article insertion or relevance feedback, both pro-

vided by the user, related concepts tend to link to-

gether, creating, in such a way, extensive networks of

terms. Consider for example the proﬁle CG shown in

Figure 4, which has been built from four articles deal-

ing with ’Content-based Recommender Systems’ and

’Information Extraction’ . On the other hand, unre-

lated concepts, form different, non-connected groups,

as we can see in Figure 5 where two unrelated articles

Figure 3: A simple Context Graph.

(the ﬁrst dealing with Machine Learning, the second

with Mechanical Engineering) are fed into a proﬁle.

If a user expresses multiple domains of interest in his

proﬁle, they will form different groups in the corre-

sponding CG. This fact makes CGs expressive tools,

able to model both short term and long term interests.

CGs allow to create, for each term, a meaningful

context of interest by simply checking its adjacency

list. If, in two different documents, the same term

is used in similar contexts (i.e. in the two respec-

tive CGs the same nodes are connected in the same

or similar way), it reasonably refers to the same con-

cept, proving a certain degree of similarity between

the two items. This mechanism also represents our

solution to the problem of disambiguating polysemic

terms.

When, as result of a user-speciﬁed query, a set of

documents is retrieved from CiteSeer, RES extracts

a list of KPs from each one of the retrieved articles,

builds a CG for each KP list and generates a recom-

mendation.

Recommendations are generated in three steps:

Matching/Scoring, Ranking, and Presentation. In the

PersonalizedRecommendationandExplanationbyusingKeyphrasesAutomaticallyextractedfromScientificLiterature

Figure 4: A CG built from 4 articles dealing with related topics.

ﬁrst step every document (D) in the SPC is matched

against the user proﬁle (U) by calculating the fol-

lowing parameters: Coverage (C), Relevance (R) and

Similarity (S).

C represents the percentage of concepts in D

which are also of interest for the user, since they are

already included in the proﬁle U.

C(D, U) :=

sharedTerms(D, U)

totalTerms(D)

(1)

By default, if less than 10% of the document nodes

do not match those in the user proﬁle, the document

is not ranked, since there are not enough shared nodes

for a meaningful evaluation of the other two parame-

ters.

R estimates the importance of the concepts shared

by the user proﬁle (U) and the document (D). It is

computed as the average tf-idf measure of the terms

corresponding to the shared nodes between the user

and the document CG with reference to the retrieved

document set.

R(D, U) :=

∑

i∈terms(D)

terms(U))

tf -idf (i, D)

sharedTerms(D, U)

(2)

Finally, S is intended to assess the local overlap be-

tween the two CGs and to measure how relevant are

the shared arcs, i.e. determine how similar are the

contexts in which shared terms are used, the stronger

the shared association, the higher the score. S is com-

puted by considering the sub-graph () constituted by

shared nodes of the user CG; the parameter is evalu-

ated as the sum of the weights (w) of the arcs in ΠU

(E(ΠU)) which are also included in D (indicated as

E(D)) divided by the overall weight of the arcs in ΠU.

S(D, U) :=











0 if E(ΠU ) = ∅

∑

i∈E(ΠU)

E(D)

w(i)

∑

j∈E(ΠU)

w( j)

otherwise

(3)

S varies between 0 and 1 In this way, each document

is considered a point in a 3-dimensional space where

each dimension corresponds to one of the three above

parameters. In the Ranking phase, the 3-dimensional

space is subdivided into several subspaces according

to the value ranges of the three parameters, identify-

ing in such a way different regions in terms of poten-

tial interest for the user. High values for all three pa-

rameters identify an excellent potential interest, while

values lower than speciﬁc thresholds decrease the po-

tential interest. Five subspaces are identiﬁed from ex-

cellent to not recommended, as shown in Figure 6, and

each document is ranked according to where its three-

dimensional representation is located. In the current

experimental prototype, the interest threshold for each

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

100

Figure 5: A CG built from two articles dealing with non-related topics.

Figure 6: The ﬁve sub-spaces according to whom items are

ranked.

parameter can be adjusted at runtime, for ﬁne tuning

the matching algorithm. Finally, in the Presentation

step, documents are sorted by descending ranking or-

der and the top ones are suggested to the user; doc-

uments not ranked are put at the very bottom of the

list. As shown in Figure 2, both matching and not

matching KPs are shown and the user can provide

relevance feedback for ﬁne adjustments of his proﬁle

and inclusion of serendipitous concepts indicated by

not matching KPs.

5 EVALUATION

In the ﬁrst development stage of the system, we have

performed a limited number of ofﬂine formative tests,

mainly aimed at experimenting different system tun-

ings. A set of over 300 scientiﬁc papers dealing with

Recommender Systems and Adaptive Personalization

was collected and classiﬁed by users, identifying 16

sub-topics. Later, 200 uncategorized documents deal-

ing with several random ICT topics were added in or-

der to create noise in the data set and the whole set

was processed and stored in a test SPC. 250 user pro-

ﬁles were automatically generated for each one of the

16 topics using groups of 2, 4, 6, 10, and 15 rele-

vant seed documents respectively; then, for each user

proﬁle, RES and a baseline reference system (ad-hoc

developed), based upon the well-known and estab-

lished tf-idf metric, recommended ten items from the

whole SPC. The baseline system produced its recom-

mendations according to keyword vector models of

PersonalizedRecommendationandExplanationbyusingKeyphrasesAutomaticallyextractedfromScientificLiterature

101

Table 1: Average accuracy results of comparative testing.

Seed documents TF-IDF system RES

2 0,42 0,57

4 0,53 0,66

6 0,55 0,70

10 0,60 0,72

15 0,60 0,72

both user interests and document contents, where key-

words were extracted from texts according to their

tf-idf and recommendation was evaluated upon the

number of shared terms between the user and the

document vector and their average weight (again, tf-

idf). For each recommendation test run, every rec-

ommended item dealing with the same topic as the

seed document was considered a good recommenda-

tion. We have deﬁned the accuracy as the average

part of good recommendations over the total number

of recommended items. Results gathered so far are

very promising since RES signiﬁcantly outperforms

the baseline mechanism for any given number of seed

documents, as shown in Table 1.

In particular, the ﬁrst evaluation of RES high-

lights how the proposed method is able to discrimi-

nate among similar domains with very ﬁne granular-

ity. For example, in the test SPC we included a small

set of documents dealing with ’segment injection at-

tacks’

together with several others dealing with var-

ious kinds of ’attack’ to commercial recommender

systems, such as ’random, average and bandwagon at-

tacks’. When the two systems exploited in the eval-

uation phase were asked to recommend items simi-

lar to a limited number of articles extracted from that

subset, the average RES accuracy was 0.59 while the

average baseline accuracy was 0.15; Figure 7 shows

the average accuracy results for this test. Such good

Figure 7: Average accuracy of RES (dotted) and the base-

line tf-idf system (solid) in the domain of ’segment injection

attacks’.

A particular kind of proﬁle injection attack to collabo-

rative recommenders, exploiting statistical market analysis

to alter recommendations.

results in this scenario may be a direct consequence

of the high polisemy of terms such as ’attack’ and

’segment’, which RES handles by taking into account

a signiﬁcant and non-trivial context for each one of

them.

Evaluation is ongoing and in the future it will ad-

dress the quality and the impact of the produced ex-

planations on user satisfaction.

6 CONCLUSIONS

Recommender systems can greatly facilitate the task

of searching for scientiﬁc literature, however, by

just ﬁltering collection of papers, state-of-the-art rec-

ommender systems still leave a heavy work to re-

searchers who have to spend efforts and time for ac-

cessing the knowledge contained in scientiﬁc publi-

cations. In this paper we present a more semantic

approach to the problem, aimed at the creation of a

user model that is both based on actual concepts of

interest and understandable. The presented RES sys-

tem is still a testbed and evaluation is ongoing, but

results gathered so far are encouraging, proving that

our concept-based, human understandable approach

is able to generate accurate recommendations. Future

work will be aimed at expanding our concept-based

strategy by means of ontologies and, eventually, folk-

sonomies, exploiting different sources of knowledge

in order to identify synonymous terms and phrases,

suggest to the users new concepts related to the ones

he considers interesting, and overcome the limitations

of a pure content-based approach. Finally, we will

also address the possible advantages of utilizing our

ideas in other scenarios such as news, patents or legal

documents recommendation.

REFERENCES

Bogers, T. and Van den Bosch, A. (2008). Recommending

scientiﬁc articles using citeulike. In Proceedings of

the 2008 ACM conference on Recommender systems,

pages 287–290, New York, NY, USA. ACM.

Bollacker, K. D., Lawrence, S., and Giles, C. L. (2000).

Discovering relevant scientiﬁc literature on the web.

Intelligent Systems and their Applications, IEEE,

15(2):42–47.

Chandrasekaran, K., Gauch, S., Lakkaraju, P., and Luong,

H. P. (2008). Concept-based document recommenda-

tions for citeseer authors. In Proceedings of the 5th in-

ternational conference on Adaptive Hypermedia and

Adaptive Web-Based Systems, AH ’08, pages 83–92,

Berlin, Heidelberg. Springer-Verlag.

Ferrara, F., Pudota, N., and Tasso, C. (2011). A keyphrase-

based paper recommender system. In Agosti, M., Es-

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

102

posito, F., Meghini, C., and Orio, N., editors, Digi-

tal Libraries and Archives, volume 249 of Communi-

cations in Computer and Information Science, pages

14–25. Springer Berlin Heidelberg.

Ferrara, F. and Tasso, C. (2011). Extracting and exploiting

topics of interests from social tagging systems. Adap-

tive and Intelligent Systems, pages 285–296.

Ferrara, F. and Tasso, C. (2013). Extracting keyphrases

from web pages. In Agosti, M., Esposito, F., Fer-

illi, S., and Ferro, N., editors, Digital Libraries

and Archives, volume 354 of Communications in

Computer and Information Science, pages 93–104.

Springer Berlin Heidelberg.

Govindaraju, V. and Ramanathan, K. (2012). Similar docu-

ment search and recommendation. Journal of Emerg-

ing Technologies in Web Intelligence, 4(1):84–93.

Huynh, T., Hoang, K., Do, L., Tran, H., Luong, H. P., and

Gauch, S. (2012). Scientiﬁc publication recommen-

dations based on collaborative citation networks. In

Smari, W. W. and Fox, G. C., editors, CTS, pages 316–

321. IEEE.

Jiang, Y., Jia, A., Feng, Y., and Zhao, D. (2012). Recom-

mending academic papers via users’ reading purposes.

In Proceedings of the sixth ACM conference on Rec-

ommender systems, RecSys ’12, pages 241–244, New

York, NY, USA. ACM.

PersonalizedRecommendationandExplanationbyusingKeyphrasesAutomaticallyextractedfromScientificLiterature

103