Informativeness-based Keyword Extraction from Short Documents

Mika Timonen

1,3

, Timo Toivanen

, Yue Teng

, Chao Cheng

and Liang He

VTT Technical Research Centre of Finland, PO Box 1000, 02044 Espoo, Finland

East China Normal University, Institute of Computer Applications, No.500 Dongchuan Road, 200241 Shanghai, China

Department of Computer Science, University of Helsinki, Helsinki, Finland

Keywords:

Keyword Extraction, Machine Learning, Short Documents, Term Weighting, Text Mining.

Abstract:

With the rise of user created content on the Internet, the focus of text mining has shifted. Twitter messages

and product descriptions are examples of new corpora available for text mining. Keyword extraction, user

modeling and text categorization are all areas that are focusing on utilizing this new data. However, as the

documents within these corpora are considerably shorter than in the traditional cases, such as news articles,

there are also new challenges. In this paper, we focus on keyword extraction from documents such as event and

product descriptions, and movie plot lines that often hold 30 to 60 words. We propose a novel unsupervised

keyword extraction approach called Informativeness-based Keyword Extraction (IKE) that uses clustering and

three levels of word evaluation to address the challenges of short documents. We evaluate the performance

of our approach by using manually tagged test sets and compare the results against other keyword extrac-

tion methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. We also evaluate the precision and

effectiveness of the extracted keywords for user modeling and recommendation and report the results of all

approaches. In all of the experiments IKE out-performs the competition.

1 INTRODUCTION

As there are more and more user created content on

the Internet, short documents have become an impor-

tant corpus in several text mining areas. The most rel-

evant sources of short documents currently are prod-

uct descriptions, Twitter messages, consumer feed-

back and blogs. In most cases, these documents have

less than 100 words and contain only a few sentences.

For example, Twitter messages contain at most 140

characters (around 20 words).

Keyword extraction, also known as keyphrase ex-

traction

, is an area of text mining that aims to iden-

tify the most informative and important words and/or

phrases, also called terms, of the document. It has

uses in several different domains, including text sum-

marization, text categorization, document tagging and

recommendation systems.

The challenge with keyword extraction is similar

with the challenge of feature weighting in text catego-

rization as both aim to assess the impact of the words

in the document. In text categorization the weights

are used when training the classiﬁer whereas keyword

In this paper, we use the term keyword extraction to

refer both keywords and keyphrases

extraction uses the weights to ﬁnd the most important

words. Timonen (Timonen et al., 2011a; Timonen,

2012) identiﬁed differences between categorization of

short and long documents. These differences are rele-

vant with keyword extraction also. The most obvious

difference comes from the number of words in each

document and in the whole corpus. This results in a

challenge identiﬁed by Timonen as TF=1 challenge;

i.e., each word occurs only once in a document. Be-

cause of this challenge, approaches that rely on the

difference between term frequency and document fre-

quency become less effective (Timonen, 2012).

The traditional keyword extraction approaches of-

ten rely heavily on term frequency. For example,

Term Frequency - Inverse Document Frequency (TF-

IDF), KeyGraph from Ohsawa et al. (1998), and

a Chi-Squared based approach from Matsuo and

Ishizuka (2003) rely on word co-occurrence and word

frequencies. All of these studies have focused on

longer documents such as news articles or scientiﬁc

articles. For example, the traditional test set of news

articles is the Reuters news archive

which contains

160 words per document on average.

http://www.daviddlewis.com/resources/testcollections/

reuters21578/

411

Timonen M., Toivanen T., Teng Y., Chen C. and He L..

Informativeness-based Keyword Extraction from Short Documents.

DOI: 10.5220/0004130704110421

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (SSTM-2012), pages 411-421

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Challenge of extracting keywords from short doc-

uments is not a trivial one. The simplest approach

may be to extract all the words, or only all the nouns

from the document. However, this is rarely beneﬁ-

cial as it results extracting words that are signiﬁcantly

less informative than others. In Section 4 of this pa-

per we show that these irrelevant words reduce sig-

niﬁcantly the precision of user models and recom-

mendation. In addition, the reduction in time con-

sumption in applications that use the extracted key-

words is signiﬁcant when there are less keywords.

For example, when using an approach for domain

modeling and query expansion that maps each of the

co-occurring keywords together, each additional key-

word will greatly increase the time consumption (Ti-

monen et al., 2011b).

In this paper, we propose a novel unsupervised ap-

proach for keyword extraction from short documents.

The aim of our work is to use the extracted keywords

in user models and in a recommendation system. We

have based our work on CollabRank keyword ex-

traction approach by Wan and Xiao (2008), and fea-

ture weighting in short documents by Timonen et al.

(2011a). The idea is to assess the importance (infor-

mativeness) of each word of a document on three lev-

els: corpus level, cluster level and document level. In

corpus level analysis we want to ﬁnd words that are

informative when considering all the documents. To

ﬁnd words that are informative in a smaller set, we

cluster the documents and evaluate the informative-

ness inside these clusters. The aim is to group similar

documents together and ﬁnd words that are common

within the cluster and uncommon outside the cluster.

Finally, the informative words within the document

are extracted using the results of the two previous lev-

els. We call this approach Informativeness-based Key-

word Extraction (IKE).

We evaluate our approach using three different

datasets: movie descriptions, event descriptions, and

company descriptions. We use data of three different

languages: English, Finnish and Chinese. We man-

ually annotated approximately 300 documents from

each dataset and compared the precision and recall

of the extracted keywords from our approach against

several other keyword extraction methods such as

CollabRank, KeyGraph, TF-IDF and Matsuo’s Chi-

Squared. The problem with this evaluation was that

the annotated keywords were subjective to the annota-

tor, which resulted under 70 % agreement rate among

annotators. Therefore we conducted another experi-

ment where we used the extracted keywords for user

modeling and evaluated the recommendation preci-

sion with the models. This evaluation was objective

and demonstrates the utilization of the extracted key-

words. In all of the experiments our approach out-

performed the competition.

This paper makes the following contributions: (1)

a novel keyword extraction approach for short docu-

ments based on three level word analysis, (2) a novel

approach for corpus level word informativeness as-

sessment for short documents, and (3) a comprehen-

sive evaluation of keyword extraction approaches us-

ing short documents as the corpus.

This paper is organized as follows: in Section 2 we

discuss related approaches. In Section 3 we describe

our approach. In Section 4 we perform experimental

evaluation and compare the results against other key-

word extraction approaches. We conclude the paper

in Section 5.

2 RELATED WORK

Several authors have presented keyword extraction

approaches in recent years. The methods often use

supervised learning. In these cases the idea is to use

a predeﬁned seed set of documents as a training set

and learn the features for keywords. The training set

is built by manually tagging the documents for key-

words.

One approach that uses supervised learning is

called Kea (Frank et al., 1999; Witten et al., 1999).

It uses Naive Bayes learning with Term Frequency -

Inverse Document Frequency (TF-IDF) and normal-

ized term positions as the features. The approach

was further developed by Turney(2003) who included

keyphrase cohesion as a new feature. One of the lat-

est updates to Kea is done by Nguyen and Kan (2007)

who included linguistic information such as section

information as features.

Before developing Kea approach, Turney experi-

mented two other approaches: decision tree algorithm

C4.5 and an algorithm called GenEx (Turney, 2000).

GenEx is an algorithm that has two components: hy-

brid genetic algorithm Genitor, and Extractor. The

latter is the keyword extractor that needs twelve pa-

rameters to be tuned. Genitor is used for ﬁnding these

optimal parameters from the training data.

Hulth et al. (2001) describe a supervised ap-

proach that utilizes domain knowledge found from

Thesaurus, and TF-IDF statistics. Later, Hulth in-

cluded linguistic knowledge and different models to

improve the performance of the extraction process

(Hulth, 2003, 2004). The models use four differ-

ent attributes: term frequency, collection frequency,

relative position of the ﬁrst occurrence, and Part-of-

Speech tags.

Ercan and Cicekli (2007) describe a supervised

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

412

learning approach that uses lexical chains for extrac-

tion. The idea is to ﬁnd semantically similar terms,

i.e., lexical chains, from text and utilize them for key-

word extraction.

There are also approaches that do not use super-

vised learning but rely on term statistics instead. Key-

Graph is an approach described by Ohsawa et al.

(1998) that does not use POS-tags, large corpus,

nor supervised learning. It is based on term co-

occurrence, graph segmentation and clustering. The

idea is to ﬁnd important clusters from a document and

assume that each cluster holds keywords. Matsuo and

Ishizuka (2003) describe an approach that also uses

a single document as its corpus. The idea is to use

the co-occurrences of frequent terms to evaluate if a

candidate keyword is important for a document. The

evaluation is done using Chi-squared (χ

) measure.

All of these approaches are designed for longer docu-

ments and they rely on term frequencies.

There are some approaches developed that ex-

tract keywords from abstracts. These abstracts of-

ten contain 200-400 words making them considerably

longer than documents in our corpus. One approach,

which is presented by HaCohen-Kerner (2003), uses

term frequencies and importance of sentences for ex-

tracting the keywords. Later, HaCohen-Kerner et al.

(2005) continue the work and include other statistics

as well. Andrade and Valencia (1998) use Medline

abstracts for extracting protein functions and other bi-

ological keywords. Previously mentioned work done

by Ercan and Cicekli (2007) also uses abstracts as the

corpus. Finally, SemEval-2010 had a task that aimed

at extraction of keywords from scientiﬁc articles. Kim

et al. (2010) presents the ﬁndings of the task.

Keyphrase extraction is some times used as a syn-

onym to keyword extraction but it can also differ from

it by aiming to extract n-grams, i.e., word groups that

are in the form of phrases (e.g., ”digital camera”).

Yih et al. (2006) present a keyphrase extraction ap-

proach for ﬁnding keyphrases from web pages. Their

approaches is based on document structure and word

locations. This approach is directed to keyphrases

instead of just words as it aims to identify not just

words but word groups. They use a logistic regres-

sion for training a classiﬁer for the extraction pro-

cess. Tomokiyo and Hurst (2003) present a language

model based approach for keyphrase extraction. They

use pointwise Kullback-Leibler -divergence between

language models to assess the informativeness and

”phraseness” of the keyphrases.

Wan and Xiao (2008) describe an unsupervised

approach called CollabRank that clusters the docu-

ments and extracts the keywords within each cluster.

The assumption is that documents with similar top-

ics contain similar keywords. The keywords are ex-

tracted in two levels. First, the words are evaluated in

the cluster level using graph-based ranking algorithm

similar to PageRank (Page et al., 1998). After this, the

words are scored on the document level by summing

the cluster level saliency scores. In the cluster level

evaluation POS-tags are used to identify suitable can-

didate keywords; the POS-tags are also used when as-

sessing if the candidate keyphrases are suitable. Wan

and Xiao use news articles as their corpus.

Assessing the term informativeness is an impor-

tant part of keyword extraction. Rennie and Jaakkola

(2005) have surveyed term informativeness measures

in the context of named entity recognition and con-

cluded that Residual IDF produced the best results for

their case. Residual IDF is based on the idea of com-

paring the word’s observed Inverse Document Fre-

quency (IDF) against predicted IDF (

IDF) (Clark and

Gale, 1995). Predicted IDF is assessed using the term

frequency and assuming a random distribution of the

term in the documents (Poisson model). If the differ-

ence between the two IDF measures is large, the word

is informative.

Finally, Timonen et al. (2011a) have studied cate-

gorization of short documents. They conclude that the

existing approaches used for longer documents do not

perform as well with short documents. They propose

a feature weighting approach that is designed to pro-

duce better results with short documents. We will de-

scribe this approachin Section 3 as we base the cluster

level word evaluation on their work.

3 KEYWORD EXTRACTION

FROM SHORT DOCUMENTS

We have based our approach on the previous works

done by Wan and Xiao (2008), and Timonen et al.

(2011a). From the former we use the idea of multi-

level word assessment through clustering, and from

the latter we use the term weighting approach. The

term weighting approach described by Timonen et al.

is designed for short documents which made it rele-

vant for this case also.

The extraction process has two steps: (1) prepro-

cessing that includes document clustering, and (2)

word informativenessevaluation. The latter is divided

into three levels of evaluation: corpus level, cluster

level and document level, and it aims to identify and

extract the most informative words of each level. The

input for the process is the set of documents (corpus).

The process produces a set of keywords for each doc-

ument as its output.

Informativeness-basedKeywordExtractionfromShortDocuments

413

3.1 Problem Description

We deﬁne a short document as a document that con-

tains no more than 100 words, which is equal to a very

short scientiﬁc abstract. Depending on the dataset, the

documents are often much shorter: for example, Tim-

onen (2012) use a Twitter dataset that holds 15 words

on average. In our work, we concentrate on event and

movie descriptions that have 30 to 60 words per de-

scription. These word counts are considerably less

than in corpora previously used in keyword extrac-

tion.

The descriptions contain information about events

or movies in a concise way. In the case of events, the

information consists of type of the event, possible per-

formers, and other information that may be relevant

for the reader. The movie descriptions hold informa-

tion about the movie such as plot, actors, director and

genre of the movie.

The aim is to extract the relevant information from

the description. The following is an example of an

actual event description: ”International contemporary

art exhibition from the collection of UPM-Kymmene.

The collection focuses on German art from this cen-

tury. It consists of paintings and drawings from sev-

eral internationally noted artists such as Markus Lu-

pertz, A.R. Penck and Sigmar.” We want to extract

the following words to capture the key information of

this event: ”contemporary”, ”German”, ”art”, ”paint-

ings”, ”drawings”, ”Markus Lupertz”, ”A.R. Penck”,

”Sigmar”.

Due to the large variation in content and the ardu-

ous task of building a comprehensive training set, we

focus our efforts on unsupervised approaches. The

biggest challenge with unsupervised learning is the

fact that most words occur only once per document.

In fact, the more often a word occurs in a document,

less informative it usually is. This makes the tradi-

tional approaches that rely on term frequency less ef-

fective.

During our initial evaluations we noticed that doc-

ument frequency alone does not ﬁnd important words.

Some informative words, such as the performer of the

event may occur only once in the whole corpus. How-

ever, some important words such as the event type

(e.g., ”rock concert”) may occur often within the cor-

pus. Both of these are important for the document,

and for the reader, but this information cannot be cap-

tured using only document frequency. In the follow-

ing sections we propose an approach that takes these

challenges and requirements into consideration when

extracting keywords from short documents.

3.2 Preprocessing and Clustering

The ﬁrst task of the extraction is preprocessing. The

text needs to be cleaned from noise, which usually

consists of uninformative characters, and stop words.

For this we use a ﬁlter that removes words with less

than 3 characters, and a stop word lists. In addition,

we need to identify important noun phrases such as

names. That is, we group the ﬁrst and last names to-

gether to form phrases. It should be noted that there

are freely available named entity recognition software

available for English but for Finnish we had to imple-

ment our own. For this, we use a naive approach:

if two or more consecutive words have a capital let-

ter as its ﬁrst letter, the words are tagged as a proper

noun group (e.g., ”Jack White”). In addition, if there

is a connecting word like ’and’ between two words

that start with a capital letter, we also tag them (e.g.,

”Rock and Roll”). For Chinese, we do not use noun

phrase tagging. We do not use other noun phrase iden-

tiﬁcation.

The term evaluation approach that we will de-

scribe in the next section requires clustering of the

documents. We use Agglomerative (CompleteLink)

clustering, which produced the best results for Wan

and Xiao (2008). CompleteLink is a bottom-up clus-

tering approach where at the beginning, each docu-

ment forms its own cluster. The most similar clusters

are joined in each iteration until there are at most k

clusters. The similarity between the clusters c

and

is the minimum similarity between two documents

∈ c

) and d

∈ c

sim(c

, c

) = min

∈c

sim(d

, d

), (1)

where similarity sim(d

, d

) is the cosine similarity

of the documents. We use a similar approach but have

made a small modiﬁcation: the most similar clusters

are joined if the minimum similarity between docu-

ments in the two clusters is above a given threshold t

That is, we do not use a predeﬁned number of clusters

k but let the cluster count vary among datasets. The

algorithm stops when there are no more clusters to

be joined, i.e., if there are no clusters with similarity

above t

. We found that this approach performs better

in our case. In addition, we use the Inverse Docu-

ment Frequency to measure the weight of the words

before the similarity is calculated as this will make

documents that have matching rare words more sim-

ilar than documents with matching common words.

We feel that this does not affect the overall result of

the process as this only beneﬁts the clustering by de-

creasing the impact of more frequent words.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

414

3.3 Word Informativeness Evaluation

We consider that short documents contain two differ-

ent types of keywords: ones that are more abstract

and are therefore more common, and the ones that

are more expressive and therefore more rare. The

more common keywords usually describe the text as

a whole; for example, terms like ’Rock and Roll’ and

’Action Movie’ deﬁne the content of the document

in a more abstract level. Words like ’Aerosmith’ and

’Rambo’ give a more detailed description. Both of

these levels are important as without them important

information would be missing. This is true especially

when the keywords are used by a computer in an au-

tomated system instead of presenting them directly to

humans.

We address this issue by evaluating the words in

two different levels: corpus level and the cluster level.

Document level analysis uses the results from the

other levels. In corpus level, the informative words

are the ones that have an optimal frequency, and in

the cluster level the ones that appear often in a single

cluster and rarely in other clusters.

3.3.1 Corpus Level Word Evaluation

The aim of the corpus level evaluation is to ﬁnd words

that deﬁne the document in a more abstract level.

These words tend to be more common than the more

expressive words but they should not be too common

either. For example, we want to ﬁnd terms like ’Rock

and Roll’ instead of just ’event’ or ’music’. Our hy-

pothesis is that the most informative words in the cor-

pus level are those that are neither too common or too

rare in the corpus; however, an informative word will

more likely be rare than common.

In order to ﬁnd these types of words we rely on

word frequency in the corpus level (tf

). As in most

cases when using a corpus of short documents, the

term frequency within a document (tf

) for each term

is 1. Therefore, document frequency (df) is df = tf

and, for example, with Residual IDF, observed IDF

equals expected IDF. However, even if Residual IDF

is not a good option with short documents we use its

basic idea: we want to ﬁnd words that have an IDF

close to the assumed optimal value. The greater the

difference between the observed IDF and the assumed

optimal IDF is, the less informative the word is in the

corpus level. This is an inverse assumption that is

used in Residual IDF. In addition, we substitute the

expected IDF with the expected optimal IDF.

To get the corpus level score s

corpus

(t) we use an

approach we call Frequency Weighted IDF (IDF

It is based on the idea of updating the observed IDF

using Frequency Weight (FW):

IDF

(t) = IDF(t) − FW(t), (2)

where FW(t) is the assumed optimal IDF described

below.

The intuition behind FW is to penalize words

when the corpus level term frequency does not equal

the estimated optimal frequencyn

. Equation 3 shows

how FW is calculated: the penalty is calculated as

IDF but we use n

as the document count |D| and tf

as df. This affects the IDF so that all the term fre-

quencies below n

will get a positive value, if tf = n

no penalty will be given, and when tf > n

the value

will be negative. To give penalty on both cases, we

need to take the absolute value of the penalty.

FW(t) = α × |log

|. (3)

Even though FW will be larger with small term

frequencies, IDF will be also larger. In fact,

when tf

= df and tf

< n

, the result IDF

) =

IDF

) even if tf

< tf

for all tf

< n

. We use α

to overcome this issue and give a small penalty when

< n

; α = 1.1 is used in our experiments.

An important part of the equation is the selection

of n

. We use a predeﬁned fraction of the corpus size:

= 0.03 × |D|. That is, we consider that a word is

optimally important in the corpus level when it oc-

curs in 3 % of documents. This number was decided

after empirical evaluation and experimentation with

the event dataset, and it has produced good results in

all of the experiments. It may be beneﬁcial to change

this value when using different datasets, however this

number was good in all of our studies.

This approach has two useful features: ﬁrst, it

also considers df in the evaluation. That is, in the

rare occasions when tf

< n

and df < tf

, these words

are emphasized. Second, the IDF

is considerably

smaller when tf

> n

than when tf

< n

, which is

preferred as we consider less frequent words more

informative than more frequent words. That is, this

emphasizes rare words over common words. As this

approach has the functionality we require, and as we

did not ﬁnd any existing approaches that fulﬁll the

given requirements, we consider this approach the

most suitable option for the corpus level assessment.

3.3.2 Cluster Level Word Evaluation

Next, the word’s informativeness is assessed in the

cluster level. The idea is to group similar documents

together and ﬁnd words that are important for the

group: if the word w appears often in the cluster

c and not at all in other clusters (C \ c), the word

is informative in this cluster. Assessing the clus-

ter level informativeness is done by using similar

Informativeness-basedKeywordExtractionfromShortDocuments

415

word informativeness assessment approach as used

by Timonen et al. (2011a) where the idea is to as-

sess word’s Term − Corpus Relevance (TCoR) and

Term−Category Relevance (TCaR) as deﬁned below.

In order to assess TCoR and TCaR, each docu-

ment is broken into smaller pieces called fragments.

The text fragments are extracted from sentences us-

ing breaks such as question mark, comma, semicolon

and other similar characters. In addition, words such

as ’and’, ’or’, and ’both’ are also used as breaks. For

example, sentence ”Photo display and contemporary

art exhibition at the central museum” consists of two

fragments, ”photo display” and ”contemporary art ex-

hibition at the central library”.

There are two features used for assessing TCoR:

inverse average fragment length (fl) and inverse cat-

egory count (ic). Average fragment length is based

on the idea that if a word occurs in short text frag-

ments it is more likely to be informative. The average

fragment length is calculated simply by taking the av-

erage of the word counts within the fragments where

the word appears in. Inverse category count is used to

give more emphasis on words that appear in a single

category. It is calculated by taking the count of the

categories (cluster in our case) where the word ap-

pears in. For example, if the word occurs in two clus-

ters, ic is 0.5. Term-Corpus Relevance is the average

of these two values:

TCoR(t) = (

avg(l

(t))

), (4)

where avg(l

(t)) is the average length of the text frag-

ment where the word t appears in, and c

is the count

of clusters where the word t appears in.

Term-Category Relevance, which in this case

should be called Term-Cluster Relevance, evaluates

the word’s informativeness among clusters. The idea

is to identify words that occur often within the cluster

and rarely in other cluster. More often the word oc-

curs within the cluster, and the less clusters the word

appears in, higher the TCaR score. The score consists

of two probabilities: P(c|t), probability that the word

t occurs within the category c, and P(d

|c), probabil-

ity for the word t within the category c. Former, pre-

sented in Equation 5, takes the distribution of word’s

occurrences among all the categories c:

P(c|t) =

t,c

∈ D

. (5)

The probability is calculated simply by taking the

number of documents |d

t,c

| in the cluster’s document

set D

∈ c) that contain the word t and dividing

it with the total number of documents |D

| where the

word t appears in (D

is the set of documents contain-

ing the word t).

The probability for the word within the cluster,

shown in Equation 6, takes the distribution of the

word t within the cluster:

P(d

|c) =

t,c

∈ D

, (6)

where |d

t,c

| is the number of documents with the word

t within the cluster c and |D

| is the total number of

documents within c. TCaR(t,c) for the word t in the

cluster c is calculated as follows:

TCaR(t, c) = (P(c|t) + P(d

|c)). (7)

The two scores TCoR and TCaR are combined

when the cluster level score is calculated:

cluster

(t, c) = TCoR(t) × TCaR(t, c). (8)

The result of the cluster level evaluation is a score

for each word and for each of the clusters it appears in.

If the word appears only in a single cluster, the weight

will be considerably higher than if it would appear in

two or more clusters. Even though the cluster level

word evaluation does some corpus level evaluation as

well, we found that the results are often better when

using both TCoR and TCaR instead of only TCaR.

This is due to the fact that the approaches used here

are complementary. More information and the intu-

ition behind the metrics can be found from the paper

by Timonen et al. (2011a).

3.3.3 Document Level Word Evaluation

The ﬁnal step of the process is to extract the keywords

from the documents. For ﬁnding the most important

words of the document, we use the word scores from

the previous analysis. The idea is to extract the words

that are found informative on either the corpus level

or the cluster level; or preferably on both.

Before calculating the document level scores, the

corpus level scores are normalized to vary between

[0,1]. This makes them comparable with the cluster

level scores. The normalization is done by taking the

maximum corpus level word score in the document

and dividing each score with the maximum value.

Equation 9 shows the normalization of s

corpus

n,corpus

(t) =

IDF

(t)

max

∈d

IDF

)

. (9)

After normalization, the word with the highest

IDF

(t) has the score 1.

The document level score s

doc

for word t in doc-

ument d, which belongs to cluster c, is calculated by

taking the weighted average of the cluster level score

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

416

n,cluster

(t, c) and the normalized corpus level score

n,corpus

(t). This is shown in Equation 10:

doc

(t, d) =

β× s

cluster

(t, c) + (1 − β) × s

n,corpus

(t)

(10)

where c is the cluster of the document d, and β indi-

cates the weight that is used for giving more empha-

sis to either cluster or the corpus level score. To give

more emphasis for the cluster level scores, we should

use β > 0.5, and vice versa. We use weighted aver-

age for two reasons: we get scores that vary between

[0,1], and the effect of a low cluster level or corpus

level score is not as drastic as it would be, for exam-

ple, when the two scores are multiplied. The latter is

important as we want also words that score highly on

either level and not just on both levels.

As we noticed that keywords occur often in the

beginning of the document, we included a distance

factor d(t) that is based on the same idea used in Kea

(Frank et al., 1999). The distance indicates the loca-

tion of the word from the beginning of the document

and it is calculated by taking the number of words that

precede the word’s ﬁrst occurrence in the document

and dividing it with the length of the document.

d(t) = 1−

i(t)

|d|

, (11)

where |d| is the number of words in the document and

i(t) is the index of word’s ﬁrst occurrence in the doc-

ument. The index starts from 0.

As Part-of-Speech tags are often useful in key-

word extraction we use them in the document level

scoring as an option. Due to the fact that POS-

taggers are not freely available for all the languages

we include the POS-tags only as an option, i.e., our

approach can easily be used without a POS-tagger.

We use the POS-tags as another weighting option

for words: different tags get a different POS-weight

POS

) in the ﬁnal score calculation.

The simplest approach is to give weight 1.0 to all

tags that are accepted, such as NP and JJ (nouns and

adjectives), and 0.0 to all others. To emphasize some

tags over the others, w

POS

(tag

) > w

POS

(tag

) can be

used. If POS-tags are not available, w

POS

= 1.0 is

used for all words.

Finally, all words in the document d are scored by

combining s

doc

(t, d), w

POS

and d(t):

s(t, d) = s

doc

(t, d) × d(t) × w

POS

(t), (12)

where w

POS

(t) is the POS weight for the word t. If t

has several POS-tags, the one with the largest weight

is used.

Each word t in the document d now has a score

s(t, d) that indicates its informativeness for the docu-

ment. The top k most informative words are then as-

signed as keywords for the document. As some times

the number of informativewords per documentvaries,

a threshold t

can be used to select n (n ≤ k) informa-

tive words from the document that have a score above

. The threshold t

is relative to the highest score of

the document: t

= r × maxs(t, d). For example, if

r = 0.5, the keywords that have a score at least 50 %

of the highest score are accepted. However, if there

are more than k keywords that fulﬁll this condition,

the top k are selected. We have used k = 9 and r = 0.5

in our experiments.

4 EVALUATION

We evaluate our approach using three different

datasets that we describe in Section 4.1. The test sets

were built by manually selecting the keywords that

were considered most relevant. We use the datasets in

two different types of performance comparisons. We

compare the results between IKE and the following

methods: CollabRank, KeyGraph, Matsuo’s χ

mea-

sure, TF-IDF and Chi-squared feature weighting. In

the ﬁrst experiment we use the manually picked key-

words to see which approach performs the best. In the

second experiment we use the extracted keywords to

create user models and see which model can produce

the best recommendations. We consider the latter ex-

periment the most indicative of performance as it is

the most objective.

4.1 Data

To make the experimentation more versatile we use

datasets of three completely different languages:

Finnish, English and Chinese. Finnish is a complex

language with lots of sufﬁxes, Chinese is a simpler

language without preﬁxes and sufﬁxes but a complex

language due to its different character set and writing

system. English is the standard language in most of

the systems.

For Finnish, we use a dataset that consists

of approximately 5,000 events from the Helsinki

Metropolitan area. The eventswere collected between

2007 and 2010 from several different data sources.

The descriptions hold information about the type of

the event and the performers in a concise form. After

preprocessing, the documents hold 32 words on av-

erage. The average term frequency per document in

this dataset was 1.04, i.e., almost all the words occur

on average only once per document.

Informativeness-basedKeywordExtractionfromShortDocuments

417

For Chinese, we use Velo

dataset that contains

1,000 descriptions of companies and their products

stored in the Velo databases. These descriptions are

used in Velo coupon machines in China. The data

was gathered in June 2010. The descriptions hold 80

words on average; even though longer than most of

our data this was short enough to be used in our ex-

periments. One of the challenges with Chinese is to

tokenize the text into words; for this, paodingjieniu,

a Chinese Word segmentation tool was used to divide

the descriptions into words separated by blank space.

For English, we use movie abstracts from

Wikipedia

. This was selected due to its free and

easy access. We downloaded approximately 7,000

Wikipedia pages that contain information about dif-

ferent movies. We use MovieLens dataset

when se-

lecting the movies: if a movie is found from Movie-

Lens dataset, we download its Wikipedia page. We

only use the abstracts found at the beginning of the

Wikipedia page. If the abstract is longer than 100

words, we remove the last full sentences to shorten

the document under the given limit. The averageword

frequency per document in this dataset is 1.07. The

Wikipedia pages were retrieved in May 2010.

For the ﬁrst experiment we created the test set by

randomly selecting 300 documents and manually tag-

ging them for keywords. Event and Wikipedia data

was tagged by two research scientist from VTT Tech-

nical Research Centre of Finland, and Velo data was

tagged by two students from East China Normal Uni-

versity. At most nine keywords were chosen per doc-

ument. The agreement rate among annotators was 69

% for the Event data, 64 % for the Wikipedia data,

and 70 % for the Velo data. The test set was updated

after disagreements were resolved.

For POS-tagging in English and Chinese we use

the Stanford’s Log-Linear Part-of-Speech tagger

For POS-tagging in Finnish we use LingSoft’s com-

mercial FinTWOL tagger.

4.2 Evaluation of Keyword Precision

We evaluate the feasibility of the extracted keywords

using a set of manually annotated keywords. We use

all three datasets for evaluation.

4.2.1 Evaluation Setup

We implemented CollabRank algorithm as described

Velo is a company based in Shanghai China that owns

and maintains coupon machines.

http://www.wikipedia.org/

http://www.grouplens.org/node/12

http://nlp.stanford.edu/software/tagger.shtml

by Wan and Xiao (2008). There were some parts that

were not clearly described and we therefore made the

following assumptions: ﬁrst, we used window size

of 10, as described in the paper. However, the win-

dow was not extended over sentence breaks such as

full stops and question marks. The candidate words

were selected after getting word co-occurrences. That

is, the words without appropriate POS-tag were re-

moved after the afﬁnity weights were calculated. This

is important as it affects the weights. We used these

settings as they produced the best results for Col-

labRank.

We experimented using both clustering ap-

proaches (predeﬁned cluster count and threshold sim-

ilarity) and found that they produced similar re-

sults for CollabRank. However, when using prede-

ﬁned cluster count with IKE, the results were poorer.

Therefore, we use the score threshold clustering in all

experiments that require clustering.

For Term Frequency - Inverse Document Fre-

quency (TF-IDF) and Chi-Squared we used the stan-

dard implementation of the approach. For Ohsawa’s

KeyGraph (Ohsawa et al., 1998) and Matsuo’s χ

keyword extraction approach (Matsuo and Ishizuka,

2003) we used Knime

and its implementation of the

two algorithms. We ran them using the default param-

eters that were described in the articles. To improve

the results we used stop word lists and N char ﬁlter

(N = 3) to remove uninformative words and charac-

ters from the documents.

After empirical evaluation, we selected the fol-

lowing parameters for IKE: β = 0.3, and POS-tag

weights w

POS

(N) = 1.0, w

POS

(JJ) = 1.0, w

POS

(V) =

0, w

POS

(Others) = 0. However, when we use event

data, we use w

POS

(N) = 3.0 and β = 0.6. For all

the approaches, at most nine keywords per document

were extracted.

4.2.2 Results

The baseline result is the F-score received when all

nouns and adjectives are extracted. We included ad-

jectives as they are relevant in some domains; for ex-

ample, adjective explosive can be considered relevant

in the description ”explosive action movie”. There-

fore, the words with the following POS-tags are ex-

tracted: N, A, ADJ, AD, AD-A, -, JJ, NN, NNS,

NNP, and NNPS. Some of these tags are used in FinT-

WOL and some in Stanford POS-tagger. The tag ”-”

means that also words without a tag, which are usu-

ally names not recognized by the tagger, are also ex-

tracted. Therefore, the tag ”-” is treated as NP in our

experiments. This produced the following baselines:

Konstanz Information Miner: http://www.knime.org/

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

418

Table 1: F-scores for each of the method in keyword precision experiment. Chi-squared is the traditional feature weighting

approach, and χ

KE is the keyword extraction approach presented by Matsuo and Ishizuka (2003). Due to the Chinese

character set, we were unable to evaluate KeyGraph and Matsuo’s χ

keyword extraction approach on Velo data using Knime.

IKE CollabRank Chi-squared TF-IDF KeyGraph χ

KE Baseline

Wikipedia 0.57 0.29 0.35 0.35 0.22 0.21 0.22

Events 0.56 0.46 0.49 0.49 0.36 0.35 0.39

Velo 0.31 0.18 0.26 0.22 - - 0.15

Event data 0.36, Wikipedia data 0.22, and Velo data

0.20.

Table 1 shows the results of our experiments. The

best results for both Event and Wikipedia data were

received using IKE. For the Chinese Velo data the dif-

ference between IKE and CollabRank, and the base-

line is great. However, we can see that TF-IDF and

perform almost as good in this case. After care-

ful review of the results, we conclude that most of the

keywords extracted by IKE are feasible even though

not originally picked by humans. This is true with all

of the datasets. The keywords picked by both χ

and

TF-IDF were in most cases uncommon, such as the

name of the movie. Keywords extracted by IKE were

both uncommon and common; for example, name of

the movie, actors, and the genre were all extracted.

KeyGraph did not produce good results which

was expected. The F-score with Wikipedia data was

only 0.22 with precision 0.19 and recall 0.26. Event

data produced F-score of 0.36 with precision of 0.31

and recall of 0.42. The reason for poor performance

in both cases is the same as with several other ap-

proaches: most words occur only once in the docu-

ment. Due to the fact that KeyGraph uses only a sin-

gle document, there is not enough information within

a short document to make this approach feasible.

Matsuo’s χ

keyword extraction approach also

performed poorly. F-score for Wikipedia data was

0.21 with precision of 0.18 and recall of 0.24. For

Events the F-score was 0.35 with the precision of 0.30

and recall of 0.41. The reason for poor performance

is the same as KeyGraph: it is impossible to assess

the important words from a short document without

using the corpus.

4.3 Evaluation of Keyword Utilization

for User Modeling

To overcome the subjectivity of the ﬁrst test set we

did a simple experiment where we compare the rec-

ommendation precision for each approach. The idea

was to see which approach extracts the most useful

words from the text, i.e., words that produce the best

recommendations. The recommendation precision is

assessed by recommending a top-n list of movies and

comparing how many of them the user has liked.

4.3.1 Evaluation Setup

We use Wikipedia data for keyword extraction and the

user ratings from MovieLens data for user modeling

and recommendation.

To test the recommendation precision we created a

simple user model: ﬁrst, we randomlyselected 10,000

users from the MovieLens dataset. Then for each user,

all the movies they rated were retrieved. This set was

divided into a training set and a test set with 75 % - 25

% ratio. However, only movies with a positive rating

(rating 4.5 or 5) were added to the test set.

For each of the movies in the training set, the key-

words were extracted from the Wikipedia page. Each

of the keywords were then used as a tag in the user

model. The tags were weighted using the user’s rat-

ing for the movie: for example, if the rating was 3, the

tag was assigned a weight of 0, if the rating was 0, the

weight was -1.0, and if the rating was 5, the weight

was 1.0. If the same keyword is found from several

movies, we use the user’s average rating among the

movies.

To evaluate the model’s precision we take the test

set and add randomly k× 5 movies from the set of all

movies to the test set, where k is the initial size of the

test set. That is, if we have 5 movies in the test set, we

take randomly 25 movies among all movies the user

hasn’t seen to make the total size of the test set 30

movies. The recommendation is done by scoring each

of the movies: take the keywords of the movie and

match them to the user’s tags. The score of the movie

is the summed weights of the matching tags; for each

keyword found from the user model the weight of the

tag is added to the score. The top n scoring movies are

then put into an descending order and selected as the

top-n list of movies. The precision of the user model

is the number of user-rated movies in top-n. That is,

if the top-n lists consists solely on movies the user has

seen and rated highly, the precision is 1.

4.3.2 Results

The baseline used here is the same as before, i.e., all

nouns and adjectives are selected and used as key-

words. The user model was created as described

above. The recommendation precision was calculated

Informativeness-basedKeywordExtractionfromShortDocuments

419

Table 2: Comparison of user models for recommendation when extracted keywords are used for user modeling.

IKE CollabRank Chi-squared TF-IDF KeyGraph χ

KE Baseline

Precision 0.55 0.41 0.84 0.86 0.30 0.33 0.39

Coverage 0.75 0.59 0.27 0.29 0.86 0.85 0.89

Total Score 0.41 0.24 0.23 0.25 0.26 0.28 0.30

by taking the ratio of correct movies in the top-n list.

To assess the precision of the model, we skipped the

movies that did not have any matching tags in the user

model. This was done to simulate an actual recom-

mendation system: when assessing the precision, we

are only interested in movies that can be linked to the

user model.

In some cases, such as with KeyGraph, there was

a problem of overspecialization as the approach pro-

duced too speciﬁc models. In these cases the model

was able to do only a very limited number of rec-

ommendations. An example of this was James Bond

movies: the model consisted solely of keywords like

James Bond. Using this model recommendation pre-

cision of James Bond movies was high but it could

not recommend any other movies. We wanted to

emphasize broader models so in addition to preci-

sion we include coverage to the assessment of per-

formance. Coverage in this case measures the per-

centage of users which can receive recommendations.

The score for the approach is then calculated as rec-

ommendation precision × recommendation coverage.

Table 2 showsthe results of our experiment. When

all the nouns and adjectives are used in the user

model, the average precision was 0.39, i.e., approx-

imately 2 movies out of 5 were found from the top-5

list. The recommendation coverage for the baseline,

i.e., the percentage that shows how many users get

recommendations in the test set, was 89 %. This pro-

duced the score of 0.30. When using IKE, the preci-

sion was 0.55 with the recommendation coverage of

75 %, making the score of 0.41. We consider this re-

sult better as the precision is considerably higher and

the coverage is good. Finally, CollabRank produced

the score of 0.24, TF-IDF 0.25, and χ

0.23.

Even though the precision is excellent with Chi-

squared and TF-IDF, the poor coverage would make

them unusable in a real world setting. However, com-

bining IKE with Chi-squared and/or TF-IDF could be

beneﬁcial for user modeling.

These results show that by extracting only the in-

formative words instead of all of them, the results are

notably better. In addition, we can see that IKE can

extract more useful words for recommendation than

CollabRank, TF-IDF and χ

. The difference in the ﬁ-

nal score can be credited to the fact that IKE extracts

both common and uncommon keywords where as the

others focus only on one of them.

5 CONCLUSIONS

In this paper, we have described the challenge of key-

word extraction from short documents. We consider

a document short when it contains at most 100 words,

which is equal to a short abstract. We proposed

Informativeness-based Keyword Extraction (IKE) ap-

proach for extracting keywords from the short docu-

ments. It is based on word evaluation that is done in

three levels: corpus level, cluster level and document

level. In order to do the evaluation on the cluster level,

text clustering is used.

We compared the results against several other key-

word extraction approaches. In all of the experiments

our approach produced the best results. In addition,

we compared effectiveness of the extracted keywords

for user modeling and recommendations. In this ex-

periment, the user models created with the keywords

using IKE produced considerably better results than

any other approach. This is encouraging as it shows

the feasibility of Informativeness-based Keyword Ex-

traction for user modeling and recommendation.

In the future, more focus should be given on noun

phrase identiﬁcation as we feel it would beneﬁt sum-

marization and user modeling by extracting more de-

tailed entities from the text. Even though our ap-

proach has performed well we believe that there is

still room for improvement. We hope that our work

can beneﬁt the future research in the ﬁeld of keyword

extraction and text mining from short documents.

ACKNOWLEDGEMENTS

Authors wish to thank the Finnish Funding Agency

for Technology and Innovation (TEKES) for funding

a part of this research. In addition, the authors wish

to thank Prof. Hannu Toivonen and the anonymous

reviews for their valuable comments.

REFERENCES

Andrade, M. and Valencia, A. (1998). Automatic extrac-

tion of keywords from scientiﬁc text: Application to the

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

420

knowledge domain of protein families. Bioinformatics,

14:600–607.

Clark, K. and Gale, W. (1995). Inverse document frequency

(idf): A measure of deviation from poisson. In Third

Workshop on Very Large Corpora, pages 121–130.

Ercan, G. and Cicekli, I. (2007). Using lexical chains for

keyword extraction. Inf. Process. Manage., 43(6):1705–

1714.

Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C.,

and Nevill-Manning, C. G. (1999). Domain-speciﬁc

keyphrase extraction. In Dean, T., editor, IJCAI’99,

pages 668–673. Morgan Kaufmann.

HaCohen-Kerner, Y. (2003). Automatic extraction of key-

words from abstracts. In Palade, V., Howlett, R. J., and

Jain, L. C., editors, KES 2003, volume 2773 of Lecture

Notes in Computer Science, pages 843–849. Springer.

HaCohen-Kerner, Y., Gross, Z., and Masa, A. (2005). Auto-

matic extraction and learning of keyphrases from scien-

tiﬁc articles. In Gelbukh, A. F., editor, CICLing 2005,

volume 3406 of Lecture Notes in Computer Science,

pages 657–669. Springer.

Hulth, A. (2003). Improved automatic keyword extraction

given more linguistic knowledge. In Conference on Em-

pirical Methods in Natural Language Processing, pages

216–223.

Hulth, A. (2004). Enhancing linguistically oriented auto-

matic keyword extraction. In North American Human

language technology conference.

Hulth, A., Karlgren, J., Jonsson, A., Bostr¨om, H., and

Asker, L. (2001). Automatic keyword extraction us-

ing domain knowledge. In Gelbukh, A. F., editor, CI-

CLing’01, volume 2004 of Lecture Notes in Computer

Science, pages 472–482. Springer.

Kim, S., Medelyan, O., Kan, M., and Baldwin, T. (2010).

Semeval-2010 task 5: Automatic keyphrase extraction

from scientiﬁc articles. In Proceedings of the 5th Inter-

national Workshop on Semantic Evaluation, ACL 2010,

pages 21–26.

Matsuo, Y. and Ishizuka, M. (2003). Keyword extraction

from a single document using word co-occurrence statis-

tical information. In Russell, I. and Haller, S. M., editors,

FLAIRS Conference, pages 392–396. AAAI Press.

Nguyen, T. D. and Kan, M.-Y. (2007). Keyphrase extraction

in scientiﬁc publications. In Goh, D. H.-L., Cao, T. H.,

Sølvberg, I., and Rasmussen, E. M., editors, ICADL, vol-

ume 4822 of Lecture Notes in Computer Science, pages

317–326. Springer.

Ohsawa, Y., Benson, N. E., and Yachida, M. (1998).

Keygraph: Automatic indexing by co-occurrence graph

based on building construction metaphor. In ADL’98,

pages 12–18. IEEE Computer Society.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1998).

The pagerank citation ranking: Bringing order to the

web. Technical report, Stanford.

Rennie, J. D. M. and Jaakkola, T. (2005). Using term infor-

mativeness for named entity detection. In Baeza-Yates,

R. A., Ziviani, N., Marchionini, G., Moffat, A., and Tait,

J., editors, SIGIR’05, pages 353–360. ACM.

Timonen, M. (2012). Categorization of very short docu-

ments. In In-press KDIR’12. SciTePress Digital Library.

Timonen, M., Silvonen, P., and Kasari, M. (2011a). Classi-

ﬁcation of short documents to categorize consumer opin-

ions. In ADMA’11. Online proceedings.

Timonen, M., Silvonen, P., and Kasari, M. (2011b). Mod-

elling a query space using associations. Frontiers in Ar-

tiﬁcial Intelligence and Applications, 255:77–96.

Tomokiyo, T. and Hurst, M. (2003). A language model ap-

proach to keyphrase extraction. In Proceedings of ACL

Workshop on Multiword Expressions.

Turney, P. D. (2000). Learning algorithms for keyphrase

extraction. Inf. Retr., 2(4):303–336.

Turney, P. D. (2003). Coherent keyphrase extraction via

web mining. In Gottlob, G. and Walsh, T., editors, IJ-

CAI’03, pages 434–442. Morgan Kaufmann.

Wan, X. and Xiao, J. (2008). Collabrank: Towards a collab-

orative approach to single-document keyphrase extrac-

tion. In Scott, D. and Uszkoreit, H., editors, COLING’08,

pages 969–976.

Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and

Nevill-Manning, C. G. (1999). Kea: Practical automatic

keyphrase extraction. CoRR, cs.DL/9902007.

Yih, W., Goodman, J., and Carvalho, V. R. (2006). Finding

advertising keywords on web pages. In Carr, L., Roure,

D. D., Iyengar, A., Goble, C. A., and Dahlin, M., editors,

WWW’06, pages 213–222. ACM.

Informativeness-basedKeywordExtractionfromShortDocuments

421