A HYBRID APPROACH TOWARDS INFORMATION EXPANSION

BASED ON SHALLOW AND DEEP METADATA

Tudor Groza and Siegfried Handschuh

Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland

Keywords:

Semantic Metadata, Information Expansion, Information Visualization, Knowledge Capturing.

Abstract:

The exponential growth of the World Wide Web in the last decade, brought an explosion in the information

space, with important consequences also in the area of scientiﬁc research. Lately, ﬁnding relevant work in a

particular ﬁeld and exploring links between relevant publications, became a cumbersome task. In this paper we

propose a hybrid approach to automatic extraction of semantic metadata from scientiﬁc publications that can

help to alleviate, at least partially, the above mentioned problem. We integrated the extraction mechanisms

in an application targeted to early stage researchers. The application harmoniously combines the metadata

extraction with information expansion and visualization for the seamless exploration of the space surrounding

scientiﬁc publications.

1 INTRODUCTION

The World Wide Web has been an essential medium

for the dissemination of scientiﬁc work in many

ﬁelds. The signiﬁcant rate at which scientiﬁc research

outcomes are growing has inevitably led to substan-

tial increases in the amount of scientiﬁc work pub-

lished within journals, conferences, workshops. As an

example, in the biomedical domain, the well-known

MedLine

now hosts over 18 million articles, hav-

ing a growth rate of 0.5 million articles / year, which

represents around 1300 articles / day (Tsujii, 2009).

This makes the process of ﬁnding relevant work in a

particular ﬁeld a cumbersome task, especially for an

early stage researcher.

In addition, the lack of uniformity and integration

of access to information can also be considered an as-

sociated issue with the information overload. Each

event has its own online publishing means, and there

exist no centralized hub linking the information, not

even for communities sharing similar interests. These

issues have motivated a variety of efforts. The Se-

mantic Web Dog Food initiative (M

oller et al., 2007)

is a pioneering attempt in which a RDF-based repos-

itory was set up to host metadata about International

and European Semantic Web conferences (and other

Semantic Web events). This data is then served as

http://medline.cos.com/

linked open data to the community. While the initia-

tive is getting more and more attention and partici-

pation from organizers of scientiﬁc events, it requires

a considerable amount of manual effort to derive the

metadata about the publications (shallow and deep,

such as title, authors, references, claims, positions or

arguments). Without any immediate reward or feed-

back, there is little incentive for authors to generate

the metadata themselves. It is our belief that the adop-

tion of semantic technologies to enable linked open

data in the scientiﬁc publication domain can be much

greater if an automatic metadata extraction solution

would exist.

In this paper, we report on a hybrid approach to-

ward automatic extraction of both shallow and deep

metadata from scientiﬁc publications, which has been

developed and evaluated. Our proposal combines

document engineering with empirical and linguistic

processing. The result of the extraction is an ontolog-

ical representation of the publication, capturing the

linear and rhetorical structure of the discourse (prove-

nance and semantics), in addition to the usual Dublin-

Core metadata terms. The metadata can then be ex-

ported and used in repositories like the Semantic Web

Dog Food Server, or embedded into the publication

and used for a tighter integration of personal infor-

mation within the Social Semantic Desktop frame-

work (Bernardi et al., 2008). We wrapped all these

109

Groza T. and Handschuh S. (2009).

A HYBRID APPROACH TOWARDS INFORMATION EXPANSION BASED ON SHALLOW AND DEEP METADATA.

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development, pages 109-116

DOI: 10.5220/0002271001090116

 SciTePress

features into a highly modularized application

that

uses the extracted metadata for achieving information

expansion and visualization. In addition, it emulates

an information hub created on demand, that provides

early stage researchers with an integrated view on

multiple publication repositories.

The rest of the paper is structured as follows: we

start by enumerating the application’s requirements

in Sect. 2, then we detail our approach in Sect. 3.

Sect. 4 discusses the evaluation we carried out and be-

fore concluding in Sect. 6, we present similar research

approaches in Sect. 5.

2 REQUIREMENTS

In order to support researchers with their needs of

ﬁnding relevant scientiﬁc publications, we performed

a study, that resulted in a list of requirements. The

study featured a series of online surveys, with the

broader scope of analyzing metadata usefulness and

general reading habits. For the current paper, only

parts of them are relevant

. In terms of results,

we had 75 researchers answering the call, all acting

within the Semantic Web community. The list of ex-

tracted generic requirements is summarized as fol-

lows:

Automatic Metadata Extraction. Although the vast

majority of researchers agree with the importance

and usefulness of metadata, almost none of them

would spend the time to create it manually. There-

fore, it is important to require as little effort as

possible from authors / readers, and ﬁnd viable

ways to generate or extract the metadata automat-

ically. In addition, we also target the extraction of

the entire metadata space including abstract and

references, as well as, deep metadata like claims,

positions or arguments. As an example, 90.8% of

the survey subjects consider the abstract impor-

tant or crucial to read, and 85.4% consider deep

metadata (e.g. claims) to improve the understand-

ing of a paper, and that is also why 82.8% usually

manually mark such deep metadata while reading.

Metadata Persistence. Having the metadata ex-

tracted, we need to allow the author / reader to

make it persistent, thus providing the opportunity

for its reuse. Both persistence options are easily

achievable, i.e. (i) exporting the metadata, for di-

rect usage in web repositories, or (ii) embedding

The application can be downloaded from

http://sclippy.semanticauthoring.org/

The complete surveys including the results can be

found at http://smile.deri.ie/surveys

the metadata into the original publication.

Metadata Usage. Obviously, extracting and storing

the metadata would not be sufﬁcient. We also

need to use it. Considering our target users and

their reading habits, we opted for using it to

achieve information expansion and visualization.

For example, 88.8% of the survey subjects would

look for other publications of the same author or

her co-authors.

3 IMPLEMENTATION

The analysis and reﬁnement of the requirements list

resulted into a workﬂow that supports the design of

our application. In the following section we describe

this workﬂow together with its composing parts.

3.1 Workﬂow

Figure 1: Application workﬂow.

The workﬂow, depicted in Fig. 1, can be split into

three main parts, according to the three dimensions

given by the requirements. The ﬁrst part deals with

the automatic extraction of metadata. Here, we cur-

rently focus only on publications published as PDF

documents. The extraction module takes as input a

publication and creates a SALT (Semantically An-

notated L

X) instance model. SALT (Groza et al.,

2007) represents a semantic authoring framework tar-

geting the enrichment of scientiﬁc publications with

semantic metadata. It introduces a layered model that

covers the linear structure of the discourse (the Doc-

ument Ontology), the rhetorical structure of the dis-

course (the Rhetorical Ontology) and additional an-

notations, e.g. the shallow metadata (the Annotation

Ontology).

KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development

110

In the second part of the workﬂow, the extracted

instance model can be exported to a separate ﬁle,

or embedded into the original publication (by means

of the same extraction module). The third and last

part uses the shallow metadata modeled by the SALT

instances to perform information expansion. This

is achieved based on different existing publication

repositories. The current implementation realizes the

expansion based on DBLP

. This represents only the

ﬁrst step, as the design of the application allows dy-

namic integration of different other expansion mod-

ules, thus transforming it into an information hub.

In addition, the expanded information can be used to

correct or improve the extracted metadata.

3.2 Extraction of Shallow Metadata and

the Linear Discourse Structure

The extraction of shallow metadata and linear dis-

course structure currently works only on PDF publi-

cations. We have developed a set of algorithms that

follow a low-level document engineering approach,

by combining mining and analysis of the publication

content based on its formatting style and font infor-

mation. Each algorithm in the set deals with one as-

pect of shallow metadata. Thus, there is an authors

extraction algorithm, one for extracting the abstract,

one for the references and last one for the linear dis-

course structure.

Detailing the actual algorithms is out of the scope

of the current paper. Nevertheless we will give an ex-

ample of how the authors extraction works. There are

four main processing steps:

1. We merge the consecutive text chunks on the ﬁrst

page that have the same font information and are

on the same line (i.e. the Y coordinate is the

same);

2. We select the text chunks between the title and the

abstract and consider them author candidates

3. We linearize the author candidates based on the

variations of the Y axis

4. We split the author candidates based on the varia-

tions of the X axis

To have a better picture of how the algorithm

works, Fig. 2 depicts an example applied on a pub-

lication that has the authors structured on several

columns. The ﬁgure shows the way in which the au-

thors’ columns containing the names and afﬁliations

are linearized, based on the variation of the Y coor-

dinate. The arrows in the ﬁgure show the exact lin-

http://informatik.uni-trier.de/ ley/db/

Figure 2: Authors extraction algorithm example.

earization order. The variations on the X axis can be

represented in a similar manner.

3.3 Extraction of the Rhetorical

Discourse Structure

For the extraction of deep metadata we followed

a completely different approach. We started from

adopting, as foundational background, the Rhetorical

Structure of Text Theory (RST)

. RST was ﬁrst intro-

duced in (Mann and Thompson, 1987), with the goal

of providing a descriptive theory for the organization

and coherence of natural text. The theory comprises a

series of elements, from which we mention the most

important ones, i.e. (i) Text spans, and (ii) Schemas.

Text spans represent uninterrupted linear intervals of

text that can have the roles of Nucleus or Satellite.

A Nucleus represents the core (main part / idea) of

a sentence or phrase, while the Satellite represents

a text span that complements the Nucleus with addi-

tional information. One the other hand, schemas de-

ﬁne the structural constituency arrangements of text.

They mainly provide a set of conventions that are ei-

ther independent or inclusive of particular rhetorical

relations that connect different text spans. The theory

proposes a set of 23 rhetorical relations, having an al-

most ﬂat structure (e.g. Circumstance, Elaboration,

Antithesis, etc). In SALT, and as well as in our ex-

traction mechanism, we adopted only a subset of 11

relations

The actual extraction process comprised two

phases: (i) the empirical analysis of a publication col-

lection and development of a knowledge acquisition

module, and (ii) an experiment for determining the

initial probabilities for text spans to represent knowl-

edge items (i.e. claims, positions, arguments), based

on the participation in a rhetorical relation of a certain

type and its block placement in the publication (i.e.

abstract, introduction, conclusion or related work).

RST is also the foundation of SALT

See http://salt.semanticauthoring.org/ for more details

on the SALT model.

A HYBRID APPROACH TOWARDS INFORMATION EXPANSION BASED ON SHALLOW AND DEEP

METADATA

111

In order to automatically identify text spans and

the rhetorical relations that hold among them, we re-

lied on the discourse function of cue phrases, i.e.

words such as however, although and but. An ex-

ploratory study of such cue phrases provided us with

an empirical grounding for the development of an ex-

traction algorithm. Having as inspiration the work

performed by Marcu (Marcu, 1997) we analyzed a

collection of around 130 publications from the Com-

puter Science ﬁeld and identiﬁed 75 cue phrases that

signal the rhetorical relations mentioned above. For

each cue phrase we extracted a number of text frag-

ments, in order to identify two types of information:

(i) discourse related information (i.e. the rhetorical

relations that were marked by the cue phrases and the

roles of the related text spans), and (ii) algorithmic

information (i.e. position in the sentence, its position

according to the neighboring text spans and the sur-

rounding punctuation). This information constitutes

the empirical foundation of our algorithm that identi-

ﬁes the elementary unit boundaries and discourse us-

ages of the cue phrases. The actual implementation

was embedded into a GATE

plugin.

The second phase of the extraction process con-

sisted of an experiment. The annotation of epistemic

items (i.e. claims, positions, arguments) in a docu-

ment is a highly subjective task. Different people have

diverse mental representations of a given document,

depending on their domain or the depth of knowl-

edge of the document in question. Therefore, prob-

ably the most reliable annotator of a scientiﬁc publi-

cation would be its author. In order to capture the way

in which people ﬁnd (and maybe interpret) epistemic

items, we ran an experiment. The goal of the experi-

ment was to allow us to compute initial values for the

probabilities of text spans to be epistemic items.

The setup of the experiment included ten re-

searchers (authors of scientiﬁc publications), two col-

lections and two tasks. The tasks represented at their

basis the same task, just that each time performed on

a different collection. The ﬁrst collection comprised a

set of ten publications chosen by us, while the second

collection had 20 publications, provided by the anno-

tators. Each annotator provided us two of her own

publications. For each publication, we extracted a list

of text spans (based on the presence of rhetorical re-

lations) and presented this list to the annotators. On

an average each list had around 110 items. The an-

notators’ task was to mark in the given lists, the text

spans that they considered to be epistemic items. Al-

though the complexity of the task was high enough

by its nature, we chose not to do experiment in a

controlled environment. Having collected the marked

http://gate.ac.uk/

lists from the annotators, we decoded the rhetorical

relations hidden behind each knowledge item in the

list and computed the proportional speciﬁc raw inter-

annotator agreement per publication. The result was

a list of proportions of speciﬁc (positive, negative and

overall) raw inter-annotator agreements. Each rhetor-

ical relation was then assigned the corresponding av-

erage of the positive agreement values, based on the

originating knowledge items.

The actual extraction implemented in the system

considers a ﬁxed probability threshold for the rhetor-

ical relations and based on the input text, it provides

the list of items having the rhetorical relations proba-

bilities above the threshold.

3.4 Information Expansion

The previous two steps represent together the hy-

brid mechanism for extracting an ontological instance

model from scientiﬁc publications. The model can

then be exported or embedded into the original publi-

cation, thus lifting the semantics captured in the docu-

ment, into a machine-processable format. In addition,

we used parts of the model to perform information

expansion. The design of the application allows one

to integrate multiple expansion modules, each con-

nected to a particular publication repository. Cur-

rently, we have implemented such a module based on

DBLP. On demand, the extracted shallow metadata

(i.e. the title and authors) is used to search the repos-

itory for the corresponding publication. Considering

that the extraction works on a best effort basis, the

ﬁnal metadata might contain errors both in title and

in the authors’ list. The user has the means for cor-

recting it manually, or if the publication is expanded

correctly, she can do it automatically.

The ﬁrst element used for searching the repository

is the title of the publication. In order to ’correct’

(or mask) the possibly existing errors in the title, we

use string similarity measures to ﬁnd out the proper

publication. An empirical analysis led us to using a

combination of the Monge-Elkan and Soundex algo-

rithms, with ﬁxed thresholds. The ﬁrst one analyzes

ﬁne-grained sub-string details, while the second looks

at coarse-grained phonetic aspects. The publications

that pass the imposed thresholds are then checked

based on the existing authors. The best match is then

provided to the user as a candidate for correcting the

existing metadata (see the left part of Fig. 3 the title

highlighted in blue). The same approach is also fol-

lowed for each author of the publication.

The outcome of the expansion process features

two elements: (i) a list of similar publications to the

one given as input, each with its authors, and (ii) for

KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development

112

Figure 3: Example of information expansion within our application.

each author of the given publication found, her com-

plete list of publications existing in the respective

repository. This provides the user with the chance

of analyzing both publications that might have sim-

ilar approaches and inspect all the publications of a

particular author.

Based on the same repository, we have also im-

plemented the option of visualizing the graph of co-

authors starting from a particular author, together with

their associated publications. The right side of Fig. 3

shows an example of such an exploration.

4 EVALUATION

We evaluated each of the modules described in the

previous section, and performed a usability study of

the overall application.

Shallow Metadata Extraction. The shallow meta-

data extraction was evaluated by testing the algo-

rithms on a corpus comprising around 1200 publi-

cations, formatted with the ACM or Springer LNCS

styles. The documents forming the corpus were con-

sistent and uniform in terms of encoding and metadata

content, individually for each type of formatting. As

extraction from PDF documents depends on a hand-

ful of factors (e.g. encoding, encryption, etc), the

evaluation results presented here, considers only the

documents for which the actual extraction was valid

(i.e. the PDF parser was able to read the document).

Also, the results present the algorithms working on

a best effort basis (no additional information is pro-

vided about the publications). Nevertheless, our next

step will be to provide the means for optimizing the

algorithms for particular styles and formats.

Table 1 lists the evaluation results. Overall, the

title and abstract extraction algorithms performed the

best, with an accuracy of 95% and respectively 96%.

We observed that most of the cases in which the algo-

Table 1: Performance measures of the shallow metadata ex-

traction.

Accuracy Prec. Rec. F

Title 0.95 0.96 0.98 0.96

Authors 0.90 0.92 0.96 0.93

Abstract 0.96 0.99 0.96 0.97

Sections 0.92 0.97 0.93 0.94

References 0.91 0.96 0.93 0.94

rithms failed to produce a result were documents that

the PDF parser managed to read but failed to actually

parse. If we would eliminate this set of documents,

the accuracy would probably increase with an addi-

tional 2%. The 90% accuracy of the authors extrac-

tion algorithm is mostly due to the lack of adherence

to the formatting style, or presence of special symbols

close to the authors’ names. The sections extraction

algorithm performed extremely well, with an 92% ac-

curacy, which represents that in these cases, it man-

aged to extract the complete tree of sections from the

paper. On the other hand, the references extraction al-

gorithm did not perform as well as we expected, and

had an accuracy of only 91%.

Deep Metadata Extraction. The deep metadata

extraction was tested based on a preliminary evalu-

ation, and with an emphasis put on the extraction of

the knowledge items. The setup of the evaluation was

similar to the one of the experiment described in the

previous section. We used two corpora (with a total of

30 publications), one with the evaluators’ own papers

and one containing a set of paper we chose. Each

evaluator was asked to mark those text spans in the

text that she considers to be knowledge items, both

in her own paper and the one we assigned. In paral-

lel, we ran our tool on the same set of publications

and compiled the predicted list of candidates. At the

end we computed the usual performance measures,

A HYBRID APPROACH TOWARDS INFORMATION EXPANSION BASED ON SHALLOW AND DEEP

METADATA

113

Table 2: Evaluation results.

Corpus Prec. Recall F-Measure

I (own) 0.5 0.3 0.18

II (provided) 0.43 0.31 0.19

i.e. precision, recall and f-measure. The evaluation

results are summarized in Table 2.

One could interpret of the performance measures

of the extraction results in different ways. On one

side, we see them as satisfactory, because they rep-

resent the effect of merely the ﬁrst step from a more

complex extraction mechanism we have envisioned.

At the same time, if we compare them, for example,

with the best precision reached by (Teufel and Moens,

2002), or 70.5, we ﬁnd our 0.5 precision to be encour-

aging. And this is mainly because in our case there

was no training involved, and we considered only two

parameters in the extraction process, i.e. the rhetor-

ical relations and the block placement, while Teufel

employed a very complex na

ıve Bayes classiﬁer with

a pool of 16 parameters, and 20 hours of training for

the empirical experiment. On the other hand, these

results clearly show that we need to consider as well

other parameters, such as, the presence of anaphora, a

proper distinction between the different types of epis-

temic items, or the used verb tense, parameters which

are already part of our future plans.

Secondly, the formula we have used for com-

puting the ﬁnal probabilities, within this preliminary

evaluation, has a very important inﬂuence on the ex-

traction results. Currently, we opted for a simple

formula that gives more weight to the probabilities

emerged from the annotation of own papers. Such an

approach should be used when the automatic extrac-

tion is performed by an author on her own papers, for

example, in real time while authoring them. This is

clearly reﬂected in the positive difference in precision

between the own corpus and the provided one. On the

other hand, if used for information retrieval purposes,

by readers and not by authors, the computation for-

mula should be changed, so that it gives more weight

to the probabilities emerged from the annotation of

given papers. This practically translate into shaping

the extraction results in a form closer to what a reader

would expect.

Usability Study. The usability study was per-

formed with 16 evaluators and included a series of

tasks to cover all the application’s functionalities. Ex-

ample of tasks included: extraction and manual cor-

rection of metadata from publications, expansion of

information starting from a publication or exploration

of the co-author graph. At the end, the evaluators

had to ﬁll in a questionnaire, comprising 18 ques-

tions, with Likert scale or free form answers, focus-

ing on two aspects: (i) suitability and ease of use,

and (ii) design, layout and conformity to expectan-

cies. The complete results of the questionnaire can be

found at http://smile.deri.ie/sclippy-usabilitystudy.

Overall, the application scored very well in both

categories. For example, the vast majority of the eval-

uators (on average more than 90%) found the tool

both helpful and well suited for the extraction and ex-

ploration of shallow and deep metadata. In the other

category, the same amount evaluators considered the

application easy to learn and to use while having the

design and layout both appealing and suited for the

task. Possible issues we discovered in two cases. The

ﬁrst one would be the self-descriptiveness of the ap-

plication’s interface, mainly due to the lack of visual

indicators and tooltips. The second case was related

to the suggested list of similar publications. Although

the application always proposed the exact publication

selected for expansion, the rest of the list created some

confusion. We believe that the cause of the confu-

sion is the fact that the similarity measures we have

adopted were not well suited.

This study led us to a series of directions for im-

provement. First of all, the need to make use of a more

complex mechanism for suggesting similar publica-

tions. This will depend to a large extent on the reposi-

tory used for expansion and on the information that it

provides. For example, in the case of the ACM Portal,

we will consider also text of the abstract when com-

puting the candidates list. Secondly, augmenting the

expanded information with additional elements (e.g.

abstract, references, citation contexts), thus providing

a deeper insight into the publications and a richer ex-

perience for the users. Lastly, the integration of the

application within the Social Semantic Desktop plat-

form. This will lead to a centralized data persistence

and deeper integration and linking of the metadata

into the more general context of the personal infor-

mation.

5 RELATED WORK

Our approach combines different directions for

achieving its goals. In the following we will try to

provide a good overview of the related efforts corre-

sponding to each research direction. We will cover

mainly: (i) methods used for automatic extraction

of shallow metadata, (ii) models for structuring dis-

course, and (iii) information visualization for scien-

tiﬁc publications.

KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development

114

There have been several methods used for au-

tomatic extraction of shallow metadata, like regular

expressions, rule-based parsers or machine learning.

Regular expressions and rule-based systems have the

advantage that they do not require any training and

are straightforward to implement. Successful work

has been reported in this direction, with emphasis on

PostScript documents in (Shek and Yang, 2000), or

considering HTML documents and use of natural lan-

guage processing methods in (Yilmazel et al., 2004).

A different trend in the same category is given by ma-

chine learning methods, that are more efﬁcient, but

also more expensive, due to the need of training data.

Hidden Markov models (HMMs) are the most widely

used among these techniques. However, HMMs are

based on the assumption that features of the model

they represent are not independent from each other.

Thus, HMMs have difﬁculty exploiting regularities

of a semi-structured real system. Maximum entropy

based Markov models (McCallum et al., 2000) and

conditional random ﬁelds (Lafferty et al., 2001) have

been introduced to deal with the problem of indepen-

dent features. In the same category, but following a

different approach, is the work performed by Han et

al. (Han et al., 2003), who uses Support Vector Ma-

chines (SVMs) for metadata extraction.

Some remarks worth to be noted here regarding

a comparison between the above mentioned methods

and ours. First of all, the comparison between the vi-

sual/spatial approaches (Shek’s and ours) and the ma-

chine learning ones is not really appropriate. This is

mainly because the latter can easily cope with general

formats, while the former are “static” methods, fo-

cused on a particular format. Nevertheless, the learn-

ing methods impose a high cost due to their need of

accurate training data, while the static methods have

no training associated. Secondly, due to the lack of

a common dataset, a direct comparison of efﬁciency

measures cannot be realized. The main reason is be-

cause most of the machine learning methods work

on plain text already extracted by some other means,

while the spatial approaches work on the actual doc-

uments. Still, we consider Shek’s approach to be the

closest to ours, although targeting a different docu-

ment format. Based on a purely empirical comparison

we observe a higher accuracy for our title and authors

extraction method (around 5%), as well as a higher

accuracy for the linear structure extraction (around

15%), while also providing additional metadata (i.e.

abstract or references).

In the area of discourse structuring, we can ﬁnd

a rich sphere of related work. One of the ﬁrst mod-

els was introduced by (Teufel and Moens, 2002) and

tried to categorize phrases from scientiﬁc publica-

tions into seven categories based on their rhetorical

role. Later, the authors developed an automatic ex-

traction approach, following a similar method to ours,

starting from a corpus of manually annotated docu-

ments and a set of probabilities emerged from inter-

annotator agreement studies. Teufel did not make use

of any relations between the extracted items. (Shum

et al., 2006) were the ﬁrst to describe one of the

most comprehensive models for argumentation in sci-

entiﬁc publications, using as links between the epis-

temic items Cognitive Coherence Relations. They de-

veloped a series of tools for the annotation, search

and visualization of scientiﬁc publications based on

this model, which represent our main inspiration. The

automatic extraction approach they followed was the

same as the one developed by Teufel, i.e. by com-

piling and using a list of particular cue-phrases. Al-

though their model is richer than the previous, due to

the presence of relations, they do not make use of the

placement of the item in the publication.

With respect to information visualization of sci-

entiﬁc publications, a number of methods and tools

have been reported in the literature. The 2004 InfoVis

challenge had motivated the introduction of a number

of visualization tools highlighting different aspects of

a selected set of publications in the Information Vi-

sualization domain. (Faisal et al., 2007) reported on

using the InfoVis 2004 contest dataset to visualize ci-

tation networks via multiple coordinated views. Un-

like our work, these tools were based on the contents

of a single ﬁle, containing manually extracted meta-

data. As noted by the challenge chairs, it was a difﬁ-

cult task to produce the metadata ﬁle (Plaisant et al.,

2008) and hence the considerable efforts required,

made it challenging for wide-spread use. In (Neirynck

and Borner, 2007), a small scale research manage-

ment tool was built to help visualizing various rela-

tionships between lab members and their respective

publications. A co-authorship network visualization

was built from data entered by users in which nodes

represent researchers together with their publications

and links show their collaborations. A similar effort

to visual domain knowledge was reported by (Mur-

ray et al., 2006), with data source being bibliographic

ﬁles obtained from distinguished researchers in the

“network science” area. While this work was also

concerned with cleansing data from noisy sources, the

metadata in use was not extracted from publications

themselves and no further information available from

external sources such as DBLP was utilized.

A HYBRID APPROACH TOWARDS INFORMATION EXPANSION BASED ON SHALLOW AND DEEP

METADATA

115

6 CONCLUSIONS

This paper presented an application that integrates a

series of different approaches for achieving informa-

tion expansion and visualization based on extracted

shallow and deep metadata. The extraction process of

the shallow metadata followed a low-level document

engineering approach, combining font mining and

format information. On the other hand, the extraction

of deep metadata (i.e. the rhetorical discourse struc-

ture) was performed based on a combined linguistic

and empirical approach. The metadata was used for

information expansion and visualization based on dif-

ferent publication repositories. The evaluation results

of all of the application’s components encourages us

to continue our efforts in the same direction, by in-

creasing the efﬁciency of the extraction mechanisms.

Future work will focus especially on improving

the extraction of deep metadata by considering word

co-occurrence, anaphora resolution and verb tense

analysis. These improvements will also be reﬂected

into a new iteration over the initial weights (probabil-

ities) assigned to the epistemic items, resulted from

this paper. At the application level, we will implement

additional expansion modules, thus integrating more

publication repositories. Also, we intend to release

the application’s core source code as open source so

that its adoption can be directly coupled with its fur-

ther development, including contributions from the

researchers that would like to use it.

ACKNOWLEDGEMENTS

The work presented in this paper has been funded

by Science Foundation Ireland under Grant No.

SFI/08/CE/I1380 (Lion-2). The authors would like

to thank VinhTuan Thai, Georgeta Bordea and Ioana

Hulpus for their support and help.

REFERENCES

Bernardi, A., Decker, S., van Elst, L., Grimnes, G., Groza,

T., Handschuh, S., Jazayeri, M., Mesnage, C., M

oller,

K., Reif, G., and Sintek, M. (2008). The Social Se-

mantic Desktop: A New Paradigm Towards Deploying

the Semantic Web on the Desktop. IGI Global.

Faisal, S., Cairns, P. A., and Blandford, A. (2007). Building

for Users not for Experts: Designing a Visualization

of the Literature Domain. In Information Visualisation

2007, pages 707–712. IEEE Computer Society.

Groza, T., Handschuh, S., M

oller, K., and Decker, S.

(2007). SALT - Semantically Annotated L

X for

Scientiﬁc Publications. In ESWC 2007, Innsbruck,

Austria.

Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z.,

and Fox, E. A. (2003). Automatic document metadata

extraction using support vector machines. In Proc.

of the 3rd ACM/IEEE-CS Joint Conf. on Digital li-

braries, pages 37–48.

Lafferty, J., McCallum, A., and Pereira, F. (2001). Con-

ditional random ﬁelds: Probabilistic models for seg-

menting and labeling sequence data. In Proc. of the

18th Int. Conf. on Machine Learning, pages 282–289.

Mann, W. C. and Thompson, S. A. (1987). Rhetorical struc-

ture theory: A theory of text organization. Technical

Report RS-87-190, Information Science Institute.

Marcu, D. (1997). The Rhetorical Parsing, Summarization,

and Generation on Natural Language Texts. PhD the-

sis, University of Toronto.

McCallum, A., Freitag, D., and Pereira, F. (2000). Maxi-

mum entropy markov models for information extrac-

tion and segmentation. In Proc. of the 17th Int. Conf.

on Machine Learning, pages 591–598.

oller, K., Heath, T., Handschuh, S., and Domingue, J.

(2007). Recipes for Semantic Web Dog Food – The

ESWC and ISWC Metadata Projects. In Proc. of the

6th Int. Semantic Web Conference.

Murray, C., Ke, W., and Borner, K. (2006). Mapping scien-

tiﬁc disciplines and author expertise based on personal

bibliography ﬁles. In Information Visualisation 2006,

pages 258–263. IEEE Computer Society.

Neirynck, T. and Borner, K. (2007). Representing, ana-

lyzing, and visualizing scholarly data in support of

research management. In Information Visualisation

2007, pages 124–129. IEEE Computer Society.

Plaisant, C., Fekete, J.-D., and Grinstein, G. (2008). Pro-

moting Insight-Based Evaluation of Visualizations:

From Contest to Benchmark Repository. IEEE Trans-

actions on Visualization and Computer Graphics,

14(1):120–134.

Shek, E. C. and Yang, J. (2000). Knowledge-Based Meta-

data Extraction from PostScript Files. In Proc. of the

5th ACM Conf. on Digital Libraries, pages 77–84.

Shum, S. J. B., Uren, V., Li, G., Sereno, B., and Mancini,

C. (2006). Modeling naturalistic argumentation in re-

search literatures: Representation and interaction de-

sign issues. Int. J. of Intelligent Systems, 22(1):17–47.

Teufel, S. and Moens, M. (2002). Summarizing scientiﬁc

articles: Experiments with relevance and rhetorical

status. Computational Linguistics, 28.

Tsujii, J. (2009). Reﬁne and pathtext, which combines text

mining with pathways. Keynote at Semantic Enrich-

ment of the Scientiﬁc Literature 2009 (SESL 2009).

Yilmazel, O., Finneran, C. M., and Liddy, E. D. (2004).

Metaextract: an nlp system to automatically assign

metadata. In JCDL ’04: Proceedings of the 4th

ACM/IEEE-CS joint conference on Digital libraries,

pages 241–242.

KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development

116