DILIA - A DIGITAL LIBRARY ASSISTANT

A New Approach to Information Discovery through Information Extraction

and Visualization

Inessa Seifert

, Kathrin Eichler

, Holmer Hemsen

, Sven Schmeier

, Michael Kruppa

Norbert Reithinger

and G

unter Neumann

Intelligent User Interfaces,

Language Technology, DFKI (German Research Center for Artiﬁcial Intelligence)

Alt-Moabit 91 C, 10559 Berlin, Germany

Keywords:

Digital libraries, Technical term extraction, Information discovery, Visualization, Co-author networks.

Abstract:

This paper presents preliminary results of our current research project DiLiA (Digital Library Assistant).

The goals of the project are are twofold. One goal of the project is the development of domain-independent

information extraction methods. The other goal is the development of information visualization methods that

interactively support researchers at time consuming information discovery tasks. We ﬁrst describe issues

that contribute to high cognitive load during exploration of unfamiliar research domains. Then we present a

domain-independent approach to technical term extraction from paper abstracts, describe the architecture of

the DiLiA, and illustrate an example co-author network visualization.

1 INTRODUCTION

This paper presents preliminary results of our current

research project DiLiA (Digital Library Assistant).

Our research goals are twofold. One goal of the

project is the development of sophisticated domain-

independent information extraction techniques that

aim at retrieving speciﬁc entities (e.g., technical

terms, key ideas) and relations (e.g., citations, co-

authorships) among the documents contained in a dig-

ital library.

The other goal of the project involves the develop-

ment of sophisticated visualization methods in order

to interactively support researchers at time consuming

information seeking tasks. These methods should vi-

sually present huge result sets caused by vaguely de-

ﬁned search queries and allow the information seek-

ers to examine, analyze, and manipulate multitudi-

nous dimensions of query results from various per-

spectives.

Finally, we aim at combining these two techniques

to make the extracted structures and relations con-

cealed in result sets transparent to information seek-

ers.

In this paper, we will exemplify aspects that con-

tribute to the cognitive complexity of information dis-

covery tasks. We will outline information extraction

methods that can be used for pre-processing of data

contained in digital libraries. In doing so, parts of

the mental work that has to be accomplished by the

information seeker can be ofﬂoaded to the assisting

system. We will discuss characteristic requirements

for data visualization in digital libraries. Finally, we

will conclude with an example visualization that il-

lustrates our preliminary results.

2 INFORMATION SEEKING

Information seeking is a complex and cognitively de-

manding task that has a close relation to learning

and problem solving (Vakkari, 1999). The informa-

tion seeking process starts with an initial concept of

a search goal that is derived from the user’s prior

knowledge about the problem domain. Based on this

knowledge, the information seeker deﬁnes an initial

search query. The analysis of the retrieved query re-

sults contributes to generation of new concepts, revi-

sion of search goals, and formulation of new queries.

Concepts, search goals, as well as criteria for assess-

ing the relevance of articles from the query results

evolve during the information seeking process and

cannot be speciﬁed in advance (Bates, 1989).

The lack of domain speciﬁc knowledge leads to

180

Seifert I., Eichler K., Hemsen H., Schmeier S., Kruppa M., Reithinger N. and Neumann G. (2009).

DILIA - A DIGITAL LIBRARY ASSISTANT - A New Approach to Information Discovery through Information Extraction and Visualization.

In Proceedings of the International Conference on Knowledge Management and Information Sharing, pages 180-185

DOI: 10.5220/0002304001800185

 SciTePress

underdetermined and unclear search goals that are re-

ﬂected in the deﬁnition of vague search queries. Such

search queries contribute to a huge number of result-

ing hits. Examining a great amount of scientiﬁc liter-

ature is a time consuming endeavor.

Each article is distinguished by a title, authors, a

short description (i.e., abstract), a source (e.g., book,

journal, etc.), publishing date (e.g., year), and its text.

These attributes can contain speciﬁc words, i.e., terms

that can be recognized by the information seeker as

relevant and trigger the formulation of reﬁned search

queries (Barry, 1994; Anderson, 2006).

Studies conducted by (Anderson, 2006) reported

that it was difﬁcult to ﬁnd and specify appropriate

terms to deﬁne more precise search queries, espe-

cially, if an information seeker was unfamiliar with

the terminology of the problem domain, or if this ter-

minology changed over time.

3 INFORMATION EXTRACTION

Our idea for domain-independent term extraction is

based on the assumption that, regardless of the do-

main we are dealing with, the majority of the TTs in

a document are in nominal group positions. To ver-

ify this assumption, we manually annotated a set of

100 abstracts from the biology part of the Zeitschrift

fuer Naturforschung

(ZfN) archive, which contains

scientiﬁc papers published by the ZfN between 1997

and 2003. We found that 94% of the annotated terms

were in fact in noun group positions. The starting

point of our method for extracting terms is therefore

an algorithm to extract nominal groups from a text.

We then classify these nominal groups into TTs and

non-TTs using frequency counts retrieved from the

MSN search engine. For the extraction of term can-

didates, we use the nominal group (NG) chunker of

the GNR tool developed by (Spurk, 2006), which we

slightly adapted for our purposes. The advantage of

this chunker compared to other chunkers is that it is

domain-independent because it is not trained on a par-

ticular corpus but relies on patterns based on closed

class words (e.g. prepositions, determiners, coordi-

nators), which are available in all domains. Using

lists of closed-class words, the NG chunker deter-

mines the left and right boundaries of a word group

and deﬁnes all words in between as an NG. In order

to ﬁnd the TTs within the extracted NG chunks, we

use a frequency-based approach. Our assumption is

that terms that occur mid-frequently in a large cor-

pus are the ones that are most associated with some

http://www.znaturforsch.com/

topic and will often constitute technical terms. To

test our hypothesis, we retrieved frequency scores for

all NG chunks extracted from our corpus of abstracts

from the biology domain and calculated the ratio be-

tween TTs and non-TTs for particular maximum fre-

quency scores. To retrieve the frequency scores for

our chunks, we use the internet as reference corpus,

as it is general enough to cover a broad range of do-

mains, and retrieve the scores using the Live Search

API of the MSN search engine

. The results conﬁrm

our hypothesis, showing that the ratio increases up to

an MSN score threshold of about 1.5 million and then

slowly declines. This means that chunks with mid-

frequency score are in fact more likely to be technical

terms than terms with a low or high score.

To optimize the lower and upper boundaries that de-

ﬁne ’mid-frequency’, we maximized the F-measure

achieved on our annotated biology corpus with dif-

ferent thresholds set. Evaluating our algorithm on our

annotated corpus of abstracts, we obtained the follow-

ing results. From the biology corpus, our NG chunker

was able to extract 1264 (63.2%) of the 2001 anno-

tated TTs in NG position completely and 560 (28.0%)

partially. With the threshold optimized for the F-

measure (6.05 million), we achieved a precision of

57.0% at recall 82.9% of the total matches. These re-

sults are comparable to results for GN learning, e.g.

those by (Yangarber et al., 2002) for extracting dis-

eases from a medical corpus. We also evaluated our

approach on the GENIA corpus

, a standard corpus

for biology. Considering all GENIA terms with POS

tags matching the regular expression

JJ ∗ NN ∗ (NN|NNS)

as terms in NG position, we were able to evaluate our

approach on 62.4% of all terms. With this data, we

achieved 50.0% precision at recall 75.0%. A sample

abstract from the ZfN data, with the automatically ex-

tracted TTs shaded, is shown in Figure 1. The key

advantage of our approach over other approaches to

GN learning is that it extracts a broad range of differ-

ent TTs robustly and irrespective of the existence of

morphological or contextual patterns in a training cor-

pus. It works independent of the domain, the length of

the input text or the size of the corpus, in which in the

input document appears. This makes it, in principal,

applicable to documents of any digital library.

http://dev.live.com/livesearch/

http://www-tsujii.is.s.u-tokyo.ac.jp/genia/topics/ Cor-

pus/

DILIA - A DIGITAL LIBRARY ASSISTANT - A New Approach to Information Discovery through Information

Extraction and Visualization

181

Figure 1: Sample output of our TT extraction algorithm.

4 ARCHITECTURE

The system’s architecture makes use of several stan-

dardized paradigms in order to guarantee a robust,

scalable application that is based on reusable com-

ponents. It consists of a 3-tier web-based client

server architecture. The client side has been devel-

oped as a Rich Internet Application (RIA) realized

in Adobe Flex

. This application follows the model-

view-controller (MVC) concept. The ﬂex prototype

makes use of the Cairngorm

MVC implementation

which ensures a consequent MVC realization. The

client utilizes server side PHP

classes to query the

digital library database. The queries are executed

by the Lucene Search Engine

. The Lucene Index

holds all documents of the digital library including

their metatags like author, headline, abstract, publish-

ing year, etc.. The results of the described information

extraction (see section 3) are represented as additional

metatag ﬁelds of a document. As the metatag repre-

sentation in the index is realized by separate ﬁelds it is

possible to formulate search queries that search only

in a subset of all metatags. With this the impact of the

described information extraction results can be mea-

sured directly.

The connection between PHP and the Lucene

Search engine is established by the PHP Javabridge

. Finally, the communication between Flex (which

is compiled into a Flash Movie) and the server side

PHP classes is realized using Weborb

. Weborb han-

dles the serialization/deserialization of data and the

interfacing of methods between PHP and Flex. To de-

termine the topic labels, we use the Carrot clustering

http://www.adobe.com/products/ﬂex/

http://opensource.adobe.com/wiki/display/cairngorm/

http://www.php.net

http://lucene.apache.org/

http://php-java-bridge.sourceforge.net

http://www.themidnightcoders.com/products/weborb-

for-php/

engine

which is fed with the results of the Lucene

Search Engine. Thus the results of the informa-

tion extraction also inﬂuence the clusters topics.The

server side environment is based on the Apache HTTP

Server

and Apache Tomcat

. The information ﬂow

between server and client is visualized in Figure 2.

Figure 2: The general DILIA architecture.

http://project.carrot2.org/

http://httpd.apache.org/

http://tomcat.apache.org/

KMIS 2009 - International Conference on Knowledge Management and Information Sharing

182

5 INFORMATION

VISUALIZATION

The main purpose of the information visualization

techniques is to present the data contained in a dig-

ital library and provide interactive operations to the

information seeker that facilitate exploration of its

content. Commonly used visualizations include a

query panel for the formulation of search queries and

a simple hit list that presents meta information such

as author’s names, title, etc. (e.g., search engines

such as www.google.com). Digital libraries special-

ized in speciﬁc research ﬁelds provide a possibil-

ity for browsing in manually annotated categories,

journal, conference, or workshop catalogs that con-

vey an overview about a research topic and facilitate

the exploration (see e.g., http://www.lt-world.org/).

Recently developed domain-independent search en-

gines employ clustering algorithms that allow for ef-

ﬁcient online-clustering of query results into topics

that can be used for ﬁltering of information or further

query formulations (see, e.g., http://www.cuil.com/,

http://www.kartoo.com/, http://www.quintura.com/).

Alternatively to hit lists, digital libraries offer

graph-based representations of hierarchically struc-

tured topics, citation and co-author networks.

Spatially inspired concept spaces display differ-

ent concepts that involve central terms retrieved from

clustered query results (Zhang et al., 2002). Spatial

distance between the concepts conveys similarity re-

lations between the extracted terms.

Topic maps provided by HighWire digital library

consist of tree-based structures that include hierarchi-

cally structured topics and subtopics

. The interac-

tive operations allow for expanding topics in order to

reach a ﬁner level of granularity.

3D-visualizations present the content of a dig-

ital library as cone trees (Robertson et al., 1991;

Mizukoshi et al., 2006). Cones stand for different top-

ics and subtopics that contain documents represented

as leaves of a tree. The user can interactively rotate

the cones to examine the titles of the documents.

The major problem in the visualization of cita-

tion and co-author networks is a great number of

documents and a high connectivity of scientiﬁc pa-

pers. Large graphs compromise the performance of

assisting systems and contribute to mental informa-

tion overload, since they are hard to understand (Her-

man et al., 2000). One of the approaches, for example,

reduces the amount of edges leading from one arti-

cle to another by employing a minimal tree-spanning

algorithm for extraction of shortest paths connecting

http://highwire.stanford.edu/help/hbt/

the articles (Elmqvist and Tsigas, 2007). Displaying

only these paths allowed for better visual inspection

of citation clusters. Although the works described so

far cover speciﬁc aspects of this problem, it is still an

open research question how to efﬁciently combine in-

formation extraction techniques with interactive visu-

alizations to support the information discovery during

exploration of scientiﬁc literature.

In the following section, we will present an ex-

ample visualization of a co-author network that can

be ﬁltered according to the topics extracted from ab-

stracts of cooperatively published papers.

6 AN EXAMPLE CO-AUTHOR

NETWORK

We used the data of the DBLP Computer Science

Bibliography

to resolve the co-author relations be-

tween scientiﬁc publications. The DiLiA user inter-

face implements basic functionality that enables the

user to formulate a search query and receive a list of

publications as a result. The user can select either an

article or an author from the generated result list in

order to analyze the scientiﬁc cooperations in a co-

author network. The following ﬁgures illustrate two

different views on the co-author network of Andreas

Dengel who is a well known scientist in the knowl-

edge representation and management community.

Figure 3: An example co-author network of “Andreas Den-

gel”.

The ﬁrst view (Fig. 3) presents the author in the

center, his publications in the ﬁrst row, and corre-

sponding collaborators in the second row. Since this

http://www.informatik.uni-trier.de/ ley/db/

DILIA - A DIGITAL LIBRARY ASSISTANT - A New Approach to Information Discovery through Information

Extraction and Visualization

183

author published a lot of papers in his scientiﬁc career,

the co-author graph is considerably large.

Figure 4: Filtering the co-author network of “Andreas Den-

gel” according to the topic “information retrieval”.

The second view (Fig. 4) shows the publications

and co-authors that correspond to the research topic

“information retrieval.” The proposed topics are ob-

tained using the clustering engine Carrot (see sec-

tion 4) based on technical terms generated by the in-

formation extraction method described in section 3.

7 OUTLOOK AND FUTURE

WORK

In this contribution, we presented an approach to sup-

port information discovery tasks that combined tech-

nical term (TT) extraction, topic retrieval, and visu-

alization techniques. We introduced a new domain-

independent TT extraction method that allowed for

retrieving technical terms from paper abstracts with-

out using any additional domain speciﬁc information

(e.g., a lexicon or a seed-list). The extracted terms

are used for subsequent online-clustering of the query

results into topics. We illustrated a graph-based visu-

alization of an example co-author network that pro-

vided an opportunity for ﬁltering the author’s publi-

cations and collaborators according to the topics ob-

tained through clustering of paper abstracts.

This example illustrates a clear advantage of the

combination of information extraction techniques and

interactive graph-based visualizations.

In the future, we plan to use the proposed TT ex-

traction method for detecting the retrieved TTs in the

body of a document. Then, we can concentrate only

on those passages that contain the found TTs. In do-

ing so, we can discover additional entities and rela-

tions that can be characteristic for a scientiﬁc paper

or a set of papers without processing the whole text

in an exceptionally efﬁcient way. Such information

extraction techniques combined with interactive vi-

sualizations will enable a collaborative processing of

information by sharing it between a human and a ma-

chine.

ACKNOWLEDGEMENTS

The research project DILIA (Digital Library Assis-

tant) is co-funded by the European Regional Devel-

opment Fund (EFRE) under grant number 10140159.

We gratefully acknowledge this support.

REFERENCES

Anderson, T. D. (2006). Studying human judgments of rel-

evance: interactions in context. In Ruthven, I., editor,

Proceedings of the 1st international conference on In-

formation interaction in context, pages 6–14. ACM.

Barry, C. L. (1994). User-deﬁned relevance criteria: An

exploratory study. Journal of the American Society

for Information Science, 45(3):149–159.

Bates, M. (1989). The design of browsing and berrypick-

ing techniques for the online search interface. Online

Review, 13(5):407–424.

Elmqvist, N. and Tsigas, P. (2007). Citewiz: A tool for

the visualization of scientiﬁc citation networks. Infor-

mation Visualization, 6(3):215–232. Technical Report

2004, published 2007.

Herman, I., Melancon, G., and Marshall, M. (2000). Graph

visualization and navigation in information visualiza-

tion: A survey. Visualization and Computer Graphics,

IEEE Transactions on, 6(1):24–43.

Mizukoshi, D., Hori, Y., and Gotho, T. (2006). Exten-

sion models of cone tree visualizations to large scale

knowledge base with semantic relations.

Robertson, G. G., Mackinlay, J. D., and Card, S. K. (1991).

Cone trees: animated 3d visualizations of hierarchical

information. In CHI ’91: Proceedings of the SIGCHI

conference on Human factors in computing systems,

pages 189–194, New York, NY, USA. ACM.

Spurk, C. (2006). Ein minimal berwachtes Verfahren zur

Erkennung generischer Eigennamen in freien Texten.

Diplomarbeit, Saarland University, Germany.

Vakkari, P. (1999). Task complexity, problem structure and

information actions - integrating studies on informa-

tion seeking and retrieval. Information Processing and

Management, 35(6):819–837.

KMIS 2009 - International Conference on Knowledge Management and Information Sharing

184

Yangarber, R., Winston, L., and Grishman, R. (2002). Un-

supervised learning of generalized names. In Proceed-

ings of the 19th International Conference on Compu-

tational Linguistics (COLING 2002).

Zhang, J., Mostafa, J., and Tripathy, H. (2002). Information

retrieval by semantic analysis and visualization of the

concept space of d-lib magazine. D-Lib

T M

Magazine,

8(10).

DILIA - A DIGITAL LIBRARY ASSISTANT - A New Approach to Information Discovery through Information

Extraction and Visualization

185