underdetermined and unclear search goals that are re-
flected in the definition of vague search queries. Such
search queries contribute to a huge number of result-
ing hits. Examining a great amount of scientific liter-
ature is a time consuming endeavor.
Each article is distinguished by a title, authors, a
short description (i.e., abstract), a source (e.g., book,
journal, etc.), publishing date (e.g., year), and its text.
These attributes can contain specific words, i.e., terms
that can be recognized by the information seeker as
relevant and trigger the formulation of refined search
queries (Barry, 1994; Anderson, 2006).
Studies conducted by (Anderson, 2006) reported
that it was difficult to find and specify appropriate
terms to define more precise search queries, espe-
cially, if an information seeker was unfamiliar with
the terminology of the problem domain, or if this ter-
minology changed over time.
3 INFORMATION EXTRACTION
Our idea for domain-independent term extraction is
based on the assumption that, regardless of the do-
main we are dealing with, the majority of the TTs in
a document are in nominal group positions. To ver-
ify this assumption, we manually annotated a set of
100 abstracts from the biology part of the Zeitschrift
fuer Naturforschung
1
(ZfN) archive, which contains
scientific papers published by the ZfN between 1997
and 2003. We found that 94% of the annotated terms
were in fact in noun group positions. The starting
point of our method for extracting terms is therefore
an algorithm to extract nominal groups from a text.
We then classify these nominal groups into TTs and
non-TTs using frequency counts retrieved from the
MSN search engine. For the extraction of term can-
didates, we use the nominal group (NG) chunker of
the GNR tool developed by (Spurk, 2006), which we
slightly adapted for our purposes. The advantage of
this chunker compared to other chunkers is that it is
domain-independent because it is not trained on a par-
ticular corpus but relies on patterns based on closed
class words (e.g. prepositions, determiners, coordi-
nators), which are available in all domains. Using
lists of closed-class words, the NG chunker deter-
mines the left and right boundaries of a word group
and defines all words in between as an NG. In order
to find the TTs within the extracted NG chunks, we
use a frequency-based approach. Our assumption is
that terms that occur mid-frequently in a large cor-
pus are the ones that are most associated with some
1
http://www.znaturforsch.com/
topic and will often constitute technical terms. To
test our hypothesis, we retrieved frequency scores for
all NG chunks extracted from our corpus of abstracts
from the biology domain and calculated the ratio be-
tween TTs and non-TTs for particular maximum fre-
quency scores. To retrieve the frequency scores for
our chunks, we use the internet as reference corpus,
as it is general enough to cover a broad range of do-
mains, and retrieve the scores using the Live Search
API of the MSN search engine
2
. The results confirm
our hypothesis, showing that the ratio increases up to
an MSN score threshold of about 1.5 million and then
slowly declines. This means that chunks with mid-
frequency score are in fact more likely to be technical
terms than terms with a low or high score.
To optimize the lower and upper boundaries that de-
fine ’mid-frequency’, we maximized the F-measure
achieved on our annotated biology corpus with dif-
ferent thresholds set. Evaluating our algorithm on our
annotated corpus of abstracts, we obtained the follow-
ing results. From the biology corpus, our NG chunker
was able to extract 1264 (63.2%) of the 2001 anno-
tated TTs in NG position completely and 560 (28.0%)
partially. With the threshold optimized for the F-
measure (6.05 million), we achieved a precision of
57.0% at recall 82.9% of the total matches. These re-
sults are comparable to results for GN learning, e.g.
those by (Yangarber et al., 2002) for extracting dis-
eases from a medical corpus. We also evaluated our
approach on the GENIA corpus
3
, a standard corpus
for biology. Considering all GENIA terms with POS
tags matching the regular expression
JJ ∗ NN ∗ (NN|NNS)
as terms in NG position, we were able to evaluate our
approach on 62.4% of all terms. With this data, we
achieved 50.0% precision at recall 75.0%. A sample
abstract from the ZfN data, with the automatically ex-
tracted TTs shaded, is shown in Figure 1. The key
advantage of our approach over other approaches to
GN learning is that it extracts a broad range of differ-
ent TTs robustly and irrespective of the existence of
morphological or contextual patterns in a training cor-
pus. It works independent of the domain, the length of
the input text or the size of the corpus, in which in the
input document appears. This makes it, in principal,
applicable to documents of any digital library.
2
http://dev.live.com/livesearch/
3
http://www-tsujii.is.s.u-tokyo.ac.jp/genia/topics/ Cor-
pus/
DILIA - A DIGITAL LIBRARY ASSISTANT - A New Approach to Information Discovery through Information
Extraction and Visualization
181