6 CONCLUSIONS
This paper presented an application that integrates a
series of different approaches for achieving informa-
tion expansion and visualization based on extracted
shallow and deep metadata. The extraction process of
the shallow metadata followed a low-level document
engineering approach, combining font mining and
format information. On the other hand, the extraction
of deep metadata (i.e. the rhetorical discourse struc-
ture) was performed based on a combined linguistic
and empirical approach. The metadata was used for
information expansion and visualization based on dif-
ferent publication repositories. The evaluation results
of all of the application’s components encourages us
to continue our efforts in the same direction, by in-
creasing the efficiency of the extraction mechanisms.
Future work will focus especially on improving
the extraction of deep metadata by considering word
co-occurrence, anaphora resolution and verb tense
analysis. These improvements will also be reflected
into a new iteration over the initial weights (probabil-
ities) assigned to the epistemic items, resulted from
this paper. At the application level, we will implement
additional expansion modules, thus integrating more
publication repositories. Also, we intend to release
the application’s core source code as open source so
that its adoption can be directly coupled with its fur-
ther development, including contributions from the
researchers that would like to use it.
ACKNOWLEDGEMENTS
The work presented in this paper has been funded
by Science Foundation Ireland under Grant No.
SFI/08/CE/I1380 (Lion-2). The authors would like
to thank VinhTuan Thai, Georgeta Bordea and Ioana
Hulpus for their support and help.
REFERENCES
Bernardi, A., Decker, S., van Elst, L., Grimnes, G., Groza,
T., Handschuh, S., Jazayeri, M., Mesnage, C., M
¨
oller,
K., Reif, G., and Sintek, M. (2008). The Social Se-
mantic Desktop: A New Paradigm Towards Deploying
the Semantic Web on the Desktop. IGI Global.
Faisal, S., Cairns, P. A., and Blandford, A. (2007). Building
for Users not for Experts: Designing a Visualization
of the Literature Domain. In Information Visualisation
2007, pages 707–712. IEEE Computer Society.
Groza, T., Handschuh, S., M
¨
oller, K., and Decker, S.
(2007). SALT - Semantically Annotated L
A
T
E
X for
Scientific Publications. In ESWC 2007, Innsbruck,
Austria.
Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z.,
and Fox, E. A. (2003). Automatic document metadata
extraction using support vector machines. In Proc.
of the 3rd ACM/IEEE-CS Joint Conf. on Digital li-
braries, pages 37–48.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Con-
ditional random fields: Probabilistic models for seg-
menting and labeling sequence data. In Proc. of the
18th Int. Conf. on Machine Learning, pages 282–289.
Mann, W. C. and Thompson, S. A. (1987). Rhetorical struc-
ture theory: A theory of text organization. Technical
Report RS-87-190, Information Science Institute.
Marcu, D. (1997). The Rhetorical Parsing, Summarization,
and Generation on Natural Language Texts. PhD the-
sis, University of Toronto.
McCallum, A., Freitag, D., and Pereira, F. (2000). Maxi-
mum entropy markov models for information extrac-
tion and segmentation. In Proc. of the 17th Int. Conf.
on Machine Learning, pages 591–598.
M
¨
oller, K., Heath, T., Handschuh, S., and Domingue, J.
(2007). Recipes for Semantic Web Dog Food – The
ESWC and ISWC Metadata Projects. In Proc. of the
6th Int. Semantic Web Conference.
Murray, C., Ke, W., and Borner, K. (2006). Mapping scien-
tific disciplines and author expertise based on personal
bibliography files. In Information Visualisation 2006,
pages 258–263. IEEE Computer Society.
Neirynck, T. and Borner, K. (2007). Representing, ana-
lyzing, and visualizing scholarly data in support of
research management. In Information Visualisation
2007, pages 124–129. IEEE Computer Society.
Plaisant, C., Fekete, J.-D., and Grinstein, G. (2008). Pro-
moting Insight-Based Evaluation of Visualizations:
From Contest to Benchmark Repository. IEEE Trans-
actions on Visualization and Computer Graphics,
14(1):120–134.
Shek, E. C. and Yang, J. (2000). Knowledge-Based Meta-
data Extraction from PostScript Files. In Proc. of the
5th ACM Conf. on Digital Libraries, pages 77–84.
Shum, S. J. B., Uren, V., Li, G., Sereno, B., and Mancini,
C. (2006). Modeling naturalistic argumentation in re-
search literatures: Representation and interaction de-
sign issues. Int. J. of Intelligent Systems, 22(1):17–47.
Teufel, S. and Moens, M. (2002). Summarizing scientific
articles: Experiments with relevance and rhetorical
status. Computational Linguistics, 28.
Tsujii, J. (2009). Refine and pathtext, which combines text
mining with pathways. Keynote at Semantic Enrich-
ment of the Scientific Literature 2009 (SESL 2009).
Yilmazel, O., Finneran, C. M., and Liddy, E. D. (2004).
Metaextract: an nlp system to automatically assign
metadata. In JCDL ’04: Proceedings of the 4th
ACM/IEEE-CS joint conference on Digital libraries,
pages 241–242.
KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development
116