Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation

Nuno Moniz; Fátima Rodrigues

doi:10.5220/0004103501230131

Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation

Nuno Moniz, Fátima Rodrigues

2012

Abstract

This paper presents an approach for text processing of PDF documents with well-defined layout structure. The scope of the approach is to explore the font’s structure of PDF documents, using perceptual grouping. It consists on the extraction of text objects from the content stream of the documents and its grouping according to a set criterion, making also use of geometric-based regions in order to achieve the correct reading order. The developed approach processes the PDF documents using logical and structural rules to extract the entities present in them, and returns an optimized XML representation of the PDF document, useful for re-use, for example in text categorization. The system was trained and tested with Portuguese Legislation PDF documents extracted from the electronic Republic’s Diary. Evaluation results show that our approach presents good results.

References

Hassan, T. 2010. User-Guided Information Extraction from Print-Oriented Documents. Dissertation. Vienna University of Technology
Taylor, S., Dahl, D., Lipshitz, M. et al. 1994. Integrated Text and Image Understanding for Document Understanding. Unisys Corporation
Klink, S, Kieninger, T. 2001. Rule-based Document Structure Understanding with a Fuzzy Combination of Layout and Textual Features. German Research Center for Artificial Intelligence
Todoran, L., Worring, M., Aiello, M., Monz, C. 2001. Document Understanding for a Broad Class of Documents. ISIS technical report series, Vol. 2001-15
Hollingsworth, B., Lewin, I., Tidhar, D. 2005. Retrieving Hierarchical Text Structure from Typeset Scientific Articles - a Prerequisite for E-Science Text Mining. University of Cambridge Computer Laboratory
Hassan, T., Baumgartner, R. 2005. Intelligent Text Extraction from PDF. Database & Artificial Intelligence Group, Vienna University of Technology, Austria
Antonacopoulos, A., Coenen, F. P. 1999. Region Description and Comparative Analysis Using a Tesseral Representation. Department of Computer Science, University of Liverpool.
Rosenfeld, B., Feldman, R., Aumann, Y. et al. 2008. Structural Extraction from Visual Layout of Documents. CIKM 7802
Siefkes, C. 2003. Learning to Extract Information for the Semantic Web. Berlin-Brandenburg Graduate School in Distributed Information Systems. Database and Information Systems Group, Freie Universität Berlin
Adobe Systems Incorporated. 2008. Document management - Portable document format - Part 1: PDF 1.7
Klink, S., Dengel, A., Kieninger, T. 2000. Document Structure Analysis Based on Layout and Textual Features. DAS 2000: Proceedings of the International Workshop of Document Analysis Systems
Adobe Systems Incorporated. 2008. Document management - Portable document format - Part 1: 1.7.
Niyogi, D. 1994. A Knowledge-Based Approach to Deriving Logical Structure from Document Images. PhD thesis, State University of New York at Buffalo
Hendler, J., Berners-Lee, T., Miller, E., 2002. Integrating Applications on the Semantic Web. Journal of the Institute of Electrical Engineers of Japan, Vol 122(10), October 2002, p.676-680
Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q., 2005. Automatic Extraction of Titles from General Documents using Machine Learning. JCDL 7805
Giuffrida, G., Shek, E., Yang, J., 2000. Knowledge-Based Metadata Extraction from Post-Script Files. DL'00

Download

Paper Citation

in Harvard Style

Moniz N. and Rodrigues F. (2012). Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 123-131. DOI: 10.5220/0004103501230131

in Bibtex Style

@conference{kdir12,
author={Nuno Moniz and Fátima Rodrigues},
title={Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={123-131},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004103501230131},
isbn={978-989-8565-29-7},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation
SN - 978-989-8565-29-7
AU - Moniz N.
AU - Rodrigues F.
PY - 2012
SP - 123
EP - 131
DO - 10.5220/0004103501230131