DAF: Data Acquisition Framework to Support Information Extraction from Scientific Publications

Muhammad Suryani, Muhammad Suryani, Steffen Hahne, Christian Beth, Klaus Wallmann, Matthias Renz

2023

Abstract

Researchers encapsulate their findings in publications, generally available in PDFs, which are designed primarily for platform-independent viewing and printing and do not support editing or automatic data extraction. These documents are a rich source of information in any domain, but the information in these publications is presented in text, tables and figures. However, manual extraction of information from these components would be beyond tedious and necessitates an automatic approach. Therefore, an automatic extraction approach could provide valuable data to the research community while also helping to manage the increasing number of publications. Previously, many approaches focused on extracting individual components from scientific publications, i.e. metadata, text or tables, but failed to target these data components collectively. This paper proposes a Data Acquisition Framework (DAF), the most comprehensive framework to our knowledge. The DAF extracts enhanced metadata, segmented text, captions and content of tables and figures respectively. Through rigorous evaluation on two distinct datasets from the Marine Science and Chemical Domain we showcase the superior performance compared of the DAF to the baseline PDFDataExtractor. We also provide an illustrative example to underscore DAF’s adaptability in the realm of research data management.

Download


Paper Citation


in Harvard Style

Suryani M., Hahne S., Beth C., Wallmann K. and Renz M. (2023). DAF: Data Acquisition Framework to Support Information Extraction from Scientific Publications. In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN 978-989-758-671-2, SciTePress, pages 468-476. DOI: 10.5220/0012260300003598


in Bibtex Style

@conference{kdir23,
author={Muhammad Suryani and Steffen Hahne and Christian Beth and Klaus Wallmann and Matthias Renz},
title={DAF: Data Acquisition Framework to Support Information Extraction from Scientific Publications},
booktitle={Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
year={2023},
pages={468-476},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012260300003598},
isbn={978-989-758-671-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - DAF: Data Acquisition Framework to Support Information Extraction from Scientific Publications
SN - 978-989-758-671-2
AU - Suryani M.
AU - Hahne S.
AU - Beth C.
AU - Wallmann K.
AU - Renz M.
PY - 2023
SP - 468
EP - 476
DO - 10.5220/0012260300003598
PB - SciTePress