
detection, and spatial research collaborations. Fur-
thermore, addressing the diverse representations of
tabular data by transforming PDFs into LaTeX ex-
pressions represents an exciting future research direc-
tion. Lately, the adoption of Large Language Mod-
els (LLMs) for various Natural Language Processing
(NLP) tasks paves the way for their potential adop-
tion in information extraction, highlighting exciting
prospects for future research.
ACKNOWLEDGEMENTS
This work was partially funded by the Helmholtz As-
sociation (grant HIDSS-0005). EB-G received sup-
port from the Cluster of Excellence ‘The Ocean Floor
– Earth’s Uncharted Interface’ (EXC 2077) funded by
Deutsche Forschungsgemeinschaft (DFG) - Project
number 390741603 hosted by the Research Faculty
MARUM-Center for Marine Environmental Sciences,
University of Bremen, Germany. This work has been
partially supported by the Deutsche Forschungsge-
meinschaft (DFG, German Research Foundation), un-
der SmartER project (Grant number 515537520). Au-
thors also acknowledge the sources being used and the
efforts of all collaborators.
REFERENCES
(2022). Camelot. Last accessed 4 May 2023.
(2022). Tabula-py. Last accessed 4 May 2023.
Ceritli, T. and Williams, C. K. (2021). Identifying the
units of measurement in tabular data. arXiv preprint
arXiv:2111.11959.
Chuang, P.-C., Yang, T. F., Wallmann, K., Matsumoto, R.,
Hu, C.-Y., Chen, H.-W., Lin, S., Sun, C.-H., Li, H.-
C., Wang, Y., et al. (2019). Carbon isotope exchange
during anaerobic oxidation of methane (aom) in sedi-
ments of the northeastern south china sea. Geochimica
et Cosmochimica Acta, 246:138–155.
Costa, K. M., McManus, J. F., and Anderson, R. F. (2018).
Paleoproductivity and stratification across the sub-
arctic pacific over glacial-interglacial cycles. Paleo-
ceanography and Paleoclimatology, 33(9):914–933.
Du, N., Guo, J., Wu, C. Q., Hou, A., Zhao, Z., and
Gan, D. (2020). Recommendation of academic pa-
pers based on heterogeneous information networks.
In 2020 IEEE/ACS 17th International Conference on
Computer Systems and Applications (AICCSA), pages
1–6. IEEE.
Ducatteeuw, V. (2021). Developing an urban gazetteer: A
semantic web database for humanities data. In Pro-
ceedings of the 5th ACM SIGSPATIAL International
Workshop on Geospatial Humanities, pages 36–39.
G
¨
opfert, J., Kuckertz, P., Weinand, J., Kotzur, L., and
Stolten, D. (2022). Measurement extraction with nat-
ural language processing: A review. Findings of the
Association for Computational Linguistics: EMNLP
2022, pages 2191–2215.
Hendricks, G., Tkaczyk, D., Lin, J., and Feeney, P. (2020).
Crossref: The sustainable source of community-
owned scholarly metadata. Quantitative Science Stud-
ies, 1(1):414–427.
H
¨
ubscher, L., Jiang, L., and Naumann, F. (2023). Ex-
tractable: Extracting tables from raw data files. BTW
2023.
Liu, J., Shi, C., Yang, C., Lu, Z., and Philip, S. Y. (2022). A
survey on heterogeneous information network based
recommender systems: Concepts, methods, applica-
tions and resources. AI Open, 3:40–57.
Lopez, P. (2009). Grobid: Combining automatic biblio-
graphic data recognition and term extraction for schol-
arship publications. In Research and Advanced Tech-
nology for Digital Libraries: 13th European Con-
ference, ECDL 2009, Corfu, Greece, September 27-
October 2, 2009. Proceedings 13 , pages 473–474.
Springer.
Martinez-Rodriguez, J. L., Hogan, A., and Lopez-Arevalo,
I. (2020). Information extraction meets the semantic
web: a survey. Semantic Web, 11(2):255–335.
Moulin, T. C. and Amaral, O. B. (2020). Using collabo-
ration networks to identify authorship dependence in
meta-analysis results. Research Synthesis Methods,
11(5):655–668.
Petersen, T., Suryani, M. A., Beth, C., Patel, H., Wall-
mann, K., and Renz, M. (2021). Geo-quantities: A
framework for automatic extraction of measurements
and spatial context from scientific documents. In
17th International Symposium on Spatial and Tempo-
ral Databases, pages 166–169.
Suryani, M. A., Hahne, S., Beth, C., Wallmann, K., and
Renz, M. (2023). Daf: Data acquisition framework to
support information extraction from scientific publica-
tions. In Proceedings of the 15th International Joint
Conference on Knowledge Discovery, Knowledge En-
gineering and Knowledge Management - Volume 1:
KDIR, pages 468–476. INSTICC, SciTePress.
Suryani, M. A., W
¨
olker, Y., Sharma, D., Beth, C., Wall-
mann, K., and Renz, M. (2022). A framework for ex-
tracting scientific measurements and geo-spatial infor-
mation from scientific literature. In 2022 IEEE 18th
International Conference on e-Science (e-Science),
pages 236–245. IEEE.
Wahle, J. P., Ruas, T., Mohammad, S. M., and Gipp, B.
(2022). D3: A massive dataset of scholarly metadata
for analyzing the state of computer science research.
arXiv preprint arXiv:2204.13384.
Zhu, M. and Cole, J. M. (2022). Pdfdataextractor: A tool
for reading scientific text and interpreting metadata
from the typeset literature in the portable document
format. Journal of Chemical Information and Model-
ing, 62(7):1633–1643.
ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods
460