ACKNOWLEDGEMENTS
This work was supported by the Helmholtz School for
Marine Data Science (MarDATA) partially funded by
the Helmholtz Association (grant HIDSS-0005).
REFERENCES
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J.
(1999). Optics: Ordering points to identify the clus-
tering structure. ACM Sigmod record, 28(2):49–60.
Appalaraju, S., Jasani, B., Kota, B. U., Xie, Y., and Man-
matha, R. (2021). Docformer: End-to-end transformer
for document understanding. In Proceedings of the
IEEE/CVF international conference on computer vi-
sion, pages 993–1003.
Camelot (2023). camelot. [Online; accessed: March 21,
2023].
Chamberlain, S. (2019). fulltext: Full text of ’scholarly’
articles across many data sources. R package version
1.4.0.
Chamberlain, S., Zhu, H., Jahn, N., Boettiger, C., and
Ram, K. (2023). rcrossref: Client for various
’crossref’ ’apis’. https://docs.ropensci.org/rcrossref/,
https://github.com/ropensci/rcrossref.
Contributors, G. Apache pdfbox - a java pdf library. https:
//pdfbox.apache.org/. [Online; accessed: February 20,
2023].
Contributors, G. Apache tika - a content analysis toolkit.
https://tika.apache.org/. [Online; accessed: February
20, 2023].
Entity-Fishing (2016–2023). entity-fishing. https://github.
com/kermitt2/entity-fishing.
Foppiano, L., Romary, L., Ishii, M., and Tanifuji, M.
(2019). Automatic identification and normalisation of
physical measurements in scientific literature. In Pro-
ceedings of the ACM Symposium on Document Engi-
neering 2019, pages 1–4.
Guerra, J., Quan, W., Li, K., Ahumada, L., Winston, F.,
and Desai, B. (2018). Scosy: A biomedical collabo-
ration recommendation system. In 2018 40th annual
international conference of the IEEE engineering in
medicine and biology society (EMBC), pages 3987–
3990. IEEE.
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., and
Park, S. (2022). Bros: A pre-trained language model
focusing on text and layout for better key informa-
tion extraction from documents. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol-
ume 36, pages 10767–10775.
Honnibal, M., Montani, I., Van Landeghem, S., and Boyd,
A. (2020). spaCy: Industrial-strength Natural Lan-
guage Processing in Python.
Inc., A. S. (2006). PDF Reference Version 1.7. 6 edition.
Jinha, A. E. (2010). Article 50 million: an estimate of the
number of scholarly articles in existence. Learned
publishing, 23(3):258–263.
Kessler, J., Reeburgh, W., Southon, J., Seifert, R.,
Michaelis, W., and Tyler, S. (2006). Basin-wide
estimates of the input of methane from seeps and
clathrates to the black sea. Earth and Planetary Sci-
ence Letters, 243(3-4):366–375.
Kreutz, C. K. and Schenkel, R. (2022). Scientific paper
recommendation systems: a literature review of re-
cent publications. International Journal on Digital
Libraries, 23(4):335–369.
Lopez, P. (2009). Grobid: Combining automatic biblio-
graphic data recognition and term extraction for schol-
arship publications. In Agosti, M., Borbinha, J., Kap-
idakis, S., Papatheodorou, C., and Tsakonas, G., ed-
itors, Research and Advanced Technology for Dig-
ital Libraries, pages 473–474, Berlin, Heidelberg.
Springer Berlin Heidelberg.
Martinez-Rodriguez, J. L., Hogan, A., and Lopez-Arevalo,
I. (2020). Information extraction meets the semantic
web: a survey. Semantic Web, 11(2):255–335.
Mehta, V. (2019). Comparison with other pdf table extrac-
tion libraries and tools.
PDFminer. pdfminer.six. https://github.com/pdfminer/
pdfminer.six. [Online; accessed: December 20, 2022].
Pe
˜
na, A., Morales, A., Fierrez, J., Ortega-Garcia, J.,
Grande, M., Puente, I., Cordova, J., and Cordova, G.
(2023). Document layout annotation: Database and
benchmark in the domain of public affairs. arXiv
preprint arXiv:2306.10046.
PyMuPDF (2023). Pymupdf. [Online; accessed: March 21,
2023].
Riedinger, N., Pfeifer, K., Kasten, S., Garming, J. F. L.,
Vogt, C., and Hensen, C. (2005). Diagenetic alter-
ation of magnetic signals by anaerobic oxidation of
methane related to a change in sedimentation rate.
Geochimica et Cosmochimica Acta, 69(16):4117–
4126.
Shinyama, Y. (2013). Programming with pdfminer.
Slate (2022). Slate. [Online; accessed: November 07,
2022].
Suryani, M. A., Wolker, Y., Sharma, D., Beth, C., Wall-
mann, K., and Renz, M. (2022). A framework for ex-
tracting scientific measurements and geo-spatial infor-
mation from scientific literature. In 2022 IEEE 18th
International Conference on e-Science (e-Science),
pages 236–245. IEEE.
Swain, M. C. and Cole, J. M. (2016). Chemdataextractor: a
toolkit for automated extraction of chemical informa-
tion from the scientific literature. Journal of chemical
information and modeling, 56(10):1894–1904.
Tabula (2023). Tabula. [Online; accessed: February 15,
2023].
Taylor-Sakyi, K. (2016). Big data: Understanding big data.
arXiv preprint arXiv:1601.04602.
Textract (2023). Textract. [Online; accessed: March 21,
2023].
Yadav, P., Remala, N., and Pervin, N. (2019). Reccite: A
hybrid approach to recommend potential papers. In
2019 IEEE international conference on big data (big
data), pages 2956–2964. IEEE.
Zhu, M. and Cole, J. M. (2022). Pdfdataextractor: A tool
for reading scientific text and interpreting metadata
from the typeset literature in the portable document
format. Journal of Chemical Information and Model-
ing, 62(7):1633–1643.
KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval
476