Extracting and Modeling Tabular Data from Marine Geology

Publications into a Heterogeneous Information Network

Muhammad Asif Suryani

1 a

, Ewa Burwicz-Galerne

2 b

, Brigitte Mathiak

1 c

Klaus Wallmann

3 d

and Matthias Renz

4 e

GESIS - Leibniz-Institute for the Social Sciences, 50667 Cologne, Germany

MARUM - Center for Marine Environmental Sciences, University of Bremen, 28359 Bremen, Germany

GEOMAR Helmholtz Centre for Ocean Research Kiel, 24148 Kiel, Germany

Institute of Informatik, Christian-Albrechts-Universit

at zu Kiel, 24118 Kiel, Germany

Keywords:

Information Extraction, Tabular Data, Research Data Management, Marine Science Publication,

Heterogeneous Information Network, Data Modeling.

Abstract:

Scientiﬁc publications serve as a source of disseminating information across research communities, often con-

taining diverse data elements such as plain-text, tables, and ﬁgures. Tables in particular offer a structured

presentation of essential research data, enabling efﬁcient information access. Automatic extraction of tabular

data alongside contextual information from scientiﬁc publications can signiﬁcantly enhance research work-

ﬂows and integrate more research data into scholarly research cycle, particularly supporting Research Data

Management (RDM). In marine geology, the researchers conduct expeditions at oceanographic locations and

accumulate substantial amounts of valuable data such as Sedimentation Rate (SR), Mass Accumulation Rate

(MAR) alongside relevant contextual information, often enriched with spatio-temporal context in tables of

publications. These expeditions are costly and time intensive, emphasizing on the value of making such data

more accessible and reusable. This paper introduces an end to end approach to extract and model heteroge-

neous tabular data from marine geology publications. Our approach extracts metadata and tabular content

from publications, modeling them into a Heterogeneous Information Network (HIN). The network uncovers

hidden relationships and patterns across multiple documents, offering new insights and facilitating enhanced

data referencing. Experimental results and exploration on marine geology datasets demonstrate the effective-

ness of our approach, showcasing its potential to support research data management and data driven scientiﬁc

exploration.

1 INTRODUCTION

Scientiﬁc publications are valuable sources of infor-

mation that presents data in plain text, tables, and

ﬁgures. These publications are generally available in

Portable Document Format (PDF). These PDFs pro-

vide access to the relevant information in a stream-

lined reading experience. However, due to the rapid

increase in the number of publications, the manual

extraction of relevant information is becoming dif-

https://orcid.org/0000-0003-1669-5524

https://orcid.org/0000-0003-4551-5609

https://orcid.org/0000-0003-1793-9615

https://orcid.org/0000-0002-1795-376X

https://orcid.org/0000-0002-2024-7700

ﬁcult and time-consuming. Moreover, the complex

internal structure of PDFs also makes information

extraction from plain text, tables, and ﬁgures even

more challenging. PDFs are primarily designed for

platform-independent viewing and printing (Petersen

et al., 2021).

To automatically extract the relevant information

from these data components of publications it is nec-

essary to have a broader information extraction cover-

age. As for the context elaboration, researcher utilize

plain-text and for graphical representation images are

presented. However, tables are generally showcased

as a set of organized information which researchers

want to emphasize in their research studies. For ex-

ample numerical information is generally organized

in tables and relatively easily accessible to users. But

Suryani, M. A., Burwicz-Galerne, E., Mathiak, B., Wallmann, K. and Renz, M.

Extracting and Modeling Tabular Data from Marine Geology Publications into a Heterogeneous Information Network.

DOI: 10.5220/0013389900003905

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), pages 453-460

ISBN: 978-989-758-730-6; ISSN: 2184-4313

453

Figure 1: Publications by Years (Hendricks et al., 2020).

the access also depends on the granularity of the infor-

mation available in these tables (G

opfert et al., 2022).

Moreover, a broader extraction approach is get-

ting indispensable which could support the extraction

of complex quantitative information from these tables

of publications to overcome manual procedures of in-

formation extraction. Besides, there are various fac-

tors which need to be considered in this regard, as ta-

bles are structured quite diversely regarding the num-

ber of (nested) rows and columns, table orientation

and depend on the template of publishers (Martinez-

Rodriguez et al., 2020),(Ceritli and Williams, 2021).

Marine Geologists perform extensive experiments

to record diverse scientiﬁc measurements at various

oceanographic locations. These experiments are quite

expensive in terms of efforts and cost. As researchers

have to spend weeks on expedition ships under chal-

lenging weather conditions to collect the samples to

study the marine environment. There have been var-

ious parameters which could be of interest of marine

geologists such as the Sedimentation Rate (SR) and

Mass Accumulation Rate (MAR) at the seaﬂoor is an

excellent example (Chuang et al., 2019). Marine geol-

ogists have recorded these measurements for over one

hundred years and typically reported them in tables

of respective scientiﬁc publications. So, proper rep-

resentation of such quantitative information would be

tremendously helpful for both readers and also for the

automatic extraction process to carry out relevant ex-

traction conveniently. In order to further emphasis on

the importance and scale of this study the information

on the number of publications is presented in Figure 1

(Petersen et al., 2021; G

opfert et al., 2022),(Martinez-

Rodriguez et al., 2020),(Ceritli and Williams, 2021),

(Hendricks et al., 2020).

To further emphasis on the tabular representations

of targeted measurements i.e. Sedimentation Rate

and Mass Accumulation Rate, it is essential to ex-

plain their relevant contextual information. In Fig-

ure 2 a sample table comprise of various columns

inciting contextual information. The most impor-

tant features are: Core which speciﬁes the name

Figure 2: Sample Table taken from PDF (Costa et al., 2018).

of location where the experimentation has been per-

formed.Lat/Long columns provide the location infor-

mation about these measurements. Similarly Water

Depth provide the depth information. Average Sed-

imentation rate column represents the values being

measured at these speciﬁc locations. Lastly Proxy

showcases which method is being adopted to record

these measurements. Moreover, all the features are

elaborated in order of their contextual importance. It

is essential to note that the header representations for

these features vary widely across tables, and the tar-

geted measurements do not have proper representa-

tions in the International System of Units (SI), making

the extraction process even more challenging (Costa

et al., 2018).

Extracting relevant information from these table

instances can address a wide range of research ques-

tions including:

1. Where were these measurements taken, and what

are the identiﬁers for these locations?

2. What are the reported values and units of SR and

MAR?

3. Which authors reported these values?

4. What are the connections between studies based

on authors and locations?

5. What are the methods and depths associated with

these measurements?

However, an access to such information alongside

their respective contextual information would provide

marine geologists with broader insights and beneﬁt

them by having convenient access to their desired in-

formation and could accelerate the knowledge discov-

ery process (Suryani et al., 2022). It is also worth

highlighting that these targeted measurements and

their corresponding spatial information have never

been compiled in any repository yet but were reported

primarily in publications. This paper discusses the

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

454

framework that extracts tables from scientiﬁc publi-

cations parse these tables to extract quantitative in-

formation from marine geology publications to ex-

tract relevant information. Finally, the extracted in-

formation alongside respective metadata is being ex-

ploited to populate a Heterogeneous Information Net-

work comprising of metadata and measurements from

the respective publications.

2 RELATED WORK

Scientiﬁc publications play a crucial role in research

by presenting detailed information about experiments

and results. Thus access to such research data could

be beneﬁcial for the communities to deﬁne future

research directions. However, research data in this

problem setting may provide specialized numerical

and spatial information, which facilitates the informa-

tion acquisition process about experiments or expedi-

tions. The primary function of Research Data Man-

agement (RDM) is to provide essential information

about conducted experiments, showcasing previously

relevant and interconnected data. This contributes to

advancing FAIR data principles and improving the

delivery of research data across communities (Ducat-

teeuw, 2021).

Marine Geologists mainly report scientiﬁc mea-

surements such as Mass Accumulation Rate (MAR)

and Sedimentation Rate (SR) in publications, repre-

sented in various data elements such as text and ta-

bles. These documents do not support a fully auto-

matic information extraction from these data elements

of publications. Contrarily, numerous studies previ-

ously focused on information extraction from these

data elements of publications but only to certain ex-

tent (Suryani et al., 2022).

Metadata extraction has evolved over time and

various approaches are available, the most promi-

nent being Grobid, which is used in numerous studies

(Lopez, 2009). Other metadata-related studies have

focused on collaboration networks, citation analysis

and publication trends (Moulin and Amaral, 2020;

Wahle et al., 2022).

Tabula is an open-source library that allows users

to extract tables from PDFs into CSVs (tab, 2022).

Camelot is another open-source Python library which

extracts tables from PDFs in CSVs. In addition,

camelot offers modularity and allows to make it

adaptable in any extraction pipeline (cam, 2022).

Recently, measurement extraction from tabular

data was carried out using Quantulum3 module, a

community-maintained Python library, which focuses

on extraction of volumetric units, i.e. liter and pints

from the tabular data. However, the extraction was

only performed on spreadsheets instead of PDFs (Cer-

itli and Williams, 2021). ExtracTable is an approach

recently proposed that extracts tabular data from text

ﬁles and CSVs by detecting the row patterns and

separating the columns accordingly (H

ubscher et al.,

2023). However, this approach not able to address ta-

ble extraction from the PDFs.

Recently authors proposed Data Acquisition

Framework (DAF) which extracts diverse data ele-

ments from scientiﬁc publications. The DAF frame-

work has ability to extract plain-text, tables, images

and metadata from scientiﬁc publications with min-

imal template dependency. The DAF exploits hy-

brid approaches to perform the extraction from pub-

lications and performed better on chemical domain

and marine science publications(Suryani et al., 2023),

(Zhu and Cole, 2022).

Moreover, the network modeling of diverse infor-

mation into a network could be challenging as well

as interesting and previously addressed numerous ap-

plications. In a recent study a novel approach based

on Heterogeneous Information Network (HIN) tar-

gets potential relationships such as citation links, au-

thor collaborations, and research areas. By populating

such network using a random walk strategy to sim-

ulate natural sentences, the approach effectively dis-

covers relevance between papers (Du et al., 2020).

With the rapid growth of digital publishing, efﬁ-

ciently visualizing scholarly data has become increas-

ingly demanding. This data includes millions of raw

data points such as authors, papers, citations, and

scholarly networks. Various visualization techniques

can be applied to better represent data structures and

uncover hidden patterns. The study introduces the ba-

sic concepts and collection methods for scholarly data

and provides a comprehensive overview of related vi-

sualization tools and techniques (Liu et al., 2022).

3 TABLE TO NETWORK

The section describes the respective framework and

discusses its individual components. The frame-

work consists of Tabular Data Extraction, Information

Parser and Network Modeling modules as depicted in

Figure 3. Generally, PDF is the primary format in

which scientiﬁc publications are available, so for ex-

traction of various data elements the focus will be on

the raw PDF documents.

Extracting and Modeling Tabular Data from Marine Geology Publications into a Heterogeneous Information Network

455

Figure 3: Block Diagram.

3.1 Data Extraction

The Data Extraction module takes on scientiﬁc pub-

lications to perform two crucial tasks, i.e. metadata

extraction and table extraction. To extract these data

components, we exploited the implementation of our

previously proposed framework i.e. Data Acquisition

Framework (DAF) (Suryani et al., 2023). The frame-

work perform hybrid approach to extract the metadata

features such as DOI, Authors and Title. The meta-

data features are generally available on the ﬁrst page

of publication and follows certain patterns. The DAF

also utilizes various routines to acquire metadata fea-

tures and later provide a consolidated set of metadata

features. For tabular data, DAF breaks the task into

two distinct steps i.e. Detection and Extraction. For

the detection of tables it transform the pages of the

PDF document to coordinates and by heuristic of cap-

tion of tables to ﬁnd the bounding box of the prospec-

tive tabular regions and later passed these coordinates

to base (cam, 2022) module for the extraction of tabu-

lar data to comma separated ﬁles. It is also important

to mention that DAF has better extraction coverage of

metadata features targeting domain speciﬁc research

publications as compared to (Zhu and Cole, 2022) in

chemical domain publications. The extracted sample

of tabular data is shown in Figure 4.

3.2 Information Parser

The Information Parser module exploits the extracted

tabular data from data extraction module. Informa-

tion parser module plays a crucial role in this prob-

lem setting by initially performing table processing,

which involves information detection, correction and

segregation tasks. The information parser module is

speciﬁcally designed to handle tables within the con-

text of marine geology publications, considering these

essential attributes:

1. Distinct identiﬁers associated with each oceano-

graphic location such as Core-ID.

Figure 4: Example of Extracted Table by DAF (Costa et al.,

2018).

2. Latitude and Longitude: Expressed in either de-

grees or decimals.

3. Mass Accumulation Rate (MAR) and Sedimenta-

tion Rate (SR) expressed in Mass/Area/Time and

Length/Time respectively.

4. Water Depth values offering vertical position to

respective identiﬁers generally expressed in me-

ters.

To address these information segregation from ta-

bles, we studied over 100 relevant papers to have an

overview of possible information representation sce-

narios for each of the targeted features. We care-

fully compiled an extensive dictionary encompassing

a multitude of potential variants for each relevant fea-

ture. Particularly, we identiﬁed twelve distinct vari-

ants for Core-IDs, ﬁfteen variations for location data,

sixteen alternatives for MAR and SR headers, and

seven diverse representations for water depth mea-

surements.

The Figure 4 shows the extracted table from the

relevant publications, which indicates the efﬁcacy of

our DAF for the extraction of tabular data from pub-

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

456

Figure 5: Parsed Table Example (Costa et al., 2018).

lications. Later, each extracted table is parsed to ex-

tract relevant features while maintaining internal re-

lationships intact. Core-IDs often lack a ﬁxed pat-

tern, encompassing alphanumeric strings, numbers,

or arbitrary names. So for Core-IDs, the parser refer-

ences our dictionary and tracks the relevant column.

A similar approach is employed for identifying depth

columns, though their presence is not usual. How-

ever, parsing latitude and longitude information from

extracted tables is the challenging task. As oceano-

graphic location coordinates are expressed in degrees

and minutes encounter encoding issues, such as “ °

” becoming “ 8 ” or “ 0 ” and “ ’ ” turning into “ 1

”, such complexities are not much in numbers but are

complex and not handled in this study. The sample

of parsed table is shown in Figure 5. Additionally,

measurement columns like MAR and SR are cross-

referenced with dedicated headers and units dictio-

naries respectively to have relevant measurements and

their corresponding location coordinates and core-ids.

The Figure 4 showcases the parsed tabular infor-

mation performed by the information parser which

indicates the ability to reproduce the tabular content

from the marine geology publications (Costa et al.,

2018). Hence indicating the importance and pave the

way to have tabular research data from relevant scien-

tiﬁc publications available across research.

3.3 Network Modeling

This section explains the proposed data model which

exploits metadata features and parsed tabular infor-

mation from the publications. Heterogeneous Infor-

mation Network (HIN) is used to represent and ana-

lyze complex linked data comprising various types of

entities and relationships. HIN incorporates diverse

node and edge types, capturing the multifaceted na-

Figure 6: Proposed Data Model.

ture of diverse data. HIN can be deﬁned as a graph

G = (V, E) where V denotes the set of nodes (or ver-

tices) and E represents the set of edges. This rep-

resentation and understanding of interconnected data,

makes HINs particularly useful in various potential

studies (Liu et al., 2022). In this problem setting,

HINs are populated by leveraging the proposed data

models, which integrate the relational data and meta-

data by performing exploratory analyses to showcase

the broader spectrum of information from publica-

tions.

The data model leverages the extracted metadata

features alongside the parsed tabular data from the

respective publication sources. Figure 6 illustrates

the data model, which organizes heterogeneous fea-

tures from marine geology publications. It structures

both metadata and tabular data, while capturing docu-

ment level relationships intact. Furthermore, the data

model facilitates the generation of a heterogeneous in-

formation network, revealing potential relationships

among targeted entities and providing a broader con-

text across documents.

4 EXPERIMENTAL RESULTS

This section provides explanation of the experimental

setup employed, followed by a brief discussion on the

gathered exploratory results.

4.1 Data Description

In this exploratory study, publications from Marine

Geology were gathered from various publishers. A

comprehensive exploration covering over 300 full ar-

ticles was conducted, serving various purposes, such

as collection of diverse expressions of information

representation. The selection criteria for these pub-

lications was centered around an important require-

Extracting and Modeling Tabular Data from Marine Geology Publications into a Heterogeneous Information Network

457

Table 1: Collected Results of Metadata Features from Pub-

lications.

Metadata Feature P R F1-Score

DOI 1.0 0.90 0.95

Title 0.95 1.0 0.97

Author 1.0 0.90 0.95

ment, as each publication must carry location coordi-

nates and relevant measurements in tables.

Moreover, it is also important to mention that the

problem, we are addressing here is not explored be-

fore, so to search for the relevant sources require con-

siderable efforts. Furthermore, an evaluation of the

overall framework was undertaken on forty full pa-

pers from Marine Geology covering different pub-

lishers. However, collection of data for experimenta-

tion from various publishers will also be important to

test the template coverage of Data Acquisition Frame-

work (Suryani et al., 2023).

4.2 Results and Discussion

The set of publications are initially processed by Data

Acquisition Framework (DAF) and results are col-

lected(Suryani et al., 2023). The ﬁrst set of results

comprise of targeted metadata features such as DOIs,

Author information and the title of the publications

which will be helpful in answering the questions re-

lated to the origination of research activity such as the

authors being involved in the research study. Beside,

it is also helpful in studying the relationships across

metadata features. The results in this regard are com-

piled and shown in Table 1.

There were a total of 111 table instances across

the publications, out of which DAF successfully ex-

tracted 106. The criterion for true table extraction

was to ensure that all rows and columns remained in-

tact. Among these 106 table instances, 46 were iden-

tiﬁed as relevant, carrying the desired information for

our proposed study. Furthermore, the information

parser successfully parsed 40 of these relevant tables,

while the remaining 6 encountered issues where lo-

cation coordinates were misaligned, as described in

Section 3.2. The later steps involve modeling the ex-

tracted information into a heterogeneous information

network. For the sake of simplicity, information mod-

eling is performed at two levels: metadata and com-

plete. By exploiting the importance of metadata fea-

tures, information modeling is initially performed on

the extracted metadata features, showcasing the meta-

information to relevant research communities. Figure

7 illustrates a network of metadata features of publica-

tions. For better understanding, all node types follow

a distinct color scheme, DOI instances are represented

Figure 7: Network based on Metadata of Marine Geology

Publications.

in red, author nodes in blue, and titles in green.

Finally, we constructed a comprehensive network

which models all the extracted information from the

publications i.e. metadata features, relevant measure-

ments and contextual information. For better ex-

ploration the features are denoted in different color

scheme to enhance the information accessibility. The

network in this regard is shown in Figure 8. Here the

information against core ID is presented in pink and

all relevant tabular information is shown in purple.

Moreover, the network in Figure 8 also showcase the

author level information as well as the core level con-

nectivity to showcase the diverse relationships. Hence

capable of addressing various questions including:

Finally, we constructed a comprehensive network

that models all the extracted information from the

publications, including metadata features, relevant

measurements, and contextual information. To facil-

itate better exploration, the features are represented

using a distinct color scheme to enhance information

accessibility as shown in Figure 8.

In this ﬁgure, information associated with the

core-ID is presented in pink, while all relevant tab-

ular information is displayed in purple. Moreover, the

network in Figure 8 also illustrates author-level infor-

mation and core-level connectivity, highlighting the

diverse relationships which address various questions,

including:

1. What are the identiﬁers for the targeted locations?

2. Who measured these values?

3. Where are SR and MAR values measured.

4. What is the water depth value corresponding to

these measurements in speciﬁc regions?

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

458

Figure 8: Network Comprise of metadata and parsed tabular

data from Marine Geology Publications.

5. Who are the potential collaborators targeting geo-

graphical regions?

Hence, to enhance information delivery, a small

subset of the complete network is presented in Fig-

ure 9, illustrating author level relationships between

two documents. Additionally, it highlights poten-

tial location based research collaborations and identi-

ﬁes groups or individuals working at speciﬁc oceano-

graphic locations.

Figure 9: Subset of the Network Showing Potential Rela-

tionships across Publications.

This exploratory study aims to highlight the di-

verse information available across research publica-

tions by facilitating research data management and

the open science initiative. One of the major aspects is

to integrate more research data from marine geology

into the scholarly eco-system at different granulari-

ties. Figure 7 aims to provide author level relation-

ships within a set of publications, which are equally

essential for an abstract overview.

5 CONCLUSION

The proposed framework addresses the challenging

task of extracting tables from scientiﬁc publications

to enhance the information discovery process by re-

trieving diverse quantitative data along with contex-

tual information from tables in PDF ﬁles. This ap-

proach contributes to making research data more ac-

cessible and usable within the marine geology do-

main. Unlike prevalent approaches that primarily fo-

cus on textual data, our framework emphasizes tab-

ular data, generating valuable insights that were pre-

viously limited to the researchers. The potential out-

come of the framework is the creation of document-

level Heterogeneous Information Network that lever-

ages heterogeneous metadata and tabular content.

In addition, it is essential to discuss the potential

challenges in this problem setting. The variance in

the representation of tabular data in publications is a

key aspect to consider, as it elevates the challenges

in the table detection and extraction process. Further-

more, metadata plays a crucial role in facilitating in-

formation proﬁling, however, its extraction is equally

challenging, primarily due to the variety of templates.

Furthermore, the lack of standardized guidelines for

representing such diverse data in marine geology pub-

lications is notable and need to be focused in future

for a streamlined delivery of information.

Moreover, it is essential to highlight two key as-

pects: the availability of relevant publications and the

extraction of data from these publications. Regarding

availability, ﬁnding relevant publications while ad-

hering to FAIR principles is challenging, despite the

availability of large number of publications. Besides,

in terms of data extraction, numerous historical docu-

ments exist that are not true PDF ﬁles, indicating to-

wards an enhanced data extraction approach capable

of extracting relevant information from images.

Beyond marine geology, the proposed framework

holds promise as a versatile solution applicable across

various scientiﬁc domains. Additionally, integrating

it with textual information extraction could enable

a comprehensive automated approach for efﬁciently

extracting heterogeneous data from all components

of scientiﬁc publications. Future directions include

leveraging the extracted information for applications

such as recommender systems, research community

Extracting and Modeling Tabular Data from Marine Geology Publications into a Heterogeneous Information Network

459

detection, and spatial research collaborations. Fur-

thermore, addressing the diverse representations of

tabular data by transforming PDFs into LaTeX ex-

pressions represents an exciting future research direc-

tion. Lately, the adoption of Large Language Mod-

els (LLMs) for various Natural Language Processing

(NLP) tasks paves the way for their potential adop-

tion in information extraction, highlighting exciting

prospects for future research.

ACKNOWLEDGEMENTS

This work was partially funded by the Helmholtz As-

sociation (grant HIDSS-0005). EB-G received sup-

port from the Cluster of Excellence ‘The Ocean Floor

– Earth’s Uncharted Interface’ (EXC 2077) funded by

Deutsche Forschungsgemeinschaft (DFG) - Project

number 390741603 hosted by the Research Faculty

MARUM-Center for Marine Environmental Sciences,

University of Bremen, Germany. This work has been

partially supported by the Deutsche Forschungsge-

meinschaft (DFG, German Research Foundation), un-

der SmartER project (Grant number 515537520). Au-

thors also acknowledge the sources being used and the

efforts of all collaborators.

REFERENCES

(2022). Camelot. Last accessed 4 May 2023.

(2022). Tabula-py. Last accessed 4 May 2023.

Ceritli, T. and Williams, C. K. (2021). Identifying the

units of measurement in tabular data. arXiv preprint

arXiv:2111.11959.

Chuang, P.-C., Yang, T. F., Wallmann, K., Matsumoto, R.,

Hu, C.-Y., Chen, H.-W., Lin, S., Sun, C.-H., Li, H.-

C., Wang, Y., et al. (2019). Carbon isotope exchange

during anaerobic oxidation of methane (aom) in sedi-

ments of the northeastern south china sea. Geochimica

et Cosmochimica Acta, 246:138–155.

Costa, K. M., McManus, J. F., and Anderson, R. F. (2018).

Paleoproductivity and stratiﬁcation across the sub-

arctic paciﬁc over glacial-interglacial cycles. Paleo-

ceanography and Paleoclimatology, 33(9):914–933.

Du, N., Guo, J., Wu, C. Q., Hou, A., Zhao, Z., and

Gan, D. (2020). Recommendation of academic pa-

pers based on heterogeneous information networks.

In 2020 IEEE/ACS 17th International Conference on

Computer Systems and Applications (AICCSA), pages

1–6. IEEE.

Ducatteeuw, V. (2021). Developing an urban gazetteer: A

semantic web database for humanities data. In Pro-

ceedings of the 5th ACM SIGSPATIAL International

Workshop on Geospatial Humanities, pages 36–39.

opfert, J., Kuckertz, P., Weinand, J., Kotzur, L., and

Stolten, D. (2022). Measurement extraction with nat-

ural language processing: A review. Findings of the

Association for Computational Linguistics: EMNLP

2022, pages 2191–2215.

Hendricks, G., Tkaczyk, D., Lin, J., and Feeney, P. (2020).

Crossref: The sustainable source of community-

owned scholarly metadata. Quantitative Science Stud-

ies, 1(1):414–427.

ubscher, L., Jiang, L., and Naumann, F. (2023). Ex-

tractable: Extracting tables from raw data ﬁles. BTW

2023.

Liu, J., Shi, C., Yang, C., Lu, Z., and Philip, S. Y. (2022). A

survey on heterogeneous information network based

recommender systems: Concepts, methods, applica-

tions and resources. AI Open, 3:40–57.

Lopez, P. (2009). Grobid: Combining automatic biblio-

graphic data recognition and term extraction for schol-

arship publications. In Research and Advanced Tech-

nology for Digital Libraries: 13th European Con-

ference, ECDL 2009, Corfu, Greece, September 27-

October 2, 2009. Proceedings 13 , pages 473–474.

Springer.

Martinez-Rodriguez, J. L., Hogan, A., and Lopez-Arevalo,

I. (2020). Information extraction meets the semantic

web: a survey. Semantic Web, 11(2):255–335.

Moulin, T. C. and Amaral, O. B. (2020). Using collabo-

ration networks to identify authorship dependence in

meta-analysis results. Research Synthesis Methods,

11(5):655–668.

Petersen, T., Suryani, M. A., Beth, C., Patel, H., Wall-

mann, K., and Renz, M. (2021). Geo-quantities: A

framework for automatic extraction of measurements

and spatial context from scientiﬁc documents. In

17th International Symposium on Spatial and Tempo-

ral Databases, pages 166–169.

Suryani, M. A., Hahne, S., Beth, C., Wallmann, K., and

Renz, M. (2023). Daf: Data acquisition framework to

support information extraction from scientiﬁc publica-

tions. In Proceedings of the 15th International Joint

Conference on Knowledge Discovery, Knowledge En-

gineering and Knowledge Management - Volume 1:

KDIR, pages 468–476. INSTICC, SciTePress.

Suryani, M. A., W

olker, Y., Sharma, D., Beth, C., Wall-

mann, K., and Renz, M. (2022). A framework for ex-

tracting scientiﬁc measurements and geo-spatial infor-

mation from scientiﬁc literature. In 2022 IEEE 18th

International Conference on e-Science (e-Science),

pages 236–245. IEEE.

Wahle, J. P., Ruas, T., Mohammad, S. M., and Gipp, B.

(2022). D3: A massive dataset of scholarly metadata

for analyzing the state of computer science research.

arXiv preprint arXiv:2204.13384.

Zhu, M. and Cole, J. M. (2022). Pdfdataextractor: A tool

for reading scientiﬁc text and interpreting metadata

from the typeset literature in the portable document

format. Journal of Chemical Information and Model-

ing, 62(7):1633–1643.

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

460