Towards Semantic Data Management Plans for Efﬁcient Review

Processing and Automation

Jana Mart

ınkov

and Marek Such

anek

Faculty of Information Technology, CTU in Prague, Prague, Czech Republic

Keywords:

Data Management Plan, Semantic Annotation, Ontology, Machine-Actionable, Human-Readable, DMP

Template.

Abstract:

In recent times, Data Management Planning has become increasingly crucial. Effective practices in data

management ensure more precise data collection, secure storage, proper handling, and utilization beyond the

primary project. However, existing DMPs often suffer from complex structures that impede accessibility for

humans and machines. This project aims to address these challenges by converting DMPs into formats that are

both machine-actionable and human-readable. Leveraging established DMP templates and relevant ontologies,

our methodology involves analyzing diverse approaches to achieve this dual functionality. We assess machine-

actionability through comparative evaluations using AI and NLP tools. Furthermore, we identify gaps in

ontologies, laying the groundwork for future enhancements in this critical area of research.

1 INTRODUCTION

The primary objective of this project is to pro-

pose a way of capturing Data management plans

(DMPs) that are both machine-actionable and human-

readable. This goal aspires to bridge the gap be-

tween conventional review procedures and the poten-

tial for automation. Unfortunately, current DMPs of-

ten suffer from convoluted composition, which im-

pedes their accessibility to human readers, and their

structure fails to align with machine-friendly pro-

cessing. In recognition of this challenge, our pro-

posal focuses on the development of a DMP template

that harmoniously combines a human-readability and

machine-actionability while considering the existing

work that has been done in terms of funder templates

as well as machine-actionable DMPs speciﬁcation.

To accomplish this, we set the following partial steps:

G1. ﬁnd and review 3 suitable DMP templates

G2. manually annotate the parts/questions from each

selected DMP templates using ontologies and

vocabularies related to the DMP or deﬁned suit-

able terms

G3. use formats combining the machine-

actionability and human-readability to cap-

https://orcid.org/0000-0001-8575-6533

https://orcid.org/0000-0001-7525-9218

ture existing DMP enriched with semantic

annotations

G4. evaluate the usability and correctness of the pro-

posed solutions

2 MACHINE-ACTIONABLE VS.

HUMAN-READABLE

DOCUMENTS

Machine-readable and human-readable documents

each serve distinct purposes, necessitating unique

characteristics, structures, and potential technolo-

gies for their creation and interpretation. Machine-

readable documents, designed primarily for auto-

mated processing by computers, prioritize structured

data, often employing formats like XML, JSON, or

CSV, enabling seamless data extraction and analy-

sis (Open Knowledge, 2015).

In contrast, human-readable documents cater to

human comprehension and are typically presented in

a visually appealing format with rich content, includ-

ing text, images, and multimedia elements.

On the other hand, machine-actionable data, as

described in (Data Documentation Initiative, 2023),

refers to structured data that machines or computers

can be programmed against its structure. Moreover

by (ELIXIR, Research Data Management Kit, 2021),

Martínková, J. and Suchánek, M.

Towards Semantic Data Management Plans for Efﬁcient Review Processing and Automation.

DOI: 10.5220/0012837900003756

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Data Science, Technology and Applications (DATA 2024), pages 543-550

ISBN: 978-989-758-707-8; ISSN: 2184-285X

543

machine-actionable data fosters semantic and syntac-

tic data integration, particularly among datasets shar-

ing similar experimental conditions or variables.

There are formats that try to combine the capa-

bilities of machine processing without losing human

capabilities. These include, for example, RDFa or so-

called microformats.

2.1 RDFa

Resource Description Framework in Attributes

(RDFa) (RDFa Working Group, 2013) is a W3C

recommendation that enables the representation of

structured data by utilizing attributes in Hyper Text

Markup Language (HTML) elements. By incorpo-

rating RDFa into web pages, it becomes possible to

embed semantic annotations while maintaining the

human-readable content of the web page.

2.2 Microformats

Microformats (Khare, 2006) are a collection of pre-

deﬁned HTML classes that empower data formats to

be enriched with semantic meaning. These HTML-

classes allow for the inclusion of machine-actionable

data directly in web page content.

2.3 Microdata

Microdata (Web Hypertext Application Technology

Working Group, 2023) share similarities with Micro-

formats, but a key distinction lies in their general-

ity. Unlike Microformats, Microdata is not prede-

ﬁned, providing greater ﬂexibility by allowing the use

of various ontologies and vocabularies according to

the author’s preferences. This makes Microdata well-

suited for scenarios requiring the annotation of di-

verse data within a single document.

2.4 XSL Transformation

XML (eXtensible Markup Language) (Bray et al.,

2008) plays a pivotal role as a versatile format for

both machine-readable and human-readable docu-

ments. While primarily designed for machine read-

ability, XML’s intuitive markup is also accessible to

humans, especially when properly formatted.

XSL (eXtensible Stylesheet Language) (W3C,

2017) complements XML by enabling the transfor-

mation of machine-readable XML data into more

human-readable formats. XSLT (XSL Transforma-

tions) (Kaz, 2017) is a key component of XSL, pro-

viding a powerful mechanism for converting XML

documents into different output formats, such as

HTML, PDF, or plain text.

XSD (XML Schema Deﬁnition) (W3C, 2012)

plays a crucial role in ensuring the integrity and valid-

ity of XML documents. XSD provides a set of rules

and constraints that deﬁne the structure and data types

within an XML document. This schema validation

maintains data consistency and reliability, indirectly

enhancing human readability.

3 DATA MANAGEMENT PLANS

DMP is a document that facilitates efﬁcient data man-

agement throughout a project. It outlines the lifecycle

of the data created or collected during the project, de-

tailing how the data will be handled and providing in-

formation about their future usability and availability.

Effective data management practices lead to more ac-

curate data collection, secure storage, and proper han-

dling, elevating their potential value and relevance in

diverse research domains. (Smale et al., 2018)

A DMP is commonly structured using a standard-

ized template to ensure all essential components are

covered, although certain sections may be adapted

based on the project, funding source, or organization.

In this work, the Horizon Europe, Science Europe,

and National Institutes of Health (NIH) DMP tem-

plates are described in more detail to fulﬁll the G1

goal. These templates were selected due to their ex-

tensive adoption on a global scale.

Science Europe Template (Science Europe, 2021)

includes essential details and a table that links differ-

ent DMP sections with individual Findable, Accessi-

ble, Interoperable, Reusable (FAIR) principles.

Horizon Europe Template (European Commis-

sion, 2020) covers all the necessary parts of the data

knowledge and includes questions explicitly aligned

with the FAIR principles, organized according to the

structure of those principles.

National Institutes of Health Data Management

and Sharing Plan Template (National Institutes of

Health, 2023) is very simple, which can lead to insuf-

ﬁcient information being ﬁlled in. It completely omits

the connection with the FAIR principles and further-

more does not cover the areas of legal requirements,

data storage security and allocated resources for data

management in the project. On the other hand, its

brevity should not discourage the data steward from

completing this plan at the beginning.

In order to obtain all the necessary information,

the National Institutes of Health Data Management

and Sharing Plan Template (National Institutes of

Health, 2023) is insufﬁcient. On contrast, the Sci-

DATA 2024 - 13th International Conference on Data Science, Technology and Applications

544

ence Europe Template (Science Europe, 2021) and

the Horizon Europe Template (European Commis-

sion, 2020) cover all the crucial details required by

the DMP. From the user’s perspective, the Science

Europe Template (Science Europe, 2021) provides

a more pleasant experience, as its questions do not

place strong emphasis on the FAIR principles. In-

stead, they prompt the data steward to consider how

they approach various issues within the project, rather

than immediately focusing on applying the speciﬁc

FAIR principle in question.

4 RELATED WORK

According to (DataCite, 2021) Machine-actionable

data management plans (maDMPs) play a pivotal

role in fostering the exchange of information by es-

tablishing connections between metadata and various

sources, including repositories and institutions.

The primary objective of the DMP Common Stan-

dard (DCS) Working Group, under the purview of

Research Data Alliance (RDA), centres on the estab-

lishment of well-deﬁned processes for research data

management, a robust data management infrastruc-

ture, and, most importantly, a universally accepted

standard (Miksa et al., 2021) in the form of a data

model to represent DMP information. Its implemen-

tation ensures seamless interoperability between sys-

tems engaged in producing or consuming maDMP,

while concurrently permitting the assimilation of ad-

ditional information from diverse systems, such as

organizational or repository-related data. Within the

framework of the DCS Working Group’s endeavours,

a JSON serialization of this application proﬁle has

been generated, offering practical utility.

However, as noted by (Cardoso et al., 2022), sev-

eral aspects of this proﬁle present challenges. Fore-

most among them is the absence of direct, explicit

linkages to existing ontologies or vocabularies. Addi-

tionally, DCS covers only essential parts of the DMPs,

omitting crucial elements such as the provenance of

reused or generated data, project objectives, data ac-

cess embargo, or the access protocol.

While the proﬁle is designed to be extensible, a

discernible mechanism for segregating the founda-

tional speciﬁcation from its extensions is not yet de-

ﬁned. Pertinently, while the proﬁle aims for machine-

actionability, limitations arise due to certain elements

within the structure accommodating text ﬁelds.

The DMP Common Standard Ontology

(DCSO) (Cardoso et al., 2022), an ontology

grounded in the RDA’s DCS standard speciﬁcation,

addresses these concerns.By consolidating terms into

a comprehensive ontology intertwined with several

pre-existing ontological constructs, the DCSO signif-

icantly enhances interoperability within the realm of

the RDA DCS working group standard.

One of the assessments of the DCSO is conducted

through the maDMP Evaluation (Foidl and Burgger,

2021). This investigation involves the transformation

of openly available maDMP instances into DCSO in-

stances using the dcsojon tool, generating JavaScript

Object Notation for Linked Data (JSON-LD) repre-

sentations. By applying predeﬁned SPARQL Pro-

tocol and RDF Query Language (SPARQL) (Harris

and Seaborne, 2013) queries based on the evalua-

tion methodology outlined in the International Align-

ment of Research Data Management (Science Europe,

2021) DMP template, the study assesses the expres-

siveness of the SPARQL queries against the evalua-

tion rubric criteria.

The challenges in expressing certain criteria via

SPARQL queries often stem from the fact that the

concept is either not covered or only partially covered

by the fundamental DCSO, thus remaining absent in

the transformed DMP DCSO.

5 OUR APPROACH

To accomplish the deﬁned objectives of this study,

we initiated the annotation of existing DMP instances

with terms extracted from diverse ontologies and dic-

tionaries, as elaborated in Section 5.1, aligning with

the fulﬁlment of G2. After these annotations, we pro-

ceeded to implement them in various formats, thor-

oughly evaluating their advantages and drawbacks, as

detailed in Section 5.2, thereby achieving G3. Finally,

the manually implemented maDMP underwent com-

prehensive assessment through multiple approaches,

as outlined in Section 6, to meet the objectives of G4.

5.1 Annotations

During the development of a machine-actionable and

human-readable DMP, terms from various ontolo-

gies and dictionaries were gradually grouped to pro-

vide a sufﬁcient semantic description of the informa-

tion contained in the DMP. Throughout the annota-

tion process, approximately nine common ontologies

and vocabularies were employed. Even in this initial

phase, it became apparent that not all known existing

ontologies and dictionaries were sufﬁcient. For this

reason, problematic areas are described below, along

with proposals on how to address them. In connection

with term proposals for annotations, a ﬁctional non-

existent ontology with the preﬁx ”dmp:” was created.

Towards Semantic Data Management Plans for Efﬁcient Review Processing and Automation

545

Terms in the DMP with this preﬁx represent proposed

terms that could be used for annotation. However,

they are not properly deﬁned.

Furthermore, existing DMP were manually anno-

tated. It was found that although the wording of ques-

tions in DMP templates may be unambiguous, the an-

swers do not always contain or cannot contain the de-

sired information. As a result, while the annotations

of individual existing DMPs are based on grouped

terms from the previous step, in some cases, they had

to be adjusted or supplemented to correspond to the

information in the existing DMPs, thus providing a

true semantic description.

5.1.1 Core Elements of a DMP

The core constituents of the DMP encompass the

DMP document itself, the relevant project, the ensu-

ing data, along with its metadata, and the designated

repository for data storage. Given that a signiﬁcant

portion of the information within the DMP is inter-

linked mainly with these components, it is crucial to

accurately deﬁne and delineate these elements.

Figure published online

(Mart

ınkov

a and

Such

anek, 2024) illustrates the annotations of the

fundamental components of the DMP together

with their types, indicating the class to which each

individual or node belongs. The typeof relationship

is represented by a dashed arrow, the object property

relationship is depicted by a solid arrow. Some core

elements are typeof more than one class to meet

the requirements of object or data properties and

inheritance speciﬁcations in the term deﬁnitions.

The following section addresses the primary chal-

lenges and intriguing aspects associated with annotat-

ing the DMP.

5.1.2 Information About Reusing Datasets

Within the context of the DMP, it is vital to ascer-

tain whether any datasets are reused throughout the

research process. While the DCSO does not provide

an explicit solution, the dcso:Dataset can potentially

have a deﬁned creation date before the project’s in-

ception, which can indicate that the dataset is reused

in the current project. This solution is not very intu-

itive, nevertheless in the commonly used ontologies

there is no suitable solution.

Several solutions were considered for this issue.

The initial approach involved employing the object

property prov:wasDerivedFrom to establish a link be-

tween the resulting dataset and the reused dataset.

However, it is often necessary to explain the rationale

https://doi.org/10.5281/zenodo.10893770

for reuse and specify if certain datasets are ultimately

unused. This information cannot be captured solely

by the prov:wasDerivedFrom property.

Our approach uses the dcat:qualiﬁedRelation

property to link a dataset to a dcat:Relationship in-

stance, which connects to the reused dataset via

dcat:relation. The dmp:reason property provides the

justiﬁcation for reuse, and an element can be added

for discarded reuse cases.

Furthermore, the data property dmp:reusingData

has been introduced. Although this may seem like

duplication of information capture, it serves for in-

stances where there are no reused datasets or the in-

formation is unknown. As a result, it captures only

the values Yes, No, or Unknown.

5.1.3 Information About the Purpose of the

Data and Its Relation to Objectives of the

Project

In the DMP is usually captured the purpose of the re-

sulting data together with its relation to the objectives

of the project. However the DCSO doesn’t cover this

aspect and there are no terms for capturing objectives

and their relations in common ontologies.

Hence, additional terms were introduced into the

hypothetical dmp ontology. The object property

dmp:hasObjective with a range of rdfs:Resource was

established to express a project’s objective. To link a

dataset with the project objective, the object property

dmp:fulﬁllsObjective was also added.

5.1.4 Metadata Elements

When detailing the metadata associated with datasets

within the DMP, it is crucial to explicitly specify the

used metadata schema or the individual metadata el-

ements employed. In the case of the former, well-

known ontologies adequately address this require-

ment but expressing individual metadata elements be-

comes challenging. To overcome this limitation, the

data property dmp:containsMetadataElement was in-

troduced to express the individual metadata elements.

5.1.5 Information About the Trustworthiness of

the Data Repository

In the DMP the trustworthiness of the used data repos-

itory(ies) are usually captured. Unfortunately, preva-

lent ontologies lack suitable solutions to encapsulate

the trustworthiness. In practice, repositories estab-

lish trustworthiness by adhering to the TRUST prin-

ciples (Lin et al., 2020) or obtaining dedicated certiﬁ-

cates. Unfortunately, the DMP do not usually specif-

ically ask for the reason for trustworthiness. In such

DATA 2024 - 13th International Conference on Data Science, Technology and Applications

546

cases, information could be easily annotated using an

object property linking the repository to the trust cer-

tiﬁcate or the evaluation of TRUST principles. How-

ever, responses in the DMP commonly only state

”Yes, the repository we use is trusted”.

To address this, the data property dmp:isTrusted

was introduced to describe the repository with values

restricted to ”Yes,” ”No,” or ”Unknown.”

5.1.6 Availability and Accessibility of Data and

Metadata

Within the DMP, multiple inquiries focus on the

availability and accessibility of both data and associ-

ated metadata. Upon examining various DMPs, it be-

comes apparent that researchers often provide similar

answers or similar key ideas to these questions. Some

questions cover multiple aspects, making it challeng-

ing to comprehensively address each point and result-

ing in insufﬁcient DMP outcomes.

In this critical aspect of DMPs, it would be ben-

eﬁcial to explore a more structured approach to ob-

tain this valuable information, perhaps even incorpo-

rating semi-controlled vocabulary options in certain

sections. To annotate this area a whole new approach

was designed as shown in the ﬁgure published on-

line

(Mart

ınkov

a and Such

anek, 2024). It’s impor-

tant to bear in mind that this is just an initial proposal.

The primary goal of this work was not to create a new

ontology for DMP, but during this work, several deﬁ-

ciencies in the existing options were identiﬁed.

To facilitate annotation, we established classes

and properties within a hypothetical DMP ontology.

The design of this structure primarily revolves around

deﬁning two key concepts: availability and accessibil-

ity, in alignment with the Common DSW Knowledge

Model (DSW Team, 2018) used in the Data Steward-

ship Wizard (DSW) (Pergl et al., 2019).

5.1.7 Common Ontologies and Vocabularies

In comprehensive DMP templates, the question arose

about using ontologies and vocabularies in the data

context. Since a speciﬁc term for ontology or vocab-

ulary wasn’t found, a concept was created in the hy-

pothetical dmp: ontology, as can be seen in the ﬁgure

published online

(Mart

ınkov

a and Such

anek, 2024).

5.1.8 Cost and Its Funding

The DMP typically addresses the resources neces-

sary to meet FAIR principles. For quantifying the re-

sources allocated to make data FAIR, we employed

the object property schema:estimatedCost with a

https://doi.org/10.5281/zenodo.10893770

range of schema:MonetaryAmount, specifying value

and currency. To signify that this amount pertains to

enhancing data FAIRness, we utilized the sioc:Topic.

To annotate how these expenditures will be

funded, the object properties schema:funder and

foaf:fundedBy were employed.

5.2 Different Formats

In this study, three potential formats, as described

in Section 2, were considered to capture annotations

within a human-readable environment DMPs. This

section details their capabilities and assesses their

suitability for the intended purpose.

5.2.1 XSL Transformations

Utilizing Extensible Markup Language (XML) for

annotation and subsequent transformation to human-

readable text using Extensible Stylesheet Language

Transformations (XSLT) is not entirely suitable for

our requirements. The main problem is with the or-

ganization of DMPs. Questions are often grouped by

topic rather than by core elements, resulting in scat-

tered references to datasets throughout the document.

Code Example 1: Example of XML annotations.

This data set has following distributions:

<title>Distribution A</title>

has size

...

<availability> Yes, all data will be

made openly available.֒→

However, XML is structured and once an element

is in the document, it cannot be repeated. In exam-

ple 1, there are two dataset elements, even though, in

the context of the DMP, we are referring to only one

dataset. This limitation led us to explore alternative

approaches rather than continuing with this method.

5.2.2 Microformats and Microdata

Microformats prove to be unsuitable for our work due

to the diverse nature of information within the DMP.

On the contrary, Microdata is highly suitable, offer-

ing the ﬂexibility to use any ontology or vocabulary

for semantic annotations. However, Microdata cannot

create Resource Description Framework (RDF) blank

nodes. Since various elements in the DMP lack prop-

erly deﬁned identiﬁers but are referenced in multiple

sections, blank nodes become valuable. They enable

Towards Semantic Data Management Plans for Efﬁcient Review Processing and Automation

547

the aggregation of information about, for example, a

reused data set mentioned in different parts of the en-

tire DMP. Even without explicit identiﬁers, the use

of blank nodes allows the connection of information

related to the same dataset.

5.2.3 RDFa

RDFa emerges as the optimal solution for our work,

offering the ﬂexibility to utilize any ontology or vo-

cabulary for text annotations. It also facilitates the

creation of blank nodes, allowing the connection of

information to these nodes via object or data prop-

erties throughout the document by assigning them a

node identiﬁer within the local resource.

Code Example 2: Example of RDFa annotations.

<b>Will you re-use any existing data?</b>

<span

resource="#dataset"

typeof="dcat:Dataset schema:Thing">֒→

data will be re-used.֒→

In the code example 2, the previously deﬁned

dataset as a named blank node allows for contex-

tual connections, linking all information related to the

dataset irrespective of its position within the docu-

ment. This leads to the choice of RDFa as the pre-

ferred format for annotations in our work.

6 EVALUATION

The assessment in this study involved comparing

semantically annotated DMPs against Natural Lan-

guage Processing (NLP) and text-mining methods, in-

cluding ChatGPT, applied to human-readable (non-

annotated) DMPs. A foundational list of approxi-

mately 20 questions, covering all aspects of the DMP,

served as the benchmark for testing these approaches

on both annotated and non-annotated DMPs. The an-

notated DMPs and the list of evaluation questions are

published online

(Mart

ınkov

a and Such

anek, 2024).

The evaluation involved two distinct methodolo-

gies applied to non-annotated DMPs. Initially, the

open-source NLP tool, the Hugging Face Trans-

former (Wolf et al., 2020), and ChatGPT (OpenAI,

2022) were utilized. Each tool was tasked with query-

ing the DMPs using questions in natural human lan-

guage, and the provided responses were manually

evaluated comparing the information contained in the

DMP. For semanticaly annotated DMPs, SPARQL

queries aligned with the set of questions were utilized.

https://doi.org/10.5281/zenodo.10893770

A comprehensive evaluation was conducted on a

total of 5 DMPs: 2 following the Horizon Europe

template (European Commission, 2020), 2 adhering

to the NIH Data Management and Sharing Plan (Na-

tional Institutes of Health, 2023), and just one DMP

aligning with the International Alignment of Research

Data Management (Science Europe, 2021). The se-

lection of only one DMP in the latter case stems from

the observed discrepancy between the DMPs avail-

able online and their original templates, rendering

them unsuitable for the evaluation process. The small

number of DMPs is due to the fact that performing de-

tailed annotation manually is a lengthy process; nev-

ertheless, we still managed to achieve results.

The table 1 displays the percentage of correctly

answered questions for each method and each indi-

vidual DMP. Initially, the proportion of 20 evaluation

questions that could be answered for each DMP, indi-

cating the presence of this speciﬁc information in the

DMP, was determined. All results from the evalua-

tion methods are calculated based on this percentage,

not the total of 20 questions.

6.1 Evaluation Questions

A set of approximately 20 questions was devised

to encompass aspects related to the reuse, result-

ing data, metadata, their availability and accessibil-

ity, resources for ensuring FAIRness, and legal and

ethical considerations. To assess annotated DMPs,

corresponding SPARQL queries were formulated.

The full list of evaluation question is published on-

line (Mart

ınkov

a and Such

anek, 2024).

6.2 Hugging Face Transformer

The Hugging Face Transformer (Wolf et al., 2020)

provides a framework and pre-trained models, sim-

plifying the performance of the NLP and especially

for our work the method of Question answering.

We used the Disilbert model (Sanh et al., 2019)

to analyze unannotated DMPs by posing questions

from the list in natural human language. However, the

tool’s answers were highly inadequate. The tool often

generated entirely unreasonable responses. The ta-

ble 1 in the column labeled ”The Hugging Face Trans-

former” shows the percentages of cases where the

answer closely aligned with the queried topic. One

question that consistently received accurate responses

pertained to the volume of resulting data.

DATA 2024 - 13th International Conference on Data Science, Technology and Applications

548

Table 1: Percentage of accurately answered questions for each method and each individual DMP.

Template Answerable

The Hugging Face

Transformer

ChatGPT SPARQL

DMP1 Horizon Europe 71.43% 33.33% 86.67% 100.00%

DMP2 Horizon Europe 100.00% 9.52% 90.48% 100.00%

DMP3 Science Europe 61.90% 15.38% 76.92% 100.00%

DMP4 NIH 47.62% 20.00% 70.00% 100.00%

DMP5 NIH 33.33% 42.86% 71.43% 100.00%

6.3 ChatGPT

The ChatGPT is a language (OpenAI, 2022) model,

designated for natural language understanding an

generation. This model was utilized for the analy-

sis of unannotated DMPs by formulating questions

from the list in natural human language. The tool

yields highly accurate and detailed responses, achiev-

ing very high correctness, and the answers can be seri-

alized in various formats upon request. The precision

of the responses to the evaluation questions is shown

in the table 1 under the column labeled ”ChatGPT.”

However, there are drawbacks to this solution.

First, it tends to provide lengthy responses, includ-

ing surrounding context. While this can be seen as

an advantage depending on speciﬁc requirements, for

machine-actionable purposes, simplicity with a com-

prehensive description is preferable.

The second limitation is that the tool lacks the ca-

pability to count automatically. If the DMP includes

a list of reused datasets, the tool cannot provide the

exact number of reused datasets. It’s important to

note that while ChatGPT occasionally struggles with

poorly described and ambiguous text, this issue is typ-

ically less prominent in the context of the DMP.

6.4 SPARQL

SPARQL (Harris and Seaborne, 2013) is a standard-

ized query language for retrieving and manipulating

data in RDF structure. SPARQL queries were used

for the analysis of annotated DMPs by formulating

questions as SPARQL queries. Not surprisingly, this

approach was the most sufﬁcient and provided the ex-

act answers (if they were part of DMP) as can be seen

in the table 1 under the column ”SPARQL.”

The incorporation of pre-existing annotations

helps prevent misunderstandings in complex,

semistructured, or intuitive texts where ChatGPT

might otherwise encounter challenges. Contrary

to ChatGPT, the utilization of SPARQL allows

counting operations. Therefore, assessing the number

of reused datasets, the total volume of datasets in

various units or the sum of required resources can be

conveniently obtained.

7 CONCLUSION

Three DMP templates were chosen and annotated

based on the semantic meaning of their parts and

questions, aligning with the objectives G1 and

G2.Nine well-known ontologies, including DCSO

were utilized alongside the introduction of additional

terms when needed. Over 20 terms were speciﬁcally

deﬁned.

The identiﬁed gaps among known ontologies

within the DMP domain present future opportunities

to enhance the proposed solution and provide more

comprehensive coverage. The analysis involved cap-

turing these annotations in different formats, with

RDFa chosen to represent all ﬁve annotated DMP ex-

amples, fulﬁlling objective G3.

To fulﬁll objective G4, a manual evaluation of

usability and correctness took place, involving a to-

tal of 20 questions. The usability and correctness of

the proposed solutions were assessed by testing unan-

notated DMP instances without their semantic mean-

ing using two approaches: The Hugging Face Trans-

former (Wolf et al., 2020) and ChatGPT (OpenAI,

2022). However, both approaches did not achieve as

high percentage of correctly answered questions as

the approach using equivalent SPARQL queries on se-

mantically annotated DMPs.

It is important to note that semantically anno-

tated DMPs contain ontological terms matching the

SPARQL version of the evaluation questions. Be-

cause the infromation caputerd within the DMPs and

evaluation question pertains to the same domain.

These results demonstrate the usability of this ap-

proach, strongly suggesting that combining manual

annotation with NLP or Artiﬁcial intelligece (AI)

methods could streamline the process, making it in-

triguing avenue for future exploration.

Towards Semantic Data Management Plans for Efﬁcient Review Processing and Automation

549

ACKNOWLEDGEMENTS

This work was supported by the Student Summer

Research Program 2023 of the FIT CTU in Prague

and by the Czech Technical University in Prague

grant: Advance Research In Software Engineering,

No. SGS23/206/OHK3/3T/18.

REFERENCES

Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E.,

and Yergeau, F. (2008). Extensible markup language

(xml) 1.0 (ﬁfth edition). [online]. [Accessed 2023-08-

13].

Cardoso, J., Castro, L. J., Ekaputra, F. J., Jacquemot, M. C.,

Such

anek, M., Miksa, T., and Borbinha, J. (2022).

DCSO: towards an ontology for machine-actionable

data management plans. Journal of Biomedical Se-

mantics, 13(1):21.

Data Documentation Initiative (2023). Machine-actionable.

[Accessed 2023-07-17].

DataCite (2021). Introduction to machine actionable dmps

(madmps). [online]. [Accessed 2023-08-13].

DSW Team (2018). Common DSW Knowledge Model.

[online]. [Accessed 2023-03-19].

ELIXIR, Research Data Management Kit (2021). Machine-

actionability. [Accessed 2023-07-17].

European Commission (2020). Horizon 2020 dmp. [on-

line]. [Accessed 2023-12-15].

Foidl, R. and Burgger, L. S. (2021). Evaluation of maDMPs

using SPARQL.

Harris, S. and Seaborne, A. (2013). Sparql 1.1 query lan-

guage. [online]. [Accessed 2023-08-13].

Kaz, M. (2017). Xsl transformations (xslt) version 3.0. [on-

line]. [Accessed 2023-08-13].

Khare, R. (2006). Microformats: the next (small) thing

on the semantic web? IEEE Internet Computing,

10(1):68–75.

Lin, D., Crabtree, J., Dillo, I., Downs, R., Edmunds, R., Gi-

aretta, D., Giusti, M., L’Hours, H., Hugo, W., Jenkyns,

R., Khodiyar, V., Martone, M., Mokrane, M., Navale,

V., Petters, J., Sierman, B., Sokolova, D., Stockhause,

M., and Westbrook, J. (2020). The trust principles for

digital repositories. Scientiﬁc Data, 7.

Mart

ınkov

a, J. and Such

anek, M. (2024). Semantically an-

notated data management plans. [Accessed 2024-04-

03].

Miksa, T., Walk, P., Neish, P., Oblasser, S., Holland, M.,

Renner, T., Jacquemot-Perbal, M.-C., Cardoso, J.,

Kvamme, T., Praetzellis, M., et al. (2021). Appli-

cation Proﬁle for Machine-Actionable Data Manage-

ment Plans.

National Institutes of Health (2023). Data management &

sharing plan. [online]. [Accessed 2023-08-13].

Open Knowledge (2015). Machine-readable. [Accessed

2023-07-17].

OpenAI (2022). ChatGPT (version 3.5). [Accessed 2023-

12-15].

Pergl, R., Hooft, R., Such

anek, M., Knaisl, V., and Slifka, J.

(2019). ”Data Stewardship Wizard”: A Tool Bringing

Together Researchers, Data Stewards, and Data Ex-

perts around Data Management Planning. Data Sci-

ence Journal, 18:59.

RDFa Working Group (2013). RDF in Attributes (RDFa).

[Accessed 2023-07-17].

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019).

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter. arXiv preprint arXiv:1910.01108.

Science Europe (2021). Practical guide to the international

alignment of research data management-extended edi-

tion.

Smale, N., Unsworth, K., Denyer, G., and Barr, D. (2018).

The history, advocacy and efﬁcacy of data manage-

ment plans. bioRxiv.

W3C (2012). W3c xml schema deﬁnition language (xsd)

1.1. [online]. [Accessed 2023-08-15].

W3C (2017). The extensible stylesheet language family

(xsl). [online]. [Accessed 2023-08-15].

Web Hypertext Application Technology Working Group

(2023). Html - living standard. [online]. [Accessed

2023-12-15].

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,

Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,

and Brew, J. (2020). Transformers: State-of-the-art

natural language processing. [Accessed 2023-12-15].

DATA 2024 - 13th International Conference on Data Science, Technology and Applications

550