Developing a Reference OntoUML Conceptual Model for Data

Management Plans: Enhancing Consistency and Interoperability

Jana Mart

ınkov

, Marek Such

anek

and Robert Pergl

Faculty of Information Technology, CTU in Prague, Prague, Czech Republic

{jana.martinkova, marek.suchanek, robert.pergl}@cvut.cz

Keywords:

Data Management Plan, FAIR Principles, Ontology, Conceptual Model, OntoUML.

Abstract:

The growing signiﬁcance of Data Management Plans (DMPs) has highlighted the need for standardized and

accurate data management practices. Current DMPs often suffer from inconsistent terminology, leading to mis-

understandings and reducing their effectiveness. This study proposes the development of a DMP OntoUML

conceptual model to address these issues. The model aims to clearly deﬁne all relevant concepts and their

relationships, ensuring consistency and interoperability, particularly by connecting with the FAIR principles

OntoUML model. The research follows a structured approach: specifying necessary concepts using existing

templates and ontologies, deﬁning terms and their relationships within the OntoUML model, and verifying the

model’s syntax. The resulting conceptual model will standardize terminology, promote interoperability, and

support future DMP development and education.

1 INTRODUCTION

In recent times, the importance of developing a data

management plan (DMP) has grown signiﬁcantly. Ef-

fective data management practices ensure more accu-

rate data collection, secure storage, proper handling,

and utilization beyond the primary project scope.

However, existing DMPs and their templates often

employ varying terminology to describe the same

concept or use identical terms for different concepts.

This inconsistency can cause misunderstandings at

human and machine levels, lowering the value of

DMPs due to incomplete or incorrect information.

These misunderstandings can cause errors in

DMPs as data stewards may misinterpret terms, lead-

ing to incorrect data management. This fragmenta-

tion hinders collaboration, reducing research quality

and impact. Inconsistent terminology also compli-

cates training for new researchers and data managers,

making it harder to adopt best practices.

In recognition of this challenge, our proposal fo-

cuses on developing a DMP OntoUML conceptual

model. This model will accurately describe all the

concepts used within DMPs and establish clear rela-

tionships between them. Additionally, it will be con-

https://orcid.org/0000-0001-8575-6533

https://orcid.org/0000-0001-7525-9218

https://orcid.org/0000-0003-2980-4400

nected to existing OntoUML model of Findable, Ac-

cessible, Interoperable, Reusable (FAIR) principles

showing the connection with concepts of FAIR prin-

ciples. Moreover, the conceptual model will aid in

standardizing terminology, promoting interoperabil-

ity between systems working with DMPs, and ensur-

ing scalability for future DMPs development. Fur-

thermore, it will serve as a valuable resource for train-

ing and education.

To accomplish this, we set the following partial

steps:

G1. Specify the concepts that needs to be covered

using existing DMP templates, ontologies, and

knowledge models related to DMPs;

G2. Deﬁne the terms and their relationships in the On-

toUML conceptual model using existing ontolo-

gies and vocabularies related to DMPs;

G3. Verify the syntax of the OntoUML model and val-

idate its content by using an example that ensures

it comprehensively covers an existing DMP.

2 CONCEPTUAL MODELLING

Conceptual modelling is an activity aimed at devel-

oping a formal description of relevant aspects of re-

ality, involving the domain, its concepts, and ac-

tivities within it. The resulting output of this pro-

Martínková, J., Suchánek, M. and Pergl, R.

Developing a Reference OntoUML Conceptual Model for Data Management Plans: Enhancing Consistency and Interoperability.

DOI: 10.5220/0012940000003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 2: KEOD, pages 159-166

ISBN: 978-989-758-716-0; ISSN: 2184-3228

159

cess, conceptual model, is used in various contexts

and domains, offering several advantages. As de-

scribed by (Gonzalez-Perez, 2018), conceptual mod-

els serve as an excellent method for documenting

knowledge with low ambiguity and high simplic-

ity, making them easily understandable without ad-

ditional explanations. As pointed out in (Robinson

et al., 2015), they also bridge the gap between differ-

ent mindsets and areas of knowledge.

The beneﬁt also lies in the precise deﬁnition of the

scope and purpose of tools and techniques used for

work, and parts of the conceptual model can be reused

in entirely different contexts, as described in (Robin-

son et al., 2015). Having a conceptual model also

opens possibilities for comparing and connecting in-

formation from various sources with a higher cer-

tainty of understanding the authors’ intentions. As de-

scribed in (Gonzalez-Perez, 2018), conceptual mod-

elling enables the exploration of complex fragments

of the world that initially seem very tricky. Using

a modelling language helps overcome obstacles by

reducing complexity, allowing for a problem-solving

approach that addresses one issue at a time.

During the development phase of conceptual mod-

els, a language is imperative to accurately, unam-

biguously, and clearly represent knowledge. Domain-

speciﬁc languages such as DEMO (Dietz and Hooger-

vorst, 2015) or domain-agnostic ones like OntoUML

can be employed (Pergl, 2019).

OntoUML (Guizzardi, 2005a) is an ontologically

well-founded language for ontology-driven concep-

tual modelling based on Uniﬁed Foundational On-

tology (UFO) and as a Uniﬁed Modeling Language

(UML) Proﬁle constructed using UML Class diagram

notation. The aim behind its creation was to estab-

lish a uniﬁed language for developing ontologically

correct conceptual models (Guizzardi, 2005b).

UFO (Guizzardi et al., 2015) is a resulting ontol-

ogy of a research analysis of conceptual modelling

languages with the aim of developing an ontological

foundation for these languages. The research was mo-

tivated by the notion that explicit deﬁnition of funda-

mentals and adherence to a certain ontological com-

mitment are crucial for conceptual modelling. Any

attempt to develop foundations for conceptual mod-

elling should take into account human knowledge and

linguistic capabilities. In order to provide explicit

deﬁnitions of entities and their relationships, crucial

for obtaining a valid domain description, the UFO

is enriched with theories from philosophical formal

ontologies, cognitive sciences, linguistic logic, and

philosophical logic (Guizzardi et al., 2015).

3 DATA MANAGEMENT PLANS

A DMP is essential for ensuring effective data man-

agement throughout the lifespan of a project. It de-

scribes the lifecycle of the data generated or collected,

detailing how it will be managed and ensuring its fu-

ture usability and accessibility. Having such a plan

is essential for (research) data management (RDM),

offering important insights into the origins, usage,

and availability of data. Recently, the importance of

data management planning has increased, especially

among funding bodies and scientiﬁc institutions, to

ensure that data remains useful beyond the original

project. Good data management practices improve

data collection accuracy, secure storage, and proper

handling, enhancing data’s value and relevance across

various research ﬁelds (Smith, 1998).

Several resources are critical in deﬁning the ter-

minology used in a DMP, whether for human or

machine readability. These resources include exist-

ing DMP templates, knowledge models from various

DMP tools, and data management-related ontologies.

To achieve the G1 goal, it was essential to select and

thoroughly analyse these resources.

3.1 DMP Templates

A DMP is commonly structured using a template to

ensure all essential components are covered. How-

ever, certain sections may be adapted based on the

speciﬁc project, funding source, or organization.

For this work, the Horizon Europe Template (Eu-

ropean Commission, 2020), National Institutes of

Health Data Management and Sharing Plan Tem-

plate(National Institutes of Health, 2023), and most

importantly Science Europe Template (Science Eu-

rope, 2021) were selected due to their widespread

adoption on a global scale. The concepts in these tem-

plates are recurring, although in different contexts,

making additional templates unnecessary. The Sci-

ence Europe Template (Science Europe, 2021) was

the main source of knowledge because it aligns Eu-

ropean DMP templates from various domains. This

template covers core requirements that can be ad-

justed to speciﬁc needs, ensuring all essential com-

ponents are addressed.

3.2 DSW Knowledge Models

One of the existing tools designed to help create

and manage DMP is the Data Stewardship Wizard

(DSW) (Pergl et al., 2019), recommended as an in-

teroperability resource by ELIXIR and several other

institutions like UB-BOTT (Universities of Norway).

KEOD 2024 - 16th International Conference on Knowledge Engineering and Ontology Development

160

This tool uses questionnaires, structured by so-called

knowledge models deﬁning speciﬁc questions and

their interconnections; thus, offering a comprehensive

perspective on the terms used for our work.

For the use of the DSW in the context of DMPs,

several predeﬁned knowledge models are available.

These models are based on a mind map developed

by Rob Hooft (Hooft, 2019). Although primarily fo-

cused on the natural sciences, the insights from this

mind map are applicable to other domains as well.

Among these models, the fundamental one derived

from Rob Hooft’s mind map is called the Common

DSW Knowledge Model (DSW Team, 2018).

3.3 DMP-Related Ontologies

Numerous ontologies related to DMPs were analysed

in (Mart

ınkov

a and Such

anek, 2023b), where nine on-

tologies were examined to identify overlaps and in-

terconnections. In this work, some of these ontolo-

gies were used as dictionaries to understand the usage

and context of various terms, leveraging the provided

overview model (Mart

ınkov

a and Such

anek, 2023a).

As highlighted in (Mart

ınkov

a and Such

anek,

2023b), different terms are often used to deﬁne the

same concept even within DMP-related ontologies.

However, in the case of ontologies, the terminol-

ogy is a secondary concern as meaning is established

through relationships with other classes.

4 FAIR PRINCIPLES

FAIR (Wilkinson et al., 2016) was created in con-

nection to conference Jointly designing a Data FAIR-

PORT. Result of this event was an agreement of cre-

ation of principles ensuring Findability, Accessibility,

Reusability and Interoperability of data.

These principles do not prescribe speciﬁc imple-

mentation methods or technologies. Instead, they

serve as a set of guidelines or approaches to achieve

data reusability and accessibility. This openness in

implementation can lead to inconsistencies in inter-

preting these principles. As noted in (Jacobsen et al.,

2020) this can result in potentially incompatible im-

plementations, which contradicts the original intent

of the FAIR principles. To address this, the FAIR au-

thors provided further explanations (Jacobsen et al.,

2020) for the intended interpretations and implemen-

tation considerations for each principle.

In response to the ongoing controversy, an On-

toUML model (Bernasconi et al., 2023) was devel-

oped to address the issues surrounding the interpre-

tation of the FAIR principles. This model aims to

clarify any ambiguities and uncertainties within the

FAIR principles and provide guidelines for design-

ing a dataset’s FAIR classiﬁcation based on a detailed

analysis of these principles.

The model is divided into three parts, covering

Findability and Interoperability, Accessibility, and

Reusability. It addresses the overall concept of the

FAIR principles as well as each sub-principle. The

core of the model consists of Data, with their content

described as Data Items, and their Metadata. This is

supplemented with additional concepts to clarify and

describe each sub-principle. The model employs an

undeﬁned relation called externalDependence, which

does not align with OntoUML deﬁnitions. However,

as we understand it, this relation makes sense within

the context of the model. For our purposes, we re-

tained this relation in the original model, but we did

not use it in our extension due to our uncertainty

about its proper deﬁnition and usage. Additionally,

the model incorporates navigable relations, which are

not supported by standard OntoUML; therefore, we

have omitted them.

5 OUR APPROACH

To achieve the study’s objectives, we initiated the de-

velopment of the DMP OntoUML model by identi-

fying the parts and concepts within the scope of the

DMPs that the model must encompass. As is de-

scribed in Section 3, various resources play a cru-

cial role in deﬁning the terminology used in the DMP.

These resources were analysed to compile a compre-

hensive list of the necessary parts and concepts, de-

tailed in Section 5.1. After establishing what needs

to be included, each component was thoroughly anal-

ysed and incorporated into the OntoUML model. This

effort aimed to meet the G2 goal and link the DMP

components of the model to the existing FAIR model,

as described in Section 5.2. Finally, the model was

validated to ensure syntax correctness, as stated in the

G3 goal, conﬁrming its accuracy and completeness in

representing the DMP, as described in Section 5.3.

5.1 Resource Analysis and Concept

Identiﬁcation

In order to determine all of the concepts needed for

the model, we analysed various resources, including

ontologies related to DMP, knowledge models serv-

ing as a knowledge base in DSW and DMP templates,

as described in Section 3.

Using DMP templates and knowledge models

from DSW, we determined the primary components

Developing a Reference OntoUML Conceptual Model for Data Management Plans: Enhancing Consistency and Interoperability

161

of the domain. Below is a list of the main parts

that need to be addressed according to our analysis.

Our list includes all the essential details required by

the Science Europe Template (Science Europe, 2021),

which was created to align and cover requirements

from various European DMP templates used by dif-

ferent funders and institutions. The model captures

each item in detail, including concepts that might not

be explicitly listed here.

• Administrative Information includes the project’s

name, identiﬁer, start and end date.

• Funding Informations covers the involved fun-

ders, their identiﬁcation, and the resources they

offer.

• Research Process section covers data creation,

reuse (including relevant considerations), and

preservation. The Common DSW Knowledge

Model (DSW Team, 2018) also includes data pro-

cessing and interpretation; however, these aspects

were excluded from our model as they detail au-

tomated steps, their compute environment, and

visualizations, which are beyond the scope of

this model. Nevertheless, these elements could

be documented within the general resource doc-

umentation.

• Data Preservation deals with long-term strategies

for data preservation and ongoing accessibility be-

yond the project lifecycle. Data preservation is,

according to (DSW Team, 2018) a part of research

process.

• Data and Their Roles in the FAIR Context in-

cludes descriptions of the data itself, datasets (in-

cluding their format, value, purpose, identiﬁca-

tion, etc.), and their metadata.

• Personal Data covers related concepts such as in-

formed consent and its potential reuse, accessibil-

ity to personal data, and the committee overseeing

personal data.

• Distribution includes details about the data repos-

itory and the distribution itself.

• Access and Reuse Requirements includes autho-

rization processes and tools requirements for ac-

cessing and reusing data.

• Cost includes all necessary resources during the

project, especially concerning their availability,

reusability, and preservation.

• Compliance focuses adherence to legal and ethi-

cal standards, including GDPR and other relevant

regulations.

• Members Engagement encompasses involvement

of various members in data management pro-

cesses.

• Data Quality Assurance includes procedures and

criteria for ensuring the accuracy, reliability, and

validity of the data.

5.2 Implementation of the Model

As a foundational model for our development, we

used the aforementioned FAIR model (Bernasconi

et al., 2023) and extended it with concepts from the

DMP domain. We aimed to align with the FAIR

model by retaining all the core classes unchanged

and distinguished them in grey in the developed

model (Mart

ınkov

a et al., 2024). The main part of

the FAIR model incorporates the DATA, METADATA

and GROUND DATA (see Figure 1).

«subkind»

Ground Data

«subkind»

Metadata

«collective»

Data

«externalDependence»

is metadata of

0..*

is metadata of

«externalDependence»

1..*1..*

{disjoint, complete}

Figure 1: Data Representation within the FAIR

model (Bernasconi et al., 2023).

In the FAIR model the DATA represents the

dataset, while GROUND DATA represents data that

cannot serve as metadata and therefore can be col-

lected into a dataset. In DMP, the terms datasets

and data are often used ambiguously. For instance,

the Horizon Europe Template (European Commis-

sion, 2020) includes the following question (in Sec-

tion 3.1 of the template): Will all data be made openly

available? If certain datasets cannot be shared. . .

As seen, these two terms are used interchangeably

within a single question, referring to the same con-

cept with different terminology. According to (U.S.

Geological Survey, nd), data and datasets are distin-

guished hierarchically: data, such as measurements or

observations, can be organized into a structured col-

lection, forming a dataset.

In the context of DMP, the term dataset is usually

preferred because DMP typically refers to a structured

collection of data currently being created or reused

from a previous creation proces. While it is perfectly

acceptable to use the term data, it should be used

accurately, as this term is often overused. The ab-

sence of dataset class in the FAIR model may seem

confusing; however, the authors of the FAIR model

KEOD 2024 - 16th International Conference on Knowledge Engineering and Ontology Development

162

have effectively addressed mentioned nuances by dis-

tinguishing DATA and GROUND DATA.

The following sections explore the most con-

tentious areas within the domain, focusing on the

use and interpretation of terminology. These areas

present challenges that require careful consideration

to achieve a consistent and accurate understanding.

5.2.1 Data Access Evaluation

Data accessibility in DMP templates typically in-

cludes a detailed overview of the requirements neces-

sary to access the data. This encompasses authoriza-

tion protocols, authentication mechanisms, and the

tools or instruments needed for data retrieval. Addi-

tionally, it outlines any potential restrictions that may

be applied to data accessibility. Furthermore, these

templates often describe the procedures for validating

access requests through an access committee that en-

sures compliance with policies and regulations.

As accessibility is one of the FAIR principles, it is

a core aspect covered by the FAIR model (Bernasconi

et al., 2023). As shown in Figure 2, data accessi-

bility requirements are supplemented by the neces-

sary instruments and, most importantly, by the ac-

cess request itself and its evaluation process. Typi-

cally, DMP templates inquire about the presence of a

data access committee responsible for validating ac-

cess requests based on the established data accessibil-

ity requirements. For instance, the Horizon Europe

Template (European Commission, 2020) includes the

question: Is there a need for a data access commit-

tee? Further details are more thoroughly elaborated

in the DMP model (Mart

ınkov

a et al., 2024).

5.2.2 Data Availability

One term that is often incorrectly interchanged with

accessibility is availability, and the domain of DMPs

is no exception. In the FAIR model, accessibility is

accurately captured as a role of DATA, which is ﬁtting

given that accessibility is one of the core principle.

In (Cambridge University Press, nda), accessibil-

ity is deﬁned as the ability to be reached or obtained

easily, whereas availability is deﬁned in (Cambridge

University Press, ndb) as the fact that something can

be reached. In the context of data, available means

that the data can be reached, without specifying by

whom or how, or even if anyone is currently having

the ability to access it—they are simply somewhere

reachable. On the other hand, accessible data means

we know how to reach them and who can access them,

even if they are not fully-open. In other words, if data

is available it does not ensure they are accessible to

certain type of users.

«relator»

Data

Accessibility

Requirements

«subkind»

Data Access Protocol

«role»

Instrument Required

for Access

«category»

Resource

«kind»

Protocol

«kind»

Instrument

«collective»

Data

«role»

Available Data

«role»

Accessible Data

«relator»

Access Request

«relator»

Evaluation of

Access Request

«collective»

Data Access

Committee

«kind»

Project

«role»

Project member

«kind»

Person

«external

Dependence»

0..*

«mediation»

0..*

«external

Dependence»

requires

1..*

0..*

«mediation»

1..*

«mediation»

1..* 1..*

«mediation»

1..*

«mediation»

0..*

«memberOf»

2..*

0..*

«memberOf»

2..*

Figure 2: Part of the DMP model (Mart

ınkov

a et al., 2024)

describing data access.

To accurately capture availability in the model as

a role of DATA, we need to determine the criteria that

establish this availability. What serves as the evidence

or proof of the data’s availability? According to the

DCAT ontology (Albertoni et al., 2020), the class for

distribution dcat:Distribution is deﬁned as an avail-

able dataset. Therefore, the presence of an existing

distribution acts as a strong indicator, or a “truth-

maker”, that the data is available. This means that

if a dataset has been distributed, it can be considered

available, as the distribution itself serves as a witness

to the dataset’s availability. The connection between

available data and their distribution is included in the

DMP model, see Figure 3.

To establish connections between roles or phases

of data, it is essential to identify any connections or

hierarchical relationships. From the deﬁnitions of ac-

cessibility and availability, we have already deduced

Developing a Reference OntoUML Conceptual Model for Data Management Plans: Enhancing Consistency and Interoperability

163

that there is a hierarchical relationship: data must be

available before it can be considered accessible.

The relationship between reusable and available

data is similarly important. As noted in (Yoon et al.,

2017), data availability is a prerequisite for reuse. The

part of the model that captures these relationships is

illustrated in Figure 3.

«relator»

Added to

Repository

«relator»

Data Usage

Licence

«relator»

Data Accessibility

Requirements

«collective»

Data

«role»

Accessible Data

«role»

Reusable Data

«role»

Available Data

«kind»

Repository

«mode»

Distribution

«mediation»

0..* 1

«mediation»

1..*

«mediation»

1..*

«mediation»

0..*

«characterization»

Figure 3: Part of the DMP model (Mart

ınkov

a et al., 2024)

describing availability of data.

5.2.3 Projects Research

Another crucial aspect that needed detailed descrip-

tion is the research phase of the project, which in-

volves the activities where data are collected. These

research activities are directly tied to the project’s ob-

jectives, as illustrated in Figure 4. An important part

of research is the reuse, or more broadly, the consid-

eration of reusing data. Questions about data consid-

eration for reuse and actual reuse are typically part of

DMP templates, with reuse consideration hierarchi-

cally above the actual reuse, as depicted in Figure 4.

Data creation is included, but reused data, which

requires harmonization, is a special case. In case

some data are reused and needs to be harmonized,

this new data results from harmonization can be con-

sidered as created. Therefore, CREATED DATA has a

subrole HARMONIZED REUSED DATA.

According to (DSW Team, 2018) a part of re-

search process is also data preservation and ongoing

accessibility beyond the project lifecycle. Therefore

one of the activities of research is also DATA PRESER-

VATION, which is distinguished to preservation during

and after the project.

5.3 Veriﬁcation and Validation

To validate the model’s syntax, we used the Open-

Ponk platform, an open-source tool for concep-

tual modeling, diagram development, and simula-

tions (Uhn

ak and Pergl, 2016). The platform includes

numerous extensions and modules for standardized

notations, such as OntoUML, which includes a com-

prehensive framework for verifying OntoUML mod-

els (B

elohoubek, 2019). This extension ensures that

all deﬁned entities and relationships adhere to the

speciﬁed rules of the OntoUML language. Addi-

tionally, it includes automatic detection of OntoUML

anti-patterns (B

elohoubek, 2021), which identiﬁes

suspicious structures within the model that typically

indicate errors as well. Using OpenPonk’s OntoUML

extension, we conducted a thorough validation of the

model, thereby achieving the veriﬁcation part of our

goal G3. This involved systematically checking each

entity and relationship to ensure they were correctly

deﬁned and aligned with OntoUML standards.

To ensure the model captures all required elements

of the DMP domain, we used the Science Europe Tem-

plate (Science Europe, 2021) as a base for our pro-

posal. This template consolidates requirements from

various European DMP templates, providing a com-

prehensive standard.

To further verify the completeness and accuracy

of the proposed model, we created an instantiation

model (Mart

ınkov

a et al., 2024) using an existing

DMP that adheres to the Science Europe Template.

We began by identifying relevant concepts in the

DMP text, ensuring they aligned with the concepts in

our proposed model and that our model contained all

the required concepts. We then constructed an instan-

tiation model based on these identiﬁed concepts. This

approach allowed us to validate our model against a

real-world example.

6 CONCLUSION

In this study, we addressed the critical issue of in-

consistent terminology and incomplete information in

DMP by developing an OntoUML conceptual model.

We analysed existing DMP templates, knowledge

models used in DSW tool and ontologies related to

DMP in order to specify concepts that needs to be

covered, in accordance with G1. For accomplish-

ing G2, we selected concepts with their relations and

developed an OntoUML conceptual DMP model ex-

tending the existing FAIR model.

To ensure the model’s accuracy and completeness,

we conducted a thorough validation using the Open-

KEOD 2024 - 16th International Conference on Knowledge Engineering and Ontology Development

164

«role»

Created Data

«collective»

Data

«role»

Available Data

«role»

Reusable Data

«role»

Data Considered for Reuse

«role»

Reused Data

«subkind»

Data Reuse

«subkind»

Data Reuse Consideration

«role»

Harmonized Reused Data

«subkind»

Harmonization of Reused Data

«subkind»

Data Creation

«kind»

Project

«kind»

Objective

«relator»

Research Activity

«subkind»

Data Preservation

«subkind»

Data Preservation after Project

«subkind»

Data Preservation during Project

«kind»