Trust the Data You Use: Scalability Assurance Forms (SAF) for a

Holistic Quality Assessment of Data Assets in Data Ecosystems

Maximilian St

abler

1 a

, Tobias M

uller

2 b

, Frank K

oster

and Chris Langdon

German Aerospace Center (DLR) - Institute for AI Safety and Security, Ulm, Germany

SAP SE, Walldorf, Germany

Drucker School of Business, Claremont Graduate University, Claremont, U.S.A.

Keywords:

Knowledge Graphs, Data Asset Quality, AI Systems Integration, Scalability Assurance Forms (SAF).

Abstract:

Companies generate terabytes of raw, unstructured data daily, which requires processing and organization to

become valuable data assets. In the era of data-driven decision-making, evaluating these data assets’ quality is

crucial for various data services, users, and ecosystems. This paper introduces ”Scalability Assurance Forms”

(SAF), a novel framework to assess the quality of data assets, including raw data and semantic descriptions,

with essential contextual information for cross-domain AI systems. The methodology includes a comprehen-

sive literature review on quality models for linked data and knowledge graphs, and previous research ﬁndings

on data quality. The SAF framework standardizes data asset quality assessments through 31 dimensions and

10 overarching groups derived from the literature. These dimensions enable a holistic assessment of data

set quality by grouping them according to individual user requirements. The modular approach of the SAF

framework ensures the maintenance of data asset quality across interconnected data sources, supporting reli-

able data-driven services and robust AI application development.The SAF framework addresses the need for

trust in systems where participants may not know or historically trust each other, promoting the quality and

reliability of data assets in diverse ecosystems.

1 INTRODUCTION

In the context of the exponential growth of Artiﬁcial

Intelligence (AI) and big data, the effective organiza-

tion and presentation of vast amounts of knowledge

have become crucial. Across various domains and

applications, the quality of data and its linked (meta-

) data descriptions are essential for making well-

informed, data-driven decisions. Different studies

unther et al., 2019; Loh et al., 2020; McCausland,

2021) highlight that due to diverse data processing

approaches, data quality and applicability cannot be

assumed to be uniform across different organizations

and applications. High-quality research and analysis

depend on reliable data (Arias et al., 2020), a concept

epitomized by the adage ”garbage in, garbage out”

(Kilkenny and Robinson, 2018). Although discus-

sions on Data Quality (DQ) appear relatively recent in

the literature, the concern with DQ is as longstanding

as the practice of data collection itself (Naroll et al.,

https://orcid.org/0000-0003-1311-3568

https://orcid.org/0000-0002-9088-5054

1961; Jensen et al., 1986).

To address these challenges, it is crucial to con-

sider a holistic assessment of data assets, encom-

passing both the structure provided by Knowledge

Graphs (KGs) and the quality of the raw data itself.

A Data Asset (DA) refers to any organized collection

of data used for business monitoring and decision-

making, distinguishing it from unorganized raw data

without immediate use (NIST, 2020). KGs, based

on Linked Data (LD) principles, promote the pub-

lication and linking of data in a machine-readable

format using web standards, enabling interoperabil-

ity and reuse across organizational silos (Radulovic

et al., 2017). However, focusing solely on the struc-

ture without considering the intrinsic quality of the

raw data can lead to misleading conclusions and sub-

optimal decision-making. Likewise, high-quality raw

data without a coherent structure lacks the context

necessary for comprehensive analysis.

Therefore, a balanced and integrated approach to

assessing data assets is necessary. This paper intro-

duces ”Scalability Assurance Forms” (SAF), a novel

framework designed to evaluate the quality of data as-

Stäbler, M., Müller, T., Köster, F. and Langdon, C.

Trust the Data You Use: Scalability Assurance Forms (SAF) for a Holistic Quality Assessment of Data Assets in Data Ecosystems.

DOI: 10.5220/0012915900003825

In Proceedings of the 20th International Conference on Web Information Systems and Technologies (WEBIST 2024), pages 199-208

ISBN: 978-989-758-718-4; ISSN: 2184-3252

199

sets, including raw data and their semantic descrip-

tions, with essential contextual information for cross-

domain AI systems. The SAF framework standard-

izes data asset quality assessments through 31 dimen-

sions and 10 overarching groups derived from the lit-

erature, enabling a holistic assessment that aligns with

individual user requirements. This approach ensures

the maintenance of data asset quality across intercon-

nected data sources, supporting reliable data-driven

services and robust AI application development. By

addressing the need for trust in systems with diverse

participants, the SAF framework promotes the quality

and reliability of data assets in varied ecosystems.

Research Questions (RQs). The goal of this re-

search is to analyze existing methods for assessing the

quality of structured data in order to identify needed

data in an opaque ecosystem. To achieve this goal, we

aim to answer how we can holistically evaluate data

by including DI and SI. Thereby, we formulated the

subsequent RQs:

• RQ1: What are the common quality dimensions

between raw data and Knowledge Graphs?

• RQ2: How can these dimensions be used to holis-

tically and individually assess existing data as-

sets?

By answering the formulated RQs, we formulate Scal-

ability Assurance Forms (SAF), a novel framework to

holistically assess the quality of data assets that in-

clude common data quality dimensions as formulated

by ISO 25012 and KG-speciﬁc quality dimensions.

Thereby, our contributions are four-fold:

• Introduction of SAF as a novel framework for or-

chestrating and assessing DAQ for raw data and

knowledge graphs.

• Development of a holistic evaluation approach for

DI and SI to ensure the quality and scalability of

AI systems.

• Facilitation of integration and analysis through

standardized DAQ assessments, reducing redun-

dancy and ensuring data integrity.

• Provision of customization to individual user re-

quirements, which is particularly important in in-

terconnected data ecosystems to support the relia-

bility of data-driven services.

In the following, we will ﬁrst provide information

on the required theoretical background (Chapter 2)

on data quality standards, data ecosystems, linked

data, and knowledge graphs. Subsequently, we de-

scribe our methodology (Chapter 3) and resulting

SAF (Chapter 4). We conclude in Chapter 5 by dis-

cussing and recapitulating our study.

2 THEORETICAL BACKGROUND

Building on this foundation, the subsequent sections

of this paper will elaborate on a holistic approach to

Data Asset Quality (DAQ) assessment, categorized

into Data Indicators (DI) and Semantic Indicators

(SI). These categories are devised to provide a com-

prehensive framework for evaluating the robustness

of datasets within KGs and across AI systems. The

DI and SI were derived based on the literature review

detailed in Section 3. Current research (Zaveri et al.,

2015; Wang et al., 2021; Radulovic et al., 2017) em-

phasizes the importance of this distinction, highlight-

ing that separate evaluation of intrinsic data quality

and semantic richness is critical for effective data uti-

lization in AI and linked data applications.

• Data Indicators (DI) focus on the intrinsic quality

of raw data, assessing aspects such as accuracy,

completeness, and consistency. For example, in a

healthcare dataset, a DI might evaluate the preci-

sion of diagnostic codes and the presence of com-

plete patient records. This ensures that founda-

tional data used in AI algorithms is reliable, miti-

gating risks associated with poor DQ.

• Semantic Indicators (SI) pertain to the semantic

descriptions of datasets, encompassing structured

interlinking and contextual relevance. These indi-

cators evaluate how effectively data is described

and linked, similar to metadata or Linked Data

(LD) standards, enhancing discoverability and us-

ability. For instance, in a scholarly database, SI

assesses the clarity and correctness of metadata,

inﬂuencing data integration and retrieval across

platforms.

DI would verify the accuracy, completeness, and syn-

chronicity of bus departure times in the public trans-

port domain, ensuring timestamps are precise and

consistently formatted. This reliability is critical for

AI systems in route optimization and predictive mod-

eling.

SI would examine the semantic richness of the

dataset, ensuring that each departure time is ade-

quately described with contextually relevant meta-

data. This may include Resource Description Frame-

work (RDF) annotations linking each timestamp to

corresponding route identiﬁers, bus capacities, acces-

sibility features, or integration with real-time trafﬁc

conditions. By embedding this semantic layer, the

dataset goes beyond simple planning to provide com-

prehensive information that can integrate seamlessly

with smart city infrastructures and deliver insightful,

actionable information to end users.

Together, these indicators form the backbone of

our methodology, addressing the dual aspects of DQ

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

200

(Kilkenny and Robinson, 2018) and semantic richness

(Zaveri et al., 2015; Wang et al., 2021) to enhance

the utility and reliability of data-asset-driven systems

(Radulovic et al., 2017). This integrated assessment

approach aligns with the strategic goals of semantic

interoperability and ensures that data and its contex-

tual framework are optimized for cross-domain appli-

cations.

Data Quality Standards and ISO Standard 25012.

Within the ISO Standard 25012

, dimensions are de-

ﬁned as distinct aspects of DQ that can be measured

and assessed independently. By differentiating these

aspects, the standard delineates a general DQ model

for data in a structured format within a data-driven

system, emphasizing quality dimensions for target

data used by humans and systems. It categorizes DQ

requirements and measures aligned with these dimen-

sions, enabling an evaluation process to analyze data

independently from other components of the com-

puter system. Our approach adopts these established

dimensions as a template to guide our investigation,

ensuring that our methodology aligns with recognized

standards and provides a robust basis for assessing

DQ in KGs and AI systems. Rather than diving

deeply into individual metrics, this strategic focus on

dimensions positions our research as a foundational

reference point, facilitating subsequent detailed stud-

ies to reﬁne these quality assessments. Building upon

the foundation of holistic DQ assessment through DI

and SI, it is crucial to note that quality in this con-

text is measured using speciﬁc dimensions qualiﬁed

through various metrics. Our work focuses on these

dimensions to lay the groundwork for future research,

as they are commonly deﬁned at the dimension level

in existing literature and ISO standards. Examining

the various metrics that can be employed to quantify

the different dimensions or to describe how to mea-

sure the different dimensions for different DAs is out-

side the scope of this study.

Existing DQ dimensions and standards, such as

ISO 25012 and ISO 8000-2, play a crucial role in

evaluating and assuring DQ in various contexts. ISO

25012, titled ”Data Quality Model,” provides a frame-

work for assessing data quality based on ﬁfteen key

dimensions, including accuracy, completeness, con-

sistency, and timeliness. These dimensions describe

various attributes of data that collectively determine

the overall quality. For instance, accuracy pertains

to the correctness of data, completeness refers to the

extent to which expected data is present, consistency

ensures data is accessible from contradictions, and

timeliness addresses the relevance of data at a given

https://www.iso.org/standard/35736.html

time. By differentiating these aspects, ISO 25012 pro-

vides a comprehensive framework for evaluating the

multifaceted nature of DQ within structured data sys-

tems. However, its limitations lie in its generality, as

it is not explicitly tailored to the complexities of KGs

or LD, which involve intricate relationships and se-

mantic structures. ISO 8000-2, known as the ”Data

Quality: Vocabulary” standard, focuses on deﬁning

terms and concepts related to DQ, aiming to create

a shared understanding and language for discussing

DQ issues. While it provides valuable terminolog-

ical clarity, it does not offer speciﬁc guidelines for

implementing quality assessments in dynamic and in-

terconnected data ecosystems. Both standards, while

foundational, do not fully address the unique chal-

lenges posed by the rapidly evolving ﬁelds of AI and

big data, where DQ needs to be evaluated in a holis-

tic and scalable manner, especially in federated and

distributed environments. To assess DQ, a data qual-

ity model (or framework) is typically established, de-

ﬁned by ISO 25012 as a ”deﬁned set of characteristics

which provides a framework for specifying data qual-

ity requirements and evaluating data quality.” These

characteristics (dimensions) encompass both quanti-

tative and qualitative assessments. ISO 25012 dis-

tinguishes between inherent DQ, which refers to the

intrinsic potential of data to meet quality needs, and

system-dependent DQ, which is inﬂuenced by the

technological environment. ISO 8000 deﬁnes three

meta-characteristics: syntactic quality, which pertains

to conformity to speciﬁed syntax; semantic quality,

which concerns the accurate representation of enti-

ties; and pragmatic quality, which relates to confor-

mance to usage-based requirements. These standards

provide a foundational basis for DQ assessment, yet

they fail to address the speciﬁc needs of emerging data

architectures (Zhang et al., 2021).

Data Ecosystems. An example of such distributed

environments are data ecosystems, a concept rapidly

materializing, particularly in Europe, embodying a

transformative approach to data management and use

(Otto et al., 2022). These ecosystems are designed to

give individuals and organizations greater sovereignty

over their data, embodying the principles of empow-

erment and control. Within these federated environ-

ments, data from multiple sources is brought together,

facilitating the creation of interoperable applications

that harness the collective power of shared informa-

tion. The anticipated value of such ecosystems lies

in their potential to streamline collaboration, drive

innovation, and improve the efﬁciency of services

across sectors (Theissen-Lipp et al., 2023). This new

paradigm aims to transcend traditional data silos and

Trust the Data You Use: Scalability Assurance Forms (SAF) for a Holistic Quality Assessment of Data Assets in Data Ecosystems

201

promote an open and dynamic exchange of data that

is securely accessible and usable within the broader

digital economy. As these ecosystems evolve, they

are expected to become key pillars in realizing a uni-

ﬁed digital marketplace, fostering economic growth

and digital autonomy (Otto et al., 2022). This re-

quires trust not only in the inherent quality of the data

but also in the descriptions, context, and semantics

accompanying the data (Theissen-Lipp et al., 2023).

Therefore, there is a growing need for a holistic ap-

proach to assessing the quality of data sets and data-

driven applications, particularly in the context of the

Semantic Web, where understanding the structure de-

pends on distinguishing between LD and KGs.

Linked Data and Knowledge Graphs. LD em-

ploys best practices using URIs and RDF for

machine-readable, interoperable data distribution on

the Web (Zaveri et al., 2015; Ji et al., 2022). KGs en-

hance LD by forming a graph-based knowledge base

with interconnected entities, enabling advanced ana-

lytics and AI applications (Ban et al., 2024; Pan et al.,

2017). KGs improve metadata quality, crucial for

accurate data descriptions and interoperable AI sys-

tems, thus enhancing reliability and trustworthiness

(Pan et al., 2024).

While quality models for KGs and LD exist, they

lack standardization and consensus on dimensions

and metrics (Zaveri et al., 2015; Radulovic et al.,

2017). The dynamic nature of LD requires innovative

assessment methods for scalable, high-quality data

exchange across systems (Zaveri et al., 2015).

3 METHODOLOGY

Our methodology is based on a Structured Literature

Review (SLR) and subsequent analysis of existing

frameworks. We derived new high-level DAQ dimen-

sions for DI and SI by anchoring our clustering pro-

cess to the ISO 25012 framework (ISO25012, 2008),

ensuring alignment with recognized standards.

Two scientists (RS1 and RS2) from different in-

stitutions independently conducted the SLR follow-

ing (Moher et al., 2010; Kitchenham, 2004) to mit-

igate bias. This approach identiﬁes open issues and

contributes to a common conceptualization. We sum-

marize established ISO 25012 dimensions and vari-

ous methods for evaluating KGs and LD, proposing

a framework for assessing data assets within a data

ecosystem.

Search Strategy. Following Kitchenham et al.

(2004), we deﬁned a search string based on keywords

Figure 1: Process of the systematic literature search.

from foundational literature (Stvilia et al., 2007; Ba-

tini and Scannapieco, 2006; Pernici and Scannapieco,

2003; Madnick et al., 2009). The search string used

was:

(”meta data” OR ”meta-data” OR ”meta-

data” OR ”knowledge graph” OR ”knowl-

edgegraph” OR ”knowledge-graph”) AND

(”quality model” OR ”quality framework” OR

”quality concept”)

Titles, abstracts, and full texts were ﬁltered using pre-

deﬁned inclusion and exclusion criteria, followed by

a backward search to identify further relevant studies.

This process, including the number of articles found

at each step, is detailed in Figure 1. We identiﬁed

395 papers, removing duplicates and including only

peer-reviewed, accessible, English or German studies

in computer science.

Subsequent steps focused on ﬁltering and reﬁn-

ing data assets. Both reviewers independently eval-

uated the titles of 127 articles and reviewed abstracts

to identify suitable studies, excluding those not fo-

cused on metadata or knowledge graphs or lacking a

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

202

Table 1: Presentation of the 31 articles identiﬁed as a result of the systematic literature search.

Title Source

A compendium and evaluation of taxonomy quality attributes (Unterkalmsteiner and Abdeen, 2024)

A comprehensive quality model for Linked Data (Radulovic et al., 2017)

A Data Quality Framework for Graph-Based Virtual Data In-

tegration Systems

(Li et al., 2022)

A Data Quality Scorecard to Assess a Data Source’s Fitness

for Use

(Grillo, 2018)

A Quality Framework for Data Integration (Wang, 2012)

A Quality Model for Linked Data Exploration (Cappiello et al., 2016)

A Quality Model for Mashups (Cappiello et al., 2011)

A Review on Data Quality Dimensions for Big Data (Ramasamy and Chowdhury, 2020)

A Semiotic Approach to Investigate Quality Issues of Open

Big Data Ecosystems

(Krogstie and Gao, 2015)

Architecture and quality in data warehouses (Jarke et al., 1999)

Big Data Quality Models: A Systematic Mapping Study (Montero et al., 2021)

Classiﬁcation of Knowledge Graph Completeness Measure-

ment Techniques

(Issa et al., 2021)

Data Infrastructures for Asset Management Viewed as Com-

plex Adaptive Systems

(Brous et al., 2014)

Data Quality Management in the Internet of Things (Zhang et al., 2021)

DQ Tags and Decision-Making (Price and Shanks, 2010)

EPIC: A Proposed Model for Approaching Metadata Im-

provement

(Tarver and Phillips, 2021)

Evolution of quality assessment in SPL: a systematic map-

ping

(Martins et al., 2020)

Exploiting Linked Data and Knowledge Graphs in Large Or-

ganisations

(Pan et al., 2017)

Information quality dimensions for the social web (Schaal et al., 2012)

KGMM - A Maturity Model for Scholarly Knowledge

Graphs Based on Intertwined Human-Machine Collaboration

(Hussein et al., 2022)

Knowledge Graph Quality Management: a Comprehensive

Survey

(Xue and Zou, 2022)

Knowledge Graphs: A Practical Review of the Research

Landscape

(Kejriwal, 2022)

Prioritization of data quality dimensions and skills require-

ments in genome annotation work

(Huang et al., 2012)

Quality assessment for Linked Data: A Survey: A systematic

literature review and conceptual framework

(Zaveri et al., 2015)

Quality Evaluation Model of AI-based Knowledge Graph

System

(Xu et al., 2021)

Quality factory and quality notiﬁcation service in data ware-

house

(Li and Osei-Bryson, 2010)

Quality model and metrics of ontology for semantic descrip-

tions of web services

(Zhu et al., 2017)

Rating quality in metadata harvesting (Kapidakis, 2015)

Towards a Critical Data Quality Analysis of Open Arrest

Record Datasets

(Wickett and Newman, 2024)

Towards a Data Quality Framework for Heterogeneous Data (Micic et al., 2017)

Towards a meta-model for data ecosystems (Iury et al., 2018)

Towards a Metadata Management System for Provenance,

Reproducibility and Accountability in Federated Machine

Learning

(Peregrina et al., 2022)

Trust the Data You Use: Scalability Assurance Forms (SAF) for a Holistic Quality Assessment of Data Assets in Data Ecosystems

203

methodology for quality assessment. Discrepancies

were resolved by consensus or detailed review, result-

ing in a ﬁnal list of 48 articles for RS1 and 43 for RS2.

A snowballing approach ensured comprehensive cov-

erage by checking references, using Google Scholar’s

”Cited by” feature, and searching for related articles.

This process identiﬁed 10 additional relevant articles.

Finally, 31 articles were selected, listed in Table 1.

4 SCALABILITY ASSURANCE

FORMS (SAF)

In this chapter, we present the results of our holistic

quality assessment framework. The sets of DAQ di-

mensions were derived from the extensive literature

review described in Section 3 and are shown in Fig-

ure 2. Throughout this process, we systematically

extracted and analyzed the quality dimensions men-

tioned in various papers. We consolidated these di-

mensions into the aforementioned groups through it-

erative clustering and synthesis, ensuring comprehen-

sive coverage and alignment with the dimensions de-

ﬁned in ISO 25012.

• Accessibility: The degree to which a DA is avail-

able and obtainable for use by authorized enti-

ties, ensuring that users can access the DA when

needed.

• Accuracy: The closeness of DA values to the true

values or accepted standard, reﬂecting the correct-

ness and precision of the data.

• Connectivity: The capability of a DA to be con-

nected and interlinked with other data sources, en-

hancing its usability and integration across sys-

tems.

• Integrity: The extent to which a DA is complete,

consistent, and free from unauthorized modiﬁca-

tion, ensuring its reliability and trustworthiness.

• Presentation: The clarity and interpretability of

a DA, including its format and structure, make it

comprehensible and usable by intended users.

• Relevance: The pertinence and applicability of a

DA to the context in which it is used, ensuring that

it meets the needs and requirements of users.

• Security: The protection of a DA against unau-

thorized access and breaches, ensuring conﬁden-

tiality, integrity, and availability of the data.

• Operational Efﬁciency: The degree to which a

DA supports effective and efﬁcient business op-

erations, including performance and process opti-

mization.

• Regulatory Compliance: The extent to which a

DA adheres to laws, regulations, and policies rel-

evant to its use and management, ensuring legal

and regulatory conformance.

• System Flexibility: The adaptability and main-

tainability of DA systems to accommodate

changes and evolving requirements, ensuring

long-term usability and scalability.

Each group contains different dimensions, which are

shown in different colours and shapes. The color dis-

tinguishes the dependency of the dimensions between

inherent, inherent and system-dependent.

• Inherent Quality refers to the inherent poten-

tial of a DA to satisfy both explicit and implicit

requirements under certain conditions, including

domain values, constraints, data-asset-value rela-

tionships, and metadata.

• System Dependent Quality depends on the tech-

nological capability of computer systems, includ-

ing hardware and software, to access a DA, main-

tain its accuracy, recover it, and facilitate its porta-

bility.

• Inherent and System Dependent Quality is a

hybrid dimension that recognizes the complexity

of DQ that arises both inherently and through sys-

tem interaction and requires a holistic approach to

assessment.

The form of the dimension distinguishes between

Data Indicators (DI), Semantic Indicators (SI), and

Hybrid Indicators (HI). The ﬁrst two dimensions were

introduced in Chapter 1. Hybrid indicators combine

DI and SI to assess the suitability of data for cross-

system use, applying to both data and semantic de-

scriptions.

Figure 2 shows groups with inherent quality di-

mensions (e.g., presentation) and system-dependent

dimensions (e.g., system ﬂexibility). Overall, these

groups align well with ISO 25012, except for ”system

ﬂexibility,” which lacks a corresponding group in the

ISO standard. We extracted 31 dimensions and 10 su-

perordinate groups from the literature, comprising 5

DI, 6 SI, and 20 HI, as well as 17 inherent, 9 system-

dependent, and 5 inherent and system-dependent di-

mensions. This selection allows users to choose di-

mensions relevant to their speciﬁc application. Clus-

tering quality dimensions and dependencies enables

more precise selection.

4.1 SAF Scores

The SAF scores are calculated to ensure comparabil-

ity between different DAs and to meet the individ-

ual needs of users and departments. These scores

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

204

Figure 2: Holistic Quality Assessment overview: The dimensions are classiﬁed under overarching groups, reﬂecting their

inherent and system-dependent qualities, and are further mapped onto the ISO 25012 standard.

allow users to prioritize whether SI or DI is more

important for their speciﬁc application, enabling cus-

tomized weighting.

The SAF scores are based on a systematic and

mathematically sound method, utilizing the dimen-

sional assignments from Figure 3. For each dimen-

sion, an appropriate metric is collected and calcu-

lated for the corresponding DA. The objective is to

combine the metrics for DI and SI such that SAF =

DI + SI. First, the mean score for each parent group

is calculated by averaging the scores of the underly-

ing features. Let c

be the score for the i

dimension

metric within a group and n be the total number of

dimension metrics in that group. The mean

C for the

group is given by

C =

∑

i=1

We then calculate the DI and SI values. For a dimen-

sion classiﬁed as a DI, labeled DI

and belonging to a

group with a mean value C, its calculated value V

= DI

·C

Similarly, for a dimension SI

identiﬁed as a semantic

indicator, the value V

is calculated using the same

formula. Each feature within the DI and SI groups is

subjected to this calculation, and the results are aggre-

gated to give the overall DI or SI score:

Total DI =

∑

Total SI =

∑

The SAF score is then the sum of the total DI and the

total SI.

To allow for the weighting of DI and SI values,

enabling users to prioritize dimensions according to

their importance, we introduce weight factors w

and w

for each dimension. The weighted values

and W

are calculated as follows:

= w

·V

= w

·V

The total weighted DI and SI scores are then:

Total Weighted DI =

∑

Total Weighted SI =

∑

Finally, the SAF score, incorporating the weights, is

calculated as the sum of the total weighted DI and the

total weighted SI:

SAF = Total Weighted DI + Total Weighted SI

Initially, the weight factors w

and w

are set to

1, ensuring balanced weighting when no user-deﬁned

weights are applied.

Figure 3 illustrates the methodological rigor of

the SAF framework with three scenarios: balanced,

semantics-centered, and data-centered evaluations.

Users can deﬁne the weighting to reﬂect the focus of

their application.

4.1.1 Example: Complex Data Asset Evaluation

Consider a complex DA distributed across multiple

sites, such as a healthcare data system integrating pa-

tient records from various hospitals. Each site collects

Trust the Data You Use: Scalability Assurance Forms (SAF) for a Holistic Quality Assessment of Data Assets in Data Ecosystems

205

Figure 3: The SAF assessment framework is a comprehensive approach for evaluating heterogeneous DA in three distinct

forms. The three diagrams illustrate the distribution of SAF levels based on the assessment focus: general (left), data-oriented

(center) and semantic-oriented (right). Furthermore, the framework is adaptable to the assessment priorities deﬁned by the

user and the granularity of the SAF grading. It is at the discretion of the user to determine the number of SAF levels and the

respective thresholds for these levels for DI and SI. Different DA are shown in green as examples; the X and Y values of the

DA are identical in all three diagrams.

data including patient diagnostics, treatment records,

and outcomes. The evaluation involves:

• Data Indicators (DI): Accuracy of diagnostic

codes, completeness of treatment records, consis-

tency of patient outcomes across sites.

• Semantic Indicators (SI): Clarity of metadata de-

scriptions, interlinking of patient records, contex-

tual relevance of treatment data.

For DI, metrics such as precision of diagnostic

codes (c

), completeness of records (c

), and consis-

tency (c

) are collected. Suppose C

for these met-

rics is calculated as:

+ c

)

Assuming DI

, DI

, and DI

are the weights for these

metrics, the total DI score is:

Total DI = DI

· c

+ DI

· c

+ DI

· c

For SI, metrics such as metadata clarity (c

), in-

terlinking (c

), and contextual relevance (c

) are col-

lected. Suppose C

for these metrics is:

+ c

)

Assuming SI

, SI

, and SI

are the weights for these

metrics, the total SI score is:

Total SI = SI

· c

+ SI

· c

+ SI

· c

Weight factors can be adjusted based on user pri-

orities. For instance, in a scenario prioritizing data ac-

curacy over metadata clarity, w

could be set higher

than w

. Finally, the SAF score, incorporating these

weights, is:

SAF =

∑

·V

∑

·V

5 DISCUSSION AND

CONCLUSION

In this paper, we developed the Scalability Assurance

Forms (SAF) framework, a comprehensive method

for assessing data asset quality in data ecosystems.

Grounded in ISO 25012, the SAF framework system-

atically integrates DI and SI to offer a holistic eval-

uation of data assets. This dual approach ensures

that both intrinsic DQ and contextual semantic rich-

ness are thoroughly addressed, which is essential for

the reliability and scalability of AI applications. The

SAF framework presents several advantages. It al-

lows users to prioritize dimensions according to their

importance through weight factors, offering a cus-

tomizable approach to DAQ assessment. This adapt-

ability is crucial for addressing the diverse needs of

different data-driven environments and ensures that

the quality assessments are both relevant and action-

able. Furthermore, by providing a structured method

for assessing data assets, the SAF framework supports

better decision-making and enhances the trustworthi-

ness of data used in various applications. The holis-

tic view offered by the SAF framework is crucial for

users, enabling them to make well-informed decisions

and select the most appropriate data assets from com-

plex data ecosystems.

However, there are limitations to the current

framework. One signiﬁcant challenge is the absence

of predeﬁned metrics for the various dimensions,

which often need to be individually deﬁned and tai-

lored to speciﬁc contexts. This process can be com-

plex and time-consuming, requiring extensive domain

expertise. Additionally, the ﬁeld of automated qual-

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

206

ity assessment in data ecosystems is still in its early

stages, and further research is needed to develop ro-

bust methodologies and tools. Despite these limita-

tions, future research will focus on deﬁning speciﬁc

metrics for each dimension and developing a pro-

totype for automated quality assessment. This will

enhance the framework’s applicability and effective-

ness, providing users with more precise and action-

able quality assessments.

REFERENCES

Arias, V. B., Garrido, L. E., Jenaro, C., Mart

ınez-Molina,

A., and Arias, B. (2020). A little garbage in, lots

of garbage out: Assessing the impact of careless re-

sponding in personality survey data. Behavior Re-

search Methods, 52(6):2489–2505.

Ban, T., Wang, X., Chen, L., Wu, X., Chen, Q., and Chen,

H. (2024). Quality Evaluation of Triples in Knowl-

edge Graph by Incorporating Internal With External

Consistency. IEEE Transactions on Neural Networks

and Learning Systems, 35(2):1980–1992.

Batini, C. and Scannapieco, M. (2006). Data Quality: Con-

cepts, Methodologies and Techniques. Data-Centric

Systems and Applications. Springer, Berlin Heidel-

berg.

Brous, P., Overtoom, I., Herder, P., Versluis, A., and

Janssen, M. (2014). Data Infrastructures for Asset

Management Viewed as Complex Adaptive Systems.

Procedia Computer Science, 36:124–130.

Cappiello, C., Daniel, F., Koschmider, A., Matera, M., and

Picozzi, M. (2011). A Quality Model for Mashups.

In Auer, S., D

ıaz, O., and Papadopoulos, G. A., edi-

tors, Web Engineering, volume 6757, pages 137–151.

Springer Berlin Heidelberg, Berlin, Heidelberg.

Cappiello, C., Di Noia, T., Marcu, B. A., and Matera, M.

(2016). A Quality Model for Linked Data Exploration.

In Bozzon, A., Cudre-Maroux, P., and Pautasso, C.,

editors, Web Engineering, volume 9671, pages 397–

404. Springer International Publishing, Cham.

Grillo, A. (2018). Developing a Data Quality Scorecard that

Measures Data Quality in a Data Warehouse.

unther, L. C., Colangelo, E., Wiendahl, H.-H., and

Bauer, C. (2019). Data quality assessment for im-

proved decision-making: A methodology for small

and medium-sized enterprises. Procedia Manufactur-

ing, 29:583–591.

Huang, H., Stvilia, B., J

orgensen, C., and Bass, H. W.

(2012). Prioritization of data quality dimensions and

skills requirements in genome annotation work. Jour-

nal of the American Society for Information Science

and Technology, 63(1):195–207.

Hussein, H., Oelen, A., Karras, O., and Auer, S. (2022).

KGMM – A Maturity Model for Scholarly Knowl-

edge Graphs based on Intertwined Human-Machine

Collaboration.

ISO25012 (2008). ISO/IEC 25012:2008.

Issa, S., Adekunle, O., Hamdi, F., Cherﬁ, S. S.-S., Dumon-

tier, M., and Zaveri, A. (2021). Knowledge Graph

Completeness: A Systematic Literature Review. IEEE

Access, 9:31322–31339.

Iury, M., Oliveira, L., Ribeiro, M., and L

oscio, B. (2018).

Towards a Meta-Model for Data Ecosystems.

Jarke, M., Jeusfeld, M. A., Quix, C., and Vassiliadis, P.

(1999). Architecture and quality in data warehouses:

An extended repository approach. Information Sys-

tems, 24(3):229–253.

Jensen, D., Wilson, T., Statistics, U. S. B. o. J., and Group,

S. (1986). Data Quality Policies and Procedures: Pro-

ceedings of a BJS/SEARCH Conference : Papers. U.S.

Department of Justice, Bureau of Justice Statistics.

Ji, S., Pan, S., Cambria, E., Marttinen, P., and Yu, P. S.

(2022). A Survey on Knowledge Graphs: Represen-

tation, Acquisition, and Applications. IEEE Trans-

actions on Neural Networks and Learning Systems,

33(2):494–514.

Kapidakis, S. (2015). Rating Quality in Metadata Harvest-

ing.

Kejriwal, M. (2022). Knowledge Graphs: A Practical

Review of the Research Landscape. Information,

13(4):161.

Kilkenny, M. F. and Robinson, K. M. (2018). Data qual-

ity: “Garbage in – garbage out”. Health Information

Management Journal, 47(3):103–105.

Kitchenham, B. (2004). Procedures for Performing System-

atic Reviews.

Krogstie, J. and Gao, S. (2015). A semiotic approach to in-

vestigate quality issues of open big data ecosystems.

In Liu, K., Nakata, K., Li, W., and Galarreta, D.,

editors, Information and Knowledge Management in

Complex Systems, pages 41–50, Cham. Springer In-

ternational Publishing.

Li, Y., Nadal, S., and Romero, O. (2022). A data qual-

ity framework for graph-based virtual data integration

systems. In Chiusano, S., Cerquitelli, T., and Wrem-

bel, R., editors, Advances in Databases and Informa-

tion Systems, pages 104–117, Cham. Springer Inter-

national Publishing.

Li, Y. and Osei-Bryson, K.-M. (2010). Quality factory and

quality notiﬁcation service in data warehouse. In Pro-

ceedings of the 3rd Workshop on Ph.D. Students in

Information and Knowledge Management, PIKM ’10,

pages 25–32, New York, NY, USA. Association for

Computing Machinery.

Loh, W.-Y., Zhang, Q., Zhang, W., and Zhou, P. (2020).

Missing data, imputation and regression trees. Statis-

tica Sinica, 30(4):1697–1722.

Madnick, S. E., Wang, R. Y., Lee, Y. W., and Zhu, H.

(2009). Overview and Framework for Data and In-

formation Quality Research. Journal of Data and In-

formation Quality, 1(1):1–22.

Martins, L. A., Afonso J

unior, P., Freire, A. P., and Costa,

H. (2020). Evolution of quality assessment in SPL: A

systematic mapping. IET Software.

McCausland, T. (2021). The Bad Data Problem. Research-

Technology Management, 64(1):68–71.

Trust the Data You Use: Scalability Assurance Forms (SAF) for a Holistic Quality Assessment of Data Assets in Data Ecosystems

207

Micic, N., Neagu, D., Campean, F., and Habib Zadeh, E.

(2017). Towards a Data Quality Framework for Het-

erogeneous Data.

Moher, D., Liberati, A., Tetzlaff, J., and Altman, D. G.

(2010). Preferred reporting items for systematic re-

views and meta-analyses: The PRISMA statement.

International Journal of Surgery, 8(5):336–341.

Montero, O., Crespo, Y., and Piatini, M. (2021). Big

Data Quality Models: A Systematic Mapping Study.

In Paiva, A. C. R., Cavalli, A. R., Ventura Mar-

tins, P., and P

erez-Castillo, R., editors, Quality of In-

formation and Communications Technology, volume

1439, pages 416–430. Springer International Publish-

ing, Cham.

Naroll, F., Naroll, R., and Howard, F. H. (1961). Position of

women in childbirth. American Journal of Obstetrics

and Gynecology, 82(4):943–954.

NIST, C. C. (2020). Data asset - Glossary | CSRC.

https://csrc.nist.gov/glossary/term/data asset.

Otto, B., Ten Hompel, M., and Wrobel, S., editors (2022).

Designing Data Spaces: The Ecosystem Approach to

Competitive Advantage. Springer International Pub-

lishing, Cham.

Pan, J. Z., Vetere, G., Gomez-Perez, J. M., and Wu, H.,

editors (2017). Exploiting Linked Data and Knowl-

edge Graphs in Large Organisations. Springer Inter-

national Publishing, Cham.

Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., and Wu, X.

(2024). Unifying Large Language Models and Knowl-

edge Graphs: A Roadmap. IEEE Transactions on

Knowledge and Data Engineering, pages 1–20.

Peregrina, J. A., Ortiz, G., and Zirpins, C. (2022). To-

wards a Metadata Management System for Prove-

nance, Reproducibility and Accountability in Feder-

ated Machine Learning. In Zirpins, C., Ortiz, G.,

Nochta, Z., Waldhorst, O., Soldani, J., Villari, M., and

Tamburri, D., editors, Advances in Service-Oriented

and Cloud Computing, pages 5–18, Cham. Springer

Nature Switzerland.

Pernici, B. and Scannapieco, M. (2003). Data Quality in

Web Information Systems. In Goos, G., Hartmanis,

J., Van Leeuwen, J., Spaccapietra, S., March, S., and

Aberer, K., editors, Journal on Data Semantics I, vol-

ume 2800, pages 48–68. Springer Berlin Heidelberg,

Berlin, Heidelberg.

Price, R. and Shanks, G. (2010). DQ tags and decision-

making. In 2010 43rd Hawaii International Confer-

ence on System Sciences, pages 1–10.

Radulovic, F., Mihindukulasooriya, N., Garc

ıa-Castro, R.,

and G

omez-P

erez, A. (2017). A comprehensive qual-

ity model for Linked Data. Semantic Web, 9(1):3–24.

Ramasamy, A. and Chowdhury, S. (2020). Big Data Quality

Dimensions: A Systematic Literature Review. Journal

of Information Systems and Technology Management,

page e202017003.

Schaal, M., Smyth, B., Mueller, R. M., and MacLean, R.

(2012). Information quality dimensions for the so-

cial web. In Proceedings of the International Con-

ference on Management of Emergent Digital EcoSys-

tems, Medes ’12, pages 53–58, New York, NY, USA.

Association for Computing Machinery.

Stvilia, B., Gasser, L., Twidale, M. B., and Smith, L. C.

(2007). A framework for information quality assess-

ment. Journal of the American Society for Information

Science and Technology, 58(12):1720–1733.

Tarver, H. and Phillips, M. E. (2021). EPIC: A proposed

model for approaching metadata improvement. In

Garoufallou, E. and Ovalle-Perandones, M.-A., edi-

tors, Metadata and Semantic Research, pages 228–

233, Cham. Springer International Publishing.

Theissen-Lipp, J., Kocher, M., Lange, C., Decker, S.,

Paulus, A., Pomp, A., and Curry, E. (2023). Seman-

tics in Dataspaces: Origin and Future Directions. In

Companion Proceedings of the ACM Web Conference

2023, pages 1504–1507, Austin TX USA. ACM.

Unterkalmsteiner, M. and Abdeen, W. (2024). A com-

pendium and evaluation of taxonomy quality at-

tributes.

Wang, J. (2012). A Quality Framework for Data Integra-

tion. In MacKinnon, L. M., editor, Data Security and

Security Data, volume 6121, pages 131–134. Springer

Berlin Heidelberg, Berlin, Heidelberg.

Wang, X., Chen, L., Ban, T., Usman, M., Guan, Y., Liu,

S., Wu, T., and Chen, H. (2021). Knowledge graph

quality control: A survey. Fundamental Research,

1(5):607–626.

Wickett, K. M. and Newman, J. (2024). Towards a Crit-

ical Data Quality Analysis of Open Arrest Record

Datasets. In Sserwanga, I., Joho, H., Ma, J., Hansen,

P., Wu, D., Koizumi, M., and Gilliland, A. J., editors,

Wisdom, Well-Being, Win-Win, pages 311–318, Cham.

Springer Nature Switzerland.

Xu, Z., Gao, Y., and Yu, F. (2021). Quality Evaluation

Model of AI-based Knowledge Graph System. In

2021 3rd International Conference on Natural Lan-

guage Processing (ICNLP), pages 73–78, Beijing,

China. IEEE.

Xue, B. and Zou, L. (2022). Knowledge Graph Quality

Management: A Comprehensive Survey. IEEE Trans-

actions on Knowledge and Data Engineering, pages

1–1.

Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann,

J., and Auer, S. (2015). Quality assessment for Linked

Data: A Survey: A systematic literature review and

conceptual framework. Semantic Web, 7(1):63–93.

Zhang, L., Jeong, D., and Lee, S. (2021). Data Qual-

ity Management in the Internet of Things. Sensors,

21(17):5834.

Zhu, H., Liu, D., Bayley, I., Aldea, A., Yang, Y., and Chen,

Y. (2017). Quality model and metrics of ontology for

semantic descriptions of web services. Tsinghua Sci-

ence and Technology, 22(3):254–272.

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

208