Quantifying Domain-Application Knowledge Mismatch in

Ontology-Guided Machine Learning

Pawel Bielski

, Lena Witterauf, S

onke Jendral

, Ralf Mikut

and Jakob Bach

Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

{pawel.bielski, ralf.mikut, jakob.bach}@kit.edu, lena.emma77@gmail.com, jendral@kth.se

Keywords:

Ontology Quality Evaluation, Knowledge-Guided Machine Learning, Application Ontology.

Abstract:

In this work, we study the critical issue of knowledge mismatch in ontology-guided machine learning

(OGML), speciﬁcally between domain ontologies and application ontologies. Such mismatches may arise

when OGML uses ontological knowledge that was originally created for different purposes. Even if onto-

logical knowledge improves the overall OGML performance, mismatches can lead to reduced performance

on speciﬁc data subsets compared to machine-learning models without ontological knowledge. We propose

a framework to quantify this mismatch and identify the speciﬁc parts of the ontology that contribute to it.

To demonstrate the framework’s effectiveness, we apply it to two common OGML application areas: im-

age classiﬁcation and patient health prediction. Our ﬁndings reveal that domain-application mismatches are

widespread across various OGML approaches, machine-learning model architectures, datasets, and prediction

tasks, and can impact up to 40% of unique domain concepts in the datasets. We also explore the potential root

causes of these mismatches and discuss strategies to address them.

1 INTRODUCTION

Motivation. Ontologies formally represent domain

knowledge in a structured way. They use a set of con-

cepts and their relationships that is understandable by

both humans and machines (Min et al., 2017; Lour-

dusamy and John, 2018; Wilson et al., 2022). They

are increasingly important for intelligent, ontology-

informed applications in ﬁelds such as knowledge

management, data integration, decision support, rea-

soning, and machine learning (McDaniel and Storey,

2020; Min et al., 2017).

One application area that is gaining interest in

the machine-learning community is ontology-guided

machine learning (OGML). OGML is a subﬁeld of

knowledge-guided machine learning (KGML) (von

Rueden et al., 2023; Willard et al., 2023) that sys-

tematically incorporates ontological domain knowl-

edge into machine-learning models. OGML aims

to improve prediction performance, especially for

rarely represented data objects, reduce training data

requirements, and generate more interpretable results.

https://orcid.org/0009-0005-3242-9113

https://orcid.org/0009-0000-0070-9595

https://orcid.org/0000-0001-9100-5496

https://orcid.org/0000-0003-0301-2798

OGML methods have shown signiﬁcant success in

ﬁelds like computer vision (image classiﬁcation, seg-

mentation, and retrieval) and medical data processing

(including text classiﬁcation and patient health pre-

diction (Choi et al., 2017; Ma et al., 2018; Yin et al.,

2019)), where rich ontological background knowl-

edge is abundant (Min and Wojtusiak, 2012).

OGML methods generally outperform ontology-

uninformed machine-learning methods on average

(Dhall et al., 2020; Karthik et al., 2021; Silla and

Freitas, 2011). However, the underlying ontologi-

cal domain knowledge may not always have the op-

timal structure for a particular machine-learning task,

which may negatively impact particular subsets of the

data. In other words, OGML methods can suffer from

a mismatch between domain and application-speciﬁc

knowledge, which typically arises because ontolog-

ical domain knowledge is created for different pur-

poses than the speciﬁc OGML task. Existing liter-

ature on OGML methods often ignores this type of

low-quality domain knowledge and assumes that on-

tologies only positively impact predictions. It is cru-

cial to understand how such mismatches manifest,

how big their impact is, and how to address them.

The ﬁrst step in tackling this challenge is to develop a

method for identifying and quantifying this mismatch

in OGML approaches.

216

Bielski, P., Witterauf, L., Jendral, S., Mikut, R. and Bach, J.

Quantifying Domain-Application Knowledge Mismatch in Ontology-Guided Machine Learning.

DOI: 10.5220/0013065900003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 2: KEOD, pages 216-226

ISBN: 978-989-758-716-0; ISSN: 2184-3228

Approach. In this work, we study the important is-

sue of knowledge mismatch between domain-speciﬁc

and application-speciﬁc ontologies in OGML, which

has been mostly overlooked in the literature. As a

result, both the existing theory-based and empirical

methods to evaluate ontology quality (Hlomani and

Stacey, 2014; Wilson et al., 2022) are inadequate

for detecting this mismatch. To address this gap,

we propose a new OGML-aware evaluation frame-

work based on the task-based framework (Porzel

and Malaka, 2004). Because the original frame-

work was not designed for OGML, we adapt it to

account for OGML-speciﬁc aspects, such as sepa-

ration of the ontology and data, different task and

ground truth deﬁnitions, and the stochastic nature of

machine-learning algorithms. We argue that domain-

application knowledge mismatch manifests as harm-

ful domain knowledge, negatively impacting the pre-

diction performance of OGML methods. Our frame-

work identiﬁes such harmful parts of the ontology

for a speciﬁc task by comparing the performance

of the OGML method with an ontology-uninformed

method.

To demonstrate the effectiveness of our frame-

work, we apply it to two common OGML applica-

tion areas: image classiﬁcation and patient health pre-

diction. For image classiﬁcation, we quantify the

mismatch across three biological image datasets, us-

ing the Hierarchical Semantic Embedding OGML ap-

proach by (Chen et al., 2018). For patient health pre-

diction, we quantify the mismatch across three pre-

diction tasks within one medical dataset, using the

GRAM (Choi et al., 2017) OGML approach. Our

ﬁndings reveal that such mismatches are widespread

across various OGML approaches, machine-learning

model architectures, datasets, and prediction tasks.

We also explore the potential root causes of these mis-

matches based on the harmful parts of the ontology

identiﬁed by our framework. Furthermore, we discuss

strategies to address these issues, demonstrating that

our methodology shows promise as a generalizable

approach for ontology quality assessment, enabling

the identiﬁcation of various ontological issues.

Contributions. To summarize, our contributions

are as follows:

1. We study the important but relatively over-

looked problem of domain-application knowledge

mismatch in ontology-guided machine learning

(OGML).

2. We introduce a quality evaluation framework to

quantify this mismatch and identify ontology

parts that negatively impact the task performance.

3. We apply our framework in two common OGML

application areas to demonstrate how to detect, in-

terpret, and address such mismatches.

4. We provide the code and experimental results

Paper Outline. Section 2 discusses background

and related work. Section 3 introduces our approach.

Section 4 reports on experiments from two OGML

case studies. Section 5 concludes.

2 RELATED WORK

In this section, we review related work regarding

approaches for ontology quality evaluation in Sec-

tion 2.1, OGML in general in Section 2.2, and the

issue of low-quality domain knowledge in the context

of machine learning in Section 2.3.

2.1 Ontology Quality Evaluation

Creating and maintaining ontologies is a highly sub-

jective, labor-intensive process. This process is prone

to errors, as there is no standard method for creat-

ing ontologies (Capellades, 1999; Brewster, 2002;

Duque-Ramos et al., 2011). Additionally, ontolo-

gies are only approximations of domain knowledge,

and multiple valid ontologies can exist to represent

the same knowledge (Hlomani and Stacey, 2014; Mc-

Daniel and Storey, 2020). Thus, evaluating the qual-

ity of ontologies is essential for the broader adoption

of ontology-informed applications (Mc Gurk et al.,

2017). This process ensures that developed ontologies

are useful for speciﬁc tasks or domains and helps se-

lect the most suitable ontology for the given applica-

tion (Duque-Ramos et al., 2011). Evaluation methods

can signiﬁcantly reduce the human effort needed to

create and maintain ontologies. In particular, they can

guide the construction process and enable the reuse

of existing ontologies instead of building them from

scratch (Capellades, 1999; Beydoun et al., 2011; Mc-

Daniel and Storey, 2020). Despite many proposed

approaches to ontology quality evaluation, no univer-

sal solution exists as they address different quality as-

pects. (McDaniel and Storey, 2020).

Existing methods can be grouped into two broad

categories: deductive (metrics-based) and inductive

(empirical) (Burton-Jones et al., 2005; Hlomani and

Stacey, 2014).

Deductive evaluation methods to evaluate ontol-

ogy quality are theory-based metrics that quantify

whether an ontology is correct according to structural

https://doi.org/10.35097/zv8zqgqd6ezm02vk

Quantifying Domain-Application Knowledge Mismatch in Ontology-Guided Machine Learning

217

properties and description-logic axioms (Hlomani and

Stacey, 2014; Wilson et al., 2022). Often inspired

by software-engineering research on software quality,

these methods use heuristic quality criteria to identify

syntactic, semantic, and structural problems that are

independent of the application (McDaniel and Storey,

2020). However, because these deductive methods

rely on various subjective interpretations of ontology

quality, none of them has become standard (Brewster

et al., 2004). Additionally, verifying whether an on-

tology meets speciﬁc formal criteria does not guar-

antee optimal performance for a particular purpose

omez-P

erez, 1999; McDaniel and Storey, 2020).

Inductive evaluation methods assess ontology

quality by empirically testing its ﬁtness (i.e., useful-

ness for a speciﬁc application) rather than its syn-

tax, semantics, or structure (Burton-Jones et al., 2005;

Wilson et al., 2022). Fitness can be quantiﬁed in

terms of application ﬁtness, which evaluates perfor-

mance on a speciﬁc task, or domain ﬁtness, which

assesses performance across multiple tasks within a

domain. Ontology ﬁtness is typically quantiﬁed for

the entire ontology (Porzel and Malaka, 2004; Clarke

et al., 2013), but it can also be quantiﬁed for spe-

ciﬁc parts of the ontology, which can help identify

improvement potentials. This process requires link-

ing speciﬁc parts of the ontology to application per-

formance, which is not trivial and thus often skipped

in practice (Pittet and Barth

emy, 2015).

(Porzel and Malaka, 2004), (Brank et al., 2005),

(Burton-Jones et al., 2005) and (Ohta et al., 2011) ar-

gue that inductive evaluation, particularly task-based

evaluation, offers an objective measure of ontology

quality by directly evaluating the ontology’s ability to

solve practical problems. Despite this, research in this

area is limited. Apart from the original paper intro-

ducing task-based ontology quality evaluation (Porzel

and Malaka, 2004) and a few adaptations (Clarke

et al., 2013; Pittet and Barth

emy, 2015), there is lit-

tle research on assessing ontology quality based on

its utility for speciﬁc applications. Both (Ohta et al.,

2011) and (Wilson et al., 2022) have highlighted the

need for more research in this area. Speciﬁcally, eval-

uating ontology quality for OGML, which we address

in this work, has not been previously explored.

Recent research in conﬁdent learning and data-

centric AI (Wang et al., 2018; Northcutt et al., 2021;

Rigoni et al., 2023) shows that analyzing predic-

tions from traditional machine-learning methods can

uncover and address ontological issues in image la-

bel hierarchies, enhancing data quality and prediction

performance. Our work follows a similar direction

but focuses speciﬁcally on ontology-guided machine

learning.

2.2 Ontology-Guided Machine

Learning

Ontology-guided machine learning (OGML) is a sub-

ﬁeld of knowledge-guided machine learning (KGML)

that leverages structured ontological domain knowl-

edge to enhance machine-learning models. This

is usually accomplished with custom loss functions

(Zeng et al., 2017; Ju et al., 2024), ontology-aware

embeddings (Vendrov et al., 2016; Nickel and Kiela,

2017; Chen et al., 2018; Dhall et al., 2020; Bertinetto

et al., 2020), or adapted model architectures (Brust

and Denzler, 2019a). OGML methods have demon-

strated signiﬁcant success in domains rich in ontolog-

ical background knowledge, such as medical data pro-

cessing or computer vision.

In healthcare, abundant medical domain knowl-

edge has accumulated through years of medical re-

search, hospital administration, billing, and documen-

tation of medical procedures. This knowledge is often

organized into ontologies that group medical codes

into semantically meaningful categories using parent-

child relationships, e.g., the ICD-9 hierarchy of symp-

toms and diseases (see Section 4.2). OGML ap-

proaches leverage these ontologies for various auto-

mated medical data processing tasks, such as patient

health prediction (Choi et al., 2017; Yin et al., 2019;

Ma et al., 2019) or medical text classiﬁcation (Arbabi

et al., 2019). These methods have been shown to im-

prove prediction performance, especially for rare dis-

eases that are often insufﬁciently represented in data.

In computer vision, domain knowledge is often

structured as taxonomies of labels, reﬂecting the hi-

erarchical nature of many real-world datasets, such

as those in biology (Silla and Freitas, 2011; Rezende

et al., 2022). Even non-hierarchical datasets can be

enriched with knowledge from literature or general

domain-independent ontologies (Chen et al., 2018;

Brust and Denzler, 2019a). OGML approaches in

computer vision have been applied to tasks such as

image classiﬁcation (Deng et al., 2014; Goo et al.,

2016; Marino et al., 2017; Chen et al., 2018; Brust

and Denzler, 2019a; Bertinetto et al., 2020; Ju et al.,

2024), image retrieval (Vendrov et al., 2016; Barz and

Denzler, 2019).

OGML approaches typically use readily available

generic or domain ontologies (Burton-Jones et al.,

2005) rather than task-speciﬁc application ontologies.

While research often reports that ontological domain

knowledge improves average prediction performance

compared to models without it (Dhall et al., 2020;

Karthik et al., 2021; Silla and Freitas, 2011), there is

limited recognition that not all data subsets may ben-

eﬁt equally in the context of a speciﬁc task.

KEOD 2024 - 16th International Conference on Knowledge Engineering and Ontology Development

218

2.3 Low-Quality Domain Knowledge

In the broader context of knowledge-guided machine

learning (KGML), both (Mitchell, 1997) and (Yu,

2007) recognize that domain knowledge can be im-

perfect due to difﬁculties in its collection, deﬁnition,

and representation. (Yu, 2007) also notes that do-

main knowledge is highly context-dependent, mean-

ing its usefulness can vary across different tasks.

The authors emphasize the importance of considering

the negative impact of imperfect domain knowledge

when applying KGML. (Mitchell, 1997) argues that

even imperfect knowledge can be beneﬁcial as long as

the machine-learning algorithm tolerates some level

of error. While some recent KGML publications ex-

plicitly design or evaluate their approaches with this

in mind and quantify the impact of imperfect domain

knowledge (Bielski et al., 2024; Brust et al., 2021;

Deng et al., 2014), most existing KGML publications

do not explicitly address this issue.

In the speciﬁc context of OGML, no studies sim-

ilar to ours on the problem of domain-application

knowledge mismatch have been conducted. However,

several related observations have been made regard-

ing the low quality of domain knowledge. For exam-

ple, (Brust and Denzler, 2019b) investigated the dis-

crepancy between visual and semantic similarity in

OGML for image classiﬁcation. They observed that

the overall prediction performance may decrease in

some situations compared to knowledge-uninformed

baselines. (Choi et al., 2017) showed that fully

randomized ontological domain knowledge can de-

crease the overall prediction performance in health-

care OGML applications. The above studies consid-

ered the overall negative effect on average prediction

performance and did not analyze the prediction per-

formance on subsets of the data. They also did not

consider identifying speciﬁc parts of ontological do-

main knowledge that might have caused the decrease

in the prediction performance. (Deng et al., 2014)

and (Brust et al., 2021) investigated the related prob-

lem of maximizing the utility of imprecise ontologies

in OGML but did not focus on identifying potential

quality issues within the ontologies themselves.

The most similar work to ours is (Marino et al.,

2017), where the authors analyzed the prediction per-

formance of their OGML approach for image clas-

siﬁcation across different data subsets. They found

that their OGML approach performed worse than

the baselines on certain subsets of the data, attribut-

ing this to missing relationships in the ontology.

While their study provided valuable insights, our

work builds upon this by offering a more comprehen-

sive framework that not only broadens the perspective

on the underlying issues but also systematically quan-

tiﬁes and addresses them.

3 APPROACH

Section 3.1 outlines our adaptation of the orig-

inal task-based evaluation framework to OGML.

Next, Section 3.2 introduces the concept of domain-

application mismatch and explains how to quantify it.

3.1 Adapting Task-Based Ontology

Quality Evaluation to OGML

The original task-based evaluation framework for on-

tologies, proposed by (Porzel and Malaka, 2004), as-

sessed quality within ontology-informed applications

by comparing task results against human-generated

gold standards. While effective in its context, this

framework requires signiﬁcant adaptation to OGML.

Separation of Ontology and Data. In the origi-

nal framework, the task is performed directly on the

ontology since data and ontology are the same. In

contrast, OGML distinguishes between ontology and

data. The ontology is used to improve the prediction

performance of a machine-learning task on the data.

Task Deﬁnition and Ground Truth Data. In the

original framework, tasks were speciﬁcally designed

to identify ontology issues, with the ground truth de-

ﬁned by humans, leading to potential subjectivity er-

rors. In OGML, however, the machine-learning pro-

cess deﬁnes the task, and the ground truth is de-

rived directly from the data. This ensures that the

evaluation is more objective and less prone to errors.

However, it also necessitates linking the performance

of the OGML task to speciﬁc parts of the ontology,

which can be achieved by using reﬁnement metrics,

as described in Section 3.2.

Stochastic Nature of ML Algorithms. OGML in-

troduces stochastic elements inherent in machine

learning, including retraining machine-learning mod-

els multiple times with different seed values, varying

train-test splits, or varying model sizes. These factors

must be considered to ensure the objectivity of results.

3.2 Quantifying Domain-Application

Knowledge Mismatch in OGML

In OGML, a domain ontology represents a broad ﬁeld

of knowledge, and an application ontology is tai-

Quantifying Domain-Application Knowledge Mismatch in Ontology-Guided Machine Learning

219

OGML Training

ML Model

Data

ML Training

ML Model

Data

Ontology

Performance

OGML

Performance

–

Refinement

metric

Harmful

domain

knowledge

Distribution of

the refinement scores

per ontology part

Helpful

domain

knowledge

Figure 1: Proposed framework to quantify the domain-application mismatch.

lored to a speciﬁc task. A mismatch between domain

knowledge and application knowledge occurs when

the provided ontological knowledge for a domain is

not optimally structured for the speciﬁc machine-

learning task at hand. This mismatch can exist even

if the domain knowledge is free of mistakes and thus

has high quality from the domain perspective.

Because most OGML approaches leverage do-

main ontologies instead of application ontologies, it

is often impossible to quantify the mismatch between

domain and application knowledge directly by com-

paring OGML models with both types of ontolo-

gies. We argue that such a mismatch manifests itself

through the existence of harmful parts of the ontolog-

ical domain knowledge, which may exist independent

from the ﬁtness of the entire ontology. That is why

we propose to approximate domain-application mis-

match by measuring harmful domain knowledge.

Deﬁnition 1. A particular part of domain knowl-

edge is harmful (helpful) for a particular super-

vised machine-learning task if it negatively (posi-

tively) affects the prediction performance compared

to a knowledge-uninformed baseline. The machine-

learning task comprises the datasets for training and

testing, prediction target, prediction model, and eval-

uation metric.

Measuring harmful and helpful domain knowl-

edge requires measuring the ﬁtness of speciﬁc ontol-

ogy parts, which can be achieved with our proposed

framework (Figure 1). An ontology part is the sub-

set of nodes and edges of the ontology that is se-

mantically connected with a speciﬁc domain concept

(e.g., unique class label) from the dataset. A reﬁne-

ment metric, which is a task-speciﬁc heuristic, links

these ontology parts to application performance. We

demonstrate examples of such reﬁnement metrics in

Section 4. In general, a reﬁnement metric assigns a

score to each part of the domain knowledge: 0 in-

dicates no impact on the prediction performance, a

score greater than 0 indicates a positive impact, and

a score less than 0 indicates a negative impact. The

distribution of the reﬁnement scores can be plotted,

as shown in Figure 1. The parts of domain knowledge

with scores below zero are harmful.

As the representation of domain knowledge may

be of considerable size, e.g., for an ontology with

many nodes and edges, only some parts may be harm-

ful or helpful. Furthermore, the above deﬁnition is

closely tied to one particular machine-learning task.

In particular, some knowledge may be harmful for

one task but not another. We propose to leverage our

framework to quantify mismatch as follows:

Deﬁnition 2. The knowledge mismatch is the ratio

between the number of harmful domain-knowledge

parts (Deﬁnition 1) and the total number of domain-

knowledge parts.

For example, if the OGML approach outperforms

the ontology-uninformed baseline, but 25% of rele-

vant domain concepts perform worse than the base-

line, we consider there to be a 25% mismatch. Note

that different domain concepts may occur with differ-

ent frequencies in the data. For example, if 25% of the

domain concepts are harmful, 10% of the data for the

machine-learning task may be affected if the affected

concepts are relatively infrequent or 50% of the data

if they are relatively frequent.

KEOD 2024 - 16th International Conference on Knowledge Engineering and Ontology Development

220

Table 1: Comparison of the three datasets for image classiﬁcation (negative reﬁnement scores in red).

Dataset Butterﬂies Birds VegFru

Acc. Baseline [%] 84.78 85.23 86.31

Acc. OGML [%] 85.82 88.09 88.77

Improvement [pp] 1.04 2.86 2.46

Mismatch [%] 25.50 16.50 40.21

– Data affected [%] 22.00 17.00 40.67

Frequency

4 CASE STUDIES

In this section, we demonstrate how to quantify a

domain-application mismatch in two common OGML

application areas: image classiﬁcation (Section 4.1)

and patient health prediction (Section 4.2). For each

area, we explain how to apply our proposed frame-

work to quantify the mismatch and present the results.

Additionally, for patient health prediction, we use our

framework to identify and qualitatively describe on-

tological issues arising from mismatches.

4.1 Use Case 1: Ontology-Guided

Image Classiﬁcation

Scenario. In the ﬁrst use case, we demonstrate how

to quantify domain-application mismatch in computer

vision. We apply our framework to the OGML ap-

proach for image classiﬁcation proposed by (Chen

et al., 2018). This approach incorporates structured

information about parent-child relationships between

image categories and subcategories (i.e., a label tax-

onomy) into a deep learning model. The OGML

model employs a Hierarchical Semantic Embedding

framework to maintain consistency in classiﬁcation

across different taxonomy levels.

Experimental Setup. We employ the same setup as

the original paper by (Chen et al., 2018). Speciﬁ-

cally, we use the three pre-trained machine-learning

models made publicly available by the authors and

apply them to the corresponding hierarchical image

datasets: Butterﬂies, Birds, and VegFru (Vegetables

and Fruits). These datasets contain 200 unique classes

for Butterﬂies and Birds, and 292 classes for VegFru,

each organized into a taxonomy with four levels for

Butterﬂies and Birds, and two levels for VegFru.

Reﬁnement Metric. To quantify the domain-

application mismatch of ontological domain knowl-

edge, we deﬁne the reﬁnement metric as the per-class

performance improvements, while the original paper

assesses overall performance improvements. Our ap-

proach allows for a more detailed analysis of how the

ontology impacts the model’s performance, offering

insights into which speciﬁc classes beneﬁt from the

ontological knowledge and which do not. We quan-

tify prediction performance with top-1 accuracy, as

in the original paper, on the test set. We calculate

the mismatch as the percentage of classes that show

a decrease in prediction performance compared to the

baseline. Additionally, since classes may vary in the

number of examples, we also report the proportion of

the test data affected by these classes.

Results. As Table 1 shows, the OGML method

demonstrates overall improvements compared to the

knowledge-uninformed baseline across all three hi-

erarchical datasets. However, a substantial number

of classes does not beneﬁt from the ontological do-

main knowledge (highlighted in red on the distribu-

tion plots of reﬁnement scores). Since the classes are

relatively balanced in all datasets, we observe a simi-

lar percentage of data affected by the mismatch.

4.2 Use Case 2: Ontology-Guided

Sequential Health Prediction

Scenario. In the second use case, we demonstrate

how to quantify domain-application mismatch in

medical data processing. We apply our proposed

framework to an OGML approach for sequential pa-

tient health prediction, proposed by (Choi et al.,

2017). This approach incorporates structured infor-

mation about the hierarchical relationships of varying

Quantifying Domain-Application Knowledge Mismatch in Ontology-Guided Machine Learning

221

Table 2: Comparison of model sizes for two variants of risk prediction (negative reﬁnement scores in red).

Heart Disease Prediction Diabetes Prediction

Architecture Small Large Small Large

Acc. Baseline [%] 71.36 79.66 70.32 85.72

Acc. OGML [%] 78.98 81.32 89.03 90.33

Improvement [pp] 7.62 1.66 18.71 4.61

Mismatch [%] 15.07 20.07 4.73 11.47

– Data affected [%] 40.79 78.76 11.46 48.90

Frequency

depth between medical codes of symptoms and dis-

eases deﬁned by the ICD-9 classiﬁcation system

. It

processes sequences of medical codes with a graph-

based attention mechanism (GRAM) to generate se-

mantic embeddings, considering not only individual

medical codes but also their hierarchical ancestors.

Experimental Setup. We employ a similar setup

as the original paper by (Choi et al., 2017), utilizing

the publicly available MIMIC-III healthcare dataset

(Johnson et al., 2016). Different from the computer

vision use case, we train the OGML model ourselves,

varying the experiments across three prediction tasks:

two risk prediction tasks – for heart diseases and di-

abetes – and one next-visit prediction task. In the

dataset, each patient visit is represented by medical

codes corresponding to the diagnoses and symptoms

identiﬁed during that visit. For risk prediction, the

goal is to predict whether the patient’s next visit will

include a diagnosis of heart disease or diabetes, based

on their previous visits. For next-visit prediction, the

goal is to predict all the diagnoses and symptoms

recorded during the patient’s next visit, based on their

previous visits.

To address the stochastic nature of machine-

learning methods, we conduct ﬁve experiments for

each combination of task and model size. In each

experiment, we used random train-test splits – 80-20

for risk prediction and 90-10 for next-visit prediction.

We then report the average performance on the test

sets. Each model comprises an embedding layer, an

RNN layer, and a ﬁnal dense layer. The dense layer

uses a sigmoid activation function for risk prediction

and a softmax activation function for next-visit pre-

diction. The larger model has an attention dimension

http://www.icd9data.com/2015/Volume1/default.htm

of 100, an RNN dimension of 200, and an embedding

dimension of 300, and it is trained with a batch size of

128 for 100 epochs with early stopping. The smaller

model has an attention dimension of 16, an RNN di-

mension of 32, and an embedding dimension of 16,

and it is trained with a batch size of 32 for 50 epochs

with early stopping. For further details on the exper-

imental setup, please refer to the experimental code

provided along with this paper.

Reﬁnement Metric. To quantify domain-

application mismatch in both tasks, we deﬁne

the reﬁnement metric as the per-code performance

improvements based on all input-output pairs where

the input sequences (patient visits) include that

particular medical code. For risk prediction, the

improvement is measured using binary accuracy. For

next-visit prediction, we deﬁne two variants of the

reﬁnement metric. The ﬁrst one is accuracy-based,

similar to the risk prediction task, but using top-20

accuracy to measure the improvement, in line with

the evaluation metric from the original paper. The

second one measures the average rank improvement

between the baseline and OGML approach by

comparing the rank differences for medical codes

found in the ground-truth data for the respective

patient visits. In both tasks, input sequences may be

counted multiple times for different medical codes.

As in the computer vision use case, we calculate the

mismatch as the percentage of classes (codes) that

show a performance decrease, and we also report the

proportion of data affected by these classes (codes).

Given the smaller dataset sizes, varying train-test

splits, and a larger number of unique classes com-

pared to the computer vision use case, we report the

mismatch on the entire dataset instead of just the test

set. To handle the high number of unique medical

KEOD 2024 - 16th International Conference on Knowledge Engineering and Ontology Development

222

Table 3: Comparison of model sizes and reﬁnement metrics for next-visit prediction (negative reﬁnement scores in red).

Next-Visit Prediction

Architecture Small Large

Acc. Baseline [%] 55.10 66.19

Acc. OGML [%] 73.87 71.32

Improvement [pp] 18.77 5.13

Reﬁnement Metric Accuracy-based Ranking-based Accuracy-based Ranking-based

Mismatch [%] 29.60 21.47 29.71 25.56

– Data affected [%] 83.50 59.61 82.67 60.78

Frequency

codes (1,823 for next-visit prediction and 2,426 for

risk prediction), we ﬁlter out codes that appear fewer

than three times in the dataset. This results in 38.2%

of unique codes being ﬁltered out for risk prediction

and 2.1% for next-visit prediction.

Results. The results, summarized in Tables 2 and 3,

reveal distinct patterns across the various models and

tasks. First, domain-application knowledge mismatch

is evident across all model sizes, prediction tasks, and

reﬁnement metrics within the dataset. However, the

degree of mismatch varies, with the mismatch for dia-

betes risk prediction being two to three times smaller

than that for heart disease risk prediction using the

same model sizes. This variation is also reﬂected in

the differences in the distribution of reﬁnement scores

shown in the bottom row of the table. Addition-

ally, models used for diabetes risk prediction bene-

ﬁt more from domain knowledge than those used for

heart disease risk prediction. Further, ontological do-

main knowledge tends to improve the performance of

smaller models more than of larger models. For risk

prediction, smaller OGML models perform nearly as

well as their larger counterparts, while in next-visit

prediction, smaller OGML models even outperform

the larger ones. Additionally, smaller models gen-

erally exhibit less domain-application mismatch, i.e.,

they beneﬁt more from domain knowledge than larger

models. Lastly, we observe a much higher percentage

of data affected by mismatches compared to the com-

puter vision use case, likely due to the presence of

multiple medical codes in each input sequence.

4.2.1 Identifying Ontological Issues

When examining the top ten medical categories with

the lowest accuracy-based reﬁnement scores for next-

visit prediction, we found several potential ontologi-

cal issues (with the ICD-9 hierarchy) that could de-

crease the prediction performance of OGML.

Similar Concepts in Different Ontological Paths.

This issue arises when related ontological categories

are placed under different paths and lack a com-

mon semantic ancestor. As a result, the OGML

approach treats these categories as semantically in-

dependent, which can confuse the machine-learning

model and negatively impact the prediction perfor-

mance for these categories. For example, ﬁve of

the ten medical codes with the lowest reﬁnement

scores are related to drug-related symptoms or dis-

eases. These ﬁve categories fall into three paths in the

ontology, without a shared common ancestor:

• 970.8: Poisoning by other speciﬁed central ner-

vous system stimulants, falls under the ontological

category Injury and Poisoning 800-999

• E950.0: Suicide and self-inﬂicted poisoning by

analgesics, antipyretics, and antirheumatics and

E950.4: Suicide and self-inﬂicted poisoning by

other speciﬁed drugs and medicinal substances

both fall under the ontological category Supple-

mentary Classiﬁcation of External Causes of In-

jury and Poisoning E000-E999.

• 304.23 Cocaine dependence, in remission, and

304.21 Cocaine dependence, continuous both fall

under the ontological category Mental Disorders

290-319.

Quantifying Domain-Application Knowledge Mismatch in Ontology-Guided Machine Learning

223

Drug Dependence

Drug Type 1 Drug Type 2

EpisodicUnspeciﬁed Continuous Remission

Drug Dependence

Drug Type 1 Drug Type 2

EpisodicUnspeciﬁed Continuous Remission

Figure 2: Suboptimal parent order (left) and potential improved one (right) for categories related to drug dependence.

Injury to gastrointestinal tract

Location A

with open

wound

Location A

without open

wound

Location B

without open

wound

Location B

with open

wound

Injury to gastrointestinal tract

Location A Location B

with open

wound

without open

wound

Figure 3: Suboptimal parent order (left) and potential improved one (right) for categories related to gastrointestinal tract.

Irrelevant Categorization. This issue arises when

the categorization focuses on aspects that may be less

relevant to the machine-learning task. For instance,

consider the following codes:

• E956 Suicide and self-inﬂicted injury by cutting

and piercing instrument

• E950.0 Suicide and self-inﬂicted poisoning by

analgesics, antipyretics, and antirheumatics

• E950.4 Suicide and self-inﬂicted poisoning by

other speciﬁed drugs and medicinal substances

While these codes categorize different types of in-

juries (cutting, poisoning by drugs, etc.), they are all

grouped under the broader category of Suicide and

Self-Inﬂicted Injury (E950-E959). This grouping does

not account for other causes of such injuries. For

next-visit prediction, the focus on whether an injury

is self-inﬂicted might be less relevant than the spe-

ciﬁc type of injury. A more effective approach could

be categorizing these codes based on the type of in-

jury (e.g., cutting, poisoning) rather than its origin, as

this may be more relevant to the prediction task.

Inaccurate or Overly Broad Categories. This is-

sue arises when a category is not speciﬁc enough or is

overly broad. Categories that include terms like ‘un-

speciﬁed’ or ‘other’, or have such terms in their parent

categories, are especially susceptible to this problem.

For example, the code 957.1 Injury to other speciﬁed

nerve(s) is classiﬁed under the broader category 957

Injury to other and unspeciﬁed nerves. This broad

classiﬁcation can include various, potentially unre-

lated medical codes within the same category, which

may confuse the machine-learning model.

Suboptimal Ordering of Parent Categories. This

issue arises when parent categories are organized to

prioritize one aspect over another, which may not be

optimal for the speciﬁc task. For example, the ICD-9

ontology initially classiﬁes drug dependence by drug

type and then by dependence type (Figure 2, left).

This structure leads the OGML approach to treat con-

tinuous use of different drugs as unrelated and con-

tinuous versus episodic use of the same drug as more

similar. Reordering to classify by dependence type

ﬁrst (Figure 2, right) could better capture the nuances

of drug use. Similarly, Figure 3 shows that organiz-

ing wound information by location and type at the

same level may not be ideal. Depending on the task,

it might be more effective to classify injuries ﬁrst by

location and then by type, or vice versa, or even to

provide both ordering paths.

KEOD 2024 - 16th International Conference on Knowledge Engineering and Ontology Development

224

5 CONCLUSIONS

In this work, we addressed the critical and often over-

looked issue of domain-application knowledge mis-

match in ontology-guided machine learning (OGML).

We developed an OGML-aware framework to quan-

tify these mismatches and identify harmful ontol-

ogy parts, which negatively affect prediction perfor-

mance. Our framework offers a practical and gener-

alizable methodology for assessing ontology quality

in OGML contexts. Thus, it improves the integration

of ontological knowledge into machine-learning mod-

els, leading to more effective and reliable use of on-

tologies. Our case studies in image classiﬁcation and

patient health prediction revealed that mismatches are

widespread across datasets, OGML approaches, and

machine-learning architectures. This highlights the

importance of aligning domain ontologies with spe-

ciﬁc application requirements in OGML contexts. Fu-

ture research could reﬁne our framework and explore

its applicability across various domains and OGML

methods. For example, one could apply our frame-

work to multiple tasks in a single domain to evalu-

ate an ontology’s domain ﬁtness. Another promis-

ing direction is automatically repairing ontologies for

a given OGML task, i.e., removing harmful domain

knowledge, restructuring the ontologies accordingly,

and re-training the OGML model.

ACKNOWLEDGEMENTS

This research has been partially funded by the Ger-

man Federal Ministry of Education and Research

(BMBF) under grant 01IS17042 as part of the Soft-

ware Campus project DomainML.

REFERENCES

Arbabi, A., Adams, D. R., Fidler, S., and Brudno, M.

(2019). Identifying Clinical Terms in Medical Text

Using Ontology-Guided Machine Learning. JMIR

Med. Inf., 7(2).

Barz, B. and Denzler, J. (2019). Hierarchy-Based Image

Embeddings for Semantic Image Retrieval. In Proc.

WACV, pages 638–647.

Bertinetto, L., Mueller, R., Tertikas, K., Samangooei, S.,

and Lord, N. A. (2020). Making Better Mistakes:

Leveraging Class Hierarchies With Deep Networks.

In Proc. CVPR, pages 12503–12512.

Beydoun, G., Lopez-Lorca, A. A., Garc

ıa-S

anchez, F., and

Mart

ınez-B

ejar, R. (2011). How do we measure and

improve the quality of a hierarchical ontology? J.

Syst. Software, 84(12):2363–2373.

Bielski, P., Eismont, A., Bach, J., Leiser, F., Kottonau, D.,

and B

ohm, K. (2024). Knowledge-guided learning of

temporal dynamics and its application to gas turbines.

In Proc. e-Energy, page 279–290.

Brank, J., Grobelnik, M., and Mladenic, D. (2005). A

survey of ontology evaluation techniques. In Proc.

SiKDD, pages 166–170.

Brewster, C. (2002). Techniques for automated taxonomy

building: Towards ontologies for knowledge manage-

ment. In Proc. Annu. CLUK Res. Colloq.

Brewster, C., Alani, H., Dasmahapatra, S., and Wilks, Y.

(2004). Data driven ontology evaluation. In Proc.

LREC, pages 641–644.

Brust, C.-A., Barz, B., and Denzler, J. (2021). Making ev-

ery label count: Handling semantic imprecision by in-

tegrating domain knowledge. In Proc. ICPR, pages

6866–6873.

Brust, C.-A. and Denzler, J. (2019a). Integrating domain

knowledge: using hierarchies to improve deep classi-

ﬁers. In Proc. ACPR, pages 3–16.

Brust, C.-A. and Denzler, J. (2019b). Not just a matter of se-

mantics: The relationship between visual and seman-

tic similarity. In Proc. DAGM GCPR, pages 414–427.

Burton-Jones, A., Storey, V. C., Sugumaran, V., and

Ahluwalia, P. (2005). A semiotic metrics suite for as-

sessing the quality of ontologies. Data Knowl. Eng.,

55(1):84–102.

Capellades, M. A. (1999). Assessment of reusability of on-

tologies: a practical example. In Proc. AAAI Work-

shop Ontol. Manage., pages 74–79.

Chen, T., Wu, W., Gao, Y., Dong, L., Luo, X., and Lin,

L. (2018). Fine-Grained Representation Learning and

Recognition by Exploiting Hierarchical Semantic Em-

bedding. In Proc. ACM MM, pages 2023–2031.

Choi, E., Bahadori, M. T., Song, L., Stewart, W. F., and Sun,

J. (2017). GRAM: Graph-based Attention Model for

Healthcare Representation Learning. In Proc. KDD,

pages 787–795.

Clarke, E. L., Loguercio, S., Good, B. M., and Su, A. I.

(2013). A task-based approach for Gene Ontology

evaluation. J. Biomed. Semant., 4.

Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Ben-

gio, S., Li, Y., Neven, H., and Adam, H. (2014).

Large-Scale Object Classiﬁcation Using Label Rela-

tion Graphs. In Proc. ECCV, pages 48–64.

Dhall, A., Makarova, A., Ganea, O., Pavllo, D., Greeff, M.,

and Krause, A. (2020). Hierarchical Image Classiﬁ-

cation using Entailment Cone Embeddings. In Proc.

CVPRW, pages 3649–3658.

Duque-Ramos, A., Fern

andez-Breis, J. T., Stevens, R., and

Aussenac-Gilles, N. (2011). OQuaRE: A SQuaRE-

based approach for evaluating the quality of ontolo-

gies. J. Res. Pract. Inf. Technol., 43(2):159–176.

Goo, W., Kim, J., Kim, G., and Hwang, S. J. (2016).

Taxonomy-Regularized Semantic Deep Convolutional

Neural Networks. In Proc. ECCV, pages 86–101.

omez-P

erez, A. (1999). Evaluation of taxonomic knowl-

edge in ontologies and knowledge bases. Technical

report, University of Calgary, Alberta, Canada.

Quantifying Domain-Application Knowledge Mismatch in Ontology-Guided Machine Learning

225

Hlomani, H. and Stacey, D. (2014). Approaches, methods,

metrics, measures, and subjectivity in ontology evalu-

ation: A survey. Semant. Web J., 1(5):1–11.

Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.-

w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits,

P., Anthony Celi, L., and Mark, R. G. (2016). MIMIC-

III, a freely accessible critical care database. Sci.

Data, 3(1).

Ju, L., Yu, Z., Wang, L., Zhao, X., Wang, X., Bonnington,

P., and Ge, Z. (2024). Hierarchical Knowledge Guided

Learning for Real-World Retinal Disease Recogni-

tion. IEEE Trans. Med. Imaging, 43(1):335–350.

Karthik, S., Prabhu, A., Dokania, P. K., and Gandhi, V.

(2021). No Cost Likelihood Manipulation at Test

Time for Making Better Mistakes in Deep Networks.

arXiv:2104.00795 [cs].

Lourdusamy, R. and John, A. (2018). A review on metrics

for ontology evaluation. In Proc. ICISC, pages 1415–

1421.

Ma, F., Wang, Y., Xiao, H., Yuan, Y., Chitta, R., Zhou, J.,

and Gao, J. (2019). Incorporating medical code de-

scriptions for diagnosis prediction in healthcare. BMC

Med. Inf. Decis. Making, 19(6).

Ma, F., You, Q., Xiao, H., Chitta, R., Zhou, J., and Gao, J.

(2018). KAME: Knowledge-based Attention Model

for Diagnosis Prediction in Healthcare. In Proc.

CIKM, pages 743–752.

Marino, K., Salakhutdinov, R., and Gupta, A. (2017). The

More You Know: Using Knowledge Graphs for Image

Classiﬁcation. arXiv:1612.04844 [cs].

Mc Gurk, S., Abela, C., and Debattista, J. (2017). Towards

ontology quality assessment. In Proc. LDQ2017,

pages 94–106.

McDaniel, M. and Storey, V. C. (2020). Evaluating Domain

Ontologies: Clariﬁcation, Classiﬁcation, and Chal-

lenges. ACM Comput. Surv., 52(4).

Min, H., Mobahi, H., Irvin, K., Avramovic, S., and Woj-

tusiak, J. (2017). Predicting activities of daily living

for cancer patients using an ontology-guided machine

learning methodology. J. Biomed. Semant., 8(1).

Min, H. and Wojtusiak, J. (2012). Clinical data analysis

using ontology-guided rule learning. In Proc. MIXHS,

pages 17–22.

Mitchell, T. M. (1997). Machine learning. McGraw-Hill

New York.

Nickel, M. and Kiela, D. (2017). Poincar

e Embeddings

for Learning Hierarchical Representations. In Proc.

NIPS.

Northcutt, C., Jiang, L., and Chuang, I. (2021). Conﬁdent

learning: Estimating uncertainty in dataset labels. J.

Artif. Intell. Res., 70:1373–1411.

Ohta, M., Kozaki, K., and Mizoguchi, R. (2011). A Quality

Assurance Framework for Ontology Construction and

Reﬁnement. In Proc. AWIC, pages 207–216.

Pittet, P. and Barth

emy, J. (2015). Exploiting Users’ Feed-

backs - Towards a Task-based Evaluation of Applica-

tion Ontologies Throughout Their Lifecycle:. In Proc.

IC3K, pages 263–268.

Porzel, R. and Malaka, R. (2004). A task-based approach

for ontology evaluation. In Proc. ECAI Workshop On-

tol. Learn. Popul.

Rezende, P. M., Xavier, J. S., Ascher, D. B., Fernandes,

G. R., and Pires, D. E. V. (2022). Evaluating hierarchi-

cal machine learning approaches to classify biological

databases. Brieﬁngs Bioinf., 23(4).

Rigoni, D., Elliott, D., and Frank, S. (2023). Cleaner Cat-

egories Improve Object Detection and Visual-Textual

Grounding. In Proc. SCIA, pages 412–442.

Silla, C. N. and Freitas, A. A. (2011). A survey of hierarchi-

cal classiﬁcation across different application domains.

Data Min. Knowl. Discovery, 22(1-2):31–72.

Vendrov, I., Kiros, R., Fidler, S., and Urtasun, R.

(2016). Order-Embeddings of Images and Language.

arXiv:1511.06361 [cs].

von Rueden, L., Mayer, S., Beckh, K., Georgiev, B., Gies-

selbach, S., Heese, R., Kirsch, B., Pfrommer, J., Pick,

A., Ramamurthy, R., Walczak, M., Garcke, J., Bauck-

hage, C., and Schuecker, J. (2023). Informed Machine

Learning – A Taxonomy and Survey of Integrating

Prior Knowledge into Learning Systems. IEEE Trans.

Knowl. Data Eng., 35(1):614–633.

Wang, H., Wang, H., and Xu, K. (2018). Categorizing con-

cepts with basic level for vision-to-language. In Proc.

CVPR, pages 4962–4970.

Willard, J., Jia, X., Xu, S., Steinbach, M., and Kumar, V.

(2023). Integrating Scientiﬁc Knowledge with Ma-

chine Learning for Engineering and Environmental

Systems. ACM Comput. Surv., 55(4).

Wilson, R. S. I., Goonetillake, J. S., Indika, W. A., and

Ginige, A. (2022). A conceptual model for ontology

quality assessment. Semant. Web, 14(6):1051–1097.

Yin, C., Zhao, R., Qian, B., Lv, X., and Zhang, P.

(2019). Domain Knowledge Guided Deep Learn-

ing with Electronic Health Records. In Proc. ICDM,

pages 738–747.

Yu, T. (2007). Incorporating Prior Domain Knowledge into

Inductive Machine Learning: its Implementation in

Contemporary Capital Markets. PhD thesis, Univer-

sity of Technology Sydney, Australia.

Zeng, C., Zhou, W., Li, T., Shwartz, L., and Grabarnik,

G. Y. (2017). Knowledge Guided Hierarchical Multi-

Label Classiﬁcation Over Ticket Data. IEEE Trans.

Netw. Serv. Manage., 14(2):246–260.

KEOD 2024 - 16th International Conference on Knowledge Engineering and Ontology Development

226