Unveiling Breast Cancer Causes Through Knowledge Graph Analysis

and BioBERT-Based Factuality Prediction

Hasna El Haji

1 a

, Nada Sbihi

, Kaoutar El Handri

2,3 b

, Adil Bahaj

, Mohammed Elkasri

Amine Souadka

4 c

and Mounir Ghogho

1 d

TiCLab, International University of Rabat, Rabat, Morrocco

MedBiotech Laboratory, Faculty of Medicine and Pharmacy (FMPR), University Mohammed V, Rabat, Morocco

Aivancity School of AI & Data for Business & Society, Cachan, France

Surgical Oncology Department, National Institute of Oncology, University Mohammed V, Rabat, Morocco

Keywords:

Knowledge Graph, Breast Cancer, Bioinformatics, BioBERT, Factuality Prediction.

Abstract:

Worldwide, millions of women are affected by breast cancer, with the impact signiﬁcantly worsened in under-

served regions. The profound effect of breast cancer on women’s health has driven research into its causes,

with the aim of developing methods for the prevention, diagnosis, and treatment of the disease. The signiﬁcant

inﬂux of research on this subject is overwhelming and makes manual exploration arduous, which motivates

automated knowledge exploration approaches. Knowledge Graphs (KGs) are one of these approaches that

attracted signiﬁcant attention in the last few years for their ability to structure and present knowledge, making

it easier to explore and analyze. Current KGs that include causes of breast cancer are deﬁcient in contextual

information, highlighting the uncertainty of these causes (facts). In this work, we present a method for extract-

ing a sub-graph of breast cancer causes and ﬁne-tuning BioBERT to evaluate the uncertainty of these causes.

Our automated approach, which simulates human annotation, computes uncertainty scores based on textual

factuality and assesses cause reliability using a Closeness Score. We also create a web-based application for

easy exploration

https://bckg.datanets.org/

1 INTRODUCTION

Breast cancer is a complex and multifaceted disease,

affecting approximately 2.3 million women world-

wide each year, according to the “WHO” (World

Health Organization, 2024). With about one in eight

women diagnosed during their lifetime (American

Cancer Society, 2024), the disease signiﬁcantly im-

pacts quality of life, particularly in disadvantaged

countries. This has led to extensive research in var-

ious disciplines to explore prevention (El Haji et al.,

2024), diagnosis, and prognosis (El Haji et al., 2023).

One crucial research focus is identifying the lead-

ing causes of breast cancer, as addressing the root

cause can increase the success of treatments. How-

https://orcid.org/0009-0000-0437-2731

https://orcid.org/0000-0001-6732-2627

https://orcid.org/0000-0002-1091-9446

https://orcid.org/0000-0002-0055-7867

ever, abundant research papers on this topic can make

manually exploring ﬁndings challenging and require

multiple specialists’ involvement. Automated knowl-

edge extraction methods have been studied to answer

this need.

There has been signiﬁcant progress in automated

knowledge extraction from scientiﬁc literature, par-

ticularly in the biomedical ﬁeld (Hogan et al., 2021),

(Zheng et al., 2021), (Wang et al., 2023). The aim is

to organize and structure knowledge to facilitate ex-

ploration and discovery (Bahaj et al., 2022). In KGs,

facts are organized into triplets, where each triplet

comprises a head entity and a tail entity linked by a

semantic relation. Mathematically, a KG G can be

presented as a triplet G = (V , R , E), where V is the

set of entities, R is the set of relations and E is the set

of entities such that E = {(h, r,t)|h,t ∈ V , r ∈ R }.

KGs can be utilized for knowledge exploration us-

ing visualization tools, improving search engine re-

sults, question-answering systems, recommendation

El Haji, H., Sbihi, N., El Handri, K., Bahaj, A., Elkasri, M., Souadka, A. and Ghogho, M.

Unveiling Breast Cancer Causes Through Knowledge Graph Analysis and BioBERT-Based Factuality Prediction.

DOI: 10.5220/0013179700003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 2: HEALTHINF, pages 141-148

ISBN: 978-989-758-731-3; ISSN: 2184-4305

141

systems (El Handri and Idrissi, 2020), and tasks

such as node classiﬁcation and link prediction (Huang

et al., 2019), (Guo et al., 2020), (Rossi et al., 2021).

In biomedicine, research in biomedical KGs has seen

a surge in recent years. This is due to the inﬂux of new

biomedical publications in multiple disciplines on

various medical concepts. Existing biomedical KGs

in literature are either domain-speciﬁc or general.

COVID19KG (Kejriwal, 2020) is a domain-speciﬁc

KG with knowledge about COVID19. SemMedDB

(Kilicoglu et al., 2012) is a general-purpose biomedi-

cal KG that contains a wide range of biomedical con-

cepts. However, KGs assume the accuracy and relia-

bility of their facts. Nonetheless, uncertainty can im-

pact them (Sosa and Altman, 2022). This uncertainty

arises from various sources of noise in the source data

or in the extraction process. Uncertain knowledge

graphs (UKGs) have been developed, whereby a con-

ﬁdence score is assigned to each fact, quantifying its

validity (Chen et al., 2019). Formally, these UKGs

assign a conﬁdence score s

to each triple l = (h, r,t).

This conﬁdence score reﬂects the level of veracity of

that fact in the KG.

In this direction, language modeling has gained

considerable attention recently. The creation of the

transformer-based architecture and self-supervised

pretraining made considerable strides to have more

reliable models. BERT (Devlin et al., 2018) is one

such model. BERT uses a transformer-based architec-

ture relying on a masked language modeling training

process. BERT initiated a research direction, which

extended from general language modeling to more

domain-speciﬁc language models in law (Sun, 2023),

biomedicine (Chen et al., 2023), marketing (Li et al.,

2022) and other domains. These variants are gener-

ally ﬁne-tuned versions of BERT, where a domain-

speciﬁc corpus is curated and labeled, then used to

retrain BERT. Language model ﬁne-tuning is essen-

tial in modeling small datasets (Zhang et al., 2020)

and low-resource languages (Hangya et al., 2022).

One speciﬁc application of these language models

is factuality prediction, which predicts the veracity

of an event expressed in a sentence (Veyseh et al.,

2019). Factuality prediction is generally formulated

as a classiﬁcation problem, where the model is tasked

to predict the factuality class of a sentence (Kilicoglu

et al., 2017). Other approaches, such as formulat-

ing the problem as a regression problem, where a

factuality value that ranges between -3 and 3 is pro-

vided for each sentence. Negative and positive val-

ues signify negative and positive polarity, respectively

(Stanovsky et al., 2017). The main difﬁculty in fac-

tuality prediction stems from the nature of factuality

itself. Practically, factuality is expressed using certain

cue words, and corresponding to their dependence on

different sentence elements, the sentence factuality

gets affected (Jiang and de Marneffe, 2021). This syn-

tactic dependency referred to as text paradigmatics is

generally challenging to model since most models fo-

cus on modeling syntax with no regard to long-range

dependencies.

To address this gap, advancements in Natural

Language Processing (NLP) have been signiﬁcantly

driven by novel pre-training techniques, such as those

presented in XLNet, ERNIE, and ELECTRA. XLNet

(Yang et al., 2019), with its generalized autoregres-

sive pretraining, enhances language understanding by

capturing bidirectional context through permutation-

based training, which surpasses traditional models in

handling complex language tasks. ERNIE (Zhang

et al., 2019) further builds on this foundation by incor-

porating informative entities, offering a more context-

aware representation that is particularly effective for

tasks involving entity-level understanding. Mean-

while, ELECTRA (Clark et al., 2020) introduces a

discriminator-based pre-training approach, improving

the efﬁciency and effectiveness of language models

by focusing on distinguishing real from generated

tokens. In this evolving landscape, our study con-

tributes by ﬁne-tuning BioBERT (Lee et al., 2020)

speciﬁcally for factuality prediction of triplets, lever-

aging its domain-speciﬁc pre-training for biomedical

text.

Our work builds on these foundational approaches

by applying BioBERT for factuality prediction. Adi-

tionally, our work proposes a metric to evaluate the

reliability of triplets within our knowledge sub-graph.

This enables visibility and clarity for physicians and

scientists working on breast cancer. Furthermore, we

design a web application that serves as an interactive

exploration platform displaying the entire sub-graph.

Our approach uniquely advances existing method-

ologies by addressing key limitations in handling un-

certainty and factuality prediction within biomedi-

cal knowledge graphs. By integrating BioBERT—a

domain-speciﬁc language model—with the construc-

tion of a breast cancer causes knowledge sub-graph,

we introduce a novel framework that not only predicts

factuality but also quantiﬁes the reliability of triplets

through a Closeness Score. This dual-layered ap-

proach signiﬁcantly enhances the interpretability and

trustworthiness of the extracted causal relationships.

Unlike existing models that assume the veracity of

their facts, our methodology explicitly evaluates and

categorizes factuality, making it particularly impact-

ful for applications in biomedical research where un-

certainty can hinder effective decision-making.

The paper is structured as follows: section 2 de-

HEALTHINF 2025 - 18th International Conference on Health Informatics

142

tails our methodology for constructing the knowledge

sub-graph and evaluating the factuality of triplets.

Section 3 presents our experiments and results, while

sections 4 and 5 provide discussion and concluding

remarks on the added value of our approach.

2 METHODOLOGY

The framework of our method consists of sub-

graph construction (extraction of triplets), ﬁne-tuning

BioBERT (Lee et al., 2020) for factuality prediction

and triplet reliability evaluation based on Closeness

Score. We detail each part in the subsections 2.1, 2.2

and 2.3, respectively. The overall workﬂow is shown

in Figure 1.

2.1 Knowledge Sub-Graph

Construction

For constructing our knowledge sub-graph of breast

cancer causes, we utilize a generic biomedical litera-

ture database known as SemMedDB (Kilicoglu et al.,

2012). SemMedDB is a valuable resource for inte-

grating and analyzing information from biomedical

literature. It captures relationships between concepts

such as genes, diseases, drugs, and other biomedical

concepts using multiple semantic relations that ex-

press general association, cause, inﬂuence and other

relations. SemMedDB provides a rich network of in-

terconnected biomedical knowledge and it includes

contextual details like negation and modality and al-

lows tracing back to PubMed articles via unique iden-

tiﬁers, enabling researchers to explore relationships

and generate hypotheses grounded in biomedical lit-

erature.

Even though an API is available, we download all

the resources to enable ofﬂine work and ensure efﬁ-

cient local processing. We extract triplets related to

breast cancer causes for sub-graph construction and

source sentences for factuality prediction. Each triplet

represents a relationship between a cause and breast

cancer. We also retrieve the paper’s PubMed Identi-

ﬁer (PMID) and its citation count to further evaluate

the relevance of the extracted triplets.

We are interested in the causes of breast cancer,

thus we ﬁlter the extracted triplets related to breast

cancer by retaining only those with the relationship

pattern:

“Subject” < causes > “Breast cancer”.

Each “Subject” is identiﬁed by a Concept Unique

Identiﬁer (CUI), a unique code used to retrieve

information about entities across various medical

databases. In our context, the object “breast can-

cer” refers speciﬁcally to female breast cancer, which

can appear under different synonyms in the biomedi-

cal literature. We utilize UMLS to extract these syn-

onyms (Rossi et al., 2021).

2.2 Fine-Tuning BioBERT for

Factuality Prediction

Although the automatic construction of KGs offers

a more practical alternative to manual annotation, it

suffers multiple pitfalls. For example, according to

SemMedDB, secondary Malignant neoplasm of the

skin causes breast cancer. But, upon closer exami-

nation of the sentence: “ We present the rare clinical

entity of a breast cancer which was ﬁrst diagnosed due

to the skin metastasis away from the breast tumor”. It

is clear that the sentence does not allow us to conﬁrm

this relationship with certainty. Therefore, retrieving

the sentence corresponding to each triplet is essential

as it enriches our knowledge sub-graph.

Given the large number of triplets, it is difﬁcult

to manually assign a factuality score to each one. As

a result, we use a language model to predict the fac-

tuality score of a triplet using the source sentence as

input. To construct the factuality model, we use a fac-

tuality dataset (Kilicoglu et al., 2017). This dataset is

built using 500 PubMed abstracts. Sentences in these

abstracts were annotated to extract seven factuality

values: fact, probable, possible, doubtful, counter-

act, uncommitted, conditional. Although the dataset

contains 500 abstracts, the number of triples (sub-

ject, relation, object) extracted from them amounts

to 3,149 (Kilicoglu et al., 2017). The dataset is un-

balanced, with 87% of the sentences classiﬁed as

facts, 4.5% probable, 3.8% uncommitted, 2% pos-

sible, 1.8% counterfact, 0.25% Doubtful, and just 1

case conditional. We removed the last two classes due

to their minimal representation.

Sentences can act as explanatory variables in a

predictive model, with factuality as the target vari-

able. In this context, the objective is to leverage

the information within the sentences to predict the

factuality or outcome of interest. We use BioBERT

(Lee et al., 2020) as the basis for constructing our

predictive model since it is a common and effec-

tive approach, especially for tasks involving natural

language processing and understanding. We ﬁne-

tune the pre-trained BioBERT model to adapt it to

our speciﬁc NLP task: factuality classiﬁcation. We

use the “WeightedRandomSampler” sampling strat-

egy, assigning higher probabilities to samples from

underrepresented classes. This approach helps pre-

vent the model from being biased towards the major-

Unveiling Breast Cancer Causes Through Knowledge Graph Analysis and BioBERT-Based Factuality Prediction

143

Figure 1: Sub-graph construction and factuality prediction: Triplets are extracted from the SemMedDB knowledge graph,

ﬁltered to include only cases where the “Subject” causes “breast cancer,” followed by the extraction of sentences related to

each triplet. For each triplet y, ﬁne-tuned BioBERT is used to predict the factuality score for each sentence from which the

triplet was derived. The factuality scores are averaged across sentences corresponding to the same triplet, resulting in a unique

score assigned to each triplet (indicating the degree of factuality of a subject that causes breast cancer).

ity class and improve its ability to learn from the mi-

nority class samples. After testing different parame-

ter values, the following combination gives us the best

validation accuracy: a learning rate of 1e-5, bach size

of 8, and number of epochs of 10.

As proposed by (Kilicoglu et al., 2017), we con-

vert factuality classes to numeric scores as illustrated

in Table 1. Finally, to calculate the factuality of a

breast cancer cause, we compute the average of all

factuality scores predicted from sentences involving

that cause. This approach may be biased, as some

breast cancer causes are cited numerous times while

others are cited only once. Consequently, we propose

further evaluation in the next section to assess cause

reliability.

Table 1: Conversion of factuality classes to numeric scores

(Kilicoglu et al., 2017).

Class Factuality score

Fact 1

Probable 0.75

Possible 0.5

Uncommitted 0.25

Counterfact 0

2.3 Triplet Reliability Evaluation Based

on Closeness Score

The assessment of triplet reliability in knowledge

graphs is a multifaceted challenge that requires evalu-

ating both the impact and contextual relevance of ex-

tracted relations. Previous studies have highlighted

the role of citation counts as a proxy for signiﬁcance

in research evaluation, where more frequently cited

works are often considered more impactful (Waltman,

2016). On the other hand, contextualized language

models, such as BERT and its biomedical variants

like BioBERT, have been increasingly used to as-

sess the relevance of extracted triplets by capturing

semantic similarity within speciﬁc contexts (Soares

et al., 2019). Despite these advances, the integration

of citation-based metrics with contextual relevance

scores for triplet reliability assessment has not been

fully explored.

We evaluate triplet reliability based on Closeness

Score, which quantiﬁes the degree of proximity of the

triplet to the core objective of the publication. By

assessing the closeness between the triplet’s mention

and the publication title, we can determine how rel-

evant the triplet is to the study’s main focus. This

ensures that the triplet is not merely referenced as the

result of other research but is central to the study it-

self.

For each triplet T

, we identify the papers i that

have cited it and retrieve the titles Title

of these pa-

pers using the PubMed API. Then, using BioBERT

embeddings, we calculate the similarity score be-

tween each title and the sentence S(T

) from which

the triplet was extracted in that paper. The overall

triplet closeness score is computed as the mean of the

similarity scores across all citing papers, where N is

the number of citing papers (Equation 1).

Closeness(T

) =

∑

i=1

Similarity(Title

, S(T

)) (1)

For an effective visual representation of our ﬁnd-

ings, we design a web application using the Python

framework Dash.

HEALTHINF 2025 - 18th International Conference on Health Informatics

144

3 EXPERIMENTS AND RESULTS

As depicted in the framework in Figure 1, we rep-

resent all the potential causes of breast cancer as a

knowledge sub-graph and we assign a degree of fac-

tuality to each cause linked to breast cancer. After ﬁl-

tering the triplets, there are 1,039 subjects identiﬁed

as causing breast cancer. In terms of their types, the

most frequently detected are hazardous or poisonous

substances, genes or genomes, amino acids, peptides,

and proteins. Figure 2 illustrates the most commonly

identiﬁed types of causes in the literature; however,

this does not necessarily imply that these causes are

the most inﬂuential. Conversely, the ﬁgure highlights

also types that are less studied and need further re-

search and exploration by the scientiﬁc community.

Figure 2: Top 20 breast cancer cause types.

As for factuality prediction, Table 2 summarizes

the performance metrics of the ﬁne-tuned BioBERT

model for factuality prediction across various classes.

The model demonstrates strong overall accuracy

(0.927). It also showed very good performance in the

“Fact” class, with a precision of 0.970, recall of 0.918,

and an F1-score of 0.944, reﬂecting its high accuracy

in identifying factual statements.

For the “Probable” class, the model shows a pre-

cision of 0.526 and a recall of 0.833, resulting in an

F1-score of 0.645. This indicates that while the model

is fairly effective at recognizing probable statements,

there is room for improvement in its precision.

The “Possible” class has a precision of 0.600

and recall of 0.500, leading to an F1-score of 0.545.

Despite achieving an overall accuracy of 0.927, the

model’s performance in predicting possible state-

ments is less robust compared to the ”Fact” class, sug-

gesting that further reﬁnement is needed.

The “Uncommitted” class presents a precision of

0.412 and recall of 0.778, with an F1-score of 0.538.

This reﬂects moderate performance, indicating that

while the model can identify uncommitted statements

to some extent, its precision remains relatively low.

Finally, for the “Counterfact” class, the model ex-

hibits a precision of 0.286 and recall of 0.667, result-

ing in an F1-score of 0.400. This performance sug-

gests that the model has challenges in accurately iden-

tifying counterfactual statements, highlighting an area

for potential improvement.

Table 2: Performance metrics of the ﬁne-tuned BioBERT

for factuality prediction.

Class Precision Recall F1-Score Accuracy

Fact 0.970 0.918 0.944

Probable 0.526 0.833 0.645

Possible 0.600 0.500 0.545 0.927

Uncommitted 0.412 0.778 0.538

Counterfact 0.286 0.667 0.400

Our web interface serves as an interactive explo-

ration platform and consists of a dropdown list where

the user can choose “All” to display the entire knowl-

edge sub-graph or select a speciﬁc segment of the sub-

graph (Figure 3). If the user is interested in viewing

only the causes related to hormones (Figure 3a), they

should select the “Hormones” option. This will dis-

play the causes categorized as “Hormones” each with

the predicted factuality score. Similarly, Figure 3b

highlights factors related to hazardous or poisonous

substances that contribute to breast cancer. These

visualizations incorporate factuality scores, offering

valuable insights into the conﬁdence levels associated

with various causal relationships. These are just two

sub-graphs of breast cancer causes, with the remain-

ing causes accessible through our web application.

Finally, we notice that some breast cancer causes

are cited numerous times while others are cited only

once. Figure 4 depicts the most frequently cited

causes. As shown, some causes have been cited over

200 times. However, among the 1,039 causes we ex-

tracted, over 90% are cited fewer than ﬁve times.

Figure 5 presents the calculated Closeness Scores

for the extracted triplets. The x-axis shows the in-

dex of each triplet after sorting to provide a clear vi-

sual representation of how the scores change across

the most to least relevant or popular triplets. Figure

5 shows that the closeness score starts high, around

0.90, and gradually decreases as the triplet index in-

creases. This indicates that the ﬁrst few triplets are

more closely aligned with the main objective of the

publication. As the index increases, the triplets be-

come less relevant. Plotting all 1039 triplets could

lead to a dense or unreadable chart. Limiting the view

to the top 80 keeps the visualization clean and inter-

pretable.

Unveiling Breast Cancer Causes Through Knowledge Graph Analysis and BioBERT-Based Factuality Prediction

145

(a) Hormones (b) Hazardous or poisonous substances

Figure 3: Segments of the knowledge sub-graph of breast cancer causes with corresponding factuality scores, displayed as

screenshots from our web application.

Figure 4: Causes of breast cancer with the highest citation

frequency (frequency corresponds to the number of research

articles from which the cause and its associated sentence

were extracted).

Figure 5: Closeness Scores: Measurement of the alignment

between triplets and the primary focus of the scientiﬁc arti-

cle in which they are mentioned.

4 DISCUSSION

Our framework (Figure 1) aids in summarizing the

literature on the causes of breast cancer by provid-

ing a factuality score automatically computed using

BioBERT. This factuality is subsequently categorized

as reliable or unreliable based on the number of ci-

tations of the relationship and the alignment of sen-

tences revealing the triplets with those papers. Here,

we discuss the signiﬁcance of our ﬁndings and their

implications.

Figure 2 highlights the most frequently identiﬁed

types of breast cancer causes in the biomedical liter-

ature, such as hazardous substances and genetic fac-

tors. This distribution not only reﬂects existing re-

search priorities but also uncovers underexplored ar-

eas, such as less-studied molecular factors. These

ﬁndings emphasize the need for targeted investiga-

tions in less-documented categories, guiding future

research toward ﬁlling critical gaps in the literature.

The visual segments in Figure 3, focusing on hor-

mones and hazardous substances, demonstrate the

breadth of our knowledge sub-graph and its capacity

to categorize causal relationships with associated fac-

tuality scores. This categorization enables researchers

to focus on speciﬁc domains while maintaining a

comprehensive understanding of the underlying re-

lationships. The integration of factuality scores into

these segments provides clarity on the reliability of

each causal link, enhancing trust in automated knowl-

edge extraction systems.

Figure 4 underscores the uneven citation distribu-

tion among the identiﬁed breast cancer causes. While

some causes are supported by extensive literature,

others are based on minimal citations, indicating vari-

ability in research focus. This variability emphasizes

the importance of combining citation counts with con-

textual evaluation, such as the Closeness Score, to en-

sure a balanced representation of causes in the knowl-

HEALTHINF 2025 - 18th International Conference on Health Informatics

146

edge graph.

As depicted in Figure 5, the Closeness Score pro-

vides a quantitative measure of triplet relevance to the

core objective of each publication. Higher scores cor-

respond to triplets directly aligned with the study’s

primary focus, while lower scores represent periph-

eral relationships. This metric demonstrates the ef-

fectiveness of integrating contextual relevance with

citation-based signiﬁcance, addressing a key limita-

tion in existing knowledge graph methodologies.

As for BioBERT, while it performs well in identi-

fying factual statements, its performance varies across

different classes, particularly concerning precision

and F1-score. The results highlight areas where the

model’s predictive capabilities could be improved, es-

pecially for less frequently occurring classes.

Several aspects of our work are still under de-

velopment. For instance, the model struggles with

detecting negation. For example, in the sentence,

“These ﬁndings do not support a role for HAAs from

meat or NAT2 in the etiology of breast cancer,” the

algorithm incorrectly classiﬁes the factuality of meat

causing breast cancer as probable, disregarding the

negation.

Another challenge is illustrated in the relationship

between tea and breast cancer. In the sentence, “The

role of tea in the aetiology of breast cancer is contro-

versial,” the algorithm fails to interpret the term “con-

troversial” correctly. Although the relationship is a

fact, the algorithm misclassiﬁes it as probable due to

its misunderstanding of the term.

Incorporating crowd-sourcing could improve the

reliability of this knowledge sub-graph. Volunteers

could assist in classifying relationships and validating

the classes detected by the model, with these correc-

tions serving as training examples for further reﬁne-

ment.

5 CONCLUSION

Ultimately, this study showcases the potential of ar-

tiﬁcial intelligence to revolutionize our understand-

ing of breast cancer. We introduce a knowledge sub-

graph to illustrate the causes contributing to breast

cancer. By mapping potential causal relationships

and assigning factuality scores to these links, the sub-

graph provides a valuable resource for researchers and

clinicians. The interactive web application enhances

usability, allowing for customized data exploration.

While offering a comprehensive overview, the sub-

graph also highlights areas for further investigation,

including expanding the representation of viral fac-

tors and elucidating the underlying biological mech-

anisms. Furthermore, we plan to include gene inter-

actions and causal antecedents in our knowledge sub-

graph to better understand the causal pathways lead-

ing to breast cancer.

The results underscore the importance of collabo-

rative efforts across disciplines to address the chal-

lenges posed by breast cancer. By combining this

knowledge sub-graph with clinical data, researchers

can develop more accurate risk prediction models and

improve patient care. Future research could expand

the sub-graph’s scope, validate causal relationships

through experimental studies, and explore integrating

additional data from domain-speciﬁc sources.

Limitations and Future Directions

As a limitation, this study lacks a comprehensive

comparison with state-of-the-art models. To address

this, we are adapting ERNIE and XLNet to our fac-

tuality dataset for prediction, and we will compare

their performance with BioBERT to assess potential

improvements in factuality classiﬁcation.

Another limitation of the current study is the

reliance on a single biomedical knowledge graph

(SemMedDB) for triplet extraction. Future work will

involve integrating additional datasets from domain-

speciﬁc sources, such as clinical trial repositories, ge-

netic databases, and epidemiological studies. These

diverse datasets will enrich the knowledge sub-graph,

capturing a wider range of causal relationships and

improving the comprehensiveness of our approach.

Finally, the inclusion of real-world clinical data,

such as electronic health records (EHRs) and patient

registries, will provide context-speciﬁc insights into

breast cancer causes and their interactions. This inte-

gration will allow for personalized causal inference,

aligning our ﬁndings more closely with clinical prac-

tice.

REFERENCES

American Cancer Society (2024). American cancer society.

https://www.cancer.org/. Accessed: 2024-06-30.

Bahaj, A., Lhazmir, S., Ghogho, M., and Benbrahim, H.

(2022). Covid-19-related scientiﬁc literature explo-

ration: Short survey and comparative study. Biology,

11(8):1221.

Chen, Q., Du, J., Hu, Y., Keloth, V. K., Peng, X., Raja,

K., Zhang, R., Lu, Z., and Xu, H. (2023). Large lan-

guage models in biomedical natural language process-

ing: benchmarks, baselines, and recommendations.

arXiv preprint arXiv:2305.16326.

Chen, X., Chen, M., Shi, W., Sun, Y., and Zaniolo, C.

(2019). Embedding uncertain knowledge graphs. In

Unveiling Breast Cancer Causes Through Knowledge Graph Analysis and BioBERT-Based Factuality Prediction

147

Proceedings of the AAAI conference on artiﬁcial intel-

ligence, volume 33, pages 3363–3370.

Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D.

(2020). Electra: Pre-training text encoders as dis-

criminators rather than generators. arXiv preprint

arXiv:2003.10555.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

El Haji, H., Sbihi, N., Guermah, B., Souadka, A., and

Ghogho, M. (2024). Epidemiological breast cancer

prediction by country: A novel machine learning ap-

proach. PLOS ONE, 19(8):e0308905.

El Haji, H., Souadka, A., Patel, B. N., Sbihi, N., Ra-

masamy, G., Patel, B. K., Ghogho, M., and Banerjee,

I. (2023). Evolution of breast cancer recurrence risk

prediction: a systematic review of statistical and ma-

chine learning–based models. JCO Clinical Cancer

Informatics, 7:e2300049.

El Handri, K. and Idrissi, A. (2020). Parallelization of topk

algorithm through a new hybrid recommendation sys-

tem for big data in spark cloud computing framework.

IEEE Systems Journal, 15(4):4876–4886.

Guo, Q., Zhuang, F., Qin, C., Zhu, H., Xie, X., Xiong, H.,

and He, Q. (2020). A survey on knowledge graph-

based recommender systems. IEEE Transactions on

Knowledge and Data Engineering, 34(8):3549–3568.

Hangya, V., Saadi, H. S., and Fraser, A. (2022). Improv-

ing low-resource languages in pre-trained multilingual

language models. In Proceedings of the 2022 Confer-

ence on Empirical Methods in Natural Language Pro-

cessing, pages 11993–12006.

Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo,

G. D., Gutierrez, C., Kirrane, S., Gayo, J. E. L.,

Navigli, R., Neumaier, S., et al. (2021). Knowledge

graphs. ACM Computing Surveys (Csur), 54(4):1–37.

Huang, X., Zhang, J., Li, D., and Li, P. (2019). Knowledge

graph embedding based question answering. In Pro-

ceedings of the twelfth ACM international conference

on web search and data mining, pages 105–113.

Jiang, N. and de Marneffe, M.-C. (2021). He thinks he

knows better than the doctors: Bert for event factu-

ality fails on pragmatics. Transactions of the Associa-

tion for Computational Linguistics, 9:1081–1097.

Kejriwal, M. (2020). Knowledge graphs and covid-19: op-

portunities, challenges, and implementation. Harv.

Data Sci. Rev, 11:300.

Kilicoglu, H., Rosemblat, G., and Rindﬂesch, T. C. (2017).

Assigning factuality values to semantic relations ex-

tracted from biomedical research literature. PloS one,

12(7):e0179926.

Kilicoglu, H., Shin, D., Fiszman, M., Rosemblat, G., and

Rindﬂesch, T. C. (2012). Semmeddb: a pubmed-

scale repository of biomedical semantic predications.

Bioinformatics, 28(23):3158–3160.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,

and Kang, J. (2020). Biobert: a pre-trained biomedi-

cal language representation model for biomedical text

mining. Bioinformatics, 36(4):1234–1240.

Li, P., Castelo, N., Katona, Z., and Sarvary, M. (2022). Lan-

guage models for automated market research: A new

way to generate perceptual maps. Available at SSRN

4241291.

Rossi, A., Barbosa, D., Firmani, D., Matinata, A., and Meri-

aldo, P. (2021). Knowledge graph embedding for link

prediction: A comparative analysis. ACM Transac-

tions on Knowledge Discovery from Data (TKDD),

15(2):1–49.

Soares, L. B., FitzGerald, N., Ling, J., and Kwiatkowski,

T. (2019). Matching the blanks: Distributional

similarity for relation learning. arXiv preprint

arXiv:1906.03158.

Sosa, D. N. and Altman, R. B. (2022). Contexts and contra-

dictions: a roadmap for computational drug repurpos-

ing with knowledge inference. Brieﬁngs in Bioinfor-

matics, 23(4):bbac268.

Stanovsky, G., Eckle-Kohler, J., Puzikov, Y., Dagan, I., and

Gurevych, I. (2017). Integrating deep linguistic fea-

tures in factuality prediction over uniﬁed datasets. In

Proceedings of the 55th Annual Meeting of the Associ-

ation for Computational Linguistics (Volume 2: Short

Papers), pages 352–357.

Sun, Z. (2023). A short survey of viewing large lan-

guage models in legal aspect. arXiv preprint

arXiv:2303.09136.

Veyseh, A. P. B., Nguyen, T. H., and Dou, D. (2019). Graph

based neural networks for event factuality prediction

using syntactic and semantic structures. In Proceed-

ings of the 57th Annual Meeting of the Association for

Computational Linguistics, pages 4393–4399.

Waltman, L. (2016). A review of the literature on citation

impact indicators. Journal of informetrics, 10(2):365–

391.

Wang, Z., Nie, H., Zheng, W., Wang, Y., and Li, X. (2023).

A novel tensor learning model for joint relational

triplet extraction. IEEE Transactions on Cybernetics,

54(4):2483–2494.

World Health Organization (2024). World health

organization. Accessed: 01:08:2024, url =

https://www.who.int/en,.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R. R., and Le, Q. V. (2019). Xlnet: Generalized au-

toregressive pretraining for language understanding.

Advances in neural information processing systems,

32.

Zhang, T., Wu, F., Katiyar, A., Weinberger, K. Q., and Artzi,

Y. (2020). Revisiting few-sample bert ﬁne-tuning.

arXiv preprint arXiv:2006.05987.

Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., and

Liu, Q. (2019). Ernie: Enhanced language repre-

sentation with informative entities. arXiv preprint

arXiv:1905.07129.

Zheng, W., Wang, Z., Yao, Q., and Li, X. (2021). Wrtre:

Weighted relative position transformer for joint entity

and relation extraction. Neurocomputing, 459:315–

326.

HEALTHINF 2025 - 18th International Conference on Health Informatics

148