Optimizing Retrieval-Augmented Generation of Medical Content

for Spaced Repetition Learning

Jeremi I. Kaczmarek

1,2,3

, Jakub Pokrywka

2,3 a

, Krzysztof Biedalak

, Grzegorz Kurzyp

and Łukasz Grzybowski

2,5

Pozna´n University of Medical Sciences, Poland

Adam Mickiewicz University, Poland

Wydawnictwo Naukowe PWN, Poland

SuperMemo World, Poland

Association for Research and Applications of Artiﬁcial Intelligence, Poland

Keywords:

Medical Education, Large Language Models, Spaced Repetition, Retrieval-Augmented Generation,

Information Retrieval.

Abstract:

Advances in Large Language Models revolutionized medical education by enabling scalable and efﬁcient

learning solutions. This paper presents a pipeline employing Retrieval-Augmented Generation (RAG) sys-

tem to prepare comments generation for Poland’s State Specialization Examination (PES) based on veriﬁed

resources. The system integrates these generated comments and source documents with a spaced repetition

learning algorithm to enhance knowledge retention while minimizing cognitive overload. By employing a

reﬁned retrieval system, query rephraser, and an advanced reranker, our modiﬁed RAG solution promotes ac-

curacy more than efﬁciency. Rigorous evaluation by medical annotators demonstrates improvements in key

metrics such as document relevance, credibility, and logical coherence of generated content, proven by a series

of experiments presented in the paper. This study highlights the potential of RAG systems to provide scalable,

high-quality, and individualized educational resources, addressing non-English speaking users.

1 INTRODUCTION

Recent technology development has transformed

many sectors, including education, where innova-

tive tools now play a critical role in facilitating the

learning process. This paper presents the applica-

tion of Retrieval-Augmented Generation (RAG) sys-

tems in medical education, focusing on optimizing

learning for Polish medical specialists by integrating

spaced repetition methodologies. The proposed so-

lution merges advanced machine learning algorithms

with medical knowledge grounded in Evidence-Based

Medicine (EBM), ensuring accuracy and credibility.

Our contribution is presenting a pipeline in which

we acquire questions from the State Specialization

Examination (PES, Pa´nstwowy Egzamin Specjaliza-

cyjny) and prepare an online course speciﬁcally tai-

lored to aspiring specialists. The course includes

RAG-based comments derived from sources within

our specialised search engine, which contains high-

quality medical content.

Firstly, the questions and sets of possible answers

https://orcid.org/0000-0003-3437-6076

are presented to a user. After a student submits their

response, they are shown the correct answer, accom-

panied by a Large Language Model (LLM)-based ex-

planation and references to medical content. Follow-

ing a learning session with multiple items, these ques-

tions are organized using a spaced repetition algo-

rithm for subsequent review sessions.

Our approach is a scalable, cost-effective sys-

tem that enhances knowledge retention and minimizes

cognitive overload. Our methodology prioritizes gen-

erated content veriﬁcation by humans, leveraging cu-

rated materials and collaboration with medical spe-

cialists. By employing structured datasets, veriﬁed

sources, and an optimized pipeline, our system min-

imizes risks associated with generative models, such

as hallucinations (Huang et al., 2023), while ensuring

the traceability of information back to authoritative

medical sources.

The motivation for this work stems from the press-

ing need to provide healthcare professionals with

tools that facilitate efﬁcient, reliable, and individu-

alized learning experiences. Current medical ques-

tion banks in Poland often rely on costly expert an-

174

Kaczmarek, J. I., Pokrywka, J., Biedalak, K., Kurzyp, G. and Grzybowski, Ł.

Optimizing Retrieval-Augmented Generation of Medical Content for Spaced Repetition Learning.

DOI: 10.5220/0013477700003932

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Conference on Computer Supported Education (CSEDU 2025) - Volume 2, pages 174-186

ISBN: 978-989-758-746-7; ISSN: 2184-5026

notations, making them economically viable only

for the most popular exams. By utilizing a RAG-

based system, we demonstrate the potential to cre-

ate high-quality, scalable, and affordable resources

that address the speciﬁc needs of medical education.

This approach represents a signiﬁcant advancement in

making specialized training more accessible and ef-

fective for future healthcare providers.

2 RELATED WORK

2.1 NLP in Medical Education

New technologies are becoming effective learning

tools for medical students. One prominent example is

virtual reality (VR) simulations, which enable learn-

ers to practice complex procedures and enhance their

technical skills in a risk-free environment. Several

studies demonstrate that VR-based training improves

surgical performance, reduces errors, and shortens

procedure times (Seymour et al., 2002; Ahlberg et al.,

2007; Colt et al., 2001). Similarly, virtual patient sim-

ulators provide interactive scenarios that help medi-

cal students develop diagnostic reasoning and clini-

cal decision-making skills (Mestre et al., 2022; Horst

et al., 2023).

Another example of a recent educational tool is

ClinicalKey AI (Elsevier, 2025), which integrates ad-

vanced information retrieval and artiﬁcial intelligence

to assist clinicians and students in accessing critical

medical knowledge. However, developing and im-

plementing comprehensive online learning resources

for medical education remains challenging due to in-

adequate infrastructure, limited faculty expertise, and

other barriers (O’Doherty et al., 2018). Furthermore,

ethical considerations are crucial; maintaining trans-

parency, fairness, and responsible technology use is

vital for building trust and ensuring patient safety.

(Weidener and Fischer, 2024).

In recent years, LLMs specialized in medicine

have shown tremendous potential in supporting med-

ical education (Saab et al., 2024; Labrak et al., 2024;

OpenMeditron, 2024; ProbeMedicalYonseiMAILab,

2024; Ankit Pal, 2024; johnsnowlabs, 2024). These

models can serve as virtual assistants, providing im-

mediate feedback on complex questions and gener-

ating tailored educational content. Their capabilities

are tested against established medical benchmarks,

where some have achieved performance levels com-

parable to or exceeding standard baselines (Singhal

et al., 2023).

2.2 Polish Medical NLP

LLMs have become a key focus in AI, demonstrating

strong performance in processing large datasets and

executing natural language processing (NLP) tasks

like text generation, translation, and question answer-

ing. These capabilities make them promising tools for

improving medical practice by enhancing diagnostic

accuracy, predicting disease progression, and support-

ing clinical decisions. By analyzing extensive medi-

cal data, LLMs can rapidly acquire expertise in ﬁelds

like radiology, pathology, and oncology. Fine-tuning

with domain-speciﬁc literature further enables them

to remain current and adapt to different languages

and contexts, potentially expanding global access to

medical knowledge. However, integrating LLMs into

healthcare presents challenges, including the com-

plexity of medical language and diverse clinical con-

texts that may hinder their ability to fully capture the

nuances of practice. Critically, ensuring model fair-

ness and protecting patient data privacy are essential

for responsible and equitable healthcare (Karabacak

and Margetis, 2023).

Several studies have explored the potential of

LLMs for Polish-language medical applications. No-

tably, research has investigated the performance of

ChatGPT on the PES across various specialties, in-

cluding dermatology (Lewandowski et al., 2023),

nephrology (Nicikowski et al., 2024), and periodon-

tology (Camlet et al., 2025). More extensive re-

search has assessed LLMs performance across all

PES specialties (Pokrywka et al., 2024). Further

work has introduced a new dataset for cross-lingual

medical knowledge transfer assessment, comparing

various LLMs on the PES, Medical Final Examina-

tion (LEK), and Dental Final Examination (LDEK).

Additionally, this research has examined LLMs re-

sponse discrepancies between Polish and English ver-

sions of general medical examination questions, us-

ing high-quality human translations as a benchmark

(Grzybowski et al., 2024). Moreover, a model for

automatically parametrizing Polish radiology reports

from free text using language models has been pro-

posed, aiming to leverage the advantages of both

structured reporting and natural language descriptions

(Obuchowski et al., 2023).

2.3 Spaced Repetition Algorithms

Spaced repetition is a widely recognized learning

technique designed to optimize memory retention

through systematic review scheduling. The methodol-

ogy predicts forgetting curves for individual learners

and speciﬁc pieces of information, prompting active

Optimizing Retrieval-Augmented Generation of Medical Content for Spaced Repetition Learning

175

reviews at the optimal time to minimize the number of

repetitions while ensuring a high retention rate. This

approach has proven particularly effective for item-

ized knowledge domains, such as language acquisi-

tion, computer science, and medicine.

The concept of spaced repetition in computer-

aided learning has been extensively explored since the

1980s when early experiments led to the development

of the SM-2 algorithm (Wo´zniak, 1990). Succes-

sive advancements incorporated individualized learn-

ing metrics, such as the theory of memory compo-

nents, which was fully implemented in the SM-17

algorithm (Wo´zniak et al., 1995). More recently

(Pokrywka et al., 2023a), research demonstrated that

applying a negative exponential function as the output

forgetting curve, proposed by (Wo´zniak et al., 2005),

signiﬁcantly enhances the performance of machine

learning models such as Long Short-Term Memory

(LSTM) networks.

In addition to algorithmic advancements, large-

scale machine-learning approaches have been inte-

grated into spaced repetition systems. For exam-

ple, Half-Life Regression introduced a method to

optimize repetition intervals using real-world learn-

ing data (Settles and Meeder, 2016). Similarly, a

Transformer-based model has been employed within

a Deep Reinforcement Learning framework to further

reﬁne repetition scheduling (Xiao and Wang, 2024).

These advancements in spaced repetition method-

ologies and their application to digital platforms un-

derscore their critical role in high-volume learning

tasks, particularly in domains that require robust

knowledge retention, such as medical education.

3 PES EXAMINATION

3.1 Examinations for Physicians and

Dentists in Poland

To practice independently and attain specialist cer-

tiﬁcation, physicians and dentists in Poland must,

among other requirements, pass speciﬁc examina-

tions. These include the LEK (Lekarski Egza-

min Ko´ncowy, Medical Final Examination), LDEK

(Lekarsko-Dentystyczny Egzamin Ko´ncowy, Den-

tal Final Examination), LEW (Lekarski Egzamin

Weryﬁkacyjny, Medical Veriﬁcation Examination),

LDEW (Lekarsko-Dentystyczny Egzamin Weryﬁka-

cyjny, Dental Veriﬁcation Examination), and PES

(Pa´nstwowy Egzamin Specjalizacyjny, State Special-

ization Examination).

https://www.cem.edu.pl/egzaminy_l.php

LEK and LDEK serve as licensure examinations

for domestic graduates, while LEW and LDEW ap-

ply to individuals trained outside the European Union.

The PES, in contrast, is a certiﬁcation exam required

to attain specialist status in a medical or dental ﬁeld.

Typically, candidates taking the PES have already

gained licensure, completed a 12- or 13-month post-

graduate internship, and worked as resident doctors

in a specialist setting for a period of four to six years.

In addition to passing the specialization examination,

they must complete mandatory courses, internships,

and perform a required set of medical procedures rel-

evant to their discipline to meet all certiﬁcation crite-

ria. This article focuses on the PES, particularly its

written component.

3.2 PES

The PES evaluates candidates’ knowledge and com-

petencies to ensure they meet the standards required

for specialized practice. It is conducted bi-annually

and comprises two main components:

• Written Examination. This part consists of 120

specialty-speciﬁc questions. Each question has

ﬁve possible answers, with only one correct op-

tion. To pass, candidates must achieve a minimum

score of 60%. Since 2022, those scoring 70% or

higher are exempted from the oral examination.

The conﬁdentiality of examination questions prior

to the test is maintained to ensure the integrity of

the process. An example question translated into

English is presented in Figure 1.

• Oral Examination. Candidates who do not meet

the exemption threshold in the written exam par-

ticipate in the oral component. The structure

of this section varies depending on the specialty

but typically involves case-based discussions that

evaluate clinical reasoning, diagnostic skills, and

decision-making abilities.

While the oral examination is an integral part of

the certiﬁcation process, this article focuses on the

written component due to its relevance to the pre-

sented RAG solution. The PES is widely regarded

as the most extensive and demanding knowledge ver-

iﬁcation in the career of a medical professional in

Poland.

4 LEARNING PLATFORMS

OVERVIEW

In this section, we present existing platforms in which

we embedded our PES preparation courses.

CSEDU 2025 - 17th International Conference on Computer Supported Education

176

Figure 1: PES question. This example is an English trans-

lation of question 10 from the Internal Medicine exam ad-

ministered during the Fall 2024 session. The translation was

prepared by the authors, with the original exam questions

written in Polish.

4.1 Medico PZWL

Medical Knowledge Platform Medico PZWL

is a

comprehensive digital resource for Polish doctors,

supporting education, clinical practice, and decision-

making. Owned by Polish Scientiﬁc Publishers

PWN

, which has over 80 years of experience in med-

ical education, the platform provides access to an ex-

tensive medical knowledge base.

The knowledge base comprises over 120,000 doc-

uments, including exclusive content from PWN, as

well as materials from other publishers, medical so-

cieties, and research institutions. Resources include

textbooks, journal articles, clinical guidelines and rec-

ommendations, procedural schemes, case studies, sur-

gical records, podcasts, formularies, and legal analy-

ses.

At its core, Medico features an advanced search

engine designed to retrieve precise information on

speciﬁc clinical issues. Search results provide rel-

evant excerpts from various publications, ensuring

quick access to the most pertinent insights. This

keyword-driven search engine incorporates a rerank-

ing mechanism, reﬁned through extensive research

and training on over 500,000 expert-annotated med-

ical cases from doctors, paramedics, and medical stu-

dents(Pokrywka et al., 2023b).

For the development of PES content, RAG queries

were initially supplemented with documents retrieved

from Medico’s production search engine. This re-

trieval system underwent multiple modiﬁcations and

enhancements based on feedback from evaluations of

the generated content. Since real-time RAG responses

https://medico.pzwl.pl/

https://pwn.pl/

were not required, computationally intensive search

and reranking improvements were feasible. The ﬁnal

validated comments, along with test materials, were

integrated into the SuperMemo spaced repetition ap-

plication to create an optimized learning experience.

4.2 SuperMemo

SuperMemo

is a world pioneer in applying spaced

repetition to computer-aided learning. Its research has

been used directly by or inspired the development of

this method in other e-learning apps, including Anki,

Quizlet or Duolingo. (Wo´zniak, 2018).

The SuperMemo algorithm consists in predicting

forgetting curves individually for each learner and for

each information they memorize. Repetitions (active

reviews) are planned accordingly in order to minimize

the number of them while reaching the desired level of

learner’s knowledge retention. This is achieved by in-

voking a repetition of information when its estimated

recall probability falls to a required level, typically

90%. The learner’s recall of each piece of informa-

tion is graded on the scale of "I don’t know" :(, "I

am not sure or almost right" :|, "I know" :) (see Fig-

ure 3a). A full history of grades is recorded and used

to adapt a general memory model to the individual

characteristics of a learner. After each repetition the

model is updated and a new forgetting curve of the

just reviewed information is estimated. This allows

the algorithm to plan the next repetition date.

Spaced repetition works particularly well for item-

ized knowledge in areas requiring high-volume learn-

ing like languages, computer science, or medicine.

The learning content in SuperMemo comprises cu-

rated courses as well as memocard (augmented ﬂash-

card) collections authored and shared by users.

At the moment of writing this article, the PES

courses range in SuperMemo covers actual questions

from 4 years of past exams in 22 specializations, all

together around 18 thousand items (see Figure 2). Ev-

ery question is accompanied by a comment generated

by a LLM, augmented with a RAG setup supplying

relevant source documents from the Medico database

(see Figure 4). Source documents are quoted and

linked in the comments (see Figure 3a, and 3b).

5 EXAMS AQUISITION

PES exams are publicly released after they are con-

ducted, primarily published on the CEM (Centrum

Egzaminów Medycznych, Medical Examination Cen-

https://supermemo.com/

Optimizing Retrieval-Augmented Generation of Medical Content for Spaced Repetition Learning

177

Figure 2: PES courses in the SuperMemo app.

tre) website

as HTML sites. We employed years

2023 and 2024 part for our learning system. However,

the 2021 and 2022 exams were only available as PDFs

without a text layer, published by NIL (Naczelna Izba

Lekarska, Supreme Medical Chamber)

We obtained permission from CEM to use PES

materials in our educational platform, ensuring full

legal compliance. Below, we describe the process of

acquiring and processing these exams.

Exams from the years 2021 and 2022 were con-

verted using optical recognition in GPT-4o with JSON

mode. A predeﬁned JSON schema structured the ex-

tracted data, including test numbers, questions, an-

swer choices, correct answers, and formatting.

Tests from the years 2023 and 2024, along with

correct answers, were downloaded from the CEM

website as HTML quizzes using custom Python

scraping scripts.

We ﬁltered out items that included visual content

like radiological images. Additionally, we removed

questions marked as inconsistent with modern med-

ical knowledge. After processing, our dataset com-

prised 17,843 questions from 149 exams across 46

medical specialties.

6 CONTENT GENERATION

PIPELINE OVERVIEW

Our system, based on the RAG paradigm (Lewis

et al., 2020), retrieves relevant documents for each

PES question and generates concise explanations. Ini-

tially, it comprised two core components: a retrieval

engine and an answer generation module. During de-

velopment, we introduced a Query Rephraser to en-

hance search effectiveness — an optional but highly

beneﬁcial addition. Each component is detailed in

the following subsections, and Figure 5 illustrates

the pipeline. We implemented our system using the

https://www.cem.edu.pl/

https://nil.org.pl/

Python 3 programming language without any RAG

frameworks such as LangChain.

6.1 Query Rephraser

Our approach builds on an existing search engine used

in real-world medical applications. In production,

user queries are typically short (single keywords or a

single natural language question) and address a single

issue. In contrast, a PES exam questions often consist

of multiple sentences, includes several possible an-

swers, and cover multiple topics. Directly inputting

the full exam question into the search engine resulted

in suboptimal retrieval performance.

To enhance retrieval quality, we introduced a

Query Rephraser, an LLM-based module that trans-

forms exam questions into optimized search queries.

It processes the exam question, answer choices, and

the correct answer, generating a concise, targeted

query to enhance document retrieval. This query is

then processed by our Retrieval System, described in

the next subsection.

6.2 Retrieval System

Our search engine is a specialized tool for health-

care providers. Its knowledge base consists of au-

thoritative Polish sources authored by medical spe-

cialists and researchers, excluding patient-oriented

content. Built on Apache SOLR, the retrieval sys-

tem operates on a keyword-based approach enhanced

by an advanced text-analysis pipeline with thousands

of domain-speciﬁc medical synonyms. Addition-

ally, a cross-encoder-based reranker re-sorts the top

100–200 results. Authors of (Pokrywka et al., 2023b)

indicated that this approach outperforms a purely bi-

encoder-based retrieval pipeline in a medical setting

similar to ours.

In production, the search engine must respond

within one second, including retrieval and reranking.

However, efﬁciency is not a concern for PES prepara-

tory materials, as courses are generated once, making

retrieval quality the priority. To enhance relevance at

the cost of computational time, we made several mod-

iﬁcations. First, we expanded the reranker’s scope to

consider up to 200 candidate documents. Second, we

adjusted the reranking context. In the production sys-

tem, each indexed document corresponds to a book

paragraph of approximately 500 words, but the user-

facing snippet is limited to about 140 characters. For

efﬁciency and snippet-level optimization, the produc-

tion reranker relies only on snippet text rather than

entire paragraphs. In contrast, for this PES-focused

material, we decided to rerank using full para-

CSEDU 2025 - 17th International Conference on Computer Supported Education

178

(a) PES multiple choice test with a RAG generated comment in the SuperMemo app. The correct

answer (green background) and LLM-generated explanation of a correct answer (blue background)

are revealed after the student selects an answer from A, B, C, D, and E choices. The links lead to

veriﬁed medical documents.

(b) Example sources of LLM-generated information from the medical books, articles, and certiﬁed

medical websites.

Figure 3: Overall caption for the ﬁgure containing two subﬁgures.

graphs. We also employed a more powerful reranker,

dadas/polish-reranker-large-ranknet (Dadas

and Gr˛ebowiec, 2024), which delivers substantially

higher quality but is not used in production due to ef-

ﬁciency constraints. These measures signiﬁcantly im-

proved the relevance of the retrieved documents. Ta-

ble 2 refers to this reﬁned component as the Reﬁned

Reranker, whereas the standard production pipeline

is denoted as the Base RAG system. For imple-

mentation, we used the SentenceTransformers library

(sbert.net).

6.3 Comment Generation

We chose to employ GPT-4o(GPT, 2024) via Ope-

nAI API as our large language model, following ini-

tial feasibility studies (Pokrywka et al., 2024; Grzy-

bowski et al., 2024) that highlighted its strong per-

formance on the PES exam task, albeit with some re-

maining inaccuracies. To mitigate this, each question

prompt provided the full exam question text, the set

of possible answers, and the known correct answer.

The model was then instructed to justify the correct-

ness of that answer by leveraging the top 10 retrieved

documents.

Identifying an optimal prompt for the LLM re-

quired several pilot studies with contributions from

Optimizing Retrieval-Augmented Generation of Medical Content for Spaced Repetition Learning

179

Figure 4: A single source document from Medico called from a link in LLM-generated comment.

both a computer scientist and a medical expert. En-

suring the generated text was of appropriate length

and style for medical specialists was crucial. In par-

allel, we conducted experiments to determine the op-

timal number of documents to supply. Speciﬁcally,

we tested a scenario where the LLM relied solely on

external documents, disregarding its internal knowl-

edge, and determined the correct answer without prior

access to it. We observed that supplying exactly 10

documents yielded optimal results—fewer led to a

signiﬁcant performance drop, while more offered no

substantial improvement. This informed our decision

to consistently retrieve 10 documents. Final result

evaluation, independent of document count, was con-

ducted by specialist annotators, as detailed in the fol-

lowing sections.

7 EVALUATION

Improving and validating any system demands reli-

able methods for assessing its performance. To sys-

tematically and objectively evaluate our solution, we

created a tailored assessment framework. This ap-

proach was consistently applied to successive itera-

tions of the RAG system during its development, en-

abling comparisons between versions and analyses of

how our modiﬁcations to the generation pipeline in-

ﬂuenced its overall performance.

Polish medical exam preparatory courses often in-

clude question banks accompanied by expert-written

commentary, which served as our gold standard.

Figure 5: Pipeline overview of selecting relevant documents

and generating an explanation of the correct answer. The

example of the generated answer is given in Figure 3a and

the example sources are given in Figure 3b.

Through a thorough analysis of these resources and

extensive user consultations, we identiﬁed several

critical aspects that matter most to physicians. The

evaluation model for PES commentary was designed

to address the speciﬁc needs of the medical ﬁeld while

considering the architecture of the RAG system, in-

cluding both its inputs and outputs. Below, we present

a detailed overview of the framework’s structure and

the rationale behind its components. A summary of

its key assumptions is provided in Table 1.

CSEDU 2025 - 17th International Conference on Computer Supported Education

180

Table 1: Parameters utilised by the evaluation framework.

Parameter Deﬁnition Exemplary question Rating

Sensitivity The ability to identify actual difﬁcul-

ties in the question.

Does the model identify all real difﬁ-

culties in the question?

Scale 1-4

Speciﬁcity Ability to ignore elements of the

question that do not constitute difﬁ-

culties.

Does the model incorrectly classify

elements of the question that are not

difﬁculties as such?

Scale 1-4

Completely relevant

documents

Documents that contain all the infor-

mation necessary to answer the ques-

tion correctly.

Number of

documents

out of 10.

Partially relevant

documents

Documents that contain some, but not

all of the information necessary to an-

swer the question correctly.

Number of

documents

out of 10.

Relevant documents

(total)

Documents that contain information

necessary to answer the question cor-

rectly.

Number of

documents

out of 10.

Credibility Consistency in citing appropriate

sources when they are available.

For each statement, does the model

reference all available relevant

sources? Does the model acknowl-

edge using internal knowledge when

no suitable sources are available?

Scale 1-4

Accuracy Authenticity of statements based on

paraphrased sources and the model’s

knowledge.

Does the model correctly paraphrase

statements from the sources it refer-

ences? Are statements truthful?

Scale 1-4

Logic Consistency of conclusions drawn in

the context of the question with logic.

Are the conclusions consistent with

the cited general statements and log-

ical principles?

Scale 1-4

Completeness/Depth Comprehensiveness and detail of the

explanation.

Does the commentary address all

identiﬁed difﬁculties? Does it sufﬁ-

ciently elaborate on them?

Scale 1-4

Conciseness Brevity of the commentary. Is the commentary overly long? Does

it address unnecessary issues?

Scale 1-4

Communicativeness/

Readability

Readability and structure of the com-

mentary.

Is the commentary written in a clear

and understandable manner?

Scale 1-4

Prioritization Prioritization of higher-value sources

in cases of inconsistencies.

Does the model disregard less valu-

able sources and justify its prioritiza-

tion of one source over another?

Scale 1-4

Table 2: Evaluation results from three experiments on a sample of 200 PES questions across ﬁve medical specialties: internal

medicine, pediatrics, family medicine, psychiatry, general surgery. A team of ﬁve clinical-years medical students assessed the

question reports during the development of RAG pipeline. Metric deﬁnitions are provided in Table 1.

Parameter Base RAG +Query Rephraser +Reﬁned Reranker

Sensitivity (1–4) 3.94 ± 0.23 3.61 ± 0.56 3.76 ± 0.53

Speciﬁcity (1–4) 3.56 ± 0.50 3.80 ± 0.44 3.96 ± 0.39

Completely relevant docs (/10) 2.13 ± 2.39 2.75 ± 2.67 3.48 ± 2.73

Partially relevant docs (/10) 2.46 ± 2.44 2.69 ± 2.36 3.35 ± 2.15

Total relevant docs (/10) 4.59 ± 3.18 5.44 ± 2.91 6.83 ± 2.70

Credibility (1–4) 2.91 ± 0.90 2.73 ± 0.84 3.23 ± 0.65

Accuracy (1–4) 3.46 ± 0.90 3.45 ± 0.83 3.56 ± 0.62

Logic (1–4) 3.67 ± 0.71 3.70 ± 0.56 3.86 ± 0.51

Completeness/Depth (1–4) 3.59 ± 0.67 3.73 ± 0.59 3.87 ± 0.37

Conciseness (1–4) 3.44 ± 0.69 3.43 ± 0.62 3.69 ± 0.56

Communicativeness/Readability (1–4) 3.78 ± 0.31 3.80 ± 0.45 3.93 ± 0.30

Prioritization (1–4) 3.50 ± 0.84 2.60 ± 1.30 3.43 ± 0.85

Optimizing Retrieval-Augmented Generation of Medical Content for Spaced Repetition Learning

181

7.1 Evaluation Framework

The approach employs a 1–4 scale to assess various

parameters, each describing a certain desired quality.

A score of 1 indicates a complete failure to meet ex-

pectations, while 4 represents full alignment with cri-

teria, showing no signiﬁcant shortcomings. Scores

of 2 and 3 cover intermediate performance levels,

with 2 reﬂecting predominantly negative aspects and

3 emphasizing mostly positive ones. Neutral scores

were excluded to ensure that the annotators adopted

a deﬁnitive stance. One parameter allowed the an-

notators to abstain when evaluation was not feasible.

Notably, this scale was not used to assess document

relevance, which is discussed separately.

7.1.1 Identiﬁcation of Key Difﬁculties

The ﬁrst area of the evaluation was LLMs correctness

in identifying key difﬁculties of a question. This was

analyzed using two markers: sensitivity and speci-

ﬁcity, named after the well-established metrics used

in medicine (to evaluate diagnostic test performance)

and machine learning (to measure detection effective-

ness). Note that unlike their traditional counterparts,

these parameters in the described framework are not

calculated based on the number of true positives, false

positives, true negatives, and false negatives. Instead,

they are assessed holistically on a 1–4 scale by hu-

man annotators, who evaluate how well the model’s

responses align with expectations.

• Sensitivity: refers to the model’s ability to iden-

tify the true challenges within a question, focusing

on essential elements needed to understand and

answer it effectively.

• Speciﬁcity: evaluates the model’s capacity to dis-

regard irrelevant or superﬁcial details, ensuring it

emphasizes meaningful content.

There was a risk of neglecting critical aspects

or overemphasizing trivial points, undermining com-

mentary quality. Therefore our goal was to ensure that

the LLM tailored its output to the users’ level of ex-

pertise and underscored the key aspects of the ques-

tion in a concise comment.

In most iterations, LLM was prompted to identify

key difﬁculties in a given question. Sensitivity and

speciﬁcity served as metrics to gauge its effectiveness.

Alongside other indicators, these metrics helped as-

sess the model’s comprehension of exam questions.

Additionally, identifying key difﬁculties served as

an enabler for the LLM to generate precise search

queries and retrieve the most relevant documents, en-

suring necessary information was included while ir-

relevant sources were not. Since the introduction

of the Query Rephraser module into the generation

pipeline, the generated query was evaluated for im-

portant and irrelevant elements. Based on this assess-

ment, annotators rated its sensitivity and speciﬁcity.

7.1.2 Document Relevance

Modern medicine is evidence-based, and consulted

physicians emphasized the importance of ensuring

that outputs do not contain false or unveriﬁed infor-

mation. To address this, GPT was prompted to gen-

erate responses based on source material, with doc-

ument relevance serving as a key metric for evaluat-

ing the quality of the provided literature. During the

evaluation process, we observed that overall appraisal

of the comments was most strongly correlated with

this parameter. Enhancements that directly increased

the number of relevant sources also led to improved

scores across all other assessed areas. Annotators

noted that when the number of inadequate sources

exceeded relevant ones, the comments tended to be

vague and included unrelated information, suggesting

that GPT prioritized referencing random sources over

providing no citations at all. We conclude that doc-

ument relevance is the most critical indicator of the

overall RAG output quality.

Annotators classiﬁed documents as:

• Completely Relevant: Addressed all key difﬁ-

culties identiﬁed by annotators.

• Partially Relevant: Covered some, but not all,

necessary aspects.

• Irrelevant: Lacked relevance to the question.

PES questions often integrate knowledge from

multiple areas. For instance, a treatment-focused

question may omit a diagnosis, instead presenting

symptoms or test results. Answering such questions

requires deducing the diagnosis and applying corre-

sponding therapeutic principles. Since relevant infor-

mation is rarely conﬁned to a single document, a com-

bination of partially relevant sources could prove to be

necessary to compile adequate responses.

7.1.3 Evaluation of Commentary Quality

The third component of the framework involved a

multi-criteria evaluation of commentary quality, as-

sessed using seven parameters rated on a 1–4 scale.

These parameters were derived from the expectations

of the medical community and underwent extensive

consultation processes.

For this evaluation, two types of statements within

the commentaries were distinguished: general rules

(or factual knowledge), corresponding to textbook

theoretical information, and conclusions derived in

CSEDU 2025 - 17th International Conference on Computer Supported Education

182

the context of speciﬁc questions. For example, a rule

might assert that crushing chest pain is a symptom of

myocardial infarction or that changes in serum tro-

ponin levels indicate acute myocardial injury, such

as during an infarction. Conversely, a conclusion

might state that a patient presenting with chest pain

should have their troponin levels measured as an ele-

ment of the myocardial infarction diagnostic process.

This distinction was not always straightforward and

often depended on contextual factors, including the

question’s content, possible answers, and source doc-

uments.

A detailed description of commentary quality pa-

rameters is presented below.

Credibility: pertained to factual statements and in-

volved two dimensions. The ﬁrst dimension was the

level of systematic referencing of available sources.

The system was expected to cite all documents cor-

roborating the provided information while avoiding

references to irrelevant or contradictory content. The

second dimension concerned the model’s acknowl-

edgement of the use of its internal knowledge. When

external sources were incomplete, it was desirable for

the system to rely on its internal knowledge while ex-

plicitly attributing the information to itself rather than

falsely citing external sources or omitting attribution.

Credibility scores were reduced when: 1) a statement

lacked references to all corroborating source docu-

ments, 2) a statement cited a document that was irrel-

evant or contradictory. 3) the model failed to attribute

internally derived statements to itself.

Accuracy: referred to the authenticity of statements,

including paraphrases of source documents and the

model’s internal knowledge. To deem a statement

accurate, an annotator needed to identify a corrobo-

rating excerpt in at least one document provided to

the system or, when this was impossible, validate it

through an independent literature review.

Logic: was evaluated based on conclusions drawn

in the context of the question. This parameter

assessed whether the conclusions adhered to logi-

cal principles, aligning with the general information

cited in the commentary, the question’s content, and

with each other. Ratings were reduced if conclu-

sions contradicted the cited sources, the correct (non-

controversial) answer, or other statements.

Completeness/Depth: measured the thoroughness

and detail of the commentary. It evaluated whether all

signiﬁcant difﬁculties of the question were addressed

and sufﬁciently detailed explanations were provided

to facilitate a full understanding of why one answer

was correct and others were not. Ratings were low-

ered when the commentary failed to address signiﬁ-

cant issues or addressed them too superﬁcially.

Conciseness: a marker complementary to complete-

ness/depth, assessed the appropriate brevity of the

commentary. The system was expected to avoid dis-

cussing irrelevant matters or providing excessive de-

tail. Ratings were reduced when commentary in-

cluded content unrelated to the key difﬁculties of a

question (e.g., irrelevant summaries of source docu-

ments) or when the level of detail was excessive from

the perspective of the question’s requirements.

Communicativeness/Readability: reﬂected the lin-

guistic quality and clarity of the commentary, serving

as an indicator of grammatical correctness and effec-

tive information delivery.

Prioritization: evaluated the system’s ability to pri-

oritize high-value sources over lower-value ones in

cases of conﬂicting information. Given the rapid

evolution of medical knowledge, with new publica-

tions rendering older sources obsolete, this parame-

ter aligned with the need for physicians to rely on

the most current and reliable evidence. The deter-

mination of source value considered factors such as

publication date and type of source, with synthesized

sources like guidelines, recommendations, and text-

books generally deemed more important than original

studies or single-case reports. Since source discrep-

ancies were relatively rare, this parameter was infre-

quently evaluated, as annotators could abstain from

scoring when no contradictions between sources were

identiﬁed.

7.2 Evaluation Process

7.2.1 System Development

During the evaluation of subsequent iterations of the

system, a team of ﬁve annotators — clinical-years

medical students — analyzed and assessed outputs

using the established framework. For the evaluation

dataset, we selected 40 questions from ﬁve exams

covering core medical specialties: internal medicine,

pediatrics, family medicine, psychiatry, and general

surgery. These disciplines were chosen for their broad

representation of medical sciences and their relevance

to a large proportion of practitioners.

For each question, the assessed output report in-

cluded:

• the content of the exam question along with ﬁve

possible answers,

• information about the correct answer,

• a statement by the model containing a list of iden-

tiﬁed difﬁculties or a single query to the search

engine or a list of queries to the search engine,

Optimizing Retrieval-Augmented Generation of Medical Content for Spaced Repetition Learning

183

• 10 source documents,

• a generated commentary on the question.

7.2.2 Validation

When output evaluations reached a satisfactory level,

we conducted an additional double veriﬁcation us-

ing the same framework to ensure quality control.

The validation assessed whether expanding from ﬁve

core medical specialties to 22, including narrower

ﬁelds, affected commentary quality. This concern

arose because PES questions in narrow specialties

are often more detailed and nuanced, with less avail-

able relevant content. The validation also enabled

inter-annotator agreement comparison and provided

insights into the objectivity and reliability of the eval-

uation methodology.

Unlike during development, validation included

10 questions from each of 22 specialties (9 from

emergency medicine, as one question contained an

image unsuitable for automatic comment generation).

Additionally, a second team of six ﬁnal-year medi-

cal students from a different university joined the an-

notation process. During validation, 219 comments

and 2,190 source documents were evaluated by one

member from each team, meaning that every question

report was independently reviewed by two unrelated

annotators.

Discrepancies were resolved by a third annotator,

who reviewed the question report for the ﬁrst time

solely to settle disagreements. A discrepancy was de-

ﬁned as:

• One annotator marking a document as irrelevant,

while the other considered it partially or fully rel-

evant.

• One annotator assigning a score of 1 or 2, while

the other assigned 3 or 4.

• One annotator deeming prioritization assessable,

while the other did not.

Annotations following the same general tendency

were not considered discrepancies. If one annotator

assigned a score of 1 and the other 2, or one rated

a document fully relevant while the other deemed it

partially relevant, these cases were classiﬁed as par-

tial inter-annotator agreement (PIAA) and did not un-

dergo third-party resolution. The resolving annota-

tor assessed only the conﬂicting ratings, leaving to-

tal inter-annotator agreement (TIAA) and PIAA un-

changed, and was aware of prior disagreements.

Final validation results, based on TIAA, PIAA,

and discrepancies resolved by a third annotator, along

with inter-annotator agreement statistics, are pre-

sented in Table 3. Sensitivity and speciﬁcity were not

Table 3: Validation results and inter-annotator agree-

ment statistics. The table presents the mean scores (± stan-

dard deviation) for each evaluated parameter, along with the

percentage of total inter-annotator agreement (TIAA) and

partial inter-annotator agreement (PIAA). TIAA represents

instances where two independent annotators assigned iden-

tical ratings, while PIAA reﬂects cases where ratings fol-

lowed the same general tendency but were not identical.

Parameter Score TIAA PIAA

Relevant docs (/10) 6.11 ± 2.91 57% 18%

Credibility (1–4) 2.92 ± 0.72 38% 28%

Accuracy (1–4) 3.57 ± 0.66 57% 32%

Logic (1–4) 3.68 ± 0.46 58% 28%

Completenes/Depth

(1–4)

3.64 ± 0.49 55% 32%

Conciseness (1–4) 3.63 ± 0.48 58% 31%

Communicativeness/

Readability (1–4)

3.71 ± 0.36 58% 34%

Prioritization (1–4) 3.78 ± 0.63 90% 0%

evaluated. The total number of relevant documents

and credibility scores were noticeably lower than in

the ﬁnal pipeline evaluation, possibly due to limited

relevant content in narrow medical ﬁelds. Other pa-

rameters retained their values from the end of devel-

opment. We suggest that the drop in credibility with

preserved accuracy could mean that the model com-

pensated for the lack of relevant sources with its in-

ternal knowledge. This aligns with annotators’ obser-

vations, as they did not detect increased factual errors

but noted a decline in proper source attribution, re-

ﬂected in the credibility metric.

Most parameters had a TIAA above 50%, with

overall agreement (TIAA + PIAA) typically reaching

80–90%. Credibility showed lower inter-annotator

agreement, with a TIAA of 38% and a TIAA + PIAA

of 66%, indicating a need for more precise evaluation

guidelines. TIAA for prioritization was 90%, with the

remaining 10% of discrepancies solely related to as-

sessability. When both annotators deemed it evalu-

able, their ratings were identical.

Overall, annotators agreed in most cases, with

complete disagreements being clear but infrequent.

These results highlight the evaluation framework’s

potential as a universal tool for RAG development in

the medical domain. However, further reﬁnements

and improved standardization are desirable to en-

hance objectivity and repeatability.

8 CONCLUSIONS

In this paper, we introduced a pipeline for generat-

ing LLM-based content tailored for medical special-

ists. The content is enriched with veriﬁed medical

CSEDU 2025 - 17th International Conference on Computer Supported Education

184

documents and made accessible through our platform.

Exam questions are seamlessly integrated into a learn-

ing system that employs a spaced repetition algorithm

to optimize knowledge retention.

Our approach prioritizes content relevance over

efﬁciency, distinguishing it from typical RAG-based

systems. Key enhancements include a Query

Rephraser, an advanced retrieval system, and a re-

ﬁned reranker. These improvements signiﬁcantly

increased retrieval performance, notably raising the

number of total relevant documents from 4.59 to 6.83

out of 10.

To ensure quality and reliability, the output under-

went rigorous manual veriﬁcation by medical special-

ists. The system is now in its ﬁnal development stages

and will soon be deployed in production.

ACKNOWLEDGEMENTS

We acknowledge CEM’s courtesy in permitting the

use of past PES questions and extend our gratitude to

all medical annotators for their contributions to devel-

opment and validation. This article has been written

with the help of GPT-4o (GPT, 2024) and Grammarly

AI Assitant (Grammarly, ).

REFERENCES

(2024). Gpt-4 technical report.

Ahlberg, G., Enochsson, L., Gallagher, A. G., Hedman,

L., Hogman, C., McClusky, D. A., Ramel, S., Smith,

C. D., and Arvidsson, D. (2007). Proﬁciency-based

virtual reality training signiﬁcantly reduces the error

rate for residents during their ﬁrst 10 laparoscopic

cholecystectomies. The American Journal of Surgery,

193(6):797–804.

Ankit Pal, M. S. (2024). Openbiollms: Advancing

open-source large language models for healthcare

and life sciences. https://huggingface.co/aaditya/

OpenBioLLM-Llama3-70B.

Camlet, A., Kusiak, A., and

Swietlik, D. (2025). Applica-

tion of conversational ai models in decision making

for clinical periodontology: Analysis and predictive

modeling. AI, 6(1):3.

Colt, H. G., Crawford, S. W., and Galbraith, O. (2001). Vir-

tual reality bronchoscopy simulation: A revolution in

procedural training. Chest, 120(4):1333–1339.

Dadas, S. and Gr˛ebowiec, M. (2024). Assessing generaliza-

tion capability of text ranking models in polish.

Elsevier (2025). ClinicalKey AI. Accessed: 2025-01-22.

Grammarly. Grammarly - ai writing assistant. https://www.

grammarly.com. Accessed: 2025-01-22.

Grzybowski, Ł., Pokrywka, J., Ciesiółka, M., Kacz-

marek, J. I., and Kubis, M. (2024). Polish med-

ical exams: A new dataset for cross-lingual medi-

cal knowledge transfer assessment. arXiv preprint

arXiv:2412.00559.

Horst, R., Witsch, L.-M., Hazunga, R., Namuziya, N.,

Syakantu, G., Ahmed, Y., Cherkaoui, O., Andreadis,

P., Neuhann, F., and Barteit, S. (2023). Evaluating the

effectiveness of interactive virtual patients for medi-

cal education in zambia: Randomized controlled trial.

JMIR Med Educ, 9:e43699.

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H.,

Chen, Q., Peng, W., Feng, X., Qin, B., et al. (2023).

A survey on hallucination in large language models:

Principles, taxonomy, challenges, and open questions.

ACM Transactions on Information Systems.

johnsnowlabs (2024). Jsl-medllama-3-8b-

v2.0. https://huggingface.co/johnsnowlabs/

JSL-MedLlama-3-8B-v2.0. Accessed: 2024-11-

02.

Karabacak, M. and Margetis, K. (2023). Embracing large

language models for medical applications: opportuni-

ties and challenges. Cureus, 15(5).

Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rou-

vier, M., and Dufour, R. (2024). Biomistral: A collec-

tion of open-source pretrained large language models

for medical domains.

Lewandowski, M., Łukowicz, P.,

Swietlik, D., and

Bara

nska-Rybak, W. (2023). Chatgpt-3.5 and chatgpt-

4 dermatological knowledge level based on the spe-

cialty certiﬁcate examination in dermatology. Clin

Exp Dermatol, page llad255.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V.,

Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rock-

täschel, T., et al. (2020). Retrieval-augmented gen-

eration for knowledge-intensive nlp tasks. Advances

in Neural Information Processing Systems, 33:9459–

9474.

Mestre, A., Muster, M., El Adib, A. R., Egilsdottir, H., By-

ermoen, K., Padilha, M., Aguilar, T., Tabagari, N.,

Betts, L., Sales, L., Garcia, P., Ling, L., Café, H.,

Binnie, A., and Marreiros, A. (2022). The impact of

small-group virtual patient simulator training on per-

ceptions of individual learning process and curricular

integration: a multicentre cohort study of nursing and

medical students. BMC Medical Education, 22.

Nicikowski, J., Szczepa

nski, M., Miedziaszczyk, M., and

Kudli

nski, B. (2024). The potential of chatgpt in

medicine: an example analysis of nephrology spe-

cialty exams in poland. Clinical Kidney Journal,

17(8):sfae193.

Obuchowski, A., Klaudel, B., and Jasik, P. (2023). Infor-

mation extraction from Polish radiology reports using

language models. In Piskorski, J., Marci

nczuk, M.,

Nakov, P., Ogrodniczuk, M., Pollak, S., P

ribá

n, P.,

Rybak, P., Steinberger, J., and Yangarber, R., editors,

Proceedings of the 9th Workshop on Slavic Natural

Language Processing 2023 (SlavicNLP 2023), pages

113–122, Dubrovnik, Croatia. Association for Com-

putational Linguistics.

OpenMeditron (2024). Meditron3-70b. https://huggingface.

co/OpenMeditron/Meditron3-70B. Accessed: 2024-

11-02.

Optimizing Retrieval-Augmented Generation of Medical Content for Spaced Repetition Learning

185

O’Doherty, D., Dromey, M., Lougheed, J., Hannigan, A.,

Last, J., and McGrath, D. (2018). Barriers and so-

lutions to online learning in medical education – an

integrative review. BMC Medical Education, 18.

Pokrywka, J., Biedalak, M., Gralinski, F., and Biedalak, K.

(2023a). Modeling spaced repetition with lstms. In

CSEDU (2), pages 88–95.

Pokrywka, J., Jassem, K., Wierzcho

n, P., Badylak, P., and

Kurzyp, G. (2023b). Reranking for a polish medical

search engine. In 2023 18th Conference on Computer

Science and Intelligence Systems (FedCSIS), pages

297–302. IEEE.

Pokrywka, J., Kaczmarek, J., and Gorzela

nczyk, E.

(2024). Gpt-4 passes most of the 297 written pol-

ish board certiﬁcation examinations. arXiv preprint

arXiv:2405.01589.

ProbeMedicalYonseiMAILab (2024). medllama3-v20.

https://huggingface.co/ProbeMedicalYonseiMAILab/

medllama3-v20. Accessed: 2024-11-02.

Saab, K., Tu, T., Weng, W.-H., Tanno, R., Stutz, D., Wul-

czyn, E., Zhang, F., Strother, T., Park, C., Vedadi,

E., et al. (2024). Capabilities of gemini models in

medicine. arXiv preprint arXiv:2404.18416.

Settles, B. and Meeder, B. (2016). A trainable spaced rep-

etition model for language learning. In Proceedings

of the 54th annual meeting of the association for com-

putational linguistics (volume 1: long papers), pages

1848–1858.

Seymour, N., Gallagher, A., Roman, S., O’Brien, M.,

Bansal, V., Andersen, D., and Satava, R. (2002).

Virtual reality training improves operating room per-

formance: Results of a randomized, double-blinded

study. Annals of surgery, 236:458–63; discussion 463.

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J.,

Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis,

H., Pfohl, S., et al. (2023). Large language models

encode clinical knowledge. Nature, 620(7972):172–

180.

Weidener, L. and Fischer, M. (2024). Role of ethics in de-

veloping ai-based applications in medicine: Insights

from expert interviews and discussion of implications.

JMIR AI, 3:e51204.

Wo´zniak, P. (1990). Optimization of learning. Mas-

ter’s thesis, University of Technology in Poz-

nan. See also https://www.supermemo.com/en/

archives1990-2015/english/ol/sm2.

Wo´zniak, P. (2018). The true history of spaced repetition.

https://www.supermemo.com/en/articles/history. Ac-

cessed: 2025-01-22.

Wo´zniak, P. A., Gorzela

nczyk, E. J., and A., M. J.

(2005). Building memory stability through rehearsal.

https://www.supermemo.com/en/archives1990-2015/

articles/stability. Accessed: 2025-01-22.

Wo´zniak, P. A., Gorzela

nczyk, E. J., and Mu-

rakowski, J. A. (1995). Two components

of long-term memory. Acta neurobiolo-

giae experimentalis, 55(4):301–305. See also

https://www.researchgate.net/publication/14489713_

Two_components_of_long-term_memory.

Xiao, Q. and Wang, J. (2024). Drl-srs: A deep reinforce-

ment learning approach for optimizing spaced repeti-

tion scheduling. Applied Sciences, 14(13).

CSEDU 2025 - 17th International Conference on Computer Supported Education

186