Automatic Question Generation for the Japanese National Nursing

Examination Using Large Language Models

usei Kido

, Hiroaki Yamada

1 a

, Takenobu Tokunaga

1 b

Rika Kimura

2 c

, Yuriko Miura

2 d

, Yumi Sakyo

2 e

and Naoko Hayashi

2 f

School of Computing, Tokyo Institute of Technology, Japan

Graduate School of Nursing Science, St. Luke’s International University, Japan

Keywords:

Large Language Models, National Nursing Examination, Distractor Generation, Multiple-Choice Questions,

Automatic Question Generation.

Abstract:

This paper introduces our ongoing research project that aims to generate multiple-choice questions for the

Japanese National Nursing Examination using large language models (LLMs). We report the progress and

prospects of our project. A preliminary experiment assessing the LLMs’ potential for question generation in

the nursing domain led us to focus on distractor generation, which is a difﬁcult part of the entire question-

generation process. Therefore, our problem is generating distractors given a question stem and key (correct

choice). We prepare a question dataset from the past National Nursing Examination for the training and eval-

uation of LLMs. The generated distractors are evaluated with compared to the reference distractors in the test

set. We propose reference-based evaluation metrics for distractor generation by extending recall and preci-

sion, which is popular in information retrieval. However, as the reference is not the only acceptable answer,

we also conduct human evaluation. We evaluate four LLMs: GPT-4 with few-shot learning, ChatGPT with

few-shot learning, ChatGPT with ﬁne-tuning and JSLM with ﬁne-tuning. Our future plan includes improving

the LLMs’ performance by integrating question writing guidelines into the prompts to LLMs and conducting

a large-scale administration of automatically generated questions.

1 INTRODUCTION

Automatic question generation (AQG) is one of the

applications of natural language processing (NLP)

that has been actively studied in the educational do-

main. It is expected to signiﬁcantly reduce the bur-

den on question writers. There have been a series of

comprehensive surveys in this area (Alsubait et al.,

2015; Kurdi et al., 2020; Faraby et al., 2023). Al-

subait et al. (2015) covered 81 papers published up to

2014 and analysed them from nine aspects: purpose,

domain, knowledge source, additional input, ques-

tion generation method, distractor generation method,

question format, answer format and evaluation. Con-

cerning the domain, the number of studies targeting

https://orcid.org/0000-0002-1963-958X

https://orcid.org/0000-0002-1399-9517

https://orcid.org/0000-0001-9660-4471

https://orcid.org/0009-0003-5270-603X

https://orcid.org/0000-0001-9519-5792

https://orcid.org/0000-0002-7058-692X

speciﬁc domains is slightly larger than that for a gen-

eral domain. Language learning is a dominant domain

among domain-speciﬁc studies, followed by mathe-

matics and medicine, but they are few.

Kurdi et al. (2020) followed Alsubait et al. (2015)

to systematically collected 93 papers on AQG pub-

lished from 2015 to 2019 and analysed the trend. The

domain distribution is similar to Alsubait et al. (2015),

i.e. the numbers of studies on domain-speciﬁc and

those on non-speciﬁc domains are comparable, and

the language learning domain remains dominant in

the domain-speciﬁc studies. Studies for new domains,

such as sports, appeared in their survey.

The studies reviewed by Alsubait et al. (2015)

and Kurdi et al. (2020) utilised template-based, rule-

based (Liu et al., 2010) or statistical-based (Kumar

et al., 2015; Gao et al., 2019) approaches. In contrast,

Faraby et al. (2023) collected 224 neural network-

based AQG papers published between 2016 and early

2022. Since neural network approaches require a

large amount of training data, the availability of large

datasets for Question-Answering (QA) systems, such

Kido, Y., Yamada, H., Tokunaga, T., Kimura, R., Miura, Y., Sakyo, Y. and Hayashi, N.

Automatic Question Generation for the Japanese National Nursing Examination Using Large Language Models.

DOI: 10.5220/0012729200003693

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024) - Volume 1, pages 821-829

ISBN: 978-989-758-697-2; ISSN: 2184-5026

821

as SQuAD (Rajpurkar et al., 2016), NewsQA

, has

facilitated neural question generation (NQG). These

datasets were initially designed and constructed for

developing QA systems but are also usable for AQG.

Faraby et al. (2023) focused on NQG in the educa-

tional context and introduced several datasets usable

for NQG. Also, they mentioned the difﬁculty of eval-

uating AQG, which comes from multiple potentially

appropriate questions and a shortage of evaluation

datasets considering this characteristic.

At the end of 2022, ChatGPT

appeared, and

numerous large language models (LLMs) followed.

Their versatility and high performance for various

tasks without ﬁne-tuning made a great impact on both

academia and industry (Liu et al., 2023b). LLMs have

already been applied to AQG. For instance, Perkoff

et al. (2023) compared three types of LLM archi-

tectures: T5, BART and GPT-2, in generating read-

ing comprehension questions and concluded that T5

was the most promising. Yuan et al. (2023) used

GPT-3 to generate questions and choose better ques-

tions among automatically generated candidates. Oh

et al. (2023) utilised LLM for paraphrasing refer-

ences to improve the evaluation metrics for AQG.

Shin and Lee (2023) conducted a human evalua-

tion of ChatGPT-generated multiple-choice questions

(MCQs) for language learners, in which 50 language

teachers evaluated a mixed set of human-made and

ChatGPT-made MCQs without knowing their origins.

They reported both types of MCQs had comparable

quality.

We follow this line of research by utilising LLMs

to generate questions. This research is a part of the

project funded by the Ministry of Health, Labour

and Welfare (MHLW) of Japan, which aims to au-

tomate administering the National Nursing Examina-

tion. Thus, our target domain is nursing; speciﬁcally,

we aim to generate questions for the Japanese Na-

tional Nursing Examination. This study assesses the

feasibility of using the LLM-generated questions in

the Japanese National Nursing Examination. Even

though the question generation could not be fully au-

tomated, automating parts of the entire generation

process can reduce the workload on question writ-

ers. Unlike the prevalent language learning domain,

recruiting a large number of domain experts for evalu-

ation is difﬁcult in our nursing domain. Therefore, the

challenge includes designing an effective and efﬁcient

evaluation framework with fewer human resources.

In this paper, we report the progress of our on-

going research project that aims to generate MCQs

https://www.microsoft.com/en-us/research/project/n

ewsqa-dataset/

https://chat.openai.com

Which of the following was the leading

cause of death in Japan in 2020?

← stem

1. malignant neoplasm ← key

2. pneumonia

)

distractors

3. heart disease

4. cerebrovascular disease

Figure 1: Example of multiple-choice question (MCQ). The

original is in Japanese. This is our translation. The same

applies to the following examples.

for the Japanese National Nursing Examination us-

ing LLMs. An MCQ consists of three components:

a stem, a key and distractors, as shown in Figure 1.

AQG needs to generate these three components to

constitute a question. Particularly, distractor gener-

ation has been actively studied (Susanti et al., 2017;

Gao et al., 2019) since generating distractors is a bur-

densome task for question writers. Considering the

result of the preliminary experiment described in sec-

tion 3, we shift our focus on distractor generation by

LLMs.

2 JAPANESE NATIONAL

NURSING EXAMINATION

Table 1: Choice type distribution in the past ten-year exam-

inations.

Choice type #questions

noun phrase 309

sentence 57

numerics 96

ﬁgure & table 22

exceptional questions 16

Total 500

In Japan, one must pass the National Nursing Exam-

ination to be qualiﬁed as a registered nurse. Gradu-

ating from a college or university with a nursing cur-

riculum is a prerequisite to the examination. The ex-

amination evaluates the knowledge and skills required

to become a registered nurse. It covers a wide range

of subjects to conﬁrm the knowledge about nursing

from various perspectives, including the structure and

function of the human body, the origin of disease and

promotion of recovery, health support and social se-

curity system, basic nursing, adult nursing, geronto-

logical nursing, pediatric nursing, maternal nursing,

psychiatric nursing, home health care nursing, and in-

tegration and practice of nursing. These subjects are

organised into a three-level hierarchical structure con-

sisting of 16 major categories, 49 middle categories

AIG 2024 - Special Session on Automatic Item Generation

822

and 252 minor categories.

The examination questions are in the form of

MCQ and classiﬁed into three: essential, general,

and situational. This study focuses on the essential

questions. The essential part consists of 50 questions

that assess particularly important basic knowledge. A

score of 80% or higher on the essential questions is

necessary to pass the examination. The questions are

intended to check the examinees’ knowledge of nurs-

ing and are not intended to select examinees for a cer-

tain quota. Figure 1 shows a question example in the

essential questions

. A question consists of a stem

(question sentences), a key (correct choice) and three

or rarely four distractors (incorrect choices).

The choices can be words or phrases like in Fig-

ure 1, longer descriptions in clauses and sentences,

numerical values, graphs and tables. Table 1 shows

the distribution of the choice types in the essential

questions of the examinations over the past ten years,

which were provided by MHLW, the body respon-

sible for the National Nursing Examination. This

study considers only questions with choices of words,

phrases and sentences. Although some recent LLMs

can handle images as well as language, the number of

questions with ﬁgures and tables is small in the past

examinations. Questions with numerical choices are

better suited to rule-based approaches, e.g. setting ap-

propriate error offsets against the correct value will

make reasonable distractors.

3 PRELIMINARY EXPERIMENT

To evaluate the potential of LLMs, we conduct a pre-

liminary study. Table 2 shows the topic, stem and

key of four questions used in the preliminary study.

We analysed the questions in the past examinations

regarding the correct answer rate, the discrimination

index and the distractor quality. These four ques-

tions

were selected because our analysis of the past

ten-year questions revealed that the past questions on

these topics should be improved.

To make LLMs work well on a task, appropri-

ate prompts specialised to the task should be pre-

pared (Liu et al., 2023a). LLMs need to create a stem,

key and distractors to compose an MCQ. We prepare

different prompts with varying information for LLMs.

Table 3 shows variations of the prompts, in which

“I” indicates the information given to LLMs in the

The actual questions are in Japanese. This is a transla-

tion by the authors.

As the second topic has two keys, we can derive two

questions from this topic with each as a key. We used only

“dehydration” as the key in the preliminary experiment.

prompt and “O” the output from LLMs. The charac-

ters in the prompt names (T, S and K) stand for the

information given in the prompt. For example, the SK

prompt contains a stem and key for generating distrac-

tors. The NON prompt contains only a topic shown in

Talbe 3 to generate all MCQ components: a stem, key

and distractors.

Additionally, we put a paragraph of the corre-

sponding topic in nursing textbooks in the prompt, in-

dicated in the “textbook” column and T in the prompt

name, since textual information is the most popular

input type to the AQG systems (Kurdi et al., 2020).

Although ontology-type structured knowledge is a

plausible option, we used textbook paragraphs since

ontologies in our Japanese nursing domain are un-

available. We do not consider providing distractors

in the prompt because creating distractors has been a

crucial research theme in the past AQG research.

The eight prompt patterns in Table 3 and the

four topics in Table 2 make 32 questions in total.

We used ChatGPT (gpt3.5-turbo-0301) through Mi-

crosoft Azure API with the zero-shot approach, i.e.

providing no exemplar. In the prompt, instead of cre-

ating a question all at once, we took an interactive ap-

proach of creating each component at each turn. For

instance, an interaction with a SK prompt looks like

Figure 2

, where the USER part is our input, and the

ASSISTANT part is the ChatGPT response. Surpris-

ingly, ChatGPT generates four options, including the

key, by just being instructed to “memorise” the stem.

The explicit instruction to generate distractors in the

second turn results in the same distractors generated

in the ﬁrst turn.

The generated 32 questions were assessed by four

of the authors who are faculty members of a graduate

school of nursing science, i.e. domain experts, and

two of them have experience in writing questions of

National Nursing Examination. The assessment re-

vealed that only one question presented in Figure 2

was judged to be usable for National Nursing Exami-

nation.

After discussion with the assessors and hearing

from other domain experts (non-authors) with expe-

rience in writing questions of National Nursing Ex-

amination, with showing the generated questions, we

had the following ﬁndings.

• Question writers prefer manually creating ques-

tion stems to manage the knowledge to assess.

Creating stems is not so difﬁcult for them in a set-

ting like National Nursing Examination, i.e. as-

sessing clearly deﬁned basic domain knowledge.

The actual prompt is in Japanese. We translate it for

readers.

Automatic Question Generation for the Japanese National Nursing Examination Using Large Language Models

823

Table 2: Target questions in the preliminary experiment (Translation).

Topic Stem Key

hair-washing assistance Which of the following is correct in assisting with hair wash-

ing?

wash with the ﬁnger pads

repeated vomit Which of the following is caused by repeated vomiting? dehydration or alkalosis

standard precautions Which of the following is considered a source of infection

in the standard precautions?

saliva

basic structure of the human

body and bodily ﬂuids

Which of the following is the highest percentage of adult

body weight?

intracellular ﬂuids

Table 3: Prompt variations for the preliminary experiment.

name topic textbook stem key

NON I O O

T I O O

S I O

K O I

TS I I O

TK I O I

SK I I

TSK I I I

USER: You will now create a question with the cor-

rect answer “intracellular ﬂuids” from four options.

Memorise the stem, which will be presented here.

Stem: “Which of the following is the highest per-

centage of ﬂuids in an adult’s body weight?”

ASSISTANT: Choices:

A. blood

B. lymph ﬂuids

C. intracellular ﬂuids

D. extracellular ﬂuids

The correct answer is C. “intracellular ﬂuids”.

USER: What are three distractors?

Format:

Distractors:

ASSISTANT:

Distractors:

A. blood

B. lymph ﬂuids

D. extracellular ﬂuids

Figure 2: Interaction with a SK prompt (Translation).

• On the other hand, creating distractors is the most

difﬁcult part of question generation. It is hard

to ﬁnd distractors that seem plausible but are not

true. A set of distractors generated by ChatGPT

is far from acceptable as is, but some of them are

usable and inspiring.

• Providing an excerpt from the textbook concern-

ing the topic in the prompt does not impact the

results positively.

Based on these ﬁndings, we decided to focus on gen-

erating distractor candidates by providing a question

stem and its key in the prompt.

4 AUTOMATIC EVALUATION

METRICS

We propose automatic evaluation metrics to evaluate

the quality of generated MCQs, particularly focus-

ing on the quality of distractors. Automatic evalua-

tion metrics are indispensable for efﬁcient system de-

velopment since human evaluation is expensive and

time-consuming. ROUGE (Lin and Hovy, 2003) is

a popular evaluation metric based on the overlapping

ratio of units between system outputs and the refer-

ences (correct outputs). ROUGE concerns the close-

ness of individual outputs to the references. However,

we are concerned about assessing a set of generated

distractors for a given stem and key, i.e. we need to

consider the closeness between generated and refer-

ence distractors on a set basis instead of an individual

basis. In this respect, set-theoretic metrics like recall

and precision are more appropriate.

The relation among distractors in a set is also im-

portant. For instance, even though each generated dis-

tractor for a given stem and key is acceptable, they are

unacceptable if they are all the same or similar.

We ﬁrst consider recall and precision, well-known

metrics in information retrieval. Given a question q

in the test set Q

, recall (R) and precision (P) are de-

ﬁned as in equation (1) and (2).

∩ G

, (1)

∩ G

, (2)

where S

is an automatically generated distractor set

for the stem and key of q

and G

is the distractors

of q

, i.e. the reference distractor set. Recall indi-

cates how much the system can replicate the refer-

ence distractors, while precision indicates how much

the generated distractors are acceptable with regard to

AIG 2024 - Special Session on Automatic Item Generation

824

the reference. We can average these metrics over the

test set to obtain the overall quality of the generated

distractors as in equation (3) and (4). They are called

macro-averaged recall and precision.

R =

∑

i=1

, (3)

P =

∑

i=1

. (4)

Recall and precision are based on counting the

number of generated distractors that are exactly the

same as one of the reference distractors. Noun phrase

distractors are highly probable to match the refer-

ence because they consist of a word or a few words.

However, generated distractors in sentence form with

more words hardly match the reference, meaning that

we might consider generated distractors inappropri-

ate even though they have the same meaning as the

references. To remedy this problem, we introduce

a similarity-based extension of recall and precision,

i.e. similarity-recall (R

) and similarity-precision

). First, we assume a similarity metric sim(·, ·)

that deﬁnes a similarity value between two arguments,

ranging from 0 to 1. The numerator of equation (1)

and (2) can be written as

∩ G

| =

∑

j=1

∈ S

)

∑

k=1

∈ G

)

, (5)

where the index function (·) returns 1 when the ar-

gument proposition is true and 0 otherwise. Rewritten

the numerator respectively, a natural extension of re-

call and precision from counting exact matching items

to accumulating the maximum similarity values leads

to equation (6) and (7).

∑

j=1

sim(argmax

s∈S

sim(s, g

), g

)

, (6)

∑

k=1

sim(s

, argmax

g∈G

sim(s

, g))

. (7)

Theoretically, these similarity-based metrics become

larger values than the original recall and precision.

Our other concern is the relationship among dis-

tractors. The deﬁnition of R

and P

does not care

about the correspondence between generated and ref-

erence distractors. For instance, R

can not distin-

guish a situation where each generated distractor has a

maximum similarity to a different reference distractor

from a situation where all generated distractors have

a maximum similarity to the same reference. To solve

this drawback, we further extend R

and P

to take

into account pairs of generated and reference distrac-

tors. Our idea is to ﬁnd a pair set of generated and

reference distractors that maximise the sum of simi-

larities of the pairs and to use the similarity sum for

the numerator to calculate recall and precision. Con-

sidering the distractor sets S

and G

as nodes and their

correspondence as edges with the weight of their sim-

ilarity, we can formulate our problem as the maximum

weight matching (MWM) problem for weighted com-

plete bipartite graphs. Efﬁcient algorithms, e.g., the

Hungarian algorithm, are known to solve this prob-

lem. Suppose we have a function MWM(V

, V

) that

returns the maximum weight sum of a bipartite graph

consisting of node sets V

and V

, we deﬁne combina-

torial similarity-based recall (R

) and precision (P

)

by equation (8) and (9).

MWM(S

, G

)

, (8)

MWM(S

, G

)

. (9)

Like the original recall and precision, these extended

metrics will be macro averaged over all test set ques-

tions for evaluation.

We are also concerned about the relationship

among generated distractors in a question. A set of

distractors representing similar concepts for a ques-

tion should be avoided. We propose a metric, distrac-

tor variation (DV), representing how much distractors

of a question differ from each other. Equation (10)

deﬁnes the distractor variation for a question q

. A

larger DV indicates the distractors of a question are

diverse.

= 1 −

∑

{(s

)|s

∈S

, j̸=k}

sim(s

, s

)

(10)

There are various ways to calculate the similarity

between linguistic expressions, i.e., the implementa-

tion of sim(·, ·). The ROUGE score mentioned above

is one of them. In the experiment below, we use the

text embedding technique (Patil et al., 2023) to cal-

culate similarities between two distractors. Each dis-

tractor is transformed into an embedding (a dense real

vector) and their cosine similarity with the minimum

value rounded up to zero is used for a similarity of

distractor pairs.

5 GENERATING DISTRACTOR

CANDIDATES BY LLM

5.1 Dataset

As mentioned in section 2, we obtained the essen-

tial questions of the National Nursing Examination

Automatic Question Generation for the Japanese National Nursing Examination Using Large Language Models

825

over the past ten years from MHLW. We target the

questions with noun phrases and sentences as the

choice, i.e. 309 (noun phrase choice) and 57 (sen-

tence choice), 366 questions in total (cf. Table 1).

These questions consist of 347 four-choice questions

(95%) and 19 ﬁve-choice questions (5%). They are

divided into 193, 87 and 86 questions for training,

validation and test data while maintaining the distri-

bution of subjects.

5.2 Models

We consider three language models: ChatGPT (gpt-

3.5-turbo-0613), GPT-4 (gpt4-0613) (Achiam et al.,

2023) developed by OpenAI, and the Japanese Stable

LM instruct alpha 7B-v2 (JSLM) model (Lee et al.,

2023) developed by Stability AI Japan. We used

ChatGPT and GPT-4 through the Microsoft Azure

API and JSLM on our GPU server. Since JSLM was

created using Japanese corpora, we expect it to per-

form well in processing Japanese text.

We conduct ﬁne-tuning for ChatGPT and JSLM

using 193 questions in the training data. The ﬁne-

tuning is not available for GPT-4 at the time of writing

this paper. The ﬁne-tuning of ChatGPT is performed

in ﬁve epochs through Microsoft Azure OpenAI API.

We ﬁne-tune JSLM using the SGD optimiser

, batch

size 1, learning rate 10

−5

and 50 epochs. The model

with the lowest loss on the validation data is adopted

as the tuned model. We do not adopt any approxima-

tion techniques of parameter tuning like LoRA (Hu

et al., 2021); instead, we conduct full-parameter tun-

ing.

5.3 Prompting

In the name of prompt engineering, various tech-

niques have been developed to create better prompts

for controlling LLMs to respond successfully. The

in-context learning method is the most popular tech-

nique, in which several exemplars consisting of an in-

struction and its corresponding appropriate response

are included after instructions (Brown et al., 2020).

It is also called the one/few-shot learning method ac-

cording to the number of exemplars. In-context learn-

ing is particularly useful when additional model learn-

ing is difﬁcult as in GPT-4. Our experiment adopts

four-shot learning for GPT-4, as it does not allow ﬁne-

tuning; this model is named “gpt4”. We also apply

four-shot learning to ChatGPT (gpt3.5-turbo-0613) to

see the difference in impact on performance between

https://pytorch.org/docs/stable/generated/torch.optim.

SGD.html

ﬁne-tuning and in-context learning. We name Chat-

GPT with in-context learning “gpt3.5-few” and that

with ﬁne-tuning “gpt3.5-FT”. For JSLM, we use no

exemplar in the prompt (zero-shot).

For few-shot learning of gpt4 and gpt3.5-few, four

questions are randomly selected from 193 questions

in the training data for exemplars; they are used for

all test questions. Figure 3 shows a zero-shot prompt,

and Figure 4 shows a few-shot prompt, in which a

stem and a key are embedded in the placeholders ⟨Q⟩

and ⟨A⟩ respectively, and N is four or ﬁve depend-

ing on the reference question

. In addition to ﬁll-

ing the stems and keys, distractors are ﬁlled in the

placeholder ⟨D

⟩ in the few-shot prompt. Most past

questions have three distractors (four-choice ques-

tions), and the rest have four distractors. Since we are

generating distractor candidates, our prompts instruct

LLMs to generate ﬁve distractors for each question,

including one or two additional candidates.

USER: Give us ﬁve distractors for the N-choice ques-

tion “⟨Q⟩” with the correct answer “⟨A⟩”.

Figure 3: Zero-shot prompt (Translation).

USER: Give us three distractors for the four-choice

question “⟨Q⟩” with the correct answer “⟨A⟩”.

ASSISTANT: Distractors:

• ⟨D

⟩

• ⟨D

⟩

• ⟨D

⟩

— three more exemplars here —

USER: Give us ﬁve distractors for the N-choice ques-

tion “⟨Q⟩” with the correct answer “⟨A⟩”.

Figure 4: Few-shot prompt (Translation).

5.4 Results

Table 4 shows the evaluation metric values of the

models for comparison. We used the Multilingual-

E5-Large model (Wang et al., 2022) to transform dis-

tractors into 1024-dimensional real vectors for cal-

culating similarity-based metrics. Table 4 is broken

down into two tables: Table 5 and Table 6, which

correspond to the questions with noun-phrase choices

and those with sentence choices, respectively. The

boldface indicates the best values across the com-

pared models. The distractor variation (DV) values

of the references are 0.126 for the entire set, 0.125 for

the noun-phrase choice set and 0.127 for the sentence-

choice set, respectively.

The prompt is in Japanese. This is a translation by the

authors.

AIG 2024 - Special Session on Automatic Item Generation

826

Table 4: Result of the entire test set (86 questions).

gpt4 gpt3.5-few gpt3.5-FT JSLM

R 0.147 0.155 0.178 0.101

P 0.088 0.093 0.107 0.060

0.903 0.906 0.909 0.892

0.894 0.894 0.896 0.877

0.901 0.903 0.905 0.886

0.540 0.542 0.543 0.532

DV 0.116 0.116 0.116 0.122

Table 5: Result of the noun-phrase choice set (75 ques-

tions).

gpt4 gpt3.5-few gpt3.5-FT JSLM

R 0.169 0.178 0.204 0.116

P 0.101 0.107 0.123 0.069

0.906 0.909 0.913 0.895

0.895 0.896 0.899 0.880

0.903 0.906 0.910 0.889

0.542 0.544 0.546 0.534

DV 0.116 0.117 0.120 0.124

We also conducted a manual evaluation by the

domain-expert authors, the same as the preliminary

experiment. The topics in Table 2 are used to gen-

erate distractor sets for human evaluation. We give

a pair of a stem and a key as input; the second topic

makes two questions by using individual keys out of

two (dehydration or alkalosis). Therefore, we have

ﬁve distractor sets generated by each model. Table 8

shows how many generated distractors are deemed ac-

ceptable by the experts (“exp”) and how many of them

are the same as the reference (“ref”).

6 DISCUSSION AND PROSPECTS

In-Context Learning vs Fine-Tuning

Comparing gpt3.5-few and gpt3.5-FT reveals that

ﬁne-tuning is consistently more effective than in-

context learning throughout all recall/precision met-

rics. Furthermore, we can gain improvement by ﬁne-

tuning with only 193 training instances. Gpt3.5-FT

outperforms gtp4 in the question set with noun phrase

choices and the entire set. We might achieve further

improvement with more training data through ﬁne-

tuning. Collecting new data or applying data augmen-

tation techniques (Li et al., 2022) to the existing data

are possible research directions.

Impact of the Parameter Size

JSLM is consistently inferior to the GPT family in

the recall/precision metrics. The parameter size of

JSLM we used in this experiment is seven billion

Table 6: Result of the sentence choice set (11 questions).

gpt4 gpt3.5-few gpt3.5-FT JSLM

R 0.000 0.000 0.000 0.000

P 0.000 0.000 0.000 0.000

0.887 0.884 0.881 0.874

0.881 0.880 0.877 0.856

0.885 0.882 0.877 0.862

0.531 0.529 0.526 0.517

DV 0.115 0.111 0.088 0.114

Table 7: Number of different most similar references to a

generated distractor.

#ref gpt4 gpt3.5-few gpt3.5-FT JSLM

1 34 29 32 25

2 45 47 42 40

3 7 10 12 21

(7B), which is far smaller than those of GPTs. The

parameter sizes of GPT 3.5-turbo and GPT-4 are not

ofﬁcially published. Still, considering that the param-

eter size of their predecessor, GPT3, is 175 billion,

we can estimate that GPT 3.5-turbo and GPT-4 have

a parameter size of three or more orders of magni-

tude larger than JSLM (7B). We should adopt a larger

JSLM model to see the impact of parameter sizes on

performance. However, interestingly, JSLM shows a

larger variation value (DV) for noun-phrase distrac-

tors than other models. The fact that JSLM has been

trained on Japanese corpora might be the reason.

Noun-Phrase Choices vs Sentence Choices

The original recall and precision metrics (R and P) do

not work for evaluating sentence distractors (Table 6).

In contrast, the proposed similarity-based recall and

precision metrics work for both noun-phrase and sen-

tence distractors. To investigate the effectiveness of

combinatorial similarity-based metrics (R

and P

we counted how many most similar reference distrac-

tors to a generated distractor are there. The correspon-

dence is shown in Table 7. In more than half of the

cases in all models, multiple reference distractors are

most similar to the same generated distractor. Such a

situation is not favourable, particularly for measuring

recall. Our combinatorial extension should remedy

this problem. We presume that the larger number in

the second and third rows of JSLM in Table 7 partially

explains a larger drop from R

to R

for JSLM than

other models.

Human Evaluation vs Automatic Evaluation

Table 8 shows how many generated distractors are

judged acceptable by the experts (“exp”) and are the

same as the reference (“ref”). We can see that gpt3.5-

Automatic Question Generation for the Japanese National Nursing Examination Using Large Language Models

827

Table 8: Result of manual evaluation.

gpt4 gpt3.5 gpt3.5 JSLM

-few -FT

Topic exp ref exp ref exp ref exp ref

hair-washing 1 0 1 0 0 0 0 0

vomit/dehydration 3 1 3 1 5 2 2 1

vomit/alkalosis 1 0 3 0 1 0 1 0

std. precautions 3 2 1 1 1 1 0 0

body ﬂuids 5 1 3 1 5 3 2 1

Total 13 4 11 3 12 6 5 2

FT reproduces the reference distractors most, which

can be considered as an effect of ﬁne-tuning. Another

notable observation is that gpt4 generates many dis-

tractors acceptable by human experts, although it re-

produces fewer reference distractors than gpt3.5-FT.

Table 9 shows distractor examples generated by each

model. We notice that gpt4 generates unique accept-

able distractors compared to other models. In this re-

spect, gpt4 might be more creative. As Faraby et al.

(2023) pointed out, the reference distractor set is not

the only appropriate set, which is also supported by

our human evaluation. Gpt4 can assist the question

writers by suggesting inspiring distractors they may

not have considered. Reference-based automatic eval-

uation metrics can not appropriately evaluate gener-

ated distractors in this respect. Similarity-based met-

rics might remedy this problem since they calculate

the similarity of generated distractors to the refer-

ences. We need to verify the effectiveness of our au-

tomatic evaluation metrics by comparing them with

human evaluation results on a large scale. Building

an evaluator model using reinforcement learning tech-

niques is another possible approach. Reinforcement

learning based on human feedback is a common tech-

nique for tuning LLM (Christiano et al., 2023). The

evaluator model can be free from the reference, and

it can also be used to tune the distractor generation

model.

Open vs Proprietary LLMs

This study employed an open LLM (JSLM) and pro-

prietary LLMs (ChatGPT and GPT-4). Only a few re-

search institutes can train huge language models, such

as ChatGPT. The size of open LLMs currently avail-

able, such as JSLM, is still limited. It is unclear how

close the performance of open LLMs can reach that of

huge proprietary LLMs. However, open LLMs have

advantages in their transparency, tuneability, repro-

ducibility and security. Security is particularly crucial

for our case, the national examination, considering the

conﬁdentiality of information. We will continue to

work on both open and proprietary LLMs, balancing

Table 9: Example of generated distractors (Translation).

Stem: Which of the following is the highest percentage

of body ﬂuids to adult body weight?

Key: intracellular ﬂuids

Model Generated distructors

gpt4 plasma, urine, sweat, bile, cerebrospinal

ﬂuids

(0.929, 0.909, 0.925, 0.555, 0.136)

gpt3.5-few blood, adipose tissue, lymph ﬂuids, di-

gestive ﬂuids, urine

(0.940, 0.906, 0.931, 0.558, 0.137)

gpt3.5-FT plasma, interstitial ﬂuids, lymph ﬂuids,

blood cell, platelet

(1.000, 0.963, 1.000, 0.600, 0.130)

JSLM blood, somatic cell, extracellular ﬂuids,

plasma, interstitial ﬂuids

(0.953, 0.928, 0.950, 0.570, 0.127)

Reference plasma, interstitial ﬂuids, lymph ﬂuids

Bolds are the same as the reference; italics are deemed acceptable by the

experts. The values in the parentheses are metrics (R

, P

, R

, P

, DV).

the aforementioned advantages of open LLMs and the

high performance of proprietary LLMs.

Future Plan

This feasibility study shows promising results in auto-

matically generating distractors for Japanese National

Nursing Examination using LLMs.

To achieve further improvement in distractor gen-

eration, we are considering two approaches. The ﬁrst

is posting a negated question stem to QA systems or

LLMs to obtain answers, which should be usable as

distractors of the original question stem. Logically,

this seems to work, but we must verify it through ex-

periments. The second is integrating question writ-

ing guidelines into the LLMs prompts. A different

group in our project has analysed the past examina-

tion questions to compile the guidelines. There are

several prohibitions in question writing for National

Nursing Examination, e.g. coined words should not

be an option, opposing choices can not coexist and

so on. We observed non-existing words in the gener-

ated distractors of our experiments. The distractors

in breach of the prohibitions can be ﬁltered out in

the post-processing. However, breach-free outputs by

LLMs are more preferable. The guidelines will also

include rules to make questions better. Those rules

can be incorporated into the prompts for LLMs.

After improving distractor generation, we plan to

administer a large-scale mock-up examination that

includes questions with automatically generated dis-

tractors. The size of the participating test-takers is

expected to be 1,000. We will conduct a human eval-

uation of a mixed set of human-made and machine-

made questions, following Susanti et al. (2017) and

Shin and Lee (2023). Also, we will compare the test-

AIG 2024 - Special Session on Automatic Item Generation

828

taker responses to both types of questions on the same

topic.

REFERENCES

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,

Aleman, F. L., Almeida, D., Altenschmidt, J., Altman,

S., Anadkat, S., et al. (2023). GPT-4 technical report.

CoRR, abs/2303.08774.

Alsubait, T., Parsia, B., and Sattler, U. (2015). Ontology-

based multiple choice question generation. KI -

unstliche Intelligenz, 30.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,

Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,

Litwin, M., Gray, S., Chess, B., Clark, J., Berner,

C., McCandlish, S., Radford, A., Sutskever, I., and

Amodei, D. (2020). Language models are few-shot

learners. In Larochelle, H., Ranzato, M., Hadsell, R.,

Balcan, M., and Lin, H., editors, Advances in Neu-

ral Information Processing Systems, volume 33, pages

1877–1901. Curran Associates, Inc.

Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S.,

and Amodei, D. (2023). Deep reinforcement learning

from human preferences. CoRR, abs/1706.03741.

Faraby, S. A., Adiwijaya, A., and Romadhony, A. (2023).

Review on neural question generation for education

purposes. International Journal of Artiﬁcial Intelli-

gence in Education, pages 1–38.

Gao, Y., Bing, L., Chen, W., Lyu, M., and King, I. (2019).

Difﬁculty controllable generation of reading compre-

hension questions. pages 4968–4974.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,

S., Wang, L., and Chen, W. (2021). Lora: Low-

rank adaptation of large language models. CoRR,

abs/2106.09685.

Kumar, G., Banchs, R., and D’Haro, L. (2015). Au-

tomatic ﬁll-the-blank question generator for student

self-assessment. pages 1–3.

Kurdi, G., Leo, J., Parsia, B., Sattler, U., and Al-Emari, S.

(2020). A systematic review of automatic question

generation for educational purposes. International

Journal of Artiﬁcial Intelligence in Education, 30:121

– 204.

Lee, M., Nakamura, F., Shing, M., McCann, P., Akiba, T.,

and Orii, N. (2023). Japanese stablelm instruct alpha

7b v2.

Li, B., Hou, Y., and Che, W. (2022). Data augmentation

approaches in natural language processing: A survey.

AI Open, 3:71–90.

Lin, C.-Y. and Hovy, E. (2003). Automatic evaluation of

summaries using n-gram co-occurrence statistics. In

Proceedings of the 2003 Human Language Technol-

ogy Conference of the North American Chapter of the

Association for Computational Linguistics, pages 71–

78.

Liu, M., Calvo, R. A., and Rus, V. (2010). Automatic ques-

tion generation for literature review writing support.

In International Conference on Intelligent Tutoring

Systems.

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig,

G. (2023a). Pre-train, prompt, and predict: A system-

atic survey of prompting methods in natural language

processing. ACM Comput. Surv., 55(9).

Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H.,

Li, A., He, M., Liu, Z., Wu, Z., Zhao, L., Zhu, D., Li,

X., Qiang, N., Shen, D., Liu, T., and Ge, B. (2023b).

Summary of chatgpt-related research and perspective

towards the future of large language models. Meta-

Radiology, 1(2):100017.

Oh, S., Go, H., Moon, H., Lee, Y., Jeong, M., Lee, H. S.,

and Choi, S. (2023). Evaluation of question gener-

ation needs more references. In Rogers, A., Boyd-

Graber, J., and Okazaki, N., editors, Findings of

the Association for Computational Linguistics: ACL

2023, pages 6358–6367, Toronto, Canada. Associa-

tion for Computational Linguistics.

Patil, R., Boit, S., Gudivada, V., and Nandigam, J. (2023).

A survey of text representation and embedding tech-

niques in nlp. IEEE Access, 11:36120–36146.

Perkoff, E. M., Bhattacharyya, A., Cai, J. Z., and Cao, J.

(2023). Comparing neural question generation archi-

tectures for reading comprehension. In Proceedings

of the 18th Workshop on Innovative Use of NLP for

Building Educational Applications (BEA 2023), pages

556–566.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016).

SQuAD: 100,000+ questions for machine comprehen-

sion of text. In Su, J., Duh, K., and Carreras, X., ed-

itors, Proceedings of the 2016 Conference on Empir-

ical Methods in Natural Language Processing, pages

2383–2392, Austin, Texas. Association for Computa-

tional Linguistics.

Shin, D. and Lee, J. H. (2023). Can chatgpt make reading

comprehension testing items on par with human ex-

perts? Language Learning & Technology, 27(3):27–

40.

Susanti, Y., Tokunaga, T., Nishikawa, H., and Obari, H.

(2017). Evaluation of automatically generated English

vocabulary questions. Research and Practice in Tech-

nology Enhanced Learning, 12(11):1–12.

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D.,

Majumder, R., and Wei, F. (2022). Text embeddings

by weakly-supervised contrastive pre-training. CoRR,

abs/2212.03533.

Yuan, X., Wang, T., Wang, Y.-H., Fine, E., Abdelghani,

R., Sauz

eon, H., and Oudeyer, P.-Y. (2023). Selecting

better samples from pre-trained LLMs: A case study

on question generation. In Rogers, A., Boyd-Graber,

J., and Okazaki, N., editors, Findings of the Asso-

ciation for Computational Linguistics: ACL 2023,

pages 12952–12965, Toronto, Canada. Association

for Computational Linguistics.

Automatic Question Generation for the Japanese National Nursing Examination Using Large Language Models

829