the future for training machine learning algorithms, a
few hundred FMs will be definitely not sufficient to
achieve a high prediction accuracy. For example,
deep learning models used in NLP use typically
training datasets containing millions of items (Bailly,
2022). In this section, we will present a few
interesting results. We have to note here that these
were separate experiments and none of these results
have been yet implemented in i-SART.
Generative AI (GenAI) is a modern, powerful
technology that can produce new plausible media
content from existing content, including text, images,
audio, etc trying to mimic human creativity. It
originates in the research done at Google in 2017
(Vaswani, 2017) that first analyzed a language trying
to discover patterns in it, and then transformed this
analysis into a prediction on which word or phrase
should come next. Many GenAI algorithms exist,
varying from the probabilistic Naïve Bayes Networks
and Markov Models, to all kinds of feature-based
neural-networks variations, such as recursive neural
networks (RNN), convolutional neural networks
(CNN), and the GPT-2, -3 and -4 series, where GPT
stands for “Generative Pre-trained Transformer”.
Regardless the algorithm, automated text generation
works basically in the same way. In the beginning, all
probabilities or adjustable weights in the neural
network are unknown; we say that the model is not
trained. However, the model can learn these
parameters if provided with a huge number of training
examples. Eventually, when the training is finished,
and one starts with one word (also called prompt), the
model will be able to accurately predict the most
likely next word in a phrase.
Therefore, GenAI seemed a perfect approach
suitable for our purpose. We had a rather small
collection of training text data (the FMs) and we
wanted an AI algorithm to learn how to create more,
synthetic FMs. In line with these thoughts, we
conducted two preliminary experiments that explored
the performance of different GenAI algorithms.
The first experiment, in the context of a MSc CS
graduation project (Haddou, 2022), used two
different algorithms, Markov Chains and ChatGPT-3
to learn how to create new FMs based on an existing
collection. The training database was slightly
different, containing around 600 FMs collected from
literature and a few RT departments in Europe. From
these, eleven FMs that were generated with the
Markov Chains algorithm were presented for
validation to an RT expert. Out of these six were
found useful. There was at least one artificial FM
with a high RPN, namely “Incorrect image data set
associated with patient shifts determined” that was
interpreted by the RT expert as “patient shifts
determined by incorrect image data set”. Another FM
was very interesting because the RT expert had seen
it a lot of times before, namely ‘Patient head’s
position is not ideal’. The RT expert noted that this
FM would never come spontaneously to her mind.
This FM was clearly and correctly generated by the
Markov Chain algorithm.
The GPT-3 algorithm generated eleven FMs that
were also presented to the same RT expert. Out of these,
four of them were found useful. Especially the FM:
“Patient or nurse falls” and “Patient falls down due
to mobile phone dropping on the floor” were very
interesting. We could conclude from here that
synthetic FMs have the potential to raise awareness
or reveal unpredicted hazards that might occur during
any process, not necessarily RT specific.
The second experiment used a Generative
Adversarial Network (GAN) algorithm to generate
artificial FMs (Brophy, 2023). As a training dataset
we used our most recent FM database. GANs are a
branch of GenAI algorithms that consist of two
artificial neural networks, called Generator and
Discriminator (Goodfellow, 2020). The Generator
tries to generate new data as similar to original data
as possible, while the Discriminator’s role is to
determine if the input belongs to the real dataset or
not. The optimization process is characterized as a
game where the generator successfully learns to
“fool” the discriminator in such a way that the
discriminator cannot distinguish between the real one
and the synthetic one. In particular, our experiment
used the seqGAN model (Yu, 2017).
The Bilingual Evaluation Understudy (BLEU)
score was one of the metrics used to measure the
quality of the FMs produced by the generative
algorithm. The basic idea of the BLEU score is
straightforward: the closer the synthetic FM is to the
human-generated target sentence, the better it is; a
score of 1 means a perfect match, and 0 means no
match. A BLEU-score has different levels (n),
depending on how many n-grams are being
compared. For example, if n=1, each word from the
original and synthetic text will be compared, and if
n=2, each word pair will be matched. As training
data we used all the 584 raw FMs initially collected
as described in section 4.1, plus 1721 FM taken from
the headlines of incidents reported in IAEA
SAFRON, (SAFety in Radiation ONcology), an