Analysis performance of the LLM can be en-
hanced by employing larger models or integrating
more task-specific training data. Closed models, like
those developed by OpenAI and Anthropic, expected
to be larger and equipped with superior training data,
may outperform in these tasks. The domain of large
language models, encompassing both open and closed
models, is evolving at an unprecedented pace, with
significant yearly improvements. Future enhance-
ments to our system could be achieved by adopting
newer more advanced models.
Just as with Whisper, an improvement strategy for
the LLM involves fine-tuning on domain-specific
training data, such as transcribed conversations with
high quality summaries or competency assessments.
Likely a few hours of high quality data, such as those
generated for this paper, would already yield positive
results. This approach can further refine the capabili-
ties of an already trained large language model.
Our proposed future iteration of the AI-Assisted
Debrief should incorporate user intervention at every
stage of the process, enabling correcting of the sys-
tem's intermediate outputs. For instance, the system
could automatically flag potentially misinterpreted
words or incorrectly identified speakers, allowing us-
ers to manually rectify these errors. Similarly, users
should have the ability to adjust summaries and PI
identifications as needed.
These corrections made by users would not only
improve the immediate output but also contribute val-
uable data for the fine-tuning of AAD. This creates a
dynamic system that progressively improves its per-
formance and accuracy in executing its designated
tasks. Through this iterative learning process, AAD
would evolve into an increasingly reliable tool.
Given the inherent limitations of current-genera-
tion LLMs, particularly their tendency to hallucinate,
we posit that the most effective application of these
technologies lies in such a human-in-the-loop frame-
work. This approach synergistically combines the
unique strengths of both LLMs and human expertise.
Human experts possess an unparalleled capacity for
critical thinking and the nuanced evaluation of com-
plex scenarios, which LLMs currently cannot match.
Conversely, LLMs excel in rapidly processing and
analysing vast quantities of data, a task that is time-
consuming and labour-intensive for humans.
ACKNOWLEDGEMENTS
We would like to thank Anneke Nabben for pitching
this project and facilitating it throughout; Simone
Caso for his help researching flight debriefings;
Jeroen van Rooij and Thomas Janssen for creating the
evaluation dataset; Asa Marjew for the illustrations;
Astrid de Blecourt for her leadership; Jelke van der
Pal for his helpful guidance and reviews; finally
Marjolein Lambregts and Jenny Eaglestone and for
their help.
REFERENCES
Bredin, H. (2023). pyannote.audio 2.1 speaker diarization
pipeline: principle, benchmark, and recipe. Proc. INTER-
SPEECH 2023.
Deng, J. a. (2022). The benefits and challenges of ChatGPT:
An overview. Frontiers in Computing and Intelligent
Systems, 81-83.
EASA. (2023, 12 18). EASA. Retrieved from EASA EU-
ROPA: https://www.easa.europa.eu/community /top-
ics/post-flight-debrief
Floridi, L. &. (2020). GPT-3: Its nature, scope, limits, and
consequences. Minds and Machines, 681-694.
Josh Achiam, e. a. (2023). GPT-4 Technical Report. OpenAI.
Kasneci, S. K. (2023). ChatGPT for good? On opportunities
and challenges of large language models for education.
Learning and Individual Differences.
Mavin, T. J. (2016). Models for and practice of continuous
professional development for airline pilots: What we can
learn from one regional airline. Supporting learning
across working life: Models, processes and practices,
169-188.
Mavin, T. J., Kikkawa, Y., & Billett, S. (2018). Key contrib-
uting factors to learning through debriefings: commercial
aviation pilots’ perspectives. International Journal of
Training Research, 122-144.
McDonnell, L. K., Jobe, K. K., & Dismukes, R. K. (1997).
Facilitating LOS debriefings: A training manual.
OpenAI. (n.d.). Whisper Github Repository. Retrieved from
https://github.com/openai/whisper
Radford, A., J.W., K., Xu, T., Brockman, G., C., M., &
Sutskever, I. (2022). Robust Speech Recognition via
Large-Scale Weak Supervision. OpenAI.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & &
Sutskever, I. (2019). Language models are unsupervised
multitask learners. OpenAI blog.
Roth, W. (2015). Cultural Practices and Cognition in De-
briefing: The Case of Aviation. Sage Journals, 263–278.
SkyBrary. (2023, December 19). SkyBrary. Retrieved from
https://skybrary.aero/articles/evidence-based-training-
ebt
Tannenbaum, S., & Cerasoli, C. (2013). Do Team and Indi-
vidual Debriefs Enhance Performance? A Meta-Analy-
sis. Human Factors and Ergonomics Society.
Touvron, H. e. (2023). Llama 2: Open Foundation and Fine-
Tuned Chat Models. Meta.
van Dorn, J. L. (2023). Applying Large-Scale Weakly Super-
vised Automatic Speech Recognition to Air Traffic Con-
trol. Retrieved from http://resolver.tu
delft.nl/uuid:8aa780bf-47b6-4f81-b112-29e23bc06a7d