Mental Wellbeing at Sea: A Prototype to Collect Speech Data

in Maritime Settings

Pascal Hecker

1,2 a

, Monica Gonzalez-Machorro

1,3 b

, Hesam Sagha

1 c

, Saumya Dudeja

1 d

Matthias Kahlau

1 e

, Florian Eyben

1 f

, Bj

orn W. Schuller

1,3,4 g

and Bert Arnrich

2 h

audEERING GmbH, Gilching, Germany

Digital Health – Connected Healthcare, Hasso Plattner Institute, University of Potsdam, Germany

CHI – Chair of Health Informatics, MRI, Technische Universit

at M

unchen, Germany

GLAM – Group on Language, Audio, & Music, Imperial College, U.K.

{phecker, mgonzalez, hsagha, sdudeja, mkahlau, fe}@audeering.com, schuller@tum.de, bert.arnrich@hpi.de

Keywords:

Mental Wellbeing, Stress, Seafarer, Speech Data Recording, Speech Data Analysis.

Abstract:

The mental wellbeing of seafarers is particularly at risk due to isolation and demanding work conditions.

Speech as a modality has proven to be well-suited for assessing mental health associated with mental well-

being. In this work, we describe our deployment of a speech data collection platform in the noisy and iso-

lated environment of an oil tanker and highlight the associated challenges and our learnings. We collected

speech data consisting of 378 survey sessions from 25 seafarers over nine weeks. Our analysis shows that

self-reported mental wellbeing measures were correlated with speech-derived features and we present initial

modelling approaches. Furthermore, we demonstrate the effectiveness of audio-quality-based ﬁltering and de-

noising approaches in this uncontrolled environment. Our ﬁndings encourage a more ﬁne-grained monitoring

of mental wellbeing in the maritime setting and enable future research to develop targeted interventions to

improve seafarers’ mental health.

1 INTRODUCTION

The maritime industry, despite its vital role in global

trade, often overlooks the psychological toll on its

workforce. Seafarers, in particular, are at risk due

to isolation and demanding work conditions (Brooks

and Greenberg, 2022). The numerous occupation-

related factors include long times away from home

and friends and family, heterogeneous crews with sev-

eral cultural backgrounds clashing, and a high work-

load with no conventional leisure time on the week-

ends.

To support this vulnerable occupation group, the

ﬁrst step would be to understand the factors contribut-

ing to their wellbeing. Due to the long time abroad,

https://orcid.org/0000-0001-6604-1671

https://orcid.org/0009-0008-9188-058X

https://orcid.org/0000-0002-8644-9591

https://orcid.org/0009-0003-5397-1759

https://orcid.org/0009-0004-7017-541X

https://orcid.org/0009-0003-0330-8545

https://orcid.org/0000-0002-6478-8699

https://orcid.org/0000-0001-8380-7667

direct interaction and interventions are not practica-

ble. Therefore, an automated system could serve as

a powerful tool for enhancing wellbeing on a broad

scale. In particular, the impact of targeted interven-

tions to enhance wellbeing could be readily measured

objectively with such a system.

1.1 Related Work

Given its pronounced impact on wellbeing within

the seafaring occupation, stress serves as a ﬁtting

proxy for monitoring overall wellbeing in this en-

vironment (Brooks and Greenberg, 2022). Stress

can manifest itself in several physiological modali-

ties (Alberdi et al., 2016). In particular, its impact on

speech has been extensively described (Giddens et al.,

2013; Van Puyvelde et al., 2018; Baird et al., 2021).

Using speech to measure stress bears the promise to

impose little strain on the user, while it can be ob-

tained readily (Hecker et al., 2022). Further, speech

data collected via an automated telephone system was

shown to be used as a modality to assess mental health

associated to stress (Higuchi et al., 2020). To col-

Hecker, P., Gonzalez-Machorro, M., Sagha, H., Dudeja, S., Kahlau, M., Eyben, F., Schuller, B. and Arnrich, B.

Mental Wellbeing at Sea: A Prototype to Collect Speech Data in Maritime Settings.

DOI: 10.5220/0013093700003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 2: HEALTHINF, pages 29-40

ISBN: 978-989-758-731-3; ISSN: 2184-4305

lect speech data in regular settings, participants can

usually just use their smart devices or data can be

recorded in a controlled environment. In contrast, the

collection of speech data onboard a ship is a very

novel endeavour. Therefore, we set out to explore

how the mental well-being of seafarers through self-

reported stress could be measured by analysing their

speech data.

1.2 Contribution

In this publication, we describe our approach to mon-

itoring a crew’s wellbeing by recording speech data

on board an oil tanker. Collecting data on a ship has

many challenges to cope with, such as limited inter-

net connectivity, accessibility, reliability, and noise.

Therefore, we adapted our data collection platform

to overcome these issues. We perform feature anal-

ysis and basic classiﬁcation experiments to assess the

correlation of several self-reported mental wellbeing

measures of participants. Further, we explore the im-

pact of data ﬁltering and denoising on the results. We

showcase the successful implementation of data col-

lection efforts on board of an oil tanker and demon-

strate that the data collected from the participants can

serve as a valuable indicator of seafarer’s wellbeing.

To the best of our knowledge, this is the ﬁrst study to

successfully record and assess speech data in such a

setting.

2 DATA COLLECTION

2.1 Study Setup

This study was done within the scope of a proof-of-

concept (PoC) project sponsored by the “Safetytech

Accelerator Limited incorporated” (a non-proﬁt es-

tablished by the Lloyd’s Register Group Limited,

London, England). The study was conducted with

one of the oil tankers (“vessels”) from the ﬂeet of

the shipping company “TORM” (Copenhagen, Den-

mark). “HiLo Maritime Risk Management” (Milton

Keynes, England) provided general insights on the

journey of the selected oil tanker. Together with the

captain and with the assistance of the human factors

department from the Lloyd’s Register Group Limited

(London, England), we deployed the system on board

and reﬁned the speech data collection surveys to be

conducted. On December 5, 2022, the data collec-

tion was rolled out to the crew and was concluded on

February 8, 2023.

Each crew member was informed about the pro-

cessing procedures involving their data and had to opt

in by giving their explicit informed consent before en-

rolling and participating in the data collection. Data

handling was deﬁned by collaboration and data pro-

cessing agreements that enforced strict adherence to

the EU’s general data protection regulation (GDPR)

guidelines and data privacy protection. Emphasis was

put on the protection of the users’ data and therefore

the other project partners had never access to the raw,

recorded user data.

2.2 Data Collection Platform

On the open water, the internet connection is usually

only available through satellite uplink, which is slow

and costly. To address this major constraint within

this PoC, we concluded that a local deployment of

our AI SoundLab platform would be the sole solu-

tion. AI SoundLab is a web platform developed by

the audEERING GmbH to collect high-quality audio

data actively from users. Data collection is organised

in customisable surveys. Those surveys also allow

the collection of additional metadata in the form of

questionnaires with, for example, checkboxes, radio

buttons, and free text ﬁelds. The platform is capa-

ble of running multiple different surveys in a study

and offers the option to unlock surveys to users only

upon completion of previous surveys. We deployed

this platform on a laptop computer, functioning as a

server, on the selected vessel and integrated it into

the intranet on board. This way, the crew members

were able to access the platform from their private

devices when connected to the intranet. To secure the

collected data from unauthorised access, an identity

and access management (IAM) service was deployed

on the laptop to authenticate users. All requests be-

tween services were transport layer security (TLS)-

encrypted.

Figure 1 depicts the data ﬂow from user record-

ings on mobile devices to the network-attached stor-

age (NAS) on audEERING’s premises. Automati-

cally, on a daily basis, audio data were converted to

free lossless audio codec (FLAC) to reduce the size,

and along with other compressed data, they were en-

crypted and incrementally backed up to a folder on

the local server. Once internet connectivity was sta-

ble with a sufﬁciently high bandwidth (usually when

the vessel was docked in a port), the local incremen-

tal backup was transferred to our servers through an

encrypted connection. Based on interim updates of

the transmitted data, we provided the captain and the

crew with a transparent overview of the ongoing data

acquisition. The entire study was performed in En-

glish language.

HEALTHINF 2025 - 18th International Conference on Health Informatics

Figure 1: Data ﬂow within the deployment of the

AI SoundLab “platform” on board of the ship.

2.2.1 Audio Recording Library

A software module developed by audEERING was

used to record raw, uncompressed audio in the

web browser. It is integrated with the front end

of the AI SoundLab web platform. This so-called

“Recording Library” is implemented in JavaScript

and is based on the web standard Media Capture and

Streams API, which is related to WebRTC, and which

is generally implemented by all common browsers.

To provide extensive support and the best possible

recording quality, the library has an automatic ad-

justment of certain conﬁgurations based on the ex-

ecution environment (browser, platform) as well as

support for the explicit conﬁguration of various set-

Table 1: Recurring survey types and the self-reported men-

tal wellbeing measures addressed at different intervals.

Baseline

Daily

Weekly 1

Weekly 2

Final

Stress - now ✓ ✓ ✓ ✓ ✓

Stress - work ✓ ✓ ✓ ✓ ✓

WHO-5 ✓ ✓ ✓ ✓

PHQ-8 ✓ ✓ ✓ ✓

PSS-10 ✓ ✓ ✓

tings for audio recording and other features. In ad-

dition to audio recording, the library includes partic-

ular client-side, real-time audio assessment features,

such as sound activity and clipping detection, as well

as peak and root mean square (RMS) level meter-

ing. These features can be dynamically conﬁgured

and, e. g., prompt users to repeat a recording if cer-

tain quality issues are present. These features can be

used optionally to inform the user, to block the sub-

mission of the recording until it is corrected or for

a certain number of repetitions, if desired, and/or to

log information regarding audio quality and record-

ing issues. Furthermore, depending on the browser

support, the library supports the use of optional au-

dio enhancement features such as noise suppression,

echo cancellation, and auto gain control. In the case

of this PoC however, we decided to disable these gate-

keeper functionalities to prevent user frustration and

simplify recording in that unsupervised and noisy en-

vironment. Instead, we applied audio-quality-related

measures, such as denoising, to the collected data.

The browser-based, installation-free application,

the extensive support of various devices, and the au-

dio quality assessment and level metering features

make the Recording Library a strong solution for au-

dio recording in environments such as the vessel in

this PoC, where participants used their personal de-

vices for the surveys.

2.3 Surveys

We created several different surveys and asked partic-

ipants to complete longer wellbeing-related question-

naires on a weekly basis, as well as very short ques-

tionnaires that were prompted every one to two days.

On a daily basis, we implemented an visual analogue

scale (VAS) (Lesage et al., 2012; Barr

e et al., 2017)

to assess the participants’ stress level a) in the mo-

ment of taking the survey (“stress-now”), and b) dur-

ing their recent work tasks (“stress-work”). Partici-

pants saw a horizontal line on their device screen and

were asked to indicate how stressed they felt along

the axis of “not at all” and “very strongly” through

Mental Wellbeing at Sea: A Prototype to Collect Speech Data in Maritime Settings

a slider button. On a weekly basis, participants were

asked to ﬁll in more long-term questionnaires on their

mental health and wellbeing: the WHO-ﬁve well-

being index (WHO-5) (Topp et al., 2015) and the

patient health questionnaire - eight-item depression

scale (PHQ-8) (Kroenke and Spitzer, 2002). The ten-

item perceived stress scale (PSS-10) (Cohen et al.,

1983; Cohen, 1988) was ﬁlled in every two weeks.

Table 1 shows the “surveys”, bundling together

several questionnaires and speech recordings, that

were developed and integrated in AI SoundLab for

this PoC study:

1. Baseline (30-35min): This survey only had to be

completed once. The goal was to create a baseline

for our data analysis and familiarise the partici-

pants with the speech tasks of the following sur-

veys.

2. Daily (3-5min): Very short selection of speech

tasks, prompted every 1-2 days.

3. Weekly 1 (15-20min): Similar to “Daily” plus

some more long-term self-reported mental well-

being measures. Participants were asked to do

this survey and then “Weekly 2” in alteration ev-

ery week.

4. Weekly 2 (15-20min): Similar to “weekly 1” with

the addition of the PSS-10.

5. Final (20-25min): This survey was activated at

the end of the journey and the participants had to

complete it only once. The questions were similar

to “Weekly 2”.

Each survey administered to participants included

several distinct speech prompts. We aimed to cover a

broad variety of different speech elicitation prompts

while keeping the protocol short for the best usabil-

ity. Participants were asked to a) produce sponta-

neous speech by talking for one minute about their

latest work tasks (“spontaneous”), b) to produce the

sustained phonation of the vowel /a/ for as long as

was comfortably possible (“sustained /a/”), c) read

a sentence of pseudowords in a neutral (“read neu-

tral”), as well as “happy” tone (“read happy”), and

d) to perform a small cognitive challenge by counting

downwards from 60 to 40 as fast as possible (“count-

ing”). The read speech task was designed to be ro-

bust against the language it was spoken in since we

expected a diverse crew from several nationalities. It

was composed to cover many different vowels while

keeping a simple syllable structure with consonant-

vowel-chains, inspired by the approach in (Scherer

et al., 1991), and it reads “Nilago me bu lefﬁ, nulato

dupo sam.” To verify that this content-free sentence

of pseudowords was indeed suited, we piloted it with

ﬁve colleagues and veriﬁed its unambiguous pronun-

ciation. In the “baseline” survey, we asked the partici-

pants to practice that sentence three times to get famil-

iar with its pronunciation and discarded these record-

ings from further analysis.

After study completion, we asked users to ﬁll in

the user experience questionnaire (UEQ) (Laugwitz

et al., 2008). The UEQ is a standardised survey tool

used for assessing subjective aspects of user experi-

ence in human-computer interaction. It evaluates the

dimensions of perceived attractiveness, perspicuity,

efﬁciency, dependability, stimulation, and novelty of

a product. This provided us with data-driven insights

into the perceived interaction of the seafarers with our

platform.

3 DATA EXPLORATION

After the data collection concluded, we analysed the

full collected data set, which we unfortunately cannot

share with the community due to privacy constraints.

We ﬁltered out two participants, who recorded only

the “baseline” session, but no other session. The de-

mographic information of the included participants is

outlined in table 2.

Of particular notice is that participants completed

378 survey sessions, totalling 3:44 hours of speech

data for the regarded speech tasks. Only 15 ses-

sions (of which twelve were baseline surveys) were

aborted without completion. The baseline surveys

collected further wellbeing-related questionnaires and

demographic information. Their relatively high initial

time investment is likely the reason why participants

started, but did not complete them.

Figures 2 and 3 present the self-reported men-

tal wellbeing measure of each participant for each of

their survey sessions. The axis range is adjusted to the

score range of the respective measure. Figure 2 shows

the psychological assessment tools: WHO-5 (A),

PSS-10 (B), and PHQ-8 (C). The respective cut-off

points from the literature are marked within the plots.

Figure 3 on the other hand portrays the self-rated

stress level on a visual analogue scale (VAS) during

Table 2: Demographic information of the 25 participants.

Age & Gender Summary

Gender 2 Female, 23 Male

Age 37 ± 9.3 years old

Nationality

Filipino 18

Croatian 3

Thai 3

Indian 1

HEALTHINF 2025 - 18th International Conference on Health Informatics

b2bb

8606

5036

eab4

9f5e

6054

7624

4315

1247

c7b0

6457

3648

6a84

9bfd

1811

5899

16a6

50f7

438c

7c20

22d4

4c1b

9e5b

9f9a

f87b

Participant

100

Self-Rated WHO-5 Score

No Potential Depression

Potential Depression

WHO-5 Score Distribution per Participant

Across Sessions

7c20

9f9a

9e5b

5899

9bfd

1247

f87b

438c

4c1b

16a6

3648

6457

1811

6054

6a84

50f7

eab4

7624

4315

5036

9f5e

b2bb

22d4

c7b0

8606

Participant

Self-Rated PSS-10 Score

Low Stress

Moderate Stress

High Stress

PSS-10 Score Distribution per Participant

Across Sessions

1247

f87b

6457

9f9a

7c20

9e5b

9bfd

16a6

7624

5036

3648

5899

4c1b

438c

c7b0

8606

9f5e

6054

50f7

b2bb

eab4

1811

22d4

4315

6a84

Participant

Self-Rated PHQ-8 Score

No Depression

Mild Depression

Moderate Depression

Moderate-Severe Depression

Severe Depression

PHQ-8 Score Distribution per Participant

Across Sessions

Figure 2: Score distribution of the self-reported mental

wellbeing measures. The scores of every session per partic-

ipant are represented as box plots. The cut-off points from

the literature are marked within the plots. The plots here

focus on the psychological assessment tools A) WHO-ﬁve

well-being index (WHO-5); B) ten-item perceived stress

scale (PSS-10); C) patient health questionnaire - eight-item

depression scale (PHQ-8); details in section 2.3; the self-

reported stress measures are portrayed in ﬁgure 3.

The box of the box plots marks the ﬁrst quartile to the third

quartile of the data with a line at the median. Whiskers ex-

tend to 1.5 times of the interquartile range, and outliers are

marked as individual points.

16a6

1247

c7b0

6457

9f9a

f87b

22d4

8606

5899

3648

7c20

4c1b

6a84

50f7

9bfd

9f5e

7624

eab4

9e5b

1811

b2bb

6054

5036

4315

438c

Participant

100

Self-Rated Current Stress Level

Current Stress Level Distribution per Participant

Across Sessions

1247

9f9a

7c20

9f5e

4c1b

3648

c7b0

8606

6457

f87b

22d4

5899

16a6

9bfd

438c

eab4

9e5b

50f7

6a84

7624

1811

b2bb

6054

5036

4315

Participant

100

Self-Rated Work Task Stress Level

Work Tasks Stress Level Distribution per Participant

Across Sessions

Figure 3: Score distribution of the self-reported mental

wellbeing measures. The scores of every session per par-

ticipant are represented as box plots. D) refers to the stress

level perceived during the audio recording and E) refers to

the stress level perceived during the recent work tasks. The

self-reported psychological assessment tools are portrayed

in ﬁgure 2.

the recording (D) and during recent work tasks (E).

Some participants show a markedly higher range in

their responses, while for others it is more drawn to-

wards lower values.

We analysed the data from the incremental up-

loads in order to provide the crew feedback on their

participation and to assess the functionality of the de-

ployed system. After the completed observation pe-

riod, we thoroughly investigated the data at hand and

the resulting distribution.

3.1 Feature Analysis and Modelling

with Speech Data

To assess the information contained in the speech

recordings of the crew members, we implemented and

open-sourced a respective feature analysis and ma-

chine learning pipeline

. This publication focuses

on collecting audio data in the noisy environment

of a maritime vessel. Respectively, the analysis and

modelling efforts concentrate on handling these noisy

real-life audio data.

https://github.com/Pascal-H/mental-wellbeing-at-sea

Mental Wellbeing at Sea: A Prototype to Collect Speech Data in Maritime Settings

For pre-processing, audio ﬁles were then down-

sampled to 16 kHz and speech segments were ex-

tracted by applying a voice activity detection (VAD)

algorithm. That algorithm used was part of the

devAIce® framework and its underlying architec-

ture is based on the Speech & Music Interpreta-

tion by Large-space Extraction (OPENSMILE) in-

terface. To assess the prevalence of noise in the

data, we predicted the signal-to-noise ratio (SNR) val-

ues of the speech samples through the audio quality

module of the devAIce® framework. devAIce® is

audEERING’s modular audio AI technology frame-

work that enables a wide range of audio-related pro-

cessing tasks. The audio quality model therein con-

sists of a deep neural network (DNN) based on the

CNN10 architecture. It was inspired by (Reddy et al.,

2021), and was trained on clean speech that was being

mixed with seven audio tracks of background noise.

A synthetic impulse response was convolved with that

mix, and ﬁnally, speech distortion was applied, such

as metallic/robotic speech. The SNR values are in-

ferred as the mean squared error (MSE) difference be-

tween the clean speech sample and that sample after

adding background noise. It therefore is optimised to

predict the quality of speech itself and performs with

a concordance correlation coefﬁcient (CCC) of 0.94

when predicting the SNR level in the mix. To com-

bat noise in the data, we applied the “causal speech

enhancement model” (D

efossez et al., 2020) for our

modelling approaches.

We extracted acoustic speech features with the

OPENSMILE toolkit (Eyben et al., 2013) and utilised

the extended Geneva minimalistic acoustic param-

eter set (EGEMAPS) (Eyben et al., 2016) feature

set. These hand-crafted features are commonly used

in the area of speech analysis for healthcare (Low

et al., 2020; Hecker et al., 2022) and emotion recog-

nition (Christ et al., 2023).

For both the feature analysis and the modelling ap-

proaches, we slice the data based on the speech tasks:

spontaneous, sustained /a/, read-neutral, read-happy,

counting, and “all” tasks combined; see section 2.3).

To examine the relationship between the ex-

tracted EGEMAPS features and the self-reported

mental wellbeing measures (see section 2.3), we em-

ploy the Spearman’s rank-order correlation coefﬁ-

cient (ρ) since data were not normally distributed.

We account for type I errors by employing the Ben-

jamini–Hochberg procedure to correct resulting p-

values. No denoising approach was applied for the

statistical feature analysis.

Recently, transformer-based deep learning ap-

proaches were largely successful with automatic

speech recognition (ASR) tasks (Baevski et al., 2020).

Additionally, they also have shown great success in

speech emotion recognition (SER) applications (Wag-

ner et al., 2022) and are used as a baseline in the ﬁeld,

e. g., , the multimodal sentiment analysis (MUSE)

challenge (Amiriparian et al., 2024). Those mod-

els operate on the raw audio waveform, which is

processed by a stack of several convolutional layers

(“feature encoder”). The output of subsequent trans-

former blocks can be averaged through mean pooling

and capture more abstract and high-level features of

the input speech, resulting in so-called “embeddings”.

Since the initial models were trained on large amounts

of speech data, these high-level representations can

be used to perform several different speech tasks. We

respectively extracted the embeddings of the individ-

ual VAD segments and used them similarly to the

EGEMAPS features. We employed speciﬁc versions

of the wav2vec 2.0 architecture that were ﬁne-tuned

on several different datasets to increase their robust-

ness for noisy speech. These versions are W2V2-LR-

LIBRI (Hsu et al., 2021)

, W2V2-L-XLSR (Conneau

et al., 2020)

, and W2V2-LR-MSP (Wagner et al.,

2022)

The extracted EGEMAPS features, as well as the

wav2vec 2.0 embeddings, were used as input to a

static classiﬁcation pipeline based on the scikit-learn

Python package (Pedregosa et al., 2011). All self-

reported mental wellbeing measures were continu-

ous scales, therefore we employed regression mod-

els. The scales were normalised between 0 and 1, and

normalisation was done on the whole potential scale

range and not just the scale range covered by the par-

ticipants’ responses. Self-reported mental wellbeing

scores collected in weekly surveys were retrospec-

tively applied to all preceding sessions within the pre-

vious week. Given the size of our dataset, we trained

linear regression (LR), k-nearest neighbours regres-

sor (KNNr), support vector regressor (SVR), and ex-

treme gradient boosting regressor (XGBr) (Chen and

Guestrin, 2016) models.

To assess the performance of the models and the

approximation of the predicted values to the actual

values, we report the concordance correlation coefﬁ-

cient (CCC), which accounts for both the correlation

and the agreement between the predicted and actual

values. Model performance on this comparably small

data set was evaluated through a leave-one-speaker-

out (LOSO) cross-validation (CV) approach. This

https://huggingface.co/facebook/wav2vec2-large-rob

ust-ft-libri-960h

https://huggingface.co/facebook/wav2vec2-large-xls

r-53

https://huggingface.co/audeering/wav2vec2-large-rob

ust-12-ft-emotion-msp-dim

HEALTHINF 2025 - 18th International Conference on Health Informatics

way, we obtain a model prediction for every sam-

ple in the dataset. We used the following pre-set

model conﬁgurations: KNNr in its default conﬁgura-

tion with a leave size of 30 and a number of neigh-

bours of 5 was supplemented with variants with a

leave size of 20 and a number of neighbours of 3, and

a leave size of 40 and a number of neighbours of 7.

SVR in its default conﬁguration with a radial basis

function (RBF) as kernel and a regularization param-

eter (C) of 1was supplemented with variants with a

C of 0.1 and 3, and with linear, as well as 3

degree

polynomial kernel functions. XGBr in its default con-

ﬁguration with a learning rate of 0.3 and a maximum

depth of a tree of 6. Other variants with a learning rate

of 0.01 and a learning rate of 0.7 were trained too.

To assess model generalisation, we further per-

form bootstrapping and report the conﬁdence in-

tervals for sampling the model predictions of each

of the hold-out test speakers in the LOSO scheme.

The “conﬁdence intervals” Python package was

used (Ferrer and Riera, 2023) with the speakers set as

conditions, alpha set to ﬁve, and 1,000 as the number

of bootstrap sets.

To assess the impact of audio quality on model

performance, we train models with and without the

described denoising approach, as well as with and

without the ﬁltering of clipped and noisy samples.

Clipping was detected with the same heuristic in

AI SoundLab’s audio recording library, which checks

if the audio amplitude peak is higher than 0.99 for

more than 0.5 ms. Files that were clipped and, as well

as those that had an SNR value below 7 dB were omit-

ted in the audio quality ﬁlter condition.

4 RESULTS

4.1 Feature Analysis and Modelling

with Speech Data

For clarity, we ﬁlter the signiﬁcantly correlating fea-

tures and present only those, whose Spearman’s rank-

order correlation coefﬁcient (ρ) is larger or equal

than 0.4. For robustness, we only regard “noisy”

(non-denoised) samples from “all” survey types com-

bined. Table 3 presents the remaining features along

with the self-reported mental wellbeing measures. All

these features stem from the speech task “sustained

/a/”.

The notation of OPENSMILE composes the fea-

ture names as follows: “Loudness” refers to the psy-

choacoustic loudness, “equivalentSoundLevel dBp”

describes the mean frame level energy in dB, “spec-

tralFlux” the mean distance between spectra of ad-

jacent frames, and “F3freq” corresponds to the cen-

tre frequency of the third formant. The func-

tional aggregates can be described, where “sma3”

means moving average smoothing over three frames,

and “percentile200”, 500, and 800, the 20th, 50th,

and 80th percentile. For the centre frequency of

the third formant, “sma3nz” is the moving average

smoothing over three frames while omitting zero val-

ues, and “stddevNorm” refers to the coefﬁcient of

variation. Interestingly, most features are related

to the psychoacoustic loudness (“loudness”), most

prominently “loudness sma3 amean”, which is in-

dicative for the PSS-10, the current stress level, and

the stress level at work tasks.

Table 4 shows the results from modelling the

self-reported mental wellbeing measures across the

different survey types and with and without audio

quality measures (SNR-based ﬁltering and denois-

ing). The model performance is given in CCC, and

it is indicated whether the lower bounds of the con-

ﬁdence intervals are greater than zero and therefore

higher than the chance level. The speech task, model

type, and feature set of the best-performing mod-

els are provided. The best-performing model re-

ﬂects the WHO-ﬁve well-being index (WHO-5) score

and achieves a CCC of 0.361 with a conﬁdence in-

terval between 0.031 and 0.514. Figure 4 depicts

the corresponding regression plot. Overall, the best-

performing models are achieved using linear regres-

sion (LR) and support vector regressor (SVR) clas-

siﬁers and are based on wav2vec 2.0 embeddings as

Table 3: Signiﬁcantly correlating features with the high-

est Spearman’s rank-order correlation coefﬁcient (ρ), sorted

by self-reported mental wellbeing measures. All p-values

(corrected for multiple comparisons using Benjamini-

Hochberg) are < 0.001. Feature codes are according

to OPENSMILE and explained in section 4.1. Targets

as described in section 2.3, “stress-now” abbreviated as

“stress-n”, and “stress-work” as “stress-w”.

Target Feature (openSMILE) ρ

WHO-5 F3freq sma3nz stddevNorm 0.433

PSS-10 loudness sma3 amean 0.404

Stress-n loudness sma3 amean 0.409

Stress-n loudness sma3 percentile500 0.404

Stress-n loudness sma3 percentile800 0.404

Stress-n spectralFlux sma3 amean 0.404

Stress-n spectralFluxV sma3nz amean 0.408

Stress-w equivalentSoundLevel dBp 0.454

Stress-w loudness sma3 amean 0.463

Stress-w loudness sma3 percentile200 0.407

Stress-w loudness sma3 percentile500 0.468

Stress-w loudness sma3 percentile800 0.456

Stress-w spectralFlux sma3 amean 0.405

Mental Wellbeing at Sea: A Prototype to Collect Speech Data in Maritime Settings

Table 4: Best performing regression models for the prediction of the self-reported mental wellbeing measures. The best per-

forming model is highlighted in bold. Models were trained on the “noisy” data, which was not denoised, and the “denoised”

data, which underwent audio-quality-based ﬁltering and denoising. The performance of the regression models is given in

concordance correlation coefﬁcient (CCC), where the ﬁrst value indicates the performance on the full dataset, and the brack-

ets contain the lower and upper bounds of the conﬁdence interval; the asterisk (*) marks those models, whose lower bound

exceeds chance level (CCC above 0). The parameters of the best performing model of the respective selection are given:

the data slice of the various speech “tasks” (see section 2.3), the “model” type (linear regression (LR) and support vector

regressor (SVR)), as well as the features used to represent the data (see section 3.1).

Target Quality CCC Task Model Features

WHO-5 Denoised 0.354 (0.028 - 0.568)* Read-happy SVR eGeMAPS

WHO-5 Noisy 0.361 (0.031 - 0.514)* Read-happy SVR eGeMAPS

PSS-10 Denoised 0.163 (-0.160 - 0.337) Read-neutral LR W2V2-LR-LIBRI

PSS-10 Noisy 0.117 (0.021 - 0.236)* Spontaneous LR W2V2-LR-MSP

PHQ-8 Denoised 0.195 (-0.037 - 0.383) Read-happy LR W2V2-LR-LIBRI

PHQ-8 Noisy 0.213 (-0.003 - 0.377) Read-happy LR W2V2-LR-LIBRI

Stress-now Denoised 0.178 (0.033 - 0.313)* Read-happy LR W2V2-LR-MSP

Stress-now Noisy 0.194 (0.042 - 0.320)* Read-happy LR W2V2-LR-MSP

Stress-work Denoised 0.238 (0.040 - 0.373)* Sustained /a/ SVR eGeMAPS

Stress-work Noisy 0.227 (0.005 - 0.373)* Sustained /a/ SVR eGeMAPS

Figure 4: Regression plot of the best performing model, pre-

icting WHO-5 score with a CCC of 0.361 (0.031 - 0.514),

see table 4. The model was trained on the “read-happy”

speech task (content-neutral sentence read in a “happy”

tone) with noisy, unﬁltered data.

well as the EGEMAPS features. The read speech

task, acted in a “happy” tone, as well as the sustained

phonation of the vowel /a/ are most prominent. Mod-

els trained on denoised data outperform their counter-

parts trained on noisy data two times and are being

outperformed in three cases.

To assess the impact of the denoising approach,

we present the SNR distribution of all speech sam-

ples before and after denoising in ﬁgure 5. It can

−10 −5 0 5 10 15 20 25

SNR [dB]

100

200

300

400

500

Number of Speech Samples

SNR Distribution - Raw and Facebook Denoiser

Raw

Denoised

Figure 5: Distribution of the signal-to-noise ratio (SNR)

values (in dB) of the raw and denoised speech samples.

Figure 6: The dimensions of the user experience question-

naire (UEQ) in its provided benchmark.

be observed that the lower long tail of particularly

low SNR values is prominently reduced, and that

the upper tail of SNR values is growing through de-

noising, as expected. The mean and standard devia-

tion (SD) of the predicted SNR values changed from

11.72 dB (± 5.41 dB) to 13.15 dB (± 6.27 dB).

HEALTHINF 2025 - 18th International Conference on Health Informatics

4.2 User Experience Questionnaire

Figure 6 provides the results of the UEQ, which was

completed by 22 participants. Except for dependabil-

ity (which denotes if a user feels in control of in-

teraction or if they feel secure), the other measures

are above average compared to the publicly provided

benchmark of 21,000 participants (Laugwitz et al.,

2008). Interestingly, perspicuity received the high-

est rating, which indicates how easy it is to become

familiar with the software and learn how to use it.

15 participants provided an additional free-text an-

swer and, in a nutshell, were “happy” with the study

and the “feeling of being heard”, however, they had

some concerns about data security.

5 DISCUSSION

With this PoC study, we piloted the assessment of

the mental wellbeing of seafarers through established

self-reported mental wellbeing measures and speech

recordings. Through our data collection platform,

AI SoundLab, we were able to reliably record longi-

tudinal speech data and deploy the platform locally

on the ship with only very limited data transfer quo-

tas. Within our data collection effort, we registered

a notably high completion rate with 378 survey ses-

sions and 25 participating seafarers, and crew mem-

bers gave above-average feedback on the ease of use

of the data collection platform. An open-source de-

noising solution was employed to reduce background

noise and therefore increase the SNR distribution of

the speech samples, and noisy, as well as clipped au-

dio samples,s were ﬁltered out in data pre-processing.

Thanks to AI SoundLab, we were able to conduct

the study in a challenging setting with limited con-

nectivity. Participants were using their own mobile

devices in a real-life setting during their active work

period on board an oil tanker. The demonstration that

data were successfully collected in such a setup bears

promise for similar future endeavours in order to mea-

sure mental wellbeing in those challenging working

conditions.

The collected data, consisting of the self-reported

mental wellbeing measures and the speech record-

ings, were processed and analysed. Several acoustic

speech features were found to be moderately correlat-

ing with the self-reported mental wellbeing measures.

While not robustly predictive on their own, they still

provide a meaningful connection to the manifestation

of stress and wellbeing in speech patterns. The posi-

tively correlating features with the self-reported men-

tal wellbeing measures are strongly clustered around

loudness. In literature, jitter- and F0-related features

are most indicative of stress, further vocal loudness

along with an increased F0 might also be related

(Shukla et al., 2011; Van Puyvelde et al., 2018).

Assessing the self-reported wellbeing measures

alone already yields valuable insights. In ﬁg-

ure 2 A, one participant (ID “f87b”) is clearly be-

low the threshold for “potential depression” for all

the recorded sessions. That same participant is also

within the “moderate depression” category in ﬁg-

ure 2 C. Respectively, these psychological assessment

tools alone might already help to uncover these hints

towards crises of individual seafarers.

Model performance is limited, which might be

due to strong differences in the expression of stress in

individual participants, as reported in (Van Puyvelde

et al., 2018). Figure 3 indicates that some participants

cover a broad range of the self-rated “current” and

“work-task-related” stress level, while others declare

very little variation. This could be in part attributed to

the subjective nature of stress perception. Individu-

als experience stress differently, inﬂuenced by factors

such as personality traits, coping mechanisms, and

cultural background. This subjectivity is a primary

reason why self-reported stress evaluations can lack

accuracy (Weckesser et al., 2019; Sommerfeldt et al.,

2019).

Further, the divergence in a high and low variation

of the self-reported stress levels could reﬂect a report-

ing bias. Even though it was communicated vigor-

ously during the study that the employer and sponsor

of the study never had access to this sensitive infor-

mation, some participants might have mistrusted the

protection of their data. As outlined in the literature,

the disclosure of personal information is highly de-

pendent on the trust of users, and mistrust leads to

an unwillingness to report their information (Joinson

et al., 2010). The reported variability in self-reported

stress levels indicates that those measures are valuable

tools to monitor the participants’ wellbeing, but also

emphasise that the users’ trust in the system is crucial.

Strategies for audio quality control such as ﬁlter-

ing out noisy and clipped samples, as well as denois-

ing, show only a limited effect. Figure 5 shows that

there is a noticeable effect when applying denoising

to the SNR distribution, and the mean SNR values

and the SD improve from 11.72 dB (± 5.41 dB) to

13.15 dB (± 6.27 dB). However, denoising could po-

tentially also have a negative impact on the classiﬁ-

cation performance. During denoising, the “noisy”

frequency bands in the acoustic spectrum are being

removed. Since those frequency bands might contain

crucial information for the classiﬁcation approach at

hand, an overly aggressive denoising approach could

Mental Wellbeing at Sea: A Prototype to Collect Speech Data in Maritime Settings

do more harm than good and impair classiﬁcation per-

formance.

When regarding the classiﬁcation performance on

the different self-reported wellbeing measures, the

WHO-5 questionnaire outperforms the other mea-

sures and is being followed by the VAS-based assess-

ment of the perceived stress level during the recent

work tasks. The WHO-5 questionnaire was utilised to

reﬂect back on the time period of the last week. The

highest classiﬁcation performance achieved using the

WHO-5 scores might indicate that the measured un-

derlying long-term “wellbeing” is manifesting on this

time scale. Similarly, the self-reported stress level

during the recent work tasks might be more indica-

tive of the state and wellbeing of the participants

than the currently-perceived self-reported stress level.

These insights could guide future efforts in selecting

the most informative assessment measures, however,

due to the limited predictive strength, a generalisation

should be done only with caution.

Further, the “read-happy” task was most promi-

nent in the best-performing models (table 4). A possi-

ble explanation is the robustness against accents from

the various ethnicities on board due to the unambigu-

ous pronunciation across different languages present

on the ship (table 2). The aspect of the acted posi-

tive sentiment might be an additional catalyst to pre-

dict wellbeing through speech. The contrast between

the prompted positive emotion and the actual mood of

the participant could strengthen the expression in the

voice. Especially when being in a negative mood, it

might be an additional burden to sound positive. Fur-

ther, the sustained phonation of the vowel /a/ could be

particularly robust against the spoken language and

accent. LR- and SVR-based models are dominant,

with SVR models showing a slightly better perfor-

mance. SVR-based models seem to perform best with

the EGEMAPS feature set, while LR-based models

are prominently using wav2vec 2.0-embeddings as in-

put. This might indicate that certain non-linearities in

the data might be better captured by SVR-based clas-

siﬁers, while speech embeddings might not be as in-

formative to SVR-based approaches.

The design of the study protocol and the data col-

lection through AI SoundLab could be regarded as

the cornerstones for future approaches. The promi-

nence of the read speech task in a happy tone and the

sustained vowel /a/ could encourage the inclusion of

these into future data collection efforts.

In the wider context, implementing a system that

objectively monitors the wellbeing of a ship’s crew

through speech analysis can provide a valuable instru-

ment. In particular, such a system could be used to

design and evaluate targeted interventions to enhance

the crew’s welfare. This aligns also with employers’

interests, as a high level of crew wellbeing is likely

to lead to increased productivity (Brooks and Green-

berg, 2022). The privacy of individual users is of ut-

most importance in this context, and in our PoC study,

we contractually agreed only to report ship-wide ag-

gregates to the sponsor. In line and understandably,

participants reported concerns with data safety, al-

though we emphasised that the sponsor has no ac-

cess to their data by design of our data collection sys-

tem. To address this further, the high transparency

of a resulting large-scale system seems key, and the

participants themselves should ideally be able to see

the exact output measures that are aggregated over the

whole ship and conveyed to the sponsor.

Our major learning from this study is that the in-

volvement of the human factors department is key to

designing an understandable and frictionless data sur-

vey. Human factors were able to facilitate effective

communication with the crew and to provide an over-

all understanding of their situation and needs. The

deployment of a system in the maritime setting, such

as presented here, is particularly challenging, but it

is also highly needed. Mental wellbeing and stress

are highly subjective and individual aspects, and mod-

elling approaches for measuring these will have to be

highly personalised.

5.1 Limitations and Future Work

The data distribution of stress per participant (ﬁg-

ure 3) indicates a heterogeneous and “individual na-

ture” of mental wellbeing and stress. As we observed,

models struggle to generalise on previously unseen

speakers and overﬁtting is the consequence. Person-

alised models, such as those proposed by (Wu et al.,

2023), could be a promising next step. This ensures

that the subjective and individual perceived wellbeing

can be quantiﬁed reliably to evaluate the success of

potential interventions.

In addition, we propose a shift towards a more

passive data collection approach for future studies.

Slavich et al. highlighted the potential of speech anal-

ysis to quantify stress as an important indicator for

wellbeing (Slavich et al., 2019). In particular, they

portrayed the potential of passive monitoring solu-

tions with particular emphasis on individuals’ pri-

vacy. Similarly, Jiang et al. piloted a wrist-worn de-

vice that collects audio data and only saves high-level

features to preserve anonymity (Jiang et al., 2019).

Accordingly, we see great potential in adapting the

here-described active monitoring system to a pas-

sive, privacy-friendly one, that takes the friction of

required active interaction from the user.

HEALTHINF 2025 - 18th International Conference on Health Informatics

ACKNOWLEDGEMENTS

We would like to thank the AI SoundLab team for

their efforts to deploy this platform on the ship. In

addition, we thank Martin H

ammerle for his sup-

port through user experience design, who provided

clear and graspable information material for the ship’s

crew. Further, we thank all the participants who do-

nated their speech samples for this study, in particular

the two captains of the ship, who supported the data

collection greatly.

REFERENCES

Alberdi, A., Aztiria, A., and Basarab, A. (2016). Towards

an automatic early stress recognition system for ofﬁce

environments based on multimodal measurements: A

review. Journal of Biomedical Informatics, 59:49–75.

Amiriparian, S., Christ, L., Kathan, A., Gerczuk, M.,

uller, N., Klug, S., Stappen, L., K

onig, A., Cam-

bria, E., Schuller, B., et al. (2024). The muse

2024 multimodal sentiment analysis challenge: So-

cial perception and humor recognition. arXiv preprint

arXiv:2406.07753.

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020).

wav2vec 2.0: a framework for self-supervised learn-

ing of speech representations. Advances in Neural In-

formation Processing Systems, 33:12449–12460.

Baird, A., Triantafyllopoulos, A., Z

ankert, S., Ottl, S.,

Christ, L., Stappen, L., Konzok, J., Sturmbauer,

S., Meßner, E.-M., Kudielka, B. M., Rohleder, N.,

Baumeister, H., and Schuller, B. W. (2021). An eval-

uation of speech-based recognition of emotional and

physiological markers of stress. Front. Comput. Sci.,

3:750284.

Barr

e, R., Brunel, G., Barthet, P., and Laurencin-Dalicieux,

S. (2017). The visual analogue scale: An easy and

reliable way of assessing perceived stress. Quality in

Primary Health Care, 1(1).

Brooks, S. K. and Greenberg, N. (2022). Mental health and

psychological wellbeing of maritime personnel: a sys-

tematic review. BMC Psychol, 10(1):139.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable

tree boosting system. In Proceedings of the 22nd acm

sigkdd international conference on knowledge discov-

ery and data mining, pages 785–794.

Christ, L., Amiriparian, S., Baird, A., Kathan, A., M

uller,

N., Klug, S., Gagne, C., Tzirakis, P., Stappen, L.,

Meßner, E.-M., K

onig, A., Cowen, A., Cambria, E.,

and Schuller, B. W. (2023). The muse 2023 multi-

modal sentiment analysis challenge: Mimicked emo-

tions, cross-cultural humour, and personalisation. In

Proceedings of the 4th on Multimodal Sentiment Anal-

ysis Challenge and Workshop: Mimicked Emotions,

Humour and Personalisation, page 1–10. Association

for Computing Machinery.

Cohen, S. (1988). Perceived stress in a probability sam-

ple of the united states. The social psychology of

health/Sage.

Cohen, S., Kamarck, T., and Mermelstein, R. (1983). A

global measure of perceived stress. Journal of health

and social behavior, pages 385–396.

Conneau, A., Baevski, A., Collobert, R., Mohamed, A.,

and Auli, M. (2020). Unsupervised cross-lingual rep-

resentation learning for speech recognition. arXiv

[preprint] arXiv:2006.13979.

efossez, A., Synnaeve, G., and Adi, Y. (2020). Real Time

Speech Enhancement in the Waveform Domain. In

Proc. Interspeech 2020, pages 3291–3295.

Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg,

J., Andre, E., Busso, C., Devillers, L. Y., Epps, J.,

Laukka, P., Narayanan, S. S., and Truong, K. P.

(2016). The geneva minimalistic acoustic parameter

set (GeMAPS) for voice research and affective com-

puting. IEEE Trans. Affective Comput., 7(2):190–202.

Eyben, F., Weninger, F., Gross, F., and Schuller, B. (2013).

Recent developments in opensmile, the munich open-

source multimedia feature extractor. In Proceedings

of the 21st ACM International Conference on Multi-

media, MM ’13, page 835–838, New York, NY, USA.

Association for Computing Machinery.

Ferrer, L. and Riera, P. (2023). Conﬁdence intervals for

evaluation in machine learning. Accessed: 2024-02-

27.

Giddens, C. L., Barron, K. W., Byrd-Craven, J., Clark, K. F.,

and Winter, A. S. (2013). Vocal indices of stress: A

review. Journal of Voice, 27(3):390.e21–390.e29.

Hecker, P., Steckhan, N., Eyben, F., Schuller, B. W., and

Arnrich, B. (2022). Voice analysis for neurologi-

cal disorder recognition–a systematic review and per-

spective on emerging trends. Front. Digit. Health,

4:842301.

Higuchi, M., Nakamura, M., Shinohara, S., Omiya, Y.,

Takano, T., Mitsuyoshi, S., and Tokuno, S. (2020). Ef-

fectiveness of a voice-based mental health evaluation

system for mobile devices: Prospective study. JMIR

Form Res, 4(7):e16455.

Hsu, W.-N., Sriram, A., Baevski, A., Likhomanenko, T.,

Xu, Q., Pratap, V., Kahn, J., Lee, A., Collobert, R.,

Synnaeve, G., et al. (2021). Robust wav2vec 2.0: An-

alyzing domain shift in self-supervised pre-training.

arXiv preprint arXiv:2104.01027.

Jiang, L., Gao, B., Gu, J., Chen, Y., Gao, Z., Ma, X.,

Kendrick, K. M., and Woo, W. L. (2019). Wearable

long-term social sensing for mental wellbeing. IEEE

Sensors Journal, 19(19):8532–8542.

Joinson, A. N., Reips, U.-D., Buchanan, T., and Schoﬁeld,

C. B. P. (2010). Privacy, trust, and self-disclosure on-

line. Human–Computer Interaction, 25(1):1–24.

Kroenke, K. and Spitzer, R. L. (2002). The phq-9: a new

depression diagnostic and severity measure.

Laugwitz, B., Held, T., and Schrepp, M. (2008). Construc-

tion and evaluation of a user experience questionnaire.

In Holzinger, A., editor, HCI and Usability for Educa-

tion and Work, volume 5298, pages 63–76. Springer

Berlin Heidelberg.

Mental Wellbeing at Sea: A Prototype to Collect Speech Data in Maritime Settings

Lesage, F.-X., Berjot, S., and Deschamps, F. (2012). Clin-

ical stress assessment using a visual analogue scale.

Occupational Medicine, 62(8):600–605.

Low, D. M., Bentley, K. H., and Ghosh, S. S. (2020).

Automated assessment of psychiatric disorders using

speech: A systematic review. Laryngoscope Inves-

tigative Otolaryngology, 5(1):96–116.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,

Cournapeau, D., Brucher, M., Perrot, M., and

Edouard

Duchesnay (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12(85):2825–2830.

Reddy, C. K., Gopal, V., and Cutler, R. (2021). Dnsmos: A

non-intrusive perceptual objective speech quality met-

ric to evaluate noise suppressors. In ICASSP 2021-

2021 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pages 6493–

6497. IEEE.

Scherer, K. R., Banse, R., Wallbott, H. G., and Goldbeck, T.

(1991). Vocal cues in emotion encoding and decoding.

Motiv Emot, 15(2):123–148.

Shukla, S., Dandapat, S., and Prasanna, S. R. M. (2011).

Spectral slope based analysis and classiﬁcation of

stressed speech. Int J Speech Technol, 14(3):245–258.

Slavich, G. M., Taylor, S., and Picard, R. W. (2019). Stress

measurement using speech: Recent advancements,

validation issues, and ethical and privacy considera-

tions. Stress, 22(4):408–413. PMID: 30945584.

Sommerfeldt, S. L., Schaefer, S. M., Brauer, M., Ryff,

C. D., and Davidson, R. J. (2019). Individual dif-

ferences in the association between subjective stress

and heart rate are related to psychological and phys-

ical well-being. Psychological science, 30(7):1016–

1029.

Topp, C. W., Østergaard, S. D., Søndergaard, S., and Bech,

P. (2015). The who-5 well-being index: a systematic

review of the literature. Psychotherapy and psychoso-

matics, 84(3):167–176.

Van Puyvelde, M., Neyt, X., McGlone, F., and Pattyn, N.

(2018). Voice stress analysis: A new framework for

voice and effort in human performance. Frontiers in

Psychology, 9:1994.

Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt,

M., Burkhardt, F., Eyben, F., and Schuller, B. W.

(2022). Dawn of the transformer era in speech emo-

tion recognition: closing the valence gap. arXiv

[preprint] arXiv:2203.07378.

Weckesser, L. J., Dietz, F., Schmidt, K., Grass, J.,

Kirschbaum, C., and Miller, R. (2019). The psycho-

metric properties and temporal dynamics of subjective

stress, retrospectively assessed by different informants

and questionnaires, and hair cortisol concentrations.

Scientiﬁc reports, 9(1):1098.

Wu, Z., Scheidwasser-Clow, N., Hajal, K. E., and Cer-

nak, M. (2023). Speaker embeddings as individuality

proxy for voice stress detection.

HEALTHINF 2025 - 18th International Conference on Health Informatics