6 LIMITATIONS AND FUTURE
WORK
While our study provides insights into using a com-
bination of oculo-motor indices to evaluate Japanese
TTS, we are mindful of its limitations. Firstly, the
study was conducted on a relatively small sample (n
= 16) of predominantly male participants. Secondly,
the study was conducted in a lab environment using
computer generated speech-shaped noise. Thus eval-
uation results can vary in other environments, such as
real-life noisy listening conditions or an-echoic cham-
ber. Thirdly, self-reported measures are subject to
inter-rater variability which could affect the objec-
tive assessment of speech. Finally, it should also be
noted that other factors beyond our control, such as
stress, could have affected the experimental outcome
as some participants were more concerned about pro-
viding correct responses.
Although the above limitations may affect the
generalisability of our research results, they also high-
light the variables that should be taken into account
in order to make evaluation of TTS more robust and
standardised. In future research, the issues of gender
and experimental design should be given more atten-
tion in order to ensure higher ecological validity of
evaluations. Firstly, whenever possible, participant
samples should be gender- and age-balanced, with
findings investigated separately for each gender and
age group. Secondly, similar consideration should be
given to the types of voices that are selected for syn-
thesis. Thirdly, calibration of sound pressure should
also be considered in evaluation - while the volume
of sound can have an impact of participants’ cogni-
tive workload, to the best of our knowledge, there are
currently no official guidelines on volume calibration.
Finally, it should be ensured that differences in par-
ticipants’ cognitive abilities are accounted for - this
could be addressed by administering a listening test
at the pre-experiment stage.
7 CONCLUSIONS
This paper presented the results of an in-lab evalu-
ation experiment in which participants listened to a
series of Japanese speech stimuli mixed with noise.
In line with the findings of previous research on En-
glish speech Govender et al. (2019), we found that
synthetic speech led to a faster increase in pupil size
(sharper curve) indicating more cognitive load. This
result was supported by participants’ perceptions who
rated synthetic speech as more cognitively taxing as
compared with natural speech. On the other hand,
we found that participants’ pupil oscillations were
stronger at higher levels of noise for natural speech
at the retention phase, but lower at the recall state in-
dicating the impact of external factors such as stress
or excessive level of noise (ceiling effect).
Although our results are preliminary, we have
shown that pupil oscillations can provide additional
measurements for cognitive workload of synthetic
speech in noisy listening conditions, and established
a baseline for future experiments. In order to fur-
ther validate the accuracy of our findings, future work
should investigate if our result can be replicated us-
ing diverse participant samples - to account for gen-
der specific hearing sensitivity (cf.McFadden (1998)).
We hope that our study will encourage discussion on
how other biological signals such as pupil oscillations
could expand TTS evaluation methods in future.
ACKNOWLEDGEMENTS
We would like to express our gratitude to Professor
Takao Kobayashi for his advice with selecting the
speech corpus for our study. We would also like to
thank all participants who took part in the experiment.
REFERENCES
Beatty, J. (1982). Task-evoked pupillary responses, process-
ing load, and the structure of processing resources.
Psychological bulletin, 91(2):276.
Duchowski, A. T., Krejtz, K., Krejtz, I., Biele, C., Niedziel-
ska, A., Kiefer, P., Raubal, M., and Giannopoulos, I.
(2018). The index of pupillary activity: Measuring
cognitive load vis-
`
a-vis task difficulty with pupil os-
cillation. In Proceedings of the 2018 CHI Conference
on Human Factors in Computing Systems, pages 1–13.
Govender, A. and King, S. (2018a). Measuring the cog-
nitive load of synthetic speech using a dual task
paradigm. In Interspeech, pages 2843–2847.
Govender, A. and King, S. (2018b). Using pupillometry to
measure the cognitive load of synthetic speech. Sys-
tem, 50:100.
Govender, A., Wagner, A. E., and King, S. (2019). Using
pupil dilation to measure cognitive load when listen-
ing to text-to-speech in quiet and in noise. In INTER-
SPEECH, pages 1551–1555.
HTS Working Group (2015). The Japanese TTS System
Open JTalk.
Kahneman, D. and Beatty, J. (1966). Pupil diameter and
load on memory. Science, 154(3756):1583–1585.
Kursawe, M. A. and Zimmer, H. D. (2015). Costs of storing
colour and complex shape in visual working memory:
Insights from pupil size and slow waves. Acta Psy-
chologica, 158:67–77.
Evaluating Synthetic Speech Workload with Oculo-motor Indices: Preliminary Observations for Japanese Speech
341