Authors:
Beatrice Biancardi
1
;
Yingjie Duan
2
;
Mathieu Chollet
3
and
Chloé Clavel
2
Affiliations:
1
LINEACT CESI, Nanterre, France
;
2
LTCI, Télécom Paris, IP Paris, Palaiseau, France
;
3
School of Computing Science, University of Glasgow, Glasgow, U.K.
Keyword(s):
Affective Computing, Human Communication Dynamics, Social Signals, Public Speaking.
Abstract:
Most of the emerging public speaking training systems, while very promising, leverage temporal-aggregate features, which do not take into account the structure of the speech. In this paper, we take a different perspective, testing whether some well-known socio-cognitive theories, like first impressions or primacy and recency effect, apply in the distinct context of public speaking perception. We investigated the impact of the temporal location of speech slices (i.e., at the beginning, middle or end) on the perception of confidence and persuasiveness of speakers giving online movie reviews (the Persuasive Opinion Multimedia dataset). Results show that, when considering multi-modality, usually the middle part of speech is the most informative. Additional findings also suggest the interest to leverage local interpretability (by computing SHAP values) to provide feedback directly, both at a specific time (what speech part?) and for a specific behaviour modality or feature (what behaviour
?). This is a first step towards the design of more explainable and pedagogical interactive training systems. Such systems could be more efficient by focusing on improving the speaker’s most important behaviour during the most important moments of their performance, and by situating feedback at specific places within the total speech.
(More)