Authors:
Samuel Silva
and
António Teixeira
Affiliation:
University of Aveiro, Portugal
Keyword(s):
Audiovisual Speech, Articulatory Synthesis, European Portuguese.
Abstract:
In speech communication, both the auditory and visual streams play an important role, ensuring both a certain
level of redundancy (e.g., lip movement) and transmission of complementary information (e.g., to emphasize
a word). The common current approach to audiovisual speech synthesis, generally based on data-driven
methods, yields good results, but relies on models controlled by parameters that do not relate with how humans
do it, being hard to interpret and adding little to our understanding of the human speech production
apparatus. Modelling the actual system, adopting an anthropomorphic perspective would provide a myriad of
novel research paths. This article proposes a conceptual framework to support research and development of
an articulatory-based audiovisual speech synthesis system. The core idea is that the speech production system
is modelled to produce articulatory parameters with anthropomorphic meaning (e.g., lip opening) driving
the synthesis of both the auditory and vis
ual streams. A first instantiation of the framework for European
Portuguese illustrates its viability and constitutes an important tool for research in speech production and the
deployment of audiovisual speech synthesis in multimodal interaction scenarios, of the utmost relevance for
the current and future complex services and applications.
(More)