surprise) have been widely applied to the develop-
ment of expressive avatars. However, we argue that
this modeling lacks completeness when the objective
is the implementation of interactive embodied conver-
sational agents. In order to capture user empathy, for
example, a virtual airport assistant informing a flight
cancellation, should not express any of the “big six”
plain emotions but a set of more complex facial ex-
pressions capable of dealing with the user frustration,
indicating that the system recognizes the flight cancel-
lation as an undesirable event. For this purpose, “ap-
praisal” models, that take into consideration the eval-
uation process that leads to an emotional response,
seem to provide a more embracing characterization of
emotions. In particular, we adopt the model proposed
by Ortony, Clore and Collins (OCC model) since it
presents a concise but comprehensive vocabulary of
22 emotions that arises as reactions to events, agents
or objects (Ortony et al., 1988).
Another important question is how to synthesize
photorealistic appearances and to reproduce the dy-
namics of the speech combined to the expression of
emotions. Put in other words, how to delude human
observers: specialists trained since the born to de-
tect the smallest variations on the signals conveyed
by the voice, the face and the body. In this work we
explore the image-based, or 2D, synthesis technique
as a mean to obtain inherently photorealistic expres-
sive faces, avoiding the typical synthetic looking of
model-based (3D) facial animation systems.
4 STATE OF THE ART
Since the pioneering work of Parke (Parke, 1972),
many others have contributed with different ap-
proaches to improve the level of videorealism of syn-
thetic talking faces: (Bregleret al., 1997), (Ezzat and
Poggio, 1998), (Brand, 1999), (Cosatto and Graf,
2000), (Pasquariello and Pelachaud, 2002), (Ezzat
et al., 2002).
In the last decade, it can be observed an emerging
interest in adding to the synthetic talking heads the
capability of expressing emotions.
In (Chuang and Bregler, 2005), for example, the
authors focus in the difficulties to edit motion cap-
ture data. In their proposal, they take expressionless
speech performance as input, analyze the content and
modify the facial expression according to a statistical
model. The expressive face is then retargeted onto a
3D character using blendshape animation. The paper
presents the results of the methodology using as train-
ing data three short video sequences including three
basic expressions: neutral, angry, and happy.
In (Beskow and Nordenberg, 2005), an expres-
sive 3D talking head is implemented using an MPEG-
4 compatible model. An amateur actor was recorded
portraying five different emotions (happy, sad, angry,
surprised, and neutral) and a Cohen-Massaro coartic-
ulation model was trained for each emotion.
A 3D expressive speech-driven facial animation
system is also presented in (Cao et al., 2005). In
this system the inputs are the speech to be animated
and a set of emotional tags. A high-fidelity motion
capture database was built with a professional actor
representing five emotions: frustrated, happy, neutral,
sad and angry. The motion capture data, together with
the timed phonetic transcript of the recorded utter-
ances, were used to construct what the authors call
an “anime” graph, where an “anime” corresponds to
a dynamic definition of a viseme. The synthesis con-
sists in searching the best path in this graph through
the minimization of a cost function that penalizes dis-
continuity in unnatural visual transitions.
Four emotions (neutral, happy, angry, sad) were
captured with a motion capture system in (Deng et al.,
2006). The resulting material was used to build a
coarticulation model and a visual appearance model
that are combined to generate a 3D facial animation
at synthesis time.
As novel approach to model emotions in a facial
animation system, in (Jia et al., 2011) the authors
parameterize eleven emotions (neutral, relax, sub-
missiveness, surprise, happiness, disgust, contempt,
fear, sorrow, anxiety and anger) according to the PAD
(Pleasure, Arousal, Dominance) dimensional emotion
model. In this system, the acoustic features of the
speech are used to drive an MPEG-4 model.
More recently, (Anderson et al., 2013) present a
2D VTTS (visual text-to-speech) which is capable of
synthesizing a talking head given an input text and
a set of continuous expression weights. The face is
modeled using an active appearance model (AAM),
from a corpus containing six emotions: neutral, ten-
der, angry, afraid, happy and sad.
These works illustrate, through a diverse range of
approaches, the challenges imposed by this research
problem.
5 METHODOLOGY
5.1 Corpus
In order to study different aspects of expressive
speech, ten professional actors, Brazilian Portuguese
native speakers, were divided in two types of exper-
iments. The first experiment consisted in asking the
ExpressiveTalkingHeadforInteractiveConversationalSystems
21