allows for only one pre-modelled facial pose at a
time, which is extremely limiting in sign synthesis.
Consider portraying a question involving a happy
person asking about a small cup of coffee. This has
three simultaneously occurring facial processes: the
question nonmanual, the small size nonmanual and
the happy affect. If the animator has modelled each
of these separately, then a morphing system is forced
to choose only one of them and ignore the other two,
resulting in a failure to communicate the intended
message.
Attempting to mitigate this issue by pre-
combining poses lacks flexibility and is labour
intensive to the point of impracticality. There are six
basic facial poses for emotion (Ekman and Friesen,
1978) and at least fifty-three nonmanual signals
which can co-occur (Bridges and Metzger, 1996).
Trying to model all combinations would result in
hundreds of facial poses. In addition, the timing of
these combinations would suffer the same problems
of flexibility associated with using video recordings.
Maskable morphing attempts to address the
inflexibility problem by subdividing the face into
regions such as Eyes, Eyebrows, Eyelids, Mouth,
Nose, and allows the animator to choose a distinct
pose for each region. This is an improvement but the
“choose one only” problem now migrates to
individual facial features, and thus it still does not
support simultaneous processes that affect the same
facial feature. For example, both the nonmanual OO
and the emotions of joy and anger influence the
mouth.
The technique of muscle-based animation more
closely simulates the interconnected properties of
facial anatomy by specifying how the movement of
bones and muscles affect the skin (Magnenat-
Thalmann, Primeau and Thalmann, 1987) (Kalra,
Mangili, Magnenat-Thalmann and Thalmann, 1991).
If two different expressions use the same muscle,
their combined effect will pull on the skin in a
natural way. However, managing and coordinating
all of these muscle movements have a tendency to
become overwhelming.
Timing is the main problem. Co-occurring facial
linguistic processes will generally not have the same
start and end times. Some processes may be present
for a single word, others for a phrase, and others for
an entire sentence. Errors in timing can change the
meaning of the sentence. For example, both the
affect anger and the WH-question nonmanual
involve lowering the brows. If the timing is not
correct, the WH-question nonmanual can be
mistaken for anger (Weast, 2008). Errors in timing
can also cause an avatar to seem unnatural and
robotic, which can distract from the intended
communication. This is analogous to the way that
poor speech synthesis is distracting and requires
more hearing effort (Warner, Wolff and Hoffman,
2006).
4 RELATED WORK
Several active research efforts around the world
have a shared goal of building avatars to portray sign
language. Their intended applications include
tutoring deaf children, providing better accessibility
to government documents and broadcast media, and
facilitating transactions with service providers. This
section examines their approaches to generating
facial nonmanual signals.
Very early efforts focused exclusively on the
manual aspects of the language only (Lee and Kunii,
1993; Zhao et al., 2000; Grieve-Smith 2002). Some
acknowledge the need for nonmanual signals but
have not yet implemented them for all facial features
(Karpouzis, Caridakis, Fotinea and Efthimiou,
2007). Others have incorporated facial expressions
as single morph targets. This has been done using
traditional key-frame animation (Huenerfauth, 2011)
and motion capture (Gibet, Courty, Duarte and Le
Naour, 2011).
The European Union has sponsored several
research efforts, starting with VisiCast in 2000,
continuing with eSIGN in 2002 and currently,
DictaSign (Elliott, Glauert and Kennaway, 2004;
Efthimiou et al., 2009). One of the results of these
efforts is the Signing Gesture Markup Language
(SiGML), an XML-compliant specification for sign-
language animation (Elliott et al., 2007). SiGML
relies on HamNoSys as the underlying
representation for manuals (Hanke, 2004), but
introduces a set of facial nonmanual specifications,
including head orientation, eye gaze, brows, eyelids,
nose, and mouth and its implementation uses the
maskable morphing approach for synthesis.
However, there is no consensus on how best to
specify facial nonmanual signals, particularly for the
mouth, and other research groups have either
developed their own custom specification
(Lombardo, Battaglino, Damiaro and Nunnari, 2011)
or are using an earlier annotation system such as
Signwriting (Krnoul, 2010). Further, none of these
efforts have yet specified an approach to generating
co-occurring facial nonmanual signals.
Recent efforts have begun exploring alternatives
to morphs and maskable morphs by exploiting the
muscle based approach (López-Colino and Colás,
2012). However this work has not addressed
portraying co-occurring nonmanual signals.
There is consensus that animating the face is an
extremely difficult problem. Consider the sentence,
"What size coffee would you like?” signed happily.
GeneratingCo-occurringFacialNonmanualSignalsinSynthesizedAmericanSignLanguage
409