characteristic in lines of consonants. In terms of
time, producing the consonant sound takes
extremely short time. It ends in such an instant that it
does not affect the change of lips shape. From the
above, we assumed that the change of lips shape
occurs based on utterance from a vowel to a vowel.
And due to the consonants in between vowels, it
creates some influences during the changes.
In this study, we assumed a link of a vowel + a
consonant + a vowel (V + C + V) as a unit.
Additionally, we assumed this unit as a phoneme
context. Because there are two vowels in the same
phoneme context, we called the former vowel as
"the first vowel" and the latter as "the second
vowel." The first vowel does not represent the whole
phoneme and shall express only last shape of the lips
at completion of utterance of the vowel. When there
is no first vowel at the beginning of an utterance,
thus, we added one more combination of a
consonant + a vowel (C + V) to phoneme context.
Of course, there is utterance of a vowel + a vowel.
We considered such utterance a special case where
the consonant does not exist and included it in the
phoneme context of V + C + V Therefore, we have
two phoneme contexts. These two types are enough
for this study. (see Figure 1)
Phoneme
Phoneme
Phoneme
Phoneme
Phoneme
Phoneme
PhonemecontextC+V
PhonemecontextV+C+V
Syllable Syllable Syllable
Word
Figure 1: Phoneme context in Japanese.
2.2 Types of Phoneme Context
We presumed that there are 25 combinations of the
first vowel and the second vowel in terms of macro
taxonomy. There are several kinds of consonants
sandwiched between the first vowel and the second
vowel. In each macro taxonomy, depending on the
consonant between vowels, the change of lips shape
differs. In order to reproduce the change of lips
shape when producing each phoneme context, we
designed a trace data of the characteristic points of
lips. There are 116 types of phoneme context in C +
V; there are 565 types of phoneme context in V + C
+ V. As a consequence, almost all Japanese words
can be spoken by combining these phoneme contexts
together.
3 CHARACTERISTIC
EXTRACTION OF LIPS SHAPE
In order to develop a lips animation, we must extract
the change characteristic of lips shape. Therefore,
we extracted the change characteristics from an
image and a sound of change of lips shape at the
time of a word spoken.
3.1 Recording both Image and Sound
of Speaking Word
A high speed camera was used for recording. The
shutter speed was set at 240 fps and the size was
256x256 pixels for a frame. In order to trace the
characteristic points of the lips precisely after the
recording, we printed eight characteristic points of
the edge of lips of the subject as shown in Figure 2.
We recorded 400 words which were spoken by the
subject. The entire phoneme contexts were all
included in these 400 words.
3.2 Image Extraction of Phoneme
Context
For dividing the phonemes after recording, as shown
in Figure 3, we analyzed the sounds which were
recorded concurrently with the images. We used
spectrogram as a clue and divided the image of the
lips by each phoneme. The sound analysis software
was used; we paid special attention to characteristic
differences of formant phonemes and divided each
phoneme in time axis. The time axis is applied as the
time frame of lip images.
Figure 2: Lips of the subject and the basic shape of a two-
dimensional computer graphic model of lips.
One phoneme context is from the endpoint of the
first vowel to the endpoint of the second vowel.
According to the sound analysis and the border of
specified vowel and consonant as a clue, we were
able to identify the part corresponding to the
phoneme context mentioned in the section 2.2 from
TEXT DRIVEN LIPS ANIMATION SYNCHRONIZING WITH TEXT-TO-SPEECH FOR AUTOMATIC READING
SYSTEM WITH EMOTION
195