VISUAL SPEECH SYNTHESIS FROM 3D VIDEO

J. D. Edge and A. Hilton

Centre for Vision, Speech and Signal Processing

School of Electronic and Physical Sciences, University of Surrey, Guildford, GU2 7XH, UK

Keywords:

Facial Animation, Speech Synthesis, Virtual Humans.

Abstract:

Data-driven approaches to 2D facial animation from video have achieved highly realistic results. In this paper

we introduce a process for visual speech synthesis from 3D video capture to reproduce the dynamics of 3D

face shape and appearance. Animation from real speech is performed by path optimisation over a graph repre-

sentation of phonetically segmented captured 3D video. A novel similarity metric using a hierarchical wavelet

decomposition is presented to identify transitions between 3D video frames without visual artifacts in facial

shape, appearance or dynamics. Face synthesis is performed by playing back segments of the captured 3D

video to accurately reproduce facial dynamic s. The framework allows visual speech synthesis from captured

3D video with minimal user intervention. Results are presented for synthesis from a database of 12minutes

(18000 frames) of 3D video which demonstrate highly realistic facial animation.

1 INTRODUCTION

In recent years the use of data-driven techniques

to produce realistic human animations have become

more prevalent. The use of video and motion-capture

data allows us to improve the visual realism of a syn-

thetic character without the complexities involved in

physical simulation. Unfortunately, whilst video data

is lifelike, it is also constrained to the 2D image plane.

Furthermore motion-capture technologies are marker-

based and so prevent the recovery of skin texture dur-

ing data capture. 3D video capture technology uses

stereo-reconstruction techniques to capture both the

texture and geometry of the facial surface thereby pro-

viding a best compromise solution to animating real-

istic human characters.

One of the most difﬁcult problems in animation

is reproducing the visible aspects of speech. It is

the method of communication that we use every day,

and any viewer will instantly spot any disparity with

everyday reality. When we see someone speaking

the movements of the articulators (tongue, jaw, lips

etc.) is actually creating the sounds that we hear, and

we know there is a high correlation between what is

seen and heard (e.g. lip-reading (Sumby and Pollack,

1954) and the McGurk effect (McGurk and Mac-

Donald, 1976).) Whereas traditional animation tech-

niques are adequate for creating cartoon-like speech,

we need more complex models to accurately synthe-

size realistic human speech movements. Data-driven

techniques are particularly appropriate to this prob-

lem because they remove the need to directly model

the dynamics of speech by retrieving this information

from a real speaker. In this paper we introduce a data-

driven technique which uses 3D video capture tech-

nology to animate realistic speech movements.

Our process of producing a talking head works on

a similar basis to Video-Textures (Sch

odl et al., 2000).

We capture sequences of 3D speech movements and

organise this data into a graph structure consisting of

nodes representing dynamic phonetic units with con-

necting arcs representing optimal transitions. Speech

synthesis is achieved by traversing this graph accord-

ing to the phonetic structure of an input utterance. An-

imations consist simply of playing back frames of the

original captured data with no blending/interpolation

required. Our process of creating a talking head has

been designed to as far as possible remove the neces-

sity for any manual intervention or reliance upon sen-

sitive computer vision algorithms.

D. Edge J. and Hilton A. (2007).

VISUAL SPEECH SYNTHESIS FROM 3D VIDEO.

In Proceedings of the Second International Conference on Computer Graphics Theory and Applications - AS/IE, pages 57-62

DOI: 10.5220/0002080400570062

 SciTePress

2 BACKGROUND

Visual speech synthesis is complicated by the fact that

when producing sequences of speech sounds, the ac-

tion of the articulators in the vocal tract are greatly

affected by speech context. This context-dependency

of speech is often termed ’coarticulation’ (L

ofqvist,

1990). The physical phenomena occurs bidirection-

ally, that is articulation is affected by both preceeding

and upcoming speech movements. Furthermore, cer-

tain aspects of speech are more important than others;

for example position of the tongue whilst producing

a /T,D/ (as in thin or this) sound exhibits low varia-

tion and is therefore a highly dominant feature, whilst

jaw rotation for vowel sounds exhibits high variation

and has little affect on surrounding speech segments.

The simulation of coarticulation is highly important

for the naturalness of any speech synthesis technique

(both audible and visual.) Typically, coarticulation is

simulated using a form of spline blending of articula-

tory parameters (Ezzat et al., 2002; Cohen and Mas-

saro, 1993).

More recently, captured dynamics have been di-

rectly used in the synthesis of visual speech. This mir-

rors the most common method of audible speech syn-

thesis which takes captured audio data and concate-

nates small sections to form novel utterances (Taylor

et al., 1998). In Video-Rewrite (Bregler et al., 1997)

triphone sequences of video data are concatenated to

synthesize lip movements which are then pasted onto

a background sequence of head movement improving

the naturalness of the output animation. In (Kalberer

and Van Gool, 2002) a model of speech movements

is captured using markers, then a viseme-space is

constructed using statistical methods and navigated

to synthesize new utterances. In (Cao et al., 2003;

Kshirsagar and Magnenat-Thalmann, 2003) motion-

captured facial movements are used to drive a 3D

model-based synthesis technique. These techniques

inherit from earlier work on motion graphs for body

motion capture data (Kovar et al., 2002).

The main difference between the audio and visual

concatenative methods is that it is more difﬁcult to

capture facial movement than audio. Whilst facial dy-

namics have traditionally been captured using mark-

ers, recently surface capture techniques have been de-

veloped which provide greater spatial resolution in

3D (Wang et al., 2004; Ypsilos et al., 2004; Zhang

et al., 2004). Surface capture has the added beneﬁt

that surface texture can be simultaneously captured -

providing ﬁne detail of skin wrinkles and creases that

will always be invisible to marker-based technologies.

In our work surface capture technology is used to pro-

vide data for speech synthesis.

3 OUR APPROACH

The technique for visual speech synthesis introduced

here relies upon the concatenation of small phonetic

units representing the variation in the dynamics of

natural speech movements. Our approach can be sum-

marised into several phases:

• Data Capture (Section 3.1) - A database of nat-

ural speech utterances are captured from an ac-

tor. These utterances contain the variations, due to

coarticulation, in speech articulation. The speech

data consists of 3D video and phonetically la-

belled audio of a news corpus.

• Graph Construction (Section 3.3) - The speech

data is split into phonetic units, and transition

probabilities between all units in the captured cor-

pus are calculated. This leads to a graph structure,

subsets of which are traversed during synthesis.

Transition probabilities are related to the similar-

ity of frames in different phonetic units.

• Unit Selection (Section 3.4) - For a given output

utterance, split into its phonetic constituents, ap-

propriate units (dynamic phonemes) are selected

from the constructed motion graph. Selection

is performed using a Viterbi algorithm which

maximises the probability of a sequence of units

which match the phonetic structure of the output

utterance.

• Animation (Section 3.5) - Selected phonetic units

are resampled and played back with the audio to

animate the speech utterance. No interpolation or

processing of the frames is performed, the ﬁnal

animation is simply a re-ordering of frames from

the initial speech corpus.

Our system has been designed to minimise the

manual intervention required to create new talking

heads. All the above stages beyond data capture can

be performed automatically, and the only intervention

currently required is to correct the automatic phonetic

labelling of captured audio.

3.1 Data Capture

All data-driven techniques require a signiﬁcant initial

data capture to represent the range of possibilities in

synthesis. In our system the data capture consists of

words and sentences from a news corpus spoken by

an actor. The actor is captured using a custom 3D

face capture rig (see (Ypsilos et al., 2004)), which re-

constructs facial geometry (200×200 scalar elliptical

depth maps) and texture (512 × 512 RGB images).

Audio is captured simultaneously and phonetically la-

belled to facilitate the construction of a motion graph.

GRAPP 2007 - International Conference on Computer Graphics Theory and Applications

In total 12 minutes of audio/video data has been

captured to drive our talking head. It is most im-

portant to capture a large number of consonant units,

as these units visually structure speech whereas vow-

els are usually transitional movements. It is also the

case that vowels may to a large degree be interchange-

able, although this is not currently implemented in

our system. Other than voiced/voiceless contrasts

we maintain all phonetic distinctions in the database,

mainly because a viseme reduction is less valid when

the units themselves are dynamic (i.e. the central

frame may be the same, but the articulatory movement

may exhibit a high degree of variation even within a

viseme class.)

3.2 Similarity Metric

In order to optimally construct a transition graph

of the 3D video data a similarity metric is required

to compare individual frames. Each frame consists

of both geometry and texture, which should both

be taken into account when determining similarity

= {G

, T

}). This section describes the creation

of a similarity metric between 3D video frames, i.e.

dist(F

, F

) → R.

In (Sch

odl et al., 2000) a simple L2 similarity met-

ric is used to recover transitions between 2D video

frames, however we have found that the effect of

noise (e.g. sensor noise, geometry reconstruction er-

ror) upon this metric is highly detrimental and leads

to poor output transition graphs. This is due to the

high dimensionality of each frame in the original data

((200 × 200) + (3 × 512 × 512) = 826432 scalar

values per frame). In (Ezzat et al., 2002) principal

components analysis (PCA) is used as a dimension-

ality reduction technique on 2D video, unfortunately

given the size of the database used here (18000 frames

×826432 scalar values) makes traditional PCA tech-

niques unviable. It is also important to note that tra-

ditional PCA applied by compressing each 2D im-

age frame into a single vector does not optimally ac-

count for spatial redundancies in the data (Shashua

and Levin, 2001). To make up for these limitations we

apply a hierarchical variant of PCA which accounts

well for both spatial and temporal redundancies in the

data and is more efﬁcient to create with very large

databases. The projection into this low dimensional

space is used to provide a similarity metric between

3D video frames.

A wavelet decomposition of a signal (e.g. a

row/column of image data) is a frequency represen-

tation produced by projecting onto a wavelet basis

(e.g. Haar, Daubechies etc.) The decomposition con-

sists of a signal average V

and sets of wavelet co-

time

signal

average

wavelet

coefficients

signal

average

wavelet

coefficients

Figure 1: Each frame (both texture and geometry) in the

database is converted to wavelet coefﬁcients, and the prin-

cipal components of these coefﬁcients over time are used

to deﬁne a similarity metric. This examples show only two

levels of the wavelet decomposition.

efﬁcients W

. Any level of the frequency hierarchy

⊂ V

⊂ . . . V

. . . ⊂ V

) can be recon-

structed using the W

and scaled/shifted basis func-

tions ψ

. If a sequence of images is decomposed into

wavelet coefﬁcients, we can apply PCA to the W

recover a model of how different spatial frequencies

change over time.

= µ

) (1)

In (1), µ

is the mean of the W

coefﬁcients

over time, the V

are the principal components rep-

resenting the important variations of the W

over

time. The b

give a projection onto the recovered

principal components, and act as the parameterisa-

tion of the original image sequences. The proposed

wavelet/PCA decomposition explicitly localises vari-

ation spatially (i.e. the wavelet transformation) and

temporally (i.e. the principal components). This

wavelet/PCA combination is demonstrated in ﬁg. 1.

In practice we decompose each frame (both tex-

ture and geometry) using a Haar wavelet basis in an

alternating row/column scheme (the non-standard ap-

proach described in (Stollnitz et al., 1995).) A 3×3×3

(rows × columns × frames) spatio-temporal median

ﬁlter is applied to the highest-frequency components

of the wavelet transform, which is used to attenuate

noise spikes in the data. From the projection onto the

(the principal components across all frequencies)

10 components are selected which account for over

95% of the variance in the data, this is the basis used

for comparison. The similarity metric, dist, is deﬁned

as the L2 distance between frames projected onto the

selected basis. We do not use the Mahalanobis dis-

tance, as used in (Ezzat et al., 2002), because this

gives too much weight to low variance components

which are relatively insigniﬁcant when determining

VISUAL SPEECH SYNTHESIS FROM 3D VIDEO

similarity.

The similarity of any two frames in the database

can now be determined using the selected basis. This

distance can now be converted into a frame-to-frame

transition probability.

P (F

, F

) = exp



−dist(F

, F

)



(2)

In (2) σ is a constant which relatively scales the

probability of transitioning between two frames (in

our system set to half the average distance between

frames, µ

dist

.) This allows the construction of a

matrix containing transition probabilities between all

frames in the stored data. In a ﬁnal step we apply

a small (3 frame window) diagonal ﬁlter to the data

similar to (Sch

odl et al., 2000), this incorporates a no-

tion of derivative similarity in the transition probabil-

ity. This transition matrix is used in the construction

of a transition graph between dynamic phonetic units.

3.3 Graph Construction

The synthesis of speech movements is based upon

the traversal of a graph representing the phonetic

structure of the captured data. For these purposes

phonemes are considered to be sequences of frames

from the centre of the previous phoneme to the cen-

tre of the following phoneme. Furthermore, we merge

voiced/voiceless contrasts in our phoneme units (e.g.

/p/ and /b/ are considered to be the same unit), which

maximises the available dynamic units available in

the database. It is important to note that we only

merge voiced/voiceless consonant contrasts; often

nasals are also combined, e.g. /m/ is often merged

with the /p,b/ group, yet these are not dynamically the

same and so this is not advisable in a concatenative

system.

The units are similar to what are often considered

as triphones (e.g. in (Bregler et al., 1997)), however,

only the central phonetic label of the unit is used dur-

ing synthesis (i.e. no phonetic context is explicitly

taken into account) thus we refer to these as dynamic

phonemes. This is an important distinction because

triphone synthesis typically requires many more units

to be captured. By disregarding the previous and fol-

lowing phonetic labels, and matching only accord-

ing to similarity, we can maximise the use of smaller

speech databases.

Each phonetic unit consists of two periods, the

onset (between the start of the unit and its phonetic

centre) and the offset (between the phonetic centre

and the end of the unit.) Possible graph transitions

→

between phoneme states q

and q

occur be-

tween frames in the offset of q

and frames in the

time

. . .

. . . . . .

Figure 2: A graph of nodes representing dynamic-

phonemes is constructed, each node is connected to each

of the following phonetic units in the sequence, the Viterbi

algorithm is used to ﬁnd the least expensive path (shown in

bold) matching an input utterance.

onset of q

. There will only be one optimal ’stitch-

ing’ point between the two sequences, the point at

which the frame-to-frame transition probability (see

Section 3.2) is highest, therefore we have a directed

graph with transition probabilities between each dis-

crete state.

In the data set we have captured there are 1314

phonetic units, from a set of 38 British English

phonemes, giving 1314

possible transitions. By

thresholding out low probability transitions and prun-

ing the graph according to phonotactic rules (e.g. bi-

labial plosives cannot occur sequentially) and remov-

ing within-state transitions the search space to be tra-

versed during synthesis can be greatly reduced. The

number of phonemes represented in the graph could

potentially be reduced further, as the main differ-

ence between many speech sounds is the position of

the tongue which is virtually indistinguishable in our

data (except in some key examples such as /T,D/, e.g.

thing.) However, we did not follow this route be-

cause often the dynamic properties of phonemes are

distinct even when the central lip pose is very similar.

The size of our speech database allows us to represent

phonemes and not resort to a coarse viseme reduction.

3.4 Unit Selection

The constructed phonetic graph structure allows syn-

thesis to be formulated as a Markov process, i.e.

the transition between phoneme states is independent

of previously traversed states (3). This is obvious,

given that the probability of transitioning between any

two phoneme states in the graph depends only upon

the similarity of frames (as deﬁned in Section 3.2.)

Thus, given a sequence of phoneme states (e.g. ’good

GRAPP 2007 - International Conference on Computer Graphics Theory and Applications

evening’ → /g/,/U/,/d/,/i:/,/v/,/n/,/I/,/N/) we must select

the best sequence of states from the constructed graph

(see ﬁg. 2) which visually represent these speech

sounds. One of the most common methods for doing

this is the Viterbi algorithm.

P (q

, . . . , q

) =

i=1

P (q

i−1

) (3)

The Viterbi algorithm is a recursive solution to

ﬁnding the best sequence of states matching a se-

ries of observations (i.e. the phonemes.) Given the

Markov assumption (3), we can calculate the best

probability from any of the possible initial states q

optimally by determining best partial paths to inter-

mediate q

states which are reused in the calculation

of the q

i+1

transition. This is a fast method of ﬁnding

the best n states to match the phonetic input.

For the previous example (’good evening’) each

of the phonemes is a state in the Viterbi search,

with m possibilities (where m is the number of

examples of that phoneme in the database.) So each

state may have a different number of alternatives,

as the database does not have a uniform sampling

of all phonemes. Therefore the probability of a

sequence, P (g

, U

, d

) will be the product of the

transition probabilities

→

(where

→

xy is

the scalar transition probability from x → y). The

Viterbi algorithm simply reformulates this potentially

exhaustive search into a series of subproblems, e.g.

best

{i:

best

, N

best

}}}}}}},

where the innermost problem is tackled ﬁrst and re-

cursively the inner problems accumulate to construct

the best path from all possible states.

Note that audio parameters (e.g. cepstral coefﬁ-

cients) are not used to optimize the chosen states (e.g.

as seen in (Cosatto and Graf, 2000)), this could be

added as a further term to allow animation directly

from speech audio with no intermediate phonetic tran-

scription. We ﬁnd that optimising on the transition

probabilities alone provides smooth transitions which

is the most important factor in creating high qual-

ity animation. A further advantage being that where

sequences of phonemes in the target utterance are

present in the training data they will be selected as a

sequence (because the transition probability between

sequential units will be 1). The unit selection proce-

dure intentionally biases toward long contiguous se-

quences in the original captured data, which will max-

imize the quality by reducing the number of synthetic

jumps. Note that the Viterbi graph search is perform-

ing a similar role to the greedy tile-matching proce-

dure used in (Cao et al., 2003), but will not choose

long sequences if they have a high associated transi-

tion cost with surrounding units. Tile-matching biases

Figure 3: Frames from a synthetic utterance.

toward longest units at the expense of possible poor

transitions because boundary matching has no associ-

ated cost (except where there are two competing units

of the same length.) We would assert that the Viterbi

algorithm is a better all around solution.

3.5 Animation

Our system only requires a rudimentary treatment of

animation. Frames from a synthetic sequence can be

seen in ﬁg. 3. Since our raw data is 3D video of an

actors face there is no necessity for complicated facial

models or deformation techniques. The output of unit

selection is a sequence of dynamic phonemes (i.e. se-

quences of frames from our initial data) and relative

timings with the input utterance (i.e. the disparities

between phoneme centres in the input utterance and

our dynamic phonemes.) We use simple linear scal-

ing to align selected units correctly with the utterance

audio, and where the stretching/squashing of the dy-

namic phonemes leads to a misalignment of frames

in the two sequences the closest frame is chosen (i.e.

no interpolation of frames.) The only intervention in

the ﬁnal animation is to use a noise function to apply

random low-magnitude rotations to the head with re-

spect to the neck. This prevents the synthetic charac-

ter from appearing too static, as can be seen in many

2D video-based talking heads.

Animations created using our system show several

advantages over traditional mocap techniques (Cao

et al., 2003; Kshirsagar and Magnenat-Thalmann,

2003). By capturing the surface texture of the actor

simultaneously with the geometry we can recover and

display high resolution features which would be lost

using techniques such as motion-capture. It is also

noticeable that that idiosyncrasies particular to our

speaker are captured, an obvious example for our ac-

tor is the asymmetrical way that she opens her mouth.

VISUAL SPEECH SYNTHESIS FROM 3D VIDEO

The quality of animations generated using our synthe-

sis technique are on a similar to those produced from

2D video (e.g. (Ezzat et al., 2002; Brand, 1999; Bre-

gler et al., 1997)), with the added advantages of full

control of orientation.

4 CONCLUSIONS

A data-driven approach to 3D visual speech synthesis

based on captured 3D video of faces has been pre-

sented. Recent advances in 3D video capture have

achieved simultaneous video-rate acquisition of facial

shape and appearance. In this paper we have intro-

duced face synthesis based on a graph representation

of a phonetically segmented 3D video corpus. This

approach is analogous to previous work in face syn-

thesis by resampling 2D video (Bregler et al., 1997)

and 2D video textures(Sch

odl et al., 2000). Face

synthesis for novel speech utterances is achieved by

optimisation of the path through the graph and con-

catenation of segments of the captured 3D video. A

novel metric using a hierarchical wavelet decomposi-

tion is introduced to identify transitions between 3D

video frames with similar facial shape, appearance

and dynamics. This metric allows efﬁcient compu-

tation of the similarity between 3D video frames for

a large corpus to produce transitions without visual

artifacts. Results are presented for facial synthesis

from a corpus of 12minutes (18000 frames) of 3D

video. Visual speech synthesis of novel sentences

achieves a visual quality comparable to the captured

3D video allowing highly realistic synthesis without

post-processing. The data-driven approach to 3D face

synthesis requires minimal manual intervention be-

tween 3D video capture and facial animation from

speech. Future extensions to the system introducing

expression and secondary facial movements in a thor-

oughly engaging synthetic character are foreseen.

REFERENCES

Brand, M. (1999). Voice puppetry. In Proceedings of

SIGGRAPH ’99, pages 21–28, New York, NY, USA.

ACM Press/Addison-Wesley Publishing Co.

Bregler, C., Covell, M., and Slaney, M. (1997). Video

rewrite: driving visual speech with audio. In Pro-

ceedings of SIGGRAPH ’97, pages 353–360, New

York, NY, USA. ACM Press/Addison-Wesley Pub-

lishing Co.

Cao, Y., Faloutsos, P., and Pighin, F. (2003). Unsupervised

learning for speech motion editing. In Eurograph-

ics/ACM SIGGRAPH Symposium on Computer Ani-

mation ’03, pages 225–231.

Cohen, M. and Massaro, D. (1993). Modeling coarticula-

tion in synthetic visual speech. In Computer Anima-

tion ’93, pages 139–156.

Cosatto, E. and Graf, H. (2000). Photo-realistic talking

heads from image samples. IEEE Transactions on

Multimedia, 2(3):152–163.

Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable vide-

orealistic speech animation. In Proceedings of SIG-

GRAPH ’02, pages 388–398, New York, NY, USA.

ACM Press.

Kalberer, G. and Van Gool, L. (2002). Realistic face ani-

mation for speech. Journal of Visualization and Com-

puter Animation, 13(2):97–106.

Kovar, L., Gleicher, M., and Pighin, F. (2002). Motion

graphs. In Proceedings of SIGGRAPH ’02, pages

473–482, New York, NY, USA. ACM Press.

Kshirsagar, S. and Magnenat-Thalmann, N. (2003). Visyl-

lable based speech animation. In Eurographics’03,

pages 632–640.

ofqvist, A. (1990). Speech as audible gestures. In Hard-

castle, W. and Marchal, A., editors, Speech Produc-

tion and Speech Modeling, pages 289–322. Kluwer.

McGurk, H. and MacDonald, J. (1976). Hearing lips and

seeing voices. Nature, (264):746–748.

Sch

odl, A., Szeliski, R., Salesin, D. H., and Essa, I.

(2000). Video textures. In Proceedings of SIG-

GRAPH ’00, pages 489–498, New York, NY, USA.

ACM Press/Addison-Wesley Publishing Co.

Shashua, A. and Levin, A. (2001). Linear image coding

for regression and classiﬁcation using the tensor-rank

principle. In Proceedings of the 2001 IEEE Computer

Society Conference on Computer Vision and Pattern

Recognition., pages 42–49.

Stollnitz, E., DeRose, T., and Salesin., D. (1995). Wavelets

for computer graphics: A primer. 15:76–84.

Sumby, W. and Pollack, I. (1954). Visual contribution to

speech intelligibility in noise. 26:212–215.

Taylor, P., Black, A., and Caley, R. (1998). The architecture

of the the festival speech synthesis system. In Third

International Workshop on Speech Synthesis.

Wang, Y., Huang, X., Lee, C.-S., Zhang, S., Li, Z., Samaras,

D., Metaxas, D., Elgammal, A., and Huang, P. (2004).

High resolution acquisition, learning and transfer of

dynamic 3-d facial expressions. In Eurographics’04,

pages 677–686.

Ypsilos, I., Hilton, A., and Rowe, S. (2004). Video-rate

capture of dynamic face shape and appearance. In 6

IEEE International Conference on Automatic Face

and Gesture Recognition, pages 117–123.

Zhang, L., Snavely, N., Curless, B., and Seitz, S. (2004).

Spacetime faces: high resolution capture for model-

ing and animation. ACM Transactions on Graphics,

23(3):548–558.

GRAPP 2007 - International Conference on Computer Graphics Theory and Applications