PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH

RECOGNITION

Luca Cappelletta and Naomi Harte

Department of Electronic and Electrical Engineering, Trinity College Dublin, Dublin, Ireland

Keywords:

AVSR, Viseme, PCA, DCT, Optical ﬂow.

Abstract:

Phonemes are the standard modelling unit in HMM-based continuous speech recognition systems. Visemes

are the equivalent unit in the visual domain, but there is less agreement on precisely what visemes are, or how

many to model on the visual side in audio-visual speech recognition systems. This paper compares the use

of 5 viseme maps in a continuous speech recognition task. The focus of the study is visual-only recognition

to examine the choice of viseme map. All the maps are based on the phoneme-to-viseme approach, created

either using a linguistic method or a data driven method. DCT, PCA and optical ﬂow are used to derive the

visual features. The best visual-only recognition on the VidTIMIT database is achieved using a linguistically

motivated viseme set. These initial experiments demonstrate that the choice of visual unit requires more

careful attention in audio-visual speech recognition system development.

1 INTRODUCTION

Many authors have demonstrated that the incorpora-

tion of visual information into speech recognition sys-

tems can improve robustness, as shown in the review

paper of Potamianos et al. (Potamianos et al., 2003).

In terms of speech recognitionas a pattern recognition

task, the most common solution is a Hidden Markov

Model (HMM)-based system. Phonemes are the typi-

cal model unit for continuous speech. Mel-frequency

cepstrum (MFCC) is the typical feature. On the vi-

sual side, there is less agreement as to the optimal

approach even for the most basic early integration

schemes.

While many efforts continue to examine visual

feature sets to best describe the mouth area, it is also

unclear what the optimal modelling units are in the

visual domain for continuous speech. At the high-

est level, the approach is to use visemes, but only a

generic deﬁnition is recognized. A viseme is deﬁned

as a visually distinguishable unit, the equivalent in the

visual domain of the phoneme in the audio domain

(Potamianos et al., 2003). However there is no agree-

ment on what a viseme is in practice. The most com-

mon approach to deriving visemes is to use a hard

link between phonemes and their visual manifesta-

tion. This is most likely inﬂuenced by considering

the baseline HMM system to be audio based. Hence a

many-to-onephoneme-to-viseme map can be derived.

Many of these maps are present in literature, and there

is no agreement on which is the best one.

In this paper ﬁve maps created using different

methods are compared. All the maps have a vary-

ing number of visemes (from 11 to 15, plus a silence

viseme). In order to compare the performances of the

maps, a HMM recognition system is used. The sys-

tem is trained using different visual feature sets: PCA;

DCT; and optical ﬂow. Since the focus of this work is

on the visual element of speech recognition initially,

visual-only cues were tested for this paper. No audio

cues were used. Ultimately, the overall recognition

combining audio and visual cues is of interest. This

work uses a basic visual HMM system however, in

order to focus the problem on the viseme set without

the interactions of integration schemes.

In investigating visemes, it is necessary to use a

continuous speech database rather than an isolated

word recognition task in order to get visemic coverage

in the dataset. The most attractive datasets, in terms

of number of speakers and sentences uttered, are AV-

TIMIT (Hazen et al., 2004) and IBM ViaVoice (Neti

et al., 2000). Currently, neither is publicly available,

so a smaller dataset was used in this work: VID-

TIMIT (Sanderson, 2008).

The paper is structured as follows: an overview

of viseme deﬁnitions is given ﬁrst along with details

of the ﬁve phoneme-to-viseme maps used; The fea-

ture extraction techniques are presented; and ﬁnally

322

Cappelletta L. and Harte N. (2012).

PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 322-329

DOI: 10.5220/0003731903220329

 SciTePress

results of a HMM based recognition system are pre-

sented for the feature sets and viseme maps. Param-

eters for the DCT feature extraction scheme are opti-

mised in the experiments reported in this paper, while

those for the other feature sets are taken from previous

work by the authors.

2 VISEME MAPS

As previously stated, visemes have multiple interpre-

tations in the literature and there is no agreement on

a way to deﬁne them. Two practical deﬁnitions are

plausible:

• Visemes can be thought of in terms of articulatory

gestures, such as lips closing together, jaw move-

ment, teeth exposure, etc.

• Visemes are derived from groups of phonemes

having the same visual appearance.

The second deﬁnition is the most widely

used (Potamianos et al., 2003; Saenko, 2004; Neti

et al., 2000; Bozkurt et al., 2007), despite a lack

of evidence that it is better than the ﬁrst deﬁnition

(Saenko, 2004). Using the second approach, visemes

and phonemes are strictly correlated, and visemes can

be obtained using a map of phonemes to viseme. This

map has to be a many-to-one map, because many

phonemes can not be distinguished using only visual

cues. This is the approach used in this work. Within

this approach, there are two possible ways to build a

map:

1. Linguistic. Viseme classes are deﬁned through

linguistic knowledge and the intuition of which

phonemes might appear the same visually.

2. Data Driven. Viseme classes are formed per-

forming a phoneme clustering, based on features

extracted from the region of interest around the

mouth.

A data driven method has several advantages. Firstly,

since most viseme recognition systems use statistical

models trained on data, it might be beneﬁcial to au-

tomatically learn natural classes from data. Secondly,

it can account for contextual variation and differences

between speakers (but only if a large database is avail-

able) (Saenko, 2004). This is particularly impor-

tant because the linguistic-based method is usually

performed with canonical phonemes in mind, while

recognition is done on continuous speech.

All ﬁve maps tested in this work have a relatively

low number of visemes (from 11 to 15, plus silence

viseme) similar to 14 classes present in the MPEG-4

viseme list (Pandzic and Forchheimer,2003). In other

Table 1: Jeffers phonemes to viseme map (Jeffers and Bar-

ley, 1971). The last viseme, /S is used for silence. The table

shows the viseme visibility rank and occurrence rate in spo-

ken English.

Viseme

Visibility Occurrence TIMIT

Rank [%] Phonemes

/A 1 3.15 /f/ /v/

/B 2 15.49

/er/ /ow/ /r/ /q/ /w/

/uh/ /uw/ /axr/ /ux/

/C 3 5.88 /b/ /p/ /m/ /em/

/D 4 .70 /aw/

/E 5 2.90 /dh/ /th/

/F 6 1.20 /ch/ /jh/ /sh/ /zh/

/G 7 1.81 /oy/ /ao/

/H 8 4.36 /s/ /z/

/I 9 31.46

/aa/ /ae/ /ah/ /ay/ /eh/

/ey/ /ih/ /iy/ /y/

/ao/ /ax-h/ /ax/ /ix/

/J 10 21.10

/d/ /l/ /n/ /t/

/el/ /nx/ /en/ /dx/

/K 11 4.84 /g/ /k/ /ng/ /eng/

/S - - /sil/

maps, the viseme number is much higher, e.g. Gold-

schen map contains 35 visemes (Goldschen et al.,

1994).

In the ﬁrst map, Jeffers & Barley group 43

phonemes into 11 visemes in the English lan-

guage (Jeffers and Barley, 1971) for what they de-

scribe “as usual viewing conditions”. The map link-

ing phonemes to visemes is shown in Table (1). In

this table visemes are labelled using a letter, from

/A to /K. To these 11, a silence viseme has been

added, labelled using /S. The last column is a sug-

gested phoneme to viseme mapping for the TIMIT

phoneme set. Two phonemes are not listed in the ta-

ble: /hh/ and /hv/. No speciﬁc viseme is linked to

them because, while the speaker is pronouncing /hh/

or /hv/, the lips are already in the position to produce

the following phoneme. Therefore /hh/ and /hv/ have

been merged with the following viseme. The table

shows the viseme visibility rank and occurrence rate

in spoken English (Jeffers and Barley, 1971). This

map is purely linguistic.

The second map analyzed is proposed by Neti et

al. (Neti et al., 2000). This map has been created us-

ing the IBM ViaVoice database and using a decision

tree, in the same fashion as decision trees are used

to identify triphones. Thus, this map can be con-

sidered a mixture of a linguistic and data driven ap-

proach. Neti’s map is composed by 43 phonemes and

12 classes (plus a silence class). Details are shown in

Table (2)

Hazen et al. (Hazen et al., 2004) use a data

driven approach. They perform bottom-up clustering

using models created from phonetically labelled vi-

sual frames. The map obtained is “roughly” (Hazen

et al., 2004) based on this clustering technique. The

PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION

323

Table 2: Neti map (Neti et al., 2000).

Code Viseme Class Phonemes in Cluster

/ao/ /ah/ /aa/

/er/ /oy/ /aw/ /hh/

V2 Lip-rounding /uw/ /uh/ /ow/

V3 based vowels /ae/ /eh/ /ey/ /ay/

V4 /ih/ /iy/ /ax/

A Alveolar-semivowels /l/ /el/ /r/ /y/

B Alveolar-fricatives /s/ /z/

C Alveolar /t/ /d/ /n/ /en/

D Palato-alveolar /sh/ /zh/ /ch/ /jh/

E Bilabial /p/ /b/ /m/

F Dental /th/ /dh/

G Labio-dental /f/ /v/

H Velar /ng/ /k/ /g/ /w/

S Silence /sil/ /sp/

Table 3: Hazen map (Hazen et al., 2004).

Viseme Class Phonemes Set

OV /ax/ /ih/ /iy/ /dx/

BV /ah/ /aa/

FV /ae/ /eh/ /ay/ /ey/ /hh/

RV /aw/ /uh/ /uw/ /ow/ /ao/ /w/ /oy/

L /el/ /l/

R /er/ /axr/ /r/

Y /y/

LB /b/ /p/

LCl /bcl/ /pcl/ /m/ /em/

AlCl /s/ /z/ /epi/ /tcl/ /dcl/ /n/ /en/

Pal /ch/ /jh/ /sh/ /zh/

SB /t/ /d/ /th/ /dh/ /g/ /k/

LFr /f/ /v/

VlCl /gcl/ /kcl/ /ng/

Sil /sil/

reason for this apparent inaccuracy is that the clus-

tering results vary a lot depending on the visual fea-

ture used. Hazen et al. group 52 phonemes into 14

visemes (plus a silence viseme). This is shown in Ta-

ble (3).

Bozkurt et al. (Bozkurt et al., 2007) created a map

using the linguistic approach. The map is based on

Ezzat and Poggio’s work (Ezzat and Poggio, 1998), in

which they deﬁne the phoneme clustering as “done in

a subjective manner, by comparing the viseme images

visually to assess their similarity”. The Bozkurt map

comprises 15 viseme (plus a silence viseme), and 45

phonemes detailed in Table (4).

In the ﬁnal map shown in Table (5), Lee and

Yook (Lee and Yook, 2002) identify 13 (plus a si-

lence viseme) viseme classes from 39 phonemes (plus

a silence phoneme and a pause phoneme). They do

not explain how the map has been derived, so it has

Table 4: Bozkurt Map (Bozkurt et al., 2007).

Viseme Class Phonemes Set

S sil

V2 ay, ah

V3 ey, eh, ae

V4 er

V5 ix, iy, ih, ax, axr,y

V6 uw, uh, w

V7 ao, aa, oy, ow

V8 aw

V9 g, hh, k, ng

V10 r

V11 l, d, n, en, el, t

V12 s, z

V13 ch, sh, jh, zh

V14 th, dh

V15 f, v

V16 m, em, b, p

Table 5: Lee Map (Lee and Yook, 2002).

Viseme Class Phonemes Set

P b p m

T d t s z th dh

K g k n ng l y hh

CH jh ch sh zh

F f v

W r w

IY iy ih

EH eh ey ae

AA aa aw ay ah

AH ah

AO ao oy ow

UH uh uw

ER er

S sil

been assumed it is a linguistic map. Even though

they claim this is a many-to-one map, some phonemes

are mapped into 2 visemes, so the map is a many-to-

many map. To remove this ambiguity, in such cases

phonemes are associated with the ﬁrst viseme pro-

posed. This affects 5 vowel phonemes.

It is not a simple task to compare these maps be-

cause the total viseme number and the total phoneme

number are different in the ﬁve maps. Table 6 sums

up the most relevant map properties. It is clear that

some similarities are present, particularly between the

Jeffers and Neti maps. In these two maps 5 con-

sonant classes are identical. Across all maps, the

consonant classes show similar class separation. All

the maps have a speciﬁc class for phoneme clusters

{/v/, /f/} and {/ch/, /jh/, /sh/, /zh/}. Jeffers, Neti,

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

324

Table 6: Map properties. Clustered phoneme number, num-

ber of visemes and number of vowel visemes. Silence

viseme and phonemes are not taken into consideration.

Map Phonemes

Total Vowel

Visemes Visemes

Jeffers 43 11 4

Neti 42 12 4

Hazen 52 14 5

Bozkurt 45 15 7

Lee 39 13 7

Bozkurt and Lee have a speciﬁc class for {/b/, /m/

/p/}. Group {/th/, /dh/} formsa viseme in Jeffers, Neti

and Bozkurt, while in Hazen and Lee it is mergedwith

other phonemes. Aside from this, the Hazen map (the

only data driven map) is signiﬁcantly different from

the others, while Jeffers and Neti have an impressive

consonant class correspondence.

In contrast, vowel visemes are quite different from

map to map. The number of vowel visemes varies

from 4 to 7, and a single class can contain from 1

up to 10 vowels. No speciﬁc cross-map patterns are

present within maps.

A ﬁnal difference within the maps is that the

phonemes {/pcl/, /tcl/, /kcl/, /bcl/, /dcl/, /gcl/, /epi/}

are not considered in the analysis by Jeffers, Neti,

Bozkurt and Lee, while they are spread across several

classes by Hazen.

3 FEATURE EXTRACTION

Feature extraction is performed in two consecutive

stages, a Region of Interest (or ROI) has to be detected

and then a feature extraction technique is applied to

the area. The ROI is found using a semi-automatic

technique (Cappelletta and Harte, 2010) based on two

stages: the speaker’s nostrils are tracked and then, us-

ing those positions, the mouth is detected. The ﬁrst

stage succeeds on the 74% of the database sentences,

so the remaining 26% has been manually tracked to

allow experimentation on the full dataset. The sec-

ond stage has 100% success rate. Subsequently the

ROI is rotated according to the nostrils alignment. At

this stage the ROI is a rectangle, but its size might

vary in each frame. Thus, ROIs are either stretched or

squeezed until they have the same size. The ﬁnal size

is the mode calculated using all ROIs size.

Having deﬁned the region of interest, a feature ex-

traction algorithm is applied to the ROI. Three differ-

ent appearance-based techniques were used: Optical

Flow; PCA (principal component analysis); and DCT

(discrete cosine transform).

Optical ﬂow is the distribution of apparent veloci-

ties of movement of brightness pattern in an image.

The code used in (Bouguet, 2002) implements the

Lucas-Kanade technique (Lucas and Kanade, 1981).

The output of this algorithm is a two dimensional

speed vector for each ROI point. A data reduction

stage, or downsampling, is required. The ROI is di-

vided in d

× d

blocks, and for each block the me-

dian of the horizontal and vertical speed is calculated.

In this way d

· d

2D speed vectors are obtained.

PCA (also known as eigenlips in AVSR applica-

tions (Bregler and Konig, 1994)) and DCT are similar

techniques. They both try to represent a video frame

using a set of coefﬁcients obtained by the image pro-

jection over an orthogonal base. While the DCT base

is a priori deﬁned, the PCA base depends on the data

used. The optimal number of coefﬁcients N (the fea-

ture vector length) is a key parameter in the HMM cre-

ation and training. A vector too short would lead to

a low quality image reconstruction, too long a feature

vector would be difﬁcult to model with a HMM. DCT

coefﬁcients are extracted using the zigzag pattern and

the ﬁrst coefﬁcient is not used.

Along with these features, ﬁrst and second deriva-

tives are used, deﬁned as follows:

∆

[i] = F

[i+ 1] − F

[i− 1]

∆∆

[i] = ∆

[i+ 1] − ∆

[i− 1]

(1)

where i represents the frame number in the video, and

k ∈ [1..N] represents the kth generic feature F value.

Used with PCA and DCT coefﬁcients, ∆ and ∆∆ repre-

sent speed and acceleration in feature evolution. Both

∆ and ∆∆ have been added to PCA and DCT features.

While optical ﬂow already represents ROI elements

speed, only ∆ has been tested with it.

Optimal optical ﬂow and PCA parameters have al-

ready been investigated and reported by the authors

for this particular dataset (Cappelletta and Harte,

2011). Results showed that an increment of PCA vec-

tor length does not improve the recognition rate ﬁgure

with an optimal value of N = 15. The best perfor-

mance is obtained using ∆ and ∆∆ coefﬁcients, with-

out the original PCA data. Similarly, the best perfor-

mance with optical ﬂow was achieved using original

features with ∆ coefﬁcients. In this case performance

is not affected by different downsampling conﬁgura-

tions. Thus, the 2 × 4+ ∆ conﬁguration will be used

for experiments reported in this paper.

4 EXPERIMENT

4.1 VIDTIMIT Dataset

The VIDTIMIT dataset (Sanderson, 2008) is com-

PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION

325

prised of the video and corresponding audio record-

ings of 43 people (24 male and 19 female), reciting

10 short sentences each. The sentences were chosen

from the test section of the TIMIT corpus. The selec-

tion of sentences in VIDTIMIT has full viseme cov-

erage for all the maps used in this paper. The record-

ing was done in an ofﬁce environment using a broad-

cast quality digital video camera at 25 fps. The video

of each person is stored as a numbered sequence of

JPEG images with a resolution of 512 x 384 pixels.

90% quality setting was used during the creation of

the JPEG images. For the results presented in this pa-

per, 410 videos have been used and they have been

split in a training group (297 sentences) and a test

group (113 sentences). The two groups are balanced

in gender and they have similar phoneme occurrence

rates. Training and test speakers did not overlap.

4.2 HMM Systems

HMMs were trained using PCA, DCT and optical ﬂow

features. A visemic time transcription for VIDTIMIT

was generated using a forced alignment procedure

with monophone HMMs trained on the TIMIT au-

dio database. The system was implemented using

HTK. All visemes were modelled with a left-to-

right HMM, except silence which used a fully ergodic

model. The number of mixtures per state was grad-

ually increased, with Viterbi recognition performed

after each increase to monitor system performance.

No language model was used in order to assess raw

feature performance. The feature vector rate was in-

creased to 20ms using interpolation. Both a 3 and

4-state HMM were used.

The experiment was conducted in two stages. In

the ﬁrst stage the Jeffers map was used. The HMM

and DCT feature parameters were varied in order to

ﬁnd the optimal parameter conﬁguration. It should

be noted that similar results were achieved using the

other maps but space limits the presentation of these

results to a single map. In particular, the recognition

rate is tested varying the HMM mixture number. Re-

sults are compared with PCA and optical ﬂow feature

performance.

In the second stage of the experiments, the feature

set parameters were ﬁxed (using the optimal conﬁgu-

rations in (Cappelletta and Harte, 2011) and those de-

termined for the DCT), in order to compare the results

from different maps. The optimal number of mixtures

for each individual viseme class was tracked. This

overcomes issues with different amounts of training

data in different classes. Thus HMMs used between 1

and 60 mixtures per state.

5 RESULTS

5.1 Feature Set Parameters

HMM results are assessed using the correctness esti-

mator, corr, deﬁned as follows:

Corr =

T − D− S

× 100 (2)

where T is the total number of labels in the reference

transcriptions, D is the deletion error and S is the sub-

stitution error.

0 10 20 30 40 50 60

M − Mixture Number

Corr [%]

N14

N20

N27

N35

N44

N54

Figure 1: Basic DCT test, 3 States. N14, N20 refer to num-

ber of DCT features at 14, 20 etc..

0 10 20 30 40 50 60

M − Mixture Number

Corr [%]

DCT

DCT+∆

∆∆

Figure 2: Higher order DCT features. N = 20, 3 States. DCT

denotes 20 DCT features only, DCT+∆ denotes addition of

ﬁrst order dynamics, ∆∆ denotes inclusions of both ﬁrst and

second order dynamics without original DCT coefﬁcients.

Figures 1 and 2 show the correctness of the 3-state

HMM using DCT features and the Jeffers map. Results

for the 4-state HMM are not shown because no sig-

niﬁcant improvement from the 3-state was achieved.

Figure 1 shows the results of the basic DCT coefﬁcient

tests obtained by varying the feature vector length N

between 14 and 54. The best results are achieved with

a vector length of 14 and 20, even though all the con-

ﬁgurations achieve very similar results, according to

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

326

Heckmann et al. (Heckmann et al., 2002). Signiﬁ-

cant improvement can be achieved using ∆ and ∆∆.

Figure 2 shows the performance of 20 DCT coefﬁ-

cients with ﬁrst and second derivatives added. The

recognition rate is increased by at least 30%. This be-

haviour mirrors that of the PCA feature set. As might

be expected, no signiﬁcant improvement is achieved

behind 35 Gaussian mixtures.

5.2 Maps Comparison

In the second part of the experiment, all the maps

were tested. The PCA and DCT results are obtained

using ∆ and ∆∆ coefﬁcients only, using N = 15 for

PCA, and N = 20 for the DCT feature set. Optical ﬂow

results are obtained using 2×4 downsampling with ∆

coefﬁcients. Along with correctness, deﬁned in equa-

tion 2, it is advisable to use the accuracy estimator to

give a better overall indication of performance. The

standard deﬁnition was used:

Acc =

T − D− S− I

× 100 (3)

where I is the number of insertions.

Figure 3 and Figure 4 show recognition results

for the ﬁve maps. When examining the ﬁgures, it is

important to realise that recognition results in a con-

tinuous speech task are expected to be relatively low

when compared to, for example, an isolated digit task.

It can certainly be argued that results will improve

signiﬁcantly with use of a language model and when

combined with audio cues. However, this viseme set

exploration is seeking to study baseline viseme per-

formance initially.

It is apparent that the Jeffers map gives the best

results both in terms of correctness and accuracy. The

Neti map is the next best map, with little difference

in performance from the remaining maps. Examin-

ing the accuracy ﬁgures, it is clear that the insertion

level remains high overall. An insertion penalty was

investigated in an attempt to address this issue but a

suitable balance has not yet been found for the sys-

tem. The performance for the optical ﬂow and PCA

features using 3-state HMMs was little better than a

guess rate.

It is possible to see a correlation between recogni-

tion rate and the number of viseme and vowel classes

listed in Table 6. The lower the viseme and vowel

class number, the better the recognition ﬁgure. Whilst

this is fully expected in a pattern recognition task, it is

still interesting to compare the Jeffers and Neti maps

because, even though many visemes encompass the

same phonemes (5 classes are identical), the results

are quite different. Results from 3-states HMM with

optical ﬂow feature are used to demonstrate this, but

PCA−3

DCT−3

Optical Flow−3

PCA−4

DCT−4

Optical Flow−4

Corr [%]

Jeffers

Neti

Hazen

Bozkurt

Lee

Figure 3: 3- and 4-states HMM correctness for each map,

using all feature extraction techniques. Jeffers map gets

the best performance in all tests, considering both 3- and

4-states.

PCA−3

DCT−3

Optical Flow−3

PCA−4

DCT−4

Optical Flow−4

Acc [%]

Jeffers

Neti

Hazen

Bozkurt

Lee

Figure 4: 3- and 4-states HMM accuracy for each map, using

all feature extraction techniques. While Jeffers map still

gets good results, some maps reach the guessing rate level

(different for each map, see Table 6).

other feature sets yield similar conclusions. Figure 5

and Figure 6 show the confusion matrices obtained

using 3-states HMM optical ﬂow tests. Total label

number, deletion number and substitution number are

also provided (see equation 2).

As expected the 5 identical classes (/H≡B, /F≡D,

/C≡E, /E≡F and /A≡G ) obtain basically the same

results. Thus, the Neti performance gap has to be

in the remaining consonant classes and in the vowel

visemes. Considering the vowel classes, it is possible

to see that in terms of number of phonemes covered,

Jeffers has two big (/B and /I) and two very small (/D

and /G) vowel classes. In contrast, Neti has four quite

balanced vowel classes (V1-V4 contain almost the

same number of phonemes). Jeffers has an advantage

because misclassiﬁcation is less probable if classes

are big (see /B and /I in Figure 5). Moreover, even

a complete misclassiﬁcation in the two small classes

will have a minor impact on the overall recognition

PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION

327

Figure 5: Confusion matrix obtained with 3-states HMM us-

ing optical ﬂow feature and Jeffers map. /B, /D, /G and /I

are the vowel visemes. Sub column represents the substitu-

tion error for each viseme, while del represents the deletion

error for each viseme. T = 3523, D = 420, S = 955 (see eq.

2).

rate. Figure 5 shows that /D and /G are basically com-

pletely misclassiﬁed, mostly in favour of the other two

vowel classes /B and /I, but these classes have such a

low occurrence that this misclassiﬁcation is negligi-

ble from a statistical point of view. On the contrary,

Neti vowel visemes are more frequently misclassiﬁed.

They contribute roughly 60% more classiﬁcation er-

ror, either in substitution or deletion errors.

Similar behaviour is present in the remaining con-

sonant classes. The remaining consonant phonemes

are clustered in two visemes in Jeffers map (/K and

/J) and in three visemes in Neti map (A, H and C).

Once again, the lesser the class number, the better the

classiﬁcation. The three Neti visemes contribute 40%

more error than the two Jeffers consonant visemes.

6 CONCLUSIONS AND FUTURE

WORK

This paper has presented a continuous speech recog-

nition system based purely on HMM modelling of

visemes. A continuous recognition task is signiﬁ-

cantly more challenging than isolated word recogni-

tion task such as digits. In terms of AVSR, it is a more

complete test of a system’s ability to capture pertinent

information from a visual stream, as the complete

set of visemes is present in a greater range of con-

texts. Five viseme maps have been tested, all based on

the phonemes-to-viseme map technique. These maps

were created using different approaches (linguistic,

data driven and mixed). A pure linguistic map (Jef-

Figure 6: Confusion matrix obtained with 3-states HMM us-

ing optical ﬂow feature and Neti map. V1 to V4 are the

vowel visemes. Sub column represents the substitution er-

ror for each viseme, while del represents the deletion error

for each viseme. T = 3662, D = 531, S = 1262 (see eq. 2).

fers) achieved the best recognition rates in all the per-

formed tests. Compared with the second best map

(Neti), this improvement in performance can be at-

tributed to better clustering in some consonant classes

and less vowel visemes (statistically, Jeffers visemes

/D and /G are negligible).

Work is ongoing to extend this system to include

other feature sets including other optical ﬂow im-

plementations and Active Appearance Model (AMM)

features to provide a deﬁnitive baseline for visual

speech recognition. To validate whether the Jeffers

map is a better approach to viseme modeling in the

context of a full AVSR system, the maps are also be-

ing tested incorporating speech features. This will test

the hypothesis that better visual features should im-

provethe overall AVSR performance when the speech

quality is low.

Unfortunately, the phonemes-to-viseme map ap-

proach does not take into account audio-visual asyn-

chrony (Potamianos et al., 2003; Hazen, 2006), nor

the fact that some phonemes do not require the use

of visual articulators, such /k/ and /g/ (Hilder et al.,

2010). Thus, along with the tested maps, it is impor-

tant to include in the analysis viseme deﬁnitions that

do not assume a formal link between acoustic and vi-

sual speech cues. This will emphasize the dynamics

in human mouth movements, rather than the audio-

visual link only.

To this end, the availability of large continuous

speech AVSR datasets (as opposed to isolated word

tasks or databases containing a small number of sen-

tences), continues to be a hurdle in AVSR develop-

ment.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

328

ACKNOWLEDGEMENTS

This publication has emanated from research con-

ducted with the ﬁnancial support of Science Founda-

tion Ireland under Grant Number 09/RFP/ECE2196.

REFERENCES

Bouguet (2002). Pyramidal Implementation of Lucas

Kanade Feature Tracker. Description of the algorithm.

Bozkurt, Eroglu, Q., Erzin, Erdem, and Ozkan (2007).

Comparison of phoneme and viseme based acoustic

units for speech driven realistic lip animation. In

3DTV Conference, 2007, pages 1–4.

Bregler and Konig (1994). ‘Eigenlips’ for robust speech

recognition. In Acoustics, Speech, and Signal Pro-

cessing, 1994. ICASSP-94., 1994 IEEE International

Conference on, volume ii, pages II/669–II/672 vol.2.

Cappelletta and Harte (2010). Nostril detection for robust

mouth tracking. In Irish Signals and Systems Confer-

ence, pages 239 – 244, Cork.

Cappelletta, L. and Harte, N. (2011). Viseme deﬁni-

tios comparison for visual-only speech recognition.

In Proceedings of 19th European Signal Processing

Conference (EUSIPCO), pages 2109–2113.

Ezzat and Poggio (1998). Miketalk: a talking facial display

based on morphing visemes. In Computer Animation

98. Proceedings, pages 96–102.

Goldschen, A. J., Garcia, O. N., and Petajan, E. (1994).

Continuous optical automatic speech recognition by

lipreading. In Proceedings of the 28th Asilomar Con-

ference on Signals, Systems, and Computers, pages

572–577.

Hazen (2006). Visual model structures and synchrony con-

straints for audio-visual speech recognition. Audio,

Speech, and Language Processing, IEEE Transactions

on, 14(3):1082–1089.

Hazen, Saenko, La, and Glass (2004). A segment-based

audio-visual speech recognizer: data collection, de-

velopment, and initial experiments. In Proceedings of

the 6th international conference on Multimodal inter-

faces, pages 235–242, State College, PA, USA. ACM.

Heckmann, Kroschel, Savariaux, and Berthommier (2002).

DCT-Based Video Features for Audio-Visual Speech

Recognition. In International Conference on Spoken

Language Processing, volume 1, pages 1925–1928,

Denver, CO, USA.

Hilder, Theobald, and Harvey (2010). In pursuit of

visemes. In International Conference on Auditory-

visual Speech Processing.

Jeffers and Barley (1971). Speechreading (Lipreading).

Charles C Thomas Pub Ltd.

Lee and Yook (2002). Audio-to-Visual Conversion Using

Hidden Markov Models. In Proceedings of the 7th Pa-

ciﬁc Rim International Conference on Artiﬁcial Intel-

ligence: Trends in Artiﬁcial Intelligence, pages 563–

570. Springer-Verlag.

Lucas and Kanade (1981). An iterative image registration

technique with an application to stereo vision. In Pro-

ceedings of Imaging Understanding Workshop.

Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Si-

son, Mashari, and Zhou (2000). Audio-visual speech

recognition. Technical report, Center for Language

and Speech Processing, The Johns Hopkins Univer-

sity, Baltimore.

Pandzic, I. S. and Forchheimer, R. (2003). MPEG-4 Facial

Animation: The Standard, Implementation and Appli-

cations. John Wiley & Sons, Inc., New York, NY,

USA.

Potamianos, Neti, Gravier, Garg, and Senior (2003). Recent

advances in the automatic recognition of audio-visual

speech. Proceeding of the IEEE, 91(9):1306–1326.

Saenko, K. (2004). Articulary Features for Robust Visual

Speech Recognition. Master thesis, Massachussetts

Institute of Technology.

Sanderson (2008). Biometric Person Recognition: Face,

Speech and Fusion. VDM-Verlag.

PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION

329