Table 6: Map properties. Clustered phoneme number, num-
ber of visemes and number of vowel visemes. Silence
viseme and phonemes are not taken into consideration.
Map Phonemes
Total Vowel
Visemes Visemes
Jeffers 43 11 4
Neti 42 12 4
Hazen 52 14 5
Bozkurt 45 15 7
Lee 39 13 7
Bozkurt and Lee have a specific class for {/b/, /m/
/p/}. Group {/th/, /dh/} formsa viseme in Jeffers, Neti
and Bozkurt, while in Hazen and Lee it is mergedwith
other phonemes. Aside from this, the Hazen map (the
only data driven map) is significantly different from
the others, while Jeffers and Neti have an impressive
consonant class correspondence.
In contrast, vowel visemes are quite different from
map to map. The number of vowel visemes varies
from 4 to 7, and a single class can contain from 1
up to 10 vowels. No specific cross-map patterns are
present within maps.
A final difference within the maps is that the
phonemes {/pcl/, /tcl/, /kcl/, /bcl/, /dcl/, /gcl/, /epi/}
are not considered in the analysis by Jeffers, Neti,
Bozkurt and Lee, while they are spread across several
classes by Hazen.
3 FEATURE EXTRACTION
Feature extraction is performed in two consecutive
stages, a Region of Interest (or ROI) has to be detected
and then a feature extraction technique is applied to
the area. The ROI is found using a semi-automatic
technique (Cappelletta and Harte, 2010) based on two
stages: the speaker’s nostrils are tracked and then, us-
ing those positions, the mouth is detected. The first
stage succeeds on the 74% of the database sentences,
so the remaining 26% has been manually tracked to
allow experimentation on the full dataset. The sec-
ond stage has 100% success rate. Subsequently the
ROI is rotated according to the nostrils alignment. At
this stage the ROI is a rectangle, but its size might
vary in each frame. Thus, ROIs are either stretched or
squeezed until they have the same size. The final size
is the mode calculated using all ROIs size.
Having defined the region of interest, a feature ex-
traction algorithm is applied to the ROI. Three differ-
ent appearance-based techniques were used: Optical
Flow; PCA (principal component analysis); and DCT
(discrete cosine transform).
Optical flow is the distribution of apparent veloci-
ties of movement of brightness pattern in an image.
The code used in (Bouguet, 2002) implements the
Lucas-Kanade technique (Lucas and Kanade, 1981).
The output of this algorithm is a two dimensional
speed vector for each ROI point. A data reduction
stage, or downsampling, is required. The ROI is di-
vided in d
R
× d
C
blocks, and for each block the me-
dian of the horizontal and vertical speed is calculated.
In this way d
R
· d
C
2D speed vectors are obtained.
PCA (also known as eigenlips in AVSR applica-
tions (Bregler and Konig, 1994)) and DCT are similar
techniques. They both try to represent a video frame
using a set of coefficients obtained by the image pro-
jection over an orthogonal base. While the DCT base
is a priori defined, the PCA base depends on the data
used. The optimal number of coefficients N (the fea-
ture vector length) is a key parameter in the HMM cre-
ation and training. A vector too short would lead to
a low quality image reconstruction, too long a feature
vector would be difficult to model with a HMM. DCT
coefficients are extracted using the zigzag pattern and
the first coefficient is not used.
Along with these features, first and second deriva-
tives are used, defined as follows:
∆
k
[i] = F
k
[i+ 1] − F
k
[i− 1]
∆∆
k
[i] = ∆
k
[i+ 1] − ∆
k
[i− 1]
(1)
where i represents the frame number in the video, and
k ∈ [1..N] represents the kth generic feature F value.
Used with PCA and DCT coefficients, ∆ and ∆∆ repre-
sent speed and acceleration in feature evolution. Both
∆ and ∆∆ have been added to PCA and DCT features.
While optical flow already represents ROI elements
speed, only ∆ has been tested with it.
Optimal optical flow and PCA parameters have al-
ready been investigated and reported by the authors
for this particular dataset (Cappelletta and Harte,
2011). Results showed that an increment of PCA vec-
tor length does not improve the recognition rate figure
with an optimal value of N = 15. The best perfor-
mance is obtained using ∆ and ∆∆ coefficients, with-
out the original PCA data. Similarly, the best perfor-
mance with optical flow was achieved using original
features with ∆ coefficients. In this case performance
is not affected by different downsampling configura-
tions. Thus, the 2 × 4+ ∆ configuration will be used
for experiments reported in this paper.
4 EXPERIMENT
4.1 VIDTIMIT Dataset
The VIDTIMIT dataset (Sanderson, 2008) is com-
PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION
325