the LPC coefficients are computed. This first formant
is then extracted using these coefficients by the Burg
algorithm described in (Childers, 1978).
In previous work (Sebe et al., 2004) syllable rate
was used as a prosody feature. However, in our work,
our audio data consists of spoken as well as non-
spoken words, e.g. exclamations, gasps or humming,
which we want to model for automatic problem de-
tection. and our speech recognizer had a lot of dif-
ficulty computing an accurate syllable rate. Of the
219 utterances processed by the speech recognizer,
97 utterances have an incorrect number of hypothe-
sized vowel phones. On average, these incorrectly
recognized utterances have 2.73 syllables more than
the hypothesized ones.
4 MULTIMODAL DETECTION
OF SYSTEM ERRORS
We explore different techniques to detect communi-
cation errors from sequences of audio-visual features
estimated in Section 3.2. First, we describe unimodal
classification models followed by the multimodal fu-
sion strategies we tested.
4.1 Unimodal Classification Methods
We want to map an observation sequence x to class
labels y ∈
Y , where x is a vector of t consecutive ob-
servations, x = {x
1
, x
2
, . . . x
t
}. In our case, the local
observation x
t
can be an audio feature A
f
, or a visual
feature, V
f
.
To detect communication errors, learning the se-
quential dynamics of these observations is important.
Hidden Markov Models (HMMs) (Rabiner, 1989) are
well known generative probabilistic sequence models
that capture sequence dynamics; Hidden Conditional
Random Fields (HCRFs) (Quattoni et al., 2004; Wang
et al., 2006) are discriminative analogs that have been
recently introduced for gesture recognition. We com-
pare both techniques in our experiments below; exper-
iments with classifiers taking a single observation as
input previously demonstrated poor results, and were
not included in our experiments.
Hidden Markov Models (HMM) - We trained a
HMM model for each communication state. During
evaluation, test sequences were passed through each
of these models and the model with the highest like-
lihood was selected as the recognized communication
state. This is a generative, sequential model with hid-
den states. More details of this model are described
in (Rabiner, 1989).
Hidden Conditional Random Fields (HCRF)
- The HCRF is a model that has recently been
introduced for the recognition of observation se-
quences (Quattoni et al., 2004). Here we describe the
HCRF model briefly:
A HCRF models the conditional probability of a
class label given an observation sequence by:
P(y | x, θ) =
∑
s
P(y, s | x, θ) =
∑
s
e
Ψ(y,s,x;θ)
∑
y
′
∈Y ,s∈S
m
e
Ψ(y
′
,s,x;θ)
(4)
where s = {s
1
, s
2
, ..., s
m
}, each s
i
∈ S captures certain
underlying structure of each class and S is the set of
hidden states in the model. If we assume that s is ob-
served and that there is a single class label y then the
conditional probability of s given x becomes a regular
CRF. The potential function Ψ(y, s, x;θ) ∈ ℜ, param-
eterized by θ, measures the compatibility between a
label, the observation sequence, and the configuration
of the hidden states.
In our paper, the local observations are the visual
features, V
f
, or the audio features, A
f
. We trained a
single two-class HCRF. Test sequences were run with
this model and the communication state class with the
highest probability was selected as the recognized er-
ror state.
For the HMM model, the number of Gaussian
mixtures and states was set by minimizing the error
on training features. For the HCRF model, the num-
ber of hidden states was set in a similar fashion.
4.2 Multimodal Fusion Strategies
We have a choice between early or late fusion when
combining the audio and visual modalities. In early
fusion, we can model the audio and visual features
in a single joint feature space, and use the joint fea-
ture for training a single classifier. In late fusion, we
can train a classifier on each modality separately and
merge the outputs of the classifiers. As illustrated in
Figure 1, our communication error detection has two
different modes: in b. we use visual features only for
error detection and in c. we use both audio and visual
features. The single mode in b. requires us to train
a classifier using a single input stream. In addition,
training classifiers based on individual streams is a
simpler process. As such, we choose late fusion tech-
niques, i.e. fusing the outputs of two classifiers. We
use two common late-fusion strategies as described
in (Kittler et al., 1998).
Let the feature input to the j-th classifier, j =
1, ..., R be x
j
, and the winning label be h. A uniform
prior across all classes is assumed.
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
368