preprocessing (Wand et al., 2007), and advances in
acoustic modeling using articulatory features in com-
bination with phone models (Jou et al., 2006a). How-
ever, the described experiments were based on rela-
tively small amounts of data, and consequently were
limited to speaker-dependent modeling schemes. In
(Maier-Hein et al., 2005), first results on EMG recog-
nition across recording sessions were reported, how-
ever these experiments were run on a small vocabu-
lary of only 10 isolated words.
This paper reports for the first time EMG-based
recognition results on continuously spoken speech
comparing speaker-dependent, speaker-adaptive, and
speaker-independent acoustic models. We investigate
different signal preprocessing methods and the poten-
tial of model adaptation. For this purpose we first
develop generic speaker independent acoustic models
based on a large amount of training data from many
speakers and then adapt these models based on a small
amount of speaker specific data.
The baseline performance of the speaker-
dependent EMG recognizer is 32% WER on a testing
vocabulary of 108 words (Jou et al., 2006b). The
training data of this baseline recognizer consisted of
380 phonetically-balanced sentences from a single
speaker, which is about 10 times larger than the
training set we use for the speaker-dependent systems
reported in this paper (see below for details on the
training data).
The paper is organized as follows: In section 2, we
describe the used data corpus and the method of data
acquisition. In section 3, we explain the setup of the
EMG recognizer, the feature extraction methods, as
well as the different training and adaptation variants.
In section 4, we present the recognition accuracy of
the different methods and section 5 concludes the pa-
per.
2 DATA ACQUISITION
For data acquisition, 13 speakers were recorded. Each
speaker recorded two sessions with an in-between
break of about 60-90 minutes, during which the elec-
trodes were not removed. The recordings were col-
lected as part of a psychobiological study investigat-
ing the effects of psychological stress on laryngeal
function and voice in vocally normal participants (Di-
etrich, 2008; Dietrich and Abbott, 2007). The sen-
tence recordings were obtained at the beginning and
at the very end of the stress reactivity protocol. Partic-
ipants were recruited at the University of Pittsburgh,
Carnegie Mellon University, and Chatham University
for a speech recognition study, but were also con-
fronted with an impromptu public speaking task.
One session consisted of the recording of 100 sen-
tences, half of which were read audibly, as in normal
speech, while the other half were mouthed silently,
without producing any sound. In order to obtain com-
parable results to previous work, we report recogni-
tion results from the audibly spoken sentences only.
Each block of audible and mouthed utterances had
two kinds of sentences, 40 individual sentences that
were distinct across speakers and 10 “base” sentences
which were identical for each speaker. We used the
individual block for training and the “base” sentences
as test set.
The corpus of audible utterances had the follow-
ing properties:
Speakers 13 females speakers aged
18 - 35 years with no
known voice disorders
Sessions 2 sessions per speaker
Average Length
(total) 231 seconds per session
(training set) 179 seconds
(test set) 52 seconds
Domain Broadcast News
Decoding vocabulary 101 words
The total duration of all audible recordings was
approximately 100 minutes (77.5 minutes training set,
22.5 minutes test set).
During any session, “base” and individual sen-
tences were recorded in a randomized order.
In order to compare our results with previous
work, we additionally use the data set reported in (Jou
et al., 2006b), which consists of a training set of 380
phonetically balanced sentences and a test set of 120
sentences with a duration of 45.9 and 10.6 minutes,
respectively.
This results in a corpus of 14 speakers, where
speaker 14 (with only one session) corresponds to the
speaker from (Jou et al., 2006b) described above and
is treated separately. In the results section, a result
denoted with e.g. 3-2 means: Speaker 3 (out of 14),
session 2 (out of 2).
The EMG signals were recorded with six pairs
of Ag/Ag-CL electrodes attached to the speaker’s
skin capturing the signal of the articulatory muscles,
namely the levator angulis oris, the zygomaticus ma-
jor, the platysma, the orbicularis oris, the anterior
belly of the digastric and the tongue. Eventually, the
signal obtained from the orbicularis oris proved un-
stable and was dropped from the final experiments.
The EMG signals were sampled at 600 Hz and fil-
tered with a 300 Hz low-pass and a 1 Hz high-pass
BIOSIGNALS 2009 - International Conference on Bio-inspired Systems and Signal Processing
156