ological speech production process, as we speak, neu-
ral control signals are transmitted to articulatory mus-
cles, and the articulatory muscles contract and relax
accordingly to produce voice. The muscle activity
alters the electric potential along the muscle fibers,
and the EMG method can measure this kind of po-
tential change. In other words, the articulatory mus-
cle activities result in electric potential change, which
can be picked up by EMG electrodes for further sig-
nal processing, e.g., speech recognition. The EMG
method is inherently robust to ambient noise because
the EMG electrodes contact to the human tissue di-
rectly without the air-transmission channel. In ad-
dition, the EMG method has better applicability be-
cause the EMG method makes it possible to recognize
silent speech, which means mouthing words without
making any sound.
For silent speech recognition with EMG, Man-
abe et al. first showed that it is possible to recog-
nize five Japanese vowels and ten Japanese isolated
digits using surface EMG signals recorded with elec-
trodes pressed on the facial skin (Manabe et al., 2003;
Manabe and Zhang, 2004). EMG has been a use-
ful analytic tool in speech research since the 1960’s
(Fromkin and Ladefoged, 1966), and the recent appli-
cation of surface EMG signals to automatic speech
recognition was proposed by Chan et al. They fo-
cused on recognizing voice command from jet pi-
lots under noisy environment, so they showed digit
recognition in normal audible speech (Chan et al.,
2002). Jorgensen et al. proposed sub auditory speech
recognition using two pairs of EMG electrodes at-
tached to the throat. Sub vocal isolated word recogni-
tion was demonstrated with various feature extraction
and classification methods (Jorgensen et al., 2003;
Jorgensen and Binsted, 2005; Betts and Jorgensen,
2006). Maier-Hein et al. reported non-audible EMG
speech recognition focusing on speaker and session
independency issues. (Maier-Hein et al., 2005).
However, these pioneering studies are limited to
small vocabulary ranging from five to around forty
isolated words. The main reason of this limitation is
that the classification unit is restrained to a whole ut-
terance, instead of a phone as a smaller and more flex-
ible unit. As a standard practice of large vocabulary
continuous speech recognition (LVCSR), the phone
is a natural unit based on linguistic knowledge. From
the pattern recognition’s point of view, the phone as
a smaller unit is preferred over a whole utterance be-
cause phones get more training data per classification
unit for more reliable statistical inference. The phone
unit is also more flexible in order to constitute any
pronunciation combination of words as theoretically
unlimited vocabulary for speech recognition. With
the phone unit relaxation, EMG speech recognition
can be treated as a standard LVCSR task and we can
apply any advanced LVCSR algorithms to improve
the EMG speech recognizer.
In this paper, we introduce such an EMG speech
recognition system with the following research as-
pects. Firstly, we analyze the phone-based EMG
speech recognition system with articulatory features
and their relationship with signals of different EMG
channels. Next, we demonstrate the challenges of
EMG signal processing with the aspect of feature
extraction for the speech recognition system. We
then describe our novel EMG feature extraction meth-
ods which makes the phone-based system possible.
Lastly, we integrate the novel EMG feature extrac-
tion methods and the articulatory feature classifiers
into the phone-based EMG speech recognition sys-
tem with a stream architecture. Notice that the ex-
periments described in this paper are conducted on
normal audible speech, not silent mouthing speech.
2 RESEARCH APPROACH
2.1 Data Acquisition
In this paper, we report results of data collected from
one male speaker in one recording session, which
means the EMG electrode positions were stable and
consistent during this whole session. In a quiet room,
the speaker read English sentences in normal audi-
ble speech, which was simultaneously recorded with
a parallel setup of an EMG recorder and a USB
soundcard with a standard close-talking microphone
attached to it. When the speaker pressed the push-
to-record button, the recording software started to
record both EMG and speech channels and generated
a marker signal fed into both the EMG recorder and
the USB soundcard. The marker signal was then used
for synchronizing the EMG and the speech signals.
The speaker read 10 times of a set of 38 phonetically-
balanced sentences and 10 times of 12 sentences from
news articles. The 380 phonetically-balanced utter-
ances were used for training and the 120 news article
utterances were used for testing. The total duration
of the training and test set are 45.9 and 10.6 minutes,
respectively. We also recorded ten special silence ut-
terances, each of which is about five seconds long
on average. The format of the speech recordings is
16 kHz sampling rate, two bytes per sample, and lin-
ear PCM, while the EMG recording format is 600 Hz
sampling rate, two bytes per sample, and linear PCM.
The speech was recorded with a Sennheiser HMD 410
close-talking headset.
BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing
4