matrix is computed session-wise on the training data,
resulting in a set of ICA components. We use the Info-
max ICA algorithm according to (Bell and Sejnowski,
1995), as implemented in the Matlab EEGLAB tool-
box (Delorme and Makeig, 2004), to compute the
ICA decomposition. For a thorough introduction to
the theory of Independent Component Analysis, we
would like to refer the reader to (Cardoso, 1998) and
(Hyv
¨
arinen and Oja, 2000). For the subsequent ar-
tifact removal, (Wand et al., 2013a) introduced two
methods:
• The direct method means that artifact components
are removed, and features are extracted on the re-
maining ICA components.
• The back-projection method consists of taking
the ICA decomposition, setting detected artifact
channels to zero, and then applying the inverse of
the ICA transformation. This “back-projects” the
signal representation into its original domain, but
suppresses the detected noise. Features are then
extracted on the back-projected data.
We compare our results with two baseline sys-
tems: First, a baseline system without any ICA ap-
plication or artifact removal. Second, we perform the
ICA decomposition, but do not remove any compo-
nents. In all cases, features are extracted on each
channel or component separately. We use the time-
domain feature extraction proposed by (Jou et al.,
2006) and also used by (Wand et al., 2013a).
For any given frame f,
¯
f is its frame-based time-
domain mean, P
f
is its frame-based power, and z
f
is
its frame-based zero-crossing rate.
For an EMG signal with normalized mean x[n],
we obtain a low-pass filtered signal w[n] by using a
double nine-point moving average:
w[n] =
1
9
4
∑
k=−4
v[n + k] (1)
where v[n] =
1
9
4
∑
k=−4
x[n + k]. (2)
The complementary high-frequency signal is p[n] =
x[n] − w[n], and the rectified high-frequency signal is
r[n] = |p[n]|.
Let S(f, n) be the stacking of adjacent frames of
feature f in the size of 2n + 1 (−n to n) frames. The
feature TDn, for one EMG channel or ICA compo-
nent, is now defined as follows:
TDn =S(TD0, n), (3)
where TD0 =[
¯
w, P
w
, P
r
, z
p
,
¯
r], (4)
i. e. a stacking of adjacent feature vectors with context
width 2 · n + 1 is performed, with varying n. Finally,
the combination of all channel-wise feature vectors
yields the TDn feature vector. Frame size and frame
shift are set to 27 ms and 10 ms, respectively.
After this step, we apply Principal Component
Analysis (PCA) on the resulting extended feature vec-
tors, reducing their dimensionality to 700. This step
is followed by Linear Discriminant Analysis (LDA)
to obtain a final feature vector with 32 coefficients.
(Wand et al., 2013b) showed that the PCA step is
necessary in order to obtain robust results: For a
small amount of training data relative to the sam-
ple dimensionality, the LDA within-scatter matrix
becomes sparse (Qiao et al., 2009), which causes
the LDA computation to become inaccurate.
1
As
LDA is a supervised method, we need to assign
classes to every feature vector of the training set. An
acoustical speech recognizer is used to align a most
likely sequence of sub-phonemes to the simultane-
ously recorded audio sequence. As the audio and
EMG data are recorded simultaneously, these sub-
phonemes can be used as classes for the EMG training
data, between which LDA maximizes discriminabil-
ity. In total, 136 different classes are used.
3.2 Training and Decoding
We perform EMG-based continuous speech recogni-
tion. For this purpose, models of words or utterances
must be constructed from smaller units. While in con-
ventional acoustic speech recognition, these units are
normally context-dependent subphones (Lee, 1989),
we follow (Schultz and Wand, 2010) and use Bun-
dled Phonetic Features (BDPFs) as foundation for
our modeling. Phonetic Features represent proper-
ties of phones, like the place or manner of articula-
tion. Phonetic feature bundling means that dependen-
cies between these features are taken into account.
Each such BDPF model is represented by a mixture
of Gaussians. The knowledge from the different pho-
netic features is merged using a multi-stream model
(Metze and Waibel, 2002) (Jou et al., 2007).
Otherwise, our recognizer follows a standard pat-
tern. We use three-state left-to-right fully continuous
Hidden Markov Models (HMM), where the emission
1
LDA essentially consists of a maximization problem
w
T
S
B
w
w
T
S
W
w
, where S
W
is the within scatter matrix and S
B
is
the between scatter matrix. The optimization is performed
by means of an eigenvalue analysis. Numerical instability
arises when the denominator of the above fraction is singu-
lar, which happens if S
W
has zero eigenvalues. Note that for
the PCA computation, this is not a problem since for PCA,
one maximizes a single term w
T
Cw (C is the sample covari-
ance matrix) instead of a fraction and all samples are used
for covariance estimation.
SpatialArtifactDetectionforMulti-channelEMG-basedSpeechRecognition
191