other persons present in the same room, or due to the
variations of physical parameters (e.g. temperature).
In such a case, the adaptive processing will have the
additional task of tracking the variation of the prop-
agation channel. With this respect, it has been es-
tablished that adaptive algorithms that exhibit good
convergence properties in stationary environments do
not necessarily provide good tracking performance
in a non-stationary environment; because the conver-
gence behavior of an adaptive filter is a transient phe-
nomenon, whereas the tracking behavior is a steady-
state property (Haykin, 2002; Triki, 2009).
In this paper, we address some issues related to
the performance evaluation of echo-cancellation in
time-varying environments. Generally, experimen-
tal evaluation should produce meaningful, consistent
and interpretable results. In time-varying environ-
ments, the assessment is particularly challenging and
the ‘experiment’ design is crucial. On one hand, mim-
icking the user experience (moving user/capturing-
device) is difficult to reproduce and to interpret. On
the other hand, simulating the impulse responses
offers flexibility and reproducibility, but gives re-
sults that are difficult to interpolate to real-world en-
vironment. Alternatively, we reproduce the time-
varying artifacts by altering the surrounding acous-
tic environment (while keeping the source and cap-
turing devices fixed). These alterations are intro-
duced by moving person/robot. The movement char-
acteristics (discrete/continuous) and location (line-of-
sight/background) represent degrees of freedom that
emphasis various room/algorithms characteristics and
provide deeper insights on the system behavior.
The remainder of this paper is organized as fol-
lows. In section 2, the experimental setup used for
the data acquisition and performance analysis is de-
scribed. Acoustic echo cancellation and noise sup-
pression building blocks are investigated in sections 3
and 4 respectively. Finally, a discussion and conclud-
ing remarks are provided in section 5.
2 EXPERIMENTAL SETUP
Throughout this paper, we evaluate different speech
preprocessing schemes in order to isolate their per-
formance impact on the recognition rate and motivate
further refinements. In the following, we present our
experimental setup to assist this progressive unfolding
of the speech preprocessing design. Namely, we will
describe the data collection procedure, and specify the
characteristics of our data recording space (reproduc-
ing a living-room environment).
2.1 Data Collection
Defining a formal data collection process is necessary,
as it ensures that gathered data is both defined and ac-
curate and that subsequent findings and decisions are
valid. The aim of the present work is to investigate
the effect of the extrinsic variabilities (noise, rever-
beration, interference) on the recognition rate. Thus,
the data should be collected such to reduce the effect
of intrinsic variabilities (that may bias the final con-
clusions). Specifically, particular attention was paid
to:
• Linguistic accent: we have chosen North Ameri-
can native speakers (American or Canadian). The
choice was motivated by the fact that our recog-
nition system (that we use for the evaluation) was
trained (optimized) for this particular accent.
• Speech rate changes: the variation of the speech
rate was alleviated with a two step simulation
approach: first we collect the input data, next
the various tasks are reproduced using a dummy-
head.
• Additive noise: the data collection was performed
in a noise-free and low-reverberent environment.
North-American native speakers (4 males, 1 female)
were asked to participate in the data collection pro-
cess. Two dictionaries were defined:
• Controls dictionary, e.g., ‘switch on’, ‘is there any
sport program tonight’.
• Artist names dictionary, e.g., ‘Madonna’, ‘Tokio
Hotel’, ‘Laura Pausini’...
The recordings were performed in a noise-free and
low-reverberant room (see Figure 2). The speakers
were seated in a comfortable chair while they read
aloud one-by-one a list of items. The items were dis-
played using a PowerPoint presentation at constant
speed (12 items per minute). The speech signal was
captured at 48 kHz.
2.2 Data Recording
We have investigated the recognition accuracy in a
living-room environment. The recordings were car-
ried out in a four-by-six meters demonstration room.
(see Figure 3 and Figure 5 for schematic representa-
tion). The room reverberation time is T
60
≈ 300 ms.
In order to account for speech rate variabilities,
the control/search commands (recorded during the
data collection phase) were reproduced by a KE-
MAR (Knowles Electronics Manikin for Acoustic Re-
search). The KEMAR was placed at 3 meters dis-
tance from the TV set. The audio signal was captured
ON SPEECH RECOGNITION PERFORMANCE UNDER NON-STATIONARY ECHO CANCELLATION
317