We apply the clean speech datasets included in the
Edinburgh Datasets (Botinhao, 2017) which is com-
prised of audio recordings, including 28 English
speakers (14 men and 14 women), sampled at 48 kHz,
to train our proposed system. For noise audios, we
used a subset of the noise environments available in
the DEMAND datasets (Thiemann et al., 2013).
These noise environments were then not included
inthe test set. The DEMAND datasets include noise
recordings corresponding to six distinct acoustic
scenes (Domestic, Nature, Office, Public, Street and
Transportation), which are further subdivided in mul-
tiple more specific noise sources (Thiemann et al.,
2013). Note that while we used clean speech and noise
included in the Edinburgh Datasets, the samples used
for training the systems are not the noisy samples
found in the noisy speech subset of the Edinburgh Da-
tasets, but rather samples mixed using the method de-
scribed in (Valin, 2018).
The model generalization is achieved by data aug-
mentation. Since cestrum mean normalization is not
applied, the speech and noise signal are filtered inde-
pendently for each training example through a second
order filter as
12
12
12
34
1
()
1
rz rz
Hz
rz rz
โโ
โโ
++
=
++
(6)
where each of
14
rr๏
are following the uniform dis-
tribution in the [-3/8, 3/8] range. We vary the final
mixed signal level to achieve robustness to the input
noisy speech signal. The amount of noisy speech at-
tenuation is also limited to get a better trade-off be-
tween noise removal level and speech distortion.
The test set used is the one provided in the Edin-
burgh Datasets, which has been specifically created
for SE applications and consists of wide-band (48
kHz) clean and noisy speech audio tracks. The noisy
speech in the set contains four different SNR levels
(2.5dB, 7.5dB, 12.5dB, 17.5dB). The clean speech
tracks included in the set are recordings of two Eng-
lish language speakers, a male and a female. As for
the noise recordings that were used in the mixing of
the noisy speech tracks, those were selected from the
DEMAND database. More specifically, the noise pro-
files found in the testing set are:
โข Office: noise from an office with keyboard typ-
ing and mouse clicking
โข Living: noise inside a living room
โข Cafe: noise from a cafe at a public square
โข Bus: noise from a public bus
The selected evaluation metrics is of great im-
portance in the effort of regular evaluation of one sys-
tem. In order to evaluate our system, we used a metric
that focuses on the intelligibility of the voice signal
(STOI) ranges from 0 to 1 and a metric that focuses
on the sound quality (PESQ) ranges from -0.5 to 4.5,
with higher values corresponding to better quality.
4 EXPERIMENTAL RESULTS
AND DISCUSSIONS
It was appropriate to present our results in a compari-
son between the reference RNNoise system and the
proposed method that makes use of the LTSD feature.
As seen in Figure 4, it becomes apparent that the pro-
posed method has better performance in most acoustic
scenarios and in all SNR levels, especially in lower
SNRs, comparing the two methods with the PESQ
quality metric. Our proposed method outperforms the
RNNoise algorithm by 0.12 MOS points on average.
Similarly, examining the STOI intelligibility meas-
ure, as depicted in Figure 5, it is indicated that the pro-
posed method also has better intelligibility perfor-
mance.
We observe a noticeable improvement in perfor-
mance than RNNoise method. Having also compared
several spectrograms of both methods, it was ob-
served that in general the proposed method does in-
deed subtract more noise components.
Having taken these results into consideration, it is
demonstrated that more detailed research is required
in future work to reduce speech distortion and pro-
mote noise removal level for different application sce-
narios.
Firstly, we find that adding more hidden layers in-
deed be beneficial for our proposed system. Given
that we provide the system with more and diverse in-
put information, the RNN might be able to better use
the proposed features with additional hidden layers.
Secondly, studying samples processed by our ex-
tended system, we speculate that the system could
benefit from changing how aggressively the noise
suppression occurs. This can be achieved by fine-tun-
ing the value of the ๐พ parameter in the loss function
(3), keeping in mind that smaller ๐พ values lead to
more aggressive suppression. According to our exper-
iment, setting ๐พ=1/2 is an optimal balance.
Finally, we believe that further research can be
done regarding the performance of our proposed sys-
tems as the training datasets increases in size and di-
versity.