5 CONCLUSIONS
In this work, keyboard sounds were dynamically
suppressed in speech signals using artificial neural
networks. The inputs to the networks were small time
frames of the Mel spectrograms, and the target was to
predict a magnitude mask that can remove the
background noise from the input signal. In the first
experiments, the background signal was only partially
suppressed and remained in the output signal at a
much lower volume. The best results were achieved
using convolutional neural networks.
The adaptation of the loss function, in order to
reach a better suppression of the keyboard noise, led
to a significantly better removal of the background
noise. Especially in the visual comparison of the
output spectrograms with the results from the initial
experiments, the improvement was recognized.
Further, we explored whether the proposed neural
networks can work on multiple different background
noise types. A combined dataset was created with
keyboard and door knocking sounds mixed together.
The networks trained on the combined training set
showed only marginally higher errors than the
previous experiments.
Finally, the results achieved in this work were
compared to the open-source tool RNNoise. It was
determined that the convolutional neural network
from the experiments outperforms RNNoise. This
might be justified because RNNoise uses a much
smaller network.
In conclusion, this topic is not new and has been
researched for some time. The problem in academic
research is that algorithms and procedures which are
already established in the commercial software are
not freely accessible. For this reason, own algorithms
and procedures have to be researched and developed
in the academic field. This paper presents a system for
removing complex background noises from speech
signals using the example of computer keyboard
noise. Thus, it is possible to develop a non-cost
software in a short time adapted to own background
noises. To assess the quality of our solution, further
experiments and different background noises are
necessary.
6 FUTURE WORK
Some of the network hyperparameters were not
varied in the experiments. It can be further
investigated whether the number of the Mel
frequency containers can be reduced without a
significant impact on the quality of the output. A
lower number of containers per time window would
also drastically decrease the network size and
computation time. Further, the network architectures
can be even automatically learned using neural
architecture search (Elsken, Metzen and Hutter,
2019). It is expected that the error values as well as
the network size can be decreased by systematically
exploring new architectures. Also, different loss
functions can be used. The adapted loss from this
work has to be tested with more fine-grained values
of α, but there might be more suitable loss functions.
For an application of such a system, more noise
types must be taken into account. It was shown that
two different types can be combined, but the training
data could be extended, for example, with fan noise,
baby screaming, or traffic sounds. When transferring
speech signals over the Internet in real time, a packet
loss can occur. The proposed system could be trained
to predict the missing time windows from the
previous ones. As an extension of the work, run time
measurements of the presented method can be carried
out. Thus, various comparisons can be made with
other NN-based or traditional methods.
REFERENCES
Bock, S., Weiß, M., 2019. A Proof of Local Convergence
for the Adam Optimizer. International Joint Conference
on Neural Networks (IJCNN), ISBN: 978-1-7281-
1985-4, pp. 1-8.
Elsken, T., Metzen, J. H., Hutter, F., 2019. Neural
Architecture Search: A Survey. Journal on Machine
Learning Research, vol. 20, pp. 55:1-55:21.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D.,
Dahlgren, N., 1993. TIMIT Acoustic Phonetic
Continuous Speech Corpus. Linguistic Data Consortium.
Krisp, 2021. HD Voice with Echo and Noise Cancellation.
Krisp.ai. [online]. Available at: https://krisp.ai/.
Accessed: 22/12/2021.
Kathania, H., Shahnawazuddin, S., Ahmad, W., Adiga, N.,
2019. On the Role of Linear, Mel and Inverse-Mel
Filterbank in the Context of Automatic Speech
Recognition. National Conference on Communications
(NCC), ISBN 978-3-658-23751-6, pp. 1-5.
Lafay, G., Benetos, E., Lagrange, M., 2017. Sound event
detection in synthetic audio: Analysis of the DCASE
2016 task results. IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics (WASPAA),
ISBN: 978-1-5386-1632-1, pp. 11-15.
Loizou, P., 2013. Speech Enhancement: Theory and
Practice. Second Edition. CRC Press, 2013.
Loizou, P., 2005. Speech Enhancement Based on
Perceptually Motivated Bayesian Estimators of the
Magnitude Spectrum. IEEE/ACM Transactions on
Suppression of Background Noise in Speech Signals with Artificial Neural Networks, Exemplarily Applied to Keyboard Sounds