– 20% did not understand how to get started as
it was only mentioned to them that the game
worked by voice commands.
• Belonging to the wav2vec2 model:
– It uses the torch library which is not compat-
ible with Windows, it can only be trained by
virtual machines, other operating systems or by
Google Colab.
– This model does a good job to isolate the
speech from the noise, it also does not lose ac-
curacy due to the size of the audio, compared to
the Tensorflow model that needed the duration
of the audios to be very close to each other for
example.
– The language of the input cannot be speci-
fied, that is done by the same model inter-
nally, which can cause a precision problem for
a syllable, word or phrase that sounds the same
in two different languages. For example, “i”
(spanish) and “e” (english) or “ji” (spanish) and
“he” (english).
5 CONCLUSIONS
By implementing the W2V2 convolutional neural net-
work model that transforms speech to text and in-
creasing the accuracy percentage of the trained words,
the computer can effectively understand the keywords
provided to it so that they can be used in this case as
input for the video game.
In the future, in terms of the video game, we
would like to increase the catalog of games, add diffi-
culty levels, some visual aspects such as color palette
or more detailed instructions on the screens. On the
other hand, as the speech-to-text method is such a
versatile topic, similar techniques could be applied in
other areas, such as voice-controlled office tools, cell
phone applications to understand the functionalities
of the cell phone for people with other disabilities,
etc. As long as you have a good speech-to-text model
or implement one and train it long enough and accu-
rately, any model is valid. Evenmore, the user expe-
rience can be improven greatly with animation mod-
els (Silva et al., 2022) or story creation (Fern
´
andez-
Samill
´
an et al., 2021).
REFERENCES
Anjos, I., Marques, N. C., Grilo, M., Guimar
˜
aes, I., Mag-
alh
˜
aes, J., and Cavaco, S. (2020). Sibilant consonants
classification comparison with multi- and single-class
neural networks. Expert Syst. J. Knowl. Eng., 37(6).
Fern
´
andez-Samill
´
an, D., Guizado-D
´
ıaz, C., and Ugarte, W.
(2021). Story creation algorithm using Q- learning in
a 2d action RPG video game. In IEEE FRUCT.
Garnica, C. C., Archundia-Sierra, E., and Beltr
´
an, B.
(2020). Prototype of a recommendation system of ed-
ucative resources for students with visual and hearing
disability. Res. Comput. Sci., 149(4):81–91.
Hersh, M. A. and Mouroutsou, S. (2019). Learning technol-
ogy and disability - overcoming barriers to inclusion:
Evidence from a multicountry study. Br. J. Educ. Tech-
nol., 50(6):3329–3344.
Hwang, I., Tsai, Y., Zeng, B., Lin, C., Shiue, H., and Chang,
G. (2021). Integration of eye tracking and lip motion
for hands-free computer access. Univers. Access Inf.
Soc., 20(2):405–416.
Jung, S., Son, M., Kim, C., Rew, J., and Hwang, E. (2019).
Video-based learning assistant scheme for sustainable
education. New Rev. Hypermedia Multim., 25(3):161–
181.
Kandemir, H. and Kose, H. (2022). Development of adap-
tive human-computer interaction games to evaluate at-
tention. Robotica, 40(1):56–76.
Laptev, A., Korostik, R., Svischev, A., Andrusenko, A.,
Medennikov, I., and Rybin, S. (2020). You do not need
more data: Improving end-to-end speech recognition
by text-to-speech data augmentation. In CISP-BMEI,
pages 439–444. IEEE.
Mavropoulos, T., Symeonidis, S., Tsanousa, A., Gian-
nakeris, P., Rousi, M., Kamateri, E., Meditskos, G.,
Ioannidis, K., Vrochidis, S., and Kompatsiaris, I.
(2021). Smart integration of sensors, computer vision
and knowledge representation for intelligent monitor-
ing and verbal human-computer interaction. J. Intell.
Inf. Syst., 57(2):321–345.
Nevarez-Toledo, M. and Cedeno-Panezo, M. (2019). Appli-
cation of neurosignals in the control of robotic pros-
thesis for the inclusion of people with physical dis-
abilities. In INCISCOS, pages 83–89.
O’Shaughnessy, D. D. (2008). Invited paper: Automatic
speech recognition: History, methods and challenges.
Pattern Recognit., 41(10):2965–2979.
O’Shea, K. and Nash, R. (2015). An introduction to convo-
lutional neural networks. CoRR, abs/1511.08458.
Silva, S., Sugahara, S., and Ugarte, W. (2022). Neuranima-
tion: Reactive character animations with deep neural
networks. In VISIGRAPP (1: GRAPP).
Song, I., Jung, M., and Cho, S. (2006). Automatic gen-
eration of funny cartoons diary for everyday mobile
life. In Australian Conference on Artificial Intelli-
gence, volume 4304 of Lecture Notes in Computer
Science, pages 443–452. Springer.
Sundaram, D., Sarode, A., and George, K. (2019). Vision-
based trainable robotic arm for individuals with motor
disability. In UEMCON, pages 312–315. IEEE.
Yi, C., Wang, J., Cheng, N., Zhou, S., and Xu, B. (2020).
Applying wav2vec2.0 to speech recognition in various
low-resource languages. CoRR, abs/2012.12121.
Speech to Text Recognition for Videogame Controlling with Convolutional Neural Networks
955