
in Spanish. And when we want to translate audio that
is in the context of a Peruvian convenience store, the
models can’t recognize some keywords, like the name
of the products, and it can return a text that is trans-
lated better than the other original models but it will
have some errors, and required an extra process to
transform that failed transcription, into the interpre-
tation of the correct phrase.
And for our model, it looks like it has better results
than the others, however, this does not mean that the
model has a better precision than the others. A fact is
that our model has better results due to the fine-tuning
process for that specific context. As we can see, we
retrained the model with our data to have that accu-
racy in the context of a Peruvian convenience store.
All other models were trained with general data
for the Spanish language. Another fact is that the
model, depending on the audio quality and the spe-
cific pronunciation of the user, cannot return the
phrase that is 100% interpreted correctly.
5 CONCLUSIONS
For this project, fine-tuning was performed on a pre-
trained speech-to-text model to improve the model in
a specific context, the phrases used in convenience
stores in Peru.
The goal was to implement this improved model
on an online platform, allowing users to interact with
the platform using their voice, our motivation for
making this project was to help the owners of these
traditional businesses adapt to technological solutions
and be the first step to adapt the business to this new
tools.
Through data collection, we achieved positive re-
sults, as depicted in the experiments section, specif-
ically in the ’Training the Model’ subsection. Our
model outperformed the base models and pre-trained
models in Spanish. Through this training using the
proposed data, we achieved a Word Error Rate of
14.3%, demonstrating its effectiveness compared to
other models for this specific context.
For future works, we aim to expand the product
list to a more comprehensive one commonly used in
these stores. In this initial training, we conducted as
the project’s prototype, we utilized a reduced list of
products.
The objective is to deliver a high-quality online
platform and help the independent owners of these
traditional businesses with tasks that are more effec-
tive if you work with technological tools.
Additionally, another goal is to try to get bet-
ter accuracy and attempt to reduce the Word Error
Rate (WER) for the products already trained, and
get the same WER for newly introduced products.
Furthermore, other kinds of models might improve
our metrics (Leon-Urbano and Ugarte, 2020; Ysique-
Neciosup et al., 2022; Rodr
´
ıguez et al., 2021).
REFERENCES
Aguirre-Peralta, J., Rivas-Zavala, M., and Ugarte, W.
(2023). Speech to text recognition for videogame
controlling with convolutional neural networks. In
ICPRAM, pages 948–955. SCITEPRESS.
Androutsopoulou, A., Karacapilidis, N. I., Loukis, E. N.,
and Charalabidis, Y. (2019). Transforming the com-
munication between citizens and government through
ai-guided chatbots. Gov. Inf. Q., 36(2):358–367.
Apicella, A., Isgr
`
o, F., Pollastro, A., and Prevete, R. (2023).
Adaptive filters in graph convolutional neural net-
works. Pattern Recognit., 144:109867.
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020).
wav2vec 2.0: A framework for self-supervised learn-
ing of speech representations. In NeurIPS.
Errattahi, R., Hannani, A. E., and Ouahmane, H. (2015).
Automatic speech recognition errors detection and
correction: A review. In ICNLSP, volume 128 of Pro-
cedia Computer Science, pages 32–37. Elsevier.
Ghobakhloo, M., Asadi, S., Iranmanesh, M., Foroughi, B.,
Mubarak, M., and Yadegaridehkordi, E. (2023). Intel-
ligent automation implementation and corporate sus-
tainability performance: The enabling role of corpo-
rate social responsibility strategy. Technology in Soci-
ety, 74:102301.
Leon-Urbano, C. and Ugarte, W. (2020). End-to-end elec-
troencephalogram (EEG) motor imagery classification
with long short-term. In SSCI, pages 2814–2820.
IEEE.
Mallikarjuna Rao, G., Tripurari, V. S., Ayila, E., Kummam,
R., and Peetala, D. S. (2022). Smart-bot assistant for
college information system. In 2022 Second Interna-
tional Conference on Artificial Intelligence and Smart
Energy (ICAIS), pages 693–697.
Peng, J. and Bao, L. (2023). Construction of enterprise busi-
ness management analysis framework based on big
data technology. Heliyon, 9(6):e17144.
Rodr
´
ıguez, M., Pastor, F., and Ugarte, W. (2021). Classi-
fication of fruit ripeness grades using a convolutional
neural network and data augmentation. In FRUCT,
pages 374–380. IEEE.
Waqar, D. M., Gunawan, T. S., Kartiwi, M., and Ahmad, R.
(2021). Real-time voice-controlled game interaction
using convolutional neural networks. In 2021 IEEE
7th International Conference on Smart Instrumenta-
tion, Measurement and Applications (ICSIMA), pages
76–81.
Ysique-Neciosup, J., Chavez, N. M., and Ugarte, W. (2022).
Deephistory: A convolutional neural network for au-
tomatic animation of museum paintings. Comput. An-
imat. Virtual Worlds, 33(5).
ICSBT 2024 - 21st International Conference on Smart Business Technologies
102