Speech Recognition for Inventory Management in Small Businesses

Bruno Tiglla-Arrascue, Junior Huerta-Pahuacho and Luis Canaval

Universidad Peruana de Ciencias Aplicadas, Lima, Peru

Keywords:

Speech-to-Text, Machine Learning, Deep Learning.

Abstract:

In recent years, we have seen an increase in independent businesses working primarily focused on online sales,

where they offer products through ads and manage the business with electronic tools. This could leave behind

some traditional businesses, especially those that are managed by a single family, where the adaption of new

technologies is slower than new business. That’s why we want to give them a tool that it’s easy to control, a

virtual assistant where they can manage the inventory even if they don’t know about databases. For this work,

we propose to create a speech-to-text platform with machine learning so those users who have difﬁculties

adapting to these new tools can use their voice to command the database and have ﬁrst contact with these new

technologies. Through a ﬁne-tuning process to a pre-trained speech-to-text model in Spanish, we managed to

obtain a percentage error result lower than the model used, this being 14.3%, this means that our model has a

better accuracy in the context of a Peruvian convenience store.

1 INTRODUCTION

After the lockdown, there was a considerable in-

crease in the number of small businesses, more people

wanted to offer their products whether virtually or in

physical shops and the commercial sector started to

grow. As a result, the supply and the demand were

gradually increasing as the population began to leave

their homes.

In recent years, technology has played a crucial

role in the business sector. And now, most retail busi-

nesses turn to digital solutions to improve their sales

and bring new tools to employees and customers.

However, there is still a sector where tech adop-

tion seems to have stopped. In these types of busi-

nesses, the implementation of technological solutions

tends to be a challenge, as many of these merchants

tend to have a more traditional approach (Peng and

Bao, 2023).

These traditional business practices are mainly

based on manual statistics, inefﬁcient analysis, and

error-prone decision-making. Additionally, some in-

ternal data of these businesses tends to be fragmented

and decentralized, making it difﬁcult to compile it

into a database.

Although the implementation of digital transfor-

mation has driven the development of technology so-

lutions focused on streamlining processes, the devel-

opment of artiﬁcial intelligence (AI) solutions stands

out.

Ghobakhloo et al. (Ghobakhloo et al., 2023) claim

that in the context of digital transformation, AI has

become a key term. Furthermore, it points out that AI

is revolutionary due to its unique features, such as the

ability to simulate human intelligence and interact in

real-time with people through voice recognition.

These features enable it to adapt to new busi-

ness circumstances and predict potential outcomes, as

mentioned in (Peng and Bao, 2023). In Peru, many

types of traditional businesses can be found in ev-

ery neighborhood, with the most common being those

that sell high-turnover products. Some of these estab-

lishments still follow traditional management meth-

ods and are often run by individuals who are not fa-

miliar with virtual tools.

However, due to the necessity brought about by

digital solutions such as virtual wallets or online com-

merce through social networks, they have had to learn

and use these technological tools, including the use of

mobile devices and specialized software.

In these businesses where they handle a lot of dif-

ferent products, it’s common to face problems related

to inventory disorganization, which can result n in

delays in the registration and checkout processes of

products.

This is often the result of manual control work, in

addition to being slow, it tends to fail, and increas-

ing the risk of information loss by relying on physi-

Tiglla-Arrascue, B., Huerta-Pahuacho, J. and Canaval, L.

Speech Recognition for Inventory Management in Small Businesses.

DOI: 10.5220/0012813600003764

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Smart Business Technologies (ICSBT 2024), pages 95-102

ISBN: 978-989-758-710-8; ISSN: 2184-772X

cal reports. A technological solution for this type of

problem can be found in the use of databases. By

maintaining a virtual inventory through platforms and

thanks to their accessibility, it becomes easier to keep

track of product movements.

However, there is little information about how

these tools could be applied to their businesses. Many

of these inventory management solutions for tradi-

tional businesses are not focused on those who are

just learning to handle technological devices or are

unfamiliar with database terms.

To solve this problem there are different methods

to facilitate the use of these tools for new users. For

example, an application where the user can manage

the inventory. Our goal is to provide support to the

technological transformation movement by facilitat-

ing work performance in those businesses.

Likewise, with the high use and development of

artiﬁcial intelligence in recent years, a solution that

involves this technology could facilitate the use of

technological tools within a store, especially to users

that, especially for those users who do not handle

electronic devices. The objective of this project is to

provide essential support to small business owners by

offering them tools to create and utilize a database.

The idea is to provide an easy-to-use solution. For

example, a mobile application where the user can con-

trol the inventory of their business, but with a speech-

to-text function with which the user can manage the

application more easily.

By listing their products, the goal is to simplify

the creation, basic organization, and management of

a virtual inventory for their business. Through a web

platform, the intention is to support the user by pro-

viding them with more effective control over their in-

ventory, enabling them to make more appropriate and

accurate decisions when acquiring new products and

have knowledge about their inventory.

Now, we will review some related works on

speech-to-text solutions and solutions applied to a

non-common technological area in Section 2. Then,

we mention the terms and tools that we used in devel-

oping our proposal in Section 3. Likewise, we review

how our proposal using the speech-to-text function is

performing in Section 4. In addition, the setup, exper-

iments, and results. And ﬁnally the conclusions and

discussions of our proposal in Section 5.

2 RELATED WORK

In this section, we will brieﬂy discuss the implemen-

tation of speech-to-text in video games, and how the

implementation of a technology solution as speech-

to-text can improve the efﬁciency of a speciﬁc area.

In (Aguirre-Peralta et al., 2023), the authors fo-

cus on the use of convolutional neural networks for

speech-to-text recognition as the control of a video

game The objective was to develop a medical tool

for individuals with upper limb motor disabilities.

The paper presents a convolutional neural network ar-

chitecture focused on speech-to-text recognition, and

three turn-based mini-games were developed within

the video game to process the data provided by the

convolutional neural network. Additionally, it pro-

vides an analysis of the most promising results that

demonstrate the project’s feasibility. It explains the

deﬁnitions of technologies related to convolutional

neural networks, the chosen architecture for the so-

lution, and terms such as NLP, CNN, speech-to-text

recognition, and gamiﬁcation. Likewise, it details the

experiments, the types of data they used, and their re-

sults.

In (Waqar et al., 2021), the authors propose a

real-time voice command recognition system using

Convolutional Neural Networks (CNN) to control the

Snake game. The authors prepared a dataset for voice

commands: up, down, left, and right, for training, val-

idation, and testing. They proposed an optimal voice

command recognition system based on MFCC (Mel

Frequency Cepstral Coefﬁcients) and CNN to recog-

nize the four voice commands. The proposed algo-

rithm achieved a high recognition accuracy of 96.5%

and successfully detected all four commands. Finally,

the proposed algorithm was integrated into a Python-

based Snake game.

In (Mallikarjuna Rao et al., 2022), the authors fo-

cus on creating a chatbot for information queries in a

university context. It outlines the issue where students

and parents need to navigate the university’s website

or make phone inquiries to obtain information, and

how a chatbot can address this problem by provid-

ing automated responses. Different types of chatbots,

such as rule-based and machine learning-based ones,

are also mentioned, along with a discussion of the ad-

vantages and limitations of each approach. The text

provides details about the structure of the proposed

chatbot and its implementation using AIML and nat-

ural language processing.

2.1 Machine Learning

In recent years, technology has experienced exponen-

tial advancement.

In particular, chatbots and other types of artiﬁ-

cial intelligence solutions, such as Machine Learning

algorithms and process automation, can signiﬁcantly

reduce administrative burden (Androutsopoulou et al.,

ICSBT 2024 - 21st International Conference on Smart Business Technologies

2019).

Their ability to enable machines to learn and adapt

from data has revolutionized a wide range of applica-

tions in various areas.

Machine Learning is transforming our approach to

complex problems and opening up new possibilities,

such as empowering machines to interact with users

through voice.

Here, we will explore some fundamentals and ap-

plications for the ﬁeld of speech-to-text recognition.

2.2 Convolutional Neural Networks

In the ﬁeld of computer vision and deep learning,

Convolutional Neural Networks (CNN) have emerged

as a fundamental architecture, inspired by the natural

mechanism of visual perception in living beings.

In the context of these networks, their potential

lies in the ability to extract and process local informa-

tion through convolutions applied to input data, using

sets of ﬁlters with a ﬁxed size (Apicella et al., 2023).

2.3 Speech-to-Text

Peralta et al. (Aguirre-Peralta et al., 2023) mention

that speech-to-text is the capacity for the machine to

recognize human speech and convert it into a written

text.

Also, they mention that it can be useful for peo-

ple who don’t know how to work with technological

tools.

2.4 Wav2Vec 2.0

Wav2Vec 2.0 is a model for Automatic Speech Recog-

nition (ASR) with self-supervised training.

This model contains different elements where the

audio will pass for different processes to covert in a

textual representation.

First, it receives a raw audio where the model use

an encoder based on a Convolutional Neural Network

(CNN), that extracts the features from the audio, this

network is using to represent audio to be more com-

pact and signiﬁcant

After passing through the encoder, the audio fea-

tures are quantized using multiple codebooks.

This means that it reduces the amount of data nec-

essary to represent a signal while maintaining an ac-

ceptable level of ﬁdelity.

The quantization is done using a function called

“Gumbel softmax”. This quantized representation of

the audio is used in the decoding process.

3 METHOD

In this section, we will explain some preliminary

phases for the development of the project, where the

speech-to-text recognition model that we plan to im-

plement will be mentioned.

Also, we will explain the main workﬂow of the

proposed application.

In addition, we mention the data used for re-

training the model and as well its preparation to re-

ceive this new information.

Finally, we will explain the expected ﬁnal product.

3.1 The ML Model Structure

As we previously mentioned, the speech-to-text

recognition process will be explained using the

Wav2Vec2 model for ASR.

In this model, there are several processes in which

the audio will go through different parts of the model

to become a textual representation of the audio.

It begins by receiving raw audio signals where

the model uses an encoder based on a convolutional

neural network (CNN), which will be responsible for

extracting audio features, which will be used to rep-

resent the audio in a more compact and meaning-

ful way. (Baevski et al., 2020) The model is com-

posed of a multi-layer feature encoder with convo-

lutional characteristics. This model learns the basic

units of speech to perform a self-supervised task, en-

abling learning from unlabeled training data to fa-

cilitate speech recognition systems for multiple lan-

guages. (Baevski et al., 2020)

g,v

exp(lg

+ n

)/τ

∑

k=1

exp(lg

+ n

)/τ

Figure 1: The framework which jointly learns contextual-

ized speech representations and an inventory of discretized

speech units. (Baevski et al., 2020).

Speech Recognition for Inventory Management in Small Businesses

After the quantized representation, the audio fea-

tures are passed through a ”Transformer” neural net-

work. This network captures relationships between

the audio features and generates a textual representa-

tion.

This representation consists of sequences of code

words that correspond to the sounds of the audio. Fi-

nally, the model employs decoders situated in the last

output layer that perform the ﬁnal classiﬁcations.

This layer is regarded as the ”Decoder,” as it de-

codes the context representations into a text transcrip-

tion. The model is depicted in Figure 1.

3.2 Workﬂow of the Project

Thanks to the previously mentioned model, we can

transform audio into text. This will serve as the re-

quired input to ensure that the proposed platform can

follow the intended workﬂow to assist in inventory

management.

However, the model in question is trained in a gen-

eral mode with a wide range of Spanish words, and its

accuracy can vary depending on how users pronounce

them. Therefore, the decision was made to retrain the

model within the context of inventory management.

To undertake this task, new training data is

needed. As a result, new audio data will be collected

and divided into three groups.

The platform operates by allowing the user to ac-

cess various functionalities using their mobile phone’s

microphone.

The platform is to be hosted at ﬁrst stance in our

local computers but for better research, the plan is to

deploy it as an AWS Cloud service, where both the

web platform’s logic and the pre-trained model are

running.

The information captured by the front end will be

sent to the back end, where the model will interpret

these audio signals and automatically perform learned

actions based on the audio input.

The logic architecture is depicted in Figure 2

3.3 Data Pre-Processing

To retrain the model, the proposal involves using a

new set of audio recordings focused on the inventory

management context.

From the collected audio data, it was planned to

divide them into three groups of Spanish words:

The ﬁrst group comprises the keywords for the

platform’s workﬂow:

1. “agregar”, “buscar”, “actuak1 es milizar”,”

vender”,”desactivar”,”generar”,”producto”,

”informe”.

The second group consists of product names that

we will focus on for the application’s recognition,

which are:

1. Sodas: ”Inca Kola,” ”Coca Cola,” and ”Fanta.”

2. Cookies: ”San Jorge,” ”Margarita,” and ”Rel-

lenita.”

3. Water: ”San Luis,” ”San Mateo,” and ”Cielo.”

The third group of data includes forty combina-

tions of phrases used in the process of adding a prod-

uct. These phrases follow the structure of ”add”:

”Key word phrase” + ”Quantity” + ”Product name”

+ ”Speciﬁcations” + ”Price.”

1. “agregar catorce rellenitas de cien gramos costo

dos soles”

2. “vender veintitres inca kola de cuatrocientos

mililitros costo dos soles cincuenta”

3. “compre setenta y nueve santos mateos de

quinientos litros salio dieciocho soles con

noventa”

From this last group of audio data, random but

equivalent samples are obtained from the proposed

ﬁfty participants.

3.4 Data Preparation

To improve the model’s accuracy in the proposed con-

text, it is suggested to retrain the model with the

newly collected data.

This way, the received audio inputs will have im-

proved accuracy in recognizing the words within the

transcribed phrase.

This precise transcription is essential for the sub-

sequent stages of the platform’s workﬂow.

Furthermore, since the model is already partially

trained to recognize a wide range of Spanish words,

the plan for increasing the data collection was to

encompass variations in background noise, such as

noise like a conversation in the background.

This involves a process of Data Augmentation for

our data collection.

It allows us to obtain more data from a smaller

number of users and optimize the model for speciﬁc

situations, types of noises, or user speech speed.

For this purpose, a sample of 50 users was used

in data collection, this group was chosen indiscrimi-

nately and is made up of Spanish speakers.

Each user-contributed with 37 audio tracks (8

from group 1, 9 from group 2, and 20 from group 3).

After compilation, we had a total of 1850 audios

for retraining, and by adding to these new audios a

background sound.

ICSBT 2024 - 21st International Conference on Smart Business Technologies

Figure 2: Logic Architecture Diagram of the solution.

We have more than 3700. However, we ﬁltered

some audio that were low quality and would not sup-

port training. And in the same way, we split a data set

for testing.

3.5 Fine-Tuning Process Preparation

For this work we are using the Lightning Flash Li-

brary, with this we can use a function that help us to

retrain the new model.

The structure is depicted in Figure 3.

Figure 3: The structure of the pre-model learning frame-

work of wav2vec2 applying the library with the freeze strat-

egy.

The ’Backbone’ hosts the Wav2Vec 2.0 model,

which was trained on its respective datasets.

This model encompasses a neural network de-

signed for Spanish speech recognition. The neural

network has already acquired general features from

the original dataset and will serve as the foundation

for the new model.

On the other hand, the ’Head’ constitutes another

neural network, typically smaller in size, trained on

the proposed dataset to learn mapping the general fea-

tures extracted by the ’Backbone’.

The model is depicted in Figure 4.

Figure 4: Fine-tuning applied to the pre-trained Wav2Vec2

model.

The ”Freeze” strategy in Lightning Flash is pri-

marily employed during the ﬁne-tuning process of

models. For the Wav2Vec 2.0 model, this strategy is

used to ”freeze” the pre-trained model’s weights dur-

Speech Recognition for Inventory Management in Small Businesses

ing training.

This entails that the weights of the neural network

in the ”Backbone,” which are the convolutional neural

networks, are not updated during the backpropagation

process.

This is done to maintain the learned characteristics

of the original dataset.

The aim is to enable the model to adapt to new

data without completely forgetting the general fea-

tures.

This is particularly useful when the new dataset is

relatively small compared to the original one.

3.6 Final Output

Finally, after ﬁne-tuning in Google Colab, we hosted

the model ﬁle in Google Drive, with this model we

created an API with Flask.

The frontend, developed in React, will receive

user audio and pass it to the API of the model to con-

vert the audio into JSON with the applied modiﬁca-

tions and interpretations.

Subsequently, in the backend, developed in

Node.js, it will carry out the necessary actions to

perform the CRUD operations on the platform in

Node.js.

4 EXPERIMENTS

In this section, we discuss about the experiments done

for the project, the setup used and results obtained

during the ﬁne-tuning and test.

4.1 Experimental Protocol

In this part, we discuss about the setup where all the

experiments were performed.

The re-training of the model used the Google Co-

lab Pro. This premium version offers a Tesla V100-

SXM2 with 16160MiB, using the CUDA version 12.0

and Python version 3.10.12. The setup of Colab of the

ﬁne-tuning is hosted in Google Drive.

Also, we used 5% of the general audio data to

serve as test data. This entire process is designed to

improve the model’s accuracy.

4.2 Training the Model

Potential scenarios were devised where the user in-

puts several extended sentences into the platform.

Google Drive

Moreover, the system was trained with potential

words to act as speciﬁc commands for a particular

platform feature.

All this data was compiled in ”ogg” format. Al-

though the model accepts different audio formats,

the documentation(Baevski et al., 2020) suggests that

the audio to train the model has two characteristics,

”.wav” format and that it contains a sampling fre-

quency of 16khz, so the audios had to be modiﬁed,

even knowing that making this modiﬁcation could

generate a small amount of information loss.

However, since the data collected is not that ex-

tensive, that loss would not be as critical.

Taking the above into account, the data set was

created.

They are made up of two columns: ”ﬁle”, which

represents the location, stored in our Google Drive for

easier access from the colab, and ”text”, which de-

notes the phrase transcribed from the audio.

We started by creating an instance of a pre-trained

model for Spanish language recognition: MODEL ID

= ”jonatasgrosman/wav2vec2-xls-r-1b-spanish”.

Utilizing the Lightning Flash framework provided

a highly useful tool for ﬁne-tuning, as elucidated in

the previous section.

The training employed features such as the use of

GPU if available, with a choice to conduct training

over 10 epochs and to adopt the ”freeze” strategy, as

detailed in the preceding section.

4.3 Results of the Training

Following the training phase, we conducted an evalu-

ation using various ofﬁcial versions of the Wav2vec2

model. Additionally, models pre-trained by other

users from the Hugging Face community were se-

lected, which built upon Facebook’s models and im-

proved them using their respective datasets.

In this instance, four additional models were cho-

sen:

Ofﬁcial Facebook Models:

• facebook/wav2vec2-large-xlsr-53-spanish

(Model 2)

• facebook/wav2vec2-base-10k-voxpopuli-ft-es

(Model 3)

Community Retrained Models:

• jonatasgrosman/wav2vec2-large-xlsr-53-spanish

(Model 4)

• jonatasgrosman/wav2vec2-xls-r-1b-spanish

(Model 5)

For this validation, an instance of each previously

visited model was called to test its accuracy.

ICSBT 2024 - 21st International Conference on Smart Business Technologies

100

In this test, an audio sample of 50 possible com-

mands that the models could receive was taken.

To equalize conditions, the audio samples were

processed exactly as each Wac2Vec2.0 architecture

requires (Baevski et al., 2020).

This means that it was ensured that the audio had

a sampling frequency of 16 kHz and a format accept-

able for the models.

For speech recognition models, exist speciﬁc mea-

sures that provide a more detailed perspective on the

performance of these models.

In particular, in this evaluation, metrics such as

Word Error Rate (WER), Message Error Rate (MER),

and Word Information Loss (WIL) have been em-

ployed to comprehensively assess the quality of the

obtained transcriptions (Errattahi et al., 2015).

• WER (Word Error Rate): It is the most popu-

lar metric for ASR evaluation, measuring the per-

centage of incorrect words (Substitutions (S), In-

sertions (I), Deletions (D)) relative to the total

number of words (Errattahi et al., 2015).

WER =

S + D + I

H + S + D

(1)

Where:

I = total number of insertions,

D = total number of deletions,

S = total number of substitutions,

H = total number of visits,

= total number of input words.

• MER (Message Error Rate): Similar to WER, but

it evaluates errors at the level of complete mes-

sages or phrases rather than individual words.

• WIL (Word Information Loss): Assesses the

amount of information lost or preserved in the

transcription or translation process, measuring the

loss of information in terms of omitted, added, or

changed words compared to a reference transcrip-

tion or translation. Additionally, it serves as an

approximation measure of RIL. However, unlike

RIL (Relative Information Lost), WIL is easy to

apply because it relies solely on counts of HSDI

and is expressed as (Errattahi et al., 2015).

WIL = 1 −

(H + S + D)(H + S + I)

(2)

For the following Table 1, decimal numbers are

shown, these numbers represent the percentage of er-

ror in each category.

This means that the number that is closest to zero

has fewer errors in the speech-to-text representation.

Table 1: Models Evaluation.

Model WER MER WIL

Our Model 0.143 0.143 0.012

Model 2 0.447 0.447 0.126

Model 3 0.534 0.516 0.043

Model 4 0.218 0.218 0.008

Model 5 0.256 0.256 0.029

4.4 Discussion

In this subsection, we discussed the results obtained

in the previous section. While the obtained results are

favorable compared to other models. It is important

to note that this model is still in its early stages. This

means that the data we used for this ﬁrst training was

done for a small number of products and with that data

we cannot satisfy the entire market. However, it has

demonstrated a good ability to understand phrases in

the context presented in this research.

4.5 Fine-Tuning Process

In the development of this project, various methods

for training a model were researched.

We chose to employ a slightly more ﬂexible

method, as mentioned earlier in Section 4. However,

this method has some limitations, such as the lim-

ited customization in hyperparameter settings, both

in terms of freezing or unfreezing layers and their

weight alterations.

Nevertheless, it serves as a good entry point into

the ﬁeld of machine learning.

The interface provided by Lightning Flash suc-

cinctly encapsulates the basic knowledge of how to

train a model.

4.6 Comparison with Other Models

As we can see in Table 1, models 2 and 3 have higher

Word Rate Errors, which means the transcription of

these models can fail and bring a text that is not prop-

erly translated.

However, these models made by Facebook are the

base of models 4 and 5, these models from the Hug-

ging Face community are pre-trained models that use

the original Facebook wac2vec2 model and are re-

trained with more data.

Looking at Table 1, these two models have a lesser

World Rate Error than models 2 and 3.

However, the data that were used for the ﬁne-

tuning process were audios with generic voice liens

Speech Recognition for Inventory Management in Small Businesses

101

in Spanish. And when we want to translate audio that

is in the context of a Peruvian convenience store, the

models can’t recognize some keywords, like the name

of the products, and it can return a text that is trans-

lated better than the other original models but it will

have some errors, and required an extra process to

transform that failed transcription, into the interpre-

tation of the correct phrase.

And for our model, it looks like it has better results

than the others, however, this does not mean that the

model has a better precision than the others. A fact is

that our model has better results due to the ﬁne-tuning

process for that speciﬁc context. As we can see, we

retrained the model with our data to have that accu-

racy in the context of a Peruvian convenience store.

All other models were trained with general data

for the Spanish language. Another fact is that the

model, depending on the audio quality and the spe-

ciﬁc pronunciation of the user, cannot return the

phrase that is 100% interpreted correctly.

5 CONCLUSIONS

For this project, ﬁne-tuning was performed on a pre-

trained speech-to-text model to improve the model in

a speciﬁc context, the phrases used in convenience

stores in Peru.

The goal was to implement this improved model

on an online platform, allowing users to interact with

the platform using their voice, our motivation for

making this project was to help the owners of these

traditional businesses adapt to technological solutions

and be the ﬁrst step to adapt the business to this new

tools.

Through data collection, we achieved positive re-

sults, as depicted in the experiments section, specif-

ically in the ’Training the Model’ subsection. Our

model outperformed the base models and pre-trained

models in Spanish. Through this training using the

proposed data, we achieved a Word Error Rate of

14.3%, demonstrating its effectiveness compared to

other models for this speciﬁc context.

For future works, we aim to expand the product

list to a more comprehensive one commonly used in

these stores. In this initial training, we conducted as

the project’s prototype, we utilized a reduced list of

products.

The objective is to deliver a high-quality online

platform and help the independent owners of these

traditional businesses with tasks that are more effec-

tive if you work with technological tools.

Additionally, another goal is to try to get bet-

ter accuracy and attempt to reduce the Word Error

Rate (WER) for the products already trained, and

get the same WER for newly introduced products.

Furthermore, other kinds of models might improve

our metrics (Leon-Urbano and Ugarte, 2020; Ysique-

Neciosup et al., 2022; Rodr

ıguez et al., 2021).

REFERENCES

Aguirre-Peralta, J., Rivas-Zavala, M., and Ugarte, W.

(2023). Speech to text recognition for videogame

controlling with convolutional neural networks. In

ICPRAM, pages 948–955. SCITEPRESS.

Androutsopoulou, A., Karacapilidis, N. I., Loukis, E. N.,

and Charalabidis, Y. (2019). Transforming the com-

munication between citizens and government through

ai-guided chatbots. Gov. Inf. Q., 36(2):358–367.

Apicella, A., Isgr

o, F., Pollastro, A., and Prevete, R. (2023).

Adaptive ﬁlters in graph convolutional neural net-

works. Pattern Recognit., 144:109867.

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020).

wav2vec 2.0: A framework for self-supervised learn-

ing of speech representations. In NeurIPS.

Errattahi, R., Hannani, A. E., and Ouahmane, H. (2015).

Automatic speech recognition errors detection and

correction: A review. In ICNLSP, volume 128 of Pro-

cedia Computer Science, pages 32–37. Elsevier.

Ghobakhloo, M., Asadi, S., Iranmanesh, M., Foroughi, B.,

Mubarak, M., and Yadegaridehkordi, E. (2023). Intel-

ligent automation implementation and corporate sus-

tainability performance: The enabling role of corpo-

rate social responsibility strategy. Technology in Soci-

ety, 74:102301.

Leon-Urbano, C. and Ugarte, W. (2020). End-to-end elec-

troencephalogram (EEG) motor imagery classiﬁcation

with long short-term. In SSCI, pages 2814–2820.

IEEE.

Mallikarjuna Rao, G., Tripurari, V. S., Ayila, E., Kummam,

R., and Peetala, D. S. (2022). Smart-bot assistant for

college information system. In 2022 Second Interna-

tional Conference on Artiﬁcial Intelligence and Smart

Energy (ICAIS), pages 693–697.

Peng, J. and Bao, L. (2023). Construction of enterprise busi-

ness management analysis framework based on big

data technology. Heliyon, 9(6):e17144.

Rodr

ıguez, M., Pastor, F., and Ugarte, W. (2021). Classi-

ﬁcation of fruit ripeness grades using a convolutional

neural network and data augmentation. In FRUCT,

pages 374–380. IEEE.

Waqar, D. M., Gunawan, T. S., Kartiwi, M., and Ahmad, R.

(2021). Real-time voice-controlled game interaction

using convolutional neural networks. In 2021 IEEE

7th International Conference on Smart Instrumenta-

tion, Measurement and Applications (ICSIMA), pages

76–81.

Ysique-Neciosup, J., Chavez, N. M., and Ugarte, W. (2022).

Deephistory: A convolutional neural network for au-

tomatic animation of museum paintings. Comput. An-

imat. Virtual Worlds, 33(5).

ICSBT 2024 - 21st International Conference on Smart Business Technologies

102