Enhancing Gesture Recognition for Sign Language Interpretation in

Challenging Environment Conditions: A Deep Lea rning Approach

Domenico Amalﬁtano

, Vincenzo D’Angelo, Antonio M. Rinaldi

, Cristiano Rus so

and Cristian Tommasino

Department of Electrical Engineering and Information Technology University of Naples Federico II,

Via Claudio, 21, Naples 80125, Italy

cristiano.russo@unina.it, cristian.tommasino@unina.it

Keywords:

Gesture Recognition, Sign Language, D eep Learning, Real-Time Translation, Accessibility.

Abstract:

Gesture recognition systems have gained popularity as an effective means of communication, leveraging the

simplicity and effectiveness of gestures. With the absence of a universal sign language due to r egional varia-

tions and limited dissemination in schools and media, there is a need for real-time translation systems to bridge

the communication gap. The proposed system aims to translate American Sign Language (AS L), the predomi-

nant sign language used by deaf communities in real-time in North America, West Africa, and Southeast Asia.

The system utilizes SSD Mobilenet FPN architecture, known for its real-time performance on low-power de-

vices, and leverages transfer learning techniques for efﬁcient training. Data augmentation and preprocessing

procedures are applied to improve the quality of training data. The system’s detection capability is enhanced

by adapting color space conversions, such as RGB to YCbCr and HSV, to improve the segmentation for vary-

ing lighting conditions. Experimental results demonstrate t he system’s Accessibility and non-invasiveness,

achieving high accuracy in recognizing ASL signs.

1 INTRODUCTION

Sign lang uage is a crucial mode of communication for

the deaf and hard-of-hearing community. It enables

these individuals to exp ress their thoughts and engage

in fruitful interactions, giving a complete knowledge

representation system. However, there are signiﬁ-

cant challenges to effective communication between

sign language users and those who do not under stand

sign language (Wadhawan and Kumar, 2021; Rast-

goo et al., 2021). Gesture recognition technology,

an essential ﬁeld within compu te r vision and machine

learning, has the potential to bridge this gap. It can

achieve this by precisely interpreting sign language

gestures in real-time. Gesture recognition in sign lan-

guage involves developing sophisticated systems that

can analyze and compr ehend complex hand move-

ments, facial expressions, and body postures that con-

stitute sign language. These systems use cuttin g-edge

https://orcid.org/0000-0002-4761-4443

https://orcid.org/0000-0001-7003-4781

https://orcid.org/0000-0002-8732-1733

https://orcid.org/0000-0001-9763-8745

algorithm s and machine learning techniques to recog-

nize and interpret the rich visual cues in sign language

gestures. Gesture recognition technology c a n facili-

tate seamless communication between sign language

users and the broader community by accurately cap-

turing and understanding these gestures (Mitra and

Acharya, 2007; Khan and I braheem, 2012).

However, precise recognition of sign la nguage

gestures can pose signiﬁcant challenge s, especially

under varied lighting and exp osure conditions. Varia-

tions in lighting, such as harsh sunlight or dim envi-

ronments, can impact the qu ality and visibility of the

gestures. This makes it difﬁcult for recognition sys-

tems to interpret them accu rately. Furthermore, dif-

fering exposure levels, such as overexposed or under-

exposed images, can add c omplexity to the recogni-

tion process, leading to possible errors o r misinterpre-

tations. It is crucia l to address the impact of lighting

and exposure conditions on sign language recognition

to develop robust and reliable gesture recognition sys-

tems. Ackn owledging these challenges will allow r e-

searchers and practitioners to create algorithms an d

techniques resilient to varying lighting conditions and

exposure levels. T his will ensure accurate recognition

Amalﬁtano, D., D’Angelo, V., Rinaldi, A., Russo, C. and Tommasino, C.

Enhancing Gesture Recognition for Sign Language Interpretation in Challenging Environment Conditions: A Deep Learning Approach.

DOI: 10.5220/0012209700003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 395-402

ISBN: 978-989-758-671-2; ISSN: 2184-3228

395

regardless of the environmental constraints (Suarez

and Murphy, 2012). Furthermore, variations in ex-

posure and lighting conditions can signiﬁcantly affect

the contrast, color, and textur e of sign language ges-

tures. Shadows, reﬂection s, and uneven illum ination

can introduce noise and distortions, making the ex-

traction of meaningful features from the visual data

challengin g. These variations m ay also affect the

recogn ition of subtle nuances and ﬁne-grained move-

ments that are essential for the accurate inte rpreta-

tion of sign language (Su arez and Murp hy, 2012).

In orde r to tackle the challenge s presented by var-

ied lighting and exposure conditions, ou r solution fo-

cuses on examining different color spaces to enhance

the accuracy of the gesture recognition system. Tradi-

tional color spaces, such as RGB (Red-G reen-Blue),

may not be robust enough to handle variations in

lighting conditions effectively. As a result, we p ro-

pose investigating alternative color spac e s, such as

HSV (Hue-Saturation-Value) or YCbCr (L uminance-

Blue Chrominance-Red Chrominanc e ), which offer

distinct advantages in separating color information

from variations in illu mination. By utilizing alterna-

tive color spaces, w e aim to boost the system’s ability

to differentiate between sign language gestures and

their backgrounds under different lighting conditions.

These color spa ce transformations can help mitigate

the effects of lighting variations, en abling more accu-

rate feature extraction and gesture recognition. Addi-

tionally, by exploring multiple color spaces, we can

tailor the system to different environments and lig ht-

ing scenarios, ensuring robust perfo rmance across di-

verse re al-world settings. In our experim e ntal eval-

uation, we will co mpare the performance of the ges-

ture recognition system using different color spaces

under varied exposure and lighting conditions. We

will assess m e trics such as recognition accuracy, ro-

bustness to lighting variations, and computational ef-

ﬁciency. By thor oughly investigating the impact of

color space transformations, we aim to provide in-

sights into the most effective co lor spa c e choices for

improving gesture recognition accuracy in different

environments. Through our research, we want to con -

tribute to advancing gesture recognition systems for

sign language. We speciﬁcally aim to tackle the chal-

lenges posed by varied exposure and lighting con-

ditions. By investigating different color spaces, we

hope to enhan c e the system’s accur acy, enabling ef -

fective communication between sign language users

and non-sign users across a variety of real-world sce-

narios. This research has the potential to signiﬁcan tly

improve Acce ssibility and inclusivity for the deaf and

hard-of-hearing co mmunity, empowering them to en-

gage more seamlessly in a wide range of environ-

ments and lighting conditions. The remainder of the

paper is organiz ed as follows: Section 2 p rovides an

overview of the related works, Section 3 introduces

our proposed approac h, Section 4 presen ts the results

obtained from our experiments, and ﬁnally, Section 5

provides con clusions and discusses future works.

2 RELATED WORKS

Gesture recognition, particularly in the context of sign

languag e, has been extensively re searched in com-

puter vision and machine learning. Researchers have

explored various app roaches an d techniques to inter-

pret sign language gestures accur ately. This section

provides an overview of the related works in ges-

ture r e cognition f or sign language, highlighting sig-

niﬁcant contributions and advancements in the ﬁeld.

One a pproach focuses on utilizing 3D models for ges-

ture recognition, which provides precise re sults but

can be c omputationally expensive and less efﬁcient

for real-time systems. For example, in video surveil-

lance, facial recognition based on 3D face model-

ing has shown a 40% improvement in performanc e

when reconstructing 3D facial models from non-

frontal frames (Park and Jain, 2007). However, the

computational c omplexity of 3D modeling techniques

makes them less suitable for real-time applications.

Simpliﬁed versions of volumetric models rely on the

representation of the human skeleton, analyzing the

position and orientatio n of its constituent segments.

These skeleton-based systems focus on key parame-

ters, resulting in faster p rocessing times while main-

taining satisfactory recognition performance. Such

approa c hes have found practical applications in var-

ious human-computer interaction interfaces. Bidi-

mensional models, on the other han d, extra ct low-

level features such as color, shape, and contour di-

rectly from images, making them suitable for ges-

ture classiﬁcation systems. Researchers have exten-

sively employed these mod els to classify and inte r-

pret sign language ge stu res accurately. Another ap-

proach to gesture recognition involves electromyogra-

phy (EMG)-based models, which classify gestures by

analyzing the ele ctrical impulses generated by mu s-

cles. This technique allows for a broader range of mo-

tion, enabling more natur al and expressive gestures.

Segmentation is a crucial step in the gestu re recog -

nition pipelin e , involving dividing images into rele-

vant parts for analysis. Various metho dologies have

been proposed to address this process. Region -based

segmentation methods have been explored, including

region growing a nd region sp littin g. Region grow-

ing involves selecting seed p ixels representing dis-

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

396

tinct areas and expanding th e m until the entire im a ge

is covered, verifying the homogeneity of each sec-

tion. Region splitting adopts a top-down approach,

recursively dividing the image into sub-images un-

til only homogeneous regions remain. Thresholding,

a commonly used technique, categorizes pixels into

“object” or “background” based on a predetermined

threshold value, resulting in a binary image. Ad-

vanced thresho lding techniques handle noise and im-

prove segmentation accuracy under challenging light-

ing conditions. Clustering techniques have also been

applied to g esture recognition, which groups elements

with similar characteristics within an image. The

level of similarity is determined through distance cal-

culations. Shape descriptors play a vital r ole in r ecog-

nizing objects, including gesture recognition. These

descriptors offer a collection of features that de scribe

a speciﬁc shape and can be u tilized for efﬁcient im-

age retrieval and comparison, even in the presence of

transformations such as scaling, rotation, or transla-

tion. Various methodologies for shape description and

representation, such as region-based or contou r-based

approa c hes, have been proposed. Edge direction his-

tograms are essential tools for detecting objects when

color information is unavailable, describing an im-

age’s texture. The input image is divided into blo cks,

and variables representing vertical, horizontal, diago-

nal, or isotropic edge n ature are associated with each

block. The Harris corner detector, an operator for

corner detec tion, identiﬁes important points for ob-

ject description and can reduce the amount of data

used in processing. H owever, its sensitivity to scale

changes limits its applicability to image s of different

sizes. The Scale-Invariant Feature Transform (SIFT)

descriptor extracts and describes many features f rom

images, minimizin g the inﬂuence of local variations

on object detection. The angular partitioning-based

approa c h (ARP) is conducted on grayscale images,

where circular sections surround the edge to ensure

scale invariance, and angles generated are measured

for information extraction. Despite progress in ges-

ture recognition technologies, limitations persist, par-

ticularly related to the equipment used and image

noise. Factors such as camera distan c e, resolution,

and lighting conditions can affect the quality of ge s-

ture detection. Additionally, user fatigue, kn own as

“gorilla arm” fatigue, has been obser ved , particularly

in mid-air gestures, where users experience arm fa-

tigue when per forming gestures over extended peri-

ods.

Gesture recognition has witnessed signiﬁcant ad-

vancements with the application of artiﬁcial neu-

ral networks, which are computational models in-

spired by biological systems (Abiodun et al., 2018).

Machine-based feature extraction has proven effective

in several doma ins (Russo et al., 2020; Rinaldi a nd

Russo, 2020; Rinaldi et al., 2021). In fact, different

types of architectures and purposes exist in the ﬁeld

of neural networks, each c atering to speciﬁc re quire-

ments. Convolutional neural networks (CNNs) have

been widely used for image and patter n recognition

tasks (Madani et al., 2023; Rinaldi et al., 2020). T heir

architecture c omprises convolutiona l layers, pooling

layers to reduce input parameters, a nd fully connected

layers for classiﬁcation based on information derived

from previous la yers. Object detection algorithms can

be categorized b a sed on the approach employed (Gir-

shick et al., 2014). Models like R-CNN and Fast

R-CNN adopt a two-stage approach: the ﬁrst stage

identiﬁes possible regions of interest, and the second

stage employs CNNs to detect objects within those

regions. Conversely, models like YOLO an d SSD uti-

lize a fully convolutiona l approach , enabling single-

pass detection . The former achieves higher accuracy,

while the latter exhibits superio r spe e d, making it

more suitable f or real-time applications. Given the

need for prompt response in gesture recognition sys-

tems, single-stage appro aches are favored in their im-

plementation. Region-based convolutional networks

excel in object detection tasks, distingu ish ing fore-

ground ob je c ts from the background based on a re-

gion of interest. These networks aim to produce

bounding boxes containing objects an d specify their

categories. Earlier models utilized selective search

algorithm s to extract regions of interest (ROIs) and

subsequen t convolutional operations to cla ssify ob-

jects within the identiﬁed regions, f ollowed by sup-

port vector machines (SVMs) for object region clas-

siﬁcation and linear regressors for bounding box re-

ﬁnement. However, these architectures suffered from

time-consuming training due to the large number o f

regions identiﬁed. Subsequent advancements in ob-

ject detection h ave led to the evolution of these mod-

els, resulting in more efﬁcient techniques. The Fast

R-CNN architectur e (Girshick, 2015) directly gen-

erates feature maps from the input image, eliminat-

ing the need for region proposal stages and improv-

ing speed. Faster R-CNN (Ren et al., 2015)intro-

duces a Region Proposal Network (RPN) that efﬁ-

ciently and accurately identiﬁes r egions of interest,

sharing convolutional features with downstream de-

tection networks. Region- based Fully Convolutional

Networks (R-FCN) (Dai et al., 2016) f urther enhances

detection speed by sharin g computations for all re-

gion proposals. Mask R-CNN (He et al. , 2017) efﬁ-

ciently de tects objects w hile simultaneously generat-

ing segmen tation mask s for each instance. This ap-

proach replaces RoIPooling with RoIAlig n for more

Enhancing Gesture Recognition for Sign Language Interpretation in Challenging Environment Conditions: A Deep Learning Approach

397

accurate pixel-level segmentation. Single Shot Multi-

Box Detector (SSD) combines object classiﬁcation

and bounding box prediction in a single pass, utiliz-

ing predeﬁned bounding boxes of varying sizes. By

evaluating object categories and adapting the boxes

to their shape, SSD handle s objects of different scales

effectively. MobileNet SSD v2 (Sandler et al., 2018),

designed for real-time applications on mobile devices,

achieves high-speed processing. YOLO (You Only

Look Once) (Redmon et al. , 2016) proposes a novel

object detection approach by recognizing im age re-

gions with high probabilities of containing objects,

enabling single-pass evalua tion. YOLO is a globa l

reasoning network that reasons abo ut the entire im -

age and all objects within it, dividing the input image

into an S×S grid. Each grid c ell is r esponsible for

detecting an object if its center falls with in that cell.

The approach is extremely fast, capable of processing

45 frames per second, w ith a faster version r eaching

155 frames per second. However, YOLO may exhibit

more localization errors compared to other de tection

systems. YOLOv7 (Wang et al., 2022), introduc ed in

2022, outperforms previous detection models in terms

both spe e d and accur acy. It requires signiﬁcantly

less expensive hardware compared to oth er n eural net-

works and is trained solely on the MS COCO dataset

without the use of pre-trained models. The cited

study (Bha rati and Pramanik, 2020), conduc te d in

2020, provides a detailed performance comparison

of various object detection models. It is imp ortant

to note that performance depen ds on factors such as

input image resolution, dataset, and training conﬁg-

urations. Model acc uracy is measured using mean

average prec ision (mAP). From the reported data, it

can be inferred that SSD and R-FCN are among the

fastest mode ls but do no t match the precision and ac-

curacy of Faster R-CNN. While SSD is less affected

by the choice of feature extractors, it is less accurate

in detecting small objects. YOLO rem ains the fastest

architecture , with a speed of appr oximately 21-155

frames per second, while Mask R-CNN exhibits the

highest accur a cy, with an average p recision of 47.3.

3 SIGN LA N GUAGE

TRANSLATION SYSTEM

Our approach targets developing a real-time Ameri-

can Sign Language (ASL) translation system. ASL

stands as the principal sign language for deaf commu-

nities in America, Canada, and various countries in

West Afric a and Southeast Asia. The system captures

visual input from a camera, executes g esture detection

and recogn ition, and then exhibits the corresponding

textual translation along with a reliability score.

Figure 1 showcases the architecture of our system,

which is com posed of the following key modu le s:

• Capture Module: This m odule represents the

eyes of the system. It uses the webcam to cap-

ture a video fr ame, which becomes the initial raw

data for the entire tra nslation pr ocess. The ability

to effectively capture frames in different lighting

conditions a nd at varying distances underpins its

functionality.

• Preprocessing Module: This module transforms

the raw vid eo frame into a format that the system

can more e fﬁciently analyze. This includes im-

age transformations like gamma correction, wh ich

can adjust the brigh tness of the image and im-

prove visibility. Noise reduction and normaliza-

tion might also occur at this stage to improve

the accuracy of subsequent detection and clas-

siﬁcation tasks. Additionally, the prepossessing

pipeline also app lies image resizing to dimensions

of 320x320.

• Detection Module: Th is is where the actual sign

detection happens. This module analyzes the pr e-

processed frame and identiﬁes the presence of any

signs. The output of this module is a set of regions

in the frame where a sign is likely present, o ften

represented by boundin g boxes.

• Classiﬁcation Module: The identiﬁed signs are

then fed into this module. Here, a class is assigned

to each detecte d sign based on the trained model.

The Classiﬁcation Module’s role is to translate the

identiﬁed signs into their equivalent meanings in

spoken or written language.

• Visualizat ion Module: Th e ﬁnal module takes

the classiﬁed signs and presents them to the user

in an easy-to-understand format. This involves

displaying the corresponding textual translation

and the screen’s reliability score. This mod-

ule provides critical feedbac k to users, allowing

them to understand how the system interprets their

signs.

Each module plays a critical role in the system, con-

tributing to the overall goal of effective and efﬁcient

real-time ASL translation.

3.1 Sign Detection Implementation

We implemented our system using Python 3. 8, draw-

ing from the power of the TensorFlow 2 Object Detec -

tion API and the OpenCV library. The training pro-

cess embraces a supervised learning approach, lever-

aging labeled data that we created with the op en-

source software LabelImg . Once we load the trained

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

398

Figure 1: System A r chitecture.

model, we start the detection proc ess. The system

captures the video stream from the webcam with the

help of the OpenCV library and starts displaying real-

time detections on the screen. Each detection comes

deﬁned with bounding boxes, class labels, and con-

ﬁdence scores. We have set up the system to dis-

play a maximum of two detections simultaneously.

We have established a conﬁdence threshold that dic-

tates the minimum score for reliable predictions. This

threshold essentially f unctions as a ﬁlter, discarding

results that lack sufﬁcient conﬁdenc e . Experiments in

different visual conditions have set the value of the

threshold. To ensure robust performan c e, we put the

system through rigo rous testing using a variety of we-

bcams, each with different resolutions. We experi-

mented with several color spaces and altered the po-

sitioning distance between the webcam and the tar-

get objects. We ran these tests under an array of

lighting conditions and with both simple and com-

plex backgro unds to mimic real-world scenarios. In

the forthcoming section, we will delve into the results

of these tests. Capitalizing on the trained model and

real-time vid eo input, o ur system strives to accurately

detect and identify the speciﬁed signs across a multi-

tude of situations. The bounding boxes and class la-

bels serve as intuitive visual feedback, giving user s in-

sights into how the system interprets their signs. Con-

ﬁdence scores, on the other hand, offer a measure of

the system’s prediction certain ty. These scores em-

power users to evaluate the reliability of the sign in-

terpretations.

4 EXPERIMENTAL RESULTS

In this section, we present our experimental strategy,

report the results, and go into deep related discus-

sions.

To evaluate our approach, we constructed ou r

dataset, recognizing that the phase of dataset creation

plays a crucial role in achieving pleasing results. In

this stud y, we selected a restricted set of signs. Such

a strategy initially allows the mode l to learn essen-

tial sign recog nition before gradually expanding the

training set. This approach can be useful when deal-

ing with com plex sign language systems, providing

a stepping stone to incorporate more signs into the

model progressively.

After all, the quality and volume of input data

directly steer the detector’s accuracy. We harnessed

four different webcam models to capture images for

each sign. To boost system robustness, we por trayed

each sign with slight variations in ha nd poses and in-

terchang eably used hands in each image.

Following the data collection, we annotated each

image using the LabelImg software (Tzutalin, 2015).

During this annotation process, we matched each sign

with its appropriate lab el and highlighted the region of

interest with a bounding box. Consequen tly, we ge n-

erated an XML ﬁle specifying the image inf ormation,

the assigned label, and the bounding box coordinates.

Constructing the dataset required us to engage in

careful labeling to depict the nuances in sign expres-

sions accurately. This variation in the dataset helps

ensure that our model learns to recogniz e and gener-

alize different instances of the same sign effectively.

Additionally, we optimized the size of the dataset,

striking a balance between the r equiremen t for am ple

samples and the practical limitations of data collec-

tion.

We use the annota te d dataset to train the sign lan-

guage translation model. By exposing the model to a

wide array of sign variations, we aim to enhance its

generalization capabilities and its accur acy in recog-

nizing signs in rea l-world scenarios. By constructing

an effective dataset, we set the stag e for training a re-

silient and de pendable sign language translatio n sys-

tem.

Figure 2: Loss trend over epochs.

Enhancing Gesture Recognition for Sign Language Interpretation in Challenging Environment Conditions: A Deep Learning Approach

399

4.1 Training

To augment the efﬁciency of the training proced ure,

we leveraged the well-established technique of tra ns-

fer learning (Cook et al., 2013). This approach

enables adapting a pre -trained ar tiﬁcial intelligenc e

model to a task distinct from its initial training objec-

tive. Speciﬁcally, we extracted the lower layers from

the pre-trained network for reuse within our newly

designed network to recognize a set of ﬁve distinct

signs.

We elected to use the SSD MobileNet V2 FPN

Lite model, with an input size of 320x320 p ixels pre-

trained on the COCO 201 7 dataset (Lin et al., 201 4).

The ration ale for this selection is the neural network’s

impressive pro c essing speed o f 22ms and a COCO

mAP of 22.2. These performa nce metrics, coupled

with the simplicity of the network’s architecture that

facilitates its operation even on low-power devices,

deemed it an apt choice for our app lication.

In the train ing phase, we employed aug mentation

techniques, speciﬁcally random cropping and hori-

zontal ﬂipping, to enhance th e network’s gener aliza-

tion capabilities. The mo del’s training leverages a

momentum optimizer (Ruder, 2016), featuring a co-

sine decay learning rate strategy (Loshchilov and Hut-

ter, 2016 ). This optimizer initiates with a base learn-

ing rate of 0.08, diminish ing gr a dually per a c osine

decay schedule, thereby promotin g smoother conver-

gence and limiting the risk of overshooting. A warm-

up phase within the initial 1000 steps grad ually es-

calates the learning rate from 0.026, a measure that

mitigates the risk of prem ature divergence. A momen-

tum optimizer value of 0.9 encoura ges the op timizer’s

acceleration in the correct directio n and da mpens os-

cillations, proving beneﬁcial in complex optimization

landscapes. We employ a loss function that is a line ar

combination of localization and classiﬁcation losses,

represented in Equation 1.

L = w

· L

loc

+ w

· L

class

(1)

In our conﬁguratio n, we d eemed it appropriate to as-

sign equivalent importance to both components lo-

calization lo ss (L

loc

) and classiﬁcation loss (L

class

thus setting w

= w

= 1.0. We speciﬁcally utilize

Smooth-L1 as the localization loss, wher eas Focal

Loss, with a gamma of 2 and alpha o f 0.25, serves

as our classiﬁcation loss. We conducted the trainin g

phase over 20,000 epochs.

As illustrated in Figur e 2, both the training and

validation data exhibit a decrease in loss functions,

signaling the system’s excellent adaptation to the new

data.

4.2 Evaluation

One of the goals of this study is to obtain a sy stem for

sign recognition that gives r obust predictions under

different environmen tal conditions, such as br ightness

variations and backgro und c omplexity. In low-light

conditions, HSV (Hue, Saturation, Value) color space

is often considered more advantageous compared to

YCbCr (Luma, Chroma Blue, Chroma Red) color

space due to the distinct characteristics of their re-

spective components. The Value component in HSV

directly r epresents the brigh tness or inten sity of a

color, making it particularly suitable for object de tec-

tion in low-light e nvironments. Despite reduced over-

all illumination, the Value co mponen t still exhibits

distinguishab le variations in brightness, enabling ef-

fective differentiation of objects. Conversely, YCbCr

separates the color information from the brightness

informa tion, with the Luma component representing

the brigh tness. However, the Luma component may

not provide sufﬁcient contrast for robust object dete c-

tion in low-light cond itions. This discrepan cy arises

from the fact that the Luma component is less sensi-

tive to variations in brightness under low-light con-

ditions, potentially leading to decreased detection ac-

curacy. Consequently, HSV, emphasizing the Value

component, is generally favored over YCbCr for im-

proved object detectio n performance in low-light con-

ditions. We observed that predicted classes were cor-

rect during tests conducted in non-optimal environ-

mental conditions for RGB in put fra mes. However,

their probability scores were not predominant con-

cerning other classes. For these reasons, we compared

RGB against HSV under b right environment condi-

tions an d RGB against YCbCr under low-light envi-

ronment condition s.

The results of tests conducted for each sign are

presented in Table 1 and Table 2 and summarized

graphica lly in Figure 3 and Fig ure 4.

For example, in tests carried out in brightly lit en-

vironm ents, an improvement in conﬁdence scores was

observed when converting to the HSV color space,

as demonstrated in the examples shown in Figure 5

where the conﬁde nce score for the “Hello” sign in-

creases from 56. 08% to 99.24 %. Similarly, in tests

condu c te d in low-light conditions, detection quality

was improved by using the YCbCr color spac e, as

shown in Figure 6 w here it can be observed that the

conﬁdence score increases from 51.45% to 93.29%

for the “No” sign.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

400

Table 1: C omparison of P r obability Scores for RGB and

HSV in low-light environments.

Sign RGB (%) HSV (%)

“Yes” 46.46 99.39

“No” 50.01 96.02

“Hello” 54.05 99.42

“I Love You” 49.89 96.72

“Thank You” 68.39 97.14

Table 2: C omparison of P r obability Scores for RGB and

YCbCr in bright environments.

Sign RGB (%) YCbCr (%)

“Yes” 52.09 92.95

“No” 58.30 95.9

“Hello” 50.43 99.81

“I Love You” 56.70 99.01

“Thank You” 59.92 99.03

Figure 3: Accuracy comparison between each sign’s RGB

and HSV color spaces.

Figure 4: Accuracy comparison between each sign’s RGB

and HSV color spaces.

Figure 5: Sign recognition comparison between RGB and

HSV color spaces in a bright environment. HSV outper-

forms RGB in this condition.

Figure 6: Sign recognition comparison between RGB and

YCbCr color spaces i n low-light environment. YCbCr out-

performs RGB in this condition.

5 CONCLUSIONS

Gestures provide a simple and effective method of

communication, which is why gestu re recognition

systems are gaining pop ularity. Depending on the

application domain, it is important to consider the

choice of technologies and training architectu res care-

fully. While more sophisticated sensors can pr ovide

more accu rate detections, they are often less afford-

able and accessible.

Our propo sed system is capable of real-time trans-

lation of ﬁve sym bols from American Sign Language

(ASL), whic h is the pr edominan t language a mong

deaf com munities in America, Canada, West Africa,

and Southeast Asia. By analyzing the data stream

from the webcam, the system displays the dete c tion

on the screen, providing bounding bo xes, classes,

and c onﬁdence scores. It is based on the SSD M o-

bilenet architecture, designed to delive r real-time per-

formance on low-power devices while achieving high

levels of accuracy under optimal co nditions. In chal-

lenging lighting c onditions, the detection quality has

been signiﬁcantly improved through color space con-

versions, ena bling better image segmentation. The

implemented technology demonstrates Accessibility

and non-invasiveness.

Overall, this resear c h contributes to advancing

gesture recognition systems, particular ly for sign lan-

guages, by leveraging De ep Lea rning techniques. The

proposed system shows promise in real-time trans-

lation and has the potential to facilitate communica-

tion betwee n deaf individuals and the wider commu-

nity. Future work may focus on expanding the vocab-

ulary using other knowledge sources (Caldarola et al.,

2015; Muscetti et a l., 2022) and improving the sys-

tem’s robustness under various environme ntal condi-

tions, ultimately aiming to make gestur e reco gnition

more accurate, efﬁcient, and inclusive.

Enhancing Gesture Recognition for Sign Language Interpretation in Challenging Environment Conditions: A Deep Learning Approach

401

ACKNOW LED GM EN TS

We acknowledge ﬁnancial suppor t from the p roject

PNRR MUR project PE0000013-FAIR.

REFERENCES

Abiodun, O. I., Jantan, A., Omolara, A., Dada, K. V., Mo-

hamed, N. A., and Arshad, H. (2018). State-of-the-

art in artiﬁcial neural network applications: A survey.

Heliyon, 4(11):e00938.

Bharati, P. and Pramanik, A. (2020). Deep learning

techniques—r-cnn to mask r-cnn: a survey. In Com-

putational Intelligence in Pattern Recognition: Pro-

ceedings of CIPR 2019, pages 657–668.

Caldarola, E. G., Picariello, A., and Rinaldi, A. M. (2015).

Big graph-based data visualization experiences: The

wordnet case study. In IC3K 2015 - Proceedings of the

7th International Joint Conference on Knowledge Dis-

covery, Knowledge Engineering and Knowledge Man-

agement, page 104 – 115.

Cook, D., Feuz, K. D., and Krishnan, N. C. (2013). Trans-

fer learning for activity recognition: A survey. Knowl-

edge and information systems, 36:537–556.

Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object de-

tection via region-based f ull y convolutional networks.

In Advances in neural information processing sys-

tems, volume 29.

Girshick, R. (2015). Fast r-cnn. I n Proceedings of the IEEE

international conference on computer vision, pages

1440–1448.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

He, K., Gkioxari, G. , Doll´ar, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

Khan, R. Z . and Ibraheem, N. A. (2012). Hand gesture

recognition: a literature review. International journal

of artiﬁcial Intelligence & Applications, 3(4):161.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J. , P erona, P.,

Ramanan, D., ..., and Zitnick, C . L. (2014). Mi-

crosoft coco: Common objects in context. In Com-

puter Vision–ECCV 2014: 13th European Confer-

ence, Zurich, Switzerland, September 6-12, 2014, Pro-

ceedings, Part V 13, pages 740–755. Springer Interna-

tional Publishing.

Loshchilov, I. and Hutter, F. (2016). Sgdr: Stochastic

gradient descent with warm restarts. arXiv preprint

arXiv:1608.03983.

Madani, K., Rinaldi, A. M., Russo, C., and Tommasino,

C. (2023). A combined approach for improving

humanoid robots autonomous cognitive capabilities.

Knowledge and Information Systems, 65(8):3197–

3221.

Mitra, S. and Acharya, T. (2007). Gesture recognition:

A survey. IEEE Transactions on Systems, Man,

and Cybernetics, Part C (Applications and Reviews),

37(3):311–324.

Muscetti, M., Rinaldi, A. M., Russo, C., and Tommasino, C.

(2022). Multimedia ontology population through se-

mantic analysis and hierarchical deep features extrac-

tion techniques. Knowledge and Information Systems,

64(5):1283–1303.

Park, U. and Jain, A. K. (2007). 3d model-based face recog-

nition in video. In Advances in Biometrics: Interna-

tional Conference, ICB 2007, Seoul, Korea, August

27-29, 2007. Proceedings, pages 1085–1094. Springer

Berlin Heidelberg.

Rastgoo, R., Kiani, K., and Escalera, S. (2021). Sign lan-

guage recognition: A deep survey. Expert Systems

with Applications, 164:113794.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with r egion

proposal networks. In Advances in neural information

processing systems, volume 28.

Rinaldi, A. M. and Russo, C. (2020). A content based im-

age retrieval approach based on multiple multimedia

features descriptors in e-health environment. In 2020

IEEE International Symposium on Medical Measure-

ments and Applications (MeMeA), pages 1–6. IEEE.

Rinaldi, A. M., Russo, C., and Tommasino, C. (2020). A

knowledge-driven multimedia retrieval system based

on semantics and deep features. Future Internet,

12(11):183.

Rinaldi, A. M., Russo, C., and Tommasino, C. (2021). Vi-

sual query posing in multimedia web document re-

trieval. In 2021 IEEE 15th International Confer-

ence on Semantic Computing (ICSC), pages 415–420.

IEEE.

Ruder, S. (2016). An overview of gradient de-

scent optimization algorithms. arXiv preprint

arXiv:1609.04747.

Russo, C., Madani, K., and Rinaldi, A. M. (2020). An unsu-

pervised approach for knowledge construction applied

to personal robots. IEEE Transactions on Cognitive

and Developmental Systems, 13(1):6–15.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residu-

als and linear bottlenecks. I n Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 4510–4520.

Suarez, J. and Murphy, R. R. (2012). Hand gesture recog-

nition with depth images: A review. In 2012 I EEE

RO-MAN: t he 21st IEEE international symposium on

robot and human interactive communication, pages

411–417. IEEE.

Tzutalin (2015). Labelimg. https://github.com/tzutalin/

labelImg.

Wadhawan, A. and Kumar, P. (2021). S ign language recog-

nition systems: A decade systematic literature review.

Archives of Computational Methods in Engineering,

28:785–813.

Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2022).

Yolov7: Trainable bag-of-freebies sets new state-of-

the-art for real-time object detectors. arXiv preprint

harXiv:2207.02696.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

402