ing of signed videos, pose estimation, multi-person,
hand detection, and other interactions between hu-
mans and computer applications (Newell et al., 2016).
Regarding deep applications to SLR, most works
have used Convolution Neural Network (CNN) with
other deep learning architectures, such as the Recur-
rent Neural Network (RNN), to increase performance
when dealing with the video input than the only CNN.
Although CNN and RNN models as well as their com-
binations were designed long ago, most researchers,
as studied by (Rastgoo et al., 2021), have continued
to use them in SLR, but only a few changes in the
used modalities and datasets. For example, having
achieved high training accuracy on ISL, (Wadhawan
and Kumar, 2020) proposed static signs in sign lan-
guage recognition using CNN on RGB images. In ad-
dition, (Ferreira et al., 2019) presented multi-modal
learning techniques from three specific modalities for
an accurate SLR, using colour, depth on Kinect, and
Leap Motion data based on CNN.
Our contributions are twofold. First, we origi-
nally introduce a new video database for Thai sign
language recognition on digits. Our dataset has about
63 videos for each digit and is performed by 21 sign-
ers. To our knowledge, this is the first video dataset
for the Thai sign language research community. Sec-
ond, we conduct a substantive study on the design
and development of deep learning systems based on
our dataset. Specifically, we implement and investi-
gate four systems: CNN-Mode, CNN-LSTM, VGG-
Mode, and VGG-LSTM, and compare their perfor-
mances under two scenarios: (1) the whole body pose
with backgrounds, and (2) hand-cropped images only
as pre-processing. The paper is structured as follows.
Section II describes related work on the Thai sign lan-
guage (TSL) datasets that currently exist in the Thai
research community. Next, we explain our dataset,
methodology, and pre-processing steps in section III.
Section IV discusses the steps and results of our ex-
periments. Finally, we conclude the experiments and
discuss the direction of future work in section V.
2 RELATED WORK
In this section, we briefly discuss some of the Thai
sign language datasets that exist at present. Ac-
cording to the situation of persons with disabilities
in Thailand, there were 393,027 people, or 18.69%,
with a hearing impairment and interpretive disabil-
ity in December 2021
1
, representing the second lead-
ing disability type among all 2,102,384 disabled peo-
1
https://dep.go.th/th/
ple. This problem causes difficult communication be-
tween those who can hear and the groups of deaf and
hard-hearing people who communicate with sign lan-
guage, a subset of hand gestures. Although Thai Sign
Language (TSL) was initially developed from Amer-
ican Sign Language (ASL), it has distinct hand ges-
tures from other countries based on tradition, culture,
and geography. The structure of TSL consists of 5
parts: the hand shapes, position of the hands, move-
ment of the hands, orientation of the palms in relation-
ship to the body or each other, and face of the signer.
Even though TSL is the only standard sign language
in Thailand, it still lacks public sign language datasets
and signers. As a result, most Thai researchers have
to provide datasets on their own without experts’ in-
volvement (see Tables 2 and 3).
Furthermore, TSL can be split into two major di-
rections: fingerspelling and natural sign language.
Fingerspelling is used for specific names such as
places, people, and objects that cannot be signed us-
ing gestures. (Chansri and Srinonchat, 2016) pro-
posed investigating the hand position in real-time sit-
uations with Kinect sensors but without the environ-
mental contexts, such as the skin colour and back-
ground. (Pariwat and Seresangtakul, 2017) presented
an example of a system based on Thai fingerspelling
using global and local features with Support-Vector
Machine. At the same time, (Nakjai and Katanyukul,
2019) employed a histogram of Oriented Gradients
(HOG) with CNN to deal with Thai fingerspelling.
Despite the aforementioned, most deaf and hard-
of-hearing people use the natural Thai sign language
to communicate with each other because it is easy
and fast. However, a significant problem with natural
signs is that the number of Thai sign language datasets
is very low. For example, (Chaikaew et al., 2021) pre-
pared their dataset by using five gestures and shot 100
videos per word, so the total was 500 videos contain-
ing 50 FPS with H.264 format for each video. Then,
input datasets were trained with RNN-based models:
LSTM, BiLSTM, and GRU. Although their results
demonstrated greater than 90% accuracy, they pre-
sented only in-sample evaluation. Undoubtedly, the
in-sample domain is higher than the out–of-sample
evaluation. Next, (Chaikaew, 2022) applied the holis-
tic landmark API of MediaPipe to extract features
from live video capture consisting of face, hand and
body landmarks. Afterwards, they trained their data
on three models to evaluate the performance of each
model. However, neither research paper showed the
number of signers. Generally, a good sign recogni-
tion model should be robust to inter-signer variations
in the input data, such as signing paces and signer ap-
pearance, to generalise well to real-world scenarios.
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
776