fect on the spectrogram. Afterward, we utilize deep
learning techniques to classify hand gestures. We use
different types of CNN models with variations in the
input. However, all CNN models take the input as a
spectrogram. The first model takes the dual-channel
audio as 1 channel image of the spectrogram (Basic
CNN). The second takes two spectrogram inputs, one
as the top microphone and the other as the bottom mi-
crophone, and then the data is concatenated and in-
putted into the model (Early Fusion). The third takes
two spectrograms inputs of the top and bottom chan-
nels as two separate inputs, and then each input is
trained on a different branch then the branches are
merged at the end of the model (Late Fusion).
This paper is organized as follows. Section 2 in-
troduces the previous studies on gesture recognition
using non-audible frequencies and tracking finger po-
sitions. Section 3 discusses the signal generation.
Section 4 explains the data collection and augmen-
tation procedures. Also, it shows the dataset compo-
nents. Section 5 describes the proposed Deep Learn-
ing models. Section 6 shows the results and accura-
cies achieved. Section 6 shows the results of our ex-
periment. Finally, section 7 describes the conclusion
and highlights the future work.
2 RELATED WORKS
Multiple studies have been conducted discussing us-
ing a smartphone’s microphone and speaker to act like
a sonar system. (Kim et al., 2019) propose a sys-
tem that uses two smartphones. One smartphone gen-
erates an ultrasonic wave of 20 kHz, and the other
uses the microphone to pick up the signals transmit-
ted by the first phone. After recording the emitted
sound during performing hand gestures, and based on
the Doppler effect, these recorded signals varied de-
pending on time according to the hand movement or
its position. Such a recorded sound signal is then re-
flected as an image using the short-time Fourier trans-
forms and applied to the convolutional neural network
model (CNN) to classify certain hand gestures. Kim
et al.reached an accuracy of 87.75 percent. As pre-
viously mentioned, the reflected paper demonstrates
hand gesture recognition by using two smartphones,
one as a microphone and the other as a speaker. On
the other hand, our research will concentrate on using
a single smartphone to perform the gesture recogni-
tion process. On the other hand, another technique
named UltraGesture (Ling et al., 2020), avoids the
Doppler Effect methodology in tracking slight finger
motions and hand gestures. Instead, it uses a Chan-
nel Impulse Response (CIR) as a gesture recognition
measurement with low pass and down conversion fil-
ters. In addition, they perform a differential operation
to obtain the differential of the CIR (dCIR). Then the
information is used as the input data. Lastly, as pa-
per (Kim et al., 2019), the authors transform the data
into an image trained in a CNN model. However, in
the experimental phase, the authors added additional
speakers kit on the smartphone, which boomed the ac-
curacy results to an average of 97% for 12 hand ges-
tures.
Another interesting study is AudioGest (Ruan
et al., 2016) which uses one built-in microphone and
speaker in a smartphone. However, it does not rely
on any deep learning model. Instead, it relies on fil-
ters and signal processing. It can recognize various
hand gestures, including directions of the hand move-
ment, by using an audio spectrogram to approximate
the hand in-airtime using direct time interval mea-
surement. Also, it can calculate the average waving
range of the hand by using range ratio and the hand
movement speed by using speed-ratio, (Ruan et al.,
2016). According to the authors, AudioGest is nearly
unaffected by human noise through its denoising op-
eration and signal drifting issues, which were tested
in various scenarios such as in a bus, office, and caf
´
e.
For the results, AudioGest achieves slightly better re-
sults in gesture recognition compared to the former
(Kim et al., 2019) which includes recognition in noisy
environments.
Sonar sensing can also be used as a finger tracking
system. FingerIO (Nandakumar et al., 2016) presents
a finger tracking system to allow users to interact with
their smartphone or a smartwatch through their fin-
gers, even if it is occluded or on a separate plane from
the interaction surface. The device’s speakers emit
inaudible sound waves 18-20 kHz, then the signals
are reflected from the finger and are recorded through
the device’s microphones. For the sake of relevancy,
the research will focus on smartphones only. Finge-
rIO’s operation is based on two stages, transmitting
the sound from the mobile device’s speakers, then
measuring the distance from the moving finger by the
device’s microphone. For the speaker’s part, it uses
Orthogonal frequency division multiplexing (OFDM)
to accurately identify the echo’s beginning to improve
the finger tracking accuracy. For the results, without
occlusions, FingerIO attains an average accuracy of
8mm in 2D tracking. The interactive surface works
effectively (error within 1 cm) within a range of 0.5
m2. Exceeding the range increases the margin of er-
ror to 3 cm. However, in occlusions, the 2D tracking
average accuracy rises to 1 cm. Besides, FingerIO
requires an initial gesture before attempting any ges-
ture to avoid false positives, which worked 95% of
Pervasive Hand Gesture Recognition for Smartphones using Non-audible Sound and Deep Learning
311