the impact, quality and efficiency of online teaching.
Methods capable of quantifying the attention level
of a student based on video information from online
lectures can form the basis of a system that is able
to help instructors adapt their style and methodology
throughout a lecture (Robal et al., 2018).
1.1 Related Work
Attention tracking is a complex task that has been
extensively studied, with several successful methods
being developed in recent years (Mass
´
e et al., 2017;
Robal et al., 2018). Smith et al. (Smith et al., 2003)
proposed a system for analyzing human driver visual
attention based on global motion and color statistics.
The system computes eye blinking, occlusion and ro-
tation information to determine with reasonable suc-
cess the driver’s visual attention level.
Another interesting approach is introduced in
(Eriksson and Anna, 2015) where the authors tried
to detect if a student is attentive or not by using
two distinct face detection methods implemented in
OpenCV: the Viola-Jones method and the multi-block
local binary pattern. Both algorithms obtained similar
values for sensitivity and precision on images where
the subjects were required to only look towards the
front of the lecture hall. This posed a limitation on
their approach as it generated high numbers of false
positives when subjects were repeatedly performing
poses with their faces tilted downwards. This essen-
tially suggests the subject is not attentive when he or
she is looking down, which is not always the case as
students could in such situations be taking notes.
Robal et al. (Robal et al., 2018) proposed an eye
tracking system to detect the position of the subject’s
eyes in order to quantify attention. Two different ap-
proaches were tested: a hardware eye tracker, Tobii,
and also two software trackers, namely WebGazer.js
(Papoutsaki, 2015) and tracking.js (TJS) (Lundgren
et al., 2015). The hardware-based system achieved
the highest accuracy 68.2%, followed by TJS with a
recorded average performance of 58.6%.
Another approach was proposed in (Deng and Wu,
2018), where a comparison between different combi-
nations of machine learning algorithms, such as Prin-
cipal Component Analysis, Gabor feature extraction,
K-nearest neighbors, Naive Bayes and Support Vec-
tor Machine (SVM) is presented. The most accurate
combination was found to be Gabor feature extraction
and SVM, with an accuracy of 93.1%.
Deep learning is a class of machine learning al-
gorithms with a deep topology that are used to solve
complex problems. It has in the last decade gained a
lot of attention and has been used in computer vision
with a high degree of success (Pak and Kim, 2017).
The most distinctive characteristic of a deep learn-
ing approach is that it can automatically extract the
best image features required for recognition directly
via training, without the need of domain expertise or
hard coded feature extraction. Convolutional neural
networks (CNNs) are special types of artificial neu-
ral networks that satisfy the deep learning paradigm
and have won numerous image recognition competi-
tions in recent years (Krizhevsky et al., 2012; Szegedy
et al., 2015). Schulc et al. (Schulc et al., 2019) devel-
oped a deep learning approach that detects attention
and non-attention to commercials using webcam and
mobile devices. The model combined a CNN with
long short-term memory (Hochreiter and Schmidhu-
ber, 1997) and achieved an accuracy of 75%.
For the past few decades, banks of Gabor filters
have been widely used in computer vision for extract-
ing features in face recognition tasks. They are based
on a sinusoidal function with particular frequency and
orientation that allows them to extract information
from the space and frequency domains of an image.
A few works have explored the integration of Gabor
filters and CNN with promising results (Alekseev and
Bobe, 2019).
In this paper, we propose a system that uses a
CNN for the task of detecting attentive and not at-
tentive states during online learning. The input im-
ages are put first through a Gabor filter, which ex-
tracts intrinsic facial features and servers as the in-
put for the CNN. The last layer is a SVM that pre-
dicts the label. Our contribution is as follows: (i)
we have build a dataset containing images from real
online lectures; (ii) we have showed that a convolu-
tional neural network can be effectively applied to the
task of quantifying attention; (iii) we have showed
that our method has significantly better performance
when compared to other approaches or well-known
convolutional neural network models such as AlexNet
(Krizhevsky et al., 2012) and GoogLeNet (Szegedy
et al., 2015).
The rest of the paper is organized as follows. In
the next section we discuss CNNs, Gabor filters and
the SVM. Section 3 describes our dataset, highlights
our proposed solution and reports the obtained results.
A discussion with some conclusions follows in the
last section.
ICAART 2022 - 14th International Conference on Agents and Artificial Intelligence
294