A Webcam Artiﬁcial Intelligence-Based Gaze-Tracking Algorithm

Saul Figueroa

1 a

, Israel Pineda

2 b

, Paulina Vizca

ıno

3 c

, Iv

an Reyes-Chac

3 d

and Manuel Eugenio Morocho-Cayamcela

1,2 e

Yachay Tech University, School of Mathematical and Computational Sciences, DeepARC Research Group, Hda. San Jos

s/n y Proyecto Yachay, Urcuqu

ı, 100119, Ecuador

Universidad San Francisco de Quito, College of Science and Engineering, Quito, 170901, Ecuador

Universidad Internacional del Ecuador, Faculty of Technical Sciences, School of Computer Science,

Quito, 170411, Ecuador

ﬁ

Keywords:

Gaze-Tracking, Artiﬁcial Intelligence, Human-Computer Interaction, Visual Attention, Gaze Prediction.

Abstract:

Nowadays, technological advancements for supporting human-computer interaction have had a big impact.

However, most of those technologies are expensive. For that reason, building a webcam gaze-tracking system

represents a computationally cost-effective approach. The gaze-tracking technique focuses on tracking the

gaze direction and estimating its coordinates over a computer screen to follow user visual attention. This

research presents a gaze estimation approach to predict the user’s gaze direction using a webcam artiﬁcial

intelligence-based gaze-tracking algorithm. The purpose of this paper is to train a convolutional neural network

model capable of predicting a 3D gaze vector to estimate then the 2D gaze position coordinates over a computer

screen. To perform this task, three steps are followed: 1) Pre-processing the input, crop facial, and eye

images from the MPIIFaceGaze dataset. 2) Train a customized network based on a ResNet-50 pre-trained on

ImageNet for gaze vector predictions. 3) 3D gaze vectors conversion to 2D point of gaze on the screen. The

results demonstrate that our model outperforms the state-of-the-art VGG-16 model under the same dataset by

up to ∼ 33%. Data and source code is available at: https://github.com/SaulFigue/Gaze-tracking-pipeline.git.

1 INTRODUCTION

The human visual attention study is one of the mo-

tivations for analyzing eye movements and estimat-

ing gaze by computer vision techniques. The gaze-

tracking technique aims to track and estimate gaze

position from the facial and eye regions of an indi-

vidual (Liu et al., 2022). Understanding the cognitive

process of people’s attention and how they react to

a speciﬁc task helps to comprehend visual attention

study. In addition, facial expressions are one depen-

dent variable that improves the analysis of this com-

puter vision task (Dilini et al., 2021).

Although eye-tracking and gaze-tracking tech-

niques can work together to perform multiple appli-

cations, we decided to develop a gaze-tracking al-

gorithm. One of the reasons that support our selec-

https://orcid.org/0000-0001-9395-5903

https://orcid.org/0000-0002-3950-2169

https://orcid.org/0000-0001-9575-3539

https://orcid.org/0009-0002-2731-5531

https://orcid.org/0000-0002-4705-7923

tion is that most eye-tracking systems need speciﬁc

hardware devices (Holmqvist and Andersson, 2017),

which are highly expensive and invasive hardware to

acquire (Krafka et al., 2016). On the other hand, there

exist gaze-tracking systems that use default laptop

webcams (a few of them are described in Section 2),

which represent a low-cost implementation for track

and estimating gaze direction. Furthermore, gaze-

tracking systems involve more features than eye re-

gions when estimating gaze. Incorporating head ori-

entation into the gaze estimation model complements

gaze estimation predictions by considering head pose

information such as the head declination informa-

tion(Cazzato et al., 2020).

Currently, new artiﬁcial intelligence methods have

been introduced into gaze-tracking systems to im-

prove the accuracy of gaze-estimation prediction and

its analysis. Deep learning algorithms are usually

found in the literature (de Lope and Gra

na, 2022),

and multiple variations of convolutional neural net-

work (CNN) methods are implemented for tracking

the gaze (Chen et al., 2020).

228

Figueroa, S., Pineda, I., Vizcaíno, P., Reyes-Chacón, I. and Morocho-Cayamcela, M.

A Webcam Artiﬁcial Intelligence-Based Gaze-Tracking Algorithm.

DOI: 10.5220/0012759700003753

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Software Technologies (ICSOFT 2024), pages 228-235

ISBN: 978-989-758-706-1; ISSN: 2184-2833

Gaze estimation is a computer vision task that in-

volves converting gaze directions into screen posi-

tions. This technique is useful when working with

artiﬁcial intelligence on a real-time webcam approach

by improving human gaze direction prediction with a

signiﬁcant accuracy (Ou et al., 2021).

This paper proposes a Deep learning-based gaze-

tracking algorithm to estimate gaze direction using a

pre-trained on ImageNet ResNet-50 model and MPI-

IFaceGaze public and available gaze-tracking dataset.

Furthermore, a comparison between benchmark pro-

posed model for their study of efﬁciency in real-time

webcam gaze tracking system (Gudi et al., 2020) and

our proposed model is presented to test our gaze-

estimation system effectiveness.

2 RELATED WORKS

2.1 Human Gaze Estimation Approach

Gaze-tracking systems involve many human-face-

eyes features. The eye region is one of the main com-

ponents that help developers to build gaze-tracking

systems. There are two types of human gaze estima-

tion approaches.

2.1.1 Model-Based Techniques

Tracking gaze direction through 3D eye models using

geometric eye features is one of the objectives of the

model-based gaze estimation technique. Gaze points

are obtained by calculating the optical and visual axis

of eyes (Modi and Singh, 2021). Some researchers

obtained desirable results from this model-based tech-

nique for gaze-tracking systems. For example, esti-

mating gaze direction using 3D eye models and geo-

metric information to build a model-based gaze esti-

mation system (Kaur et al., 2022).

Facing refraction calculus represents a complex

task involving pupils. Therefore, a 3D gaze estima-

tion method based on the iris features is a potential

approach (Liu et al., 2020). To estimate the gaze di-

rection, 3D human eye optical axis (OA) reconstruc-

tion is performed through the iris and cornea center.

Also, a visual axis (VA) is needed to calculate the

angle between the OA and VA, known as the kappa

angle. This process allows drawing the line of sight

using a single light and camera source.

2.1.2 Appearance-Based Techniques

Machine learning and Deep learning are the two most

commonly used techniques for building appearance-

based gaze estimation systems (Modi and Singh,

2021).

Conventional Appearance-Based Methods.

When referring to conventional appearance-based

methods, implementations involve a multilevel

Histogram of Oriented Gradient (HOG) features

extraction approach to infer gaze by processing

an eye-appearance features extraction followed

by a personal calibration (Martinez et al., 2012).

An Adaptive Linear Regression (ALR) method

to map high-dimensional eye image features to

low-dimensional gaze positions is another approach,

where the sub-pixel alignment method helps to solve

the face-head motion problem. By this means, the

computational cost can be reduced signiﬁcantly.

Additionally, using ﬁxed, smaller training sample

sets can help achieve high accuracy, facing some

conventional appearance-based challenges (Lu et al.,

2014).

Appearance-Based Methods with Deep Learning.

The Gaze-Net system, trained on MPIIGaze and re-

trained by transfer learning technique on Columbi-

aGaze datasets, is a capsule network approach for

appearance-based gaze estimation (Mahanama et al.,

2020). This network receives eye images as input

and returns two results: A reconstructed image and

gaze estimation coordinates. The Gaze-net architec-

ture performs a classiﬁcation training process with 6

classes of gaze direction. The transfer learning pro-

cess re-trains the gaze estimator with accurate results

by considering that one single eye image is enough to

estimate gaze reliably. The same conclusion was pro-

vided by a study of efﬁciency in real-time webcam

gaze tracking system (Gudi et al., 2020). Their work

intended to obtain the best performance in speed and

effort terms for estimating gaze. A CNN is used to

process face, eyes, and single-eye input images. In

addition, the authors ﬁnd that geometric regression

calibration performs better than machine learning and

hybrid geometric approaches. A pre-trained on Ima-

geNet VGG-16 neural network model is used to cal-

culate gaze vectors. In addition, the system was tested

with MPIIFaceGaze and EYEDIAP datasets. One key

ﬁnding in Gudi et al. work is that using single-eye im-

ages (left or right) performs better than using left and

right-eye images since it enhances the computational

speed and effort for calibration processes.

A Webcam Artiﬁcial Intelligence-Based Gaze-Tracking Algorithm

229

2.2 Gaze-Tracking Applications

2.2.1 Human-Computer Interaction (HCI)

One of the most attractive applications of gaze-

tracking is using gaze and head movement to control

computer functions without using hands. A proper

implementation based on this purpose involves the

face detection algorithm and the iBUG 300-W dataset

as the core. Then, developing a simpler mouse as-

sistance system for motor-disable people represents a

huge advance in HCI (Anantha Prabha et al., 2022).

One step further is to combine mouse and key-

board functions to develop a real-time HCI gaze

tracking system. This approach can be handled by

two principal steps: Eye state (opened or closed)

recognition and a CNN gaze estimation model to pre-

dict the point of gaze (x, y) over the screen (Huang

et al., 2021).

2.2.2 Visual Attention

During the last few years, interest in studying the

visual attention behavior of people of different ages

has increased. Related works investigated the vi-

sual intention inﬂuence of textual and graphical in-

formation images over 30 participants divided into

three age groups. Using a real-time low-cost web-

cam gaze-tracking system helped to determine that

younger users tend to give more attention to graphi-

cal information images, while older ones lean for tex-

tual information (Sabab et al., 2022). On the other

hand, OWLET architecture is capable of estimating

127 ∼ 7-month-old infants’ visual attention by fol-

lowing face, eye, pupil extraction process, ecologi-

cally grounded gaze direction, and point of gaze es-

timation process from the video input of the infant

(Werchan et al., 2022).

2.2.3 Cognitive Processes

Multiple researchers have led their studies inves-

tigating the cognitive processes of the educational

ﬁeld through eye-gaze-tracking system approaches.

For example, using eye-tracking and artiﬁcial in-

telligence to investigate students’ motivation while

watching Massive Open Online Course (MOOC) lec-

tures (Sharma et al., 2020).

Looking forward, gaze-tracking systems can work

as a cognitive analysis tool to examine the relation

between students’ engagement and facial emotion in

combination with eye and head movements informa-

tion to help teachers make the learning environment

effective (Sharma et al., 2023).

3 SYSTEM MODEL AND

METHODOLOGY

This section provides detailed implementation prop-

erties needed for executing the gaze-tracking system.

Additionally, the pipeline of the proposed method

is based on the benchmark paper for further com-

parisons. All the hyper-parameters and architectures

used to perform this task are presented.

3.1 Dataset

We use the public MPIIFaceGaze (Zhang et al., 2017)

dataset to train the model. It consists of 37, 667 face

images of 448×448 pixels from 15 participants. Fig-

ure 1 illustrates a sample of facial images of the MPI-

IFaceGaze dataset.

Figure 1: MPIIFaceGaze dataset sample images of the fa-

cial region for each participant.

For every participant, 2D and 3D gaze annotations

are provided. The relevance of using this dataset is

its variety of illumination, skin color, eye accessories,

and facial features of participant images. All of these

collected images were taken from a laptop webcam

source, simulating real-world scenarios. Moreover,

the MPIIFaceGaze dataset has an annotation text ﬁle

for each participant, where gaze location in screen co-

ordinates, facial landmark points, 3D head pose and

extrinsic parameters in camera coordinates, gaze ori-

gin, and 3D gaze target location are some of the most

used information provided by this dataset.

3.2 Proposed Model

The pipeline proposed for the gaze estimation system

is presented in Figure 2. Three main building blocks

are needed to develop the gaze estimation task based

on CNN architectures.

ICSOFT 2024 - 19th International Conference on Software Technologies

230

Figure 2: Complete pipeline representation for the artiﬁcial intelligence-based gaze-tracking algorithm.

3.2.1 Pre-Processing Block

In this ﬁrst building block, facial images are pro-

cessed to obtain normalized and cropped images be-

fore moving to the CNN model gaze predictor. Using

the annotation information provided by the dataset,

facial features and eye regions can be found easily.

Since some of the images inside the dataset present

head pose inclination, we perform an alignment pro-

cess using the 6 landmark points and 3D face model

to build rotation and translation matrices. Moreover,

before dividing the complete dataset into 3 subsets:

Training, testing, and validation, we need to remove

those erroneous data from those gaze points not in the

screen size. Then, we decide to divide these subsets

into 15 participants. 13 are for training, and the miss-

ing two are for testing and validation. All of these

processes help to get the ﬁnal normalized face and

eye crop images to get into the CNN model.

3.2.2 CNN Gaze Predictor Model

In contrast with the benchmark paper, the pre-

trained VGG-16 network architecture is replaced by

a ResNet-50 model pre-trained on ImageNet to pre-

dict pitch and yaw angles with respect to the camera

by using the pre-processed output images. For a fur-

ther comparison between both models, the same pre-

processed MPIIFaceGaze dataset is used for the train-

ing process. Figure 3 illustrates the proposed face

CNN architecture model. A similar architecture is

used for the eyes CNN model.

In addition, the complete architecture is presented

in Figure 4, where L-Eye CNN and R-Eye CNN rep-

resent the proposed CNN model architecture for eyes

images; Eye CNN is the concatenated model for both

L and R Eye CNN; FF Eye CNN is the feature fu-

sion process of the Eye CNN block; FC Eye repre-

sents the fully connected layer for the FF Eye CNN

block; Face CNN is the proposed model architecture

for facial images; FC Face represents the fully con-

Figure 3: Proposed CNN model for gaze direction predic-

tion: Facial images inputted on a pre-trained ResNet-50

model with the ﬁrst two Bottleneck stage blocks and two

customized convolutional blocks added, plus ﬂatten linear

layers.

Figure 4: Proposed forward architecture for gaze estima-

tion.

A Webcam Artiﬁcial Intelligence-Based Gaze-Tracking Algorithm

231

nected layer for Face CNN block; FC concatenated is

the combined block after a concatenation process of

both face and eye fully connected layers; FC Face-

Eye represents the ﬁnal fully connected layer for all

the system model.

For the training process, some parameters are

considered: 96 × 96 pixels pre-processed facial im-

ages from the MpiiFaceGaze dataset; 64 × 96 pixels

pre-processed eyes images from the MpiiFaceGaze

dataset; 50 epochs; 10

−5

learning rate value; Adam

optimizer.

The ground truth gaze vector is compared with

predicted ones to analyze how accurate this model

is. Moreover, the mean angular error from the gaze

vector of each participant is one of the principal per-

formance metrics to be evaluated. Then, let k be

the number of sample images with random gaze tar-

get points generated. Random calibration (RC) sam-

ples describe the randomly positioned targets over

the sample images based on k value. Furthermore,

comparing the benchmark and proposed model per-

formances can give us an idea of how accurate our

proposed model is compared to previous literature.

3.2.3 Gaze-to-Screen Mapping Method

To translate the coordinates of the gaze vector to the

screen, a geometric relation is applied between the

camera and the screen coordinates system. For this

process, the predicted gaze vector in camera coordi-

nates must be calibrated to gaze points in screen co-

ordinates.

Screen Calibration. To perform this task, we fol-

low the geometry-based calibration method used in

the benchmark paper as a guide. First, individual eye

location is estimated in the screen coordinate system

(x, y, z) by ﬁxate gaze over a speciﬁc (x, y) point on the

screen and considering z as the distance between indi-

vidual and screen. Then, the subject was asked to look

at speciﬁc points over the screen while the CNN gaze

predictor model records eye movements and gaze di-

rections. In addition, we assume that the camera’s

position is the same as the default laptop’s webcam

location (top-mid). Once the pitch angle is obtained

from the previous process, rotation R and translation

T matrices can be calculated by the following equa-

tion:

R =





1 0 0

0 cos(ρ) −sin(ρ)

0 sin(ρ) cos(ρ)





(1)

T = e

scs

− R · e

ccs

(2)

Where ρ is the estimated pitch angle, e

scs

is the

eye location in the screen coordinates system, and e

ccs

is the eye location in the camera coordinates system

obtained from eye landmark points. Next, all eye lo-

cations on the screen coordinate system can be ob-

tained by using the same rotation and translation ma-

trices. The following equation represents this deﬁni-

tion:

ˆe

scs

= R · ˆe

ccs

+ T (3)

Where ˆe

scs

represents any eye location in the

screen coordinates system and ˆe

ccs

represents any eye

location in the camera coordinates system. Finally,

from ˆe

scs

we can extract ˆx

scs

and ˆy

scs

mapped values.

4 RESULTS AND DISCUSSIONS

4.1 Mean Angular Error Analysis

Table 1 presents the mean angular error for each one

of the participants evaluated on both models.

Table 1: Mean angular error comparison between bench-

mark network based on VGG-16 architecture and proposed

network based on ResNet-50 architecture accuracy with

k = [9, 128] random calibration samples.

Benchmark Proposed

Participants

\Models

k = 9

k = 128

k = 9

k = 128

p00 1.780 1.676 2.102 1.997

p01 1.899 1.777 2.994 2.845

p02 1.910 1.790 1.266 1.193

p03 2.924 2.729 1.879 1.767

p04 2.355 2.239 1.411 1.341

p05 1.836 1.720 1.358 1.284

p06 2.569 2.464 1.769 1.705

p07 3.823 3.599 1.994 1.883

p08 3.778 3.508 2.454 2.304

p09 2.695 2.528 1.313 1.234

p10 3.241 3.126 1.506 1.431

p11 2.668 2.535 1.434 1.365

p12 2.204 1.877 1.391 1.240

p13 2.914 2.753 1.650 1.547

p14 2.161 2.010 1.359 1.280

mean 2.584 2.422 1.725 1.628

Results demonstrate an improvement in terms of

mean angular error achieved by the proposed architec-

ture. Moreover, for k = 9 RC samples, the benchmark

paper results were improved by up to 33.24% compar-

ing 1.725 degrees to 2.584 degrees. In the same way,

ICSOFT 2024 - 19th International Conference on Software Technologies

232

for k = 128 RC samples, the benchmark paper results

were improved by up to 32.78% comparing 1.628 de-

grees to 2.422 degrees.

4.2 Pitch and Yaw Prediction Analysis

A comparison between pitch and yaw predicted val-

ues for both models is presented in Figure 5.

Pitch and Yaw prediction values are key outputs

for the gaze estimation system. To evaluate those re-

sults, scatter relation plots for pitch and yaw predic-

tion of both models are given for one of the partici-

pants as a sample test. Figure 5 plots the pitch and

yaw representation predicted values over the ground

truth for participant 13. One big difference between

these two models is that the proposed model based

on ResNet-50 architecture performed better than the

benchmark model based on VGG-16 architecture.

This difference is appreciated from Figure 5b and Fig-

ure 5a pitch representations, where the ﬁrst shows a

not desirable relation between these two axes, while

the last one keeps a better relation. This analysis

is also interpreted for yaw plots in Figure 5d and

Figure 5c, where the last one, our proposed model,

presents a sharpened relation results in comparison

with the benchmark model plot. These results are

supported by considering that our customized net-

work uses the ﬁrst ResNet-50 pre-trained blocks until

the second bottleneck, which contains more convolu-

tion layers than the benchmark architecture with only

four convolution blocks of the VGG-16 pre-trained

model. Moreover, the residual information collected

by ResNet architecture keeps more feature informa-

tion useful for learning. This model aims to extract

more features from the input data, increasing the in-

formation quality and optimizing the gradient back-

propagation to avoid losing important information in

the process.

4.3 Training Performance Analysis

Figure 6 illustrates the performance during training

for the benchmark and proposed model.

Since gaze direction precision involves angle met-

rics, and to better compare benchmark gaze vector

results, we decided to evaluate the angular error by

plotting it over each iteration in the training process

to evaluate the proposed model performance. Addi-

tionally, a train-loss plot is generated to check if the

loss error over the training process keeps decreasing.

Both results are visualized in Figure 6. Figure 6a

shows that between [0 − 20]K training iterations, the

angular error keeps over 3 for both models, represent-

ing bad accuracy results. However, after more iter-

ations, this angular error decreased. The sharpened

light-purple curve (bottom one) better illustrates this

result and decreases faster than the benchmark train-

ing process (light-blue). In the same way, this light-

purple curve in Figure 6b follows a desired decreasing

tendency result for the loss values after each iteration

in the training process compared with benchmark one.

5 CONCLUSIONS

In conclusion, this research proposes an artiﬁcial

intelligence-based gaze-tracking algorithm that uses

a built-in laptop webcam to predict the user’s point of

gaze on a computer screen. The proposed model in-

volves a customized ResNet-50 network pre-trained

on ImageNet, enhancing the performance and accu-

racy compared to traditional models such as the one

used in benchmark one. This architecture, trained on

the MPIIFaceGaze dataset, allows us to build a 3D

gaze vector predictor that is translated to 2D screen

coordinates with the help of a calibration process.

The results indicate that the proposed system can

accurately predict the user’s gaze direction. More-

over, the proposed model outperforms the benchmark

by up to ∼ 33% in terms of mean angular error, in-

creasing our gaze-tracking system’s reliability and

signiﬁcantly improving the human-computer interac-

tion experience.

Undoubtedly, the potential application of gaze-

tracking systems bridges the gap in how we interact

with computers in a hands-free system.

6 LIMITATIONS FOUND

In fact, each built-in laptop webcam has its camera

parameters: intrinsic and extrinsic. This leads to the

calibration process being unique and dependent on the

webcam used in the implementation. Then, the rota-

tion matrix R and translation vector T depend on the

camera’s features.

During the point of gaze estimation test, the dis-

tance between the laptop’s camera and the user was

ﬁxed at 50−60cm. However, the accuracy of the gaze

estimation predictions for different distance ranges

was not covered by the proposed system in this work.

7 FUTURE WORKS

We encourage readers and all practitioners in this re-

search area to address the limitations identiﬁed by our

A Webcam Artiﬁcial Intelligence-Based Gaze-Tracking Algorithm

233

(a) (b)

Figure 5: Pitch and Yaw scatter plot relation between predicted output and ground truth values for participant 13. (a) and

(c) represent the accuracy between the predicted gaze vector and the ground truth vector in degrees by using the benchmark

customized VGG-16 trained model; (b) and (d) represent the accuracy between the predict gaze vector and the ground truth

vector in degrees by using the proposed trained model.

(a) (b)

Figure 6: Graph representation of the training process using both models, the light-blue curve (upper one) refers to the

benchmark model, and the light-purple curve (bottom one) refers to the proposed model. (a) Train/angular-error plot between

50K iterations and angular error values for y−axis; (b) Train/loss plot for 50K iterations.

ICSOFT 2024 - 19th International Conference on Software Technologies

234

system. They can replicate this proposed implemen-

tation using their own laptops’ webcams and evaluate

the accuracy of their systems. Additionally, varying

the distance between the camera and the users’ eyes

is an important factor to be tested during experiments.

Furthermore, implementing a new customized archi-

tecture based on the methodology provided in this pa-

per could be a proper approach in this ﬁeld. An ini-

tial starting point to improve the results presented in

this work will be to modify some hyper-parameters to

conduct more in-depth tests on the proposed architec-

ture.

Finally, we ﬁnd gaze-tracking algorithm applica-

tions useful, and we certainly will continue develop-

ing some of those described in Section 2.2 as part of

our future work and invite readers to explore a wide

range of applications, such as a shooter game aim as-

sistant, mouse assistant controller for disabled people

to use in driving a vehicle, or even neuromarketing

studies based on visual attention analysis as a solu-

tion for products or ad placement in websites like e-

commerce or social media pages.

REFERENCES

Anantha Prabha, P., Srinivash, K., Vigneshwar, S., and

Viswa, E. (2022). Mouse assistance for motor-

disabled people using computer vision. In Proceed-

ings of International Conference on Recent Trends in

Computing, pages 403–413. Springer.

Cazzato, D., Leo, M., Distante, C., and Voos, H. (2020).

When I look into your eyes: A survey on computer

vision contributions for human gaze estimation and

tracking. Sensors, 20(13):3739.

Chen, H.-H., Hwang, B.-J., Wu, J.-S., and Liu, P.-T. (2020).

The effect of different deep network architectures

upon cnn-based gaze tracking. Algorithms, 13(5):127.

de Lope, J. and Gra

na, M. (2022). Deep transfer learning-

based gaze tracking for behavioral activity recogni-

tion. Neurocomputing, 500:518–527.

Dilini, N., Senaratne, A., Yasarathna, T., Warnajith, N.,

and Seneviratne, L. (2021). Cheating detection in

browser-based online exams through eye gaze track-

ing. In 2021 6th International Conference on Informa-

tion Technology Research (ICITR), pages 1–8. IEEE.

Gudi, A., Li, X., and van Gemert, J. (2020). Efﬁciency in

real-time webcam gaze tracking. In Computer Vision–

ECCV 2020 Workshops: Glasgow, UK, August 23–

28, 2020, Proceedings, Part I 16, pages 529–543.

Springer.

Holmqvist, K. and Andersson, R. (2017). Eye tracking:

A comprehensive guide to methods. paradigms and

measures.

Huang, J., Zhang, Z., Xie, G., and He, H. (2021). Real-time

precise human-computer interaction system based on

gaze estimation and tracking. Wireless Communica-

tions and Mobile Computing, 2021.

Kaur, H., Jindal, S., and Manduchi, R. (2022). Rethink-

ing model-based gaze estimation. Proceedings of

the ACM on computer graphics and interactive tech-

niques, 5(2).

Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhan-

darkar, S., Matusik, W., and Torralba, A. (2016). Eye

tracking for everyone. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 2176–2184.

Liu, J., Chi, J., Hu, W., and Wang, Z. (2020). 3d model-

based gaze tracking via iris features with a single cam-

era and a single light source. IEEE Transactions on

Human-Machine Systems, 51(2):75–86.

Liu, J., Chi, J., Yang, H., and Yin, X. (2022). In the eye

of the beholder: A survey of gaze tracking techniques.

Pattern Recognition, page 108944.

Lu, F., Sugano, Y., Okabe, T., and Sato, Y. (2014). Adap-

tive linear regression for appearance-based gaze esti-

mation. IEEE transactions on pattern analysis and

machine intelligence, 36(10):2033–2046.

Mahanama, B., Jayawardana, Y., and Jayarathna, S. (2020).

Gaze-net: Appearance-based gaze estimation using

capsule networks. In Proceedings of the 11th aug-

mented human international conference, pages 1–4.

Martinez, F., Carbone, A., and Pissaloux, E. (2012). Gaze

estimation using local features and non-linear regres-

sion. In 2012 19th IEEE International Conference on

Image Processing, pages 1961–1964. IEEE.

Modi, N. and Singh, J. (2021). A review of various state of

art eye gaze estimation techniques. Advances in Com-

putational Intelligence and Communication Technol-

ogy: Proceedings of CICT 2019, pages 501–510.

Ou, W.-L., Kuo, T.-L., Chang, C.-C., and Fan, C.-P. (2021).

Deep-learning-based pupil center detection and track-

ing technology for visible-light wearable gaze track-

ing devices. Applied Sciences, 11(2):851.

Sabab, S. A., Kabir, M. R., Hussain, S. R., Mahmud, H.,

Hasan, M., Rubaiyeat, H. A., et al. (2022). Vis-itrack:

Visual intention through gaze tracking using low-cost

webcam. arXiv preprint arXiv:2202.02587.

Sharma, K., Giannakos, M., and Dillenbourg, P. (2020).

Eye-tracking and artiﬁcial intelligence to enhance mo-

tivation and learning. Smart Learning Environments,

7(1):1–19.

Sharma, P., Joshi, S., Gautam, S., Maharjan, S., Khanal,

S. R., Reis, M. C., Barroso, J., and de Jesus Filipe,

V. M. (2023). Student engagement detection using

emotion analysis, eye tracking and head movement

with machine learning. In Technology and Innovation

in Learning, Teaching and Education: Third Interna-

tional Conference, TECH-EDU 2022, Lisbon, Portu-

gal, August 31–September 2, 2022, Revised Selected

Papers, pages 52–68. Springer.

Werchan, D. M., Thomason, M. E., and Brito, N. H. (2022).

Owlet: An automated, open-source method for infant

gaze tracking using smartphone and webcam record-

ings. Behavior Research Methods, pages 1–15.

Zhang, X., Sugano, Y., Fritz, M., and Bulling, A. (2017).

It’s written all over your face: Full-face appearance-

based gaze estimation. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion workshops, pages 51–60.

A Webcam Artiﬁcial Intelligence-Based Gaze-Tracking Algorithm

235