Real Time Eye Gaze Tracking System using CNN-based Facial Features

for Human Attention Measurement

Oliver Lorenz and Ulrike Thomas

Professorship of Robotics and Human-Machine-Interaction,

Chemnitz University of Technology, Reichenhainer Str. 70, Chemnitz, Germany

Keywords:

Eye Gaze Tracking, Human-robot Interaction, Facial Features, Head Pose, Face Detection, Human Attention.

Abstract:

Understanding human attentions in various interactive scenarios is an important task for human-robot collabo-

ration. Human communication with robots includes intuitive nonverbal behaviour body postures and gestures.

Multiple communication channels can be used to obtain a understandable interaction between humans and

robots. Usually, humans communicate in the direction of eye gaze and head orientation. In this paper, a new

tracking system based on two cascaded CNNs is presented for eye gaze and head orientation tracking and ena-

bles robots to measure the willingness of humans to interact via eye contacts and eye gaze orientations. Based

on the two consecutively cascaded CNNs, facial features are recognised, at ﬁrst in the face and then in the

regions of eyes. These features are detected by a geometrical method and deliver the orientation of the head

to determine eye gaze direction. Our method allows to distinguish between front faces and side faces. With a

consecutive approach for each condition, the eye gaze is also detected under extreme situations. The applied

CNNs have been trained by many different datasets and annotations, thereby the reliability and accuracy of

the here introduced tracking system is improved and outperforms previous detection algorithm. Our system is

applied on commonly used RGB-D images and implemented on a GPU to achieve real time performance. The

evaluation shows that our approach operates accurately in challenging dynamic environments.

1 INTRODUCTION

Analyses of human eye gazes are important for recog-

nizing attentions in various applications. Eye motions

and eye directions indicate human attentions to com-

municate to each other and can thus be used to study

their willingness to interact and their intentions. Ba-

sed on these knowledge, accurate and robust detection

of eye gazes are essential components in active rese-

arch topics, while used among many other indicators

for action recognition (Liu et al., 2017) and saliency

detection (Parks et al., 2015). Besides hand gestu-

res and body postures, facial features are considered

as the most important interactive features to under-

stand a human desire, needs and cognitive processes

(Palermo and Rhodes, 2007). In order to enable inte-

ractive communication between humans and robots,

eye gazes are used, which clearly describe human be-

haviour to communicate. Looking on eye gazes, it

offers other humans a stimulus to direct their attenti-

ons immediately to their counterparts. Human actions

are always performed in the direction of eyes in or-

der to maintain control and the ability to react. Face

detection tasks are affected strongly by various hu-

man characteristics and images based on strong as-

sumptions. Rowley et al. (Rowley et al., 1998) des-

cribe a neural network trained by arbitration on mul-

tiple networks to improve performance over a single

network. In (Yang et al., 2015) a deep CNN-based

(DCNN) face detector detects faces from a new per-

spective view through scoring facial parts responses

by spatial structures and arrangements. Fischer et. al.

(Fischer et al., 2018) generate own datasets with an-

notations based on an external unit of measurement

in order to obtain eye gazes and head poses. A cas-

cading CNN (Li and Fu, 2018), which operates on

multiple resolutions, quickly rejects the background

regions in the fast low-resolution step. Cascaded

shape regression models (Doll

ar et al., 2010) apply re-

gression to establish the relation between appearance

to shape by directly learning the mapping function.

Deep learning combined with shape regression can

improve the accuracy of localisation of facial featu-

res. DCNNs can be classiﬁed within the large fra-

mework of cascaded shape regression models. Multi

task-oriented CNNs establish the inherent correlation

between these two tasks. Eye gaze tracking is deﬁned

as tracking the eye movements, e.g. the directions of

598

Lorenz, O. and Thomas, U.

Real Time Eye Gaze Tracking System using CNN-based Facial Features for Human Attention Measurement.

DOI: 10.5220/0007565305980606

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 598-606

ISBN: 978-989-758-354-4

eye gazes and their gradient over time. Special geo-

metric relationships of the features (Zhao et al., 2018)

reduce the complexity for calibration and image pro-

cessing time. Chinsatit and Saitoh (Chinsatit and Sai-

toh, 2017) suggest two CNNs as feature points to re-

cognise the pupil centres. The ﬁrst CNN is used to

classify the eye state and the second CNN estimates

the position of the pupil centres. The authors (Muk-

herjee and Robertson, 2015) use low-resolution multi

modal RGB-D images and regression based on lear-

ned depth classiﬁers. They combine the two models

to approximate regression conﬁdences. In (George

and Routray, 2016), a CNN is used to classify the eye

gaze and to estimate eye-accessing cues in real time.

Lemley et. al. (Lemley et al., 2018) evaluate different

datasets for designing a CNN with minimal compu-

tational requirements for gaze estimation. Recently,

Deng and Zhu (Deng and Zhu, 2017) suggest a new

two-step training policy, where a CNN for head re-

cognition and a CNN for eye recognition are trained

separately and then jointly ﬁne-tuned with a geometri-

cally constrained gaze transform layer. In general, the

robustness and reliability of current approaches for

eye gaze tracking have been improved. However, the

accuracy and repeatability of current detection met-

hods are not sufﬁcient. In order to predict human in-

tentions in different situations and in realistic work

environments, a new tracking system based on casca-

ded CNNs for extreme conditions and face alignments

is proposed. It is able to recognise the willingness of

interaction and measures the attention of humans be-

cause of eye gazes and head orientations. It applies in

the ﬁrst recognised 2D facial feature and maps it into

3D, then a further CNN is applied to track the head

orientation. These models allow real time tracking of

eye gaze independently from orientation of the head.

Due to the facial features and applied facial symme-

tries, the eye gaze recognition is robust towards face

and image-based occlusions. As a result, the eye gaze

direction should be considered as a key indicator for

the ability to interact.

2 PROPOSED METHOD

Figure 1 shows the proposed tracking pipeline for

eye gazes. A RGB-D image is obtained from a

low-cost RGB-D sensor and transferred to the Multi-

Task CNN (MTCNN). The MTCNN detects face re-

gions and locates the facial landmarks of an indivi-

dual face. The eye regions are separated and the head

pose is computed. If the yaw angle of the head is

beyond a certain threshold τ, the eye region is repla-

ced with the tracking result from the MTCNN. Alter-

natively, the eye region is taken as the initialisation in-

put for the Kernelized Correlation Filter (KCF). Furt-

hermore, the two eye regions are taken as input for

another CNN. The model of the second CNN expli-

citly encodes coordinates in the CNN-layers and cal-

culates the coordinates of eye corners. Finally, the eye

gaze of each individual eye is computed by the usage

of different facial features.

2.1 Face Detection

By various human characteristics and different trai-

ning datasets, we recognise all human faces in the

ﬁeld of view (FOV) and then determine the face alig-

nment for each face. We introduce a cascaded struc-

ture to establish our detection from coarse to ﬁne,

see Figure 1. The cascaded network contains three

sub networks: a Proposal Network (P-Net), a Reﬁ-

nement Network (R-Net) and an Output Network (O-

Net). The P-Net is used to generate bounding boxes

around detected faces. The trained P-Net only out-

puts N bounding boxes with four coordinates and their

quality scores. The R-Net is used to remove a large

number of regions where no faces are detected. The

input of the R-Net is the resulted bounding box of the

former P-Net, with the size of 24 × 24 pixels. The

O-Net is similar to the R-Net by one exception that

this the O-Net with the size of 48 × 48 pixels inclu-

des the task for landmark regression. The ﬁnal out-

put includes four (x, y)-coordinates for the bounding

box, designed by scores, that are in particular the

(x, y)-coordinate of left top point, height and width

of the bounding box, and can be described by a vector

box

∈ R

. The values ˆy

are the facial coordinates

and the values are y

the Ground Truth coordinates.

Altogether, there are ﬁve facial landmarks as a vector

∈ R

, including rough centre of left and right eye,

nose tip, left mouth corner and right mouth corner.

The loss function of the face classiﬁcation branch is

given by L

det

(1), which adopts the cross entropy

det

= −(y

det

log(p

) + (1 −y

det

)(1 − log(p

))), (1)

where p

is the probability produced by the network

that indicates a sample being a face. The y

det

∈

{0, 1} denotes the Ground Truth label. The second

loss function is for bounding box regression L

box

=||

ˆy

box

− y

box

and the third is for facial landmark re-

gression. The difference L

=|| ˆy

− y

is the

euclidean distance loss. The ˆy

box

describes the re-

gression target and y

box

represents the Ground Truth

coordinates. We trained the MTCNN with several dif-

ferent datasets (Sun et al., 2013; Yang et al., 2016),

which contain front view images and side view ima-

ges. Moreover, we annotated these datasets to im-

Real Time Eye Gaze Tracking System using CNN-based Facial Features for Human Attention Measurement

599

Figure 1: Proposed pipeline of CNN-based facial features and tracking of eye gazes.

prove the accuracy and robustness against the appro-

ach from (Zhang et al., 2016). Zhang et. al. only use

datasets with front view images. For eliminating the

background we compute the loss function L

det

. The

Equation (2) is deﬁned to control the calculation of

different losses for various inputs.

min

∑

i=1

∑

j∈{det,box,lm}

(2)

where N is the number of training samples. j ∈

{det, box, lm} and α

denotes the task importance. In

the case of P-Net and R-Net, the weights for the key

points regression are less than those for the O-Net

part due to the focus of the ﬁrst two stages on ﬁlte-

ring out the bounding boxes of which are no faces.

∈ {0, 1} is a sample type indicator and L

denotes

the respective loss function for three different tasks.

As the information provided in the various datasets

can be assumed as known, the number of images con-

tained in the original training set is very small, which

is not enough to train MTCNN. Therefore, we incre-

ase the number of input images by augmenting trai-

ning dataset with annotated information. Our appro-

ach includes a) random rotations to deal with face ro-

tations, b) random increase and decrease of intensity

values to deal with different lighting conditions and

c) addition of Gaussian noise to enhance generalisa-

tion performances. We select the 70 % largest loss

samples in each mini batch as hard samples and we

only use this part to calculate the gradient in the back

propagation. Some simple samples, which have little

effect on the enhanced classiﬁcations, are discarded.

Finally, it turned out that our training with stochastic

gradient descent, an initial learning rate of 0.01 with

decay is assumed.

2.2 Head Orientation

Based on the ﬁve facial landmarks l

the key features

are determined and with the symmetry assumption of

the face, we compute the orientation of the head. In

(Gee and Cipolla, 1994) are taken four key features to

calculate the head orientation and determine the cen-

tre of the mouth by linearly extending the tip of the

nose downwards. We use an improved and more accu-

rate approach with more CNN-based facial features as

well as the calculation of the pupil centre. In addition,

we estimate the symmetry centre of the mouth and the

length of the mouth L

based on the left and right cor-

ner of the mouth. The facial symmetry axis is found

by connecting a line between the eye midpoint and

the mouth midpoint. Assuming a ﬁxed ratio between

these facial features and ﬁxed distance ratios between

eye to mouth, nose to mouth, nose base to nose tip as

well as the length of the mouth. The facial direction

can be determined under weak perspective geometry

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

600

from the 3D angle of the nose. The l

are used to de-

termine the head pose from the normal by the vector

d to the image plane, which can be found from pla-

nar skew symmetry and a coarse estimate of the nose

tip. The angle σ is formed by the face normal

n and

the normal vector d

d of the image plane. Using the

length between eyes to mouth centre L

and nose tip

to mouth centre L

, the position of the nose base l

can be found along the symmetry axis following the

image of the facial normal and producing an immedi-

ate estimate of the tilt τ. Our Algorithm 1 calculates

Algorithm 1: Tracking for face alignment.

Data: Face ROI s

face

, 5 facial landmarks

= (x

, y

) where i ∈ {le, re, lm, rm, no}

Result: Aligned eye regions s

eye, j

,where j ∈ {l, r}

1 Initialize KCF using dot product kernel;

2 Compute head orientation (θ, φ) based on

n = [sinσ cos τ, sin σ sin τ, − cos σ] with

σ = cos

−1

| and angle τ;

3 if φ > yaw threshold φ

then

4 Result as s

eye, j

;

5 else

6 if τ > 5 then

7 Rotate transformation with rotation angle

-τ and rotation centre l

on s

f ace

and

landmarks l

;

8 Aligned face ROI s

f ace

and transformed

landmarks l

;

9 Crop 2 eye ROIs s

eye, j

in s

f ace

;

10 Location of s

eye, j

to initialize the KCF

using

f (z

z) =



+λ

;

11 Result as s

eye, j

;

12 end

13 end

the head orientation resulting in a pitch angle ψ and

yaw angle φ obtained from the angles θ and τ. Since

the roll angle θ describes the alignment of a human

faces, the process of feeding the aligned human eye

regions can improve the robustness. For computing

a roll angle θ, we use the two mouth corners and the

mouth length to determine the angle θ relative to the

x-axis. Then we select the left mouth corner as the

centre of rotation to carry out the rotation transforma-

tion of −θ. This step can eliminate the inﬂuence of

the roll angle. Through the yaw angle φ, we determine

lateral head rotation which has an inﬂuence on the de-

tection of the face. If the yaw angle φ is too large due

to the head orientation, the key features can no longer

be detected exactly, because the MTCNN was only

trained with front view images and a few multi view

images. It is well known that when the face is tilted

at a larger angle, the other eye is not visible in the

camera-centred image, and the eye gaze tracking de-

pends on the eye region that is visible. Afterwards,

the eye corners and pupil centres can be extracted in

the aligned face images and then they are reversed to

obtain their true coordinates in the origin image. For

tracking the eye region under large yaw angle φ, we

set a threshold angle φ

for the yaw angles. The thres-

hold angle φ

is statistically determined by the usage

of training data. If the yaw angle φ is greater than the

threshold angle φ

, we will start to use the tracking

results of KCF instead of these cropped eye regions.

The KCF constructs the training sample of classiﬁer

through loop shifting, which turns the data matrix into

a circular matrix. Based on the properties of circular

matrix, the solution of the problem is transformed to

the frequency domain, which avoids the process of

matrix inversion and greatly reduces the complexity.

2.3 Eye Features and Eye Gaze

CNN models have some excellent properties, such as

weight sharing and local connections. However, in

the task involving coordinate modelling, its advanta-

ges become eliminated and potentially affect the mo-

del performance. The nature of the problem lies on

the invariant translation of convolution. Accordingly,

we proposed the corresponding ”CoordConv-layer”.

”CoordConv-layer” solves the problem of coordinate

transformation and it has an improved generalisation

ability. The ”CoordConv-layer” allows the convolu-

tion ﬁlter to observe the coordinates that will break

invariant transformations and learn a function for the

translation of invariants. If weights converge toward

zero, the layer behaves exactly like the standard con-

volution. If weights are nonzero, the function will

contain some degree of translation of dependence.

The precise form of which will ideally depend on the

task being solved. The ”CoordConv-layer” accom-

plishes a mapping by ﬁrst concatenating extra chan-

nels to the incoming representation. It allows the

Corner CNN (CorCNN) as a single stage CNN to le-

arn to keep or to discard the third translation of in-

variance as is needed for the task being learned, see

in Figure 2. For the training datasets, we selected

Figure 2: Model of our CorCNN.

Real Time Eye Gaze Tracking System using CNN-based Facial Features for Human Attention Measurement

601

the Kaggle Facial Landmarks Dataset

and also ex-

tended the training data. We use the loss function

of landmark regression to control the training pro-

cess of our CorCNN. The loss function, based on

Mean Squared Error (MSE), is a L2 regularisation

with factor 0.0001 and a simple L2 distance between

the respective predicted output and (x, y)-coordinates

of Ground Truth landmarks. Our CorCNN computes

the loss between actual targets and predicted targets

and learns the weights by reducing this loss. The re-

gression loss does not change after 10000 iterations

and the minimum of the regression loss is 0.0136 by

a learning rate of 0.01. From there, the regression loss

converge towards zero. We apply a feature-based ap-

proach by using combined image gradients and mor-

phological operations for the localisation of pupil cen-

tre P. In order to directly track the eye movements

and to calculate the eye gaze vector, we propose a fa-

cial features method based on the pupil centre P, eye

corners LC

and RC

and the particular head orienta-

tions v

head

, which allows image-based estimation of

eye gaze directions. Humans have the intuition to use

the pupil centre positions in relation to the corners of

eyes to determine the direction of eye gazes. Eye regi-

ons can be considered as two small rectangles located

in the face plane. When the position of the corners of

both eyes, LC

and RC

, are determined, we can de-

ﬁne the centre points of the eyes S

. The vertical ﬁeld

of view of the eyes is about 150°. The distance bet-

ween the centre of the pupils P

and the centre of the

eyes S

is determined by adding the calculated vec-

tor of the head pose v

head

on the shift vector d

with

two scaling factors k

gaze

and k

head

. The result is the

respective gaze direction of both eyes. The algorithm

corrects this result based on the head pose. One eye

far away from the camera may not appear in the image

when the head is moved relatively to the camera, thus

we add a weight w

pose

to determine the importance

of head poses in the direction of gaze, which points

away from the camera as described in Algorithm 2.

3 EXPERIMENTS AND RESULTS

All components are considered separately or indepen-

dently of each other and then statistically evaluated on

the basis of various image datasets and compared to

the current state of art. Experiments are conducted in

dynamic environments in real time and for side faces.

It allows different interaction scenarios to be simula-

ted and adjusted. The goal is to evaluate the reliabi-

lity and accuracy of the whole tracking system and all

https://www.kaggle.com/c/facial-keypoints-detection

associated components. Therefore, we neglect extre-

mely side faces with an occlusion of more than a half

of the face. The error function is given by the absolute

mean error (AME):

AME =

∑

i=0

| g − ˆg | (3)

where g denotes the facial feature values and ˆg the

Ground Truth values.

Algorithm 2: Eye gaze vector estimation.

Data: Eye corners LC

and RC

, pupil centre P

for j ∈ {l, r}, where denotes left l and right

left r eye, head pose vector v

head

, scaling

factors for local eye gaze and head pose

direction k

gaze

, k

head

Result: Eye gaze vector v

gaze, j

1 w

pose

according to v

head

[0];

2 for j ∈ {l, r} do

3 Results as S

of LC

and RC

; d

PS, j

;

4 v

gaze, j

= k

gaze

PS, j

+ k

head

;

5 end

6 if v

head

[0] > 0 then

7 v

gaze,right

←

(1 − w

pose

gaze,right

+ w

pose

gaze,right

;

8 else

9 v

gaze,left

←

(1 − w

pose

gaze,left

+ w

pose

gaze,left

;

10 end

3.1 Facial Features Detection

The evaluation of facial features detection is based on

the HELEN dataset (Le et al., 2012) and is compa-

red to the Dlib-ml method (E. King, 2009). The HE-

LEN dataset provides a large collection of annotated

facial images, exhibiting a large variety in appearance

as well as general images and environmental condi-

tions. The facial landmark detector proposed inside

Dlib-ml produces 68 coordinates of facial landmarks

that map to speciﬁc facial structures. Through the an-

notations and Ground Truth data, both algorithms can

be compared directly. We calculate the AME for the

Ground Truth data of all previously detected facial fe-

atures. Our ﬁrst CNN (Multi-Data MTCNN) does not

achieve the detection accuracy of the Dlib-ml. For all

extracted facial key features the AME is smaller as

shown in Figure 3. The largest deviation can be seen

at the corners of the mouth. Due to the inaccurate re-

cognition of our Multi-Data MTCNN, we reﬁned and

improved the inaccurate output by a second CorCNN

in which is especially designed for the detection of

eye corners in the eye regions. It turns out that the

”CoordConv-layer” signiﬁcantly improves the accu-

racy for the regression results of eye corners. We do

not need an exact output of the corners of both eyes,

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

602

because cascading the two CNNs improves the recog-

nition accuracy for the eye regions sufﬁciently. Our

coarse-to-ﬁne approach realised by the two CNNs op-

timises the accuracy and reliability. We deﬁned eye

regions of the Multi-Data MTCNN as a new input

for the CorCNN to improve the recognition of both

eye corners. As underlined by our experiments, it is

shown that the CorCNN detects the corners of the eye

better than before, no matter in which direction the

eyes or in which orientation the head is. Compared

to the detection accuracy for the eye corners in Fi-

gure 3 e) and Figure 4 b), an improved detection is

achieved as well with the cascaded two CNNs. Furt-

her examples are illustrated in Figure 3 f) Figure 4

a) and Figure 3 h) and Figure 4 c). Our approach

is not as accurate in mouth detection as the Dlib-ml

method, but we recognise facial features in side faces

in which the Dlib-ml method does not recognise the

landmarks. The multi view training of our Multi-Data

MTCNN without the addition of the KCF can identify

some side faces. If the head orientation converges to

the limited angle for sides face during tracking, we

switch to our KCF based tracker. Moreover, it impro-

ves the results of real time detection of the eye regi-

ons in side faces. For the permanent calculation of the

head orientation, we achieve an average tracking rate

of 20 ms. The Dlib-ml method is not robust against la-

teral faces because it is not designed for lateral faces,

but the achieved tracking rate can be higher 18 ms. In

addition, it is worth mentioning that our detection rate

is 100 %, while Dlib-ml only scores to 96.8 %.

Figure 3: Detection results of facial features for our method

(red), Dlib-ml (green) and Ground Truth Data (yellow).

Figure 4: Coarse-to-ﬁne results of our CorCNN.

3.2 Head Pose Estimation

For head pose estimations we tested with the Head

Pose Images (Gourier et al., 2004) using the Deep-

Gaze method (Patacchiola and Cangelosi, 2017). The

Head Pose Images consists of 15 images of 15 diffe-

rent people, wearing glasses or not and of various skin

colors. Each set contains 2 series of 93 images of the

same person in different poses. The head orientation

is determined by 2 angles, which varies from −90° to

90°. The head orientations are not continuously recor-

ded, but distributed at intervals of 15°. The average

detection time of our method is about 44 ms and the

from DeepGaze 5 ms. The difference can be explai-

ned by the need to ﬁrst extract facial features with

our Multi-Data MTCNN and then reﬁne our coarse-

to-ﬁne CorCNN to improve the accuracy. As a com-

parison DeepGaze is a rapid end-to-end structure that

directly outputs the head pose orientation. We only

need the head orientation to distinguish between front

faces and side faces and then we switch to our MT-

CNN and our KCF tracker. Figure 6 shows the AME

and standard deviation (STD) for the two approaches.

The angular deviations of a vertical (yaw) or horizon-

tal (pitch) head rotation is signiﬁcantly smaller by our

method. The estimated average angular deviation of

our method is 10° in comparison to 15° with a STD

of 0.1° to 0.2°. DeepGaze has a deviation greater than

or equal to 25° for both angles with a STD of 0.2° to

0.4°. Due to our high accuracy, the calculation time

increases in comparison to DeepGaze.

Figure 5: Head orientation results of our method (red),

DeepGaze (green) and Ground Truth data (yellow).

3.3 Eye Gaze Direction

For the evaluation of eye gaze tracking, all previ-

ously determined facial features are transferred into

the Facial Features Based Eye Gaze Model, see Fi-

gure 1. Our whole CNN-based tracking pipeline is

then compared in terms of accuracy and robustness

with the OpenFace approach (Amos et al., 2016) eva-

luated with the Columbia Gaze dataset (A. Smith

et al., 2013). The Columbia Gaze dataset is a large

publicly available gaze dataset with 5, 880 images of

56 people (21 of them wearing glasses) over varying

gaze directions and head poses. There are 5 head po-

ses and 21 gaze directions per head pose. Figure 7

Real Time Eye Gaze Tracking System using CNN-based Facial Features for Human Attention Measurement

603

Figure 6: Mean and standard deviation of the head pose.

and Figure 8 show that there are no signiﬁcant diffe-

rences in the accuracy of the gaze direction between

our approach and the OpenFace. However, our met-

hod is robust against head orientations, which shape

lateral facial views in the image plane. The detection

accuracy of OpenFace decreases in side faces. In con-

trast to OpenGaze, we achieve a stable detection rate

for front faces as well as side faces. As illustrated

in Figure 7 and 8, we estimate the gaze direction in

most cases, but as shown in Figure 8 the black back-

ground seriously affected the pupil centre calculation,

because they are computed based on the image gra-

dient, and the black regions gain a large weight accor-

ding to intensity values. Each method has its strengths

and weaknesses in precision, but both are better in the

pitch angle than in the yaw angle.

Figure 7: Eye Gaze results of our method (red), OpenFace

(green) and Ground Truth data (yellow).

Figure 8: Eye Gaze results of our method (red), OpenFace

(green) and Ground Truth data (yellow).

3.4 Dynamic Environments

For some applications, we focus on industrial work

cells and service-oriented supermarket environments.

In both environments various humans appear where

some of them wear glasses and keep their heads in

Figure 9: Absolute mean and standard deviation of eye gaze

direction.

different orientations. Many people wear glasses that

partially cover the human eyes or create reﬂections

through reﬂective lenses. Our method robustly de-

tects the eye gaze direction in many different scena-

rios where some of the humans wear glasses, refer

to Figure 11. Many methods from the state of the

art show inaccuracies and errors. By using the KCF,

we can eliminate some of these errors, e.g. the limi-

tation due to reﬂection of lenses have been reduced.

This elimination is important, because landmark de-

tection for the faces and eye regions have been pre-

viously inaccurate and produced a large error rate,

which can be reduced by our new approach. Limitati-

ons of current methods concern tracking of eye gazes

in lateral faces. All CNNs are based on training data-

sets, which consist of series of frontal images. There-

fore, the tracking of faces seen in side views is difﬁ-

cult, because of missing training datasets. With pre-

vious solutions, facial occlusions often result in failu-

res for eyes gaze estimation. Our method overcomes

this problem by exploiting symmetry using the KCF

instead of the facial features to calculate the eye gaze

direction after a certain vertical head rotations. When

faces are seen from the side view, half of the faces are

occluded. Our algorithm assumes symmetry and cal-

culates the eye gaze direction robustly. This solution

works because the eyes perform a synchronous mo-

vement in the gaze direction normally. Seen in Figure

10, the eye gaze is always only for the non-occluded

eye. In the middle lower picture in Figure 10 it can be

seen, that the symmetry of the gaze can be projected

onto the other eye.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

604

4 CONCLUSIONS

We propose a CNN-based facial features gaze

tracking system to track eye gaze based on facial fea-

tures and head orientations in dynamic environments.

It is able to estimate human eye gazes as a signiﬁcant

measurement for the ability of humans to interact with

their counterparts. Our approach recognises eye gazes

in extreme conditions robustly and with a good relati-

onship between accuracy and real time performance.

Our angle-based approach of the head orientation dif-

ferentiates between front faces and side faces and al-

lows us to switch speciﬁcally between the different

CNNs and the detection of image ﬁlters. With the

aid of the suggested cascaded CNNs, the eye regions

can be further reﬁned and the detection of facial fe-

atures are improved in order to counteract reﬂections

and by masking in a robust manner. Furthermore, dis-

tinctions can be made in the future between open and

closed eyes in order to improve safety in human-robot

interaction.

Figure 10: Eye gaze tracking for side faces.

Figure 11: Eye gaze tracking when wearing eyeglasses.

REFERENCES

A. Smith, B., Yin, Q., K. Feiner, S., and K. Nayar, S.

(2013). Gaze locking: Passive eye contact detection

for human-object interaction.

Amos, B., Ludwiczuk, B., and Satyanarayanan, M. (2016).

Openface: A general-purpose face recognition library

with mobile applications. Technical report.

Chinsatit, W. and Saitoh, T. (2017). Cnn-based pupil cen-

ter detection for wearable gaze estimation system.

2017:1–10.

Deng, H. and Zhu, W. (2017). Monocular free-head 3d gaze

tracking with deep learning and geometry constraints.

2017 IEEE International Conference on Computer Vi-

sion, pages 3162–3171.

Doll

ar, P., Welinder, P., and Perona, P. (2010). Cascaded

pose regression. In 2010 IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition,

pages 1078–1085.

E. King, D. (2009). Dlib-ml: A machine learning toolkit.

10:1755–1758.

Fischer, T., Jin Chang, H., and Demiris, Y. (2018). Rt-

gene: Real-time eye gaze estimation in natural envi-

ronments. In The European Conference on Computer

Vision.

Gee, A. and Cipolla, R. (1994). Determining the gaze

of faces in images. Image and Vision Computing,

12(10):639–647.

George, A. and Routray, A. (2016). Real-time eye gaze di-

rection classiﬁcation using convolutional neural net-

work. pages 1–5.

Gourier, N., Hall, D., and Crowley, J. L. (2004). Estimating

face orientation from robust detection of salient facial

structures. In FG Net Workshop on Visual Observation

of Deictic Gestures.

Le, V., Brandt, J., Lin, Z., Bourdev, L., and Huang, T. S.

(2012). Interactive facial feature localization. In Com-

puter Vision, pages 679–692.

Lemley, J., Kar, A., Drimbarean, A., and Corcoran, P.

(2018). Efﬁcient cnn implementation for eye-gaze es-

timation on low-power/low-quality consumer imaging

systems.

Li, B. and Fu, H. (2018). Real time eye detector with cas-

caded convolutional neural networks. 2018:1–8.

Liu, Y., Wu, Q., Tang, L., and Shi, H. (2017). Gaze-assisted

multi-stream deep neural network for action recogni-

tion. IEEE Access, 5:19432–19441.

Mukherjee, S. S. and Robertson, N. M. (2015). Deep head

pose: Gaze-direction estimation in multimodal vi-

deo. IEEE Transactions on Multimedia, 17(11):2094–

2107.

Palermo, R. and Rhodes, G. (2007). Are you always on my

mind? a review of how face perception and attention

interact. Neuropsychologia, 45(1):75–92. The Percep-

tion of Emotion and Social Cues in Faces.

Parks, D., Borji, A., and Itti, L. (2015). Augmented saliency

model using automatic 3d head pose detection and le-

arned gaze following in natural scenes. Vision Rese-

arch, 116:113–126. Computational Models of Visual

Attention.

Patacchiola, M. and Cangelosi, A. (2017). Head pose es-

timation in the wild using convolutional neural net-

works and adaptive gradient methods. 71.

Rowley, H. A., Baluja, S., and Kanade, T. (1998). Neural

network-based face detection. IEEE Transactions on

Real Time Eye Gaze Tracking System using CNN-based Facial Features for Human Attention Measurement

605

Pattern Analysis and Machine Intelligence, 20(1):23–

38.

Sun, Y., Wang, X., and Tang, X. (2013). Deep convolutio-

nal network cascade for facial point detection. In 2013

IEEE Conference on Computer Vision and Pattern Re-

cognition, pages 3476–3483.

Yang, S., Luo, P., Loy, C. C., and Tang, X. (2015). From fa-

cial parts responses to face detection: A deep learning

approach. CoRR, abs/1509.06451.

Yang, S., Luo, P., Loy, C. C., and Tang, X. (2016). Wider

face: A face detection benchmark. 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 5525–5533.

Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. (2016). Joint

face detection and alignment using multi-task casca-

ded convolutional networks. CoRR, abs/1604.02878.

Zhao, X., Meng, C., Feng, M., Chang, S., and Zeng, Q.

(2018). Eye feature point detection based on single

convolutional neural network. IET Computer Vision,

12(4):453–457.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

606