eye gazes and their gradient over time. Special geo-
metric relationships of the features (Zhao et al., 2018)
reduce the complexity for calibration and image pro-
cessing time. Chinsatit and Saitoh (Chinsatit and Sai-
toh, 2017) suggest two CNNs as feature points to re-
cognise the pupil centres. The first CNN is used to
classify the eye state and the second CNN estimates
the position of the pupil centres. The authors (Muk-
herjee and Robertson, 2015) use low-resolution multi
modal RGB-D images and regression based on lear-
ned depth classifiers. They combine the two models
to approximate regression confidences. In (George
and Routray, 2016), a CNN is used to classify the eye
gaze and to estimate eye-accessing cues in real time.
Lemley et. al. (Lemley et al., 2018) evaluate different
datasets for designing a CNN with minimal compu-
tational requirements for gaze estimation. Recently,
Deng and Zhu (Deng and Zhu, 2017) suggest a new
two-step training policy, where a CNN for head re-
cognition and a CNN for eye recognition are trained
separately and then jointly fine-tuned with a geometri-
cally constrained gaze transform layer. In general, the
robustness and reliability of current approaches for
eye gaze tracking have been improved. However, the
accuracy and repeatability of current detection met-
hods are not sufficient. In order to predict human in-
tentions in different situations and in realistic work
environments, a new tracking system based on casca-
ded CNNs for extreme conditions and face alignments
is proposed. It is able to recognise the willingness of
interaction and measures the attention of humans be-
cause of eye gazes and head orientations. It applies in
the first recognised 2D facial feature and maps it into
3D, then a further CNN is applied to track the head
orientation. These models allow real time tracking of
eye gaze independently from orientation of the head.
Due to the facial features and applied facial symme-
tries, the eye gaze recognition is robust towards face
and image-based occlusions. As a result, the eye gaze
direction should be considered as a key indicator for
the ability to interact.
2 PROPOSED METHOD
Figure 1 shows the proposed tracking pipeline for
eye gazes. A RGB-D image is obtained from a
low-cost RGB-D sensor and transferred to the Multi-
Task CNN (MTCNN). The MTCNN detects face re-
gions and locates the facial landmarks of an indivi-
dual face. The eye regions are separated and the head
pose is computed. If the yaw angle of the head is
beyond a certain threshold τ, the eye region is repla-
ced with the tracking result from the MTCNN. Alter-
natively, the eye region is taken as the initialisation in-
put for the Kernelized Correlation Filter (KCF). Furt-
hermore, the two eye regions are taken as input for
another CNN. The model of the second CNN expli-
citly encodes coordinates in the CNN-layers and cal-
culates the coordinates of eye corners. Finally, the eye
gaze of each individual eye is computed by the usage
of different facial features.
2.1 Face Detection
By various human characteristics and different trai-
ning datasets, we recognise all human faces in the
field of view (FOV) and then determine the face alig-
nment for each face. We introduce a cascaded struc-
ture to establish our detection from coarse to fine,
see Figure 1. The cascaded network contains three
sub networks: a Proposal Network (P-Net), a Refi-
nement Network (R-Net) and an Output Network (O-
Net). The P-Net is used to generate bounding boxes
around detected faces. The trained P-Net only out-
puts N bounding boxes with four coordinates and their
quality scores. The R-Net is used to remove a large
number of regions where no faces are detected. The
input of the R-Net is the resulted bounding box of the
former P-Net, with the size of 24 × 24 pixels. The
O-Net is similar to the R-Net by one exception that
this the O-Net with the size of 48 × 48 pixels inclu-
des the task for landmark regression. The final out-
put includes four (x, y)-coordinates for the bounding
box, designed by scores, that are in particular the
(x, y)-coordinate of left top point, height and width
of the bounding box, and can be described by a vector
y
box
i
∈ R
4
. The values ˆy
lm
i
are the facial coordinates
and the values are y
lm
i
the Ground Truth coordinates.
Altogether, there are five facial landmarks as a vector
y
lm
i
∈ R
5
, including rough centre of left and right eye,
nose tip, left mouth corner and right mouth corner.
The loss function of the face classification branch is
given by L
det
i
(1), which adopts the cross entropy
L
det
i
= −(y
det
i
log(p
i
) + (1 −y
det
i
)(1 − log(p
i
))), (1)
where p
i
is the probability produced by the network
that indicates a sample being a face. The y
det
i
∈
{0, 1} denotes the Ground Truth label. The second
loss function is for bounding box regression L
box
i
=||
ˆy
box
i
− y
box
i
||
2
and the third is for facial landmark re-
gression. The difference L
lm
i
=|| ˆy
lm
i
− y
lm
i
||
2
is the
euclidean distance loss. The ˆy
box
i
describes the re-
gression target and y
box
i
represents the Ground Truth
coordinates. We trained the MTCNN with several dif-
ferent datasets (Sun et al., 2013; Yang et al., 2016),
which contain front view images and side view ima-
ges. Moreover, we annotated these datasets to im-
Real Time Eye Gaze Tracking System using CNN-based Facial Features for Human Attention Measurement
599