2 RELATED WORK
2.1 Face and Landmark Based Eye
Detection
Eye-center or pupil-center detection methods have
been proposed for gaze tracking (Y.Cheng, 2021).
These methods first extract the face region using face
detection, then resize the region to a fixed size, ex-
tract the single eye regions using facial-landmark de-
tection, and execute high-precision position estima-
tion, or directly detect the single eye region from the
face region. Therefore, the accuracy of these methods
depends on the accuracy of the underlying face region
and facial-landmark detection.
Face detection methods predict the facial bound-
ing box. Early methods were mainly based on the
classifiers using hand-crafted features extracted from
an image (P.Viola, 2001). After the breakthrough of
the CNN, CNN based models were proposed, such
as Cascade-CNN, Faster-RCNN, and single-Shot De-
tection (H.Li, 2015; X.Sun, 2017; S.Zhang, 2017;
V.Bazarevsky, 2019; J.Deng, 2020). To improve
detection accuracy, several studies focused on the
loss function or multi-task learning (R.Ranjan, 2016;
J.Deng, 2020). Dent et al. (J.Deng, 2020) pro-
posed RetinaFace which predicts facial bounding box
by leveraging extra-supervised and self-supervised
multi-task learning and showed significant improve-
ment in accuracy. One of the challenges in face de-
tection is occlusion, i.e, the lack of facial information
due to obstacles or masks. Chen et al. (Y.Chen, 2018)
proposed the Occlusion-aware Face Detector (AOFD)
which detects faces with few exposed facial land-
marks using adversarial training strategy. The above
face-detection methods use annotated facial-image
datasets that include images captured under visible
light. Several visible-light face datasets are pub-
licly available. For example, WIDER FACE (S.Yang,
2016) includes more than 30,000 images and about 4
million labeled faces. There are several other datasets
containing hundreds to tens of thousands of labeled
faces. The majority of images are wide-angle shots of
the face (V.Jain, 2010; Y.Junjie, 2014; B.Yang, 2015;
H.Nada, 2018; Q.Cao, 2018).
Several of eye-detector and eye-center estimation
methods detect eyes from resized facial bounding
boxes. Some methods (N.Y.Ahmed, 2021; M.Leo,
2013) use statistical facial-landmark information for
cropping out single eyes before an eye segmentation
process in real time. The other method (M.D.Putro,
2020) uses a face region resized to 128 × 128 pixels
before a bounding-box estimation of eyes.
Facial-landmark-detection methods detect key
points that represent facial landmarks from facial
bounding boxes. Early landmark-detection meth-
ods were mainly based on fitting a deformable face
mesh by using statistical methods (N.Wang, 2018).
V. Kazemi et al, proposed ensemble of regression
trees which is based on gradient boosting initial-
ized with the mean shape of landmarks (V.Kazemi,
2014). They achieved high speed and high accu-
racy in detecting 68 points from frontal-face images
with less occlusion. CNN based landmark detec-
tors are also proposed, showing significant improve-
ment in in-the-wild facial-landmark detection (Y.Sun,
2013; E.Zhou, 2013; P.Chandran, 2020; K.Zhang,
2016; Z.Feng, 2018). The models of these meth-
ods are typically evaluated with 68 points using an-
notated visible-light-image datasets. Several datasets
(C.Sagonas, 2013; M.K
¨
ostinger, 2011; A.Burgos,
2013; W.Wu, 2018) are available. Each dataset in-
cludes several thousand of annotated images. For ex-
ample, the 300W (C.Sagonas, 2013) dataset contains
4437 images with 68 landmark annotations. AFLW
(M.K
¨
ostinger, 2011) contains 24386 images with 21
landmark annotations, COFW (A.Burgos, 2013) con-
tains 1852 images with 29 landmark annotations, and
WFLW (W.Wu, 2018) have 10000 images with 98
landmark annotations.
Several iris-landmark-detection methods
(J.H.Choi, 2019; K.I.Lee, 2020; A.Ablavatski,
2020) uses cropped single eye regions from facial-
landmark-detection results. Choi et al. (J.H.Choi,
2019) proposed segmentation based eye center
estimation. They cut out a rectangle region using the
landmarks of the eye socket and eye corner from 68
points of landmarks (V.Kazemi, 2014) before pupil
segmentation. Ablabatski et al. (A.Ablavatski, 2020)
detects 5 points of iris landmarks from a 64 × 64
single eye region from facial-landmark detection
(V.Bazarevsky, 2019) results.
Our assumption of NIR partial face image data
is images under intense NIR illumination, such as
CASIA-Iris-Distance (Tan, 2012a) and CASIA-Iris-
M1 dataset (Tan, 2012b). Since the modality of these
images is different from the visible light data set, the
pre-trained models of the above detection methods are
insufficient to provide accuracy. In addition, there
are currently hardly any public datasets with annota-
tions for near-infrared light face images. Therefore,
it is necessary to annotate facial bounding box and
landmark annotations on existing datasets. However,
these annotations are very costly for the task of eye
detection.
Fast Eye Detector Using Siamese Network for NIR Partial Face Images
421