tion of the suiting input type depends also on the cate-
gory of an approach. Three main categories have been
defined to classify approaches: feature-based, 3D
model registration and appearance-based approaches
(Fanelli et al., 2011; Meyer et al., 2015; Borghi et al.,
2017). Feature-based approaches need defined facial
features like eye corners or mouth corners, which are
then localized in frames to perform pose estimation.
These approaches can work on 2D as well as 3D infor-
mation. In (Barros et al., 2018), two different feature-
based approaches have been combined to regress head
pose, the approaches being defined facial landmarks
on the face and keypoints computed by motion. The
approach requires 2D images only.
3D model registration derives a head model from
the data and regresses a head pose depending on the
derived 3D information. This can be done based on
2D and 3D or both. (Papazov et al., 2015) uses facial
point clouds and matches them with possible poses.
Appearance-based approaches take the whole in-
formation provided into consideration and try to
regress a pose. They are generally learning-based
methods. This can be either a raw 2D image or a depth
map, as in the DriveAHead-approach (Schwarz et al.,
2017). The DriveAHead-approach uses both, 2D-IR-
images and depth information to regress a pose. The
POSEeidon-framework (Borghi et al., 2017; Borghi
et al., 2018) uses 3D-information only to derive other
types of information like motion and grayscale image
to regress the 3D orientation.
The baseline method we use in this paper is based
on deep neural networks, which has proven to have
high potential for the head pose estimation task as
shown by (Borghi et al., 2017; Borghi et al., 2018;
Ahn et al., 2015; Ahn et al., 2018), however, requir-
ing large amounts of data.
3 AutoPOSE DATASET
We introduce a new headpose and eye gaze dataset.
We captured our images using two cameras placed
at two different positions in the car simulator in our
lab. One camera, is an IR-camera placed at the dash-
board of the car, and targeted at the driver. The sec-
ond camera is a Kinect v2 placed at the location of
the center miror of the car providing 3 image types,
[IR, depth (512x424 pixels)], and RGB (1920x1080)
images. The dataset consists of 21 sequences. Our
21 subjects were 10 females and 11 males. The dash-
board IR camera was running at 60 fps, giving in total
1,018,885 IR images. The Kinect was running at 30
fps, giving in total 316,497 synchronized RGB, depth,
and IR images.
Figure 2: Driving simulator at our lab. The red circles high-
lights some of the motion capture system cameras.
It was not possible to capture the data using both
cameras at the same time, because the strong IR light
emitted by the Kinect was affecting the image cap-
tured by the camera located at the dashboard. Conse-
quently, we decided to capture the data first with the
dashboard IR camera, then capture with the Kinect.
In each capturing sequence, the subject was asked to
perform the tasks listed in Table 1. First, the subject
was instructed about all the tasks required. The sub-
ject performed pure rotations as much as possible, fol-
lowed by free natural motion, with and without face
occlusions using his/her hand. Later, the gaze tasks
which are described later in detail in subsection 4.5.
All tasks were first performed without any glasses
on the face of the subject. Later on, all tasks were
performed again with clear glasses on, then with sun-
glasses on. After acquiring the data with the dash-
board camera, the whole experiment was repeated
again using the Kinect camera while turning the dash-
board IR camera off. It is noted that the subjects were
faster in performing the tasks again for the Kinect se-
quence, thus leading to less frames for the Kinect se-
quence. Also, 4 Kinect sequence were discarded due
to technical issues that lead to invalidating them. All
tasks for all of our 21 subjects were manually anno-
tated stating the start/end frame, along with the task
performed, and glasses existence with its type.
3.1 Head Coordinate System
As introduced in subsection 2, existing datasets have
different head coordinate system definition. In other
words, when treating the head as a rigid body, it is
required to define the x, y, and z axis of the head,
and the head center. In our dataset we decided to fol-
low the head coordinate system definition proposed
in (Schwarz et al., 2017), which adds more consis-
tent data to the community. The definition, requires
AutoPOSE: Large-scale Automotive Driver Head Pose and Gaze Dataset with Deep Head Orientation Baseline
601