nose, can be determined. It is furthermore assumed
that the head was detected in the image in advance.
Conversely, in Reh-der et al. (2014) monocular RGB
images serve as input data, which do not have to be re-
stricted to the head section, but can contain complete
as well as covered pedestrians. The head localiza-
tion is done within the algorithm via HOG/SVM and
a part-based detector proposed by Felzenszwalb et al.
(2010). Proceeding from this, four discrete classes are
defined for the pose estimation, for each of which a
classifier is trained using logistic regression with LBP
(Local Binary Pattern) features. By integrating the
discrete orientation estimates using a HMM (Hidden
Markov Model), they obtain continuous head poses.
This approach is particularly interesting in that the
head poses are plausibilized and impossible poses are
discarded with the help of the upper body pose and
the motion direction. Chen et al. (2011) go one step
further and estimate both the head and body poses in
pedestrian images. Therefore, the orientation of the
body is divided into eight discrete direction classes
and multi-level HOG features are extracted. Further-
more, the yaw angle range of the head is divided into
twelve classes and the pitch angle range of the head
into three classes. After localizing the head, texture
and color features are extracted by another multi-level
HOG descriptor and histogram-based color detector.
A particle filter framework is subsequently used to
estimate the body and head poses. The dependency
between the poses as well as the temporal relation-
ship are taken into account. Another approach also
estimates both the head and body pose (Flohr et al.,
2015). For both poses, eight orientation-specific de-
tectors are trained, whose class centers are shifted by
45
◦
each. To locate the exact body and head position
in the image, they make use of disparity information
obtained from the stereo input data. Based on this, a
DBN (Dynamic Bayesian Network) is used to get the
current orientation states. Thereby the current head
pose depends on the previous head pose and also on
the current body pose.
Recently, (deep) neural networks have become in-
creasingly important and their application also aims
for an improvement of the head pose detection. Lat-
est nets as presented in (Patacchiola and Cangelosi,
2017) or (Ruiz et al., 2017) predict yaw, pitch and roll
angles in a continuously manner and achieve great ac-
curacy. The input, however, is also here only the head
section, which must be available in relatively high res-
olution. If these approaches are to be used in the con-
text of automatic driving, pedestrians and their head
positions must be recognized early, i.e., from a great
distance, so that the poor quality of the input data does
not fulfill the requirements of the mentioned meth-
ods. The present work, therefore, presents a neural
network that recognizes head poses from images with
the quality of cameras commonly used in vehicles.
Not only the head but the entire pedestrian’s image
section serves as input, since the head pose in rela-
tion to the upper body provides further important in-
formation. From this it can be deduced, for exam-
ple, whether a pedestrian shows a safety behaviour,
which is a clear indication of the intention to cross
the road. For the training of this head and upper body
pose detector, commonly available data sets for head
poses and pedestrians in general like Human3.6m
2
,
PETA
3
or INRIA
4
cannot be used, because the ref-
erence to the upper body alignment is missing. In
addition, most researchers only consider yaw angles
in the range of −90
◦
to 90
◦
, i.e., the frontal view of
the pedestrian heads. In the present project, however,
it is just as important whether a passer-by perceives
oncoming traffic or the automated driving vehicle.
Therefore, a framework for the generation of a ”full-
range” data set will be briefly presented here. Using
a semi-supervised approach, a trained convolutional
neural network (CNN) is extended so that the com-
paratively small amount of self-generated annotated
data is enriched by many unlabeled data from real
test drives within the project and the detector achieves
more accurate results.
The contributions of this work can be summarized in
the following key points:
• a framework for generating a data set with head
and upper body poses,
• training and evaluation of a network (CNN) using
the data set,
• enhancement of training data with real driving
data,
• evaluation of an approach to semi-supervised
learning and improvement of the network.
The paper is structured as follows:
After the second chapter presented the framework
for data set generation, Chapter 3 gives a detailed de-
scription of our detector design. The obtained results
are illustrated in the following chapter. Chapter 5
draws a conclusion and presents an outlook for future
work.
2
http://vision.imar.ro/human3.6m/description.php
3
http://mmlab.ie.cuhk.edu.hk/projects/PETA.html
4
http://pascal.inrialpes.fr/data/human