in all cases. The remaining sections of this paper are
organized as follows. Section 2 reviews pertaining
research efforts. Section 3 offers the full details of
our proposed model. Section 4 shows the results of
the performance evaluation, while Section 5 provides
concluding remarks.
2 RELATED WORK
In this section we discuss the main recent contri-
butions in pedestrian intention prediction. The au-
thors of (Joon-Young Kwak, 2017) propose a dy-
namic fuzzy automata (DFA) method for pedestrian
intention and use low-level features with a boosted-
type random forest classifier for pedestrian detection
and tracking. To consider the pedestrian character-
istics they use the pedestrian’s distance from curb ,
pedestrian’s speed and the direction of his/her head.
Four pedestrian intention states are defined, two of
them represent that the pedestrian is not passing, and
the other two represent that the pedestrian is pass-
ing. This approach has 9 FPS processing time which
is not sufficient for real time applications. In an-
other paper, Sebastian Kohler et al.(S. K
¨
ohler and Di-
etmayer, 2012) show that the sub-region of the im-
age that covers the pedestrian within its bounding
box is available for a time series, e.g. by the fusion
of LIDAR and video-data and a HOG-based detec-
tion. The methodology of their approach is to gen-
erate the motion descriptors within this box and to
classify the motion. This approach was tested on lab
conditions so it wasn’t proven yet it’s efficiency in
real time applications. Gurkan Solmaz et al.(Solmaz
et al., 2019) propose the use of Internet Of Things
(IOT) technology where the pedestrian next location
is predicted based on his/ her historical data and cur-
rent position. Both the pedestrian GPS position and
velocity are obtained using a mobile device to pre-
dict the pedestrian’s next position using a trajectory
model. This approach assumes that all the pedestri-
ans are using a 4G mobile device and that pedestri-
ans are always walking in the same direction, which
is not the case. Christoph Scholler et al.(C. Sch
¨
oller
and Knoll, 2020) used a simple constant velocity
model(CVM) to predict the pedestrian intention that
does not require any information besides the pedes-
trian’s last relative motion. They denote the position
(x
t
i
,y
t
i
) of pedestrian i at time-step t as P
t
i
. The goal
of pedestrian motion prediction is to predict the fu-
ture trajectory Ti = (P
t
i+1
,. . . ..,P
i
t+n
) for pedestrian
i, taking into account his or her own motion history
H
i
= (p
0
i
,. . . ., p
t
i
).The constant velocity model ap-
proach mispredict the pedestrian intention if he/she
suddenly change his/her walking direction. In (Re-
hder et al., 2018), the authors propose a different ap-
proach that relies on predicting pedestrian intention
using goal directed planning. They use a mixture
density function for possible destinations. They use
these set of destinations as the goal states of a plan-
ning stage that predict the motion of the pedestrian
based on the common motion patterns that are already
known. Those patterns are learned by a fully convolu-
tional network operating on the maps on the environ-
ment. R. Quintero et al .(Quintero et al., 2014) con-
sidered the three-dimensional pedestrian body lan-
guage in order to perform path prediction in a prob-
abilistic framework. For this purpose, the different
body parts and joints are detected using stereo Vision.
The body pose algorithm they use predict the input as
a point cloud on one pedestrian that has been previ-
ously extracted from the general point cloud provided
by the stereo images pair. Let P = {p
1
,..., p
N
} repre-
sent the pedestrian point cloud with N points.The re-
cursive nature of the algorithm limits the accuracy of
a body part on the accuracy of the previous part. If a
part is incorrectly detected all following parts will be
affected. In our proposed approach we overcame all
the mentioned limitations by using multi features in
order to ensure the prediction results and we consid-
ered the processing time to be able to fit our approach
in real time applications.
3 INTENTION PREDICTION
MODEL
Our system architecture consists of several stages.
First, a frame captured by the monocular camera acts
as input to our system. Then, the human detection
model and the head pose model use the captured
frame as an input. The output for the human detec-
tion model is a bounding box for each pedestrian in
the frame which is used as the current location for the
pedestrian while the head pose model output the head
orientation for all the pedestrians. The frame with
bounding boxes after pre-processing act as input for
the body pose estimation model and constant velocity
model while the bounding box points are used to de-
tect the pedestrian direction and the side from which
the pedestrian will pass. Also the pedestrian position
is used to detect their moving speed. The output of the
system is the person’s future position predicted using
the constant velocity model and the person’s intention
as ”passing” or ”Not passing”.
Multi-feature and Modular Pedestrian Intention Prediction using a Monocular Camera
1161