realistic situations, we have to pay careful attention
to the pedestrians that cross a street with suddenly
running and such (sudden) running is regarded as a
high risk action to be treated by AEB system with
early braking. Therefore, in this paper, we address
the problem to predict the (sudden) running action of
pedestrians by detecting the sign for that action which
preindicates the running actions beforehand. In ad-
dition, we employ an appearance-based approach us-
ing only static image features, though motion features
might be suitable for recognizing actions, since it is
quite hard to extract reliable motion features from a
moving on-board camera. There is a primary ques-
tion how early we can predict the running action, or
more basically, whether such sign (preindicator) ex-
ists or not. We empirically answer this question in the
framework of feature selection and show the effective
preindicator from the quantitative viewpoint. Besides,
we also give useful qualitative meaning to it from the
biomechanical viewpoints.
2 APPEARANCE BASED ACTION
PREDICTION
In this section, we detail the action prediction method
using only static image features. This method is based
on the assumption that the action preindicator can be
sufficiently described by distinctive pedestrian shape,
not motion itself.
2.1 Static Image Feature
To characterize the human shape in detail, we
employ gradient local auto-correlation (GLAC)
method (Kobayashi and Otsu, 2008). The GLAC
method extracts co-occurrence of gradient orientation
as second-order statistics while HOG (Daral, 2005)
is based only on first-order statistics of occurrence of
gradient orientations. Suppose the pedestrian is de-
tected by arbitrary methods and the bounding box en-
closing the pedestrian is provided as shown in fig. 1.
As in the common approach such as of HOG (Daral,
2005), the bounding box is spatially partitioned into
regular grids of 3 × 3 at each of which the GLAC fea-
tures are extracted, then the final feature vector is con-
structed by concatenating those feature vectors; the
setting of 9 orientation bins for gradients and 4 spa-
tial co-occurrence patterns produces GLAC features
of 324 dimensionality, and the final feature is formed
as a 2916 = 324 × 3 × 3 dimensional vector.
The spatial grids of 3 × 3 is much coarser com-
pared to HOG-related methods. The GLAC method
can characterize the human shape more discrimina-
tively due to exploiting co-occurrence and thus even
such coarser grids are enough for static image fea-
tures. In addition, the coarser grids render robustness
regarding spatial position of human shape; that is,
the features are stably extracted even for miss-aligned
bounding boxes. On the other hand, 3 × 3 grids are
considered as the coarsest one for capturing the hu-
man shape; head, torso, two arms and two legs are
roughly aligned to respective spatial grids.
2.2 Action Prediction
Based on the time-series sequence of image features
extracted in the bounding boxes, we predict the action
which will occur in the near future.
We consider the subsequence of T frames which
are represented by image feature vectors as described
in the previous subsection. Then, we pick up D
frames (feature vectors) from them, [t − D + 1,t], to
predict the action which will occur at the T -th frame
indexed as time 0. Those D feature vectors are con-
catenated to single feature vector of relatively high
dimension (fig. 2) which is finally passed to a linear
SVM classifier for predicting whether running will
occur at time 0 or not. The concatenated feature indi-
rectly encodes motion information of pedestrian dur-
ing D frames. Because we can not know which tim-
ing {t, D} produces better performance for predicting
the running action, those parameters are empirically
determined based on data from the quantitative view-
point. And, it is obvious that the smaller t is prefer-
able since it provides the earlier prediction; on the
contrary, t = 0 means on time classification and does
not give any prediction at all.
3 EXPERIMENTS
This section shows the experimental procedure for
determining the parameters {t, D} in the proposed
method (section 2.2) as well as evaluating it.
3.1 Dataset
The dataset that we use contains 57 video sequences
of 12 children captured by a (fixed) video camera with
30 fps in a gymnasium (fig. 3).
1
Children behave
unpredictably in context and thus are regarded as the
subjects to be carefully paid attention. They first walk
1
This experiment is approved by the Ethical Review
Board of Mazda Motor Corporation and the informed con-
sent of all subjects were also obtained.
NCTA 2015 - 7th International Conference on Neural Computation Theory and Applications
100