2 BACKGROUND WORK
A vision-based approach to this problem has been
proposed in the past (Lalonde et al., 2015). It relied
on the post-experiment analysis of video footage cap-
tured using a handheld camera to determine the sub-
ject’s movement from an offset point of view. The
subject’s feet were localized with respect to known
landmarks (markings painted on the ground) for spa-
tial referencing. Such an approach is convenient in
terms of data acquisition: an observer merely needs
to walk behind the subject with a camcorder, video
acquisition and management is easy, resolution is al-
ways good, etc. However, many challenges made the
analysis phase difficult, most notably the large vari-
ations in illumination and ensuing cast shadows, as
well as the lack of robustness of the feet tracking al-
gorithm. In addition, the method was dependent on
the presence of several lines painted on the pavement,
and their location had to be precisely known a priori.
In this paper, we tackle the movement mapping
problem using a vision-based simultaneous localiza-
tion and mapping (SLAM) approach. The idea be-
hind this kind of approach is to use visual data to con-
currently build a model of the local environment (i.e.
a “map”) and estimate the state (or location) of the
camera within it. In our case, the map of the envi-
ronment is not our primary focus, as our specific ap-
plication only relies on odometry, and loop closure is
not needed (i.e. we analyze one-way street crossings).
Nonetheless, environment maps can be used to correct
scaling issues found in monocular camera setups (as
discussed further in Section 3).
SLAM methods can be separated into direct
and indirect approaches. Indirect SLAM methods
such as ORB-SLAM (Mur-Artal et al., 2015) and
PTAM (Klein and Murray, 2007) typically use key-
point detectors to extract unique landmarks from the
observed images, and then estimate scene geometry
and camera extrinsics using a probabilistic model.
This classic approach is quite efficient in practice due
to the sparse nature of visual keypoints, and it is quite
robust to noise in geometric observations. However,
these keypoint-based methods fail when the observed
images are composed mostly of uniformly-textured
regions. Direct SLAM methods such as DSO (Engel
et al., 2017) and LSD-SLAM (Engel et al., 2014) rely
on local image intensities instead of sparse keypoints
to represent observations in their model. The advan-
tage of this approach is that it can use and reconstruct
any observed surface with an intensity gradient. This
is a crucial requirement for our application, as most
street crosswalk surfaces show repetitive landmarks
and high-frequency or uniform textures, which would
hinder the performance of an indirect SLAM method.
Besides, note that self-localization using only a cam-
era has been studied extensively before, but mostly
for robots or vehicles in large scale contexts (Se et al.,
2002; Pink et al., 2009; Brubaker et al., 2016). In our
case, a person’s gait directly affects the stability and
height of the camera, which can in turn hinder the per-
formance of traditional localization methods based on
landmarks or holonomic constraints.
For a more complete look at various SLAM
methodologies and algorithms, we refer the reader to
the recent survey of (Cadena et al., 2016).
3 STRATEGY
In this work, we take advantage of the recent devel-
opments in robot vision and SLAM, and explore the
use of visual odometry techniques to localize a per-
son during a street crossing. So, instead of having
someone hold the camera behind the subject and try
to track both the subject and the environment (using
e.g. added markers on the ground for proper localiza-
tion), we equip the subject with a calibrated camera
facing the street. Localizing the subject then amounts
to tracking the camera pose throughout the crossing.
As noted before, SLAM using a single camera
setup (i.e. a monocular setup) entails that the abso-
lute scale of the environment is unknown — this is
a problem for us, as deviations need to be recorded
and registered in a fixed coordinate system. Some
SLAM extensions rely on GPS, IMUs, or altime-
ters to correct this issue via sensor fusion using Ex-
tended Kalman Filters (Lynen et al., 2013). Others
instead rely on assumptions about the camera height
above the ground plane (Song et al., 2016), or about
its movement in very constrained settings (Guti
´
errez-
G
´
omez et al., 2012; Scaramuzza et al., 2009). In
our case, we obtain camera trajectories using the Di-
rect Sparse Optimization (DSO) method (Engel et al.,
2017), and then fix this scaling issue by solving a
camera Perspective-n-Point (PnP) problem using cal-
ibration boards placed around the crosswalk. Since
we know the exact dimensions and grid layouts of
these boards, we can determine their orientation and
distance to the camera in specific key frames of the
analyzed video sequences using the OpenCV calibra-
tion toolbox. These distances can then be averaged
and used to properly scale the “map” provided by the
SLAM algorithm. Furthermore, by fixing a calibra-
tion board directly on the ground, a coordinate space
reference can be created, meaning all experiments can
be registered to the same coordinate system. Finally,
note that we could also use the length of the crosswalk
Localizing People in Crosswalks using Visual Odometry: Preliminary Results
483