Localizing People in Crosswalks using Visual Odometry:

Preliminary Results

Marc Lalonde

, Pierre-Luc St-Charles

, D

elia Loupias

, Claude Chapdelaine

and Samuel Foucher

Vision and Imaging Team, Computer Research Institute of Montreal, 405 Ogilvy Ave. suite 101, Montreal, Quebec, Canada

epartement de G

enie

Electronique et Informatique, Institut National des Sciences Appliqu

ees, 135 Avenue de Rangueil,

31400 Toulouse, France

Keywords:

Visual Odometry, SLAM, Direct Sparse Odometry.

Abstract:

This paper describes a prototype for the localization of pedestrians carrying a video camera. The application

envisioned here is to analyze the trajectories of blind people going across long crosswalks when following an

accessible pedestrian signal (APS), in the context of signal optimization. Instead of relying on an observer

for manually logging the subjects’ position at regular time intervals with respect to the crosswalk, we propose

to equip the subjects with a wearable camera: a visual odometry algorithm then recovers the trajectory and

spatial analysis can then determine to which extent the subject remained within reasonable boundaries while

performing the crossing. Preliminary tests in conditions similar to a street crossing show that our results

qualitatively agree with the physical behavior of the subject.

1 INTRODUCTION

The accurate 2D localization of deformable objects

such as pedestrians without a top-down view or a pla-

nar scene assumption is a challenging task. In an un-

constrained setting where other objects might be si-

multaneously moving around a target of interest, and

where static visual references are few, most classic

vision-based solutions are prone to failure. In this

work, we propose an early prototype to measure the

trajectory of blind subjects crossing a street intersec-

tion with the aid of various accessible pedestrian sig-

nals (APS). Our goal is to determine which signals are

more adequate in terms of pitch, melody, etc. in guid-

ing a blind person to align themselves with the cross-

walk, and to remain within its boundaries throughout

the crossing. For this, we measure each subject’s de-

viation with respect to the center of the crosswalk,

varying the signal used in each experiment. Bet-

ter signals should, on average, minimize such devi-

ations. Previous data collection protocols usually re-

quired researchers to visually estimate the deviations

as the person walks in front of them, which is obvi-

ously inaccurate and labor-intensive (Laroche et al.,

2000). Since hundreds of crossings may also be re-

quired for the proper statistical analysis of deviations

in our problem, this approach is unsuitable. On the

other hand, hardware-based localization solutions are

not always adequate due to the spatial accuracy re-

quired for proper analysis (≈15 cm); for example, the

accuracy of consumer GPS devices is a few meters

in good conditions. Furthermore, the equipment used

should not disturb the subjects’ progress during cross-

ing, and measurements should be done inside a fairly

large volume since the intersection that is selected for

the experiment is a six-lane boulevard with a median

(total walking distance is 30m).

Recent advances in robot vision, most notably

in Simultaneous Localization and Mapping (SLAM)

techniques, may provide an elegant solution to our

problem: if the subject is wearing a camera while

performing the crossing, the tracking of the camera

pose would allow the recovery of the 3D trajectory of

the subject, hence its deviation with respect to some

reference points. This paper is organized as follows:

ﬁrst, Section 2 describes a previous approach to the

problem and also provides some background informa-

tion about visual odometry; in Section 3, the proposed

strategy is exposed; and ﬁnally, in Section 4 we report

on some preliminary results gathered during a short

experiment.

482

Lalonde, M., St-Charles, P-L., Loupias, D., Chapdelaine, C. and Foucher, S.

Localizing People in Crosswalks using Visual Odometry: Preliminary Results.

DOI: 10.5220/0006646904820487

In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 482-487

ISBN: 978-989-758-276-9

2 BACKGROUND WORK

A vision-based approach to this problem has been

proposed in the past (Lalonde et al., 2015). It relied

on the post-experiment analysis of video footage cap-

tured using a handheld camera to determine the sub-

ject’s movement from an offset point of view. The

subject’s feet were localized with respect to known

landmarks (markings painted on the ground) for spa-

tial referencing. Such an approach is convenient in

terms of data acquisition: an observer merely needs

to walk behind the subject with a camcorder, video

acquisition and management is easy, resolution is al-

ways good, etc. However, many challenges made the

analysis phase difﬁcult, most notably the large vari-

ations in illumination and ensuing cast shadows, as

well as the lack of robustness of the feet tracking al-

gorithm. In addition, the method was dependent on

the presence of several lines painted on the pavement,

and their location had to be precisely known a priori.

In this paper, we tackle the movement mapping

problem using a vision-based simultaneous localiza-

tion and mapping (SLAM) approach. The idea be-

hind this kind of approach is to use visual data to con-

currently build a model of the local environment (i.e.

a “map”) and estimate the state (or location) of the

camera within it. In our case, the map of the envi-

ronment is not our primary focus, as our speciﬁc ap-

plication only relies on odometry, and loop closure is

not needed (i.e. we analyze one-way street crossings).

Nonetheless, environment maps can be used to correct

scaling issues found in monocular camera setups (as

discussed further in Section 3).

SLAM methods can be separated into direct

and indirect approaches. Indirect SLAM methods

such as ORB-SLAM (Mur-Artal et al., 2015) and

PTAM (Klein and Murray, 2007) typically use key-

point detectors to extract unique landmarks from the

observed images, and then estimate scene geometry

and camera extrinsics using a probabilistic model.

This classic approach is quite efﬁcient in practice due

to the sparse nature of visual keypoints, and it is quite

robust to noise in geometric observations. However,

these keypoint-based methods fail when the observed

images are composed mostly of uniformly-textured

regions. Direct SLAM methods such as DSO (Engel

et al., 2017) and LSD-SLAM (Engel et al., 2014) rely

on local image intensities instead of sparse keypoints

to represent observations in their model. The advan-

tage of this approach is that it can use and reconstruct

any observed surface with an intensity gradient. This

is a crucial requirement for our application, as most

street crosswalk surfaces show repetitive landmarks

and high-frequency or uniform textures, which would

hinder the performance of an indirect SLAM method.

Besides, note that self-localization using only a cam-

era has been studied extensively before, but mostly

for robots or vehicles in large scale contexts (Se et al.,

2002; Pink et al., 2009; Brubaker et al., 2016). In our

case, a person’s gait directly affects the stability and

height of the camera, which can in turn hinder the per-

formance of traditional localization methods based on

landmarks or holonomic constraints.

For a more complete look at various SLAM

methodologies and algorithms, we refer the reader to

the recent survey of (Cadena et al., 2016).

3 STRATEGY

In this work, we take advantage of the recent devel-

opments in robot vision and SLAM, and explore the

use of visual odometry techniques to localize a per-

son during a street crossing. So, instead of having

someone hold the camera behind the subject and try

to track both the subject and the environment (using

e.g. added markers on the ground for proper localiza-

tion), we equip the subject with a calibrated camera

facing the street. Localizing the subject then amounts

to tracking the camera pose throughout the crossing.

As noted before, SLAM using a single camera

setup (i.e. a monocular setup) entails that the abso-

lute scale of the environment is unknown — this is

a problem for us, as deviations need to be recorded

and registered in a ﬁxed coordinate system. Some

SLAM extensions rely on GPS, IMUs, or altime-

ters to correct this issue via sensor fusion using Ex-

tended Kalman Filters (Lynen et al., 2013). Others

instead rely on assumptions about the camera height

above the ground plane (Song et al., 2016), or about

its movement in very constrained settings (Guti

errez-

omez et al., 2012; Scaramuzza et al., 2009). In

our case, we obtain camera trajectories using the Di-

rect Sparse Optimization (DSO) method (Engel et al.,

2017), and then ﬁx this scaling issue by solving a

camera Perspective-n-Point (PnP) problem using cal-

ibration boards placed around the crosswalk. Since

we know the exact dimensions and grid layouts of

these boards, we can determine their orientation and

distance to the camera in speciﬁc key frames of the

analyzed video sequences using the OpenCV calibra-

tion toolbox. These distances can then be averaged

and used to properly scale the “map” provided by the

SLAM algorithm. Furthermore, by ﬁxing a calibra-

tion board directly on the ground, a coordinate space

reference can be created, meaning all experiments can

be registered to the same coordinate system. Finally,

note that we could also use the length of the crosswalk

Localizing People in Crosswalks using Visual Odometry: Preliminary Results

483

Figure 1: Block diagram of the approach.

(which is known a priori) to roughly validate the scale

determined by solving the PnP problem. The strategy

is depicted in Fig. 1.

4 RESULTS AND DISCUSSION

4.1 Experimental Results

Our preliminary experiments were conducted in a

15m x 5m zone of an outdoor parking lot, so as to

simulate a street intersection (reduced by a scale of

1/2). The experimental setting is depicted in Fig. 2.

We laid white tape on the pavement so as to form

a 2m-wide corridor. A single subject was outﬁtted

with a chest mount body harness equipped with a

Hero3+ GoPro. The subject then simulated multiple

crossings inside the corridor while carrying the Go-

Pro, with varying trajectories with respect to the cen-

terline of the corridor. The GoPro camera was ori-

ented in portrait mode and slightly tilted toward the

ground, so that both the horizon (including neighbor-

ing buildings, parked cars, street furniture, etc.) and

the ground (pavement, line markings, etc.) were visi-

ble in the video frames. One key advantage with this

camera conﬁguration is the possibility of adding land-

marks such as chessboard patterns on the ground, vis-

ible at the beginning as well as the end of the simula-

tion. This allows the computation of the exact camera

pose (3D position and orientation) at those moments,

which corresponds to the initial/ﬁnal absolute anchor

points for the (relative) VO-computed trajectory, as

mentioned in Section 3.

An example of a simulated crossing is given in

Fig. 3, where we present some video frames as well as

the top view of the 3D trajectory provided by DSO

This top view representation is in line with the actual

path followed by the subject: the starting point is in

the middle of the corridor, there is a drift towards the

right up to the edge of the corridor (roughly at mid

point during displacement), and then a realignment

towards the center.

4.2 Discussion

Overall, video sequences of nine crossings were col-

lected. It however should be pointed out that only

seven of them have been processed successfully by

DSO: for the other sequences, the algorithm either

lost track of the camera position mid-crossing, or it

was unable to go beyond the initialization stage due

to strong orientation variations in early frames. In

that regard, an observation can be made about ini-

tialization. It seems that the ﬁrst frames of the video

sequence greatly inﬂuence DSO’s behavior: if visual

odometry starts as the person wearing the camera is

already in motion, the initial estimate for the camera

pose may be irreversibly biased, without any possi-

bility of recovery. We hypothesize that the cause is

due to the oscillatory nature of a person’s walking

pattern which induces an undesirable orientation of

The implementation used in this work has

been made available by (Engel et al., 2017) at

https://github.com/JakobEngel/dso

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

484

Figure 2: Sketch of the experimental crosswalk setup.

Figure 3: Rendering of a trajectory. Top: video frames at 0s, 9s, 10s and 12s of the 16s video clip. Bottom: top view of the

recovered trajectory. The strong deviation in the middle of the sequence is clearly visible.

Localizing People in Crosswalks using Visual Odometry: Preliminary Results

485

Figure 4: Additional results for seven crossings. Trajectories are shown in red, and the pair of blue lines represents the

15m-long crosswalk.

the camera pose from which the algorithm cannot re-

cover. This observation underlines the importance of

carefully designing the experimental protocol when

subjects will be involved in a real setting.

Another observation is about the high number

of feature points that DSO can track as the camera

moves. As opposed to many competing methods,

DSO does not search for visual keypoints, but in-

stead splits each video frame into blocks and retains a

number of candidate points with high image gradients

in each block. The strategy makes sure that points

are well distributed throughout the frame, and even

across frames. Block size and gradient threshold are

also dynamically set to ensure that the pool of can-

didate points is sufﬁciently rich for the camera pose

estimate. Consequently, this strategy enables DSO

to perform well in the current context, even though

a weakly textured surface (pavement) occupies a sig-

niﬁcant portion of the image. Other methods such as

ORB-SLAM would have failed to provide any reason-

able odometry results in a similar context.

The accuracy of the odometry can be assessed us-

ing two sequences where the subject was asked to

walk in the center of the corridor (see Fig. 4). Con-

sidering that both trajectories are about 10% off the

corridor centerline (20cm on the left-hand side) and

that the camera held by the subject was 8cm off the

body centerline (on the left-hand side as well) for me-

chanical reasons, a rough evaluation gives an error of

about 12cm for these two sequences. These encourag-

ing results justify the planning of a formal evaluation

involving blind persons in a real street intersection,

which will allow us to collect more accurate perfor-

mance measures. It will be interesting to evaluate the

stability of visual odometry in the presence of cast

shadows and moving objects such as pedestrians, bi-

cycles, etc.

5 CONCLUSION

This paper reported on preliminary tests involving vi-

sual odometry for localizing people in a street cross-

walk. Our objective is to measure the ability of a blind

person to engage in a crossing and stay on course

by listening to the accessible pedestrian signal. Pre-

liminary tests have shown that analyzing the video

footage from a wearable camera attached to a person

provides enough information to locate them in a street

crosswalk via camera pose estimation. Although the

focus of the paper is 3D positioning, visual odometry

may also allow for the monitoring of the orientation

of the person with respect to the crosswalk, for exam-

ple capturing hesitations as the crossing progresses.

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

486

For our live tests, the use of a wearable Inertial Mea-

surement Unit (IMU) may be considered to further

improve the accuracy of the odometry algorithm dur-

ing post-processing.

ACKNOWLEDGEMENTS

This work has been made possible through funding

from the Minist

ere de l’

Economie, de la Science et de

l’Innovation du Qubec (MESI) of Gouvernement du

ebec.

REFERENCES

Brubaker, M. A., Geiger, A., and Urtasun, R. (2016). Map-

based probabilistic visual self-localization. IEEE

Trans. Pattern Anal. Mach. Intell., 38(4):652–665.

Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza,

D., Neira, J., Reid, I., and Leonard, J. J. (2016). Past,

present, and future of simultaneous localization and

mapping: Toward the robust-perception age. IEEE

Trans. Robotics, 32(6):1309–1332.

Engel, J., Koltun, V., and Cremers, D. (2017). Direct sparse

odometry. IEEE Trans. Pattern Anal. Mach. Intell.,

PP(99):1–1.

Engel, J., Sch

ops, T., and Cremers, D. (2014). LSD-SLAM:

Large-scale direct monocular SLAM. In European

Conference on Computer Vision, pages 834–849.

Guti

errez-G

omez, D., Puig, L., and Guerrero, J. J. (2012).

Full scaled 3D visual odometry from a single wearable

omnidirectional camera. In Intelligent Robots and

Systems (IROS), 2012 IEEE/RSJ International Con-

ference on, pages 4276–4281.

Klein, G. and Murray, D. (2007). Parallel tracking and map-

ping for small AR workspaces. In International Sym-

posium on Mixed and Augmented Reality, pages 225–

234.

Lalonde, M., Chapdelaine, C., and Foucher, S. (2015). Lo-

calizing people in crosswalks with a moving handheld

camera: proof of concept. In Proc. SPIE 9405, Image

Processing: Machine Vision Applications VIII, vol-

ume 9405.

Laroche, C., Leroux, T., Giguere, C., and Poirier, P. (2000).

Field evaluation of audible trafﬁc signals for blind

pedestrians. In Triennial Congress of the International

Ergonomics Association.

Lynen, S., Achtelik, M. W., Weiss, S., Chli, M., and Sieg-

wart, R. (2013). A robust and modular multi-sensor

fusion approach applied to MAV navigation. In IEEE

International Conference on Intelligent Robots and

Systems, pages 3923–3929.

Mur-Artal, R., Montiel, J. M. M., and Tardos, J. D. (2015).

ORB-SLAM: a versatile and accurate monocular

SLAM system. IEEE Trans. Robotics, 31(5):1147–

1163.

Pink, O., Moosmann, F., and Bachmann, A. (2009). Vi-

sual features for vehicle localization and ego-motion

estimation. In IEEE Intelligent Vehicles Symposium,

pages 254–260.

Scaramuzza, D., Fraundorfer, F., Pollefeys, M., and Sieg-

wart, R. (2009). Absolute scale in structure from mo-

tion from a single vehicle mounted camera by exploit-

ing nonholonomic constraints. In International Con-

ference on Computer Vision, pages 1413–1419.

Se, S., Lowe, D., and Little, J. (2002). Mobile robot lo-

calization and mapping with uncertainty using scale-

invariant visual landmarks. Int. J. Robotics Research,

21(8):735–758.

Song, S., Chandraker, M., and Guest, C. C. (2016). High

accuracy monocular SfM and scale correction for au-

tonomous driving. IEEE Trans. Pattern Anal. Mach.

Intell., 38(4):730–743.

Localizing People in Crosswalks using Visual Odometry: Preliminary Results

487