InfoSalGAIL: Visual Attention-empowered Imitation Learning of

Pedestrian Behavior in Critical Trafﬁc Scenarios

Igor Vozniak, Matthias Klusch, Andr

e Antakli and Christian M

uller

German Research Center for Artiﬁcial Intelligence (DFKI),

Stuhlsatzenhausweg 3, 66123 Saarbruecken, Germany

Keywords:

Visual Attention-empowered Imitation Learning, End-to-End Human-like Data-driven Simulation, Critical

Scenario Generation.

Abstract:

The imitation learning of complex pedestrian behavior based on visual input is a challenge due to the underly-

ing large state space and variations. In this paper, we present a novel visual attention-based imitation learning

framework, named InfoSalGAIL, for end-to-end imitation learning of (safe, unsafe) pedestrian navigation

policies through visual expert demonstrations empowered by eye ﬁxation sequence and augmented reward

function. This work shows the relation in latent space between the policy estimated trajectories and visual-

attention map. Moreover, the conducted experiments revealed that InfoSalGAIL can signiﬁcantly outperform

the state-of-the-art baseline InfoGAIL. In fact, its visual attention-empowered imitation learning tends to much

better generalize the overall policy of pedestrian behavior leveraging apprenticeship learning to generate more

human-like pedestrian trajectories in virtual trafﬁc scenes with the open source driving simulator OpenDS.

InfoSalGAIL can be utilized in the process of generating and validating critical scenarios for adaptive driving

assistance systems.

1 INTRODUCTION

Complex human-like on-street walking activities are

hard to mimic by means of end-to-end imitation learn-

ing. The underlying massive state-space and high

variation in data may lead to insufﬁcient objective

generalization of the trained policy. Nevertheless, im-

itation learning based methods succeeded in a wide

range of problems (Ziebart et al., 2008; Englert and

Toussaint, 2018; Finn et al., 2016; Stadie et al., 2017;

Ermon et al., 2015) given expert demonstrations. For

example, the recently proposed system InfoGAIL (Li

et al., 2017) performs end-to-end imitation learning

based on clustered visual demonstrations by various

experts. During training, the derivation of latent code

relies on highly mutual information between the code

and those expert demonstrations used for policy in-

ference. However, our experiments revealed that In-

foGAIL suffers from poor policy generalization in

due course of learning to imitate human-like on-street

walking pedestrians in complex trafﬁc scenes.

To this end, we developed a novel approach for vi-

sual attention-empowered end-to-end imitation learn-

ing of pedestrians in virtual trafﬁc scenes. The re-

sulting system, named InfoSalGAIL, is now capa-

ble of learning given human expert behavior of on-

street walking in complex trafﬁc scenes based on vi-

sual input only with additional prior knowledge in the

form of visual attention or saliency maps (cf. Sec-

tion 3.2). Our experiments with human pedestrian be-

havior learning avatars in trafﬁc scenes with the vir-

tual driving simulator OpenDS

revealed that InfoS-

alGAIL may signiﬁcantly fasten and improve the imi-

tation learning process as compared to its baseline In-

foGAIL (cf. Section 4.1). The code for reproducing

the experiments is available in free access as part of

the open source simulation software OpenDS (version

6.0). The contributions of this work are threefold:

1. Development of a novel approach to visual

attention-empowered imitation learning based on

integrated use of individual pedestrian saliency

maps (cf. Figure 1) for large state spaces in com-

plex urban trafﬁc scenes. Experimental evalua-

tion of the approach shows signiﬁcant improve-

ment over its baseline InfoGAIL in terms of speed

and quality of learning.

2. Development of a new open-source benchmark

for training and testing pedestrian imitation learn-

https://opends.dfki.de

Vozniak, I., Klusch, M., Antakli, A. and Müller, C.

InfoSalGAIL: Visual Attention-empowered Imitation Learning of Pedestrian Behavior in Critical Trafﬁc Scenarios.

DOI: 10.5220/0010020003250337

In Proceedings of the 12th International Joint Conference on Computational Intelligence (IJCCI 2020), pages 325-337

ISBN: 978-989-758-475-6

325

ing avatars in critical trafﬁc scenarios with the

open-source virtual driving simulator OpenDS.

For this purpose, OpenDS has also been extended

with new modules for covering human pedestrian-

centred perspectives and control such as head

movements (pitch, yaw) and walking.

3. The results of our experimental performance eval-

uation of the approach based on the German In-

depth Accident Study (GIDAS

) strongly sup-

ports the hypothesis that the used saliency maps

can be considered as a kind of individual pedes-

trian movement policy ﬁngerprints. Their inte-

grated usage may enforce the policy generator

of pedestrian avatars to more human-like (expert-

like) actions in virtual critical trafﬁc scenarios.

Figure 1: Visual attention-empowered imitation learning

with InfoSalGAIL in OpenDS trafﬁc scene. The augmented

eye ﬁxation sequence in the center of the image shows those

features that are most attractive for the individual pedestrian

walking in the scene.

The remainder of the paper is structured as fol-

lows. Section 2 reviews the background required to

follow this research work and covers the state-of-

the-art analysis in the domain of imitation learning.

Section 3 presents InfoSalGAIL, our visual attention-

empowered imitation learning framework, which is

applied to the problem of human-like pedestrian tra-

jectory generation. Moreover, it covers the train-

ing and the testing phases with the explanation of

the introduced loss function. Section 4 describes the

data creation process as well as conducted experi-

ments and obtained results in comparison to the cho-

sen baseline. Finally, we conclude the paper in Sec-

tion 5.

https://www.gidas.org

2 BACKGROUND AND RELATED

WORK

2.1 Visual Attention

According to Corbetta (1998), visual attention deﬁnes

the mental ability to select stimuli, responses, memo-

ries or thoughts that are behaviorally relevant among

the many others that are behaviorally irrelevant. The

human visual system is agitated by entities of the sur-

roundings in color and shape. It leads to the physio-

logical fact that the human is not capable of attending

at all elements of the visual ﬁeld of view (FoV) at the

same time due to the limited cognitive capacity of the

brain. Thus, the brain has an ability to ﬁlter out vi-

sually perceived information, which was deﬁned by

Sully (1891) as selective visual attention, where the

spotlight model has been initially proposed to justify

the visual attention feature of the brain.

There are two classes of factors which inﬂuence

visual attention, namely bottom-up and top-down.

The bottom-up class is based on the physical prop-

erties of the objects that fall in the visual FoV of hu-

mans like shapes, colors, size and orientation. In con-

trast, the top-down class is task-dependent and inﬂu-

enced by the current task, cognitive abilities and/or

experience. The prominent computational model of

visual attention that describes it from both the bottom-

up and top-down perspectives is the so-called saliency

model. Its task is to identify an area of interest (most

attractive or probable area), which can be seen as a

set of pixels, which corresponds to the scene entities.

We adopted a heat map (2D image of the same size),

where the intensity of the color corresponds to the rel-

evance of the given pixel. Figure 2 shows a sample of

the expert‘s FoV and the corresponding saliency map.

Figure 2: Visual attention sample. Left: expert‘s ﬁeld

of view of size 224 × 224 pixels; Right: corresponding

saliency map. As seen from the given pair of images, a

signiﬁcant portion of visual attention map, given crossing

the street task, is attached to the approaching vehicle.

2.2 Preliminaries

The pedestrian navigation problem, from the per-

spective on imitation learning (cf. Section 2.3), can

be deﬁned as an inﬁnite-horizon discounted Markov

Decision Process (MDP) (S, A, P, r, ρ

, γ), where S

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

326

denotes the ﬁnite state space, A the ﬁnite set of

actions, P : S × A × S ⇒ R the transition probability

distribution function, r : S ⇒ R the reward function,

: S ⇒ R the distribution of initial state s

, and γ ∈

(0, 1) the discount factor. Let π denote a stochastic

policy π : S × A ⇒ [0, 1] with the expert policy π

be mimicked given in the form of visual demonstra-

tions only. A set of expert trajectories τ

generated

by the policy π

consists of a sequence of state-action

pairs. The expectation with respect to the policy π is

used to denote an expectation with respect to the gen-

erated trajectories: E

[ f (s, a)] , E[

∑

∞

t=0

f (s

, a

)],

such that s

∼ ρ

, a

∼ π(a

), s

t+1

∼ P(s

t+1

, s

2.3 Imitation Learning

In this work, we apply imitation learning to let pedes-

trian avatars learn to best mimic some demonstrated

human pedestrian walking or navigation policy π(a|s)

without knowing the reward function r of the con-

sidered MDP. The two broadly applicable classes of

solutions of imitation learning approaches are behav-

ior cloning (BC) and apprenticeship learning (AL).

Behavior cloning uses a sequence of state-action pairs

of an expert for approximating a solution towards

learning the policy (Pomerleau, 1989). However,

BC is known to poorly generalize from the given

problem due to compounding errors and covariant

shift (Ross and Bagnell, 2010; Ross et al., 2011). On

the other hand, AL tends to reconstruct the reward

function (Abbeel and Ng, 2004; Syed et al., 2008; Ho

et al., 2016), but at high computational costs due to

embedded reinforcement learning in the training loop.

Generative Adversarial Imitation Learning. One

prominent work on AL is the Generative Adversar-

ial Imitation Learning (GAIL) approach (Ho and Er-

mon, 2016), which objective is to learn the pol-

icy for a given complex task without estimating the

reward function directly. GAIL is based on the

originally introduced generative adversarial network

(GAN) (Goodfellow et al., 2014) which consists of

two networks, a generator as the end-policy estimator

π and a discriminator that has to differentiate between

the given real and synthesized inputs generated from

and π, respectively. Thus, GAIL is intimately con-

nected to GAN, however with a newly introduced ob-

jective. Mathematically, the objective of a GAIL is

deﬁned as follows:

min

max

D∈(0,1)

S×A

[logD(s, a)]+

[log(1 − D(s, a))] − λH(π)

(1)

where π denotes the agent‘s policy, π

the policy of

the expert, D the discriminative classiﬁer, which tries

to distinguish state-action pairs generated by π and

respectively. The H(π) , E[− logπ(a|s)] denotes

the γ - discount casual entropy of the policy π

deﬁned by Bloem and Bambos (2014). The objective

of a GAIL is to learn the optimal reward policy based

on its discriminator D reasoning during the compar-

ison. Thus, the optimal policy π is achieved, once

D reaches maximum uncertainty state. The GAIL

framework addresses the problem of learning a policy

from example expert behavior without interaction

with the expert or access to reinforcement signal. It is

a model-free approach of directly extracting a policy

from data, similar to inverse reinforcement learning.

For this purpose, it leverages a hybrid optimization

approach by alternating between gradient steps

to maximize (1), achieved by using Monte-Carlo

estimation of policy gradients, and trust region policy

optimization (TRPO) (Schulman et al., 2015) to

minimize (1) with respect to the agent policy π.

InfoGAIL. The InfoGAIL system (Li et al., 2017;

Hausman et al., 2017) is an extension to GAIL, that

motivates our work and serves as baseline for com-

parative experimental performance evaluation. In In-

foGAIL, the agent‘s policy estimation is derived from

the mixture of expert trajectories, where a direct rela-

tion to the latent variable has been added as proposed

by Chen et al. (2016). Thus, the extended policy

objective given latent variable c resulted in π(a|s, c),

which is an approximation to the π

, c denotes a tu-

ple of the form [x, 1 − x], where x ∈ [0, 1]. In or-

der to force the network to rely on the introduced la-

tent variable, an information-theoretic regularization

has been utilized, which states that there should ex-

ist a high mutual information between the latent vari-

able and state-action pairs in generated trajectories.

Thus, the model objective in (1) extended with a vari-

ational lower bound L

(π, Q) of the mutual informa-

tion I(c; τ), where τ denotes trajectory, is as follows:

min

π,Q

max

[logD(s, a)]+

[log(1 − D(s, a)] − λ

(π, Q) − λH(π)

(2)

where λ

is the hyper-parameter for the information

maximization regularization term. Hence, the varia-

tional lower bound, L

(π, Q), of the mutual informa-

tion is given as:

(π, Q) = E

c∼p(c),a∼π(·|s,c)

[logQ(c|τ)] + H(c) ≤ I(c, τ)

(3)

where Q(c|τ) is an approximation of the true posterior

P(c|τ). After introducing an augment-reward term

(Englert and Toussaint, 2018) to incorporate prior

knowledge and the Wasserstein GAN Arjovsky et al.

InfoSalGAIL: Visual Attention-empowered Imitation Learning of Pedestrian Behavior in Critical Trafﬁc Scenarios

327

(2017) to overcome the vanishing gradient problem,

the ﬁnal objective is denoted as:

min

θ,ψ

max

(s, a)] + E

(s, a)]−

η(π

) − λ

(π, Q) − λH(π)

(4)

where η(π

) = E

s∼π

] reﬂects tendency towards

learning the desired behavior.

3 THE InfoSalGAIL APPROACH

The InfoSalGAIL system enables the imitation of

human-like walking behavior of pedestrians by means

of navigating avatars in realistic trafﬁc scenarios gen-

erated with virtual driving simulators. Simulated be-

havior of a human pedestrian is considered safe or un-

safe and is dependent on the current position of the ex-

pert in relation to the deﬁned trafﬁc scene zones (cf.

Figure 3). InfoSalGAIL solves the pedestrian naviga-

Figure 3: Example of the environment partitioned into

scene zones. The red dotted trajectory shows a sample of

risky navigation, where the green dashed one, highlights

possible safe trajectory since zebra crossing taken to cross

the street in pursuance of reaching the goal.

tion problem deﬁned in Section 2.2 with the following

modiﬁcations with respect to visual attention and re-

wards.

3.1 Visual Attention and Rewards

Visual Attention and Policy. The ﬁnite state

space S of the considered MDP is extended with

pairs of visual and saliency information (vis, sal).

Thus, the expert policy is given in the form of

visual demonstrations supported by the saliency

heat maps described in Section 2.1 each of which

stands for a ﬁxation sequence. Thus, the expec-

tation with respect to the policy π denotes an

expectation with respect to the generated trajec-

tories: E

[ f (s

vis,sal

, a)] , E[

∑

∞

t=0

f (s

vis,sal

, a

)],

such that s

∼ ρ

, a

∼ π(a

vis,sal

), s

t+1

vis,sal

∼

P(s

t+1

vis,sal

, s

vis,sal

Reward Augmentation. As proposed by En-

glert and Toussaint (2018), a reward augmenta-

tion is required in order to gain prior knowledge

about the environment as well as to compensate

for a complex state space, e.g. urban scene, in

which pedestrian avatar can literally walk (steer)

to any direction. Thus, it is essential to intro-

duce scene semantics as part of the surrogate re-

ward function, which is deﬁned as follows r =

(coll

sta

, coll

car

, nM

car

, coll

ped

, nM

ped

, loc, dist

goal

vis

goal

), where

• coll

sta

∈ ] − 1, 1[ denotes collision occurrence to

the static objects of the scene, e.g. buildings, traf-

ﬁc signs, trafﬁc lights, other street furniture, etc;

• coll

car

∈]−1, 1[ denotes collision occurrence con-

cerning the vehicles;

• nM

car

∈]− 0.5, 0.5[ stands for the occurrence of a

near-miss with a reference to the vehicles;

• coll

ped

∈ ] − 0.1, 0.1[ represents collision occur-

rence to any non-player character (NPC) pedes-

trian avatar;

• nM

ped

∈ ] − 0.05, 0.05[ denotes the occurrence of

a near-miss with reference to the NPC;

• loc ∈ {−1; 0.5;1} denotes the location of the

avatar with respect to the deﬁned simulation scene

zones (cf. Figure 3);

• dist

goal

∈ [0, 1[ denotes avatar‘s distance to the

goal;

• vis

goal

∈ ] − 1, 1[ stands for the in range of vision

goal, e.g. in the case of an obstacle between the

avatar and the goal, like a column in the FoV, the

end-reward approaches −1.

The proposed surrogate reward adopts the label

smoothing technique to discourage the discriminator

from producing overconﬁdent classiﬁcation and en-

suring that a much broader set of features considered

during classiﬁcation task, e.g. in the case of a near-

miss with a vehicle, instead of using a ﬁxed negative

reward [−0.5], a value in the range of [−0.45;−0.55]

is sampled. We adapt the near-miss/hit concept for the

car (Pusse and Klusch, 2019) as illustrated on Figure

4 with the shifted focus towards pedestrian. The shape

of the near-miss area is dynamically scaled to cover

fast walking activity, e.g. stretched in walking direc-

tion in accordance to the avatar‘s speed and is equal

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

328

to the hit area in the case of idle. Due to the applied

augmented reward function (similar to the baseline),

InfoSalGAIL can be seen as a hybrid between rein-

forcement and imitation learning. Thus, the reinforce-

ment signal for the purpose of policy optimization is a

compound of explicitly deﬁned surrogate reward and

implicitly derived reward from the Discriminator.

Figure 4: Sample of near-miss and hit areas. The grey

vector indicates the move direction of both the vehicle and

pedestrian. The blue vector indicates look direction of the

avatar‘s head, where pitch and yaw actions are supported.

3.2 Saliency Generator

Inspired by the approach for visual saliency predic-

tion with GAN (SalGAN) by Pan et al. (2017), In-

foSalGAIL adopts the generator part of SalGAN that

follows its convolutional encode-decode network ar-

chitecture. This network is identical to the VGG-

16 proposed by Simonyan and Zisserman (2014) net-

work, where the last ﬁve layers are removed to update

the training objective from soft-classiﬁcation task to

match reconstruction of the input image objective.

The initial weights of the network initiated with pre-

trained weights of VGG-16 given ImageNet classiﬁ-

cation dataset, to achieve feature extraction. In addi-

tion to the generator architecture, we added a Dropout

layer after the second cascade of convolutional layers

of decoder to avoid, experimentally conﬁrmed, model

over-ﬁtting. During the training, the batch size is set

to 96, where Adagrad (Duchi et al., 2011) optimizer

was utilized with the loss rate of 0.0001.

The loss is computed on a per-pixel basis, based

on the binary cross-entropy function and is denoted

as:

∆

= −

∑

j=1

log(

(1 − M

)log(1 −

))

(5)

where

M stands for the predicted saliency value (map)

and M denotes the ground truth respectively, N =

img

× H

img

is the resolution of the input image.

The sequence of image-saliency triples on Figure 5

demonstrates Case-2 critical scenario, based on GI-

DAS analysis, given the crossing the street objective.

As illustrated, the left column refers to the avatar‘s

FoV, where the middle column denotes the ground-

truth saliency map, additionaly extended with a short-

term visual attention memory build upon the corre-

sponding saliency frames from t − 1 and t − 2 (circu-

lar areas of smaller diameter). The rightmost column

refers to the generated saliency maps.

Figure 5: Sample of a saliency generator input-output. Per-

formed on a continuous sequence of 7 images, given the

crossing of the street objective, with 10 Hz interval (top

to bottom order). Left column: FoV of the avatar; Middle

column: blurred saliency ground-truth map; Right column:

predicted saliency map. The achieved performance is equal

to 0.06261 (> 6%), given an input dataset of 1K images,

with the training/testing ratio of 0.7.

InfoSalGAIL: Visual Attention-empowered Imitation Learning of Pedestrian Behavior in Critical Trafﬁc Scenarios

329

3.3 System Architecture

The InfoSalGAIL system architecture consists of four

different networks that are trained separately:

• the saliency generator network ∆

(s) as in Section

3.2 with the objective to reconstruct expert‘s most

probable visual attention map;

• the extended policy generator network

(a|s

vis,sal

, c) (cf. Figure 6) which corre-

sponds to the end-learning policy we would like

to approximate;

• the extended discriminator network D

vis,sal

, a)

(cf. Figure 10), with the objective to differentiate

between the synthetic and true (expert) inputs;

• the extended posterior estimator network

(c|s

vis,sal,

a) (cf. Figure 11), with the objective

to reconstruct the latent variable, in particular,

safe and unsafe navigation styles, given visual

demonstrations.

In contrast to the InfoGAIL work, where RMSProp

gradient descent algorithm is applied, D

is updated

utilizing Stochastic Gradient Descent (SGD) opti-

mizer with Nesterov momentum and lossrate = 1e

−6

resulting in slower policy convergence, allowing for

better problem generalization given substantial state-

space and variations. The update of Q

and π

performed by alternating between Adam optimizer

(Kingma and Ba, 2014) and TRPO as proposed by

Schulman et al. (2015). In accordance with Ho and

Ermon (2016), to accelerate network convergence,

the weights of π

initialized through the behavior

cloning (BC) pre-trained network, enhanced by usage

of visual attention information. As in original Info-

GAIL, the discriminator network D

and the poste-

rior approximation network Q

are threaded as differ-

ent networks, due to the applied weight clipping and

momentum-free optimization methods in the process

of training D

(to avoid interference with Q

After applying all the extensions, the ﬁnal training ob-

jective is given as:

min

θ,ψ,∆

max

vis,sal

, a)] +

vis,sal

, a)] − λ

η(π

)−

(π, Q) − ∆

(s) − λH(π)

(6)

Training and Inference. As shown on Figure 6, the

policy estimator π

accepts two elements as an in-

put: the input image of the size 224 × 224 × 3 and

a set of XML ﬁles passed over in unpacked vector

format as auxiliary information for the frames at time

t − 1 and t − 2, denoting short-term memory. Concur-

rently, the input image pass on to saliency generator

∆

to derive the most probable visual attention map

for the current frame. The input image and obtained

saliency map are then feed through the ResNet50 (up

to activation layer 40) network followed by a cas-

cade of convolution layers to extract valuable feature

maps. Eventually, the feature maps of both signals

are merged through the average operation resulted in

drawing more attention to the features with higher

prior visual attention information (saliency map). As

depicted on Figure 6, afterward the ﬂatten vector is

concatenated to the prior knowledge (memory) fol-

lowed by a set of fully connected layers merged (sum)

with the latent variable, which denotes the behavior

objective style, namely safe or unsafe on-street navi-

gation.

The discriminator D

and posterior estimator Q

networks both accept the same input, which is the in-

put image and saliency map pair, auxiliary informa-

tion, and the action vector derived by the π

on the

previous step. However, the objective of the two is

different, the Discriminator, as shown on Figure 10,

uses this information to estimate the origin of the data

by comparing synthetically generated input with the

true (expert) input. The Posterior, on the other hand,

aims for the latent variable (behavior class) prediction

(cf. Figure 11 in Appendix) given a pair of input im-

ages and corresponding saliency maps.

During the inference, the saliency generator ∆

and policy estimator π

are the only involved net-

works.

4 EXPERIMENTS

Dataset Creation. A set of relevant critical scenarios

is a prior must-have condition before proceeding with

the numerical evaluation. Thus, we created, ﬁrst of

its kind, pedestrian-centric dataset of the relevant

critical scenarios, named OpenDS-GIDAS Motivated

dataset (OpenDS-GiM)

based on the German

In-Depth Accident Study analysis, which consists

of pairs of visual demonstrations and corresponding

saliency maps. In particular, the generated dataset

consists of 9 classes of the most common critical

situations visualized on Figure 7, recorded using an

open source simulation software OpenDS

(given a

beforehand created virtual twin of ”name omitted”

city, Germany) supported by an eye-tracker sensor for

accessing the true expert ﬁxation sequences. Prior to

the start of the recording session, an eye tracker has

been calibrated for every expert (6 in total), where

the overall precision of the chosen eye-tracker falls

under 0.7 degrees, and recall under 0.25 degrees. In

https://cloud.dfki.de/owncloud/index.php/s/

XarwdHgDYYmma7H

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

330

Figure 6: Visual Attention-Empowered Policy Generator architecture. The input image (224 × 224 × 3) is passed to both

ResNet50 (for feature extraction) and Saliency Generator (for saliency map generation). Following the cascade of convolu-

tional layers, both inputs are merged by using the AVG operation, resulted in highlighting more valuable feature maps. The

memory (XML) ﬁles are used as auxiliary information during the training and inference. Upon merging ﬂatten feature maps

with auxiliary information and latent variable, an output vector is generated. The output of the network (4d vector) denotes

the turning angle (converted to a walking vector in OpenDS) and speed of the avatar as well as the pitch and yaw dimensions

of the movement of the head. The latent code, in the case of inference, is used to interpolate between selected behavior

objectives safe and unsafe.

Figure 7: Layout of GIDAS on-street critical scenarios.

The nine classes of accident scenarios between the car and

the pedestrian, clustered in accordance to the street layout

with the corresponding scenario occurrence ratio in percent-

age. The dotted vector denotes pedestrian moving direction,

where the solid vector states for the vehicle moving direc-

tion. The red rectangle is an obstacle, e.g. vehicle.

total, the released dataset

contains 140K of pairs of

FoV images and corresponding saliency maps.

Trajectory Generation. Due to the existence of po-

tentially endless set of possible avatar trajectories in

the simulation environment, the variance reduction

technique, in particular the common random numbers

(CRN) method has been applied. In contrast to the

vine path generation approach, initially proposed by

Schulman et al. (2015) in Section 5.2, the initial start-

ing point and orientation of the avatar is subject to a

normal distribution N(0, 1). Thus, the trajectory gen-

eration workﬂow starts with sampling s

∼ ρ

initial

states denoted by the tuple (Position, Rotation) and

follows with simulating the policy π

, bounded by

CRN (cf. Figure 8). The generated trajectories addi-

tionally bounded by a certain length, through a ﬁxed

time horizon, successfully achieved target objectives,

and/or false navigation, e.g. throughout safe trajec-

tory generation leaving of a safe zone leads to a neg-

ative reward and simulation reset. Hence, the gener-

ated actions, inﬂuenced by the CRN looks as follows:

= π(·|S

, σ).

Figure 8: Trajectory generation approach, where ρ

cor-

responds to the initial distribution of starting points in S,

where s = (Pos

x,y,z

, Rot

x,y,z

). The adopted CRN technique

aims to reduce the variance of the data, where the blue

dashed lines denotes samples of a safe trajectory, and black

dashed one correspond to an unsafe one, respectively. The

tuple < s

, a

> represents a sample of a state-action pair

along the generated trajectory.

4.1 Results

InfoSalGAIL framework is relying on the visual at-

tention information from the current frame t, which

was additionally extended with the reduced visual at-

tention map from frames t − 1 and t − 2 to compen-

sate for the gaps between the two sequential visual in-

puts (frames are captured and passed over to the neu-

ral network framework at 10Hz rate resulted in only

InfoSalGAIL: Visual Attention-empowered Imitation Learning of Pedestrian Behavior in Critical Trafﬁc Scenarios

331

Figure 9: Plots of expert trajectories gathered during visual demonstrations (top row) together with policy generated tra-

jectories produced by InfoGAIL (baseline, middle row) and InfoSalGAIL (proposed, bottom row) frameworks. The scene

elements, e.g. vehicles, buildings, parking lots as well as simulation speciﬁc details like pitch and yaw (head turns) were

excluded from the plots for visual transparency. The color bar on the right denotes the normalized surrogate reward. As seen

from the plots, the baseline fails to generalize given problem objective resulting in generating rather similar trajectories for

safe and unsafe navigation styles. Moreover, it demonstrates poor performance in reaching the goal area and street crossing

task. In contrast, the trajectories generated by the proposed model are comparable to the expert trajectories.

two full frames per second). The auxiliary informa-

tion input for the policy generator at time t consists

of a 17-dimensional vector with following elements:

1) speed = 1d at time t; 2) actions = 4d at time t − 1

and t − 2, compound of (turning angle, speed, pitch,

and yaw); 3) reward = 8d (cf. Section 3.1) at time

t, where d denotes the dimension of elements. In the

scope of this work, we considered two classes of on-

street behavior: safe, where as demonstrated by the

expert, the objective is to cross the street and reach

the goal area by utilizing (pedestrian) safe zones in

the scene, e.g. pavement/sidewalk, zebra crossings;

unsafe uses same objective, however, with no limita-

tions with reference to scene zones, e.g. street lanes,

aside parking lots are allowed for navigation. Nev-

ertheless, the time constraint is not considered to be

a decisive factor due to the signiﬁcant variability of

input data, e.g. the same expert can demonstrate dif-

ferent time to reach the goal performance as the result

of the scene dynamics or other latent factors. Thus, it

is not part of the evaluation schema. Since the con-

sidered problem state-space is multi-dimensional, it

is a challenge on its own to mathematically deﬁne rel-

evant evaluation metrics, e.g. one-to-one trajectory

comparison (expert vs avatar) cannot be considered.

To overcome this limitation, we propose the follow-

ing objective-oriented evaluation as summarized in

Table 1 and visually demonstrated in Figure 9, where

the trajectories of safe and unsafe navigation styles

for both InfoGAIL and InfoSalGAIL frameworks as

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

332

Table 1: Comparison evaluation of InfoGAIL and InfoSalGAIL frameworks given the street crossing task (Cases 1-3 based

on GIDAS analysis, conditioned by approaching vehicle from the right). InfoSalGAIL outperformed the baseline in all

categories, except for collision to street furniture one. Moreover, in the case of the safe simulation, in 50% of simulated

scenarios, avatar followed zebra section of given street layout to cross the street (resulted in a fraction of 0.269 out of all

trajectory points belonging to drivable scene zones, e.g. parking lots, pedestrian crossings, and drivable lanes). During the

simulation, an average value across all simulated trajectories, has been chosen as evaluation metrics, where a single trajectory

consists of a set of points, denoted by X and Y coordinates.

Approach

InfoGAIL

(baseline)

InfoSALGAIL

(ours)

Targeted behavior style of navigation: Safe Unsafe Safe Unsafe

Target area (goal) reached following the chosen objective behavior style, e.g. safe, unsafe? [higher better] 0.0 0.0 0.5 0.86

Objective (street crossing) reached? [higher better] 0.0 0.0 1.0 1.0

Collision to the street furniture (buildings) occurred? [lower better] 0.0 0.0 0.056 0.07

Collision to the vehicles occurred? [lower better] 0.05 0.035 0.045 0.03

Near-miss to the vehicles occurred? [lower better] 0.054 0.045 0.01 0.04

Avatar remained within the pavement area? (fraction out of all trajectory points) [higher better] 0.345 0.366 0.799 0.742

Avatar remained within drivable area? (fraction out of all trajectory points) [lower better] 0.655 0.633 0.201 0.258

Avatar navigated through the pedestrian crossing? (fraction out of drivable area trajectory points)

[higher better; not relevant for unsafe]

0.112 – 0.269 –

well as the expert demonstrations (ground truth) are

plotted. As shown in the plots, the substantial in-

crease in state-space led to a poor objective general-

ization by the baseline: the InfoGAIL model was not

able to cope with the given task. Moreover, the de-

sired target task of pedestrian agent to cross the street

was never reached, despite different latent variable in-

puts policy generator π

θIn f oGAIL

produced rather sim-

ilar end-trajectories. The very same is shown in Fig-

ure 12 (middle row) in the Appendix, where we ap-

plied a dimension reduction technique, namely Prin-

cipal Component Analysis (PCA) (Jolliffe, 2003), to

visualize the policy generator output vector (in less di-

mensions) That aimed at controlling the avatar within

simulated environment by means of principal compo-

nents and corresponding variations. The PCA plots

(middle row, Figure 12) conﬁrms high similarity with

actions produced by the policy generator π

in f oGAIL

for

the chosen walking styles, e.g. safe, unsafe. This

can serve as another proof of poor model general-

ization given human-like on-street trajectory gener-

ation task. In contrast, trained InfoSalGAIL models

reached the target objectives, namely street crossing

and navigation to the Goal area, in 92, 3% of simu-

lated scenarios. Moreover, in ∼ 50% of safe simu-

lated scenarios, despite unbalanced input data distri-

bution (on average, a single complete trajectory con-

tains ∼ 15% of crosswalk visual demonstrations) the

trained model even managed to mimic an exact expert

style of navigation, e.g. street crossing through cross-

walk. In the remaining simulated safe cases, the short-

est path to cross the street while taking into the con-

sideration approaching vehicles (head turns towards

vehicles) has been generated by π

θIn f oSalGAIL

(which

might also be considered as a subclass of a safe nav-

igation). The PCA plots (given the policy generator

output vector) on Figure 12 (upper vs bottom rows)

in Appendix conﬁrms the high similarity between the

principal components of ground truth and generated

by InfoSalGAIL framework data, where the numeri-

cal representation of principal components and corre-

sponding variations are summarized in Table 2 in Ap-

pendix. Nevertheless, the collision with a vehicle still

occurred. Such an anomaly can be caused by the in-

put data distribution used in the process of training of

saliency generator, where a dataset including (evenly

distributed) accident cases is required for more accu-

rate visual attention map estimation. In the case of

unsafe simulation, the policy generator π

θIn f oSalGAIL

reached the street crossing target in 100% of simu-

lated cases. However, only in > 86% of simulated

scenarios, the goal area has been reached, where in

the process of simulation both the near-miss as well

as the collision with the vehicles took place (indicated

by negative surrogate reward explicitly obtained from

simulated environment). To support plotted trajecto-

ries (cf. Figure 9) given GIDAS scenarios Cases 1-3

(crossing the street with an approaching vehicle from

the right), in a more intuitive manner, a video

, for the

purpose of performance demonstration of InfoGAIL

and InfoSalGAIL, has been recorded. Additionally,

avatar’s FoV augmented with the visual attention map

was compiled into an animated GIF

The conducted experiments reveal that InfoSal-

GAIL outperformed InfoGAIL by utilizing additional

input information, namely expert‘s visual attention

map. Moreover, since the visual attention map is

unique for every expert, it might characterize each in-

dividual and serve as additional ﬁnger-print like fea-

ture.

The entire framework training took place by utiliz-

ing a high compute server with NVIDIA Tesla V100

https://www.dropbox.com/sh/smm2vxbuwwlctez/

AAD2AmcZ9kZjMEAeiHV3WMHaa?dl=0

InfoSalGAIL: Visual Attention-empowered Imitation Learning of Pedestrian Behavior in Critical Trafﬁc Scenarios

333

GPU (32GB), where the CPU-memory consumption

was roughly 90GB, due to the in memory loaded

dataset of images. InfoSalGAIL framework is based

on TensorFlow (version 1.15) and Keras 2.0 library,

where the connection to OpenDS has been realized

though a transmission control protocol (TCP) to guar-

antee no data loss.

5 CONCLUSIONS

In this paper, we presented a novel approach, named

InfoSalGAIL, for visual attention-empowered imita-

tion learning of pedestrian behavior in critical trafﬁc

scenarios that can handle substantial state-space and

variations, e.g. on-street urban scenarios, to mimic

complex human-like behavior of experts in a virtual

environment. Moreover, we synthesised two classes

of navigation (cf. in Section 3) which renders InfoSal-

GAIL quite suitable for the challenge of critical trafﬁc

scenario generation. Our experimets revealed that In-

foSalGAIL can signiﬁcantly outperform the selected

baseline InfoGAIL for the given objective due to the

utilization of a saliency map and its direct impact

on the policy generator in deriving the output vector

(control actions). To support this research activity, the

functionality of the OpenDS simulation software has

been extended to allow for a pedestrian-centric con-

trol, resulting in a creation of a new dataset, which

consists of more than 140K pairs of images and cor-

responding saliency maps generated from a virtual

clone of Saarbruecken city (Germany).

Future research is concerned with an extension of

the saliency generator network by incorporating latent

variables to further differentiate between the pedes-

trian imitating avatars such in terms of age, average

speed, short term interests. In this regard, the created

benchmark will be extended with a new set of realistic

scenarios based on JAAD

dataset to capture ground

truth data. In general, we hope that InfoSalGAIL at-

tracts more attention to the topic of human-like behav-

ior simulation in the scope of generating critical trafﬁc

scenarios for virtual tests and validation of collision-

free navigation methods of self-driving cars.

ACKNOWLEDGEMENTS

This research was funded by the German Federal

Ministry for Education and Research (BMBF) in the

project REACT under grant 01IW17003.

http://data.nvision2.eecs.yorku.ca/JAAD\ dataset

REFERENCES

Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning

via inverse reinforcement learning. In Proceedings of

the twenty-ﬁrst international conference on Machine

learning, page 1.

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasser-

stein gan. arXiv preprint arXiv:1701.07875.

Bloem, M. and Bambos, N. (2014). Inﬁnite time horizon

maximum causal entropy inverse reinforcement learn-

ing. In 53rd IEEE Conference on Decision and Con-

trol, pages 4911–4916. IEEE.

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,

I., and Abbeel, P. (2016). Infogan: Interpretable rep-

resentation learning by information maximizing gen-

erative adversarial nets. In Advances in neural infor-

mation processing systems, pages 2172–2180.

Corbetta, M. (1998). Frontoparietal cortical networks for

directing attention and the eye to visual locations:

identical, independent, or overlapping neural sys-

tems? Proceedings of the National Academy of Sci-

ences, 95(3):831–838.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive sub-

gradient methods for online learning and stochastic

optimization. Journal of machine learning research,

12(7).

Englert, P. and Toussaint, M. (2018). Inverse kkt–learning

cost functions of manipulation tasks from demonstra-

tions. In Robotics Research, pages 57–72. Springer.

Ermon, S., Xue, Y., Toth, R., Dilkina, B., Bernstein, R.,

Damoulas, T., Clark, P., DeGloria, S., Mude, A.,

Barrett, C., et al. (2015). Learning large-scale dy-

namic discrete choice models of spatio-temporal pref-

erences with application to migratory pastoralism in

east africa. In Twenty-Ninth AAAI Conference on Ar-

tiﬁcial Intelligence.

Finn, C., Levine, S., and Abbeel, P. (2016). Guided cost

learning: Deep inverse optimal control via policy op-

timization. In International conference on machine

learning, pages 49–58.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In

Advances in neural information processing systems,

pages 2672–2680.

Hausman, K., Chebotar, Y., Schaal, S., Sukhatme, G., and

Lim, J. J. (2017). Multi-modal imitation learning from

unstructured demonstrations using generative adver-

sarial nets. In Advances in Neural Information Pro-

cessing Systems, pages 1235–1245.

Ho, J. and Ermon, S. (2016). Generative adversarial im-

itation learning. In Advances in neural information

processing systems, pages 4565–4573.

Ho, J., Gupta, J., and Ermon, S. (2016). Model-free imi-

tation learning with policy optimization. In Interna-

tional Conference on Machine Learning, pages 2760–

2769.

Jolliffe, I. (2003). Principal component analysis. Techno-

metrics, 45(3):276.

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

334

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Li, Y., Song, J., and Ermon, S. (2017). Infogail: In-

terpretable imitation learning from visual demonstra-

tions. In Advances in Neural Information Processing

Systems, pages 3812–3822.

Pan, J., Ferrer, C. C., McGuinness, K., O’Connor, N. E.,

Torres, J., Sayrol, E., and Giro-i Nieto, X. (2017). Sal-

gan: Visual saliency prediction with generative adver-

sarial networks. arXiv preprint arXiv:1701.01081.

Pomerleau, D. A. (1989). Alvinn: An autonomous land

vehicle in a neural network. In Advances in neural

information processing systems, pages 305–313.

Pusse, F. and Klusch, M. (2019). Hybrid online pomdp

planning and deep reinforcement learning for safer

self-driving cars. In 2019 IEEE Intelligent Vehicles

Symposium (IV), pages 1013–1020. IEEE.

Ross, S. and Bagnell, D. (2010). Efﬁcient reductions for

imitation learning. In Proceedings of the thirteenth

international conference on artiﬁcial intelligence and

statistics, pages 661–668.

Ross, S., Gordon, G., and Bagnell, D. (2011). A reduc-

tion of imitation learning and structured prediction

to no-regret online learning. In Proceedings of the

fourteenth international conference on artiﬁcial intel-

ligence and statistics, pages 627–635.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and

Moritz, P. (2015). Trust region policy optimization. In

International conference on machine learning, pages

1889–1897.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Stadie, B. C., Abbeel, P., and Sutskever, I. (2017).

Third-person imitation learning. arXiv preprint

arXiv:1703.01703.

Sully, J. (1891). W. james, the principles of psychology.

Syed, U., Bowling, M., and Schapire, R. E. (2008). Ap-

prenticeship learning using linear programming. In

Proceedings of the 25th international conference on

Machine learning, pages 1032–1039.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey,

A. K. (2008). Maximum entropy inverse reinforce-

ment learning. In Aaai, volume 8, pages 1433–1438.

Chicago, IL, USA.

InfoSalGAIL: Visual Attention-empowered Imitation Learning of Pedestrian Behavior in Critical Trafﬁc Scenarios

335

APPENDIX

Figure 10: Visual Attention-Empowered Discriminator Network architecture. The input image together with the correspond-

ing saliency map of the size 224× 224 × 3 are passed through a set of convolutional layers in order to derive feature maps for

later merging (AVG). Thus, features empowered with the saliency map become more valuable. The memory, represented in

the form of XML ﬁles are used as auxiliary information in the process of training, where 4d vector of actions generated by

the Policy generator is given as additional input. The ground truth data, obtained during the dataset creation phase (recordings

of expert visual demonstrations and corresponding saliency maps), passed through the identical network to identify the origin

of the input (synthetic or ground truth). The goal is to reach the maximum uncertainty state as an output of Discriminator

network.

Figure 11: Visual Attention-Empowered Posterior Estimator Network architecture. The input image of size 224 × 224 × 3

together with the saliency map are passed through a set of convolutional layers in order to derive feature maps for later merging

(AVG). Thus, features empowered with the saliency map become more valuable. The memory, represented in the form of

XML ﬁles are used as auxiliary information in the process of the training, where 4d vector of actions generated by the Policy

generator is given as additional input. The objective of this network is to derive latent variable (navigation style, e.g. safe or

unsafe) without explicitly proving an optimal reward function since it is a challenge on it‘s own.

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

336

Figure 12: The numerical evaluation of the output vector using dimension reduction approach of PCA. Top row: expert ground

truth data; middle row: InfoGAIL (baseline) generated data; bottom row: InfoSalGAIL (proposed) generated data. The red

dots characterise unsafe navigation style, where green dots represent a safe one. As seen from InfoGAIL plots, the model

fails to generalize problem objective. Thus, resulting in generating very similar trajectories for both types of navigation. In

contract, the PCA visualization of the output vector of InfoSalGAIL policy generator looks comparable to the ground truth

data, which is conﬁrmed from the plotted trajectories in Figure 9.

Table 2: Variation per principal component given output vector, e.g. turning angle, speed, pitch, and yaw.

Data / Principal Component 1 2 3

Ground Truth (expert) 0.81556843 0.09826111 0.06400615

InfoGAIL (baseline) 0.87912261 0.08173591 0.02932816

InfoSalGAIL (proposed) 0.8115204 0.10881166 0.06633512

InfoSalGAIL: Visual Attention-empowered Imitation Learning of Pedestrian Behavior in Critical Trafﬁc Scenarios

337