Decoupling State Representation Methods from Reinforcement Learning

in Car Racing

Juan M. Montoya

, Imant Daunhawer

, Julia E. Vogt

and Marco Wiering

Department of Computer Science, University of Konstanz, Germany

Department of Computer Science, ETH Zurich, Switzerland

Bernoulli Institute for Mathematics, Computer Science and Artiﬁcial Intelligence, University of Groningen, The Netherlands

Keywords:

Deep Reinforcement Learning, State Representation Learning, Variational Autoencoders, Constrastive Learning.

Abstract:

In the quest for efﬁcient and robust learning methods, combining unsupervised state representation learning

and reinforcement learning (RL) could offer advantages for scaling RL algorithms by providing the models

with a useful inductive bias. For achieving this, an encoder is trained in an unsupervised manner with two state

representation methods, a variational autoencoder and a contrastive estimator. The learned features are then fed

to the actor-critic RL algorithm Proximal Policy Optimization (PPO) to learn a policy for playing Open AI’s

car racing environment. Hence, such procedure permits to decouple state representations from RL-controllers.

For the integration of RL with unsupervised learning, we explore various designs for variational autoencoders

and contrastive learning. The proposed method is compared to a deep network trained directly on pixel inputs

with PPO. The results show that the proposed method performs slightly worse than directly learning from pixel

inputs; however, it has a more stable learning curve, a substantial reduction of the buffer size, and requires

optimizing 88% fewer parameters. These results indicate that the use of pre-trained state representations has

several beneﬁts for solving RL tasks.

1 INTRODUCTION

In reinforcement learning (RL), decoupling image pro-

cessing and policy learning could allow scaling RL

algorithms to a multiplicity of (cheap) hardware, as

well as to speed up learning by providing the agent

with a good inductive bias. Instead of training an agent

using high-dimensional pixel data as input, in this pa-

per we investigate how self-supervised learned state

representations generalize to solving an RL task. This

approach has several advantages such as the need to

adapt fewer parameters when RL is used and more

possibilities to use transfer learning.

The beneﬁts of training state representations and

combining these with RL has been explored before

(e.g., Hafner et al., 2019, 2020; Ha and Schmidhuber,

2018; Wahlstr

om et al., 2015; Zhang et al., 2018; Lee

et al., 2019). Variational autoencoders (VAE) (Kingma

and Welling, 2014) have been used to enable efﬁcient

planning in latent space in model-based RL, heavily

reducing the number of experiences required for learn-

ing a good policy (Hafner et al., 2019, 2020). Fur-

thermore, generative recurrent neural networks have

been used for learning to model the environment and

train an agent entirely by using “hallucinated” experi-

ences generated from its learned world model (Ha and

Schmidhuber, 2018).

More recently, contrastive learning (Smith and Eis-

ner, 2005; Gutmann and Hyv

arinen, 2010; Oord et al.,

2018) was shown to be a promising, decoder-free

approach for self-supervised representation learning.

Hjelm et al. (2019) demonstrate that across many Atari

environments contrastive learning is effective in cap-

turing detailed state information, without needing su-

pervision from rewards. Srinivas et al. (2020) used off-

policy control based on representations learned with

contrastive learning; their approach performed well on

complex tasks in the DeepMind Control Suite and on

Atari games. Furthermore, Ding et al. (2019) used con-

trastive learning with model-based RL for improving

robustness against perturbed visual backgrounds.

In this paper, we investigate the transfer of pre-

trained self-supervised state representations to an RL

agent. Most previous work has focused on either as-

sessing the quality of learned representations through

known generative factors (such as the locations of

objects in an image) or on end-to-end RL. Previous

work (Lange and Riedmiller, 2010; Lange et al., 2012)

752

Montoya, J., Daunhawer, I., Vogt, J. and Wiering, M.

Decoupling State Representation Methods from Reinforcement Learning in Car Racing.

DOI: 10.5220/0010237507520759

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 752-759

ISBN: 978-989-758-484-8

implemented Deep Reinforcement Learning using au-

toencoders for image processing with a two-step proce-

dure. Rafﬁn et al. (2019) decoupled feature extraction

from policy learning for goal-based robotics tasks im-

proving the results compared to end-to-end training,

whereas others (Srinivas et al., 2020; Stooke et al.,

2020) researched auxiliary tasks with contrastive es-

timators to decouple state representations from RL-

controllers. In the same spirit, our approach decouples

these components—state-representation learning and

reinforcement learning—yet, our approach freezes the

pre-trained representations unlike previous approaches

(Lange and Riedmiller, 2010; Lange et al., 2012). Con-

trary to Rafﬁn et al. (2019) our model feeds the rep-

resentations to the RL controller without adding ad-

ditional models or using auxiliary tasks to improve

sample efﬁciency (Srinivas et al., 2020; Stooke et al.,

2020). This allows us to examine more systematically

what makes a good representation and provides a fair

comparison of different methods for self-supervised

state representation learning.

Summarizing, this work provides an empirical

study on how self-supervised state representation learn-

ing can provide a useful inductive bias for RL. As a

proof of concept, we study the OpenAI environment

CarRacingV0. Our study led to the following observa-

tions:

•

An agent that is trained on pre-trained state rep-

resentations reaches a slightly worse performance

than an agent that learns from raw pixels, but bene-

ﬁts from a signiﬁcant increase in training stability.

•

The learned state representations are improved

considerably by training a sequential VAE with

frameskip and a contrastive estimator with infor-

mation about performed actions.

•

Saving state representations instead of images dur-

ing an episode considerably reduces memory over-

head and the amount of trainable parameters.

This paper is structured as follows. Section 2 explains

the proposed two-step approach for combining RL and

State Representation Learning. In Section 3, related

literature is reviewed in detail. Section 4 describes the

experimental set-up and in Section 5 the experimental

results are shown. Section 6 discusses implications

and possibilities for future research. Finally, Section 7

provides our conclusion.

2 METHODS

Proximal Policy Optimization (PPO).

RL agents

receive feedback on their actions in the form of

rewards from interacting with an environment. An

agent aims to solve a sequential decision-making

problem by optimizing the cumulative future reward

intake (Sutton and Barto, 2018). PPO (Schulman

et al., 2017) is a popular on-policy algorithm with

an actor-critic architecture for continuous spaces.

PPO performs not just one but multiple mini-batch

gradient steps with the experience of each iteration.

One reuses the same data to make more progress

per iteration, while stability is ensured by limiting

the divergence between the old and updated policies.

From the two proposed variants of the original paper,

we implemented the clipped surrogate loss function.

State Representation Learning.

The environment

provides an agent with a typically high-dimensional

input, such as the frame of a video sequence, at each

time step. As part of the agent, the role of an en-

coder is to compress a high-dimensional input to a

lower-dimensional representation that is useful for the

task at hand. To train an encoder from observational

data, we consider two different self-supervised learn-

ing algorithms, a variational autoencoder (VAE) and

contrastive learning:

•

VAE (Kingma and Welling, 2014) is a generative

model that is trained by maximizing a variational

lower bound on the log-likelihood of the training

data. It learns a lower-dimensional representation

that is useful for reconstructing the original input,

while also matching a prior distribution. Since the

VAE optimizes the reconstruction loss, the learned

representations typically encode information about

large salient objects.

•

Contrastive Learning is a self-supervised method

for learning representations based on noise con-

trastive estimation (Smith and Eisner, 2005; Gut-

mann and Hyv

arinen, 2010) that does not require

a decoder. Intuitively, it is based on learning an

encoding that maximizes the similarity between

positive pairs and the dissimilarity between neg-

ative pairs. In particular, we use the InfoNCE

estimator (Oord et al., 2018) and, analogous to

previous work, deﬁne positive pairs by different

data-augmentations of the same image and nega-

tive pairs by random pairs of images, obtained by

permuting a mini-batch.

Figure 1 illustrates how state representation learning

and reinforcement learning are combined. First,

an encoder is trained through self-supervision and

then the pre-trained encoder is used for training the

reinforcement learning agent. We aim to analyze

possible advantages by separating these compo-

nents. In the following, we describe the individual

components and formalize the beneﬁts of such a

decoupling. Afterward, we discuss the advantages

Decoupling State Representation Methods from Reinforcement Learning in Car Racing

753

Policy

Reinforcement

Learning

Algorithm

Agent

Policy

Update

Observation

Action

Reward

Policy Nets

Encoder

g(o

)

Environment

Figure 1: Illustration of an RL framework where the encoder

is decoupled from the agent’s policy network. Notice that

the policy updates are not propagated back into the encoder.

of our two-step approach for decoupling RL agents.

Then, we introduce our task-speciﬁc modiﬁcations to

the VAE and InfoMax models: VAE-Frameskip and

InfoMax-Action.

Encoder.

The encoder

maps a high-dimensional

observation

∈ R

m×n

, observed at time

, to a lower

dimensional representation

∈ R

. As previously

discussed, we consider two different types of encoders,

VAE as well as the encoder of a constrastive estimator.

Policy Networks.

The policy networks map from la-

tent space to the values required to learn a policy. For

PPO, there is one function for the critic,

v(z

;θ

) =

∈R

, and one for the actor,

π(z

;θ

) = a

∈R

, where

θ denotes the respective network parameters.

2.1 Advantages of Decoupled RL

Agents

The use of pre-trained state representations provides

several advantages for training an RL agent, com-

pared to training it on raw inputs. First, PPO re-

quires a large memory replay buffer consisting of tu-

ples

t+1

, o

, a

, r

)

. By using low-dimensional em-

beddings

instead of high-dimensional observations

, the memory footprint is reduced at least

√

the regular case of

equal or smaller than

and

Second, when training the agent, we freeze the weights

of the pre-trained encoder, which accounts for the ma-

jority of weights in the model. For instance, in the

considered architecture for CarRacingV0, the policy

network has around 13K weights, while the encoder

has around 142K weights—a signiﬁcant reduction of

about 88% in the number of free parameters, compared

to an agent that is trained directly on pixels.

Figure 2: Illustration of the input for VAE-Frameskip.

2.2 VAE-Frameskip and

InfoMax-Action

VAE-Frameskip.

The VAE uses multiple frames as

input. In particular, we found that it is important to use

the same frameskip (i.e., the same spacing between

observed frames) as for the RL agent. We chose

heuristically the same frameskip as in the baseline:

frameskip of size eight with a stack of four images,

so the VAE takes

, o

t+8

, o

t+16

, o

t+24

)

as input as

illustrated in Figure 2.

InfoMax-Action.

InfoMax-Action is an extended ver-

sion of spatiotemporal InfoMax (Hjelm et al., 2019),

where, in addition to spatial and temporal contrasting,

we add contrasting between observations and actions.

Thus, there are three contrasting objectives: (1) spa-

tial contrasting with random patches from the same

image and random pairs of patches from different im-

ages, (2) temporal contrasting with consecutive images

and random pairs shufﬂed across time, as well as (3)

contrasting between corresponding image-action pairs

and random image-action pairs. The different types of

contrasting objectives are illustrated in Figure 3.

The InfoNCE loss, that is being minimized, is de-

ﬁned as

InfoNCE

:= −E

∑

i=1

log

f (z

)

∑

j=1

f (z

)

(1)

where

, z

)

is a positive pair of embeddings,

, z

)

a negative pair, K the batch size, K −1 the number of

negative samples (the sum in the denominator includes

the positive example), and

is a critic that maps a

pair of embeddings to a real-valued score. The critic

is typically a shallow neural network or a simple dot

product of the two embeddings (see, e.g., Tschannen

et al., 2019, for a detailed discussion).

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

754

Figure 3: Illustration of the 3 contrastive objectives.

3 RELATED WORK

In Deep Reinforcement Learning, early work employ-

ing autoencoders for image processing used a two-step

procedure (Lange and Riedmiller, 2010; Lange et al.,

2012). This work aimed ﬁrst to extract rich features

from images using autoencoders and then implement

the representation for control by deploying batch-mode

RL algorithms. Lange and Riedmiller (2010) showed

that RL agents using autoencoder representations can

learn near-optimal policies in a grid-world task. In

Lange et al. (2012), the same approach was extended

to autonomously learn to control a real slot car better

than humans.

Rafﬁn et al. (2019) decoupled feature extraction

from policy learning for goal-based robotics tasks us-

ing a two-step procedure, which outperformed the pol-

icy obtained through end-to-end training. First, the au-

thors combine the reconstruction loss of the VAE with

a reward prediction and an inverse dynamics model.

This allows them to create a rich assembled model

that learns the state representation for the robotic tasks.

Second, they used these learned state representations

as input for the reinforcement learning algorithm that

trained the controller. The results show better sample

efﬁciency, an acceleration of the learning curve, as

well as a similar or better performance compared to

end-to-end training.

Stooke et al. (2020) show that contrastive estima-

tors can be pre-trained and their low-dimensional state-

representations used to train an RL-controller effec-

tively. The authors train the contrastive encoder with

an auxiliary task approach similar to (Srinivas et al.,

2020), but adding a temporal aspect. The implementa-

tion consists mainly of an exponential moving average

for the auxiliary task combined with an additional pre-

diction layer. The results show their method to have

better or similar results to end-to-end RL in several

benchmarks of the DeepMind Control Suite, Deep-

Mind Lab, and Atari games.

Similarly to above studies, our approach decouples

state-representation learning and reinforcement learn-

ing. However, in this paper the state representations of

the VAE and InfoMax are generated after one training

session and the representations do not have the pos-

sibility of being ﬁne-tuned further as in (Lange and

Riedmiller, 2010; Lange et al., 2012). Unlike (Rafﬁn

et al., 2019), we do not add extra models to the state

representation and freeze the pre-trained representa-

tions instead. Furthermore, we do not use auxiliary

tasks or add any prediction layer as in (Stooke et al.,

2020). Thus, our approach allows us to focus entirely

on the effectiveness of using learned state representa-

tions and on comparing this approach to directly using

raw pixel data as input of the RL algorithm.

4 EXPERIMENTAL SETUP

CarRacingV0.

OpenAI’s CarRacing Environment

is a continuous control task to learn from pixels

in a racing environment viewed from a bird’s eye

perspective. Observations consist of

96 ×96

pixels.

The reward is -0.1 every frame and +1000/N for

every track tile visited, where N is the total number

of tiles in a track. The possible actions are braking,

accelerating and steering. CarRacing-v0 is deﬁned

as being ’solved’ when the agent gets an average

cumulative reward of 900 over 100 consecutive trials.

RL Algorithm.

All agents use PPO as their RL

algorithm. We use a standard implementation

without any modiﬁcations to the architecture or

loss function. Both actor and critic networks

consist of two fully connected layers on top of the

encoder. We use TD(0)-learning for training the critic

and the advantage is estimated using a single time step.

Baselines.

The Raw-Pixel agent is trained end-to-end

given the original images, but does not decouple the

training of encoder and policy networks. The Random-

CNN agent provides an ablation on the Raw-Pixel

agent, where the encoder weights are frozen in their

random initial state, whereas the policy networks are

trained. We use a total of six convolutional layers

and one fully connected layer for the critic and actor.

In general, we use the same hyperparameters across

all models, the only exception being the size of the

memory replay buffer and frameskip, which were

found to be important hyperparameters to speed up

learning. For more concrete information about the

Decoupling State Representation Methods from Reinforcement Learning in Car Racing

755

hyperparameters, see the appendix.

Encoder.

The used encoder is a convolutional neural

network with six convolutional layers that map the

input image to a latent space of size 64. It uses the

same architectures as described above. For a fair

comparison between state representation learning

methods, we use the same encoder architecture for

both VAE and contrastive learning.

Data Collection.

The data collected for self-

supervised learning of state representations is

extracted during the training of a single PPO agent

from the RawPixel baseline. Therefore, the dataset is

comprised of mostly random episodes among which

there are a few mediocre and good episodes. Thereby,

we seek to replicate the variety of data that is usually

available for real-world applications. In particular, we

take 1000 frames out of every 300 episodes of our

PPO agent that is trained for 3000 episodes. In total,

this process generates 10,000 frames for training the

representation learning models.

Autoencoders.

The variational autoencoders (VAE

and VAE-Frameskip) use a decoder with transpose

convolutions that is approximately symmetric to the

encoder and comparable in the number of parameters.

For the variational autoencoders, the decoder is only

required for the stage of state representation learning;

for reinforcement learning, all encoder weights are

frozen and the decoder can be discarded.

Contrastive Learning.

For contrastive learning,

no decoder is necessary. The critic network for

contrasting consists of two fully-connected layers

of the same dimensionality as the latent space. As

with the VAE decoder, the critic is only required for

the stage of state representation learning and it can

be discarded for reinforcement learning. For more

concrete information about the hyperparameters, see

the appendix.

Methodology.

The hyperparameters and architecture

of our agents are based on the RawPixel baseline. We

found these hyperparameters heuristically by deter-

mining a set of values that solved the CarRacingV0

environment with an average cumulative reward of

900 over 100 consecutive trials. Afterwards, we an-

alyzed the effects of using the frozen encoder VAE

and InfoMax to feed the RL-agent without altering

any other variables. Then we repeat the same pro-

cedure using the altered versions of VAE(-Skip) and

InfoMax(-Action) without changing the original hy-

perparameters. Finally, we tuned two hyperparameters

Figure 4: Learning curves for the OpenAI’s Car Racing

continuous control task. The shaded region represents the

standard deviation of the average evaluation over 10 trials for

each of the six random seeds. VAE-Frameskip and InfoMax-

Action obtain considerably better performances compared

to standard VAE and InfoMax.

in our new state representation methods. The tuned hy-

perparameters are: PPO’s buffer is reduced from 5000

to 2000 and the frame-skip is reduced from eight to

six. Thus, except for the last step, we attempt to isolate

the cause of different performances by altering only

the state representation methods in our experiments.

5 RESULTS

In this section, we ﬁrst compare the representa-

tion learning methods InfoMax-Action and VAE-

Frameskip to VAE and InfoMax without altering any

hyperparameters. Later, we experiment with two hy-

perparameters within our state representation methods

with RL that improves the learning curve of our decou-

pled RL-agents.

5.1 Testing Learning Representation

Methods

Figure 4 shows the comparison between the learning

curves of our proposed decoupled agents InfoMax-

Action and VAE-Frameskip and the corresponding

traditional approach in OpenAI’s Car Racing environ-

ment. Each run is made for 4 million time steps with

evaluations every 40,000 time steps, where each eval-

uation reports the average cumulative reward over 10

episodes with no exploration noise. The results are

reported over 6 random seeds of the simulator and

different network initializations.

InfoMax-Action reaches an average of 350 points

and VAE-Frameskip about 700. In contrast, spatiotem-

poral InfoMax without action contrasting and VAE

have a similar result of around 150 points.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

756

Table 1: Best average scores and standard deviations computed over 100 consecutive trials without exploration for the 6 random

seeds. The agent that achieved the best score during the evaluation phase (over 10 trials) of training was saved as the best agent.

Agent

Exp. no Raw-Pixel VAE-Frameskip InfoMax-Action InfoMax-Action-FS1 Random-CNN

1 901 ±36 644 ±66 604 ±188 727 ±171 −55 ±10

2 883 ±30 782 ±190 646 ±164 762 ±103 4 ±35

3 904 ±22 625 ±301 397 ±166 485 ±151 45 ±34

4 886 ±82 786 ±169 40 ±43 30 ±23 43 ±43

5 897 ±34 782 ±174 75 ±38 93 ±43 72 ±38

6 896 ±39 604 ±235 270 ±132 271 ±125 −84 ±5

Total 895 ±46 704 ±218 355 ±274 429 ±306 4 ±65

5.2 Final Results

Figure 5 shows the learning curves for Car Racing

with the baseline agents and the decoupled agents.

Compared to the previous experiment, we made two

modiﬁcations to the agents that improved the learning

speed of the proposed methods. The buffer is reduced

from 5000 to 2000 and frame-skip from eight to six.

We found these parameters by testing a range of hyper-

parameters of our PPO agents.

The Raw-Pixel agent reaches the best results—

more than 800 points on average. Up to almost 1.5

million frames, the agent shows little volatility, but af-

ter 3.5 million frames we observe a signiﬁcant drop in

performance. In contrast, the random-CNN agent per-

forms close to random, which indicates that although

random convolutions can lead to good performance

for object detection (Hjelm et al., 2019; Ulyanov et al.,

2018), such a representation does not sufﬁce for train-

ing an RL agent. The VAE agent obtains an average

of 650 points at the end of the training, while the Info-

Max agent reaches almost 400 points. The difference

between VAE and the Raw-Pixel agent is consistently

around 200 points. Notably, the VAE and InfoMax

agents obtain much better results than the Random-

CNN agent, suggesting that the inductive bias learned

Figure 5: Learning curves for the OpenAI’s Car Racing

continuous control task. The shaded region represents the

standard deviation of the average evaluation over 10 trials

for each of the six random seeds.

by self-supervised methods transfers reasonably well

to the considered RL task.

Table 1 shows the best results of each of the ran-

dom seeds for each of our agents. The best agent

is saved after achieving the best-averaged outcome

over ten consecutive trials without exploration dur-

ing the training. This evaluation is effectuated ev-

ery 100 episodes. Four raw-pixel agents are less than

20 points away from crossing the 900 points, while

two agents manage to do so. VAE-Frameskip has the

highest results among the decoupled representation

learning methods. The agents achieve an average out-

come of around 700 points. InfoMax-Action shows

results of over 300 points less than VAE. Here we

see that the drastic difference depends mainly on two

agents that give poor results comparable to Random-

CNN. We added an ablation of InfoMax-Action agent:

InfoMax-Action-FS1. This ablation implements the

lowest frameskip of one (FS1) during testing. This

modiﬁcation improves the results around 70 points

for the best agent. The Random-CNN has an average

close to zero.

In line with the ofﬁcial evaluation for OpenAI’s

leaderboard

, Table 2 reports the average reward over

100 evaluations of the best agent from Table 1. We

compare to the ofﬁcial results from DQN (Prieur,

2017), A3C (Jang et al., 2017), and world models (Ha

and Schmidhuber, 2018), which is the best performing

agent on OpenAI’s public leaderboard. We also in-

clude the results of the ablation of the InfoMax-Action

agent for which we test the lowest frameskip of one

(FS1). This ablation is implemented during testing.

The training is exactly the same as InfoMax-Action

(except for the frameskip: 1 vs 6). This modiﬁcation

improves the results around 120 points.

The Raw-Pixel agent’s score is on par with the

state-of-the-art results from OpenAI’s public leader-

board. The best VAE agent reaches a score of almost

790 points, whereas the InfoMax agent reaches an av-

erage score of 646, which can be improved by around

120 points by using a frameskip of one. Therefore, the

https://github.com/openai/gym/wiki/Leaderboard

Decoupling State Representation Methods from Reinforcement Learning in Car Racing

757

Table 2: Best agents scores and standard deviations averaged

over 100 consecutive trials without exploration. The result

of the Raw Pixel agent using PPO ranks among the top 3

leaderboard scores of OpenAI. The best decoupled agent

reaches 87% of the performance of our state-of-the-art agent

with 88% fewer trained parameters and a large reduction of

memory usage.

AGENT

AVERAGE SCORE

WORLD MODELS: HA AND SCHMIDHUBER (2018) 906 ±21

RAW-PIXEL 904 ±22

VAE-FRAMESKIP 786 ±169

INFOMAX-ACTION-FS1 762 ±103

AC3: JANG ET AL. (2017) 652 ±10

INFOMAX-ACTION 646 ±164

DQN: PRIEUR (2017) 343 ±18

RANDOM-CNN 72 ±38

agents using ﬁxed pre-trained representations perform

better than the best DQN and A3C agents. Notably, our

agents achieve around 87% of the performance of the

state-of-the-art with 88% fewer trainable parameters

than the respective PPO agent.

6 DISCUSSION

This single-environment experiment using a pre-

trained agent seeks to test the viability of our two-step

methodology. Our next step is to conﬁrm our results

with a radio-controlled (RC) car and adding more en-

vironments. Our next step consists of analyzing:

Different environments: having tested in only one

environment makes it challenging to know if the

results can be generalized. For this reason, it is

necessary to validate the model in additional envi-

ronments, such as Atari games and the DeepMind

Control Suite.

Transfer learning: an agent driving a real car could

ﬁrst learn representations with real images. After-

ward, a policy can be trained with this represen-

tation in a simulator, and ﬁnally, the agent can be

transferred and adjusted to the real world.

7 CONCLUSIONS

We have investigated how well pre-trained representa-

tions, learned by self-supervised methods, transfer to

RL agents in the CarRacingV0 environment. We con-

sidered two self-supervised methods for learning rep-

resentations: variational autoencoders and contrastive

learning. Based on frozen pre-trained representations,

RL agents achieve a signiﬁcant share of the perfor-

mance obtained by RL agents trained directly on pixel

inputs while requiring signiﬁcantly fewer trainable pa-

rameters. The decoupled agents reach up to

87%

the performance of unconstrained agents with a reduc-

tion of

88%

in the number of parameters optimized by

the RL agent. Furthermore, our approach leads to a

signiﬁcant reduction in the space requirements of the

memory replay buffer, and using pre-trained represen-

tations also exhibits more stable training behavior.

These outcomes demonstrate that self-supervised

representations can provide a useful inductive bias for

knowledge transfer in RL tasks. In future work, we

want to study this approach using more environments

such as the Atari games and real-world implementa-

tions of autonomous robotic cars.

ACKNOWLEDGEMENTS

Thanks deeply to Vassilios Tsounis and Katia Bouder.

ID is supported by the SNSF grant #200021 188466.

REFERENCES

Ding, Y., Clavera, I., and Abbeel, P. (2019). Mutual informa-

tion maximization for robust plannable representations.

In ICML 2015 Workshop on Robot Learning: Control

and Interaction in the Real World.

Gutmann, M. and Hyv

arinen, A. (2010). Noise-contrastive

estimation: A new estimation principle for unnormal-

ized statistical models. In Proceedings of the Thir-

teenth International Conference on Artiﬁcial Intelli-

gence and Statistics, pages 297–304.

Ha, D. and Schmidhuber, J. (2018). Recurrent world mod-

els facilitate policy evolution. In Advances in Neural

Information Processing Systems 31, pages 2451–2463.

Curran Associates, Inc.

Hafner, D., Lillicrap, T. P., Ba, J., and Norouzi, M. (2020).

Dream to control: Learning behaviors by latent imagi-

nation. In 8th International Conference on Learning

Representations.

Hafner, D., Lillicrap, T. P., Fischer, I. C., Villegas, R., Ha,

D., Lee, H., and Davidson, J. (2019). Learning latent

dynamics for planning from pixels. In ICML.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2019).

Momentum contrast for unsupervised visual represen-

tation learning. arXiv preprint arXiv:1911.05722.

Hjelm, D., Fedorov, A., Lavoie-Marchildon, S., Grewal,

K., Bachman, P., Trischler, A., and Bengio, Y. (2019).

Learning deep representations by mutual information

estimation and maximization. In ICLR 2019. ICLR.

Jang, S. W. d., Lee, C., and Kim, J. H. (2017). Reinforcement

car racing with a3c. Scribd preprint: 358019044.

Kingma, D. P. and Ba, J. (2014). Adam: A method for

stochastic optimization. In Proceedings of the 3rd In-

ternational Conference on Learning Representations.

Kingma, D. P. and Welling, M. (2014). Auto-encoding vari-

ational bayes. In Proceedings of the 3rd International

Conference on Learning Representations.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

758

Lange, S. and Riedmiller, M. (2010). Deep auto-encoder

neural networks in reinforcement learning. In The 2010

International Joint Conference on Neural Networks

(IJCNN), pages 1–8.

Lange, S., Riedmiller, M., and Voigtlander, A. (2012). Au-

tonomous reinforcement learning on raw visual input

data in a real world application. pages 1–8.

Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S.

(2019). Stochastic latent actor-critic: Deep reinforce-

ment learning with a latent variable model. CoRR,

abs/1907.00953.

Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representa-

tion learning with contrastive predictive coding. arXiv

preprint:1807.03748.

Prieur, L. (2017). Deep-q learning using simple feedfoward

neural network. Github Gist in https://goo.gl/VpDqSw.

Rafﬁn, A., Hill, A., Traor

e, K. R., Lesort, T., D

ıaz-Rodr

ıguez,

N., and Filliat, D. (2019). Decoupling feature extrac-

tion from policy learning: assessing beneﬁts of state

representation learning in goal based robotics. SPiRL

Workshop ICLR.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization algo-

rithms. CoRR, abs/1707.06347.

Smith, N. A. and Eisner, J. (2005). Contrastive estimation:

Training log-linear models on unlabeled data. In Pro-

ceedings of the 43rd Annual Meeting on Association

for Computational Linguistics, pages 354–362. Asso-

ciation for Computational Linguistics.

Srinivas, A., Laskin, M., and Abbeel, P. (2020). CURL: con-

trastive unsupervised representations for reinforcement

learning. CoRR, abs/2004.04136.

Stooke, A., Lee, K., Abbeel, P., and Laskin, M. (2020). De-

coupling representation learning from reinforcement

learning. arXiv:2004.14990.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-

ing: An Introduction. MIT Press, Cambridge, MA,

USA, second edition.

Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S.,

and Lucic, M. (2019). On mutual information maxi-

mization for representation learning. In International

Conference on Learning Representations.

Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2018). Deep

image prior. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

9446–9454.

Wahlstr

om, N., Sch

on, T. B., and Deisenroth, M. P. (2015).

From pixels to torques: Policy learning with deep dy-

namical models. In ICML 2015 Workshop on Deep

Learning.

Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J.,

and Levine, S. (2018). SOLAR: deep structured latent

representations for model-based reinforcement learn-

ing. CoRR, abs/1808.09105.

APPENDIX: HYPERPARAMETERS

RL Agent

Model Architecture.

The input to the neural network

consists of a

96 ×96 ×4

grayscale image. There are a

total of six convolutional layers with ReLU activation

functions. The ﬁrst hidden layer convolves 8 ﬁlters of

4 ×4

with stride 2 with the input image. The second

hidden layer convolves 16 ﬁlters of

3 ×3

with stride 2.

The third hidden layer convolves 32 ﬁlters of

3 ×3

with stride 2. The fourth hidden layer convolves 64

ﬁlters of

3 ×3

with stride 2. The ﬁfth hidden layer

convolves 128 ﬁlters of

3 ×3

with stride 1. The sixth

hidden layer convolves 64 ﬁlters of

3 ×3

with stride 1.

The critic’s ﬁnal hidden layer is fully-connected with

100 rectiﬁer units using the described ConvLayers.

The output layer is a fully-connected linear layer

with only one single output. The actor also has a

fully-connected layer with 100 rectiﬁer units after the

described ConvLayers. The ﬁnal layer consists of two

parallel fully connected layers with 100 units to three

outputs for each valid action with a tanh activation

function. From these layers, we sample the continuous

actions using the beta distribution.

PPO’s Hyperparameters.

Frameskip = 8. Discount

Factor = 0.99. Value Loss Coefﬁcient = 2. Image Stack

= 4. Buffer Size = 5000. Adam Learning Rate = 0.001.

Batch Size = 128. Entropy Term Coefﬁcient = 0.0001.

Clip Parameter = 0.1. PPO Epoch = 10.

State Representation Learning

Variational Autoencoder.

The VAE is trained using

the same encoder as the PPO agent described above.

We train for 2000 epochs using a batch size of 64, a

KL-divergence weight (beta) of 1.0 which is annealed

over the ﬁrst 10 epochs, and the Adam optimizer

(Kingma and Ba, 2014) with learning rate 0.0003.

Contrastive Learning.

For contrastive learning, we

also use the same encoder as the PPO agent described

above and Adam optimizer with learning rate 0.001.

We train the model for 5000 epochs using a large batch

size of 1024 which provides a large number of negative

samples through batch-wise permutations, which has

been shown to be beneﬁcial in previous work (Oord

et al., 2018; He et al., 2019).

Decoupling State Representation Methods from Reinforcement Learning in Car Racing

759