Causal Campbell-Goodhart’s Law and Reinforcement Learning

Hal Ashton

Computer Science, University College London, U.K.

Keywords:

Reinforcement Learning, Goodhart’s Law, Campbell’s Law, Causal Inference, Cognitive Error.

Abstract:

Campbell-Goodhart’s law relates to the causal inference error whereby decision-making agents aim to inﬂu-

ence variables which are correlated to their goal objective but do not reliably cause it. This is a well known

error in Economics and Political Science but not widely labelled in Artiﬁcial Intelligence research. Through a

simple example, we show how off-the-shelf deep Reinforcement Learning (RL) algorithms are not necessarily

immune to this cognitive error. The off-policy learning method is tricked, whilst the on-policy method is not.

The practical implication is that naive application of RL to complex real life problems can result in the same

types of policy errors that humans make. Great care should be taken around understanding the causal model

that underpins a solution derived from Reinforcement Learning.

1 INTRODUCTION

In many learning tasks, the learning agent has an im-

pact on a stochastic state variable through its previ-

ous actions. If when undisturbed, this variable pre-

dicts but does not reliably cause something of interest

to the agent, the agent might choose actions to tar-

get this variable thereby causing suboptimal perfor-

mance. This is related to Campbell-Goodhart’s law in

social science which is described as: ”When a mea-

sure becomes a target, it ceases to be a good measure”

(Strathern, 1997). Manheim and Garrabrant (2018)

describe it occurring ”When optimization causes a

collapse of the statistical relationship between a goal

which the optimizer intends and the proxy used for

that goal”.

In such a situation it is important to have a

causal model for the learning task so that the agent

is able to correctly separate the effect of its actions

on the world, from those which are caused by some

other mechanism. According to Pearl and Mackensie

(2018), reasoning based on correlations alone is not

sufﬁcient to solve certain causal problems. So-called

level 2 and 3 problems on their inference ladder re-

quire concepts of causality, intervention and counter-

factual reasoning.

Recently ’Deep’ Reinforcement Learning (RL)

has had great success in the automated mastery of

learning optimal policies to hard problems such as

Chess and Go (Silver et al., 2017) and a range of

https://orcid.org/0000-0002-1780-9127

computer games starting with Atari (Mnih et al.,

2015) through to more advanced games like Star-

Craft (Vinyals et al., 2019) through the use of Neu-

ral networks. All of these problems have environ-

ments where there is an effect to the optimizer’s ac-

tions and yet RL has traditionally avoided discussing

causality at all

. Are the successes stated above made

possible because the problems have straightforward

causal dependencies? Perhaps inside the black boxes

derived during training, a causal inference technique

is found automatically. Else, are RL methods easily

confounded by simple causal problems? It seems use-

ful to know for anyone wishing to use RL for real

world applications like ﬁnance. The inability to ex-

plain AI coupled with claims as to its superhuman

abilities is termed ’enchanted determinism’ in Cam-

polo and Crawford (2020). Over-conﬁdence and an

accountability shield are two ill-effects of this phe-

nomenon.

In this paper I present a toy-problem where an

agent is able to alter (or intervene on) a variable that

can be otherwise used to predict its reward. This prob-

lem is classiﬁed as Causal Goodhart by Mannheim

and Garrabant (Speciﬁcally metric manipulation). It

is deliberately simple and it can be solved either ana-

lytically or using a number of learning methods with-

out involving neural networks. The motivation is to

see whether problems that have an interesting causal

For example the canonical text in Reinforcement

Learning: Sutton and Barto (2018) makes no explicit ref-

erence to causality throughout the book.

Ashton, H.

Causal Campbell-Goodhart’s Law and Reinforcement Learning.

DOI: 10.5220/0010197300670073

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 67-73

ISBN: 978-989-758-484-8

structure can be solved through a naive application of

existing, off the shelf RL learning algorithms which

have had success solving complex problems.

2 THE DOG BAROMETER

PROBLEM

There is a dog living in a house in Scotland that wants

to go for a walk. The dog can observe current weather

through a window but really needs to know future

weather when it is walking.

The capricious weather of Scotland is either Rain

or Sunshine and depends on recent barometric pres-

sure only.

The barometric pressure (henceforth just pressure)

is either high or low. Future pressure depends only on

past pressure. High pressure causes sunshine more

often and low pressure causes rain more often.

The dog would like to wear its smart coat when it

is raining, but not wear it when it is sunny. Once it

has committed to its sartorial choice, the dog leaves

its house and experiences the weather during its walk.

This marks the end of the decision problem for the

dog.

Within the house there is a barometer which mea-

sures current pressure and has two states high and low.

The dog can see the barometer. The barometer also

has a button which the dog can press. The effect is to

set its reading to high.

2.1 Causal Structure

High pressure (P

= 1) causes a high barometer read-

ing (B

= 1) and a high chance of sunshine period

t+1

= 1).

Conversely Low pressure (P

= 1) causes a low

barometer reading (B

= 1) and a higher chance of

rain next period (W

t+1

= 1).

Touching the barometer causes a high barometer

reading (B

t+1

= 1) next period regardless of the pres-

sure

This is summarised by the DAG in Figure 1 lim-

ited to three periods. Alternatively it can be written

with the following independence statement:

P(P

, B

, W

t−1

, B

t−1

, W

t−1

, A

t−1

) =

P(P

t−1

).P(B

, A

t−1

).P(W

t−1

) (1)

The barometer use is inspired from a lecture given by

Prof Ricardo Silva at UCL

In the simplest case, Pressure has no autocorrelation:

P(P

t−1

) = P(P

). This also has the effect of making

the Weather variable useless to the dog.

The effect of the dog pressing the button and set-

ting the Barometer variable B

is akin to that of an

atomic intervention (Pearl, 2000). This removes any

arcs from parental nodes leading to B

, meaning that

inference about the state of P

is impossible.

Figure 1: Causal diagram for dog barometer problem shown

for 3 periods. A

variables represent actions, B

pressure

readings from the barometer, P

the actual pressure, and W

the current weather. Pressure is hidden to the (canine) ob-

server. The button press makes an intervention on the state

of B

, accordingly all incoming arrows into B

should be

deleted, thereby making it useless as an indicator for P

2.2 Problem as a MDP

We will model this problem as a MDP (Markov Deci-

sion Process)

. Time is discretised and indexed by t.

An MDP is a tuple (S , A, R, T , S

, γ) where:

1. S is the set of states. The binary random weather

variable be W, with rain denoted W = 0 and sun-

shine W = 1. Pressure variable is P, with a high

state denoted P = 1 and a low state P = 0. Simi-

larly for Barometer reading B.

Assuming that pressure is not observable, it is better

modelled as a Partially Observable Markov Decision Pro-

cess (POMDP), but we will proceed naively ’

a la mode’ and

ignore the hidden variable.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

2. A is the set of actions. The dog is able to do one

of four things at any time period.

(a) Do nothing and wait a

= w

(b) Touch the barometer a

= m

= c

(d) Leave the kennel without a coat. a

= n

3. R : S × A :→ R is the reward function. The dog

prefers to go outside in the sun (r

) with no coat,

quite likes going outside with a coat in the rain

) but dislikes wearing a coat in the sun (r

)

and going out in the rain without a coat (r

These four rewards are ordered r

≥ r

≥

. An example assignment is shown in table 1.

Table 1: Rewards for the dog.

Rain Sun

Coat r

= 4 r

= −8

No Coat r

= −8 r

= 8

There is also an optional penalty r

wait

= −1 for

the dog when it chooses actions w and m, which

involve it waiting in the house. Whilst the dis-

count factor would seem to have a similar effect,

such a penalty is often used in practice to avoid

sparse reward signals and help the learning algo-

rithm.

4. T (s, a, s

) = P(s

|s, a) is the transition function

determining the probability of transitioning from

state s to s

after taking action a. It is shown in

tables 2, 3 and 4.

Table 2: P(P

t−1

). Pressure transition: When ρ

= 0.5 there is no autocorrelation.

P(P

t−1

) P

t−1

= Low P

t−1

= High

= Low ρ

= 1 − ρ

= High ρ

= 1 − ρ

Table 3: P(B

t−1

, A

t−1

) If the button on the barometer has

been touched in the previous period (A

t−1

= 1) the proba-

bility of a high reading is 1. The α

coefﬁcients correspond

to the accuracy of the barometer.

P(B

, A

t−1

)

t−1

= 0 A

t−1

= 1

= Low P

= High P

= Low P

= High

= Low α

= 0.9 1 − α

0 0

= High 1 − α

= 0.9 1 1

Table 4: P(W

t−1

). The ω

coefﬁcients correspond to the

capriciousness of the weather.

P(W

t−1

) P

t−1

= Low P

t−1

= High

= Rain ω

= 0.9 1 − ω

= Sun 1 − ω

= 0.9

5. S

∈ P(S ) is a distribution over states that the

process begins at. For the initial states we draw

−1

= H with probability 0.5 and generate P

, B

and W

according to the conditional distributions

in tables 2,3 and 4 respectively.

6. γ = 0.95 is a discount factor which reﬂects how

much less the dog values future rewards from

present ones.

The dog must choose a policy function π : S :→ A to

maximise the following discounted sum of rewards:

arg

maxE



∑

R(s

, a

)





(2)

3 METHOD

I implemented the Dog Barometer problem in Open

AI gym

. I then tested two Deep RL learning al-

gorithms on this problem using the StableBaselines 3

module which provides a number of implementations

of state of the art deep RL algorithms

. The code for

the Dog Barometer environment and tests is available

online

For each algorithm I tested the case when pres-

sure is (somehow) visible to the dog as a baseline and

when it is not. In both cases the algorithm was trained

separately 10 times.

The ﬁrst learning algorithm tested was DQN,

which was shown to be successful learning how to

play Atari games in Mnih et al. (2015). It learns

through Q-learning (Sutton and Barto, 2018) and ap-

proximates the State-action function (aka Q-function)

through a neural network. The optimal policy is then

the action with the highest value for any state A mem-

ory of experiences (termed experience replay) is built

up to allow batch updates of the neural networks. This

algorithm is classiﬁed as off-policy learning, since

the algorithm estimates the value of an optimal pol-

icy without having to follow that policy during explo-

ration.

The second learning algorithm I tested was A2C

which evolved from Mnih et al. (2016). This is an

actor-critic method which estimates value and actions

functions through neural-networks. It is an on-policy

learning method, that is to say the algorithm seeks to

improve the policy it is currently following.

In both cases, I used the default parameters ac-

cording to StableBaselines 3. In particular all the neu-

ral networks were two layered, feed-forward percep-

trons of 64 neurons each with Tanh activations.

https://gym.openai.com/

https://github.com/DLR-RM/stable-baselines3

Environment available at https://github.com/yetiminer/

dogbarometer/

Causal Campbell-Goodhart’s Law and Reinforcement Learning

The A2C algorithm was trained for 20,000

episodes and the DQN algorithm, being less efﬁcient

was trained for 100,000. These ﬁgures were chosen

for sufﬁcient convergence properties. Default settings

from the StableBaselines3

module were used for

both algorithms. The resultant strategies were eval-

uated over 10,000 episodes.

4 RESULTS

In Experiment 1, pressure has no auto-correlation:

= ρ

= 0.5. This makes the strategy of wait-

ing for high pressure or a high barometer reading the

most efﬁcient. We denote this strategy Π

. Table

5 shows the results of the different training methods.

When Pressure is visible, the optimal strategy is re-

covered by both algos though some of the time A2C

converges on Π

- wear or don’t wear coat accord-

ing to barometer. When pressure is hidden, the A2C

algorithm successfully ﬁnds the optimal strategy Π

on every occasion. The DQN algorithm always con-

verges on Π

; the naive strategy that involves press-

ing the barometer button if the barometer is initially

low and going outside without a coat if the barometer

is high.

Table 5: DQN trained dogs are consistently fooled into

pressing the barometer - strategy Π

whilst the A2C al-

gorithm correctly recovers the optimal strategy - Π

Exp series Pressure Hidden Mean Reward

Strategy count

A2C 4.80 3 7

A2C

TRUE 4.15 10

DQN 5.39 10

DQN

TRUE 2.05 10

Table 6: Strategies found when weather is auto-correlated.

Exp series Pressure Hidden Mean Reward

Strategy count

nwc

nbb

A2C E2 4.58 10

A2C E2 H TRUE 3.58 10

DQN E2 4.60 10

DQN E2 H TRUE 0.87 8 2

In Experiment 2 ρ

= ρ

= 0.75. That is to

say pressure has auto-correlation - the probability of

maintaining the same level between periods is 0.75.

This now makes waiting for high pressure less desir-

able. It also makes the weather variable useful as an

indicator independent to the barometer for the previ-

ous state of pressure. Table 6 shows the results of

the different training methods. The A2C algorithm

successfully ﬁnds the optimal strategy Π

nwc

on every

occasion. This is the strategy where the Dog exits the

house with or without a coat depending on the barom-

https://github.com/DLR-RM/stable-baselines3

eter unless the barometer reads low and the weather is

ﬁne, in which case the dog will wait a period. This

balances the chance of a misreading from the barom-

eter, the penalty of going out with a coat when the

weather is sunny. The DQN algorithm mostly con-

verges on Π

; the sub-optimal strategy that involves

pressing the barometer button if the barometer is ini-

tially low but also Π

nbb

which involves pressing the

barometer on every occasion except when the barom-

eter and the weather agree on a high/sun reading. This

is an improvement on Π

since the weather is being

used as an indicator though is still not efﬁcient.

5 DISCUSSION

In our experiments we saw that the DQN method

of training consistently led to the naive strategy of

pressing the barometer to ’cause’ high pressure which

would cause desirable sunny weather. In contrast

A2C avoids this pitfall and consistently ﬁnds an op-

timal strategy.

I hypothesise that this could be due to two re-

lated features of DQN; Experience replay and Off-

policy updating. Experience replay consists of the

chunking experience which is subsequently sampled

in batches to update the neural network that estimates

the state-action value (Q-value) of each state. Be-

cause prior-actions are not saved in this memory, the

distribution of rewards is not separated between those

where the button has been pressed and those where

it has not. High reward signals from not wearing a

coat after reading a legitimately high barometer read-

ing are mixed with the disappointing ones of press-

ing the barometer and exiting without a coat. Barein-

boim and Pearl (2016) call this the data-fusion prob-

lem. Secondly DQN is an off-policy learning method

- all policies are updated during learning not just the

policy that the learner is currently following. Again

this would seem to mean that the feedback from op-

timal policies where the button is not pressed is also

credited to policies where the button is pressed.

In contrast A2C does successfully navigate the

Dog-Barometer. This is a surprise given the inade-

quacy of the state signal in its ability to show when the

barometer is behaving properly. On reﬂection I think

this might be because this method of learning is ’on-

policy’. Learning only occurs on a policy which is

currently being used by the learner. Since the dynam-

ics of this environment are dependent on the action-

history, and in this example, the policy encodes action

history, the learner is not tripped up as easily.

In all cases I used the default 2 layer 64 neuron

feed forward MLP. It would be useful to try a recur-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

rent neural network like an LSTM Goodfellow et al.

(2016) to see how the results changed.

6 RELATED WORK

Causality within Reinforcement Learning is begin-

ning to receive mainstream attention. An introduc-

tion to the subject is given by Bareinboim (2020) and

the associated website

. Guo et al. (2018) provide a

more general survey of learning causality from data.

The general issues surrounding the replay-memory of

DQN erroneously mixing data generated under dif-

ferent policies are explored in Bareinboim and Pearl

(2016).

Motivated from biological/psychological perspec-

tives Gershman (2015) places causal knowledge in

model-based and model free RL and provides a novel,

simple taxonomy to identify where causal consider-

ations can come into RL. Gershman highlights re-

search that suggest model and model-free reasoning

exists within the brain and discusses how they inter-

act with reference to the Dyna architecture of Sutton

and BartoSutton (1990).

Buesing et al. (2019) present a counterfactual

learning technique in the context of POMDPS (Par-

tially observable MDPs). They show how a model

based RL approach to POMDPs can be cast in the

language of a SCM (Structural Causal Model - see

Pearl (2000)). This is important since SCMs are the

principal tool through which causal research has pro-

gressed. Counterfactual reasoning is strongly related

with off-policy learning methods since it considers

the results of actions not taken under the policy used

to generate the experience. In contrast to their use of

an SCM to generate counter-factual data to search for

better policies, for a model-free setting, they observe

that sampling from experience can in certain circum-

stances be high variance or even useless when evalu-

ating different policies. This echoes the poor perfor-

mance of DQN in our experiment.

The subject of unobserved confounding in MDPs

(termed MDPUCs) is studied in Zhang and Barein-

boim (2016). The authors point out that MDPUCs

are quite separate from POMDPS. They observe that,

to date, MDP learning algorithms do not differenti-

ate between passive data collection and data collec-

tion after actions (interventions). The authors go on

to show that standard MDP techniques as used in RL

are not guaranteed to converge to optimal policies and

present a method using counterfactual analysis to im-

prove upon existing learning algorithms.

https://crl.causalai.net/ Accessed August 2020.

Model based Reinforcement Learning (MBRL) is

an area of RL research typically separate from the RL

mainstream which is model-free. In it an agent builds

a representation of the world from observational data

in order to predict future observations and thereby

plan future actions and optimise a policy. Rezende

et al. (2020) study the problem of causal errors aris-

ing from partially modelled environments in RL and

show why they occur and propose a way of mitigating

them. They illustrate the problem with a simple MDP

called ’FuzzyBear’ and explicitly relate it to causal

reasoning’s concepts of interventions, backdoors and

frontdoors taken from Pearl (2000).

Goodhart’s Law in economics (Goodhart, 1984)

is analogous to Campbell’s law (Campbell, 1979) in

social science. Since both were originally communi-

cated at around the same point in time, I thought it

was important to combine the two to aid communi-

cation, as espoused in Rodamar (2018). This article

is concerned with one variant of Goodhart-Campbell

termed ’causal’ in Manheim and Garrabrant (2018).

Whilst there is a growing knowledge base on the

failure cases of AI (see for example Lehman et al.

(2020)) the relevant ones are most often examples of

what Manheim and Garrabant term ’Shared-Cause’

Campbell-Goodhart, which is an example of objective

misalignment - the AI learns to maximise the met-

ric but not the ultimate goal of the programmer. To

my knowledge, AI research has not explicitly identi-

ﬁed Causal-Goodhart effects when discussing failure

cases. One potential example of Causal Campbell-

Goodhart is to be found in Ha and Schmidhuber

(2018), where the AI which learns an internal model

of Doom - a computer game. On occasion it would

learn strategies that would work in its model, but

would fail when used on the actual computer game.

7 CONCLUSION

Reinforcement learning has had great success learn-

ing optimal policies in a variety of game settings, of-

ten exceeding human competence in Go, Chess, Atari

games etc. It is tempting to apply this method to more

complex problems where there are hidden variables,

stochastic outcomes and non-trivial causal structures

and hope for the best. This may lead to disappoint-

ment and more seriously, bad policies being enacted.

To date, RL Research has given very little considera-

tion to causal issues, often because the causal mecha-

nisms in the canonical test cases are straightforward.

By applying RL as is to more complex problems,

there is an implicit assumption that the neural net-

works used in deep RL, are able to ﬁgure out the prob-

Causal Campbell-Goodhart’s Law and Reinforcement Learning

lem of understanding causality just as well as they

can learn to decode the raw vision data fed to them

in Atari Games as in Mnih et al. and turn it into win-

ning strategies. Pearl (2000) argues that certain prob-

lems cannot be solved using correlation based statis-

tics alone; to progress to the second rung of his cau-

sation ladder, interventions need to be made. RL is

a learning framework which naturally performs inter-

ventions on the environment it seeks to learn about

yet its tools do automatically account for the mathe-

matical implications of making interventions.

The dog barometer problem that I present here is

deliberately simple and the state-space given to the

learner is not ideal for a learning algorithm. RL Meth-

ods do exist for settings with hidden variables and I

have not used them here. Expanding the state space

to include previous state values and actions may solve

the problem. However I do think that the problems

in RL raised by dog barometer are not simply the re-

sult of a straw-man argument. The presence of hidden

variables in real life learning applications is almost

certain as is the existence of non-trivial causal struc-

tures whose effect may linger over arbitrarily long

timescales thereby negating the efﬁcacy of adding

more history. Every model of a real problem will be

misspeciﬁed to some extent; it is important to under-

stand when and why this matters. The fact that the

cognitive error is sufﬁciently common in social sci-

ence to be named Goodhart’s Law is a good indica-

tor that this is a policy failure case which is likely

to appear again and again in real life applications of

RL. In defence of RL, the performance of the A2C

algorithm even in the face of such misspeciﬁcation is

very promising and warrants further investigation to

see whether this is consistent or an artefact of the en-

vironment.

Finally, I would like to continue to build an open

library of causal problems which new RL algorithms

can be benchmarked against. Such an approach using

the OpenAI interface has already beneﬁted RL and I

think such a library will help widen the audience of

Causal RL to general RL researchers. In parallel it

would be useful to begin to build a taxonomy of cog-

nitive errors that AI suffers from, starting by investi-

gating whether others, similar to Campbell-Goodhart

can be recreated with RL.

ACKNOWLEDGEMENTS

This work is supported by an EPSRC PhD stu-

dentship.

REFERENCES

Bareinboim, E. (2020). Towards Causal Reinforcement

Learning (CRL). In Thirty-seventh International Con-

ference on Machine Learning (ICML2020).

Bareinboim, E. and Pearl, J. (2016). Causal inference and

the data-fusion problem. Proceedings of the National

Academy of Sciences of the United States of America,

113(27):7345–7352.

Buesing, L., Weber, T., Zwols, Y., Racaniere, S., Guez, A.,

Lespiau, J.-B., and Heess, N. (2019). Woulda, Coulda,

Shoulda: Counterfactually-guided policy search. In

ICLR.

Campbell, D. T. (1979). Assessing the impact of planned

social change. Evaluation and Program Planning,

2(1):67–90.

Campolo, A. and Crawford, K. (2020). Enchanted Deter-

minism: Power without Responsibility in Artiﬁcial In-

telligence. Engaging Science, Technology, and Soci-

ety, 6:1.

Gershman, S. J. (2015). Reinforcement learning and causal

models. Oxford Handbook of Causal Reasoning,

pages 1–32.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning.

Goodhart, C. A. E. (1984). Problems of Monetary Man-

agement: The UK Experience. Monetary Theory and

Practice, pages 91–121.

Guo, R., Cheng, L., Li, J., Hahn, P. R., and Liu, H. (2018).

A Survey of Learning Causality with Data: Prob-

lems and Methods. ACM Computing Surveys (CSUR),

53(4):1–37.

Ha, D. and Schmidhuber, J. (2018). World Models.

Lehman, J., Clune, J., and Misevic, D. (2020). The sur-

prising creativity of digital evolution: A collection

of anecdotes from the evolutionary computation and

artiﬁcial life research communities. Artiﬁcial Life,

26(2):274–306.

Manheim, D. and Garrabrant, S. (2018). Categorizing Vari-

ants of Goodhart’s Law.

Mnih, V., Badia, A. P., Mirza, L., Graves, A., Harley,

T., Lillicrap, T. P., Silver, D., and Kavukcuoglu, K.

(2016). Asynchronous methods for deep reinforce-

ment learning. 33rd International Conference on Ma-

chine Learning, ICML 2016, 4:2850–2869.

Mnih, V., Kavukcuoglu, K., Silver, D., ..., and Hassabis, D.

(2015). Human-level control through deep reinforce-

ment learning. Nature, 518(7540):529–533.

Pearl, J. (2000). Causality: Models, reasoning and infer-

ence. Cambridge University Press.

Pearl, J. and Mackensie, D. (2018). The Book of Why: The

new science of cause and effect. Basic Books.

Rezende, D. J., Danihelka, I., Papamakarios, G., ..., and

Buesing, L. (2020). Causally Correct Partial Models

for Reinforcement Learning.

Rodamar, J. (2018). There ought to be a law! Campbell

versus Goodhart. Signiﬁcance, 15(6):9.

Silver, D., Schrittwieser, J., Simonyan, K., ..., and Hassabis,

D. (2017). Mastering the game of Go without human

knowledge. Nature, 550(7676):354–359.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

Strathern, M. (1997). ‘Improving ratings’: audit in

the British University system. European Review,

5(3):305–321.

Sutton, R. S. (1990). Integrated architectures for learn-

ing, planning, and reacting based on approximat-

ing dynamic programming. In Proceedings of the

7th. International Conference on Machine Learning,

pages(1987):216–224.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-

ing: An Introduction. MIT Press, 2nd edition.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., ..., and

Silver, D. (2019). Grandmaster level in StarCraft

II using multi-agent reinforcement learning. Nature,

575(7782):350–354.

Zhang, J. and Bareinboim, E. (2016). Markov Decision Pro-

cesses with Unobserved Confounders: A Causal Ap-

proach.

Causal Campbell-Goodhart’s Law and Reinforcement Learning