Multi-Agent Causal Reinforcement Learning
e Meyer-Vitali
Deutsches Forschungszentrum f
ur K
unstliche Intelligenz GmbH (DFKI), Saarbr
ucken, Germany
Software Engineering, Artificial Intelligence, Trust, Transparency, Robustness, Causality, Agency.
It has become clear that mere correlations extracted from data through statistical processes are insufficient to
give insight into the causal relationships inherent in them. Causal models support the necessary understanding
of these relationships to make transparent and robust decisions. In a distributed setting, the causal models
that are shared between agents improve their coordination and collaboration. They learn individually and
from each other to optimise a system’s behaviour. We propose a combination of causal models and multi-
agent reinforcement learning to create reliable and trustworthy AI systems. This combination strengthens
the modelling and reasoning of agents that communicate and collaborate using shared causal insights. A
comprehensive method for applying and integrating these aspects is being developed.
The development of reliable and trustworthy AI sys-
tems requires new methods and metrics for their cre-
ation and verification. However, not everything needs
to be developed from scratch, because the experience
of decades of systems and software engineering, as
well as research in the foundations of AI systems, pro-
vides ample opportunities to reuse and adapt existing
methods. We outline some principles of AI Engineer-
ing in section 2. We dive deeper into two of the prin-
ciples, namely agency in section 3 and causality in
section 4. Consequently, in section 5, the combina-
tion of causality and multi-agent reinforcement learn-
ing (MARL) is investigated. A use case concerning
urban mobility and energy consumption is presented
in section 6. Finally, some conclusions and an outlook
are provided in section 7.
The design and development of complex systems has
a long-standing tradition in software architecture and
engineering (Gamma et al., 1994; Booch et al., 2005).
This experience can be used to build better AI sys-
tems, including the current wave of data-driven and
hybrid systems, that are reliable and worthy of trust.
Trust is defined as the willingness of a trustor to
be vulnerable to the actions of an actor (trustee) that
she cannot directly control. It requires the existence
of uncertainty and risks. The level of trust increases
with the trustor’s perceived benevolence, competence
and integrity of the trustee (Lewis and Weigert, 1985;
Rousseau et al., 1998; Mayer et al., 1995; Jacovi
et al., 2021; Lewis and Marsh, 2022; de Brito Duarte
et al., 2023). Trust calibration is the process of ad-
justing and aligning actual and perceived trustwor-
thiness, where actual trustworthiness is the degree to
which a system or actor complies with its required
or promised expectations and performance (Okamura
and Yamada, 2020; de Visser et al., 2020; Visser et al.,
2023; Chi and Malle, 2023)).
In order to achieve trustworthiness and trust, an
AI system should include a number of characteristics,
such as those listed by the High-Level Expert Group
(HLEG) of the EU. Their Assessment List for Trust-
worthy Artificial Intelligence, ALTAI (Directorate-
General for Communications Networks, Content and
Technology (European Commission), 2020)) lists:
1. Human Agency and Oversight;
2. Technical Robustness and Safety;
3. Privacy and Data Governance;
4. Transparency;
5. Diversity, Non-discrimination and Fairness;
6. Societal and Environmental Well-being;
7. Accountability.
An analysis of AI engineering methods for trust
(Meyer-Vitali and Mulder, 2024a; Meyer-Vitali and
Mulder, 2024b) reveals the following four main prin-
Models & Explanations. Reliable predictions and
decisions about system behaviour for insightful
and plausible explanations and simulations with
generalised models from knowledge and training.
Causality & Grounding. Identification and predic-
tions of cause-effect relationships for informed
predictions and actions, as well as anchoring of
meaning in real-world context and phenomena.
Modularity & Compositionality. Design of com-
plex systems broken down into comprehensible
and manageable parts (functions and features), re-
liably composed in system architectures.
Human Agency & Oversight. Overview, final deci-
sion, and human responsibility for the actions of
AI systems, also when delegating tasks to au-
tonomous agents in hybrid collaborative teams.
An important aspect to consider is that we should not
aim to reinvent the wheel, time and time again. In-
stead, we should augment what is already established
knowledge and build on top of existing models and
methods (Schreiber et al., 1999; Tiddi et al., 2023).
Furthermore, it is not sufficient to determine statisti-
cal patterns and correlations in data. What machine
“learning” achieves is just that which is not learn-
ing as a desire to understand. Beyond “learning”, we
should aim for understanding concepts and relation-
ships. AI can become a tool for scientific discovery if
developers and systems understand what they learned
and manage to convert experimental insights into hy-
potheses and theories that serve as stepping stones
for further experiments. This is the scientific method
that was successful for centuries and which should
not be given up for fancy trompe l’oeils (Goldstein
and Goldstein, 1978; Gower, 1996; Nola and Sankey,
2007; Griffin et al., 2024; Jamieson et al., 2024).
Agency is the level of control that an entity has over
itself and its behaviour (van der Vecht et al., 2007).
Agents are “systems that can decide for themselves
what they need to do in order to satisfy their de-
sign objectives” (Wooldridge in (Weiss, 2000)). Thus,
agents reason and take deliberate decisions on their
behalf and initiative (Wooldridge and Jennings, 1995;
Wooldridge, 2009). Agents are fundamental build-
ing blocks of AI systems (Russell and Norvig, 2020;
OECD, 2022), which were investigated for a long pe-
riod and are currently of major interest, again, for the
future of AI (Larsen et al., 2024).
In technical terms, we are interested in agents as
autonomous and communicative entities (software or
robots) that learn, reason, plan and act to achieve
some goals. Each agent can communicate with its en-
vironment (physical or virtual) and with other agents,
including humans. Due to their autonomy, agents
can process data and knowledge efficiently in a dis-
tributed fashion, which also benefits scalability and
Agents don’t come alone. In multi-agent systems
(MAS) (Weiss, 2000) agents take roles and interact
with each other to achieve their individual or col-
lective goals. They communicate using communica-
tive or speech acts (Searle, 1969)according to a vari-
ety of interaction patterns and protocols, such as ne-
gotiations, auctions or queries (Poslad and Charlton,
The benefits of multi-agent systems for building
reliable AI systems include
modularisation and compositionality by defining
specific roles and protocols for communicating
among agents;
separation of concerns, where agents have their
individual roles, values and duties;
agreements are reached by negotiations, such that
outcomes can be explained; and
distribution of logic and processing for efficiency,
scalability and sovereignty.
Causality refers to our ability to understand and pre-
dict how things are connected through cause and ef-
fect specifically, what causes what, and why (Pearl
and Mackenzie, 2018). When artificial intelligence
systems can grasp these causal connections, they be-
come capable of making better predictions and solv-
ing complex problems. This represents a crucial shift
in focus: rather than simply identifying correlations
(when things happen together), we need to understand
causation (when one thing directly leads to another).
This transition from correlation-based to
causation-based thinking is becoming increas-
ingly crucial, particularly as we seek to understand
the reasoning behind AI predictions and decisions
(Pearl et al., 2016). Understanding the true causes
behind outcomes, rather than just their associations,
is essential for explaining why AI systems make
specific choices.
Causal inference considers how and when causal
conclusions can be drawn from data. Complex sys-
tems of interacting variables can be described with
a Structural Causal Model (SCM) (Pearl, 2009; Pe-
ters et al., 2017; Nogueira et al., 2022). Such a
model describes the causal mechanisms and assump-
tions present in an arbitrary system. The relationships
between (endogenous) variables can be visualised in
a directed acyclic graph (DAG), wherein nodes repre-
sent variables and directed paths represent causal in-
fluences between variables. These influences do not
need to be deterministic, but can be probabilistic, in-
clude external factors (exogenous variables) and are
valid for linear as well as non-linear functions, dis-
crete as well as continuous variables.
A structural causal model (SCM) is a triple
(Pearl, 2009)
M = U,V,F,
U is a set of background variables, (also called
exogenous), that are determined by factors outside
the model;
V is a set {V
,·· · ,V
} of variables, called en-
dogenous, that are determined by variables in the
F is a set of functions { f
, f
,·· · , f
} such that
each f
is a mapping from U
to V
: v
),i = 1, ··· ,n.
A causal model M can be visualised as a directed
acyclic graph, G(M), in which each node corresponds
to a variable and the directed edges point from mem-
bers of PA
and U
toward V
. This graph is the causal
diagram associated with M. From a causal diagram,
it is easy to read dependencies between variables by
examining the three different types of connections
chains, forks, and colliders using d-separation
Graphs also help to identify confounders: variables
that influence both the treatment (cause) and the out-
come (effect). Confounders, as common causes, cre-
ate a spurious association between treatment and out-
The Law of Conditional Independence (d-
(XsepY |Z)
(X Y |Z)
describes the consequence of separation in the model
as independence in the distribution.
Interventions (or experiments) correspond to forc-
ing a variable X to take on value x, thereby removing
dependencies on parent variables to examine changes
denotes the parents of V
d stands for directional.
in one variable (representing a state, action or event)
and whether they cause changes in another, in order
to distinguish between correlated and causal relation-
ships in data.
An intervention I in an SCM M entails changing
some set of structural assignments in M with a new set
of structural assignments. Assume the replacement is
on X
given by assignment X
f (
), where
are the parents in the new graph.
Counterfactuals refer to alternative choices that
could have been made in the past and the correspond-
ing effects that they might have caused. Therefore,
they allow for exploring possibilities to find alterna-
tive outcomes according to a causal model, allowing
to change policies accordingly in the future.
The Law of Counterfactuals
(u) = Y
states that a model M generates and evaluates all
counterfactuals. M
is a modified structural submodel
of M, where the equation for X is replaced by X = x.
This allows for generating answers to an enormous
number of hypothetical questions of the type “What
would Y be had X been x?” (Pearl et al., 2016).
An SCM can be interpreted as a probability distri-
bution P with density p over variables X in the causal
system. According to the Ladder of Causation (Pearl
and Mackenzie, 2018), three classes of reasoning ex-
1. Seeing (associations) encapsulates statistical rea-
2. Doing (interventions) contains randomised con-
trol trials (RCTs) and RL methods (cf. section
P(y|do(x), z)
3. Imagining (counterfactuals) allows for reasoning
about outcomes of alternative choices.
Consequently, this formalism allows for explicit rea-
soning about each action an agent does, can, and
could have taken. Interestingly, causality provides
means to understand the story behind data, i.e., a
causal model can be seen as a data generator that re-
flects the causal relationships. Therefore, it can ex-
plain those relationships and is seen as a necessary
requirement for explainability (Carloni et al., 2023;
Rawal et al., 2023; Ganguly et al., 2023). An SCM
can be used for planning, such that causes are de-
termined to achieve desired effects (Meyer-Vitali and
Mulder, 2023).
Multi-Agent Causal Reinforcement Learning
Reinforcement Learning (RL) (Sutton and Barto,
2018) is a method for agents to learn how to map sit-
uations to actions from interacting with an environ-
ment. Agents act and receive feedback on the quality
or performance of their actions (rewards), which they
use for updating their strategy or policy.
Typically, the strategy is defined as a Markov De-
cision Process (MDP). Markov decision processes
include three aspects: sensation, action, and goal.
MDPs consist of a finite set of states S, a finite set of
actions A, a reward function R : S ×A ×S R, a state
transition probability function T : S × A × S [0, 1]
and an initial state distribution.
More realistically, not all relevant information can
be sensed or observed from the environment and the
set of states and actions is infinite. Therefore, Par-
tially Observable MDPs (POMDP) are often used
(Kaelbling et al., 1998). POMDPs do not sense the
states directly, but receive observations which depend
(in a probabilistic or deterministic way) on the state
of the environment. Agents will need to take into ac-
count the history of past observations to infer the pos-
sible current state of the environment.
In a distributed setting, Multi-Agent Reinforce-
ment Learning (MARL) concerns multiple agents
learning concurrently or collaboratively in the same
environment (Bus¸oniu et al., 2010; Now
e et al., 2012;
Zhang et al., 2021; Gronauer and Diepold, 2022; Al-
brecht et al., 2024). For example, in Partially Observ-
able Stochastic Games (POSG) all agents have the
same reward function, joint observations and a joint
action set. They are also known as “Decentralized
POMDP” (Dec-POMDP) and are used in the area of
multi-agent planning (Oliehoek, 2012; Oliehoek and
Amato, 2016). However, in a generalised approach,
agents do not need to obey to the same reward func-
tion for distributed learning and control.
Finally, agents, causality and reinforcement learn-
ing come together. The reader may appreciate that the
similarity between interventions and counterfactuals
in causal models with agents’ actions and rewards in
reinforcement learning provides a useful synergy. An
approach at this combination was already presented
by (Maes et al., 2007; Grimbly et al., 2021; Jiao
et al., 2024), where multi-agent causal models are in-
troduced. Agents share an environment and have ac-
cess to private and public variables of interest.
A Multi-Agent Causal Model (MACM) consists
of n agents, each of which contains a semi-Markovian
model M
= V
,i {1, ··· ,n}
is the subset of variables that agent a
can ac-
is the causal graph over variables V
) is the joint probability distribution over
stores the intersections V
with other
agents a
, {V
}, assuming that the agents
agree on the structure and distribution of their in-
MACMs are useful as shared causal models in teams
of agents. Each agent reasons according to its indi-
vidual POMDP with endogenous and exogenous vari-
ables and corresponding observations of the environ-
mental state. Agents share histories of observations
and rewards (state-action trajectories), but take indi-
vidual coordinated actions based on their individual
rewards. An MACM can be seen as a generalisation
of Dec-POMDPs.
Multi-Agent Causal Reinforcement Learning
(MACRL) allows for causal inference in a dynamic
learning environment with a multitude of agents
(Casini and Manzo, 2016; Pina et al., 2023a; Pina
et al., 2023b; Richens and Everitt, 2024). With fur-
ther elaboration of these basic ideas, MACRL could
evolve into several design patterns for neuro-symbolic
AI systems (van Bekkum et al., 2021), where neuro-
causal agents combine explicit prior knowledge in
the form of (hypothetical) causal models with dis-
covery (Pearl, 2019) and learning for continuous self-
Urban life has many peculiar characteristics (WBGU
German Advisory Council on Global Change, 2016;
Angelidou et al., 2022; Oliveira et al., 2020; Popelka
et al., 2023; Petrikovi
a et al., 2022; Hashem et al.,
2023). Many different streams of information, activi-
ties and resources are intertwined and have conflicting
requirements and characteristics, such as energy, mo-
bility, food, water, waste, healthcare, commerce, and
many more (Nevejan et al., 2018). They all share the
same limited spatial and temporal constraints. The ur-
ban context requires and involves many interactions at
high speed with lots of people, high density and diver-
sity of the population, land use, displacements (geo-
graphical, social and occupational mobility), as well
as fluent and heterogeneous communities.
The interactions in an urban environment are di-
verse, complex and conflicting. Many interests of hy-
brid actors are related and depend on each other. In
the urban context, an overall goal for sustainable use
Temperature ComfortWeather
Car Energy
Working /
Figure 1: A causal model for living and working in the urban context.
of resources could be the reduction of energy con-
Some causal relationships in an urban context, fo-
cusing on a combination of energy consumption and
mobility, are shown in figure 1. In the urban causal
model, values, duties and the weather are exogenous
variables. They need to be accepted as they are and
are independent. Goals, working/living and moving
are causes that determine the behaviour of agents.
Energy and com f ort are the effects that we want to
achieve and to optimise.
The consumption of various types of energy is af-
fected by the need and desire to move about the city
and to heat buildings at home and at work (and for
leisure, shopping, etc.). Values and duties are the
main sources that drive urban behaviour and exter-
nal factors, such as the weather, influence decision-
making. This causal model explains the relationships
among several important behavioural aspects, but it is
not deterministic. Individual behaviour is influenced
by exogenous variables and cooperative behaviour re-
sults in complex interactions.
Each of the variables is modelled using a quan-
titative metric. The transfer functions are to be de-
termined using causal machine learning, based on the
behaviour of actors.
A shared goal can be seen and modelled as an ef-
fect, that is caused by one or more interventions (ac-
tions or events). Consequently, in order to decide and
plan which actions to take, it is necessary to under-
stand which actions or events cause the intended ef-
fects. For example, your goal can be to arrive at a
destination at a given time (work, home, leisure, etc.).
By reasoning back which actions are required to get
you there, piece by piece, a connected causal path can
be constructed to determine the departure time and
modes of traffic along the route. Due to shared in-
tentions and causal models, humans and agents can
mutually trust each other regarding their actions and
After looking at agency, causality and reinforcement
learning separately, the combination in Multi-Agent
Causal Reinforcement Learning (MACRL) provides
a path towards more robust, transparent and explain-
able AI systems, such that they become more trust-
worthy (concerning some of the characteristics of the
ALTAI). Model-based software engineering for such
AI systems promotes reliability and trust, because the
actions and interventions that agents take are clearly
understandable. By investigating various ways of im-
plementing MACRL, a more robust software engi-
neering method may emerge. This paper gives some
insights and directions for such a method; more re-
search is needed and should be applied in real-world
use cases.
Multi-Agent Causal Reinforcement Learning
