Online Learning of non-Markovian Reward Models
Gavin Rens, Jean-François Raskin, Raphaël Reynouard, Giuseppe Marra
2021
Abstract
There are situations in which an agent should receive rewards only after having accomplished a series of previous tasks, that is, rewards are non-Markovian. One natural and quite general way to represent history- dependent rewards is via a Mealy machine. In our formal setting, we consider a Markov decision process (MDP) that models the dynamics of the environment in which the agent evolves and a Mealy machine synchronized with this MDP to formalize the non-Markovian reward function. While the MDP is known by the agent, the reward function is unknown to the agent and must be learned. Our approach to overcome this challenge is to use Angluin’s L∗ active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine (MRM). Formal methods are used to determine the optimal strategy for answering so-called membership queries posed by L∗. Moreover, we prove that the expected reward achieved will eventually be at least as much as a given, reasonable value provided by a domain expert. We evaluate our framework on two problems. The results show that using L∗ to learn an MRM in a non-Markovian reward decision process is effective.
DownloadPaper Citation
in Harvard Style
Rens G., Raskin J., Reynouard R. and Marra G. (2021). Online Learning of non-Markovian Reward Models.In Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-484-8, pages 74-86. DOI: 10.5220/0010212000740086
in Bibtex Style
@conference{icaart21,
author={Gavin Rens and Jean-François Raskin and Raphaël Reynouard and Giuseppe Marra},
title={Online Learning of non-Markovian Reward Models},
booktitle={Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2021},
pages={74-86},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010212000740086},
isbn={978-989-758-484-8},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Online Learning of non-Markovian Reward Models
SN - 978-989-758-484-8
AU - Rens G.
AU - Raskin J.
AU - Reynouard R.
AU - Marra G.
PY - 2021
SP - 74
EP - 86
DO - 10.5220/0010212000740086