ACKNOWLEDGEMENTS
This work is partially supported by the ARC project
Non-Zero Sum Game Graphs: Applications to Re-
active Synthesis and Beyond (F
´
ed
´
eration Wallonie-
Bruxelles), the EOS project (No. 30992574) Verify-
ing Learning Artificial Intelligence Systems (F.R.S.-
FNRS & FWO), and the COST Action 16228
GAMENET (European Cooperation in Science and
Technology).
REFERENCES
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
J., Devin, M., Ghemawat, S., Irving, G., Isard, M.,
et al. (2016). Tensorflow: A system for large-scale
machine learning. In Twelfth USENIX symposium on
operating systems design and implementation (OSDI
16), pages 265–283.
Alshiekh, M., Bloem, R., Ehlers, R., K
¨
onighofer, B.,
Niekum, S., and Topcu, U. (2018). Safe reinforce-
ment learning via shielding. In Proc. of Thirty-Second
AAAI Conf. on Artif. Intell. (AAAI-18), pages 2669–
2678. AAAI Press.
Amato, C., Bonet, B., and Zilberstein, S. (2010). Finite-
state controllers based on Mealy machines for central-
ized and decentralized POMDPs. In Proc. of Twenty-
Fourth AAAI Conf. on Artif. Intell. (AAAI-10), pages
1052–1058. AAAI Press.
Angluin, D. (1987). Learning regular sets from queries
and counterexamples. Information and Computation,
75(2):87–106.
Bacchus, F., Boutilier, C., and Grove, A. (1996). Rewarding
behaviors. In Proc. of Thirteenth Natl. Conf. on Artif.
Intell., pages 1160–1167. AAAI Press.
Baier, C. and Katoen, J. (2008). Principles of Model Check-
ing. MIT Press.
Brafman, R., Giacomo, G. D., and Patrizi, F. (2018).
LTL
f
/LDL
f
non-Markovian rewards. In Proc. of
Thirty-Second AAAI Conf. on Artificial Intelligence
(AAAI-18), pages 1771–1778. AAAI Press.
Camacho, A., Chen, O., Sanner, S., and McIlraith, S.
(2018). Non-Markovian rewards expressed in LTL:
Guiding search via reward shaping (extended version).
In Proc. of First Workshop on Goal Specifications for
Reinforcement Learning, FAIM 2018.
Camacho, A., Icarte, R. T., Klassen, T., Valenzano, R.,
and McIlraith, S. (2019). LTL and beyond: Formal
languages for reward function specification in rein-
forcement learning. In Proc. of Twenty-Eighth Intl.
Joint Conf. on Artificial Intelligence, IJCAI-19, pages
6065–6073.
Cheng, R., Orosz, G., Murray, R., and Burdick, J. (2019).
End-to-end safe reinforcement learning through bar-
rier functions for safety-critical continuous control
tasks. In The Thirty-third AAAI Conf. on Artificial In-
telligence, pages 3387–3395. AAAI Press.
Giacomo, G. D., Favorito, M., Iocchi, L., and Patrizi, F.
(2019). Foundations for restraining bolts: Reinforce-
ment learning with LTL
f
/LDL
f
restraining specifica-
tions. In Proc. of Twenty-Ninth Intl. Conf. on Auto-
mated Planning and Scheduling (ICAPS-19), pages
128–136. AAAI Press.
Hasanbeig, M., Abate, A., and Kroening, D. (2019a).
Logically-constrained neural fitted q-iteration. In Ag-
mon, N., Taylor, M. E., Elkind, E., and Veloso, M.,
editors, Proc. of Eighteenth Intl. Conf. on Autonomous
Agents and Multiagent Systems, AAMAS-2019, pages
2012–2014. Intl. Foundation for AAMAS.
Hasanbeig, M., Kroening, D., and Abate, A. (2019b). To-
wards verifiable and safe model-free reinforcement
learning. In Proc. of First Workshop on Artificial In-
telligence and Formal Verification, Logics, Automata
and Synthesis (OVERLAY).
Hasselt, H. (2010). Double q-learning. In Advances in neu-
ral information processing systems 23, pages 2613–
2621.
Icarte, R. T., Klassen, T., Valenzano, R., and McIlraith, S.
(2018a). Teaching multiple tasks to an RL agent us-
ing LTL. In Proc. of Seventeenth Intl. Conf. on Au-
tonomous Agents and Multiagent Systems, AAMAS-
2018, pages 452–461. Intl. Foundation for AAMAS.
Icarte, R. T., Klassen, T., Valenzano, R., and McIlraith,
S. (2018b). Using reward machines for high-level
task specification and decomposition in reinforcement
learning. In Proc. of Thirty-Fifth Intl. Conf. on Ma-
chine Learning, volume 80 of ICML-18, pages 2107–
2116. Proceedings of Machine Learning Research.
Icarte, R. T., Waldie, E., Klassen, T., Valenzano, R., Castro,
M., and McIlraith, S. (2019). Learning reward ma-
chines for partially observable reinforcement learning.
In Proc. of Thirty-third Conf. on Neural Information
Processing Systems, NeurIPS 2019.
Kingma, D. and Ba, J. (2014). Adam: A method
for stochastic optimization. arXiv preprint
arXiv:1412.6980.
K
ˇ
ret
´
ınsk
´
y, J., P
´
erez, G., and Raskin, J.-F. (2018). Learning-
based mean-payoff optimization in an unknown MDP
under omega-regular constraints. In Proc. of Twenty-
Ninth Intl. Conf. on Concurrency Theory (CONCUR-
18), pages 1–8, Schloss Dagstuhl, Germany. Dagstuhl.
Lee, D. and Yannakakis, M. (1996). Principles and methods
of testing finite state machines - a survey. Proc. of
IEEE, 84(8):1090–1123.
McTear, M., Callejas, Z., and Griol, D. (2016). The conver-
sational interface. Springer Verlag, Heidelberg, New
York, Dortrecht, London.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness,
J., Bellemare, M., Graves, A., Riedmiller, M., Fidje-
land, A., Ostrovski, G., et al. (2015). Human-level
control through deep reinforcement learning. nature,
518(7540):529–533.
Plappert, M. (2016). keras-rl. https://github.com/keras-rl/
keras-rl.
Puterman, M. (1994). Markov Decision Processes: Discrete
Dynamic Programming. Wiley, New York, NY.
Shahbaz, M. and Groz, R. (2009). Inferring Mealy ma-
chines. In Cavalcanti, A. and Dams, D., editors, Proc.
Online Learning of non-Markovian Reward Models
85