definition could be possible, when MQs can be tagged
as ’impossible’ if they are not answered in reasonable
time (Gaon and Brafman, 2020; Xu et al., 2021). Tra-
ditionally, actions in MDPs are stochastic, so it is pos-
sible that a feasible MQ is tagged as negative. How-
ever, a large number of MQs is answered, so the prob-
ability of including false negatives is low.
Since the labeling function in this work is al-
ways deterministic, we could transform POMDPs
into (fully observable) MDPs. If observations
are stochastic (i.e. noisy), this would be impos-
sible. Unfortunately, we are dealing with non-
stationarity, which make traditional techniques for
solving POMDPs inapplicable. Literature on non-
stationary POMDPs with stochastic observatin func-
tions does exist (Peshkin et al., 1999; Shani et al.,
2005; Jaulmes et al., 2005; Chatzis and Kosmpou-
los, 2014). One could also consider literature regard-
ing partial observability, proposing the use of stochas-
tic action plans (Meuleau et al., 1999) and stochastic
policies (Sutton and Barto, 2018).
ACKNOWLEDGEMENTS
The second author was supported by the EOS project
(No. 30992574).
REFERENCES
Angluin, D. (1987). Learning regular sets from queries
and counterexamples. Information and computation,
75(2):87–106.
Bacchus, F., Boutilier, C., and Grove, A. (1996). Rewarding
behaviors. In Proceedings of the National Conference
on Artificial Intelligence, pages 1160–1167.
Bellman, R. (1957). A markovian decision process. Journal
of mathematics and mechanics, 6(5):679–684.
Bertsekas, D. (2012). Dynamic programming and optimal
control: Volume I, volume 1. Athena scientific.
Cassandra, A. R., Kaelbling, L. P., and Littman, M. L.
(1994). Acting optimally in partially observable
stochastic domains. In Aaai, volume 94, pages 1023–
1028.
Chatzis, S. P. and Kosmpoulos, D. (2014). A non-
stationary partially-observable markov decision pro-
cess for planning in volatile environments. OPT-
i 2014 - 1st International Conference on Engineer-
ing and Applied Sciences Optimization, Proceedings,
2014:3020–3025.
Gaon, M. and Brafman, R. (2020). Reinforcement learn-
ing with non-markovian rewards. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol-
ume 34, pages 3980–3987.
Jaulmes, R., Pineau, J., and Precup, D. (2005). Learning
in non-stationary partially observable markov deci-
sion processes. In ECML Workshop on Reinforcement
Learning in non-stationary environments, volume 25,
pages 26–32.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.
(1998). Planning and acting in partially observable
stochastic domains. Artificial intelligence, 101(1-
2):99–134.
Kolobov, A. (2012). Planning with markov decision pro-
cesses: An ai perspective. Synthesis Lectures on Arti-
ficial Intelligence and Machine Learning, 6(1):1–210.
Meuleau, N., Peshkin, L., Kim, K.-E., and Kaelbling, L. P.
(1999). Learning finite-state controllers for partially
observable environments. In Proceedings of the Fif-
teenth Conference on Uncertainty in Artificial Intelli-
gence (UAI).
Peshkin, L., Meuleau, N., and Kaelbling, L. P. (1999).
Learning policies with external memory. In ICML.
Rens, G., Raskin, J.-F., Reynouard, R., and Marra, G.
(2021). Online learning of non-markovian reward
models. In Proceedings of the 13th International Con-
ference on Agents and Artificial Intelligence - Volume
2: ICAART,, pages 74–86. INSTICC, SciTePress.
Ross, S. M. (2014). Introduction to stochastic dynamic pro-
gramming. Academic press.
Shani, G., Brafman, R., and Shimony, S. (2005). Adap-
tation for changing stochastic environments through
online pomdp policy learning. In Proc. Eur. Conf. on
Machine Learning, pages 61–70. Citeseer.
Singh, S. P., Jaakkola, T., and Jordan, M. I. (1994). Learn-
ing without state-estimation in partially observable
markovian decision processes. In Machine Learning
Proceedings 1994, pages 284–292. Elsevier.
Sutton, R. and Barto, A. (2018). Reinforcement Learning:
An Introduction. MIT Press, Cambridge and London,
2nd edition.
Thi
´
ebaux, S., Gretton, C., Slaney, J., Price, D., and Ka-
banza, F. (2006). Decision-theoretic planning with
non-markovian rewards. Journal of Artificial Intelli-
gence Research, 25:17–74.
Toro Icarte, R., Waldie, E., Klassen, T., Valenzano, R., Cas-
tro, M., and McIlraith, S. (2019). Learning reward ma-
chines for partially observable reinforcement learning.
Advances in Neural Information Processing Systems,
32:15523–15534.
White III, C. C. and White, D. J. (1989). Markov deci-
sion processes. European Journal of Operational Re-
search, 39(1):1–16.
Xu, Z., Wu, B., Ojha, A., Neider, D., and Topcu, U. (2021).
Active finite reward automaton inference and rein-
forcement learning using queries and counterexam-
ples. In International Cross-Domain Conference for
Machine Learning and Knowledge Extraction, pages
115–135. Springer.
ICAART 2022 - 14th International Conference on Agents and Artificial Intelligence
736