ronment and compared it with the existing algorithms
and the worst-case baseline. We have concluded that
the new algorithm is effective at solving the teaching
In future work, it would be interesting to study
such a teacher-learner interaction in more complex
environments. For example, an environment could
have more states and a non-linear reward function
possibly represented as a neural network. Another
question yet to be addressed is the convergence guar-
antees of the proposed algorithms. It is also interest-
ing to check whether the MT module of the algorithm
could be improved by considering the uncertainty of
the estimated learner policy. Another possible direc-
tion of research is finding more sophisticated ways
of weighing older trajectories of the learner. E.g., if
the environment consists of several isolated regions
and any feature is confined to a certain region, then
sending a teaching demonstration in one region might
not change the learner’s behavior in others, therefore
the previous learner’s trajectories from other regions
might not need to be weighed down.
This work was partially supported by national funds
through Fundac¸
ao para a Ci
encia e a Tecnologia,
under project UIDB/50021/2020 (INESC-ID multi-
annual funding) and the RELEvaNT project, with ref-
erence PTDC/CCI-COM/5060/2021.
Rustam Zayanov would also like to thank Open
Philanthropy for their scholarship, which facilitated
his dedicated involvement in this project.
