headed in the wrong direction, a teacher tends to step
in and guide them with more explicit and frequent
instruction to help them accomplish the task. The
teacher’s teaching style changes depending on what
is needed from them at the time.
Ideally, when teaching an agent to do a task, the
teacher would have all the necessary and relevant in-
formation needed to complete the task. They would
know what to do with the information at hand to ac-
complish the goal. Domains with this property are
known as fully observable domains. However, realis-
tically, many domains do not have such traits. Hu-
mans are often presented with situations in which
they have only partial information of the environment,
known as a partially observable domain. They are re-
quired to make decisions to satisfy their goals, which
is often done by filling in knowledge gaps as they
proceed with their task (Klein and O’Brien, 2018).
When teaching agents, we want to mimic real world
situations in which the human teacher has limited
knowledge and must give advice to the agent based
on their current understanding of the environment, as
more and more information about the environment is
revealed. Real world environments also yield con-
straints in the form of direct consequences for actions.
Previous work has investigated how different
methods of an interaction algorithm affect human ex-
perience in teaching ML agents in a fully observ-
able environment with penalties (Krening and Feigh,
2019). The penalties are a result of hazards in the en-
vironment that simulate real world constraints. It was
found that the human’s level of frustration teaching
the agent, and how intelligent the teacher thinks the
agent is, are affected greatly by the method used to
interact with the agent. Specifically, in a fully observ-
able domain with penalties, it was found that quick
agent response times and the adherence to the advice
given were highly correlated with human teaching sat-
isfaction.
This paper expands upon (Krening and Feigh,
2019) to investigate which interaction features affect
user experience in teaching ML agents in partially ob-
servable domains with and without penalties. It is
hypothesized that the findings of the fully-observable
domain will not hold in a partially observable domain,
and that the introduction of penalties will cause the
human teacher to be more conservative in their ad-
vice.
2 METHOD
In this study, we conducted two repeated measures,
within-subject experiments in which we investigated
the effect of 4 different interaction methods on the
participant’s experience of teaching the agent. The
experiment took place in-person, and collected data
from 24 participants and 30 participants for each
study, respectively. All participants were ML novices,
and the ordering of the trials was randomized accord-
ing to a Latin square design. We made a concerted
effort to recruit individuals from the general popula-
tion.
The participants were required to teach each agent
to navigate a maze developed in the Malmo minecraft
platform. In the game, there are 2 players: the agent,
and a non-playable character. The agent needs to nav-
igate through the maze to find and approach the non-
playable character. In the first experiment, there were
no penalties. In the second, there were penalties as-
sociated with a water hazard. If the agent entered the
water, the agent fails the task is penalized with neg-
ative reward. The participant is able to see the maze
in two ways: an isometric view, in which part of the
maze is obscured, and the agent’s point of view from
within the maze (Figure 1) – both of which provide
only partial observability. The goal of the task is to
find and approach a non-playable character situated
near the end of the maze.
Figure 1: Partially Observable Maze in Minecraft. Left win-
dow shows isometric view. Right window shows agent’s
POV from within the maze.
The teachers provided advice to the agent in the
form of arrow key presses on the keyboard, which
were sent to an interaction algorithm (for all of the
algorithm methods). The interaction algorithm col-
laborated with the reinforcement learning agent to
select which action to take. All agents share the
same action-selection process (see 2.2), and so all are
equally capable of completing the task.
Participants were told to repeat the task for as
many training episodes as they felt was necessary
to achieve satisfactory performance from the agent.
They were also told that they could stop training if
they were too frustrated to continue, or for any other
reason. As a result, the training time per episode var-
ied for every participant and interaction algorithm.
After each agent was trained, the participant was
asked to complete a questionnaire about their experi-
ence. At the end of the experiment, participants were
Effect of Interaction Design of Reinforcement Learning Agents on Human Satisfaction in Partially Observable Domains
175