intercept the learning agent before it makes its action
selection.
Prior work on computer teachers limits their inter-
actions with the students in a number of ways. First,
there is an overall budget that restricts how much ad-
vice the teacher can provide the student throughout
training (Torrey and Taylor, 2013). Second, some
form of teacher-student interactions only occur dur-
ing “important states,” where the definition of an “im-
portant” state is chosen by the system designers, and
could be modified if other definitions of importance
appear. Prior work primarily differ in how provide
the action advice in these “important states.” For ex-
ample, (Torrey and Taylor, 2013) proposed both Ad-
vise Important, where the computer teacher provides
an action whenever the student’s state is “important”,
and Mistake Correcting, which operates like Advise
Important but has the additional constraint that the ad-
vice is provided only if the student would have cho-
sen the wrong action. (Amir et al., 2016) expanded
on Torrey and Taylor’s work by creating a teaching
framework titled apprenticeship learning. In appren-
ticeship learning, the student will actually query the
teacher for advice when the student believes it is in an
important state. Then, the teacher checks if the state
is important (according to its model) and will provide
the action if it truly is in an important state. This
method performed similarly to the Torrey and Tay-
lor strategies, however it required significantly less
attention from the teacher, because as the student im-
proved at its task, it relied less on the teacher. Ideally,
once the student learned the proper action from the
teacher in the important states, it didn’t need to query
the teacher anymore.
There has not been a great deal of prior work
in which human teachers offer action advice, per-
haps due to the difficulty in supporting the human
teacher to provide timely action advice. One exam-
ple we found is the work of (Maclin and Shavlik,
1996), where circumvent the issue by asking the hu-
man teachers to predict critical states and their corre-
sponding correct actions ahead of time. In particular,
they used a customized scripting language to provide
the advice. The human, prior to training, would define
states that they believed was important in the script-
ing language and then provide the correct action. This
system worked well, but is difficult to scale to larger
domains and include humans with no prior program-
ming or RL experience.
3 TIME WARP HUMAN
GUIDANCE INTERFACE
While prior work showed that pre-trained computer
agents can successfully assist in training new agents,
there are valid reasons to prefer human teachers. First,
human teachers have a broader view of the problem
than computer teachers and can naturally switch be-
tween the ”big picture” and focusing on the specific
details. Second, there are situations when a computer
teacher is simply unavailable because there is no so-
lution from a standard RL approach.
1
3.1 Problem Formulation
We argue that human teachers may be best suited to
offering action advice instead of providing a rein-
forcement signal or presenting scenarios. While re-
ward feedback may be appropriate for tasks in which
the environment does not naturally provide rewards
on its own (Knox and Stone, 2012), as Thomaz found,
humans naturally want to provide guidance rather
than reward signal, resulting in a improperly shaped
signal. While not yet thoroughly expored, we suspect
that in order to craft highly useful scenarios, the hu-
man teacher must be experts in both the domain and
RL methods.
In this paper, we propose a framework that facil-
itates assisted RL with action advice from a human
teacher. This framework serves as a testbed that al-
lows us to address two questions:
1. Can an RL agent assisted by action advice from
human teachers produce a better performing pol-
icy than an RL agent with no assistance? Can our
student agents converge on a policy in less time?
2. How do human teachers compare to computer
teachers when both are giving action advice? For
example, would the human teacher need to teach
as often or as long as the computer teachers?
Inspired by the work of (Torrey and Taylor, 2013)
and (Amir et al., 2016), our framework imposes sim-
ilar constraints on our human teacher as their com-
puter teachers. We record two main statistics from our
learning agents. First, we record the total reward re-
ceived by an agent in an exploitation run to determine
the average episode reward for the agents at various
points in training. We also record a statistic Amir used
called cumulative attention to determine the amount
of advice the teacher provided in a given period of
time.
1
Since a policy must have already been developed to cre-
ate the computer teacher, creating a new student agent from
scratch is arguably unnecessary.
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
94