in advance.
During recognition, the stationary robot observes
a number of individuals interacting with one another
and with stationary objects. It tracks those individu-
als using the visual capabilities described above, and
takes the perspective of the agents it is observing.
Based on its perspective-taking and its prior under-
standing of the activities it has been trained to under-
stand, the robot infers the intention of each agent in
the scene. It does this using maximum likelihood esti-
mation, calculating the most probable intention given
the observation sequence that it has recorded up to the
current time for each pair of interacting agents.
5.2 Adding Context
Our system uses contextual information to infer inten-
tions. This information is linguistic in nature, and in
section 3 we show how lexical information represent-
ing objects and affordances can be learned and stored
automatically. In this subsection, we outline how that
lexical information can be converted to probabilities
for use in intent recognition.
Context and Intentions. In general, the context for
an activity may be any piece of information. For our
work, we focused on two kinds of information: the
location of the event being observed, and the identi-
ties of any objects being interacted with by an agent.
Context of the first kind was useful for basic experi-
ments testing the performance of our system against
a system that uses no contextual information, but did
not use lexical digraphs at all; contexts and intentions
were defined by entirely by hand. Our other source of
context, object identities, relied entirely on lexical di-
graphs. In experiments using this source of informa-
tion, objects become the context and their affordances
– represented by verbs in the digraph – become the
intentions. As explained below, if s is an intention
and c is a piece of contextual information, our sys-
tem requires the probability p(s | c), or in other words
the probability of an affordance given an object iden-
tity. This is exactly what is provided by our digraphs.
If “water” appears as a direct object of “drink” four
times in the robot’s linguistic experience, then we can
obtain a proper probability of “drink” given “water”
by dividing four by the sum of the weights of all edges
which have “water” as the direct object of some word.
In general, we may use this process to obtain a table of
probabilities of affordances or intentions for every ob-
ject in which our system might be interested, as long
as the relevant words appear in the corpus. Note that
this may be done without human intervention.
Inference Algorithm. Suppose that we have an ac-
tivity model (i.e. an HMM) denoted by w. Let s
denote an intention, let c denote a context, and let v
denote a sequence of visible states from the activity
model w. If we are given a context and a sequence of
observation, we would like to find the intention that is
maximally likely. Mathematically, we would like to
find
argmax
s
p(s | v, c),
where the probability structure is determined by the
activity model w.
To find the correct s, we start by observing that by
Bayes’ rule we have
max
s
p(s | v, c) = max
s
p(v | s,c)p(s | c)
p(v | c)
. (1)
We can further simplify matters by noting that the de-
nominator is independent of our choice of s. More-
over, we assume without loss of generality that the
possible observable symbols are independent of the
current context. Based on these observations, we can
write
max
s
p(s | v,c) ≈ max
s
p(v | s)p(s | c). (2)
This approximation suggests an algorithm for deter-
mining the most likely intention given a series of ob-
servations and a context: for each possible intention
s for which p(s | c) > 0, we compute the probabil-
ity p(v | s)p(s | c) and choose as our intention that s
whose probability is greatest. The probability p(s | c)
is available, either by assumption or from our linguis-
tic model, and if the HMM w represents the activ-
ity model associated with intention s, then we assume
that p(v | s) = p(v | w). This assumption may be made
in the case of location-based context for simplicity,
or in the case of object affordances because we focus
on simple activities such as reaching, where the same
HMM w is used for multiple intentions s. Of course
a perfectly general system would have to choose an
appropriate HMM dynamically given the context; we
leave the task of designing such a system as future
work for now, and focus on dynamically deciding on
the context to use, based on the digraph information.
5.3 Intention-based Control
In robotics applications, simply determining an ob-
served agent’s intentions may not be enough. Once a
robot knows what another’s intentions are, the robot
should be able to act on its knowledge to achieve
a goal. With this in mind, we developed a simple
method to allow a robot to dispatch a behavior based
on its intent recognition capabilities. The robot first
ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics
318