The results are analyzed with respect to (a) the structure of dialogue that facilitated
versus interfered with effective coordination, and (b) the content and form utterances
that present challenges to successful comprehension in a remote communicative situ-
ation. We conclude with a brief discussion of the implications of our results for NLP
architectures for embodied situated agents.
2 Background
Clark (1996) views human language use as a joint project consisting of 4 hierarchical
levels of speaker-addressee coordinated actions, which he refers to as an “action lad-
der”. Consider the case of a speaker asking an addressee, “What is the current time?”.
At the first level, the speaker executes a communicative behavior, which consists of
producing the sounds of the utterance. The addressee, in turn, attends to the behav-
ior (speech). At the second level, the speaker presents words and phrases, which are
identified as such by the addressee. At the third level, the speaker signals an intended
meaning (a request for the current time), and the addressee understands the meaning. At
the fourth level, the speaker proposes a joint project, namely that the addressee inform
him of the current time, and the addressee considers accepting the proposal. There are
two essential properties of this hierarchy of actions. The first is upward causality: The
actions at a lower level cause the actions at the next level up. The second property is
downward evidence: Evidence of successful completion of the actions at a higher level
constitutes evidence of successful completion of the actions at all levels below it.
As Clark (1996, p. 222) states, “A fundamental principle of any intentional action
is that people look for evidence that they have done what they intended to do.” Fur-
thermore, people strive to provide evidence that is sufficient for current purposes, in a
timely manner, and with the least effort. In the example above, valid, timely, and suf-
ficient evidence comes from the addressee responding with the current time soon after
the end of the speaker’s utterance. In doing so, the addressee provides positive evidence
of her acceptance of the speaker’s proposed joint project at level 4, as well as posi-
tive evidence of her understanding the meaning of the speaker’s utterance (level 3), her
identification of the speaker’s words (level 2), and her attending to the speaker’s speech
(level 1). In other words, the evidence allows both the speaker and addressee to reach
the mutual belief of success at all four levels well enough for current purposes, which
is the process of grounding.
Often, a joint project may be extended across a sequence of utterances, as in the case
of telling a story, or providing a complex response to a question, or giving a complex
direction, where complexity refers to number of propositions or informational units. In
these situations, each utterance is an iteration through the first 3 levels, and positive
evidence at level 3 (understanding) is provided by addressees in the form of acknowl-
edgments, which may be verbal (e.g., yes, uh huh, mkay, okay) or nonverbal (head
nods). Acknowledgments may occur on a separate turn or they may overlap with the
speaker’s utterance (i.e., Yngve’s (1970) backchannels).
Most psycholinguistic research as well as research in developing artificial natural
language processing systems has focused on the processes involved at the first 3 lev-
els of action (i.e., producing and perceiving a speech signal, identifying words and the
166