Utterance Behavior of Users While Playing Basketball with a Virtual

Teammate

Divesh Lala

1,2

, Yuanchao Li

and Tatsuya Kawahara

Graduate School of Informatics, Kyoto University, Kyoto, Japan

Japan Society for the Promotion of Science, Tokyo, Japan

Keywords:

Human-Agent Interaction, Joint Actions, Virtual Basketball, Wizard-of-Oz, Conversation Analysis.

Abstract:

Research on human-agent interaction has focused mainly on domains which are conversational in nature, but

little work has been done on examining the behavior of interactive agents in domains such as team sports.

This paper analyzes utterance behavior in this domain, speciﬁcally a virtual basketball game with an agent

teammate. The main motivation is to assess the nature of utterances during the course of a game. We use a

Wizard-of-Oz system which allows a hidden operator to appropriately respond to user utterances. Utterances

are analyzed by annotating and categorizing according to Searle’s illocutionary speech acts. We ﬁnd that

there is evidence to support the process of the user beginning with basic utterances needed to play the game,

conﬁrming that the agent can understand them, and then moving to more complex utterances. We also ﬁnd

that non-task utterances are used and their proportion increases as the game progresses.

1 INTRODUCTION

Embodied conversational agents (ECAs) have been a

major focus for interaction research because face-to-

face conversation provides a rich source of phenom-

ena where speech, eye gaze and facial expression can

be measured and analyzed. However scenarios such

as sports where parties interact by navigating in an

open space and use full body movements to engage

in collaborative actions cannot be handled by ECAs.

Virtual agents which can function in these environ-

ments have been identiﬁed in previous work (Lala

et al., 2014). Aside from sports, other related sce-

narios include a human-agent team assisting victims

across a disaster area or even a human and agent lift-

ing furniture around a house. Autonomous agents

which can function in such virtual environments are

more feasible than real world robots which do not yet

have navigation ability which is on par with humans.

These type of interactions are also of a different

nature to conversation. Unlike one-on-one conversa-

tion, interactions are relatively infrequent, often re-

peated, and are used to achieve a shared goal. An

example of this is basketball, with the interactions be-

ing passing. In terms of utterance and dialog analysis,

such interactions have received relatively little focus.

The lack of research for this type of agent motivates

our work. We wish to create an agent which can rec-

ognize speech from the user and interact with them in

a natural manner. In order to do this, we require not

only a speech corpusbut data on the type of utterances

used so we can create a dialog model.

The domain of our study is a virtual basketball

game with an agent acting as a teammate. The user

is able to play the game using only their bodies and

without hand-held devices. Our methodology in this

work is to conduct a Wizard-of-Oz (WOZ) experi-

ment, annotate and categorize all utterances, then dis-

cover patterns during gameplay. Unlike conversa-

tion we are able to analyze temporal patterns because

of the repetitiveness of collaborative actions such as

passing.

Understanding temporal behavior of the human is

important for agents because they may be better able

to infer the human’s internal state. For example, at

the beginning of an interaction a human may be unfa-

miliar of how to behave to communicate effectively.

As familiarity with the agent increases, the observ-

able human behavior also changes. Such information

is crucial for any virtual agent system, not only bas-

ketball. Therefore, being able to estimate this sense

of familiarity has several implications for the design

of agents.

We propose that utterance behavior changes over

the course of an interaction. More speciﬁcally, we hy-

pothesize that the changes over time are related to the

Lala D., Li Y. and Kawahara T.

Utterance Behavior of Users While Playing Basketball with a Virtual Teammate.

DOI: 10.5220/0006119400280038

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 28-38

ISBN: 978-989-758-219-6

types of utterances used. At the beginning of the game

the human is unsure of the capabilities of the basket-

ball agent in terms of what speech it understands, so

will conﬁrm that it can understand commands such

as passing. We consider such utterances which con-

tribute directly towards the achieving of the goal of

basketball to be task utterances. As the interaction

progresses, if the agent can prove that it can effec-

tively understand the human, the human starts to ex-

periment with more complex task utterances.

H1 Over the course of an interaction user utterances

become more complex, from co-ordinating basic

to more complex tasks.

Our second hypothesis concerns utterances which

are not task utterances. These can include praising or

apologizing to a teammate, which provide evidence

that agents are considered as social partners rather

than machines. We propose that the proportion of

task to non-task utterances decreases over time as the

language of users becomes more social towards the

agent.

H2 The ratio of task to non-task utterances decreases

over the course of an interaction.

We also propose that the utterance behavior of the

user has some relationship to the subjective percep-

tion of the agent. This utterance behavior is measured

in terms of the number of task, non-task, and total

utterances from the user, and their perception of the

agent is measured through a standard questionnaire,

with the dependent measures being intelligence, ani-

macy and likeability. Such a relationship would have

implications for human-agent research. If the per-

ception of the agent can be estimated through the ut-

terance behavior of the user, then we have a useful

method of user attitude which can be estimated in

real-time.

H3 There is a relationship between the frequency of

task, non-task and total number of user utterances

and the the perception of the agent in terms of in-

telligence, animacy and likeability.

Our approach to answering these questions is to

conduct Wizard-of-Oz experiments using the virtual

basketball system described in Section 3. We then

describe how we analyze utterances during the exper-

iment by categorizing them according to Searle’s illo-

cutionary speech acts. The motivation for this catego-

rization is described in Section 4. From this data we

use frequency analyses and questionnaires to address

the above research hypotheses.

The contribution of this work is an analysis into

the nature of user utterances over time when interact-

ing with a basketball agent. This work can provide

guidelines for designing agents which can act appro-

priately with the user in terms of speech behavior, by

knowing what kind of utterances are suitable at par-

ticular moments.

2 RELATED WORK

Much research into embodied agents has been related

to ECAs. Sophisticated techniques for multi-modal

interaction have been able to create ECAs which exist

in many specialized and real-world domains such as

counseling, job interviews and museum guides (De-

Vault et al., 2014; Baur et al., 2013; Bickmore et al.,

2011) as well as those that partake in more general

conversation such as Greta and sensitive artiﬁcial lis-

teners (Schroder et al., 2012; Niewiadomski et al.,

2009). The purpose of these agents is to engage the

user in social interactions primarily through conversa-

tion, by using social signals to regulate their behavior.

On the other hand, embodied agents have been de-

veloped which engage in a shared virtual task with

the user, the earliest being Steve (Rickel and John-

son, 1999; Rickel and Johnson, 2000). These types

of agents also communicate with the user through

multiple modalities and are often used as training

systems. In Steve’s case the speech acts were well

structured. The focus in our work is on unstructured

speech where the user is free to say anything. Joint ac-

tions as a basis for communication in teams has been

implemented in other work, although this focused on

robots or agents which were not humanoid (Li et al.,

2015; Bradshaw et al., 2009).

Many studies have analyzed spoken dialog behav-

ior of humans towards virtual agents (Campano et al.,

2014; Langlet and Clavel, 2014; Veletsianos, 2012;

Robinson et al., 2008; Kopp et al., 2005). These

dialogs have been social in nature and any task is

largely achieved through conversational means. Task-

based systems requiring teamwork arguably contain

more command-based language (“Go there”, “Pick

that up”). Severalstudies have also investigatedsocial

dialog by an ECA in a task-based setting (Veletsianos,

2012; Bickmore and Cassell, 2005; Gulz, 2005), with

no clear consensus. It would appear that the value of

social dialog in these environments is user-dependent.

Furthermore, we could not identify any studies which

examine the change in utterance behavior during a

single session, which is a main focus of this work.

Real world communication in team sports, includ-

ing basketball, has also been studied (Poizat et al.,

2012; Travassos et al., 2011), but there is limited work

on interactive virtual teammates in a sporting domain.

Naturally there are basketball video games but com-

Utterance Behavior of Users While Playing Basketball with a Virtual Teammate

Figure 1: The virtual basketball environment. Screenshot of the game is shown in the left ﬁgure while in the right ﬁgure the

user is shown interacting inside the immersive display environment. A Kinect sensor and pressure pad are used for interaction

and navigation purposes.

munication is done through peripherals rather than

human body interactions. Furthermore, in a video

game the user actually controls all the players, so joint

actions are not required. Ideally we would use a robot

which could play basketball but this currently does

not exist.

3 VIRTUAL BASKETBALL

SYSTEM

In this section we describe the virtual basketball en-

vironment and the design of the Wizard-of-Oz agent

used in the experiment.

3.1 Basketball Environment

Our system is designed so that the user is able to

play basketball without the use of keyboard, mouse or

hand-held peripherals. This system was also used in

previous research to analyze non-verbal signals (Lala

et al., 2014) . Our aim is not to implement a realis-

tic simulation of actual basketball. This would require

overcoming several technical issues which are outside

the scope of our work. We concede that the realism

of the game can inﬂuence the types of utterances used

and the results of our study, and this limitation is dis-

cussed later in this paper. For now our focus is only

on the interactions between a human player and agent

teammate, so body movement and speech recognition

is required as a means to facilitate natural communi-

cation. The actual physics of the game need not be

accurately modeled.

The user stands in the middle of an immersive en-

vironment, with eight surrounding displays project-

ing the basketball game. They are represented by a

semi-transparent avatar which they can see in a third-

person view. The body movements of the user are

tracked using a Kinect sensor located in front of them,

so gesture recognition of passing, shooting and drib-

bling can be achieved. To navigate in the environment

the user walks in place on top of a foot pressure sen-

sor which recognizes their walking motion and moves

their character forward. Due to the limitations of the

Kinect sensor, in order to turn in the environment the

user does not turn their body but rotates their view-

point by standing on the extreme edges of the pres-

sure sensor. Although the user must generally be fac-

ing towards the Kinect sensor, the immersive displays

allow them to perceive the whole of the environment,

which is necessary in a dynamic game such as bas-

ketball. Screen shots of the game environment are

shown in Figure 1. The game is simpliﬁed to 2 vs. 2

pickup basketball to stimulate communication. In this

version of basketball, each team takes turns at trying

to score in one goal only. Opponent agents have the

same physical properties as the teammate agent. They

will attempt to block the path of the human and ﬁnd

space to shoot goals.

3.2 Wizard-of-Oz Agent Design

Our aim is to eventually create a fully autonomous

agent which will recognize human speech. For this

reason, we should also use our experiment as a

method to collect a number of utterances to create an

appropriate speech corpus, which will be in Japanese.

We can then use this corpus as a knowledge base for

an autonomous agent, and use techniques such as key-

word spotting to associate human utterances with be-

haviors and intentions.

One method of collecting corpus data is to simply

observe real basketball matches. However real bas-

ketball largely differs from virtual basketball in terms

of the richness of communication channels. Real

life behaviors make use of facial expression, subtle

hand movements and eye gaze which are not recog-

nized in our system. Another approach would be to

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

Table 1: Categorizations of basketball utterances based on Searle’s taxonomy (Searle, 1975). Categories in italics are deﬁned

as task utterances.

Illocutionary act Utterance category

Assertive describing the state of the game

Directive calling for a pass, ordering (strategy), ordering (shoot)

Commissive throwing a pass, statement of intention

Expressive acknowledgment, apology, celebration, disappointment, encouragement, praise, thanking

Declarative -

Unclassiﬁed small talk, other

analyze a multi-player basketball game. This also

has drawbacks because as humans we can assume

many capabilities of each other, including the ability

to recognize complex speech. It is likely that most

humans will assume their human teammate under-

stands this speech and so use utterances which coordi-

nate human-human activities rather than human-agent

play. Research suggests that the type of communi-

cation partner (agent or avatar) affects behavior (Fox

et al., 2015; Aharoni and Fridlund, 2007).

Due to these issues, we opted to use a Wizard-

of-Oz (WOZ) agent. The advantage is that the user

assumes that their teammate is artiﬁcial while we can

provide it with intelligent behavior. The design of the

WOZ agent is important because it should not reveal

that it is being controlled by a human operator. For

this reason the WOZ agent is controlled by keyboard,

with triggers for gestures and utterances rather than

real-time motion capture and synthesis of a human

voice. The utterances are created using OpenJTalk,

a Japanese language speech synthesis program (Open

JTalk, 2015). This program allows us to create speech

for the agent in the form of pre-recorded sound and

then playing the sound ﬁles during the game at appro-

priate moments. In total, only 18 sound recordings

were used.

The initial utterance categories of the agent were

calling for a pass, celebration, disappointment, en-

couragement and acknowledgment. After the ﬁrst

three experiments we found that we could not encom-

pass a lot of behavior so added new utterance cat-

egories in subsequent experiments to help grow our

speech corpus. We subsequently added categories of

throwing a pass, apologizing and stating an intention

to move. The choice of which individual utterances

in the same category to use was random. Speech was

used to both instigate and respond to the human team-

mate. For example, the agent could use encourage-

ment if the human was struggling or call for a pass if

in free space.

The WOZ operator had knowledge of the goal of

the experiment, but their decisions were made to try

and simulate those of an average, rational player who

aimed to collaborate with their human teammate. Al-

though the game is extremely easy to win using a key-

board, the WOZ operator did not fully realize this ca-

pability in order to make the game more balanced.

4 ANNOTATION OF

UTTERANCES

In this section we describe the methodology used to

annotate and categorize the utterances used by users

in the basketball game. Categorization of dialog in

human-agent interaction has been addressed in pre-

vious research which argued for categorizing dialog

based on speech act theory (Traum, 1999; Traum,

2000). However, the majority of this work was in the

domain of conversation or conversation as a means of

gathering information. The domain of our system is

more speciﬁc. It is dialog which occurs while a team

sport is being played. From our experiments we ob-

served that the type of dialog differed greatly. Utter-

ances tended to be short (two or three words), much

like the interactions (one utterance per party), and of-

ten repeated at various stages during play. Conver-

sational dialog often involves elaboration, explaining

and question-answering as well as facilitation mecha-

nisms such as turn-taking and backchannelling.

It would appear that basketball as a domain is sim-

pler than conversation in terms of the length and type

of utterances used. The richness of signaling therefore

comes from the context of the game and other modal-

ities. If a player with the ball says “Here!” while

turning towards their partner,the partner can infer that

this is a signal to receive a pass. Such domains have

rarely been examined in real or virtual settings, al-

though research on dialog for online teamwork has

been conducted (Taylor, 2012). Therefore we have no

domain-speciﬁc categorization which we can apply.

Several standardized taxonomies for utterance

classiﬁcation exist and one of the most well-known is

the labeling of utterances as illocutionary speech acts

as described by Searle (Searle, 1975). Others have

also been devised such as DAMSL (Core and Allen,

1997) and DIT++ (Bunt, 2009) which label utterances

Utterance Behavior of Users While Playing Basketball with a Virtual Teammate

as dialog acts. These taxonomies address some draw-

backs of Searle’s categorizations by allowing multi-

ple labels of an utterance and providing a hierarchi-

cal structure for categorizations. The dialog acts have

been used as the basis for other coding schemes which

either reﬁne the tags (Jurafsky et al., 1997) or relate

them to speciﬁc domains such as meetings (Shriberg

et al., 2004).

However a problematic issue with using dialog

acts for virtual basketball is that there are many la-

bels to choose from which are applied to human-

human conversation rather than basketball-type inter-

action, as described above. For this reason, we opt to

use Searle’s speech act categorizations. Although the

number of labels is smaller, they better represent the

more limited range of utterances used in basketball.

Furthermore, annotating and classifying the types of

utterances is more clear-cut under the categories de-

ﬁned by Searle as opposed to multi-dimensional or

hierarchical labeling. Table 1 displays the categoriza-

tions of speciﬁc basketball activities under Searle’s

taxonomy.

We now clarify some of the more ambiguous cat-

egories. Describing the state of the game is an ut-

terance from the user which explains the current sit-

uation but does not make any subjective assessment,

such as “haittenai ([the ball] didn’t go in)”. Order-

ing (strategy) is an utterance detailing steps the agent

should take in the game, such as moving to a particu-

lar location. This excludes passing or shooting com-

mands. Passing (calling and throwing) and order-

ing (shoot) were designated as speciﬁc categories due

to them being the major task behaviors in basketball.

Encouragement is a general category containing ut-

terances which are used to give the agent support. We

include utterances used when the agent is attempting

to perform a task or expressing regret for a mistake.

Examples include “ganbare (do your best)!” and “ii

yo (it’s OK)”.

Previously we described task utterances as those

being directly related to the achieving of a shared

goal. We can therefore also label all commissive and

directive speech acts in addition to acknowledgment

as task utterances because these are said in order to

win the game.

5 EXPERIMENT

We conducted experiments with 15 Japanese speakers

who played the basketball game with a Wizard-of-Oz

operator. Prior to the game they were shown an in-

struction video and given a training session to famil-

iarize themselves with the game. During this train-

ing session the agent would also take part and engage

in a greeting with the participant. This was to en-

sure that the participants were aware that the agent

had the ability to understand speech. We did not pro-

vide details as to what speech the participants should

use during the experiment. They were free to speak

and interact with the agent however they liked. Each

game lasted 15 minutes. All speech data and game

data (positions of the game objects, players, and their

body poses) was recorded so that we could go back

and watch the games. Participants were also asked to

submit questionnaires which gave subjective evalua-

tions of the perceived intelligence, animacy and like-

ability of their teammate (Bartneck et al., 2009).

Each recording of the basketball game was ob-

served and all user utterances were transcribed, both

lexical and non-lexical. We used the following pro-

cess to annotate an utterance:

1. If the utterance is not a communicative act toward

the agent, ignore it. This removes self-directed

speech. This information is maintained for the

corpus but is not part of our analysis at this stage.

2. Label the utterance according to the categories in

Section 4. This is subjective but when observing

the games the appropriate categorization is gener-

ally clear, particularly compared to conversation.

3. Note if the categorization is the ﬁrst of its kind

during the basketball game. For example, if the

participant says “Thanks” and a thanking utter-

ance has not been used in the game then this con-

stitutes a new category.

4. Note if the utterance is the ﬁrst of its kind within

the same category. We consider similar utterances

in different categories to be distinct. For example,

“Pass” can be used when either calling for a pass

or throwing a pass. Variations of the root word

constitute the same utterance. In Japanese we con-

sider the utterance “Pasu shiro” to be the same as

“Pasu shite”, the common root being “Pasu”. Al-

though this is not entirely accurate because of nu-

ance, it is satisfactory for this analysis. We also

combine repeated utterances into one utterance,

deﬁning repetition if it is spoken within 500 mil-

liseconds with no interruption by the agent.

Interpretation of the meanings of utterances was

not difﬁcult due to the context of the utterance being

apparent in basketball. Nevertheless, the annotations

were also checked by a native Japanese speaker and

inter-observer reliability was around 95%. The end

result of this is a script consisting of time-stamped ut-

terances by both user and agent, and their associated

categories. This provides us with the necessary tem-

poral information for our analysis.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

Figure 2: The left ﬁgure shows the ratio of new utterances to total utterances divided into 15 1-minute blocks. The right ﬁgure

displays the ratio of new categories to total utterances.

Table 2: User utterances for all games (abbreviated). Task

utterances are in italics.

Utterance category % total

Call for pass 18.2

Praise 13.3

Throw pass 9.7

Ordering (strategy) 9.0

Acknowledgment 8.4

Celebration 6.3

Encouragement 6.3

Apology 5.7

Statement of intent 4.1

Table 3: Distribution of utterances under Searle’s illocu-

tionary acts.

Illocutionary Act % total

Assertive 3.5

Directive 31.0

Commissive 13.8

Expressive 47.8

Unclassiﬁed 3.9

6 RESULTS

We ﬁrst provide some general statistics on the utter-

ances. The 15 participants spoke a total of 934 cat-

egorized utterances, of which 153 were unique. We

identiﬁed one outlier, a participant who did not use

any utterances during their interaction with the WOZ

agent. Table 2 displays the distribution of the utter-

ances for the top ten categories. We can see that utter-

ances are fairly equally spread between task and non-

task, although for speciﬁc categories, task utterances

dominate.

An analysis of the basketball utterances according

to Searle’s illocutionary speech acts are displayed in

Table 3. We see that almost half of all utterances are

expressive in nature, while almost a third are direc-

Table 4: Median order of utterance categories.

Utterance category Order

Call for pass / Throw pass 2

Acknowledgment 4

Celebration / Praise / Thank 5.5

Ordering (shoot) / Disappointment 6

Ordering (strategy) 6.5

Encourage 7

Apology 7.5

Statement of intent 8.5

tive. This would indicate that communication which

expressed emotion or the participant’s internal state

to the agent was more heavily used than command-

based language. Around 15% of utterances were com-

missive, with the user informing the agent about what

they were going to do.

To analyze temporal behavior we divided the time

periods for all participants into 1-minute blocks. The

distribution of utterances per block was approxi-

mately uniform. There was a mean average of 4.2

utterances per participant per minute. We calculated

the proportion of new utterances to total utterances

in each block. Results are shown in the left diagram

of Figure 2. Until the sixth minute (where there is a

peak), the majority of utterances are new. The rate of

new utterances then drops after this time and remains

fairly steady. We performed the same analysis for the

proportion of utterances in new categories, as shown

in the right diagram of Figure 2. Similarly, the drop

over time is gradual before leveling off from around

the eighth minute.

A general overview of the data shows that both

task and non-task utterances were used. It would also

appear that even after 15 minutes, users would try to

sporadically use utterances and dialog with commu-

nicative intent which they had not previously used be-

fore in the game.

Utterance Behavior of Users While Playing Basketball with a Virtual Teammate

6.1 Task Utterance Complexity

H1 states that users will attempt basic task utterances

before complex ones, so we are interested in the order

in which new utterance categories are spoken. We an-

alyzed the order of new utterance categories and only

considered those which were present in a majority of

games, of which there were 12. For example, if “Call

for pass” was the ﬁrst category uttered in a game we

recorded its order as 1. We took the median of the

orders for all games to determine which types of ut-

terance were likely to be spoken before others. The

results are shown in Table 4.

What does utterance complexity mean in the con-

text of basketball? Complexity could mean the choice

of words used, but as we have stated most utterances

were only a few words at most. If we take complexity

as the type of action, then the most basic of basketball

collaborative actions are to do with passing - asking

the agent to receive a pass and signaling that a pass

is to be thrown. These can involve both speech and

gesture. More complex utterances may be directives

which order the agent to perform a particular action or

strategy. Results in Table 4 appear to support H1. The

ﬁrst utterances from the user are basic passing actions

and acknowledgements. Strategic ordering utterances

are used later in the game. In terms of Searle’s classi-

ﬁcation, there does not appear to be any deﬁnite pat-

tern of ordering.

6.2 Task Utterance Ratio

H2 states that the proportion of task utterances from

the user changes during the game. We deﬁned “dur-

ing the game” as relative to the number of user ut-

terances to account for individual differences in user

behavior. To be speciﬁc, we divide the total number

of utterances of a user into half and compare how the

proportion of task to non-task utterances changes over

the second half of the interaction. We take a moving

average with a large window that can encapsulate a

general trend. We ﬁrst normalized by total utterances,

n. For the m

utterance u

, we calculate the propor-

tion of the previous m − (n/2) utterances which are

task utterances k(u

). This creates a simple moving

average for the second half of utterances.

k(u

) =

∑

l=m−(n/2)

1[u

= TU]

m− (n/2)

,m > n/ 2

with TU indicating whether the utterance is a task ut-

terance.

After normalizing for all users, we then calculated

the mean percentage change in task utterance propor-

tion during the second half of utterances. Figure 3

Figure 3: Mean percentage change of task utterance propor-

tion for second half of utterances.

displays this trend. It can be seen that the second half

of utterances exhibits a decrease in task utterance pro-

portion with approximately 6% less task utterances

than the ﬁrst half. The decrease isn’t gradual but ﬂuc-

tuates. This gives some support to our hypothesis H2.

6.3 Perceptions of Agent

We analyzed the results of the Godspeed question-

naire by summing the items for perceived intelli-

gence, animacy and likeability. Cronbach’s alpha was

above 0.8 for all these measures so we could treat

each measure as a single variable. Results in Table

5 show that for all three measures the average score

was middling, indicating that users did not generally

have strong opinions about the agent. We also found

a positive correlation (R-squared value 0.57) between

perceived intelligence and likeability.

Table 5: Results of Godspeed questionnaire on perception

of agent teammate.

Measure Max possible Mean Std. dev.

Intelligence 25 17.9 3.0

Animacy 15 8.2 2.2

Likeability 25 18.5 2.6

However, we could not ﬁnd any evidence to sup-

port H3, which was to assess if there were any corre-

lations between user utterance behavior and their per-

ception of the agent. The number of total utterances,

task utterances and non-task utterances had no signif-

icant correlation with perceived intelligence, animacy

or likeability. From the questionnaires we also found

no evidence that users could tell that the agent was ac-

tually controlled by a hidden operator. We did not ask

the participants about this in the questionnaire itself

(to prevent alerting them about the true nature of the

experiment), but instead spoke with them casually af-

ter the experiment. Several participants remarked on

how they were surprised that the agent could under-

stand their utterances.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

7 DISCUSSION

Our research assessed utterance behavior during in-

teraction with agents which engage in repeated joint

actions with humans. Through a WOZ experiment we

were able to produce an agent which could understand

the human and react appropriately to a wide range

of utterances. Our analysis involved annotating and

categorizing utterances then assessing changes over

time.

We showed that users tended to begin with utter-

ances which conﬁrmed the agent’s understanding of

basic passing tasks, before moving on to more com-

plex utterances such as strategic commands (H1). We

found that subjects used both task and non-task dia-

log and found some evidence that the proportion of

non-task dialog increased during the second half of

the interaction (H2). There was no evidence that ut-

terance behavior of the user was indicative of their

perception of the agent (H3). We now discuss lim-

itations of this study and then further discuss these

results in a broader context and their implications for

future research.

Aside from our hypotheses, we also found that the

language of users was varied in terms of the categories

of utterances and Searle’s taxonomy. This is encour-

aging because it shows that users did not treat the

agent as a simple machine which interacted through

commands. In fact, according to Searle’s taxonomy,

expressive utterances were the most common, with

language indicating praise, disappointment, encour-

agement and apologizing often used. As with real

basketball, socially expressive language seems to hold

just as much importance in virtual basketball as task-

based language. Our hope is this that this type of re-

sult can be replicated with an autonomous agent.

7.1 Limitations

There were several limitations in this work. The

biggest limitation is that the experiment had a small

sample size so our results are only indicative in na-

ture. Although we found evidence of correlations

these need to be reproduced to claim any substantial

pattern of behavior. In future work, we plan to more

robustly test these ﬁndings by using more participants

and increasing the range of utterances of the agent to

accommodate more complex behavior, such as strate-

gic ordering. Furthermore, as we are also aiming to

create a Japanese speech corpus, this experiment was

performedusing Japanese-speakingparticipants. Cul-

tural or linguistic differences could produce different

results in other settings.

One other major limitation is that the game is not

exactly the same as real-life basketball. This is not

only restricted to physical realism, but also the ﬁdelity

of the agent in terms of gaze behavior and body move-

ments. Clearly basketball uses multimodal interaction

rather than speech alone. We did not account for these

non-verbal features in our analysis, although anecdo-

tally we did observe that users often used non-verbal

signals together with speech, particularly when call-

ing for a pass. The agent itself could only utter a

very limited set of phrases. This meant that the user

could only communicate with it in a limited manner,

mainly giving commands and receiving acknowledg-

ments. A more sophisticated agent would need to be

able to accommodate small talk behaviors. Addition-

ally, an agent which sounded human-like rather than

using a synthesized voice as in this study could have

produced better results.

We acknowledge that results of this study could

change if a more realistic game was used. How-

ever, we also believe that the general hypothesis of

communicative behavior shifting from simple to com-

plex meanings would still hold. The difference is the

form that this behavior would take, given the ability

to smoothly combine speech, gesture and gaze as op-

posed to reducing signals to speech alone. The chal-

lenge is to infer the intended message of the user

from a wide range of modalities. Clearly this requires

more effort than our study, where only verbal utter-

ances were analyzed. Another challenge for an au-

tonomous agent is to recognize complex multimodal

signals, which is more difﬁcult than one modality (in

our case, speech recognition).

From the perspective of the user, there was some

variation in the ability to play the game smoothly,

which may have hindered their motivation to interact.

A few users had trouble using the system to pass and

shoot which made interaction with the agent trouble-

some. For these users the focus was on getting the

system to work rather than collaborative actions.

7.2 Implications for Agent Design

The long-term goal of our work is to produce an au-

tonomous basketball agent which can interact natu-

rally with the user. However, this does not mean the

results cannot be generalized to other domains. Pre-

viously we stated that basketball is part of a set of

domains which utilize open navigation and full body

movement as communicative tools. Another example

is helping out victims in a disaster area. We argue

that the basic ideas presented in this work still apply

to these domains, in that users start by testing basic

capabilities of the agent. For a task-based agent these

are functions which contribute to the accomplishment

Utterance Behavior of Users While Playing Basketball with a Virtual Teammate

of the shared task. Once these have been satisﬁed, the

user is likely to test other capabilities of the agent by

engagingin more social language and complex behav-

iors. We have shown in our experiment that the order

of such behaviors can be somewhat estimated. When

designing an agent which uses repeated joint actions,

we should ensure that we facilitate the user’s process

of capability testing by creating situations where the

agent can prove itself.

This initial experiment can provide a useful base-

line for comparison with a fully autonomous agent.

We now have a substantially larger corpus from which

utterances can be generated, so this should provide a

more interesting ground for comparison. Using the

corpus we can create an autonomous agent which uses

speech recognition. We can then deﬁne the utterances

the agent uses for speciﬁc game actions. From our

ﬁndings, the agent model should regulate its utter-

ances according to the amount of time spent interact-

ing with the user.

We propose a conceptual agent model based on

our ﬁndings. During the initial interaction, the agent

should showthat it can express and understandsignals

related to simple passing behaviors by actively trying

to engage the user in these joint actions. This can be

achieved by initiating the joint action through speech

and proving to the user it understands this behavior.

Several repetitions of these joint actions can be per-

formed. Once this capability has been established, the

agent moves to non-task and complex task behaviors

using a similar process, gradually building up com-

mon ground between itself and the user. With more

sophisticated technology, agents and robots which en-

gage in repeated joint actions such as in basketball

will become more viable, so we propose analyzing

behavior in this environment as a potential research

direction.

From our experiment it would appear that user ut-

terance behavior is not correlated with their percep-

tion of the agent’s perceived intelligence, animacy or

likeability. This means that we cannot use real-time

utterance analysis in this environment as a means to

gauge user enjoyment or satisfaction. It is likely that

an analysis of prosodic features of speech such as vol-

ume and pitch would produce a correlation with these

perception measures, but this requires a more sophis-

ticated recording system than we used for this work.

It is also likely that there are non-speech features of

the agent which inﬂuences user perception.

7.3 Changes in Human Behavior

This work examined behavioral changes over one ses-

sion of play. In other longitudinal studies with agents,

multiple sessions are often used to gauge changes in

communicative behavior. We argue that both can be

useful, particularly in the case where the same type of

interactions occur repeatedly. Although in this work

the changes were not drastic even after 15 minutes, we

would like to ﬁnd some underlying state of the user

which can be inferred from their behavior. What we

could not identify was what causes humans to try new

utterances. This information would be extremely use-

ful for agent design because we could use it to speed

up the process of the human understanding the capa-

bilities of the agent. The context of our agent makes

this crucial because the human must interact with the

agent with no prior knowledge of its capabilities.

7.4 Integrating Task and Non-task

Dialog

One important result of the experiment is that most

participants used dialog which wasn’t just directly re-

lated to playing basketball. A basketball agent has

a particular shared task with a human, as opposed to

conversation where the goal might just be to interact

socially. The question then arises of how and when

to reliably transition from task to social dialog. In

the case of basketball, the situations to use both are

clearly deﬁned. Task dialog is used during play, while

stoppages in play or reactions to an event are precur-

sors to non-task dialog. Social dialog can be consid-

ered as a subset of non-task dialog, and is completely

unrelated to basketball. For example, during the game

an agent may ask if the human plays other sports. We

did not ﬁnd any examples of such utterances in our

experiment, but this may need to be considered in the

future. After all, many situations such as basketball

are essentially tasks which often require social lan-

guage.

7.5 Comparison with Previous Findings

Our previous work had analyzed body communica-

tion during the basketball game. We did not conduct

a thorough investigation of body movement in this

work, but from casual observations, previous results

generally held. Explicit body signals were mainly

with the arms and mostly were related to passing in-

teractions. Similarly, both task and non-task commu-

nicative signals were used. However in this experi-

ment observable non-task signals such as apologizing

and celebrating were done through speech only. One

explanation could be that oral communication can ex-

press non-task signals much clearer. For example,

apologizing without speaking may be unintuitive to

humans without detailed recognition of facial expres-

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

sions. Examination of passing also showed similari-

ties in terms of the initiator and role of the interaction.

Participants tended to use speech the most when call-

ing for a pass from the agent, while were less likely to

use speech in the opposite situation. In any case, the

combination of speech and gesture should be more

thoroughly addressed in future work.

8 CONCLUSION

In this paper we analyzed human utterance behav-

ior during interaction with an embodied basketball

teammate controlled by a Wizard-of-Oz operator. We

found evidence that the utterances from humans to-

ward the agent progressed from coordinating basic

tasks to more complex tasks. We also found that hu-

mans used both task and non-task utterances, with an

increase in the proportion of non-task utterances in

the latter half of the interaction. There was no corre-

lation between utterance behavior and the perception

of the agent. These results suggest that humans ﬁrst

test if the agent can understand basic speech related

to the game before experimenting with more complex

joint actions. Non-task dialog should also be consid-

ered and be used as the user becomes familiar with

the agent. Since we have gathered many utterances

for a speech corpus our next step is to create a fully

autonomous basketball agent.

REFERENCES

Aharoni, E. and Fridlund, A. J. (2007). Social reac-

tions toward people vs. computers: How mere lables

shape interactions. Computers in human behavior,

23(5):2175–2189.

Bartneck, C., Kuli´c, D., Croft, E., and Zoghbi, S. (2009).

Measurement instruments for the anthropomorphism,

animacy, likeability, perceived intelligence, and per-

ceived safety of robots. International journal of social

robotics, 1(1):71–81.

Baur, T., Damian, I., Gebhard, P., Porayska-Pomsta, K., and

Andre, E. (2013). A job interview simulation: So-

cial cue-based interaction with a virtual character. In

2013 International Conference on Social Computing

(SocialCom), pages 220–227.

Bickmore, T. and Cassell, J. (2005). Social dialongue

with embodied conversational agents. In Advances

in natural multimodal dialogue systems, pages 23–54.

Springer.

Bickmore, T., Pfeifer, L., and Schulman, D. (2011). Re-

lational agents improve engagement and learning in

science museum visitors. In Vilhj´almsson, H., Kopp,

S., Marsella, S., and Th´orisson, K., editors, Intelli-

gent Virtual Agents, volume 6895 of Lecture Notes in

Computer Science, pages 55–67. Springer Berlin Hei-

delberg.

Bradshaw, J. M., Feltovich, P., Johnson, M., Breedy, M.,

Bunch, L., Eskridge, T., Jung, H., Lott, J., Uszok, A.,

and van Diggelen, J. (2009). From tools to teammates:

Joint activity in human-agent-robot teams. In Human

Centered Design, pages 935–944. Springer.

Bunt, H. (2009). The DIT++ taxonomy for functional di-

alogue markup. In AAMAS 2009 Workshop, Towards

a Standard Markup Language for Embodied Dialogue

Acts, pages 13–24.

Campano, S., Durand, J., and Clavel, C. (2014). Compara-

tive analysis of verbal alignment in human-human and

human-agent interactions. In Proceedings of the 9th

International Conference on Language Resources and

Evaluation (LREC 2014).

Core, M. G. and Allen, J. (1997). Coding dialogs with the

DAMSL annotation scheme. In AAAI fall symposium

on communicative action in humans and machines,

volume 56.

DeVault, D., Artstein, R., Benn, G., Dey, T., Fast, E.,

Gainer, A., Georgila, K., Gratch, J., Hartholt, A.,

Lhommet, M., et al. (2014). Simsensei kiosk: A vir-

tual human interviewer for healthcare decision sup-

port. In Proceedings of the 2014 international con-

ference on Autonomous agents and multi-agent sys-

tems, pages 1061–1068. International Foundation for

Autonomous Agents and Multiagent Systems.

Fox, J., Ahn, S. J. G., Janssen, J. H., Yeykelis, L., Segovia,

K. Y., and Bailenson, J. N. (2015). Avatars ver-

sus agents: A meta-analysis quantifying the effect of

agency on social inﬂuence. Human-Computer Inter-

action, 30(5):401–432.

Gulz, A. (2005). Social enrichment by virtual characters

differential beneﬁts. Journal of Computer Assisted

Learning, 21(6):405–418.

Jurafsky, D., Shriberg, E., and Biasca, D. (1997). Switch-

board SWBD-DAMSL shallow-discourse-function

annotation coders manual. Institute of Cognitive Sci-

ence Technical Report, pages 97–102.

Kopp, S., Gesellensetter, L., Kr¨amer, N. C., and

Wachsmuth, I. (2005). A conversational agent as mu-

seum guide–design and evaluation of a real-world ap-

plication. In Panayiotopoulos, T., Gratch, J., Aylett,

R., Ballin, D., Olivier, P., and Rist, T., editors, Intelli-

gent Virtual Agents, pages 329–343. Springer.

Lala, D., Nishida, T., and Mohammad, Y. (2014). A joint

activity theory analysis of body interactions in multi-

player virtual basketball. In Proceedings of the 28th

International BCS Human Computer Interaction Con-

ference on HCI 2014 - Sand, Sea and Sky - Holiday

HCI, BCS-HCI ’14, pages 62–71. BCS.

Langlet, C. and Clavel, C. (2014). Modelling users atti-

tudinal reactions to the agent utterances: focus on the

verbal content. In 5th International Workshop on Cor-

pora for Research on Emotion, Sentiment & Social

Signals (ES3 2014), Reykjavik, Iceland.

Li, S., Sun, W., and Miller, T. (2015). Communication

in human-agent teams for tasks with joint action. In

COIN 2015: The XIX International Workshop on Co-

Utterance Behavior of Users While Playing Basketball with a Virtual Teammate

ordination, Organizations, Institutions and Norms in

Multiagent Systems, pages 111–126.

Niewiadomski, R., Bevacqua, E., Mancini, M., and

Pelachaud, C. (2009). Greta: An interactive expres-

sive ECA system. In Proceedings of The 8th Inter-

national Conference on Autonomous Agents and Mul-

tiagent Systems-Volume 2, pages 1399–1400. Interna-

tional Foundation for Autonomous Agents and Multi-

agent Systems.

Open JTalk (2015). Open JTalk - HMM-based Text-to-

Speech System. http://open-jtalk.sp.nitech.ac.jp/.

Poizat, G., Bourbousson, J., Saury, J., and S`eve, C. (2012).

Understanding team coordination in doubles table ten-

nis: Joint analysis of ﬁrst- and third-person data. Psy-

chology of Sport and Exercise, 13(5):630 – 639. A

Sport Psychology Perspective on Olympians and the

Olympic Games.

Rickel, J. and Johnson, W. L. (1999). Virtual humans for

team training in virtual reality. In Proceedings of the

Ninth International Conference on Artiﬁcial Intelli-

gence in Education.

Rickel, J. and Johnson, W. L. (2000). Task-oriented collab-

oration with embodied agents in virtual worlds. Em-

bodied conversational agents, pages 95–122.

Robinson, S., Traum, D., Ittycheriah, M., and Henderer, J.

(2008). What would you ask a Conversational Agent?

Observations of Human-Agent Dialogues in a Mu-

seum Setting. In International Conference on Lan-

guage Resources and Evaluation (LREC).

Schroder, M., Bevacqua, E., Cowie, R., Eyben, F., Gunes,

H., Heylen, D., ter Maat, M., McKeown, G., Pammi,

S., Pantic, M., Pelachaud, C., Schuller, B., de Sevin,

E., Valstar, M., and Wollmer, M. (2012). Building

autonomous sensitive artiﬁcial listeners. IEEE Trans-

actions on Affective Computing, 3(2):165–183.

Searle, J. R. (1975). A taxonomy of illocutionary acts. In

Gunderson, K., editor, Language, Mind and Knowl-

edge, pages 344–369. University of Minnesota Press.

Shriberg, E., Dhillon, R., Bhagat, S., Ang, J., and Car-

vey, H. (2004). The ICSI meeting recorder dialog act

(MRDA) corpus. Technical report, DTIC Document.

Taylor, N. (2012). a Silent Team is a Dead Team: Commu-

nicative norms in competitive FPS play. In Voorhees,

G. A., Call, J., and Whitlock, K., editors, Guns,

Grenades, and Grunts: First-person Shooter Games,

pages 251–275. Bloomsbury Publishing USA.

Traum, D. R. (1999). Speech acts for dialogue agents.

In Foundations of rational agency, pages 169–201.

Springer.

Traum, D. R. (2000). 20 questions on dialogue act tax-

onomies. Journal of Semantics, 17(1):7–30.

Travassos, B., Ara´ujo, D., Vilar, L., and McGarry, T. (2011).

Interpersonal coordination and ball dynamics in fut-

sal (indoor football). Human Movement Science,

30(6):1245–1259.

Veletsianos, G. (2012). How do learners respond to ped-

agogical agents that deliver social-oriented non-task

messages? Impact on student learning, perceptions,

and experiences. Computers in Human Behavior,

28(1):275–283.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence