Constructing a Non-task-oriented Dialogue Agent using Statistical

Response Method and Gamiﬁcation

Michimasa Inaba

, Naoyuki Iwata

, Fujio Toriumi

, Takatsugu Hirayama

, Yu Enokibori

Kenichi Takahashi

and Kenji Mase

Graduate School of Information Sciences, Hiroshima City University,

3-4-1 Ozukahigashi, Asaminami-ku, Hiroshima, Japan

Graduate School of Information Sciences, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Japan

School of Engineering, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan

Keywords:

Non-task-oriented, Dialogue Agent, Crowdsourcing, Gamiﬁcation.

Abstract:

This paper provides a novel method for building non-task-oriented dialogue agents such as chatbots. The

dialogue agent constructed using our method automatically selects a suitable utterance depending on a context

from a set of candidate utterances prepared in advance. To realize automatic utterance selection, we rank the

candidate utterances in order of suitability by application of a machine learning algorithm. We employed both

right and wrong dialogue data to learn relative suitability to rank the utterances. Additionally, we provide

a low-cost and quality-assured learning data acquisition environment using crowdsourcing and gamiﬁcation.

The results of an experiment using learning data obtained via the environment demonstrate that the appropriate

utterance is ranked on the top in 82.6% of cases and within the top 3 at 95.0% of cases. Results show that

using context information that is not used in most existing agents is necessary for appropriate responses.

1 INTRODUCTION

A great demand exists for computerized dialogue

agents. They are increasingly used in many differ-

ent areas. Dialogue agents are categorizable into

two types according to their task perspective: task-

oriented dialogue agents and the non-task-oriented di-

alogue agents(Isomura et al., 2009). Task-oriented di-

alogue agents are used to accomplish particular tasks

such as reservation services (Zue et al., 1994), supply-

ing speciﬁc information (Chu-Carroll and Nickerson,

2000), etc. Non-task-oriented dialogue agents have

no such tasks and only chat with us.

Non-task-oriented dialogues play a critical role in

human society because they are an important tool for

building relationships. Robots and other anthropo-

morphic agents are expected to participate increas-

ingly in our daily lives. Therefore, much more inves-

tigation is needed on how non-task-oriented dialogue

agents can be designed so that they can develop good

relationships with people.

Even a task-oriented dialogue agent can accom-

plish a task more efﬁciently using non-task-oriented

dialogues. For example, a study by Bickmore showed

that when dialogue agents that supported the buying

and selling of real estate initially chatted about sub-

jects not pertinent to real estate such as the weather,

people were much more motivated to buy real estate

through them than through agents that did not engage

in non-task-oriented dialogues (Bickmore and Cas-

sell, 2001).

As described in this paper, we propose a construc-

tion method for non-task-oriented dialogue agents

that are based on the statistical response method. In

fact, two major response methods exist for non-task-

oriented dialogue agents.

The ﬁrst of these are rule-based methods that pro-

duce utterances in accordance with response rules.

Well-known dialogue agents which use this strategy

are ELIZA (Weizenbaum, 1966) and A.L.I.C.E. (Wal-

lace, 2009). Mitsuku (Worswick, 2013) which the

Loebner prize contest

(non-task-dialogueagent com-

petition) winner of 2013 also used this strategy. The

problem of this strategy is their substantial cost be-

cause the rules are developed by hand work.

The other is example-based method (Murao et al.,

http://www.loebner.net/Prizef/loebner-prize.html

Inaba M., Iwata N., Toriumi F., Hirayama T., Enokibori Y., Takahashi K. and Mase K..

Constructing a Non-task-oriented Dialogue Agent using Statistical Response Method and Gamiﬁcation.

DOI: 10.5220/0004722000140021

In Proceedings of the 6th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2014), pages 14-21

ISBN: 978-989-758-015-4

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

2003; Banchs and Li, 2012). A dialogue agent em-

ploying this strategy searches a large database of di-

alogue by user input (a user’s utterance) using cosine

similarity and selects an utterance that follows after

the most similar one as a response. The problem is

how it acquires a large quantity of good quality dia-

logues efﬁciently because the performance depends

on the quality of dialogues in the database. A re-

sponse method based on statistical machine transla-

tion (Ritter et al., 2011) has been proposed. It treats

last user’s utterance as input sentence and translates

it into the response utterance. This method is cate-

gorized as the example-based method and has same

problem.

The mutual problem of the two is that it cannot

use a context (sequence of utterances) but a given last

user’s utterance. According to the rule-based method,

necessary rules and the costs of creating them are ex-

tremely increased. Regarding example-based meth-

ods, if it searches the database by a context, in many

cases, then it cannot ﬁnd a similar one because of the

diversity of non-task-oriented dialogues. When the

method cannot ﬁnd a similar one, it has no choice but

to use random selection.

Our statistical response method belongs to the cat-

egory of example-based method because it uses dia-

logue data. However, our method, which uses no co-

sine similarity but statistical machine learning, is able

to use contexts. Our method prepares candidate ut-

terances in advance. It learns which utterances are

suitable for context by the data. Therefore, a dia-

logue agent that is constructed using our method au-

tomatically selects a suitable utterance depending on

a context from candidate utterances. Additionally,

we provide a low-cost and quality assurance method

of learning data acquisition using crowdsourcing and

gamiﬁcation.

2 STATISTICAL RESPONSE

METHOD

2.1 Selection of Candidate Utterances

As described in this paper, we deﬁne “utterance” as

a one-time statement and “context” as an ordered set

consisting of utterances from the conversation’s be-

ginning to the speciﬁc point in time. Here, “an utter-

ance is suitable to a context” means the utterance is

a “humanly” and semantically appropriate answer to

the context.

First, we deﬁne a state of a point of time in a di-

alogue as context c = { u

, u

, . . . , u

}. Each u

(i =

Table 1: Example of context c.

No. Speaker Utterance

Agent Are you good at English?

Human No I am not.

I love Japanese.

Agent It is said that experience is

important to enhance English

communication skills.

Human I see! It might be a good idea to

travel abroad during summer

vacation.

(Agent) (Select an utterance from Table2)

Table 2: Example of candidate utterance set A

No. Utterance

Are you good at English?

] Where do you want to go?

I think dogs are trustworthy and intelligent

animals.

] That would be nice.

...

] Travel can make a person richer inside.

...

130

A link exists between mental and

physical health.

1, 2, . . . , l) denotes an utterance appearing in the con-

text and l denotes a number of utterances. Herein, u

is the last utterance; u

is the ﬁrst utterance in context

c. As a matter of practical convenience, u

represents

a response utterance to context c.

Second, we deﬁne a candidate utterance set A

, a

, . . . , a

}, where a

(i = 1, 2, . . . , |A

|) denotes

a candidate utterance. Here, A

contains suitable and

unsuitable utterances to context c. |A

| represents a

number of candidate utterances. We deﬁne the cor-

rect utterance set R

= {r

, r

, . . . , r

} ⊆ A

, where

(i = 1, 2, . . . , |R

|) denotes a correct utterance. |R

represents a number of correct utterances to context c.

The utterance selection means acquiring a correct ut-

terance set R

from a candidate utterance set A

, given

a context c. Here, we assume that c and A

fulﬁll the

following requirements.

• A

can be generated by any context c.

• A

has at least one correct utterance r

for context

Table 1 and 2 present examples of c, A

, and R

is shown by the darker-shadedarea). In this exam-

ple, a suitable utterance to context c shown in Table 1

is selected from the candidate utterance set A

shown

in Table 2. The utterance should be selected from the

correct utterance set R

= {a

, a

} in this case.

ConstructingaNon-task-orientedDialogueAgentusingStatisticalResponseMethodandGamification

2.2 Ranking Candidate Utterances

We describe the method used in our study to select

candidate utterances automatically.

By speciﬁcally processing c and a

(∈ A

), we

generate n-dimensional feature vector Φ(c,a

) =



(c, a

), x

(c, a

), . . . , x

(c, a

)



that represents rela-

tions between the context and the candidate utterance.

Each x

(c, a

)( j = 1, 2, . . . , n) is a feature represent-

ing a binary value. For instance, when particularly

addressing the last utterance u

in c and a

, a feature

(s, a

) is represented if it contains a speciﬁc word, a

word class, or a combination of the two.

We then deﬁned f as a function that will return the

evaluated value of a feature vector. In the following

passages, we expressed the feature vector Φ(c, a

) as

. Here it can be denoted using a linear function,

which can be expressed as follows:

f (Φ

) =

∑

j=1

(c, a

). (1)

Therein, w

is a parameter representing the weight of

(c, a

)

Using the evaluation function above, optimum ut-

terance ˆa in response to the context is obtainable by

the following equation:

ˆa = argmax

a∈A

f(Φ

). (2)

Therefore, the candidate utterances can be ranked

by sorting the value from the above evaluation func-

tion.

To estimate the parameter w

w = (w

, w

, . . . , w

)

in evaluation function f, we use a learning to rank

method ListNet(Cao et al., 2007) algorithm.

2.3 Parameter Estimation

ListNet is constructed for ranking objects. It uses

probability distributions for representing the ranking

lists of objects. Then, minimizing the distance be-

tween learning data and distribution of the model, it

learns suitable parameters for ranking.

We deﬁne Y

= {y

, y

, . . . , y

} as a score list to

candidate utterance set A

= {a

, a

, . . . , a

}. Each

score y

(i = 1, 2, . . . , |A

|) denotes the score of a can-

didate utterance a

with respect to context c. Score

represents the degree of correctness a

to c and is

an evaluated value given by humans. For instance, if

a candidate utterance is a suitable response to a con-

text, the score is 10. Alternatively, if an utterance is

unsuitable, the score is 1.

ListNet parameter estimation algorithm uses pairs

of X



, Φ

, . . . , Φ



which is a list of feature

vectors and Y

as learning data which are ranked cor-

rectly.

Here, for the list of feature vectors X

, us-

ing function f, we obtain a list of scores Z



f(Φ

), f(Φ

), . . . , f(Φ

)



. The objective of learn-

ing is to minimize difference between Y

and Z

respect to their rankings. We then formalize it using a

loss function.

G(C) =

∑

∀c∈C

L(Y

, Z

) (3)

Therein, C means all contexts in learning data and L

is loss function. In ListNet, the cross entropy is used

as a loss function.

H(p, q) = −

∑

p(x)logq(x) (4)

In that equation, p(x) and q(x) are probability distri-

butions. When p(x) and q(x) show an equal distribu-

tion, cross entropy H(p, q) takes a minimum value.

Therefore, the lists of scores Y

and Z

are

converted into probability distributions using the

Plackett–Luce model (Plackett, 1975; Luce, 1959).

The distribution of Y

using the Plackett–Luce model

for the top rank utterance is expressed as follows.

(Φ

) =

pow(α, y

)

∑

j=1

pow(α, y

)

(5)

In that equation, pow(α, y) denotes α to the power of

y. This equation represents the probability distribu-

tion of a candidate utterance being ranked on the top.

The higher the candidate utterance score is, the higher

the probability becomes. For instance, when a list of

feature vectors X

is (Φ

, Φ

) and a list of scores

is (1, 0, 3), then the probability of Φ

being ranked

on the top is calculated as follows (α = 2).

(Φ

) =

pow(2, y

)

pow(2, y

) + pow(2, y

)

pow(2, 3)

pow(2, 1) + pow(2, 0) + pow(2, 3)

= 0.727

(6)

Instead, the probability of Φ

being ranked on the top

is 0.182 and Φ

is 0.091, which is the lowest.

Similarly, the distribution of Z

can be converted

into a probability distribution as follows.

(Φ

) =

pow



α, f(Φ

)



∑

j=1

pow



α, f(Φ

)



(7)

Using Eq. (4), (5) and (7), then the loss function

L(Y

, Z

) becomes



, Z



= −

∑

i=1

(Φ

)log



( f)

(Φ

)



(8)

Optimum parameter ω

ω is obtainable using Gradient

Descent.

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

Figure 1: Context and candidate utterances on our crowd-

sourcing website.

3 DATA ACQUISITION

3.1 Crowdsourcing

Human work is important for acquiring data. To ac-

quire data, we used crowdsourcing and opened a web-

site for it.

The crowdsourcing website shows a context c and

5 candidate utterances (6 options) as shown in Fig-

ure 1 to participants. They select suitable candidate

utterances to the context or “(There is no suitable ut-

terance)”. Then, we can acquire the pair of c and the

selected utterance as a correct pair, or acquire c and

these ﬁve utterances as an incorrect pair for learning.

If a participant selects the option “Having strong com-

munication skills is paramount if you want to be suc-

cessful.” as shown in Figure 1, then the pair of the

context and the utterance are acquired as correct data.

When participants select “(There is no suitable ut-

terance)”, they must write a suitable utterance manu-

ally in the textbox. This way, we can acquire new can-

didate utterances. However, we do not use these utter-

ances in crowdsourcing and subsequent experiments

in this paper because they entail some problems such

as spelling errors, phraseology, etc. We will use this

function to collect new utterances continuously and

produce a dialogue agent to handle even the newest

topics in the future.

3.2 Conﬁdence Estimation

When we use crowdsourcing, quality control of ac-

quired data is necessary. We offer this to the gen-

eral public. Therefore, quality gaps are unavoidable.

In this study, we prepare several evaluated questions

that comprise a context c, unsuitable utterances, and

one or more suitable utterances. The suitability and

unsuitability are judged in advance by four evalua-

tors. We adopt the utterances which reach a consensus

on the suitability or unsuitability among evaluators as

evaluated ones.

The website measures the degree of conﬁdence p

by these evaluated questions. The degree of conﬁ-

dence p is calculated by counting how many times the

suitable utterance is selected within N

trials. Conse-

quently, the range of p is 0 ≤ p ≤ N

Our crowdsourcing website presents 10 questions

in a row: 5 questions for data acquisition and 5 ques-

tions for measuring the degree of conﬁdence (N

= 5).

We decide whether the acquired data are available or

not according to the degree of conﬁdence p because,

if p is small, then the possibility exists that the par-

ticipant did not work seriously. To let participants an-

swer seriously for all questions, the website does not

tell participants which question is intended for data

acquisition.

3.3 Gamiﬁcation

One of the most important considerations with crowd-

sourcing is rewards to participants. If we set high

rewards, then we can gather many participants and

acquire much data. To construct a better non-task-

oriented dialogue agent that can accommodate top-

ics of many kinds, it is desirable to acquire new data

continuously. Although the agent requires many new

data, setting high rewards increases the cost of con-

struction and the unserious users who don’t address

the task properly.

In this study, we bring game mechanics to data

acquisition to gather participants with no rewards.

Bringing game mechanics, participants enjoy the task

like game play. Such a method brings game me-

chanics to accomplish an objective called “gamiﬁca-

tion”(Von Ahn and Dabbish, 2004; Deterding et al.,

2011).

3.4 Gamiﬁed Data Acquisition

Environment

We opened a website “The diagnosis game of di-

alogue skills“

(Japanese text only) as a gamiﬁed

crowdsourcing data acquisition environment. At this

site, participants answer 10 questions and ﬁnally ob-

tain a score for dialogue skills. The score goes up to

100 points. The score becomes higher if a participant

http://beta.cm.info.hiroshima-cu.ac.jp/DialogCheck/

ConstructingaNon-task-orientedDialogueAgentusingStatisticalResponseMethodandGamification

Figure 2: Diagnosis game result.

selects a candidate utterance that many other partici-

pants selected. At the same time, the website shows a

graph of score distribution for comparison with other

participants. Figure 2 portrays an example of a game

result.

Scoring the results of selection and comparison

with those of the other participants stimulates partici-

pants’ retrial motivation,by which they want to obtain

a higher score. Additionally, by posting the score on

SNS or micro blogs by themselves, we expect adver-

tising effects for other people (the website has a tweet

button to tweet their score easily).

4 EXPERIMENTS

4.1 Experimental Methodology

To underscore the effectiveness of the statistical re-

sponse method that learns data acquired through the

gamiﬁed data acquisition environment, we checked

the ranking of suitable utterances that were estimated

automatically.

For comparison, we used a classiﬁcation method,

support vector machine (SVM). In general, SVM pro-

vides binary classiﬁcation results and no direct means

to obtain scores or probabilities for ranking. Never-

theless, Piatt proposed transforming SVM predictions

to posterior probabilities by passing them through a

sigmoid (Platt et al., 1999). We then classiﬁed candi-

date utterances by SVM, selected correctly classiﬁed

ones and ranked them by posterior probabilities using

the sigmoid method. We used this method as a base-

line without the use of the learning to rank method.

4.2 Features

To rank the utterances, we converted pairs of a con-

Table 3: Feature vector generation (noun feature).

No. Speaker Utterance

Agent Enjoy this season fully

because it’s long-awaited

summer vacation.

Human Yes, I will.

Agent Do you plan to travel?

Human No. However, I would like

to go.

( Agent ) Why don’t you go on a trip

overseas?

Noun pair Vector value

: travel & u

: Europe 0

: summer & u

: trip 0

: part-timer & u

: overseas 0

: travel & u

: trip 1

: travel & u

: overseas 1

: friends & u

: trip 0

: summer & u

: overseas 1

: vacation & u

: trip 1

text and a candidate utterance into a feature vector.

We used features of 11 types to represent relations be-

tween a context and an utterance. Here, we describe

one of these, the noun feature, as the most basic one.

In the noun feature, we use a combination of a

noun in a context and an utterance. Using this feature,

we expect that a candidate utterance that includes

words related to words in a context ranks higher. We

only use u

, u

, and u

in a context for this fea-

ture because it is often the case that semantic rela-

tions between old utterances in a context and suitable

candidate utterances are small. The usage range of

utterances in context differs according to the type of

feature. In the noun feature, whether a particular noun

pair exists between utterances represents a binary fea-

ture value. We use noun pairs that appear three or

more times in learning data.

Table 3 shows the example. The upper table shows

an example of the context and candidate utterances.

The lower shows part of a feature vector generated

from them. As the table shows, we distinguish noun

pairs by the number of utterances in the context. For

instance, the vector value of “u

: travel & u

: over-

seas” is 1 because u

includes the word “travel” and

includes “overseas”. Similarly, the vector value of

“u

: summer & u

: trip” is 0 because u

includes

“trip” but u

does not include “summer”.

The features should be designed to represent vari-

ous aspects of relations between contexts and utter-

ances, such as sentence structures, discourse struc-

tures, semantics, and topics.

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

Table 4: Data acquisition result.

Number of participants 460

Number of Evaluated contexts 320

Number of Evaluated utterances 4694

Average of the conﬁdence p 4.215

4.3 Data Set

4.3.1 Candidate Utterances

We made 980 utterances by hand for crowdsourcing

and the experiment. The topics of utterances were se-

lected to interest as many people as possible such as

healthcare, marriage, travel, sport, etc. We also pro-

duced versatile utterances such as “I think so.” and

“It’s wonderful!”.

4.3.2 Learning Data

To acquire the data, we opened the gamiﬁed website

for crowdsourcing. Table 4 shows the results of data

acquisition.

We used 4520 evaluated utterances for which con-

ﬁdence p is p > 3.0 for the experiment.

Additionally, we used other data produced by 50

part-time participants intended to compensate for data

deﬁciency. The modes of producing data were about

the same, with the exception of using the game me-

chanics. As a result, we obtained 239,897 evaluated

utterances to 14,900 contexts. We used these data all

together as learning data.

The scores of utterances are given depending on

the evaluation. If an utterance is suitable to a context,

then the score is 30. If unsuitable, the score is 1. The

values of score are decided on an empirical basis.

4.3.3 Test Data

We prepared 500 contexts as test data. The ranked ut-

terances using the proposed method and SVM were

evaluated manually. Each utterance was evaluated by

three evaluators. They judged whether each utterance

was semantically suitable or unsuitable to the con-

text. The eventual judge was decided using a majority.

Therefore, when two evaluators judge an utterance as

suitable and one evaluator judge as unsuitable, the ut-

terance is determined to be suitable.

4.4 Results

Figure 3 shows the experiment result and 95% conﬁ-

dence interbals obtained using the proposed method

and SVM.

Figure 3: Rate of appropriate candidate utterance.

The x-axis represents the rank of the ﬁrst appear-

ance of a suitable utterance. The y-axis shows the cu-

mulative frequency. In other words, the ﬁgure shows

the rate of the contexts that include at least one appro-

priate utterance within each rank.

In the ﬁgure, the proposed method ranked a suit-

able utterance on the top at 82.6%, within the top 3 at

95.0%, and the top 10 at 98.6%. However, SVM was

ranked on the top at 58.4%, within the top 3 at 82.4%,

and at the top 10 at 95.4%. As shown in the result,

the proposed method outperformedSVM overall. The

above shows that the proposed method is effective for

the selection of candidate utterances.

When we implement the proposed method to dia-

logue agents, the rate of replying to a suitable utter-

ance (82.6%) is inadequate for smooth communica-

tion. Note that the set of candidate utterances has at

least one correct utterance for each context (test data).

This may not always be the case and the rate may

drop when the agent talks to human actually. How-

ever, the proposed method produced rankings within

the top 3 at over 90% to use new effective features. To

improve the ranking algorithm, it seems that we can

improve the performance of the statistical response

method further.

4.5 Discussion

A great beneﬁt of the proposed method is that it can

use contexts for responses. To demonstrate that ef-

fectiveness, we created feature vectors using the last

user’s utterance (u

) only and conducted an experi-

ment.

Figure 4 portrays the results. The rate of the top 1

was 69.2%, 13.4% lower, and all results in the ﬁgure

are lower by at least 1.6% than that using contexts

(Fig. 3). This result is, so to say, the natural result

because a context has more hints than an utterance

for selecting a suitable utterance.

ConstructingaNon-task-orientedDialogueAgentusingStatisticalResponseMethodandGamification

Figure 4: Rate of appropriate candidate utterance without

use of contexts.

However, as described at the beginning of this pa-

per, existing response methods cannot use contexts

for response generation. Various problems exist be-

cause of such information loss. For instance, a dia-

logue agent broaches a topic that was discussed previ-

ously or makes contradictory comments to what it had

said before. In fact, this experimentally obtained re-

sult indicates that using not only the last utterance but

also contexts are necessary for realizing superior non-

task-oriented dialogue agents. Therefore, in terms of

the availability of contexts, the effectiveness of the

statistical response method was clariﬁed.

5 CONCLUSIONS

As described in this paper, we proposed a statistical

response method that automatically ranks previously

prepared candidate utterances in order of suitability

to the context by application of a machine learning

algorithm. Non-task-oriented dialogue agents that ap-

plied the method use the top utterance from the rank-

ing result for carrying out their dialogues. To col-

lect learning data for ranking, we used crowdsourc-

ing and gamiﬁcation. We opened a gamiﬁed crowd-

sourcing website and collected learning data through

it. Thereby, we achieved low-cost and continuous

learning data acquisition. To prove the performance

of the proposed method, we checked the ranked ut-

terances to contexts and conclude that the method is

effective because a suitable utterance is ranked on the

top at 82.6% and within the top 10 at 98.6%.

The non-task-oriented dialogue agents are basi-

cally evaluated by hand work and the task requires a

tremendous amount of time and effort. By using pro-

posed gamiﬁed crowdsourcing platform, we can eval-

uate the performance of non-task-oriented dialogue

agents in a low-cost way. We prepare several types

of agents which we want to evaluate and each agent

generates a response to the given context. The plat-

form shows the context and the generated responses

to participants in the same way as our website. The re-

sponses which ware generated by a high-performance

agent should be selected more than others.

The candidate utterances are created manually.

Future work includes automatic candidate utterance

generation. Our crowdsourcing website has a func-

tion that collects new utterances. However, these ut-

terances present some problems such as spelling er-

rors, phraseology, etc. because they are written by

users in free description format. We need to ﬁx them

to use the new utterances. As an alternative utterance

generation method, using microblog data is promis-

ing. Using microblog data, it can be expected to gen-

erate a new utterances set that includes numerous or

newest topics.

We also intend to improve the feature vector. It

is important to devise new effective features because

the performance of our method depends heavily on

the features. The features used in the experiment (not

illustrated in detail here) did not deeply consider the

semantics of contexts and utterances. Realizing ap-

propriate responses requires semantical features. We

are now deliberating on such features.

REFERENCES

Banchs, R. E. and Li, H. (2012). Iris: a chat-oriented di-

alogue system based on the vector space model. In

Proceedings of the ACL 2012 System Demonstrations,

pages 37–42. Association for Computational Linguis-

tics.

Bickmore, T. and Cassell, J. (2001). Relational agents: a

model and implementation of building user trust. In

Proceedings of the SIGCHI conference on Human fac-

tors in computing systems, pages 396–403.

Cao, Z., Qin, T., Liu, T., Tsai, M., and Li, H. (2007). Learn-

ing to rank: from pairwise approach to listwise ap-

proach. In Proceedings of the 24th international con-

ference on Machine learning, pages 129–136.

Chu-Carroll, J. and Nickerson, J. (2000). Evaluating auto-

matic dialogue strategy adaptation for a spoken dia-

logue system. In Proceedings of the 1st North Ameri-

can chapter of the Association for Computational Lin-

guistics conference, pages 202–209.

Deterding, S., Sicart, M., Nacke, L., O’Hara, K., and Dixon,

D. (2011). Gamiﬁcation. using game-design elements

in non-gaming contexts. In Proceedings of the 2011

annual conference extended abstracts on Human fac-

tors in computing systems, pages 2425–2428. ACM.

Isomura, N., Toriumi, F., and Ishii, K. (2009). Statistical

Utterance Selection using Word Co-occurrence for a

Dialogue Agent. Lecture Notes in Computer Science,

5925/2009:68–79.

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

Luce, R. (1959). Individual choice behavior: A theoretical

analysis. New York: Wiley.

Murao, H., Kawaguchi, N., Matsubara, S., Yamaguchi, Y.,

and Inagaki, Y. (2003). Example-based spoken dia-

logue system using woz system log. In SIGdial Work-

shop on Discourse and Dialogue, pages 140–148.

Plackett, R. (1975). The analysis of permutations. Applied

Statistics, pages 193–202.

Platt, J. et al. (1999). Probabilistic outputs for support vec-

tor machines and comparisons to regularized likeli-

hood methods. Advances in large margin classiﬁers,

10(3):61–74.

Ritter, A., Cherry, C., and Dolan, W. B. (2011). Data-

driven response generation in social media. In Pro-

ceedings of the conference on empirical methods in

natural language processing, pages 583–593. Associ-

ation for Computational Linguistics.

Von Ahn, L. and Dabbish, L. (2004). Labeling images

with a computer game. In Proceedings of the SIGCHI

conference on Human factors in computing systems,

pages 319–326. ACM.

Wallace, R. (2009). The anatomy of alice. Parsing the Tur-

ing Test, pages 181–210.

Weizenbaum, J. (1966). ELIZA-a computer program for

the study of natural language communication between

man and machine. Communications of the ACM,

9(1):36–45.

Worswick, S. (2013). Mitsuku Chatbot.

http://www.mitsuku.com/.

Zue, V., Seneff, S., Polifroni, J., Phillips, M., Pao, C., Goo-

dine, D., Goddeau, D., and Glass, J. (1994). Pegasus:

A spoken dialogue interface for on-line air travel plan-

ning. Speech Communication, 15(3-4):331–340.

ConstructingaNon-task-orientedDialogueAgentusingStatisticalResponseMethodandGamification