Exploring the Context of Recurrent Neural Network based

Conversational Agents

Raffaele Piccini and Gerasimos Spanakis

Department of Data Science and Knowledge Engineering, Maastricht University,

Keywords:

Conversational Agents, Recurrent Neural Networks, Hierarchical Recurrent Encoder Decoder.

Abstract:

Conversational agents have begun to rise both in the academic (in terms of research) and commercial (in terms

of applications) world. This paper investigates the task of building a non-goal driven conversational agent,

using neural network generative models and analyzes how the conversation context is handled. It compares

a simpler Encoder-Decoder with a Hierarchical Recurrent Encoder-Decoder architecture, which includes an

additional module to model the context of the conversation using previous utterances information. We found

that the hierarchical model was able to extract relevant context information and include them in the generation

of the output. However, it performed worse (35-40%) than the simple Encoder-Decoder model regarding

both grammatically correct output and meaningful response. Despite these results, experiments demonstrate

how conversations about similar topics appear close to each other in the context space due to the increased

frequency of speciﬁc topic-related words, thus leaving promising directions for future research and how the

context of a conversation can be exploited.

1 INTRODUCTION

Interactive conversational agents, also known as Dia-

logue Systems, or more informally Chatbots, are be-

coming increasingly popular in recent years among

researchers and companies. The reason for this rise

in popularity is the wide set of possible applications,

ranging from customer services and technical support

to video-games and other forms of entertainment. Ad-

ditionally, the potential for conversational agents has

signiﬁcantly improved in recent years. Advancements

in the ﬁeld of Machine Learning and the introduction

of new Deep Recurrent Neural Networks, made possi-

ble the creation of dialogue systems capable of more

meaningful interactions with the users such as (Ser-

ban et al., 2016).

Conversational agents could be divided (among

many other ways) into two categories; goal-driven

systems: when the agent has a speciﬁc predetermined

task to fulﬁll, such as customer service, and non-goal-

driven systems. This paper will deal with the lat-

ter category, trying to generate meaningful conversa-

tions, with no goal to fulﬁll, as goal-driven systems

would require large corpora of task-speciﬁc simulated

conversations. Nevertheless, the model might eventu-

ally prove useful also for goal-driven tasks.

This paper explores the application of different

Encoder-Decoder architectures to generate answers.

We compares a simple Encoder-Decoder architec-

ture with a Hierarchical Recurrent Encoder Decoder

(HRED) and we seek answers as to how the quality of

answers improved. Additionally, the way that HRED

network encodes entire conversations in the context

latent space will be investigated as to whether it can

provide more coherent answers. Finally, an attempt to

visualize the context space will be attempted in order

to conﬁrm where conversations about similar topics

lie.

2 RELATED WORK

The traditional approach for Conversational Agents

follows a modular approach, dividing the process into

three modules: a Natural Language Understanding

(NLU) unit, a Dialogue Manager and a Natural Lan-

guage Generation module (NLG). The NLU mod-

ule will process the input and extract useful informa-

tion. This information is then used by the Dialogue

Manager to update internal states, send a query to a

knowledge-based system, or simply follow precoded

instructions. Finally, the NLG will use the informa-

Piccini, R. and Spanakis, G.

Exploring the Context of Recurrent Neural Network based Conversational Agents.

DOI: 10.5220/0007574203470356

In Proceedings of the 11th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2019), pages 347-356

ISBN: 978-989-758-350-6

347

tion from the Dialogue Manager to generate the out-

put sentence. The simplest technique used for NLU

is to spot certain keywords in the input, often work-

ing together with a script-based Dialogue Manager.

However, throughout the years there have been many

attempts to improve the NLU unit to better extract

text information, using techniques including statis-

tical modeling of language (Manning et al., 1999),

skip-gram models (Mikolov et al., 2013b) and, more

recently, deep neural networks (Collobert and We-

ston, 2008). Eventually, with the rise of Deep Learn-

ing in recent years, Dialogue Systems research has

mainly focused on end-to-end models, capable of in-

cluding all 3 modules in a single deep neural network,

trained on a large dataset. One end-to-end RNN archi-

tecture proved particularly successful in recent years

is the Encoder-Decoder.

The use of Encoder-Decoder architectures for nat-

ural language processing was ﬁrst proposed as a so-

lution for text translation, in 2014 by (Cho et al.,

2014). From then on, the architecture has been ap-

plied to many other tasks, including conversational

agents (Serban et al., 2017). However, generating re-

sponses was found to be considerably more difﬁcult

than translating between languages, probably due to a

broader range of possible correct answers to any given

input. A limitation of Encoder-Decoder models to

produce meaningful conversations is the fact that any

output is only inﬂuenced by the latest question. Thus,

important factors are ignored, such as the context of

the conversation, the speaker, and information pro-

vided in previous inputs. In 2015, Sordoni et al. pro-

posed an updated version of the Encoder-Decoder ar-

chitecture, called Hierarchical Recurrent Encoder De-

coder (Sordoni et al., 2015) (HRED), originally used

for query suggestions. In their paper, they demon-

strate that the architecture is capable of using context

information extracted by previous queries to generate

more appropriate query suggestions. This paper will

attempt to apply such architecture to a dialogue sys-

tem.

3 DESCRIPTION OF MODELS

3.1 Recurrent Neural Networks

A Recurrent Neural Network (RNN) is a neural net-

work which works on variable length sequences in-

puts x = (x

, x

... x

). Its output consists of a hidden

state h, which is updated recursively at each time step

h(t) = f (h(t − 1), x

) (1)

where f can be any non-linear activation function. For

the scope of this paper f was chosen to be a Long

Short Term Memory (LSTM) unit. The reason why

RNNs are particularly suited for the task of generat-

ing conversations is their ability to receive variable

length inputs by encoding them in a ﬁxed length vec-

tor. Moreover, they are also able to produce variable

length outputs, which is essential when working with

sentences which are likely going to be of different

lengths. When receiving an input sentence s = (w

... w

) where l is the length, the RNN will up-

date its hidden state h recursively for every element

of the input sequence, in such a way that at any step

t, h(t) is a ﬁxed length vector representing all the en-

tire sequence (w

, w

... w

) up to time-step t. After

receiving the last input w

, the hidden state will then

be representing the entire sequence.

Another interesting property of RNNs is their abil-

ity to learn the probability distribution of a sequence.

In this speciﬁc case, it would learn the probability of

a word p to be the next, knowing the previous words

in the sentence.

P(w

, w

...w

t-1

). (2)

This can be useful to represent syntactical and gram-

matical rules and produce believable answers.

3.2 Word Embeddings

RNNs cannot accept strings as input, thus a way to

encode each word into a numerical vector is required.

For this paper, word2vec (using Python’s Gensim Li-

brary) was used (Mikolov et al., 2013a). Word2vec

uses a neural network to map each word of the vocab-

ulary to a vector space, in which words with similar

meaning are closer together. This approach should

make it simpler for the chatbot to extract informa-

tion from the input. The word2vec model was used

with two different approaches. The ﬁrst approach was

to use a pre-trained word2vec model. This method

ﬁxes the embedding matrix, and it does not allow it

to be trained with the rest of the model. When using

this approach, the output layer uses a linear activation

function to output a vector directly in the embedding

space. The output vector is then compared to the vec-

tors of the words in the vocabulary, and the closest

one is chosen as the output word. The second ap-

proach was to use the pre-trained word2vec model to

initialise an embedding layer, connected to the input

of the Encoder. The embedding weights would then

train with the rest of the network, giving it the possi-

bility to adapt to the speciﬁc task. When using this ap-

proach, the output layer has the same dimension as the

vocabulary, and it uses the softmax activation function

to produce a valid probability distribution. The output

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

348

word is then chosen according to the output probabil-

ity vector. The differences in performance between

the two approaches are explored in Section 4.

3.3 Encoder-Decoder

An Encoder-Decoder is an architecture (see Figure 1)

that uses two RNNs to learn to encode a variable-

length sequence into a ﬁxed-length vector and to de-

code the ﬁxed-length vector back into a variable-

length sequence.

Figure 1: Computational graph of the Encoder-Decoder ar-

chitecture.

This architecture can be used to learn the condi-

tional probability distribution over a variable-length

sequence, conditioned on a different variable-length

sequence.

P(w1, w2...wl

|v1, v2...vl

). (3)

The Encoder is a Recurrent Neural Network that

reads each word of the input sentence sequentially.

After each word, the hidden state of the RNN will be

updated according to Eq.(1), until it reads the end-

of-sentence token (<EOS>), after which it becomes

a vector representation of the entire input sentence.

The Decoder is a RNN trained to generate the output

sentence by predicting the next word in the sentence,

conditioned not only by all the previously generated

words, but also by the vector representation of the in-

put sentence. When generating the ﬁrst word of a sen-

tence, the input will be the begin-of-sequence token

(<BOS>). After that, the input will include every

word previously predicted by the Decoder. Further-

more, to condition the answer on the input sequence,

the RNN is initialised with the hidden state of the en-

coder when generating each new element.

3.4 Hierarchical Recurrent Encoder

Decoder

The Hierarchical Recurrent Encoder Decoder

(HRED) model is an extension of the simpler

Encoder-Decoder architecture (see Figure 2). The

HRED attempts to overcome the limitation of the

Encoder-Decoder model of generating output based

only on the latest input received. The HRED model

assumes that the data is structured in a two-level

hierarchy: sequences of sub-sequences, and sub-

sequences of tokens. In the case of conversational

data, each conversation can be modelled as a se-

quence of sentences, and each sentence as a sequence

of words.

The HRED model consists of three different

RNNs: an Encoder, a Decoder, and a context mod-

ule. The encoder will handle the data at the lower

hierarchical level by processing each word in a sen-

tence, to encode the sentence in a ﬁxed length vector.

The sentence vector representation is then given as

input to the context module. The context RNN han-

dles the data at the higher hierarchical level by itera-

tively processing all the sentence vectors of a conver-

sation and updating its hidden state after every sen-

tence. By doing so, the context vector represents the

entire conversation up to the last sentence received.

The output of the Context RNN is then used to ini-

tialize the hidden state of the Decoder RNN (simi-

larly to the encoder-decoder, where the hidden state of

the encoder was used to initialise the decoder), which

will generate the output sentence. The initialization of

the Decoder with a vector representing the entire con-

versation (compared to using the vector of the input

sentence) allows HRED to use context information to

generate a more appropriate answer. The output is not

only conditioned on the input sentence, but also on all

the input sentences received during the current con-

versation session. HRED is said to be hierarchical as

the Encoder and Context RNNs work to encode infor-

mation at the two hierarchical levels of the conversa-

tion. The encoder will try to ﬁnd short-term correla-

tion within the input sentence, and the context RNN

will work at a higher level ﬁnding long-term correla-

tions, across different sentences within the same con-

versation.

Exploring the Context of Recurrent Neural Network based Conversational Agents

349

Figure 2: Computational graph of the HRED architecture for a conversation composed of 2 turns.

4 EXPERIMENTS

In this section we describe the dataset used for exper-

imentation followed by the evaluation process of the

different models. Finally, all experiments with their

results are presented.

4.1 Dataset

In order to test the extraction of context information,

a dataset following the hierarchical structure assumed

by HRED is needed. The dataset needs to be com-

posed of multiple dialogues, each of which should be

a sequence of sentences, as a simple list of questions

and answers would not have any context information.

The dataset DailyDialogue is a high-quality multi-

turn dialogue dataset, which contains conversations

about every-day life (Li et al., 2017) and therefore is

ideal for the task of non-goal-driven conversational

agents. Some basic pre-processing was performed

on the data-set, although, punctuation was left un-

chagned, in order for the chat-bot to learn to use it as

well. The DailyDialogue dataset also includes topic

labels for each conversation. These labels can be used

to compare the context representations of the conver-

sations. For the experiments, only the 5 most common

topics were used. In Section 4.5, experiments will

explore whether conversations labelled with the same

topics are close together in the context latent space.

4.2 Evaluation Method

In order to determine the consistency of the output

quality, each sentence is evaluated by a user according

to two measures:

The ﬁrst is a binary metric equivalent to whether

the output is syntactically correct. This will include

the use of punctuation, the use of the end-of-sequence

token, and also the correct formulation of a sentence

with a valid grammatical structure (including correct

use of verbs, adverbs, subjects, names etc.).

The second metric evaluates how appropriate an

answer is to the input sentence, and in the case of the

hierarchical architecture, also how appropriate it is in

the context of the whole conversation. This second

evaluation is the most important, especially in com-

paring the Encoder-Decoder with the HRED architec-

ture. The evaluator will decide whether the response

was “good”, “decent“, or “wrong”. An output sen-

tence is considered of “good” quality if it is a valid

answer to the input sentence, and if it is pertinent to

previous sentences in the conversation. The quality is

considered “decent” when the answer is a meaningful

sentence, and it applies to the input only in part, or

not entirely. A “wrong” output is when the sentence

does not make sense. Some examples of responses

with their evaluations are shown in Table 1.

Human evaluation was performed by volunteers

who were instructed to have a conversation with the

bot about a given topic. A total of 19 volunteers were

interviewed, and they performed on average 7 turns

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

350

Table 1: Some examples of the evaluation of different out-

puts for the same input, according to the two metrics de-

scribed above.

of conversation for each one of the 5 topics given.

4.3 Experimental Setup

The code for this paper was written using the python

implementation of Keras. The neural network’s pa-

rameters are learned by maximising the training log-

likelihood using the RMSProp optimiser. For all ex-

periments, the embedding size was tested with a di-

mension of 100 and 300. Because of GPU memory

limitations, the hidden state was ﬁxed to a dimension

of 300, and the depth of all the modules was kept to

Initially, the training was performed by trying to

predict at each step an entire sentence. This approach

resulted in terrible quality outputs, often consisting

in the same word repeated or the use of the end-of-

sentence token as ﬁrst output word. It appears that

training the model by predicting single words instead

gives better results, although this approach will make

training signiﬁcantly slower as the weights are up-

dated after every word and not after every sentence.

Keras require the input of its models to be squared

numpy arrays; this requires all conversations in a

batch to be of the same length. Similarly, at step i,

the i

sentence of all conversations in a batch need

to have the same length so that the input vector is a

squared matrix. To do so, the dataset is sorted by

the length of their conversation in order to reduce

the padding necessary to make all conversations in a

batch the same length. In the case of a shorter con-

versation, a new empty sentence is added (an empty

sentence is composed by a begin-of-sentence token

immediately followed by an end-of-sentence token).

The data also had to be adjusted with zero padding

at the level of the individual sentences. Because of

memory limitation the batch size was set between 70

to 100.

4.4 Models Comparison

Embedding. The goal of the ﬁrst experiment is to

compare the two different ways the word embedding

is used to represent the input and outputs. The ﬁrst

approach consists of using the word2vec model to ini-

tialise an embedding layer used as input. The output

layer will have the dimension of the vocabulary, and it

uses a softmax activation function, to generate a valid

probability distribution over every word in the dictio-

nary. The output is then chosen according to the prob-

ability vector. The second approach consists of train-

ing a word2vec model separately from the rest of the

system, and use it to convert both input and output

in n-dimensional vectors (where n is the embedding

dimension). The output layer has a linear activation

function so that the model predicts vectors in the em-

bedding space. The vector is then compared to all

the vectors in the dictionary using cosine similarity to

ﬁnd the closest match.

Both approaches were tested on a simple encoder-

decoder architecture and only the best performing

version is used for the experiments with the HRED

version. The training for both models were performed

on a GeForce GTX 960M GPU for about 32 hours.

Some examples of responses generated by both mod-

els are shown in Table 2.

The results of the experiment show that the second

approach learns to generate shorter sentences, which

are often extremely vague and mostly grammatically

correct. Because of their vagueness, they often apply

to the input sentence, thus giving the impression that

the answer is somehow appropriate to the input. For

this reason, most of the testing volunteers graded the

answers generated by this approach higher than the al-

ternative. On the other hand, the ﬁrst approach gener-

ated sentences that are less often correct from a gram-

matical point of view, making mistakes including re-

peating the same word twice when not necessary, and

wrong use of punctuation. The response generated by

this approach appears to be often a longer and more

complex sentence, and although it mostly does have a

correct structure, it often mistakenly uses words, mak-

ing the response less meaningful.

The result of the questionnaire is ﬁrmly in favour

of the second approach. Nearly all the volunteers

(95%) judged the second chatbot as the best of the

two, as it was able to generate a valid answer more

often than the ﬁrst, both grammatically and semanti-

cally, thanks to the fact that those answers were usu-

ally incredibly vague and could have applied to a sub-

stantial number of different inputs. Although the sec-

ond model appears to perform better, a larger dataset

and longer training might result in a different conclu-

Exploring the Context of Recurrent Neural Network based Conversational Agents

351

Table 2: Some example responses of the two different approaches. The answer labelled with “Model 1” is generated by

the embedding approach with a probability vector generated by a softmax activation. The one labelled ”Model 2” uses the

representation of input and output directly through vectors in the embedding space.

Input Output

Hey! Do you want to have lunch with me tomorrow after the game?

Model 1: I’ll meet you, but I need to drive to the ties

<NUMBER>, I guess I can, but I won’t

Model 2: You bet, go for it!

Hello, how are you today? It’s such a sunny day, I feel like going for a walk

Model 1: Perhaps you should leave it to another con-

tract. You never know what

Model 2: Alright!

We should meet your brother, I heard he is a funny guy. Is he going to be at the party?

Model 1: What about the other? It is more incredibly

enjoyable

Model 2: Well, I’m afraid so. But don’t worry about

it!

Hi, how are you?

Model 1: I’ am sorry and you?

Model 2: Don’t ask me!

Can you help me ﬁnd a jacket of my size?

Model 1: Well, you know it! Consider it done!

Model 2: How can I help you!

sion, as neither of the networks seems to have been

trained long enough (and on a deep enough network)

to learn the semantic relation between words and use

them according to their meaning. The second model

appears better only because it learned to generate very

vague responses.

Table 3: Comparison of the accuracy metrics described in

subsection 4.2.

Model Encoder-Decoder 1 Encoder-Decoder 2

Metric1 28% 58%

Metric 2 Good 12% 30%

Metric 2 Decent 20% 31%

Metric 3 Bad 67% 38%

Hierarchical Model. The second experiment aims

at comparing the Hierarchical Recurrent Encoder-

Decoder Architecture, to its simpler version, the

Encoder-Decoder model. For this test, both of the net-

works will use a separately trained word2vec model

to convert inputs and outputs into vectors, as it was

the method that gave the best results among the two

embedding techniques (see Experiment 1). This com-

parison will reveal whether the Hierarchy structure of

the HRED will help it generate more appropriate an-

swers given the context of the conversation. These

models were trained on a GeForce GTX 960M GPU

for a total of about 45 hours.

Table 4 displays some examples of outputs gen-

erated given an input and a context. The Encoder-

Decoder model is the same used in experiment one,

the only difference is the amount of training, as for

the second experiment the model was trained for a

longer time. Although the performance does not seem

to change much; The Encoder-Decoder produces very

vague answers to any input, so that the output is often

a valid response. On the other hand, the HRED tries

to produce more speciﬁc outputs, also using the high-

level information extracted with the context layer. It

is clear from the answers generated that the network

is able to extract some context information, as there is

an increase in the use of words associated with other

words in previous sentences. When the context con-

tains a lot of work-related words, the model is more

likely to generate outputs with words such as “work”,

“colleague”, “contract”, “ofﬁce” etc. Similarly, when

using words associated with vacations and tourism,

some common outputs are names of countries, and

words such as “visiting”, “Museum”, “Hotel” etc...

Unfortunately, although the network is able to extract

some context information, it does not seem to be able

to use it to form meaningful sentences. Therefore, the

result is often a complex response (more complicated

than the simplistic and vague answers of the Encoder-

Decoder), usually containing quite a lot of grammat-

ical mistakes. In addition, although the words it uses

are often appropriate to the context, it fails in us-

ing them in a meaningful way, thus generating sen-

tences that often do not make much sense. Because

of these reasons the results of the questionnaire are

strongly in favour of the simpler Encoder-Decoder

model (88% of the interviewed people preferred the

Encoder-Decoder).

4.5 Context Representation

This subsection will include two experiments aiming

at investigating the encoding of the sentences in the

context space by the context module.

Topics Clustering. The goal of this experiment is

to compare vector representations of different con-

versations (in the context space) in order to deter-

mine whether conversations about similar topics are

close to each other. After the HRED model is trained,

all sentences of the dataset are encoded by the con-

text module in a vector. Once all conversations are

mapped into vectors in the context space, a technique

called “t-distributed stochastic neighbor embedding”

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

352

Table 4: Some example responses for the encoder-decoder model and HRED model.

Context Input Output

Did you buy your books for school?

Where is the math notebook? → I don’t

want to go to school, I don’t like math,

it’s boring

We have to go, pack your bag Enc-Dec: I told you so!

HRED: Please don’t worry

about it!

Where are we going for holidays this

summer? → What about we go in Eu-

rope? We can visit museums and go to

the beach.

We have to go, pack your bag Enc-Dec: I told you so!

HRED: Oh, what do you

want to see? I don’t there

there is anything good Ger-

man out!

We have a new colleague at work, he

is the new operations manager → Is he

good at his job? → No. He doesn’t

know what he is doing, he was hired be-

cause he is friend with the CEO

Are you going to do something

about it?

Enc-Dec: Well, I don’t

know what to do about it

HRED: I work before Bron.

Yankees play tonight.

How is it going with your boyfriend? →

Six months ago he asked me to marry

him and how he cheated on me with an-

other woman, I can’t just let it go

Are you going to do something

about it?

Enc-Dec: Well, I don’t

know what to do about it

HRED: I will get through it

Table 5: Comparison of the accuracy metrics described in

subsection 4.1.

Model HRED Encoder-Decoder

Metric1 38% 58%

Metric 2 Good 16% 30%

Metric 2 Decent 23% 32%

Metric 3 Bad 60% 31%

(Maaten and Hinton, 2008) (t-SNE) is used to visual-

ize the context space in only two dimensions, by ﬁnd-

ing a 2D representation in which vectors that were

close in the original space, are close, and vectors that

were distant are further away.

Figure 3 shows the result of the t-SNE technique

to visualise the 100 dimension vectors into 2D. One

can see in the image that, how we would expect, vec-

tors about the same topic are loosely clustered to-

gether. This is probably because conversations about

the same topics often use the same “topic-speciﬁc”

words, thus making the ﬁnal conversation vectors

similar. The four topic categories “School”, “Work”,

“Relationship” and “Tourism” are roughly clustered

in the four quadrants of the graph, while the ”ordi-

nary life” (blue) conversations are much more spread

across the entire graph, with a noticeable concentra-

tion in the centre-right of the graph. This is likely

due to the fact that conversations labelled as “Ordi-

nary Life” are much more varied, and lack a “topic-

speciﬁc” vocabulary if compared to conversation of

the other, more speciﬁc topics.

Context Vector Changes during a Conversation.

This experiment investigates the way a conversation

vector changes in the context space during a conver-

sation, according to the input given. The average co-

ordinates of the conversation of each topic are calcu-

lated, to better visualise the position of a vector in

respect to the clusters of each topic. A conversation is

processed by the model, and after every sentence, the

context vector is mapped to the 2D representation of

the context space. Doing so one can track the context

vector throughout a conversation to see how different

inputs affect the position of the context vector in the

context space.

Figure 4 shows the example of a conversation.

One can notice how vague and non-speciﬁc sentences

tend to push the context vector closer to the centre of

the graph, e.g. the last sentence “I hope so.” is a good

example of this. Also in the example, we can notice

how naming a nationality, i.e. “Chinese” in sentence

4, pushes the vector towards the bottom left corner

of the graph, close to the average coordinate of the

Tourism topic. This effect will occur most of the time

when the input is a city or a country name.

In order to asses the statistical signiﬁcance of the

aforementioned observations, an additional experi-

ment is performed. The conversation vectors of 150

different conversations, randomly selected from the

entire dataset, are mapped in the context vector space.

Then, the same input sentence is given to all of the

conversations. To establish the effect that the last in-

put had on the context vector, the distance from the

centroid of the different topics is calculated both be-

fore and after the last input is given. Doing so we can

calculate the distance reduction that the last input sen-

Exploring the Context of Recurrent Neural Network based Conversational Agents

353

Figure 3: 2D visualisation of conversation encodings in the context space. Colours represent conversations about different

topics.

Figure 4: The black X represents the context vector after every sentence. The coloured + represent the average coordinate of

every topic.

tence generated from all of the topic centroids. Fig-

ure 5a illustrates how much the distance of the context

vectors of the conversations from all 5 the topic aver-

age coordinates, is reduced, after receiving as input

the sentence “I don’t know”. Because the sentence

is general and vague, we expect to see the distance

to the “ordinary life” average coordinate decreasing

more compared to the distance from the average co-

ordinates of other topics.

One can clearly see how in average the context

vector moves signiﬁcantly closer to the “Ordinary

Life” topic (moving in average 13.5 units closer),

compared to all others (with average ranging from -

4.5 to 3). The same test is performed with the sen-

tence “Did you do book the ﬂight to China for this

summer? I already reserved the hotel”. As expected,

ﬁgure 5b shows a much stronger average decreased

distance from the topic “Tourism” (having a mean

distance reduction of 17.1) compared to other topics

(with a mean ranging from -4.4 and 3.6), likely be-

cause of the intense use of tourism-related words.

In order to ensure that the difference shown in

the box-plots are indeed of statistical signiﬁcance, a

Wilcoxon test is performed. The Wilcoxon test was

chosen because being a non-parametric test, it does

not assume any speciﬁc distribution of the data. Table

6 shows the results of the tests.

Table 6: Table showing the p-value from the Wilcoxon test,

on the distance reduction from the 5 topic centroids, after

the input of the sentence “I don’t know” in the top triangle,

and after the input of the tourism related sentence in the

lower triangle.

Ordinary Life School Work Relationship Tourism

Ordinary Life 1 0.009 0.007 0.03 0.001

School 0.17 1 0.39 0.18 0.63

Work 0.71 0.19 1 0.44 0.27

Relationship 0.51 0.26 0.37 1 0.09

Tourism 0.01 0.005 0.02 0.009 1

In the top triangle of table 6 it can be noticed that

the p-value for all pairs that include the “Ordinary

Life” topic are below 0.05, and therefore are signif-

icantly different, contrary to all other pairs, which are

not signiﬁcant (the p-value is above 0.05). Similarly

lower triangle the p-value for all the pairs involving

the “Tourism” topic are signiﬁcant, while all other

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

354

Figure 5: The Boxplot shows the distance reduction of the conversations vectors from the centroids of the 5 topics, after the

input “I don’t know” (in boxplot a) and the tourism related sentence(in boxplot b).

pairs are not.

The results of this experiment suggest that “topic-

speciﬁc” words will strongly inﬂuence the position of

the context representation in the context latent space.

5 CONCLUSION & FUTURE

WORK

This paper explored the application of a Hierarchical

Encoder-Decoder architecture for the task of a Dia-

logue System. The model attempts to overcome the

limitation that the basic Encoder-Decoder model has:

the inability of extracting and using context informa-

tion in the generation of a response. The model that

performed the best in the human evaluation experi-

ments is the Encoder-Decoder that uses a separately

trained word2vec model to transform the words in

vectors.

The reason why Encoder-Decoder outperformed

the other models in the experiments, is because it

learned to generate simpler and vaguer answers that,

due to their simplicity, are more likely to appear as a

valid response to the input. The HRED model is still

able to extract some contextual information, includ-

ing the most common words used in all the topics,

as demonstrated in experiments 3 and 4. However,

it seems that HRED has difﬁculty learning how to

apply such context information to generate meaning-

ful and coherent sentences, resulting in a percentage

of grammatically correct sentences generated below

40%, against almost 60% for the Encoder-Decoder.

Even though HRED was not able to improve the

quality of the outputs, it was capable of using the

context extracted from previous sentences in the gen-

eration of the output, resulting in increased usage

of words associated with other words previously ap-

peared in the conversation. HRED has been shown to

be able to generate a context space with some level of

structure, in which conversations about similar topics

are indeed close together. The result of experiment 4

suggests that this effect is probably caused by the fact

that conversations labelled with the same topic often

use similar words.

Future work will study how a signiﬁcant increase

in the depth of the HRED modules would affect the

performance of the model, and what the effect of a

much bigger dataset would be on the ﬁnal result. Ad-

ditionally, we are planning to explore the application

of the architecture to different tasks, as HRED could

in principle be applied to any sequence-generating

tasks that exhibit a hierarchical structure. In particu-

lar, it would be interesting to experiment with hierar-

chical structures composed by more than 2 levels, by

adding one additional context module for every addi-

tional level of the hierarchy.

REFERENCES

Cho, K., Van Merri

enboer, B., Gulcehre, C., Bahdanau, D.,

Bougares, F., Schwenk, H., and Bengio, Y. (2014).

Learning phrase representations using rnn encoder-

decoder for statistical machine translation. arXiv

preprint arXiv:1406.1078.

Collobert, R. and Weston, J. (2008). A uniﬁed architec-

ture for natural language processing: Deep neural net-

works with multitask learning. In Proceedings of the

25th international conference on Machine learning,

pages 160–167. ACM.

Li, Y., Su, H., Shen, X., Li, W., Cao, Z., and Niu, S. (2017).

Dailydialog: A manually labelled multi-turn dialogue

dataset. arXiv preprint arXiv:1710.03957.

Maaten, L. v. d. and Hinton, G. (2008). Visualizing data

using t-sne. Journal of machine learning research,

9(Nov):2579–2605.

Manning, C. D., Manning, C. D., and Sch

utze, H. (1999).

Foundations of statistical natural language process-

ing. MIT press.

Exploring the Context of Recurrent Neural Network based Conversational Agents

355

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013b). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. C.,

and Pineau, J. (2016). Building end-to-end dialogue

systems using generative hierarchical neural network

models. In AAAI, volume 16, pages 3776–3784.

Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J.,

Courville, A. C., and Bengio, Y. (2017). A hierarchi-

cal latent variable encoder-decoder model for generat-

ing dialogues. In AAAI, pages 3295–3301.

Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Si-

monsen, J., and Nie, J.-Y. (2015). A hierarchical re-

current encoder-decoder for generative context-aware

query suggestion. In Proceedings of the 24th ACM In-

ternational on Conference on Information and Knowl-

edge Management, pages 553–562. ACM.

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

356