tion from the Dialogue Manager to generate the out-
put sentence. The simplest technique used for NLU
is to spot certain keywords in the input, often work-
ing together with a script-based Dialogue Manager.
However, throughout the years there have been many
attempts to improve the NLU unit to better extract
text information, using techniques including statis-
tical modeling of language (Manning et al., 1999),
skip-gram models (Mikolov et al., 2013b) and, more
recently, deep neural networks (Collobert and We-
ston, 2008). Eventually, with the rise of Deep Learn-
ing in recent years, Dialogue Systems research has
mainly focused on end-to-end models, capable of in-
cluding all 3 modules in a single deep neural network,
trained on a large dataset. One end-to-end RNN archi-
tecture proved particularly successful in recent years
is the Encoder-Decoder.
The use of Encoder-Decoder architectures for nat-
ural language processing was first proposed as a so-
lution for text translation, in 2014 by (Cho et al.,
2014). From then on, the architecture has been ap-
plied to many other tasks, including conversational
agents (Serban et al., 2017). However, generating re-
sponses was found to be considerably more difficult
than translating between languages, probably due to a
broader range of possible correct answers to any given
input. A limitation of Encoder-Decoder models to
produce meaningful conversations is the fact that any
output is only influenced by the latest question. Thus,
important factors are ignored, such as the context of
the conversation, the speaker, and information pro-
vided in previous inputs. In 2015, Sordoni et al. pro-
posed an updated version of the Encoder-Decoder ar-
chitecture, called Hierarchical Recurrent Encoder De-
coder (Sordoni et al., 2015) (HRED), originally used
for query suggestions. In their paper, they demon-
strate that the architecture is capable of using context
information extracted by previous queries to generate
more appropriate query suggestions. This paper will
attempt to apply such architecture to a dialogue sys-
tem.
3 DESCRIPTION OF MODELS
3.1 Recurrent Neural Networks
A Recurrent Neural Network (RNN) is a neural net-
work which works on variable length sequences in-
puts x = (x
1
, x
2
... x
l
). Its output consists of a hidden
state h, which is updated recursively at each time step
by
h(t) = f (h(t − 1), x
t
) (1)
where f can be any non-linear activation function. For
the scope of this paper f was chosen to be a Long
Short Term Memory (LSTM) unit. The reason why
RNNs are particularly suited for the task of generat-
ing conversations is their ability to receive variable
length inputs by encoding them in a fixed length vec-
tor. Moreover, they are also able to produce variable
length outputs, which is essential when working with
sentences which are likely going to be of different
lengths. When receiving an input sentence s = (w
1
,
w
2
... w
l
) where l is the length, the RNN will up-
date its hidden state h recursively for every element
of the input sequence, in such a way that at any step
t, h(t) is a fixed length vector representing all the en-
tire sequence (w
1
, w
2
... w
t
) up to time-step t. After
receiving the last input w
l
, the hidden state will then
be representing the entire sequence.
Another interesting property of RNNs is their abil-
ity to learn the probability distribution of a sequence.
In this specific case, it would learn the probability of
a word p to be the next, knowing the previous words
in the sentence.
P(w
t
|w
1
, w
2
...w
t-1
). (2)
This can be useful to represent syntactical and gram-
matical rules and produce believable answers.
3.2 Word Embeddings
RNNs cannot accept strings as input, thus a way to
encode each word into a numerical vector is required.
For this paper, word2vec (using Python’s Gensim Li-
brary) was used (Mikolov et al., 2013a). Word2vec
uses a neural network to map each word of the vocab-
ulary to a vector space, in which words with similar
meaning are closer together. This approach should
make it simpler for the chatbot to extract informa-
tion from the input. The word2vec model was used
with two different approaches. The first approach was
to use a pre-trained word2vec model. This method
fixes the embedding matrix, and it does not allow it
to be trained with the rest of the model. When using
this approach, the output layer uses a linear activation
function to output a vector directly in the embedding
space. The output vector is then compared to the vec-
tors of the words in the vocabulary, and the closest
one is chosen as the output word. The second ap-
proach was to use the pre-trained word2vec model to
initialise an embedding layer, connected to the input
of the Encoder. The embedding weights would then
train with the rest of the network, giving it the possi-
bility to adapt to the specific task. When using this ap-
proach, the output layer has the same dimension as the
vocabulary, and it uses the softmax activation function
to produce a valid probability distribution. The output
ICAART 2019 - 11th International Conference on Agents and Artificial Intelligence
348