word is compared with each word from the other text.
Then the distance between most similar ones are ad-
ded to get the total distance.
3 METHODOLOGY
3.1 Datasets
Two datasets obtained from two channels has been
used in this study. First is from Webchat, which is
a system our company offers customers to solve their
problems via contacting a human agent. Hence, We-
bchat dataset is the collection of dialogues with mes-
sages written between customers and human agents.
The webchat dataset includes dialogues from 2014 to
2016. This dataset contains more than a million dia-
logues and total number of sentences in all dialogues
is over 13 million.
The second dataset is obtained from Facebook
Messenger. This dataset is similar to Webchat, but
here customers interact with a primitive troubleshoot-
ing chatbot that can reply to basic customer que-
ries. This basic chatbot is previously developed as
a quickwin solution - a decision tree based on ke-
ywords. Facebook messenger dataset only contains
messages written by customers. Moreover, it is obser-
ved that average length of messages is shorter com-
pared to messages in Webchat dataset. One possi-
ble reason might be due to the fact that customers,
being aware that responses are given by an automa-
tized system instead of a real human, shorten their
queries. Therefore, messages in Facebook messenger
dataset represent the summary of problem that custo-
mers want to ask.
Both datasets includes free written text messa-
ges from users and they contain high amount of ty-
pos, misspellings and grammatically wrong senten-
ces. Therefore, since datasets consist of noisy texts,
preprocessing and using word vectors for calculating
similarity between words is very important.
3.2 Preprocessing
Messages in both datasets introduced in Section 3.1,
contain Turkish and English letters. Therefore, Tur-
kish letters are converted to their closest English
counterparts for consistency since most of the letters
belong to English.
˘
g, c¸, s¸,
¨
u,
¨
o, and ı are converted
to g, c, s, u, o, and i respectively. What is more,
spell checker is also utilized to find correct forms of
words. For this approach, Zemberek (Akın and Akın,
2007), a natural language processing tool for Turkish
language, has been utilized.
Before calculating similarity between two ques-
tions, they are preprocessed in following ways. At
first, all punctuation marks are removed. Then, each
letter is converted to lowercase. Finally, Turkish cha-
racters are converted to corresponding English words.
Latter process is mainly done to ensure consistency
because inspection of datasets show people can write
with either a Turkish keyboard or an English keyboard
and often Turkish characters are replaced with their
English counterparts.
3.3 Word Embeddings
In order to calculate semantic similarities among
words word embeddings are used. Word embeddings
are dense vector representations of words. Word em-
beddings are used to get semantic similarities between
synonymous words or same words with different suf-
fixes. While a bag-of-words approach would treat
inflectional forms and synonyms of words as enti-
rely irrelevant, a well trained word embedding should
capture similarity between such words. Furthermore,
word embeddings can calculate vectors for frequent
misspelling of words that are very similar to vector for
original words. Such the closest vector for the word
’kredi’ (credit in English) is ’kreid’ which is a com-
mon typo of that word in datasets. We used two algo-
rithms to train word vectors mainly due to their pro-
ven success in capturing semantic relations between
words: Word2Vec and Fasttext.
Word embeddings with both methods have been
trained on webchat dataset with all messages conca-
tenated as a whole document. Since webchat dataset
is domain specific, training word vectors on it provi-
ded meaningful semantic representations of banking
related terms. Dimensions of the trained vectors are
set as 300 since it is shown that 300 dimensional word
vectors performs better in (Sen and Erdogan, 2014).
3.3.1 Word2vec
Word2Vec is a method for training word em-
beddings using neural networks (Mikolov et al.,
2013). Word2Vec actually has two different ways
to train embeddings: skip-gram and continous-bag-
of-words(CBOW). Both versions of the method has
a sliding window through words. In CBOW model,
the words are projected by their previous and future
words along the window size. This model is called
bag-of-words because order of the words inside the
window are not taken into consideration and the con-
text is trained by neighboring information. Skip-gram
model works similar to CBOW model with a few key
differences. In skip-gram model words are used to
project their neighboring words within the window
A Hybrid Approach to Question-answering for a Banking Chatbot on Turkish: Extending Keywords with Embedding Vectors
173