Authors:
Miraç Tuğcu
;
Tolga Çekiç
;
Begüm Erdinç
;
Seher Akay
and
Onur Deniz
Affiliation:
Natural Language Processing Department, Yapı Kredi Teknoloji, Istanbul, Turkey
Keyword(s):
QA Classification, Data-Centric AI, Clustering, Language Models, Deep Learning, NLP, BERT.
Abstract:
Questionnaires with open-ended questions are used across industries to collect insights from respondents. The answers to these questions may lead to labelling errors because of the complex questions. However, to handle this noise in the data, manual labour might not be feasible due to low-resource scenarios. Here, we propose an end-to-end solution to handle questionnaire-style data as a text classification problem. In order to mitigate labelling errors, we use a data-centric approach to group inconsistent examples from the banking customer questionnaire dataset in Turkish. For the model architecture, BiLSTM is preferred to capture longterm dependencies between contextualized word embeddings of BERT. We achieved significant results on the binary questionnaire classification task. We obtained results up to 81.9% recall and 79.8% F1 score with the clustering method to clean the dataset and presented the results of how it impacts overall model performance on both the original and clean v
ersions of the data.
(More)