Classiﬁcation of Questionnaires with Open-Ended Questions

Mirac¸ Tu

gcu, Tolga C¸ ekic¸, Beg

um C¸ ıtamak Erdinc¸, Seher Can Akay and Onur Deniz

Natural Language Processing Department, Yapı Kredi Teknoloji, Istanbul, Turkey

Keywords:

QA Classiﬁcation, Data-Centric AI, Clustering, Language Models, Deep Learning, NLP, BERT.

Abstract:

Questionnaires with open-ended questions are used across industries to collect insights from respondents.

The answers to these questions may lead to labelling errors because of the complex questions. However, to

handle this noise in the data, manual labour might not be feasible due to low-resource scenarios. Here, we

propose an end-to-end solution to handle questionnaire-style data as a text classiﬁcation problem. In order

to mitigate labelling errors, we use a data-centric approach to group inconsistent examples from the banking

customer questionnaire dataset in Turkish. For the model architecture, BiLSTM is preferred to capture long-

term dependencies between contextualized word embeddings of BERT. We achieved signiﬁcant results on the

binary questionnaire classiﬁcation task. We obtained results up to 81.9% recall and 79.8% F1 score with the

clustering method to clean the dataset and presented the results of how it impacts overall model performance

on both the original and clean versions of the data.

1 INTRODUCTION

Classiﬁcation is one of the core tasks that has largely

been studied in machine learning and by extension,

text classiﬁcation is a common area of research for

natural language processing (NLP). Text classiﬁca-

tion can be broadly deﬁned as categorizing a text of

arbitrary length and composition into two or more

predeﬁned classes. It has been used for sentiment

analysis (Tejwani, 2014), spam detection (Bhowmick

and Hazarika, 2016), intent classiﬁcation (Larson and

Leach, 2022), and so on. In earlier research, sparse

tf-idf vectors have been used with methods such as

Support Vector Machines to classify various types of

textual data. With the introduction of dense vecto-

rial representations of tokens in text such as word2vec

(Mikolov et al., 2013) and fastText (Bojanowski et al.,

2017) which is shown to retain semantic information

of words successfully, deep learning based text clas-

siﬁcation models have gradually surpassed success of

earlier models. Recent advances in contextual embed-

dings by transformer networks have further improved

on the shortcomings of word vectors namely the se-

mantic relation between words far apart in a text.

Considering these advances, we have approached the

problem of analyzing questionnaire data as a purely

text classiﬁcation problem. Rather than trying to ex-

tract information from each answer one by one, by in-

troducing the questionnaire as complete text to a con-

textual embedding model, we trained a classiﬁcation

model.

Classifying multiple open-ended questions & an-

swers requires an understanding of different aspects

of creative responses in contrast to close-ended ques-

tions. This nature of open-ended questions allows

elaboration from respondents, thus making them im-

portant in questionnaires and surveys. On the other

hand, human analysis of text responses is time-

consuming and domain knowledge is needed for non-

trivial questions. Surveys are widely used in educa-

tion, research, and industry domains to get feedback

or information from a targeted group of people. Nev-

ertheless, open-ended questions may cause noisy data

while increasing the variability of answers. As a re-

sult, cleaning the data or annotation can be a cumber-

some process even for domain experts.

In this work, we approach classifying question-

naires as a pure text classiﬁcation problem and in-

troduce our results. We also apply a data-centric ap-

proach to reduce the labelling error in the dataset. De-

tails of the literature review are mentioned in Section

2. The preparation process and properties of the Turk-

ish customer questionnaire dataset on the banking do-

main are explained in Section 3. The proposed model

architecture is described in Section 4. The clustering

approach to mitigate the noise in the dataset is de-

tailed in Section 5. The experiment setup and results

of the mentioned methods are discussed in Section 6.

gcu, M., Çekiç, T., Erdinç, B., Akay, S. and Deniz, O.

Classiﬁcation of Questionnaires with Open-Ended Questions.

DOI: 10.5220/0012233200003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 413-420

ISBN: 978-989-758-671-2; ISSN: 2184-3228

413

Finally, the conclusions we reached and the details of

our future research are shared in Section 7.

2 RELATED WORK

Contemporary methods of text classiﬁcation with

BERT (Devlin et al., 2019) involve using a classiﬁer

layer that can leverage from transfer learning and ﬁne-

tuning BERT to create robust models for a speciﬁc

task.

A recent method of multi-class sentiment classiﬁ-

cation proves that using a simple model architecture

of dropout layer (Srivastava et al., 2014) and soft-

max classiﬁer layer is able to produce satisfying re-

sults (Munikar et al., 2019). The same architecture is

applied alongside our architecture for questionnaire

classiﬁcation to observe if BERT with a simple clas-

siﬁer network can capture the bidirectional dependen-

cies of question-answer pairs in a text and be robust

against labelling errors. FakeBERT (Kaliyar et al.,

2021) uses BERT embeddings to perform binary clas-

siﬁcation for fake news classiﬁcation by using a CNN

layer network as a classiﬁer and comparing results

of using GloVe (Pennington et al., 2014) embed-

dings which are context-independent and unidirec-

tional. For our problem, multiple question-answer

pairs could be dependent on each other. Hence, we

choose BiLSTM (Graves and Schmidhuber, 2005)

model in our architecture to capture sequential depen-

dencies from BERT embeddings which are contextu-

alized and bidirectional. To the best of our knowl-

edge, this is the ﬁrst work to approach the open-ended

questionnaire classiﬁcation problem as a text classiﬁ-

cation problem.

3 QUESTIONNAIRE DATASET

In order to create a dataset, Turkish customer ques-

tionnaire data in the banking domain is collected from

Yapı Kredi. There are different question categories

for customer types - Turkish citizens, foreign cus-

tomers, underage customers, and so on. In this work,

questionnaire data of the Turkish citizens’ category is

used instead of others due to its large proportion com-

pared to other categories. The raw data was in the for-

mat of email texts, and answers to the questions were

in a separate reply email. Thus, the ﬁrst challenge of

creating the dataset was parsing the question-answer

pairs from emails to a structured format.

3.1 Data Parsing

A rule-based parser is developed to extract question-

answer pairs from the reply patterns of respondents.

These patterns are about where answers are lo-

cated in a reply because the questions are sent in a

default format in the ﬁrst email. Two of the most

frequent reply patterns are either appending answers

next to questions in the reply section or copying the

questions to add answers next to them. So we deﬁned

rules to check questions in replies and answers next

to or beneath them. Speciﬁc keyword and length con-

trols are also used to ensure there are no mistakes in

the parsing process.

This extraction approach helped to cover ∼80%

of the email data. We experimented with using mail

contents in raw format, but this approach helped us

generalize the data and save it in a semi-structured

format.

The task is a binary classiﬁcation problem and the

classes are either an issue found with answers or not.

The surveyors decide if there is an issue when an un-

expected answer arrives to a question.

The data was self-labelled because the surveyors

of questionnaires sent a different reply email with ex-

tra questions if they decided there was any issue with

the answers.

After parsing emails, the dataset is set for binary

classiﬁcation. Given an arbitrary question text q and

a corresponding answer text a for a question-answer

pair p = (q, a), an example from the dataset includes a

series of question-answer pairs {p

, ..., p

}. The task

is to predict the class y ∈ {no issue, issue f ound} for

each example. The dataset has 19006 questionnaire

examples with a total of 186092 question-answer

pairs.

Figure 1: Venn diagram for class confusion and percent-

ages of class distribution in the dataset. Inconsistent sam-

ples were detected empirically, but it is impossible to ﬁnd

out how many of them are there.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

414

3.2 Data Inconsistency

There were two reasons for inconsistency in our data,

misleading email replies and missing features. Mis-

leading replies are caused by the puzzling nature of

the open-ended questions. Even domain experts we

consulted have trouble identifying if there is an is-

sue with the answers in this situation. Thus, label

errors in data occur due to confusion of labels be-

cause the similar questionnaires overlap in different

labels, as shown in Figure 1. There are also scenarios

where another channel other than emails (calls, short

messages) is used to decide if there is an issue with

the customer. However, it is unlikely to detect this

problem by only using emails due to not having any

knowledge if another communication channel is used.

With a sufﬁciently large dataset, state-of-the-art

deep learning models are able to iron out inconsisten-

cies. Because the data was not large enough, in order

to tackle the inconsistency problem, we focused on a

data-centric approach to handle noisy data.

3.3 Data Preprocessing

After the questionnaire dataset is created, the order

of question-answers is shufﬂed for each example in

the dataset to apply regularization and reduce over-

ﬁtting during the training. Punctuation characters

are removed, and lowercase characters are used be-

cause of the improper usage of punctuation and up-

percase characters in replies. A special token [SEP]

is added after each question-answer pair when tok-

enizing to separate question-answer pairs in the in-

put. For empty answers or any answer with a length

shorter than one non-whitespace character [UNK] to-

ken is used. These two special tokens were already in

the dictionary of the pre-trained BERT model’s tok-

enizer that is used. The details of the model will be

further explained in Section 4.

4 MODEL ARCHITECTURE

Sequence representations of concatenated and tok-

enized question-answer pairs are used for binary clas-

siﬁcation to ﬁnd if there is an issue or not with the

given questionnaire, depending on the answers to the

questions. The architecture of the model is shown in

Figure 2. The model is expected to generalize what

an issue could be without further auxiliary features

about the issue itself. To achieve this, representations

of an attention-based model like BERT (Devlin et al.,

2019) are used for classifying, and pre-trained model

is further explained in Subsection 4.1. The classiﬁer

Figure 2: Model architecture for binary classiﬁcation of

questionnaires.

layer is explained in Subsection 4.2

4.1 The Pre-Trained Language Model

Using contextualized word embeddings instead of

static word embeddings is a must to represent the

context. Different question-answer pairs in the input

and answers may be related to each other. Therefore,

using question-answer pairs in a text requires repre-

senting the context between word tokens across the

question-answer pairs. To achieve this, the BERT

model is chosen. The word embeddings of the

BERT model are able to represent the context between

question-answer pairs by using a bidirectional self-

attention mechanism. Thus, the word embeddings can

represent the context for both the left and right or-

dering of tokens and capture different dependencies

from both sides for each token. The BERT model we

used in this work is the BERTurk model (Schweter,

2020), which is a model trained in Turkish corpora.

The version with a 32K dictionary size and base ar-

chitecture with a hidden size of 768 is used. Ad-

ditional pre-training of BERT model on the domain

of the task can improve the text classiﬁcation perfor-

mance (Sun et al., 2019). Therefore, the BERTurk

model is pre-trained on Masked Language Modelling

task with banking documents and old questionnaire

emails to further adapt it to the banking domain.

Classiﬁcation of Questionnaires with Open-Ended Questions

415

4.2 Classiﬁer Head

During the experiments, using a BiLSTM layer be-

fore the linear layer performed better at generalizing

the questionnaire data compared to using only a lin-

ear neural network inside the classiﬁer layer. While

increasing the complexity of the model, the BiLSTM

layer also helps to capture higher-level representa-

tions of the BERT model by modelling contextual

information for both directions. BiLSTM is able to

enhance the word embeddings of BERT by using se-

quential dependencies. For regularization, we also ap-

plied a dropout layer before and after the BiLSTM

during the training phase. Input and output sizes of

the BiLSTM layer are the same as BERT’s hidden size

(i.e., 768). The input of the linear layer is the concate-

nated output of BiLSTM’s last step. The output size

of the linear layer is the number of class sizes which is

two. None of the BERT’s layers are frozen during the

training. Therefore, BERT is ﬁnetuned when training

the classiﬁer layer, which helps to adapt higher-level

representations of BERT to the task it is used.

4.3 Loss Function

The cross-entropy loss function (L

) is used to cal-

culate the loss at the end of the training pipeline.

There is no softmax layer for the outputs in the model

architecture, but loss function L

applies the softmax

function internally, as shown in Equation 1 where N is

the number of classes (i.e., output neurons) and x

target

is the value of the target output neuron.

= l(x, x

target

) = −log(

exp(x

target

)

∑

exp(x

)

) (1)

is used instead of using a binary cross-entropy

(BCE) loss from a sigmoid output to train the neu-

ral network. While it is common and more efﬁcient

to use BCE loss for a binary classiﬁcation problem,

we observed using the loss function L

contributes

more to the balance problem of classes in the dataset,

as shown in Table 1. The output of the loss functions

and BCE must be the same if the inputs to the

functions are also the same for binary classiﬁcation.

Thus, the outcome of the experiment differs due to

the increased complexity of the neural network by us-

ing L

with two output neurons. Also, using L

enables the model to be used for non-binary classiﬁ-

cation tasks as well in case of need.

Correctly predicting all the customers with issues

is the main problem we are trying to optimize in this

classiﬁcation; hence the signiﬁcant increase of the re-

call score observed for Class 1 (as shown in Figure 1)

contributes toward the desired solution.

5 CLUSTERING APPROACH

Figure 3: Clustering pipeline to obtain follow-up question

clusters.

The dataset is created without annotation, which

causes noisy labels due to the aforementioned data

properties in Subsection 3.2. This situation is similar

to weak supervision noise because email replies are

used directly to get labels rather than manual anno-

tation, which is time-consuming and not always fea-

sible. A BERT-based classiﬁer is not robust to weak

supervision noise and it can signiﬁcantly degrade its

performance (Zhu et al., 2022). For this reason, a

clustering-based approach is used to solve the data in-

consistency problem that is related to noise to reduce

manual review time and boost text classiﬁer perfor-

mance.

The questionnaires with any issues in the training

dataset are clustered using follow-up questions in 3

steps as shown in Figure 3. Respondents are asked

follow-up questions if a surveyor decides there is an

issue with the replies. However, some of these ques-

tions are mistakenly asked or unrelated to the answers

to the questionnaire. Also, there are differences when

deciding if there is an issue with a puzzling question-

naire because there is more than one surveyor who is

responsible for reviewing questionnaires.

Follow-up questions of questionnaires from the

issue-found class are collected. Contextual word em-

beddings of BERT can be used in a way that seman-

tically similar text embeddings are grouped close to

each other in vector space. While this property of

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

416

Table 1: Classiﬁcation results of two different loss func-

tion setups of the same model architecture at Figure 2. The

model with BCE has only one output activated by a Sigmoid

function and uses the Binary Cross Entropy loss function.

Loss Function Class Prec. Recall F1

BCE 0 0.885 0.921 0.903

1 0.753 0.668 0.708

CE 0 0.904 0.903 0.904

1 0.733 0.732 0.733

computed word representations is useful for classiﬁ-

cation and other downstream tasks in NLP, it also pro-

vides the basis for clustering. Thus, questions are rep-

resented in vector space by using the Sentence-BERT

(Reimers and Gurevych, 2019) framework by using

the BERTurk model with mean pooling.

The questions are clustered after reducing the di-

mensionality of their representations in vector space

by using the principal component analysis technique

(Jolliffe, 1986). A hierarchical clustering algorithm,

agglomerative clustering, is used. The algorithm re-

cursively groups the pair of vectors to ﬁnd clusters

until all representations are assigned to a cluster. We

used the algorithm with a distance threshold with-

out specifying a cluster size. Cosine distance is used

for obtaining clusters that contextually represent the

questions.

The clusters that included the mentioned type of

follow-up questions are ﬂagged after the clustering

step. We assumed that the inconsistent examples that

are not clustered can be found by predicting the clus-

ter of their follow-up questions and checking if the

clusters are ﬂagged. However, this approach can only

be used if follow-up questions exist for a question-

naire. Thus, inconsistencies in questionnaires without

any issue could not be detected with this approach. To

handle this problem, the Cleanlab framework (North-

cutt et al., 2021) is used to ﬁnd labelling errors in data.

However, there was no empirical improvement in the

model due to the aforementioned ambiguous answers

in our dataset which can even be puzzling for a hu-

man expert. Thus, the proposed methods might still

struggle with the inconsistencies even though they are

reduced by the clustering approach and some noisy la-

bels could be overlooked.

6 EXPERIMENTS & RESULTS

6.1 Experiment Setup

Experiments are performed to observe the results of

handling label errors with a clustering approach on

the dataset. The only dataset used in the experiments

is the Turkish questionnaire dataset on the banking

domain from Yapı Kredi, mentioned in Section 3. The

setting of the dataset we used involves using only

open-ended questions and their answers from a re-

spondent to classify ambiguous issues via binary clas-

siﬁcation. Because of the unique properties of the

problem and the dataset we used, there is not an avail-

able open benchmark dataset for our work.

From the point of view of model architecture, two

output neurons are used. However, it is more common

to use BCE loss function to calculate loss by using the

output of a single neuron after a Sigmoid activation

function. We observed using the cross entropy loss

function L

like a multi-class classiﬁcation problem

setting improved the overall performance and recall

value for Class 1, as known as the issue-found class,

as shown in Table 1. Normally, there is no difference

when using either loss function in a binary setting ex-

cept for the efﬁciency of using BCE. However, the

added complexity of using two output neurons instead

of one aids model parameters to converge better for

obtaining higher recall scores. Due to recall being

more important than other evaluation metrics for this

work, the model architecture with the cross-entropy

loss L

is used. Evaluation metrics used in exper-

iments are the common metrics for binary classiﬁ-

cation, such as precision, recall, F1 score, and AUC

score. Macro average evaluation metrics are used

due to the dataset being imbalanced. A cleaned test

dataset could not be prepared due to manual force not

being feasible because of ambiguity. As a result, the

original test data is used. Also, the test data that is

cleaned by the clustering approach is used to show

how the results of the trained models differ for both

test datasets.

Various language models are utilized for classiﬁ-

cation in a ﬁne-tuning setting by using a linear layer

as a classiﬁer layer to benchmark different language

models on our dataset. mBERT, DistilBERT (Sanh

et al., 2019), ELECTRA (Clark et al., 2020), Con-

vBERT (Jiang et al., 2020), and BERTurk (Schweter,

2020) models from the BERTurk repository are used

for this experiment. A parameter-free classiﬁcation

method that uses a compressor (Jiang et al., 2023)

is chosen to compare its result with pre-trained lan-

guage models. This approach is denoted as gzip with

respect to the compression application that is used.

gzip utilizes the k-nearest neighbors algorithm where

k = 3. The pre-trained language model that is used in

the model architecture of this work is pre-trained in

the banking domain before ﬁne-tuning for the clas-

siﬁcation task as mentioned in Subsection 4.1. This

model will be denoted as BERT in this section for

convenience. For this experiment only, the models

Classiﬁcation of Questionnaires with Open-Ended Questions

417

are trained in the dataset where the examples with

empty answers are removed. Sometimes empty an-

swers might be a reason for asking follow-up ques-

tions and a question that has no answer can have a

linkage with other questions. Thus, the dataset where

examples with empty answers are removed is easier

to classify compared to the original dataset.

Two different model architectures are chosen to

experiment clustering approach. The ﬁrst one is a

BERT with a classiﬁer head that has only one linear

network layer (i.e., output layer) denoted as BERT+L.

The other model is the proposed model architecture

where a BiLSTM layer is used as a hidden layer be-

fore the output layer and denoted by + BiLSTM. The

models trained on the cleaned version of train data

are marked with an asterisk character on tables and

the following subsection.

6.2 Classifying Results

Table 2: Classiﬁcation results of using different pre-trained

language models and a parameter-free approach.

Model Acc.

Macro Avg.

Prec. Recall F1

ConvBERT 0.784 0.777 0.758 0.764

ELECTRA 0.782 0.773 0.758 0.764

DistilBERT 0.755 0.741 0.740 0.741

mBERT 0.784 0.775 0.763 0.768

gzip 0.672 0.656 0.614 0.612

BERTurk 0.782 0.775 0.755 0.762

BERT 0.791 0.791 0.760 0.769

+ BiLSTM 0.792 0.793 0.760 0.769

Results of using different pre-trained languages have

similar results except for the DistilBERT where there

is a minor difference with other models, as shown in

Table 2. This is expected due to the smaller parame-

ter size of the DistilBERT model. There is a signiﬁ-

cant difference between the parameter-free approach

gzip and pre-trained language models. This is an-

ticipated due to the complexity of the task, yet this

approach is proven to be successful on less complex

text-classiﬁcation tasks (Jiang et al., 2023) and shows

promising results with regard to having no training

phase and GPU force. The best result is yielded by the

models that use the BERT model that is pre-trained in

the banking domain. Removing the examples with

empty answers from the dataset helps models to per-

form slightly better at classiﬁcation compared to re-

sults in Table 3.

The + BiLSTM* slightly improves the recall value,

as shown in Table 3. Results of AUC scores of each

model especially show the classiﬁcation abilities of

the models. While BERT models without a BiLSTM

Table 3: Model results on the original test data. Models

with * are trained in cleaned train data.

Model Acc.

Macro Avg.

Prec. Recall F1

BERT+L 0.743 0.741 0.718 0.723

BERT+L* 0.737 0.732 0.714 0.719

+ BiLSTM 0.743 0.744 0.715 0.721

+ BiLSTM* 0.737 0.729 0.728 0.728

Table 4: Model results on the cleaned test data.

Model Acc.

Macro Avg.

Prec. Recall F1

BERT+L 0.857 0.825 0.792 0.806

BERT+L* 0.846 0.807 0.789 0.797

+ BiLSTM 0.852 0.815 0.796 0.804

+ BiLSTM* 0.833 0.786 0.819 0.798

layer in their classiﬁer heads show poorer results,

cleaning the data increases the classiﬁcation ability

on the original test data, as shown in Figure 4. The

same models are also tested on cleaned test data. The

BERT+L model has higher metric scores compared to

the other models except for recall as shown in Table

4 on cleaned test data. However, the ROC curve anal-

ysis of the BERT+L model shows that the model fails

to differentiate the classes due to the AUC score be-

ing under the value of 0.5. It can be deduced that +

BiLSTM* outperforms other models by generalizing

the given data better and more conﬁdent than other

models to decide whether there is an issue with a

questionnaire.

Figure 4: Area under the ROC Curve score of the models

on original test data.

7 CONCLUSIONS & FUTURE

WORK

By collecting Turkish questionnaire data from respon-

dents’ emails and extracting question-answer pairs,

we created a customer questionnaire dataset to train

a model on the text classiﬁcation setting. A novel ap-

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

418

proach is proposed for questionnaire classiﬁcation by

using concatenated question-answers pairs to perform

text classiﬁcation rather than separately analysing the

pairs. A pre-trained BERT is used in the training

to get contextual and bidirectional word embeddings

to capture the correlation between the pairs in the

whole text. A BiLSTM layer on top of BERT is

used to represent the sequential dependencies of word

embeddings for further improvement. We utilized a

data-centric approach, using clustering to group in-

consistent data to mitigate the effects of noise caused

by open-ended questions that provide deeper insights

into a questionnaire and affect the annotation process.

The model architecture we proposed for questionnaire

classiﬁcation performed better than a simple text clas-

siﬁcation architecture. Also, we have observed mean-

ingful improvement in the classiﬁcation performance

with models trained on the data where the clustering

approach is applied.

The proposed novel approach for classiﬁcation

can be used in a dataset in a similar setting that has

multiple question-answer pairs with the task of clas-

sifying these pairs as a single unit and not as sepa-

rate parts. The method we used doesn’t involve any

domain-centric or language-centric technique, thus

one can assume the methods are applicable to simi-

lar data in other contexts or languages. Our work fo-

cuses on Turkish data in the banking domain due to

not having any public data available. However, the

results prove the classiﬁcation is successful in a noisy

dataset that is labelled without supervision.

For future research, we intend to experiment with

semi-supervised methods like self-learning to lessen

the impact of incorrect labels. This will help us to

cover the examples in our dataset that our approach

could not affect. We also believe data-centric ap-

proaches will improve NLP applications, especially

for low-resource languages like Turkish. And using a

data-centric approach to handle inconsistent data will

further help in situations where manual labour is not

feasible. For further work, we aim to develop our

method using Explainable AI approaches to under-

stand which question-answer pair mostly contributed

to the outcome.

REFERENCES

Bhowmick, A. and Hazarika, S. M. (2016). Machine learn-

ing for e-mail spam ﬁltering: Review,techniques and

trends.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.

(2017). Enriching word vectors with subword infor-

mation. Transactions of the Association for Computa-

tional Linguistics, 5:135–146.

Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D.

(2020). Electra: Pre-training text encoders as dis-

criminators rather than generators. arXiv preprint

arXiv:2003.10555.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. In Pro-

ceedings of the 2019 Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies, Volume

1 (Long and Short Papers), pages 4171–4186, Min-

neapolis, Minnesota. Association for Computational

Linguistics.

Graves, A. and Schmidhuber, J. (2005). Framewise

phoneme classiﬁcation with bidirectional lstm and

other neural network architectures. Neural networks,

18(5-6):602–610.

Jiang, Z., Yang, M., Tsirlin, M., Tang, R., Dai, Y., and

Lin, J. (2023). “low-resource” text classiﬁcation: A

parameter-free classiﬁcation method with compres-

sors. In Findings of the Association for Computational

Linguistics: ACL 2023, pages 6810–6828.

Jiang, Z.-H., Yu, W., Zhou, D., Chen, Y., Feng, J., and Yan,

S. (2020). Convbert: Improving bert with span-based

dynamic convolution. Advances in Neural Informa-

tion Processing Systems, 33:12837–12848.

Jolliffe, I. T. (1986). Principal Component Analysis.

Springer-Verlag, Berlin; New York.

Kaliyar, R. K., Goswami, A., and Narang, P. (2021). Fake-

bert: Fake news detection in social media with a bert-

based deep learning approach. Multimedia tools and

applications, 80(8):11765–11788.

Larson, S. and Leach, K. (2022). A survey of intent classi-

ﬁcation and slot-ﬁlling datasets for task-oriented dia-

log.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space.

Munikar, M., Shakya, S., and Shrestha, A. (2019). Fine-

grained sentiment classiﬁcation using bert. 2019 Arti-

ﬁcial Intelligence for Transforming Business and So-

ciety (AITB), 1:1–5.

Northcutt, C., Jiang, L., and Chuang, I. (2021). Conﬁdent

learning: Estimating uncertainty in dataset labels. J.

Artif. Int. Res., 70:1373–1411.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe:

Global vectors for word representation. In Proceed-

ings of the 2014 Conference on Empirical Methods in

Natural Language Processing (EMNLP), pages 1532–

1543, Doha, Qatar. Association for Computational

Linguistics.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. In

Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing. Associa-

tion for Computational Linguistics.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019).

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter. arXiv preprint arXiv:1910.01108.

Schweter, S. (2020). Berturk - bert models for turkish.

Classiﬁcation of Questionnaires with Open-Ended Questions

419

Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: a simple way

to prevent neural networks from overﬁtting. Journal

of Machine Learning Research, 15(1):1929–1958.

Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How

to ﬁne-tune bert for text classiﬁcation? In Sun,

M., Huang, X., Ji, H., Liu, Z., and Liu, Y., editors,

Chinese Computational Linguistics, pages 194–206,

Cham. Springer International Publishing.

Tejwani, R. (2014). Sentiment analysis: A survey.

Zhu, D., Hedderich, M. A., Zhai, F., Adelani, D. I., and

Klakow, D. (2022). Is bert robust to label noise? a

study on learning with noisy labels in text classiﬁca-

tion. Insights 2022, page 62.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

420