A New Benchmark for NLP in Social Sciences: Evaluating the Usefulness

of Pre-trained Language Models for Classifying Open-ended Survey

Responses

Maximilian Meidinger and Matthias Aßenmacher

Department of Statistics, Ludwig-Maximilians-Universit

at, Munich, Germany

Keywords:

Benchmark, Multi-label Classiﬁcation, Open-ended Responses, Transfer Learning, Pre-trained Language

Models.

Abstract:

In order to evaluate transfer learning models for Natural Language Processing on a common ground, numerous

general domain (sets of) benchmark data sets have been established throughout the last couple of years. Pri-

marily, the proposed tasks are classiﬁcation (binary, multi-class), regression or language generation. However,

no benchmark data set for (extreme) multi-label classiﬁcation relying on full-text inputs has been proposed in

the area of social science survey research to this date. This constitutes an important gap, as a common data set

for algorithm development in this ﬁeld could lead to more reproducible, sustainable research. Thus, we pro-

vide a transparent and fully reproducible preparation of the 2008 American National Election Study (ANES)

data set, which can be used for benchmark comparisons of different NLP models on the task of multi-label

classiﬁcation. In contrast to other data sets, our data set comprises full-text inputs instead of bag-of-words rep-

resentations or similar. Furthermore, we provide baseline performances of simple logistic regression models

as well as performance values for recently established transfer learning architectures, namely BERT (Devlin

et al., 2018), RoBERTa (Liu et al., 2019) and XLNet (Yang et al., 2019).

1 INTRODUCTION

The quasi-standard method in machine learning to de-

termine the performance of a newly proposed method

is to evaluate it on benchmark data sets. The same ap-

plies for the evaluation of pre-trained language mod-

els frequently utilized for transfer learning in Natural

Language Processing (NLP). Collections of bench-

mark data sets for different natural language under-

standing (NLU) tasks (Rajpurkar et al., 2016; Lai

et al., 2017; Wang et al., 2018) have gained mas-

sive popularity among researchers in this ﬁeld. These

benchmark collections stand out mainly due to two as-

pects: They are extremely well documented with re-

spect to their creation and they are ﬁxed with respect

to the train-test split and the applied evaluation met-

rics. Furthermore they provide public leaderboards

where the results of submitted models are displayed

in a uniﬁed fashion. For the majority of the pro-

posed benchmark data sets the task is either a binary

or a multi-class classiﬁcation task (cf. data sets from

https://orcid.org/0000-0003-2154-5774

e.g. https://gluebenchmark.com/leaderboard

Wang et al. (2018)). In the context of social science

survey research, however, to our knowledge no exist-

ing (extreme) multi-label data sets (Lewis et al., 2004;

Mencia and F

urnkranz, 2008) have been used for per-

formance evaluation by any of the current state-of-

the-art (SOTA) transfer learning models. These, and

other (tabular) multi-label data sets can e.g. be found

in repositories like MULAN.

In the social sciences, especially in survey re-

search, deﬁnitive standards for raw data formatting of

open-ended survey questions have not yet been estab-

lished to our knowledge. This is not to say that there

exist no current standards for handling and organiz-

ing survey research data in general (Inter-University

Consortium For Political And Social Research (IC-

SPR), 2012; CESSDA Training Team, 2020) or the

metadata describing the primary data (Vardigan et al.,

2008; Hoyle et al., 2016). Yet, for open-ended sur-

vey questions and their coding

, these standards have

not been well established, apart from descriptions of

best practices by some authors (Z

ull, 2016; Lupia,

The process of manually assigning survey responses to

pre-deﬁned sets of labels (codes) is known as coding.

866

Meidinger, M. and Aßenmacher, M.

A New Benchmark for NLP in Social Sciences: Evaluating the Usefulness of Pre-trained Language Models for Classifying Open-ended Survey Responses.

DOI: 10.5220/0010255108660873

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 866-873

ISBN: 978-989-758-484-8

2018b,a).

Our data set preparation represents a novelty since

it combines an interesting use-case (multi-label clas-

siﬁcation) for NLP models in Social Sciences with a

fully reproducible pre-processing resulting in full-text

strings as inputs. Note that this combination is not

yet included in the benchmark collections mentioned

above

. Thus, in the spirit of the growing overall

need for standardized data sets and for reproducibil-

ity, we provide a description (cf. Sec. 2), an overview

on previous use of this data set (cf. Sec. 3) and a

thoroughly described pre-processing (cf. Sec. 4.1)

of the ANES 2008 data, which enables its usage for

benchmark comparisons for multi-label classiﬁcation.

Baseline performance values for a simple machine

learning model as well as for more recently proposed

transfer learning architectures are provided in Sec. 5.

2 THE ”AMERICAN NATIONAL

ELECTION STUDIES” SURVEY

The American Election Studies (ANES) provide

high-quality data for political and social science re-

search by conducting surveys on political participa-

tion, public opinion and voting behavior since 1948.

To fulﬁll this commitment, ANES conducts a series

of biennial election studies which cover these topics,

sometimes extended by surveys on special-interest

topics and expanded methodological instrumentation.

The 28

ANES time series study in 2008 (The

American National Election Studies, 2015) has been

supplemented by a coding project for open-ended re-

sponses (Krosnick et al., 2012) to various pre- and

post-election questions. The topics ranged from rea-

sons to vote for a presidential candidate, perceived

reasons why a candidate won or lost the 2008 election,

across the most important problems for the country

and the electorate, over to (dis)likes of the competing

political ﬁgures and parties among the respondents.

Like in all previous ANES studies conducted

in years of presidential elections, respondents were

interviewed in pre-election interviews and then re-

interviewed in the two months following the election

(post-election interviews), hence there was a varying

number of respondents.

Despite these benchmark collections do include data

sets with text input, all inputs are provided as bag-of-words

representations or similar, but not as full-text verbatims.

3 RELATED WORK

Card and Smith (2015) already investigated machine

learning methods for automated coding of the ANES

2008 data. Namely, they evaluated (regularized) lo-

gistic regression models as well as recurrent neu-

ral network architectures, including long short-term

memory (LSTM) units (Hochreiter and Schmidhuber,

1997). As a result, they ﬁnd that recurrent neural

network based methods are not generally able to out-

perform the more ”traditional” natural language pro-

cessing methods, like logistic regression models com-

bined with uni-/bigrams or additional features. An in-

teresting conclusion they draw from their analysis is

that this might be due to the limited amount of train-

ing data available for this multi-label classiﬁcation

task at hand. Since this is a problem statement explic-

itly addressed by recent transfer learning approaches,

we are curious to ﬁnd out whether pre-trained archi-

tectures like BERT & Co. are able to perform better

on this task. Roberts et al. (2014) work on the ANES

2008 data by applying a structural topic model as a

fully unsupervised approach for automated coding,

which is a highly interesting strategy for previously

unlabeled data sets. But since our goal is to evaluate

the ability of transfer learning models (which rely on

labeled data) for multi-label classiﬁcation, we do not

make use of this methodology.

4 MATERIALS AND METHODS

4.1 Preparation of the ANES Data

The data from the Open Ended Coding Project

con-

sists of a main ﬁle in *.xls - format which combines

all verbatims

from the targeted respondents collected

on the individual questions in separate spreadsheets.

The codes assigned to these verbatims are stored sep-

arately in so-called codes-ﬁles.

Analogously to the work of Card and Smith

(2015), we only use the answers to the open-ended

questions unrelated to occupation/industry of the re-

spondents. The topics of the questions deﬁning the

different data sets are displayed in Tab. 1. As some

of the questions share the same code sets, they can be

grouped into ten individual data sets comprising all of

the questions on the topics mentioned in Sec. 2. With

this, we follow the data preparation strategy of Card

Publicly available under: ANES time series study and

the Open Ended Coding Project

Answers to the open-ended survey questions are re-

ferred to as verbatims

A New Benchmark for NLP in Social Sciences: Evaluating the Usefulness of Pre-trained Language Models for Classifying Open-ended

Survey Responses

867

and Smith (2015), to keep our later results roughly

comparable to their model benchmarks.

Until now, there seems to be no broadly accepted data

format or structure in the social sciences regarding

the storage and publication of codes assigned to indi-

vidual responses to open-ended questions in surveys.

Data sets seem to be structured matrix-like ad-hoc to

ﬁt an individual survey’s needs.

Besides the obvious structural requirements,

namely that the codes assigned to each response have

to be identiﬁable using a particular variable (here this

is provided via an ”ID”, alternatively designated as

”caseID”) and that there is a limited amount of vari-

ables which can be used for storing the code values for

a single response, the internals of such data sets seem

to be highly idiomatic. Another aspect which partially

varies between different surveys are the codes being

used for indicating that a value is missing. This in

turn leads to the problem that these data sets as such

are hardly usable for standard machine learning pur-

poses without extensive preprocessing which has to

reﬂect the individual survey’s logic.

In the particular case of the ANES 2008, one has

to turn to the so-called ”coding report” accompanying

each response-codes data set to identify the columns

which contain the codes for a speciﬁc question and

to understand their meaning. The pre-deﬁned codes

for each question have been manually assigned to the

individual responses by professional human coders.

The coding procedure has been developed after a thor-

ough review of the ANES open-ended coding meth-

ods and a subsequent conference in December 2008

which suggested best practices.

As the sets of predeﬁned codes belonging to indi-

vidual questions cannot be used for machine learning

purposes as such, we have to transform them into a

useful format. In order to generate usable data sets

from the ﬁles distributed by the Open Ended Cod-

ing Project, we exploit the notion of representing the

codes, which have been assigned to each textual ob-

servation, by a binary vector.

As described previously by various authors

(Tsoumakas and Katakis, 2007; Gibaja and Ventura,

2015; Herrera et al., 2016), multi-label problems

can be formalized by proposing an output space L =

, L

, .., L

of q labels (q > 1), which allows us to de-

scribe each observation in the data as (x, Y ) where

x = (x

, .., x

) ∈ X is a d-dimensional instance which

has a set of labels associated Y ⊆ L. In this paper, we

understand the codes assigned to each response in the

data as the labels encountered in a multi-label learn-

ing problem, just as Card and Smith (2015) did pre-

The ANES Conference on Optimal Coding of Open-

Ended Survey Data took place in Dec. 2018

viously. In order to transform the numeric codes as-

signed to the responses into multi-hot encodings, we

exploit the cardinality of the code set associated with

each question. This helps us to represent the labels

associated to each observation by a q-dimensional bi-

nary vector y = (y

, .., y

) = {0, 1}

where each ele-

ment is 1 if the respective label was assigned to the

response and 0 otherwise.

To map the numeric codes to binary label vec-

tor elements one-to-one, we sourced the total size

of each code set from the codes-documents enclosed

with each data set. Using this information, we de-

ﬁned the length of the binary mapping vectors to be

identical to the cardinality of the code sets. To gener-

ate multi-hot encoded label vectors for each response

contained in the data sets, we designed a mapping dic-

tionary for each code set deﬁning which code from

the current set belongs to which element in the binary

vector generated for a particular response. To ﬁnally

obtain the binary label vectors from the set of nu-

meric codes associated to each observation, we trans-

formed all data sets using a custom function which

can be fed a mapping dictionary and the raw data

row-by-row. The function then returns the binary la-

bel vectors of length q for each observation, where

each vector element is 1 if the code mapped to this el-

ement was assigned to the response and 0 otherwise.

For the latter application of machine learning meth-

ods we split the data into train and test set (90/10) us-

ing an iterative stratiﬁcation method for balancing the

label distributions (Sechidis et al., 2011; Szyma

nski

and Kajdanowicz, 2017a) implemented in the novel

scikit-multilearn library for Python (Szyma

nski

and Kajdanowicz, 2017b). This represents an inno-

vation, as such stratiﬁcation has not been previously

used by Card and Smith (2015). The resulting data

splits are publicly available.

4.2 Model Architectures

Simple Baseline. As a simple baseline we use a

logistic regression classiﬁer (without regularization)

for one vs. rest classiﬁcation per label and thus ob-

tain a varying number of single models per label

set. Verbatim-level averaged fasttext-vectors (Bo-

janowski et al., 2017) are used as input and one-

hot vectors per label as targets. We use nltk (Bird

et al., 2009) for a mild preprocessing of the raw

verbatims, dropping punctuation, interviewer annota-

tion and lowercasing. Then, we ﬁt the model using

the scikit-learn implementation (Pedregosa et al.,

Code, data sets and leaderboard available at https://

github.com/mxli417/co benchmark.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

868

Table 1: Overview of the prepared data sets of ANES 2008, which our analysis will be based on, and their respective topics.

Additional details and descriptive statistics about the data sets can be found in Appendix 7.1.

ID Topic Question ID n #labels

1 General Election T5, T6 238 34

2 Primary Election T2, T3 288 29

3 Party (Dis-)Likes C1b, C1d, C2b, C2d 4393 33

4 Person (Dis-)Likes A8b, A8d, A9b, A9d 4672 34

5 Terrorists S1 2100 26

6 Important Issues Q3a1, Q3a2, Q3b1, Q3b2 8399 72

7 Ofﬁce Recognition Question: Gordon Brown J3c 2096 9

8 Ofﬁce Recognition Question: Dick Cheney J3b 2094 11

9 Ofﬁce Recognition Question: Nancy Pelosi J3a 2094 14

10 Ofﬁce Recognition Question: John Roberts J3d 2092 9

2011) in conjunction with gensim (Radim Rehurek,

2010) for including the fasttext-vectors.

Transfer Learning Architectures. As represen-

tatives for the class of transfer learning models

we use existing cased

implementations of BERT-

base (Devlin et al., 2018), RoBERTa-base (Liu

et al., 2019) and XLNet-base (Yang et al., 2019)

via simpletransformers, which is based on the

transformers module (Wolf et al., 2019). The ba-

sic structure of the models is complemented by a

multilabel-classiﬁcation head

. The used loss func-

tion is BCEWithLogitsLoss from pytorch per node

in order to account for the multi-label structure of the

targets. We do not intend to perform excessive tuning

of hyperparameters, but rather want to evaluate the

performance of these models when used ”out-of-the-

box” for a much more difﬁcult task than the common

ones. This approach is also largely in line with re-

cent works extending BERT to multi-label problems

(Lee and Hsiang, 2019; Chang et al., 2019). All mod-

els were ﬁne-tuned on the data sets for three epochs

with a maximum sequence length of 128 tokens and a

batch size of eight sequences. (Peak) learning rate for

ﬁne-tuning was set to 2e-05 for every model.

4.3 Evaluation Metrics

Generally, metrics commonly used for the evalua-

tion of machine learning methods in binary or multi-

class classiﬁcation tasks cannot be used for multi-

label learning without some further considerations

(Tsoumakas and Katakis, 2007). This is mainly due

to the fact that the performance of a given classiﬁer

should be evaluated over all labels and the partial cor-

rectness of a prediction must be taken into account.

Since RoBERTa only exists in a cased version, we had

to choose the other models analogously.

For implementation details of this head see https://

github.com/ThilinaRajapakse/simpletransformers

Thus, we here utilize a set of multi-label evaluation

metrics reported in overview articles by different au-

thors (Tsoumakas and Katakis, 2007; Sorower, 2010;

Gibaja and Ventura, 2014, 2015; Herrera et al., 2016)

to assess various aspects of the performance of the

classiﬁers we investigate.

For the following, we resume the previous nota-

tion. Let us assume that we have a multi-label test

set T = (x

, Y

)∣1 ≤ i ≤ n with n instances and different

label sets Y

, representing the ground truth, at our dis-

posal. Further, let P

be the set of predicted labels for

a given observation.

First, we will report the widely known F

score,

which is the harmonic mean of Precision and Recall

= 2 ⋅

precision⋅recall

precision+recall

. (1)

We report the micro- and macro-averaged versions of

this score, as the F

score is a binary evaluation mea-

sure and one needs to choose an averaging approach

in the multi-label case. By doing so, different perfor-

mance aspects can be investigated (Gibaja and Ven-

tura, 2015). Micro-averaging mainly tends to summa-

rize the classiﬁer performance on the most common

categories, whereas macro-averaging tends to report

performance on the rare categories of the test set. Val-

ues towards 1 are better, the minimum value is 0.

Additionally, we also report the sample-based F

score as this is also the central metric Card and Smith

(2015) use and report in their paper

. This version of

the F1 score can be formally described as:

sample

∑

i=1

2∣Y

∩P

∣

∣Y

∣+∣P

∣

(2)

(cf. Gibaja and Ventura (2014)) where N is the total

number of samples in the test set.

Second, we report the subset accuracy, often also

referred to as exact match ratio. It computes the frac-

tion of instances in the data for which the predicted

Note that they did not use the same notation, but essen-

tially used the same metric described in a vectorized form.

A New Benchmark for NLP in Social Sciences: Evaluating the Usefulness of Pre-trained Language Models for Classifying Open-ended

Survey Responses

869

Table 2: Model performances (measured as micro- and macro-averaged F

-scores) for all considered architectures. Results are

displayed separately for each data set with the best performance per data set in bold. We report F

sample

to ensure comparability

to the results reported by Card and Smith (2015).

Dataset-ID 1 2 3 4 5 6 7 8 9 10

n 238 288 4393 4672 2100 8399 2096 2094 2094 2092

#labels 34 29 33 34 26 72 9 11 14 9

sample

Baseline 0.44 0.51 0.57 0.54 0.68 0.88 0.92 0.95 0.90 0.91

BERT 0.00 0.02 0.44 0.35 0.41 0.79 0.94 0.95 0.91 0.93

RoBERTa 0.00 0.00 0.56 0.55 0.57 0.85 0.95 0.97 0.93 0.94

XLNet 0.00 0.00 0.54 0.58 0.55 0.86 0.96 0.98 0.91 0.92

Card and Smith (2015) 0.55 0.67 0.71 0.71 0.81 0.86 0.94 0.96 0.93 0.96

micro

Baseline 0.40 0.48 0.53 0.51 0.61 0.84 0.89 0.93 0.85 0.90

BERT 0.00 0.03 0.51 0.44 0.46 0.79 0.94 0.95 0.91 0.93

RoBERTa 0.00 0.00 0.60 0.60 0.62 0.85 0.96 0.97 0.94 0.95

XLNet 0.00 0.00 0.59 0.61 0.61 0.85 0.96 0.97 0.90 0.93

macro

Baseline 0.23 0.29 0.33 0.34 0.47 0.46 0.62 0.51 0.56 0.71

BERT 0.00 0.01 0.11 0.16 0.12 0.09 0.47 0.40 0.39 0.58

RoBERTa 0.00 0.00 0.18 0.26 0.21 0.14 0.51 0.51 0.44 0.58

XLNet 0.00 0.00 0.20 0.27 0.21 0.16 0.58 0.53 0.43 0.66

labels exactly match their corresponding true labels.

This is a very harsh metric, as it does not distin-

guish between partially and completely incorrect pre-

dictions. It is deﬁned as:

subset accuracy =

∑

i=1

) (3)

Next, we report the Label Ranking Average Precision

(LRAP). This metric computes the fraction of labels

ranked above a certain label λ ∈ Y

which belong to

, averaged across all observations (Gibaja and Ven-

tura, 2015). For this, a function f ∶ X ×Y → is gen-

erated by a label-ranking algorithm which orders all

possible labels for a given instance x

by their rele-

vance (Gibaja and Ventura, 2014). If a given label

′

∈Y

is ranked higher than a another label λ ∈Y

, then

f (x

, λ

′

) > f (x

, λ) must hold. In the following we

consider

to be a function which returns the ranking

for a given label λ, generated by the used ranking al-

gorithm. Here, the higher the obtained metric results

are, the better. The best achievable value is 1. LRAP

is deﬁned as (Gibaja and Ventura, 2014):

LRAP =

∑

i=1

∣Y

∣

∑

λ∈Y

∣{λ

′

∈Y

∣

′

≤

}∣

(4)

The LRAP favors classiﬁers which can rank the rel-

evant labels associated with each observation higher

than the irrelevant ones.

5 RESULTS

We report all of the above mentioned metrics for the

baseline model as well as for the three mentioned pre-

trained architectures on the test set. The results will be

structured as follows: In Tab. 2 we report macro- and

micro-averaged F

scores, additionally the sample-

based F

scores (cf. Card and Smith 2015) will be

reported as well. Tab. 3 shows the label ranking aver-

age precision LRAP and the subset accuracy.

Considering the F

sample

scores from Tab. 2, it

becomes clear that all used models can hardly out-

perform the previous best results. Note that the best

model from Card and Smith (2015) on almost all data

sets has been a thoroughly tuned logistic regression

model with a battery of different features. Overall,

the best logistic regression model has outperformed

even much more advanced architectures in 7 out of 10

cases, establishing that this kind of model can handle

multi-label text classiﬁcation problems surprisingly

well. In line with this, we observe that our baseline

can beat the transfer learning architectures on 5 out of

10 data sets. Only RoBERTa and XLNet can beat the

previous best results on two data sets by a small mar-

gin. On all other data sets the previously set bench-

mark results remain largely unchallenged.

When focussing on the F

micro

measure, we

can see that the more advanced models, especially

RoBERTa and XLNet, outperform the baseline as

soon as the data set size gets bigger, even if they

sometimes demonstrate only a slightly better perfor-

mance. BERT still performs relatively poorly, and

even gets beaten by the baseline on ﬁve out of ten

data sets. RoBERTa also shows only slightly better

performance than the baseline on the data set 5 con-

taining the question on terrorism and the data set 6 on

Important Issues. On the remaining data sets, how-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

870

Table 3: Model performances (measured as LRAP and subset accuracy) for all considered architectures. Results are displayed

separately for each data set with the best performance per data set in bold.

Dataset-ID 1 2 3 4 5 6 7 8 9 10

LRAP

Baseline 0.59 0.65 0.70 0.70 0.75 0.93 0.95 0.98 0.92 0.95

BERT 0.09 0.10 0.41 0.32 0.40 0.71 0.94 0.95 0.90 0.93

RoBERTa 0.09 0.09 0.51 0.49 0.55 0.79 0.95 0.97 0.93 0.94

XLNet 0.09 0.09 0.49 0.52 0.53 0.80 0.95 0.97 0.90 0.92

subset acc.

Baseline 0.00 0.10 0.20 0.17 0.35 0.70 0.80 0.89 0.76 0.80

BERT 0.00 0.00 0.16 0.08 0.20 0.41 0.89 0.90 0.81 0.87

RoBERTa 0.00 0.00 0.22 0.20 0.32 0.54 0.91 0.94 0.87 0.89

XLNet 0.00 0.00 0.22 0.22 0.31 0.58 0.92 0.94 0.80 0.87

ever, it can clearly outperform the baseline. XLNet

also mostly outperforms the baseline, with the ex-

ception of the data set concerning terrorism. On the

very small and thus very challenging data sets 1 and 2

which contain questions on the General and Primary

Election outcomes, the baseline model still is the best.

Finally, when considering the F

macro

score, we

observe that the baseline model is the single best

model across almost all data sets. Only for data set 8,

the larger RoBERTa and XLNet can match or outper-

form it. While this might be quite surprising, it proves

again that a binary relevance approach with a logistic

regression as a base learner can be a quite competitive

model – which is exactly the same ﬁnding Card and

Smith (2015) have reported.

Regarding LRAP (cf. Tab. 3), RoBERTa and XL-

Net can partially match the baseline model especially

on the last four data sets, which have a small label set

and are reasonably large. But XLNet and RoBERTa

also hardly outperform the baseline on all remaining

data sets, which makes the baseline model a powerful

ranking algorithm. BERT, however, cannot beat the

baseline at any of the data sets. For the strict measure

subset accuracy the baseline is not a strictly superior

competitor, as it outperforms the more advanced mod-

els only on 3 out of 10 data sets. This is also why it

is important to compare several evaluation metrics in

multi-label classiﬁcation, as each metric focuses dif-

ferent performance characteristics (Nam, 2019). Un-

fortunately, Card and Smith (2015) have not provided

any results beyond the F

sample

metric.

After these comparisons we conclude that con-

cerning data sets 1 and 2, which contain 238

and 288 observations respectively, BERT, RoBERTa

and XLNet cannot obtain any results above zero.

Additionally, these models outperform the base-

line only marginally on the data sets regarding the

Party (Dis)Likes, Person (Dis)Likes and the Ofﬁce-

Recognition-Question for Dick Cheney. Nonetheless,

they can outperform the baseline as soon as the data

sets get larger and the label sets remain relatively

small.

6 DISCUSSION

Transfer learning has, in this speciﬁc use case, not

turned out to be a strong alternative compared to pre-

vious research. BERT, RoBERTa and XLNet can not

generally outperform previous best results obtained

on the same data. Additionally, we observed just like

the previous authors that a binary relevance approach

with logistic regression can be a strong competitor,

sometimes even outperforming advanced models. On

small data sets, however, no model achieved good re-

sults with respect to the subset accuracy, our harshest

metric. This is most certainly due to the size, as the

data does not contain much information for automated

classiﬁers to learn from. In this case, relying on hand-

coding by humans might still be a good option.

Our ﬁndings are somewhat contrary to previously

reported results, where BERT was used quite suc-

cessfully in multi-label classiﬁcation (Adhikari et al.,

2019; Chang et al., 2019; Lee and Hsiang, 2019), even

yielding new SOTA results. The data sets these au-

thors have used to train their models, however, were

much larger than the ones we can utilize here. As

noted previously, we try to generate a benchmark re-

garding the usability of these models in the context of

scarce data, which is common in the social sciences.

In the light of the good performance of the baseline

model, the bigger models also might not be the best

choice if computational efﬁciency is the goal. As so-

cial scientists typically do not have unlimited com-

puting power at their disposal, a model which can be

trained to obtain reasonable levels of, for example,

subset accuracy, in a short amount of time might be

especially interesting for future research. Addition-

ally, this model also can handle smaller data sets sig-

niﬁcantly better and does not break down on bigger

A New Benchmark for NLP in Social Sciences: Evaluating the Usefulness of Pre-trained Language Models for Classifying Open-ended

Survey Responses

871

ones. This might be an indicator to look at smaller,

more problem-speciﬁc algorithms like feature-based

transfer learning to advance the research on automatic

coding in the future.

7 CONCLUSION

In this work, we provided an extension to the collec-

tion of commonly used benchmark data sets used for

evaluation transfer learning models for NLP. The full-

text data set encompasses a different task than most of

the others and thus widens the opportunities for care-

fully evaluating pre-trained models on a different kind

of challenge. Furthermore we propose a uniﬁed pre-

processing of the data set going along with a ﬁxed

train-test split enabling a valid comparison against our

baselines. We evaluated the performance of state-of-

the-art transfer learning models on the ANES 2008

data set and compared them to a simple baseline

model. Our comparison illustrates that, despite the

extremely good performances of those models on bi-

nary, multi-class and previous multi-label classiﬁca-

tion tasks, there is still a lot of room for improvement

concerning the performance on challenging multi-

label classiﬁcation tasks on small to mid-sized data

sets.

ACKNOWLEDGEMENTS

We want to express our sincere gratitude to Christian

Heumann for his guidance and support during the pro-

cess of this research project. We would like to thank

Jon Krosnick and Matt Berent for their insightful ex-

planations via e-mail regarding the Open Ended Cod-

ing Project. This helped us to develop a better under-

standing for the initial data format. A special thanks

also goes to Dallas Card for his explanations regard-

ing the data splits from Card and Smith (2015).

REFERENCES

Adhikari, A., Ram, A., Tang, R., and Lin, J. (2019).

Docbert: Bert for document classiﬁcation. arXiv

preprint arXiv:1904.08398.

Bird, S., Klein, E., and Loper, E. (2009). Natural Lan-

guage Processing with Python. Safari Books Online.

O’Reilly Media Inc, Sebastopol.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.

(2017). Enriching word vectors with subword infor-

mation. Transactions of the Association for Computa-

tional Linguistics, 5:135–146.

Card, D. and Smith, N. A. (2015). Automated coding of

open-ended survey responses.

CESSDA Training Team (2020). Cessda data management

expert guide.

Chang, W.-C., Yu, H.-F., Zhong, K., Yang, Y., and Dhillon,

I. (2019). X-bert: extreme multi-label text classiﬁca-

tion using bidirectional encoder representations from

transformers. arXiv preprint arXiv:1905.02331.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Gibaja, E. and Ventura, S. (2014). Multi-label learning: a

review of the state of the art and ongoing research.

Wiley Interdisciplinary Reviews: Data Mining and

Knowledge Discovery, 4(6):411–444.

Gibaja, E. and Ventura, S. (2015). A tutorial on multilabel

learning. ACM Computing Surveys (CSUR), 47(3):1–

38.

Herrera, F., Charte, F., Rivera, A. J., and Del Jesus, M. J.

(2016). Multilabel classiﬁcation. In Multilabel Clas-

siﬁcation, pages 17–31. Springer.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Hoyle, L., Vardigan, M., Greenﬁeld, J., Hume, S., Ionescu,

S., Iverson, J., Kunze, J., Radler, B., Thomas, W.,

Weibel, S., and Witt, M. (2016). Ddi and enhanced

data citation. IASSIST Quarterly, 39(3):30.

Inter-University Consortium For Political And Social Re-

search (ICSPR) (2012). Guide to social science data

preparation and archiving: Best practice throughout

the data life cycle.

Krosnick, J. A., Lupia, A., and Berent, M. K. (2012). 2008

open ended coding project.

Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. (2017).

Race: Large-scale reading comprehension dataset

from examinations. arXiv preprint arXiv:1704.04683.

Lee, J.-S. and Hsiang, J. (2019). Patentbert: Patent classiﬁ-

cation with ﬁne-tuning a pre-trained bert model. arXiv

preprint arXiv:1906.02124.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004).

Rcv1: A new benchmark collection for text catego-

rization research. Journal of machine learning re-

search, 5(Apr):361–397.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Lupia, A. (2018a). Coding open responses. In Vannette,

D. L. and Krosnick, J. A., editors, The Palgrave Hand-

book of Survey Research, pages 473–487. Springer In-

ternational Publishing, Cham.

Lupia, A. (2018b). How to improve coding for open-ended

survey data: Lessons from the anes. In The Pal-

grave Handbook of Survey Research, pages 121–127.

Springer.

Mencia, E. L. and F

urnkranz, J. (2008). Efﬁcient pair-

wise multilabel classiﬁcation for large-scale problems

in the legal domain. In Joint European Conference

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

872

on Machine Learning and Knowledge Discovery in

Databases, pages 50–65. Springer.

Nam, J. (2019). Learning Label Structures with Neural

Networks for Multi-label Classiﬁcation. PhD thesis,

Technische Universit

at.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Radim Rehurek, P. S. (2010). Software framework for topic

modelling with large corpora. In Proceedings of the

LREC 2010 Workshop on New Challenges for NLP

Frameworks, pages 45–50, Valletta, Malta. ELRA.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016).

Squad: 100,000+ questions for machine comprehen-

sion of text. arXiv preprint arXiv:1606.05250.

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C.,

Leder-Luis, J., Gadarian, S. K., Albertson, B., and

Rand, D. G. (2014). Structural topic models for open-

ended survey responses. American Journal of Political

Science, 58(4):1064–1082.

Sechidis, K., Tsoumakas, G., and Vlahavas, I. (2011). On

the stratiﬁcation of multi-label data. In Gunopulos,

D., Hofmann, T., Malerba, D., and Vazirgiannis, M.,

editors, Machine Learning and Knowledge Discovery

in Databases, pages 145–158, Heidelberg. Springer.

Sorower, M. S. (2010). A literature survey on algorithms for

multi-label learning. Oregon State University, Corval-

lis, 18:1–25.

Szyma

nski, P. and Kajdanowicz, T. (2017a). A network per-

spective on stratiﬁcation of multi-label data. In First

International Workshop on Learning with Imbalanced

Domains: Theory and Applications, pages 22–35.

Szyma

nski, P. and Kajdanowicz, T. (2017b). A scikit-based

python environment for performing multi-label classi-

ﬁcation. arXiv preprint arXiv:1702.01460.

The American National Election Studies (2015). ANES

2008 Time Series Study. Inter-university Consortium

for Political and Social Research [distributor].

Tsoumakas, G. and Katakis, I. (2007). Multi-label classi-

ﬁcation: An overview. International Journal of Data

Warehousing and Mining (IJDWM), 3(3):1–13.

Vardigan, M., Heus, P., and Thomas, W. (2008). Data doc-

umentation initiative: Toward a standard for the social

sciences. International Journal of Digital Curation,

3(1):107–113.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and

Bowman, S. R. (2018). Glue: A multi-task bench-

mark and analysis platform for natural language un-

derstanding. arXiv preprint arXiv:1804.07461.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,

C., Moi, A., Cistac, P., Rault, T., Louf, R., Fun-

towicz, M., et al. (2019). Transformers: State-of-

the-art natural language processing. arXiv preprint

arXiv:1910.03771.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R. R., and Le, Q. V. (2019). Xlnet: Generalized au-

toregressive pretraining for language understanding.

In Advances in neural information processing sys-

tems, pages 5754–5764.

ull, C. (2016). Open-ended questions. GESIS Survey

Guidelines, 3.

APPENDIX

7.1 Pre-processed Data Set

Table 4: Multi-label descriptive statistics for our data prepa-

ration approach.

Word count Avg. Words/ Label Powerset/ Observations per Label

ID Total Unique Verbatim -Cardinality -Density -Powerset Size Examples Min. Avg. Max.

1 3775 872 15.86 2.895 0.085 238 1.00 1 20.26 73

2 3176 752 11.03 2.205 0.076 288 1.00 1 21.90 114

3 60405 5046 13.75 2.291 0.069 4393 1.00 2 305.00 1507

4 67659 5136 14.48 2.397 0.070 4672 1.00 1 329.32 1549

5 26502 2442 12.62 1.947 0.075 2100 1.00 8 157.27 618

6 37652 3548 4.48 2.329 0.032 8399 1.00 1 271.64 8398

7 9512 742 4.54 1.374 0.153 512 0.24 5 319.89 1363

8 6711 576 3.20 1.204 0.109 2048 0.98 1 229.18 1384

9 10378 720 4.96 1.369 0.098 2094 1.00 3 204.71 899

10 9157 762 4.38 1.337 0.149 512 0.24 10 310.78 1232

A New Benchmark for NLP in Social Sciences: Evaluating the Usefulness of Pre-trained Language Models for Classifying Open-ended

Survey Responses

873