Text Classiﬁcation for Monolingual Political Manifestos with Words Out

of Vocabulary

Arsenii Rasov

1,∗

, Ilya Obabkov

, Eckehard Olbrich

and Ivan P. Yamshchikov

Ural Federal University, Mira Street, 19, Yekaterinburg, Russia

Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22, Leipzig, Germany

Keywords:

Electoral Programs, Text Corpus, Classiﬁcation of Political Texts.

Abstract:

In this position paper, we implement an automatic coding algorithm for electoral programs from the Manifesto

Project Database. We propose a new approach that works with new words that are out of the training vocab-

ulary, replacing them with the words from training vocabulary that are the closest neighbors in the space of

word embeddings. A set of simulations demonstrates that the proposed algorithm shows classiﬁcation accu-

racy comparable to the state-of-the-art benchmarks for monolingual multi-label classiﬁcation. The agreement

levels for the algorithm is comparable with manual labeling. The results for a broad set of model hyperparam-

eters are compared to each other.

1 INTRODUCTION

Computational social science is a ﬁeld that lever-

ages the capacity to collect and analyze data at scale.

One hopes that automated data analysis of such data

may reveal patterns of individual and group behav-

iors, (Laser et al. (2009)). Analysis of political dis-

course is one of the prominent ﬁelds where data anal-

ysis overlaps with sociology, history, and political sci-

ence. Scientists study electoral processes, interactions

of political actors with one another and with the pub-

lic. In these works, researchers use different types

of data that could describe such processes. However,

the demand for well-annotated high-quality datasets

is continuously higher than the supply of new data.

Political scientists are in general need of annotated

datasets to do their work. It can be a document-wise

annotation, which matches the whole documents with

speciﬁc categorical labels, retrieving the document’s

basic idea, or sentence-wise labeling, that matches

each sentence with a particular label.

There are many widely-used sources, which pro-

vide different types of political data. Some re-

searchers use data from social networks, such as

twitter

. For those who are interested in parlia-

ment debates, there are such projects as EuroParl

∗

This project has received funding from the European

Union’s Horizon 2020 research and innovation programme

under grant agreement No 732942.

UK MPs: https://twitter.com/twittergov/lists/uk-mps

corpus (Koehn(2005)), Linked EP

, ConVote dataset

(Gentzkow et al. (2018)).

One of the most popular corpora in political sci-

ence is the Manifesto Project (Lehmann et al. (2018)).

It is a large human-annotated, open-access, cross-

national text corpus that consists of electoral pro-

grams. Here, the experts implemented human anno-

tation (or the so-called ”coding”) based on the con-

tent analysis of electoral programs. The sentences are

divided into statements (quasi-sentences). Each sen-

tence is coded with one of 57 categories (which, in

turn, form 7 broad topics). Currently, the corpus in-

cludes more than 2300 machine-readable documents,

more than 1150 of them are coded already. There are

about 1 000 000 coded quasi-sentences in the corpus

(Volkens et al. (2015)).

The annotation process in Manifesto is a very

challenging task. It is carried out by the groups of ex-

perts, specially trained to perform such labeling. This

process is a very time-consuming, labor-intensive,

and expensive procedure. Moreover, it is not a triv-

ial task to label each quasi-sentence with only one of

57 categories; indeed, the level of agreement of the

experts is only about 50% (Mikhailov et al.(2012)).

One way to overcome those challenges is to use

algorithms of supervised machine learning for quasi-

sentence classiﬁcation. For a long time, text classiﬁ-

cation was perceived as a monolingual task. However,

https://linkedpolitics.project.cwi.nl/web/html/home.

html

Rasov, A., Obabkov, I., Olbrich, E. and Yamshchikov, I.

Text Classiﬁcation for Monolingual Political Manifestos with Words Out of Vocabulary.

DOI: 10.5220/0009792101490154

In Proceedings of the 5th International Conference on Complexity, Future Information Systems and Risk (COMPLEXIS 2020), pages 149-154

ISBN: 978-989-758-427-5

149

there are more and more recent results that treat it as

a multilingual one. Monolingual methods are trained

on the data represented in one language only. Such

methods could be used when there is enough mono-

lingual data for training. Naturally, since some lan-

guages can be scarcely represented, one would like

to use multilingual methods and transfer some in-

formation learned in the data-rich languages to the

more challenging ones. Such methods are called mul-

tilingual. The majority of these methods are based

on the idea that one can construct speciﬁc semantic

space and embed texts in different languages into one

shared space. Training language-speciﬁc embedder

algorithms, in this case, could have nothing to do with

the classiﬁcation task per se. However, the joint mul-

tilingual embedding space equipped with a particular

measure of semantic similarity could be used for clas-

siﬁcation purposes. Such methods are harder to train

but can be useful if we are interested in the languages

that are underrepresented in the training data. In the

next section, we brieﬂy review the latest multi- and

monolingual results relevant to our project.

2 STATE OF THE ART

There are several baselines for the classiﬁcation of

Manifesto texts that vary across different formula-

tions of the classiﬁcation task. One could split the

works in this area into two huge sections: the re-

searchers that build algorithms for seven coarse high-

level topics and the researcher that classify individual

labels.

For seven high-level topics classiﬁcation, one of

the baselines is the paper (Glavas et al.(2017)). Here

authors implement multilingual text classiﬁcation us-

ing convolutional neural networks to match the sen-

tence for a given manifesto with seven coarse-grained

classes. They outperform the state-of-the-art for Ital-

ian, French, and English languages. In a monolin-

gual setting (Zirn et al.(2016)) present the method,

which combines the topic-classiﬁcation method and

topic-shift method using the Markov logic network

for seven sparse-level categories, reaching 74.9% of

macro-average F1-score.

The classiﬁcation process for individual labels is

more challenging due to the lack of data for training

and the sheer fact that it is typically harder to build

classiﬁcation algorithms with more categories. (Sub-

ramanian et al. (2017)) use a joint sentence-document

model for both sentence-level and document-level

classiﬁcation. They propose the neural multilin-

gual network-based approach for ﬁne-grained sen-

tence classiﬁcation and demonstrate the state-of-the-

art quality for different languages. In (Subramanian et

al. (2018)), authors improve their performance using

a hierarchical bidirectional LSTM approach.

The current state-of-the-art benchmark for the

Manifesto quasi-sentence classiﬁcation on 57 ﬁne-

grained labels is presented in (Merz et al. (2016)).

The authors describe the approach of monolingual

text classiﬁcation, using the SVM algorithm. They

show 42% accuracy for German manifestos.

In this research, we propose our method that out-

performs the (Merz et al. (2016)) benchmark for 57

labels. We also modify the experimental conditions

to make it more similar to the real conditions and ad-

dress the out-of-vocabulary words problem that we

describe in detail further.

3 CLASSIFICATION

Since some labels are under-represented in the train-

ing sample, it is hard to balance the training, and it

is futile to expect that a multi-parameter model such

as a deep neural network could be trained on such

scarce data in a monolingual setting. In this work,

we suggest focusing on basic machine learning meth-

ods that are robust under the variation of the training

categories sizes. Further, we experiment with sup-

port vector machines (Vapnik, V. (1998)) and gradient

boosting (Chen, T. and Guestrin, C. (2016)).

3.1 Training

We perform the following preprocessing of the mani-

festo data. We remove punctuation, split all sentences

into the lists of separated words, and remove stop-

words.

We train a tf-icf matrix to vectorize each word in a

semi-sentence. Tf-icf is a supervised version of tf-idf,

which includes supervised term-weighting, see (Lan

et al. (2009)). In tf-icf scheme, we build the term-

category matrix instead of the term-document one.

To do that, we join all semi-sentences of each class

in separate new documents and train tf-icf matrix on

them.

Then for each sentence, we create weighted one-

hot vectors, using a scheme, proposed at (Merz et al.

(2016)): we take a sum of weighted one-hot vectors of

the target sentence (weighted by 1/2) and vectors of 4

nearest sentences (weighted by 1/3 and 1/6 w.r.t. dis-

tance to the target sentence). In this work, we also

experiment with the different sizes of such kernel:

three, ﬁve, and seven sentences. After the multipli-

cation of such vectors by the tf-icf matrix we receive

COMPLEXIS 2020 - 5th International Conference on Complexity, Future Information Systems and Risk

150

Table 1: Various results for English, German and Spanish.

A longer window of seven sentences seems to yield bet-

ter results. Unigram-based method outperforms bigrams in

Spanish.

Kernel

size

N-gram

range

Metric

Language

Eng Ger Span

accuracy 0.485 0.438 0.461

correlation 0.876 0.891 0.617

accuracy 0.484 0.437 0.453

correlation 0.875 0.890 0.601

accuracy 0.471 0.425 0.468

correlation 0.863 0.890 0.690

accuracy 0.481 0.431 0.450

correlation 0.878 0.892 0.605

accuracy 0.468 0.409 0.446

correlation 0.878 0.892 0.636

accuracy 0.468 0.408 0.438

correlation 0.878 0.891 0.616

57-dimensional vectors. We also experiment with dif-

ferent sizes of n-grams (uni- and bigrams). This way,

one could hope to retrieve more information about the

context, taking into consideration more than one word

as a bit of meaningful information.

Finally, we train a machine learning algorithm us-

ing the obtained matrix as input and labels as a target.

3.2 Reproducing Experiments

(Merz et al. (2016)) also use the supervised version of

tf-icf vectorization. In the experiments, authors train

the ﬁnal tf-icf-based matrix on the whole dataset, in-

cluding the train and the test parts. Then they train

the ML algorithm on the train part of the dataset and

benchmark it on the test set. Here we ﬁrst reproduce

that experiment with various parameters.

Since the data in Manifesto dataset is historical, it

makes sense to train the algorithm on older documents

and test the resulting quality on newer ones. Here

we use the documents of the most recent year in the

dataset as a test set. These would be the year 2017 for

German, and 2016 for English and Spanish.

We use accuracy as a quality metric for our exper-

iments. It is analogous to the agreement level for hu-

man coders and provides the possibility to compare

the classiﬁcation quality to the human’ annotation.

As another quality metric, we use the document-wise

Pearson correlation between human-annotated cate-

gories and algorithm-annotated ones, proposed kin

(Merz et al. (2016)). This metric helps to estimate the

similarity of code assignment at the aggregate level.

The results of the experiments are shown in Table 1.

Figure 1 shows scatter-plots for the frequencies of

all manually assigned categories versus automatically

assigned ones. The plots are drawn for the best per-

Table 2: Various results for English, German, and Span-

ish without out-of-vocabulary words. Bigrams with longer

window kernel demonstrate higher accuracy across all lan-

guages.

Kernel

size

N-gram

range

Metric

Language

Eng Ger Span

accuracy 0.430 0.368 0.434

correlation 0.866 0.878 0.604

accuracy 0.430 0.368 0.435

correlation 0.866 0.877 0. 606

accuracy 0.427 0.364 0.430

correlation 0.866 0.880 0.611

accuracy 0.427 0.364 0.430

correlation 0.867 0.880 0.611

accuracy 0.416 0.345 0.418

correlation 0.867 0.878 0.638

accuracy 0.416 0.354 0.418

correlation 0.867 0.878 0.643

forming models in English, German, and Spanish, re-

spectively.

For German and English, the highest agreement

with human annotators is achieved when including

bigrams to the tf.icf vocabulary. The accuracy and

the correlation score for German texts outperforms

the state-of-the-art one (0.42 and 0.88, (Merz et al.

(2016))). The accuracy for English and Spanish lan-

guages are comparable to the state-of-the-art models.

3.3 Out-of-Vocabulary Words

Due to the supervised nature of the tf-icf algorithm, it

is fair to say that in real-life conditions, one does not

have the annotation to the new historical data. One

has to classify these new data as it arrives. That means

that the method described above could only be par-

tially reproduced: one can not build a complete tf-

icf matrix that would include every word in the new

data, since some of the words may not occur in the

training dataset. These words out of vocabulary con-

stitute a signiﬁcant portion of the vocabulary that can

not be ignored. If we use the latest datasets for the

test, there would be 3485, 3266, and 8018 out-of-

vocabulary (O-o-V) words for German, English, and

Spanish datasets, respectively. Table 2 shows that if

one initializes O-o-V words with zeros, it drastically

reduces the quality of the classiﬁcation.

One should notice here that without any informa-

tion on the out-pf-vocabulary words, the best accu-

racy is achieved on a bigger kernel with bigrams. This

stands to reason: due to the absence of information

on new words that were not observed in the training

set, the model needs to rely on a broader context to

achieve higher accuracy. Table 3 compares the ac-

curacy for the model with a full if-icf matrix (with

Text Classiﬁcation for Monolingual Political Manifestos with Words Out of Vocabulary

151

Figure 1: Comparison of code frequencies of 57 categories,

trained on the whole dataset, in six electoral programs by

human and semi-automatic coding for Spanish, English and

German texts respectively.

bigrams and kernel size 7) and the same model but

without information on the O-o-V words. There is a

drastic decline in accuracy for all three languages.

Table 3: Overview of the change in accuracy for the algo-

rithm that does not use words out-of-vocabulary.

Language Accuracy with full Drop in accuracy

tf-icf matrix without words O-o-V

English 0.485 −11.3%

German 0.436 −15.6%

Spanish 0.461 −5.6%

To overcome this problem, we propose a speciﬁc

method of word replacement, based on the FastText

word embeddings (Bojanovski et al. (2016)). One

can use pre-trained Wikipedia FastText vectors (Ver-

berne et al. (2018)) for all of the words in our dataset

and replace out-of-vocabulary words with the closest

ones from the training set, using cosine distance be-

tween FastText vectors as a distance metric. This ma-

nipulation helps to keep part of the information that

comes with the out-of-vocabulary words intact dur-

ing the vectorization process. Table 4 shows the re-

sults for various parameters of the algorithm across

all three languages.

However, again the best accuracy is achieved us-

ing bigrams and the kernel of size 7 for all languages.

Figure 2 shows scatter-plots for the frequencies of

all manually assigned categories versus automatically

assigned ones. The plots are drawn for the best per-

forming models in English, German, and Spanish, re-

spectively.

Table5 shows relative accuracy improvement

when O-o-V words are substituted with their nearest

neighbors in the FastText embeddings.

Looking at Table 5, one can see that replacing the

out-of-vocabulary words with their nearest FastText

Table 4: Various results for English, German and Spanish

with out-of-vocabulary words replacements. Bi-grams with

longer window kernel demonstrate higher accuracy across

all languages.

Kernel

size

N-gram

range

Metric

Language

Eng Ger Span

accuracy 0.426 0.371 0.448

correlation 0.860 0.879 0.646

accuracy 0.426 0.371 0.449

correlation 0.858 0.879 0. 648

accuracy 0.423 0.368 0.444

correlation 0.858 0.878 0.649

accuracy 0.424 0.368 0.444

correlation 0.858 0.879 0.651

accuracy 0.412 0.351 0.431

correlation 0.859 0.880 0.685

accuracy 0.413 0.352 0.432

correlation 0.859 0.879 0.685

COMPLEXIS 2020 - 5th International Conference on Complexity, Future Information Systems and Risk

152

Figure 2: Comparison of code frequencies of 57 categories,

trained only on the training part with the word embeddings

for the out-of-vocabulary words, in six electoral programs

by human and semi-automatic coding for Spanish, English,

German texts respectively.

Table 5: Overview of the results for the algorithms that

use FastText nearest neighbours instead of the words out-

of-vocabulary. Performance varies across the languages.

Language Accuracy Change # of Vocab.

without with O-o-V O-o-V size

O-o-V repl. in test

English 0.426 −0.9% 3 485 52 949

German 0.372 +1.1% 3 266 24 227

Spanish 0.449 +3.2% 8 018 49 969

neighbors that are included in the training dataset can

partially address the problem. Moreover, the more

out-of-vocabulary words there are in the test dataset,

the better such replacement performs. Indeed, Ta-

ble 5 shows that the accuracy signiﬁcantly improves

for Spanish that has twice as many words out-of-

vocabulary in the test set. In contrast, for German and

English, the performance varied within one percent-

age point (and is even weaker for English than for the

model that omits O-o-V words altogether).

However, experiments clearly show that there is

a need to analyze a more extended context for better

label classiﬁcation. With the current amount of mono-

lingual data, there is little one can do to broaden the

context used by the models. We believe that further

accuracy improvements could be achieved with mul-

tilingual models with the attention that could leverage

varying importance of the words within different top-

ics.

4 DISCUSSION

The achieved results are promising, concerning the

complexity of the category scheme. Indeed, human

coders’ agreement level is only about 50%, compar-

ing to a master copy (Lacewell and Werner(2013)).

However, this level of accuracy does not allow to au-

tomation the real-world task completely.

It is also important to note here that some semi-

sentences may contain more than one category. For

coarse-grained ones, it is not a common problem, be-

cause the labels already include a variety of topics, but

for the small labels, it is a real challenge. In this case,

the current labeling scheme is difﬁcult to reproduce

by machine learning algorithms.

One possible way to modify the annotation pro-

cess is to assign more than one label to the sample if it

is needed. It should decrease ambiguity in the human

coding process and, therefore, increase the machine-

classiﬁcation quality. Another idea is to change the

structure of labels themselves to decrease overlap-

ping.

Text Classiﬁcation for Monolingual Political Manifestos with Words Out of Vocabulary

153

5 CONCLUSION

This paper implements a classiﬁcation algorithm

for electoral programs from the Manifesto Project

Database. A new approach is proposed to overcome

the problem of the words that are out of the train-

ing vocabulary. The algorithm demonstrates the ac-

curacy comparable to the state-of-the-art benchmarks

for multi-label classiﬁcation. The algorithm is tested

on different languages, showing its applicability, and

on different sizes of the kernel (weighting scheme of

(Merz et al. (2016))). The experiments show that

longer textual context is useful for the classiﬁcation

accuracy.

ACKNOWLEDGEMENTS

Authors are grateful to Oleg Gluhih, Maxim Gnativ

and Alexei Postnikov for their help and constructive

discussions.

REFERENCES

Chen T. and Guestrin C. (2016) Xgboost: A scalable tree

boosting system. Proceedings of the 22nd acm sigkdd

international conference on knowledge discovery and

data mining, 785–794.

Vapnik V. (1998) The support vector method of func-

tion estimation. Nonlinear Modeling, title=The sup-

port vector method of function estimation, au-

thor=Vapnik, Vladimir, booktitle=Nonlinear Model-

ing, 55–85, Springer.

Lazer D., Pentland A., Adamic L., Aral S., Barab

asi A.L.,

Brewer D., Christakis N., Contractor N., Fowler J.,

Gutmann M., Jebara T., King G., Macy M., Roy D.,

Alstyne M. (2009) Computational Social Science. Sci-

ence, vol. 323, issue 5915, 721-723

Koehn P. (2005) Europarl: A Parallel Corpus for

Statistical Machine Translation. MT Summit.

http://www.statmt.org/europarl/

Gentzkow M., Shapiro J.M., Taddy M. (2018) Parsed

Speeches and Phrase Counts. Congressional Record

for the 43rd-114th Congresses Palo Alto, CA: Stan-

ford Libraries. https://data.stanford.edu/congress text

Lehmann P., Lewandowski J., Matthieß T., Merz N.,

Regel S., Werner A. (2018) Manifesto Corpus. Ver-

sion: 2018-1. Berlin: WZB Berlin Social Science

Center.

Volkens A., Krause W., Lehmann P., Matthieß T.,

Merz N., Regel S., Weßels B. (2019) The

Manifesto Data Collection. Manifesto Project

(MRG/CMP/MARPOR). Version 2019a. WZB.

https://doi.org/10.25522/manifesto.mpds.2019a

Mikhaylov S., Laver M., Benoit K.R. (2012) Coder Reli-

ability and misclassiﬁcation in the human coding of

party manifestos. Political Analysis 20(1), 78–91.

Glava

s G., Nanni F., Ponzetto S.P. (2017) Cross-lingual

classiﬁcation of topics in political texts

Zirn C., Glava

s G., Nanni F., Eichorts J., Stuckenschmidt H.

(2016) Classifying topics and detecting topic shifts in

political manifestos. In PolText.

Subramanian S., Cohn T., Baldwin T. (2017) Hierarchical

Structured Model for Fine-to-coarse Manifesto Text

Analysis Proceedings of the 2018 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, vol. 1

Subramanian S., Cohn T., Baldwin T., Brooke J. (2018)

Joint Sentence–Document Model for Manifesto Text

Analysis Proceedings of Australasian Language Tech-

nology Association Workshop: 25-33.

Merz N., Regel S., Lewandowski J. (2016) The Manifesto

Corpus: A new resource for research on political par-

ties and quantitative text analysis, Research and Poli-

tics. DOI: 10.1177/2053168016643346

Lan M., Tan C.L., Su J. (2007) Supervised and Traditional

Term Weighting Methods for Automatic Text Catego-

rization. Journal of IEEE PAMI, vol. 10, No. 10

Lacewell O.P. and Werner A. (2013) Coder training: Key

to enhancing coding reliability and estimate validity.

In: Volkens A., Bara J., Budge I., et al. (eds) Map-

ping Policy Preferences from Texts. Statistical Solu-

tions for Manifesto Analysts. Oxford: Oxford Univer-

sity Press.

Bojanovski P., Grave E., Joulin A., Mikolov T. (2016) En-

riching Word Vectors with Subword Information.

Grave E, Bojanowski P., Gupta P., Joulin A., Mikolov T.

(2018) Learning Word Vectors for 157 Languages.

Proceedings of the International Conference on Lan-

guage Resources and Evaluation (LREC 2018)

Verberne S., D’hondt E., van den Bosch A., Marx M.

(2014). Automatic thematic classiﬁcation of election

manifestos. Information Processing & Management,

50(4), 554-567.

COMPLEXIS 2020 - 5th International Conference on Complexity, Future Information Systems and Risk

154