Table 1: Various results for English, German and Spanish.
A longer window of seven sentences seems to yield bet-
ter results. Unigram-based method outperforms bigrams in
Spanish.
Kernel
size
N-gram
range
Metric
Language
Eng Ger Span
7
1
accuracy 0.485 0.438 0.461
correlation 0.876 0.891 0.617
2
accuracy 0.484 0.437 0.453
correlation 0.875 0.890 0.601
5
1
accuracy 0.471 0.425 0.468
correlation 0.863 0.890 0.690
2
accuracy 0.481 0.431 0.450
correlation 0.878 0.892 0.605
3
1
accuracy 0.468 0.409 0.446
correlation 0.878 0.892 0.636
1
accuracy 0.468 0.408 0.438
correlation 0.878 0.891 0.616
57-dimensional vectors. We also experiment with dif-
ferent sizes of n-grams (uni- and bigrams). This way,
one could hope to retrieve more information about the
context, taking into consideration more than one word
as a bit of meaningful information.
Finally, we train a machine learning algorithm us-
ing the obtained matrix as input and labels as a target.
3.2 Reproducing Experiments
(Merz et al. (2016)) also use the supervised version of
tf-icf vectorization. In the experiments, authors train
the final tf-icf-based matrix on the whole dataset, in-
cluding the train and the test parts. Then they train
the ML algorithm on the train part of the dataset and
benchmark it on the test set. Here we first reproduce
that experiment with various parameters.
Since the data in Manifesto dataset is historical, it
makes sense to train the algorithm on older documents
and test the resulting quality on newer ones. Here
we use the documents of the most recent year in the
dataset as a test set. These would be the year 2017 for
German, and 2016 for English and Spanish.
We use accuracy as a quality metric for our exper-
iments. It is analogous to the agreement level for hu-
man coders and provides the possibility to compare
the classification quality to the human’ annotation.
As another quality metric, we use the document-wise
Pearson correlation between human-annotated cate-
gories and algorithm-annotated ones, proposed kin
(Merz et al. (2016)). This metric helps to estimate the
similarity of code assignment at the aggregate level.
The results of the experiments are shown in Table 1.
Figure 1 shows scatter-plots for the frequencies of
all manually assigned categories versus automatically
assigned ones. The plots are drawn for the best per-
Table 2: Various results for English, German, and Span-
ish without out-of-vocabulary words. Bigrams with longer
window kernel demonstrate higher accuracy across all lan-
guages.
Kernel
size
N-gram
range
Metric
Language
Eng Ger Span
7
1
accuracy 0.430 0.368 0.434
correlation 0.866 0.878 0.604
2
accuracy 0.430 0.368 0.435
correlation 0.866 0.877 0. 606
5
1
accuracy 0.427 0.364 0.430
correlation 0.866 0.880 0.611
2
accuracy 0.427 0.364 0.430
correlation 0.867 0.880 0.611
3
1
accuracy 0.416 0.345 0.418
correlation 0.867 0.878 0.638
2
accuracy 0.416 0.354 0.418
correlation 0.867 0.878 0.643
forming models in English, German, and Spanish, re-
spectively.
For German and English, the highest agreement
with human annotators is achieved when including
bigrams to the tf.icf vocabulary. The accuracy and
the correlation score for German texts outperforms
the state-of-the-art one (0.42 and 0.88, (Merz et al.
(2016))). The accuracy for English and Spanish lan-
guages are comparable to the state-of-the-art models.
3.3 Out-of-Vocabulary Words
Due to the supervised nature of the tf-icf algorithm, it
is fair to say that in real-life conditions, one does not
have the annotation to the new historical data. One
has to classify these new data as it arrives. That means
that the method described above could only be par-
tially reproduced: one can not build a complete tf-
icf matrix that would include every word in the new
data, since some of the words may not occur in the
training dataset. These words out of vocabulary con-
stitute a significant portion of the vocabulary that can
not be ignored. If we use the latest datasets for the
test, there would be 3485, 3266, and 8018 out-of-
vocabulary (O-o-V) words for German, English, and
Spanish datasets, respectively. Table 2 shows that if
one initializes O-o-V words with zeros, it drastically
reduces the quality of the classification.
One should notice here that without any informa-
tion on the out-pf-vocabulary words, the best accu-
racy is achieved on a bigger kernel with bigrams. This
stands to reason: due to the absence of information
on new words that were not observed in the training
set, the model needs to rely on a broader context to
achieve higher accuracy. Table 3 compares the ac-
curacy for the model with a full if-icf matrix (with
Text Classification for Monolingual Political Manifestos with Words Out of Vocabulary
151