The Fake News Challenge: Stance Detection using Traditional Machine

Learning Approaches

Razan Masood and Ahmet Aker

Department of Information Engineering, University of Duisburg-Essen, Duisburg, Germany

Keywords:

Fake News, Stance Detection, Traditional Machine Learning, Feature Engineering.

Abstract:

Fake news has caused sensation lately, and this term is the Collins Dictionary Word of the Year 2017. As

the news are disseminated very fast in the era of social networks, an automated fact checking tool becomes

a requirement. However, a fully automated tool that judges a claim to be true or false is always limited in

functionality, accuracy and understandability. Thus, an alternative suggestion is to collaborate a number of

analysis tools in one platform which help human fact checkers and normal users produce better judging based

on many aspects. A stance detection tool is a ﬁrst stage of an online challenge that aims to detect fake news.

The goal is to determine the relative perspective of a news article towards its title. In this paper, we tackle the

challenge of stance detection by utilizing traditional machine learning algorithms along with problem speciﬁc

feature engineering. Our results show that these models outperform the best outcomes of the participating

solutions which mainly use deep learning models.

1 INTRODUCTION

Fake news is one of the controversially discussed is-

sues lately. New York Times deﬁnes it as ”a made-

up story with an intention to deceive”

. Moreover,

propaganda, conspiracy theories and other false sto-

ries have always been used in the media for a second

gain like monetizing, political goals and opinion ma-

nipulation. Online services such as factcheck.org and

PolitiFact.com perform manual fact checking to ﬁlter

fake news.

The current online environments like social media

create powerful tools to spread false stories extensi-

vely. As a result, journalists and fact checkers with

their current strategies cannot label fake stories in real

time before they are out of control. Automating those

strategies is one solution to speed up the procedure.

This kind of issues is considered to ﬁt a machine lear-

ning task (Markowitz and Hancock, 2014; Hardalov

et al., 2016; Jin et al., 2017).

Until lately, the work on ﬁghting fake news is

handled in many separate projects and studies. Howe-

ver, organizations like FullFact.org suggests to open

collaborations between these projects to build a plat-

form that provides a collection of tools to handle the

https://www.nytimes.com/2016/12/06/us/fake-news-

partisan-republican-democrat.html

different aspects of fact checking routines

. Similarly

Fake News Challenge (FNC-1), which is an on-line

competition, also suggests a solution for fake news

detection to be composed by a collection of automa-

ted tools to support human fact checkers and speed

up their processes. Stance detection is among the col-

lection of these tools

Stance detection has been proven to be useful in

disinformation detection. (Jin et al., 2016) applied

the stance to analyze the credibility propagation for

news veriﬁcation through building connections bet-

ween micro-blogs (tweets) as supporting or denying

each others’ viewpoints. (Qazvinian et al., 2011) use

the stance observed in tweets in a Belief Classiﬁca-

tion to classify false and true rumors, even though

rumors checking is found to be different from news

checking. Stance detection for fact checking in the

emerging news has mostly been investigated in micro-

blogs.

The stance detection task presented by FNC-1 is

about predicting the stance of one piece of text (news

body) towards another (news headline). Particularly,

it should predict whether the news body has the stan-

ces Unrelated, Discuss, Agree or Disagree to a news

headline. Most of the teams participated in the FNC-1

https://fullfact.org/media/uploads/full fact-

the state of automated factchecking aug 2016.pdf

http://www.fakenewschallenge.org/

128

Masood R. and Aker A.

The Fake News Challenge: Stance Detection using Traditional Machine Learning Approaches.

DOI: 10.5220/0006898801280135

In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KMIS 2018), pages 128-135

ISBN: 978-989-758-330-8

including the winner team used deep learning appro-

aches to solve the task. Although deep learning is a

powerful technique and it has shown a great success in

various tasks, the deep architectures is said to be short

on providing more understandable results in terms of

features extracted by the deep architecture and their

performances. This drawback is the motivation in our

work. Thus, instead of deep learning we use traditi-

onal machine learning approaches along with the ap-

propriate feature selection/engineering and show that

we can beat the deep learning approaches in an at-

tempt to provide useful information about engaged fe-

atures and their performances.

Thus our contributions are as follows:

• We provide a solution for the FNC-1 task using

traditional machine learning algorithms.

• We extract a range of different features that are

useful for the stance detection task.

• We perform feature analysis and discuss different

experimental training and testing settings which

were important to obtain state-of-the-art results.

• We achieve a score of 82.1%

which is currently

the best score achieved for the fake news stance

detection task.

We ﬁrst discuss related work, and then in Section

3 we describe the data and the scoring system. In

Section 4 we introduce our method including the fe-

atures and the machine learning approaches used to

learn the models. Then, in Section 5, we discuss ex-

perimental settings followed by the results discussion

in Section 6 and conclusion in Section 7.

2 RELATED WORK

The problem that is introduced in this paper was pu-

blished ﬁrst as a pure stance detection tool within a

plan to employ it in a wider fake news detection plat-

form. Many researches proposed methods which em-

ploy stance in disinformation detection. The veracity

of claims were also predicted using the stance of arti-

cles and the reliability of their sources (Popat et al.,

2017). Stance features were also employed in de-

tecting the credibility of rumours, which are also de-

ﬁned to be unveriﬁed claims (Enayet and El-Beltagy,

2017). Moreover, Using Tweets publishing time and

stances as the only features to model the veracity of

tweets using Hidden Markov Models achieved high

accuracy (Dungs et al., 2018). Some other cases that

This score is calculated using the FNC-1 scoring sy-

stem

targeted rumours used also stance features to identify

their veracity (Zubiaga et al., 2018).

Detecting stance of news articles is the most rela-

ted work to our task. On this line the work of Fer-

reira & Vlachos addresses rumor debunking based on

stance. The aim is to estimate the stance of a news he-

adline towards its paired claim as Observing, For or

Against. Linguistic features are extracted from each

claim and headline pair (Ferreira and Vlachos, 2016).

The FNC-1 task extends the work of Ferreira &

Vlachos to predict the stance of a complete news arti-

cle (body of an article) towards a title or headline pai-

red with that article. For this task results of ﬁrst three

top systems have been announced. The ﬁrst ranked

team

approach is based on a 50/50 weighted average

ensemble combining a gradient-boosted decision tree

model fed with text based features from the headline

and the body pair, and a deep learning model based

on one dimensional Convolutional Neural Network

(CNN) with Google News pre-trained vectors.

Unlike most of the approaches used by the parti-

cipating teams described above we employ traditional

machine learning techniques along with feature en-

gineering. We investigate several features and deter-

mine their contribution towards the task. We also ex-

periment with different training settings. Overall we

show that our approach leads to slightly better results

than those reported by deep learning strategies.

3 THE FAKE NEWS CHALLENGE

The fake news challenge (FNC-1) is a machine lear-

ning task which is a contribution between AI commu-

nity, journalists and fact-checkers. It forms a basis for

ﬁghting fake news and aims to develop tools towards

fake news detection. One of the tools is a stance de-

tection tool which is the ﬁrst interest of the challenge.

The challenge is about predicting the stance of a news

article (body of the article) towards its paired title or

headline. In the following sections we describe the

data and the scoring mechanism of FNC-1.

3.1 Data

The data used in the competition was extracted from

Craig Silverman’s Emergent dataset

which is part of

a research project that employs rumor tracking in de-

tecting misinformation. The dataset consists of 2595

articles that relates to 300 claims (headline) so that for

each claim there are between 5 to 20 articles. These

https://github.com/Cisco-Talos/fnc-1

http://www.emergent.info/

The Fake News Challenge: Stance Detection using Traditional Machine Learning Approaches

129

articles are labeled manually by journalists as agree,

disagree or discuss the claims they are paired with.

The FNC-1 organizers mixed and matched the ar-

ticle bodies and their headlines, and used the labels

relative to the claims. They got 75,119 labeled pairs

as the following:

• Unrelated: The topic of the headline is different

from the topic of the article body.

• Discuss: The body observes the headline’s claim

neutrally without taking a position.

• Agree: The body conﬁrms the headline’s claim.

• Disagree: The body refutes the headline’s claim.

The resulted pairs were divided by FNC-1 orga-

nizers into 49,972 pairs as training data and 25,147

pairs for testing. The training dataset was a match be-

tween 1648 unique headlines and 1683 unique article

bodies, whereas the test dataset is a match between

880 unique headlines and 904 unique article bodies

with no overlaps between the splits. In addition, the

test data used to ﬁnally evaluate the competitors was

supplied with an additional 266 pairs that the organi-

zers derived and labeled using Google News articles.

The headline’s length ranged between 1 - 40 words

with an average of 11 words. While the article body

length ranged between 3 - 4800 words with an average

of 350 words. The training dataset is highly unbalan-

ced with class distribution as the following: 73.13%

unrelated, 17.83% discuss, 7.36% agree and 1.68%

disagree.

3.2 FNC-1 Scoring System and Baseline

Classiﬁer

FNC-1 scoring system adds 0.25% score for each pair

classiﬁed correctly as Unrelated. The score is incre-

ased by 0.25% if the pair was related and was classi-

ﬁed as any of Discuss, Agree or Disagree classes. If

the pair was correctly classiﬁed as Discuss, Agree or

Disagree, the score is increased to 0.75%. We consi-

der the approach that won the FNC-1 as our baseline

system. This system scored 82.02% according to the

FNC-1 scoring system.

4 METHOD

In our methodology we apply traditional machine le-

arning approaches, speciﬁcally, L1-Regularized Lo-

gistic Regression provided by LibLINEAR (Fan et al.,

2008) using WEKA (Hall et al., 2009) and Random

Forest classiﬁer from the same WEKA toolkit. Both

approaches rely on feature engineering. Our feature

engineering focuses on the article content and tries to

ﬁnd parts of it that would best describe the stance the

article has towards the headline.

The data provided for the FNC-1 stance detection

task is limited to articles’ text with no reference to

sources, writers or any explicit meta data. Given this,

the features we relied on are only linguistic features.

In the following sections we describe our features in

detail.

4.1 Headline Features

• Headline Length (H-Len). This is equal to the

number of words in the headline.

• Headline Contains Question Mark (H-Q). A

feature indicating whether a headline contains a

question mark or not (0 or 1).

4.2 Article Content Features

We split each article content into a heading, middle,

and tail parts based on the sentences

. The motiva-

tion behind this splitting is that most news articles are

written in a speciﬁc style in which the article begin-

ning (heading) introduces the main argument(s) that

the entire article wants to convey to the users, the

body part (middle) provides more detailed informa-

tion about the argument(s) made earlier and a conclu-

sion towards the end (tail) summarizing what is de-

tailed in the body. We have experimented with diffe-

rent splitting strategies however, dividing the entire

article into ﬁrst 5 sentences (heading), 4 sentences

from the tail and 10% of the middle sentences (min.

2 sentences)

gave us best performance. In the fol-

lowing we explain the features extracted from these

parts.

• Bag of Words (BoW): We extract uni-grams and

bi-grams from the heading and tail parts of the ar-

ticle. However, we retain only the 500 most occur-

ring n-grams and delete all the remaining ones.

• Root Distance (Root-Dist): This feature is calcu-

lated similar to the study of (Ferreira and Vlachos,

2016). However, for our case we compute 3 dif-

ferent features (feature vectors), i.e. one for the

heading part, one for the middle and one for the

tail part. For each sentence in each part we parse

it using Standard CoreNLP parser and compute its

root distance to pre-collected words list obtained

from related work (Discuss or Refute words).

Sentence splitting has been performed using The Stan-

ford CoreNLP tools (Manning et al., 2014)

We start taking from the median, then left and right of

it until we have reached our threshold.

KMIS 2018 - 10th International Conference on Knowledge Management and Information Sharing

130

• Sentiments: For each sentence in each article part

we compute its sentiment score. The tool used is

Stanford Sentiment (Socher et al., 2013) which gi-

ves each sentence a score between 0 (high positi-

vity) to 4 (high negativity).

• Sentence Length (Sentence-Len): This feature

indicates the maximum and the average length of

the sentences in the respective article parts.

• Punctuation Count (Punct): We use several

punctuation such as dot, comma, etc. and for each

of them we compute how many times it appears in

the entire article (not only in the three parts).

• Lemma Count: We remove all stop words from

the headline and lemmatize the remaining words.

For each sentence in each article part we count the

occurrences of each lemma that also appears in

the headline and take the sum of all lemma occur-

rence counts as a lemma count feature. We also

do this for the entire article regardless of the men-

tioned article split boundaries.

• Character Grams (Ch-Grams): We build sets of

character sequences of lengths 4, 8 and 16 from

the headline. For each character sequence set we

count how many times the sequences appear in

each sentence of each article part. Each sentence

is assigned three count values each indicating how

many times the sentence includes any sequence

from the respective length category. We use lem-

matized text before building the character sequen-

ces and also remove all the stop words.

• Word2Vec Similarity (W2Vec-Sim): For this fe-

ature, a vector space representation of both the

headline and each article part is computed using

word embedding (Mikolov et al., 2013). For

the embedding, we used Google’s Word2Vec pre-

trained words and phrases from Google News

. Once the embedding vectors are obtained we

compute the cosine similarity between the given

vectors.

• Word Grams (N-Grams): This is similar to the

Ch-Grams feature however, instead we take word

sequences of lengths 2, 4, 8 and 16.

• Hypernyms Similarity (Hyp-Sim): We use

WordNet 3.1 (Miller, 1995) and collect hyper-

nyms from the ﬁrst synset of nouns and verbs.

The nouns and verbs are taken from the headline,

article heading and article tail. For the collected

hypernyms we build word embedding vectors and

compute similarities between title-article heading

and title-article tail using cosine.

https://code.google.com/archive/p/word2vec/

• Cosine Similarity (Cos-Sim): This feature com-

putes the cosine similarity of the headline to each

sentence in each article part. The vector values are

word counts. Before computing we take lemmas

of the words and remove stop words.

• Paraphrase Alignment (ppdb): This feature

captures an alignment score calculated between

two texts depending on the Paraphrase Database

(Pavlick et al., 2015) and the Kuhn-Munkres al-

gorithm (Kuhn, 1955; Munkres, 1957). It is com-

puted between words from the headline and words

from a sentence in each article part. This feature is

calculated similar to (Ferreira and Vlachos, 2016).

• Subject, Verb and Object Triples Entailment

(SVO): This feature indicates the entailment re-

lations between the subject, verb and object tri-

ples of the headline and each sentence in each ar-

ticle part. The entailment is again found using the

paraphrase database (Pavlick et al., 2015). This

feature is computed as in (Ferreira and Vlachos,

2016) work but instead of indicating the entail-

ment with 0 or 1 we count how many sentence in

each article part have this entailment relationship.

• Negation (Neg): We use the Hungarian algorithm

(Kuhn, 1955) to align words between the headline

and words from each sentence from the article.

Then we check for each aligned word pairs whet-

her one of them is the negation of the other accor-

ding to (Ferreira and Vlachos, 2016). Each sen-

tence is assigned a counter indicating how many

negated pairs it contains. We compute this feature

for each sentence in the entire article.

• Word Overlap Score (W-overlap): For this fe-

ature we compute an overlap score between the

headline and the body’s heading as well as bet-

ween headline and tail. The method is based on

extracting all possible sub-strings from these parts

and then ﬁnding the longest matching sub-strings.

The score is calculated by summing up the square

lengths of these matches.

• Bias Count (Bias): Based on a bias lexicon as in

(Recasens et al., 2013) and (Allen et al., 2014), we

compute how many bias lexicon entries appear in

the entire article as well as in the headline.

5 EXPERIMENTAL SETTING

As noted in section 3.1, the data has four different

class labels: Unrelated, Discuss, Agree and Disagree.

We trained our classiﬁers so that they predict one of

these four labels. However, the performance of the

The Fake News Challenge: Stance Detection using Traditional Machine Learning Approaches

131

resulting models were below the baselines

. After

manual inspection of the data we realized that the ar-

ticles labeled differently were similar in tone towards

the headline and thus difﬁcult for a multi-class labeler

to predict the right class. Furthermore, the data is un-

balanced and contains mostly Unrelated pairs and re-

latively few pairs from the other classes. To overcome

these issues we experimented with different training

strategies without modifying the training and testing

settings deﬁned by FNC-1:

• 2-Steps Classiﬁer: We ﬁrst train the classiﬁer to

distinguish only between Unrelated and Related

pairs, where Related represents all the categories

Discuss, Agree, Disagree. Next, we train a second

classiﬁer on the pairs labeled with Discuss, Agree,

Disagree. For testing, we ﬁrst run the ﬁrst classi-

ﬁer on the entire testing data. Any article-headline

pair classiﬁed as Related is further analyzed with

the second classiﬁer to further classify it as Dis-

cuss, Agree or Disagree.

• 3-Steps Classiﬁer, Setting 1: We further split

the classiﬁcation of the Related classes and cre-

ate three classiﬁers. We keep the ﬁrst step as it is

in the previous 2-step classiﬁer setting (classiﬁca-

tion for Unrelated and Related. Then we train a

classiﬁer to predict the classes Discuss and Non-

Discuss, where Non-Discuss category stands for

the original categories Disagree, Agree. For the

third step we use a 2-way classiﬁcation for the re-

maining categories Disagree, Agree.

For testing we again run ﬁrst the ﬁrst classiﬁer to

split the data into Unrelated and Related catego-

ries. After, the Related data pairs are further clas-

siﬁed to obtain Discuss and Non-Discuss article-

headline pairs. Finally, for the Non-Discuss pairs

we further detail their actual classes using the

third classiﬁer and obtain the Disagree Agree clas-

ses.

• 3-Steps Classiﬁer, Setting 2: We keep the ﬁrst

step as it is, but we used a 2-way classiﬁcation in

the second step for the categories Disagree, Non-

Disagree. The Non-Disagree category represents

the original categories Discuss, Agree. For the

third step we use a 2-way classiﬁcation for the re-

maining categories Discuss, Agree.

In all settings for the ﬁrst two steps we use an L1-

regularized logistic regression classiﬁer (Fan et al.,

2008). For the third step we use a Random Forest

classiﬁer with 100 trees (Breiman, 2001). In each step

we used different sets of features. Table 1 shows the

features used in each step.

Classiﬁer predicting all classes led to 77.04% FNC-1

score.

Table 1: Classiﬁer steps and the features used in each step.

1st step 2nd step 3rd step

W-Overlap H-Q H-Q

Lemma Count BoW BoW

Ch-Grams Root-Dist Root-Dist

N-Grams Neg Neg

Cos-Sim SVO SVO

Hyp-Sim Sentiments Sentiments

PPDB PPDB PPDB

W2Vec-Sim W2Vec-Sim Sentence-Len

Bias Bias

Punct

6 RESULTS AND DISCUSSION

Our overall results are shown in Table 2. We report,

as in FNC-1 challenge, the results using the FNC-1

score. We also compute accuracy. From the table we

see that the best results are obtained with the 3-step

classiﬁers and setting 2. However, we found no diffe-

rence in terms of signiﬁcance to our other settings.

From the table we also see that the performance of our

classiﬁer (3-steps classiﬁer setting 2) is better than the

one of the best system participated in the FNC-1 task.

Both FNC-1 as well as accuracy ﬁgures are better than

those of the best performing baseline.

Tables 3 and 4 show the confusion matrices of the

best baseline and our 3-steps classiﬁer with setting 2.

According to the matrices, the 3-steps classiﬁer in set-

ting 2 predicts more correct Discuss, Disagree, Unre-

lated pairs. The baseline, on the other hand, performs

better on the Agree class.

6.1 Features Analysis

As shown in Table 5 best results are obtained when all

features are used. We aimed to understand the con-

tribution of each feature to the overall results. Thus

we removed a feature at a time, trained the classiﬁers

with the remaining features and tested on the testing

data. The difference in results are captured using pai-

red t-test and a p-value of p < 0.0028

In the results

we see only a signiﬁcance drop when we remove the

BoW feature, in all other settings the results are not

signiﬁcantly different from when there is no feature

Signiﬁcance test is performed using student t-test.

Again in the results we did not ﬁnd any indication for

signiﬁcance.

When conducting multiple analyses on the same depen-

dent variable, the chance of achieving a signiﬁcant result by

pure chance increases. To correct for this we did a Bon-

ferroni correction on the p-value. Results are reported after

this correction.

KMIS 2018 - 10th International Conference on Knowledge Management and Information Sharing

132

Table 2: N-Steps classiﬁers and Baseline.

Winning Baseline 2-Step Classiﬁer 3-Steps Classiﬁer setting 1 3-Steps Classiﬁer setting 2

Unrelated 0.98 0.98 0.98 0.98

Discuss 0.76 0.76 0.75 0.76

Agree 0.54 0.52 0.49 0.52

Disagree 0.04 0.05 0.07 0.1

Accuracy 89.1 89.1 88.8 89.18

FNC- Score 82.02 82.0 81.53 82.10

Table 3: Best baseline Confusion Matrix. A, DA, DC and U

stands for Agree, Disagree, Discuss and Unrelated respecti-

vely.

A DA DC U

A 1114 17 588 184

DA 275 13 294 115

DC 823 6 3401 234

U 35 0 203 18111

Table 4: 3-Steps Classiﬁer with setting 2 Confusion Matrix.

A, DA, DC and U stands for Agree, Disagree, Discuss and

Unrelated respectively.

A DA DC U

A 947 29 799 128

DA 181 39 343 134

DC 589 28 3558 289

U 10 2 219 18118

omission. However, in all removal cases there is a

moderate drop in the results indicating that every fea-

ture has some contribution to the ﬁnal results.

We also removed combinations of features from

the entire set of features used in our ﬁnal model to

show the effect of more than one feature removed at

once. The selection of different combinations is cho-

sen according to the relatedness of features. We list

them in groups:

• Group A: Is a group of features used in the

ﬁrst step for distinguishing Related and Unrela-

ted classes, namely Ch-grams, N-grams, Lemma

Count and W-overlap (see Table 1 for the set of

features used in the ﬁrst step). When removing

Group A features, the number of correctly clas-

siﬁed instances as Unrelated reduces the most

(from 18118 to 18034), hence reducing the cor-

rectly classiﬁed instances as Discuss. See confu-

sion matrices in Tables 6 and 4 for comparison.

• Group B: This group holds features related to si-

milarity and entailment, namely PPDB, Hyp-Sim,

W2Vec-Sim and Cos-Sim. They have lower effect

on Related and Unrelated but greater effects on

the Agree (reduction from 947 to 913) and Dis-

agree (reduction from 39 to 26). See confusion

Table 5: Accuracy, and FNC-1 score when using all featu-

res compared to results when removing features one by one

accordingly.

Features Accuracy FNC-1 Score

All Features 0.891 82.1

- BoW* 0.870 78.53

- Lemma Count 0.888 82.07

- Ch-Grams 0.888 81.42

- N-Grams 0.890 82.00

- Hyp-Sim 0.891 81.93

- Cos-Sim 0.890 81.86

- W2Vec-Sim 0.891 82.01

- ppdb 0.890 81.70

- w-overlap 0.891 82.02

- H-Len 0.891 81.97

- Root-Dist 0.888 81.57

- SVO 0.889 81.59

- Neg 0.891 81.85

- Sentiments 0.887 81.37

- Bias 0.891 82.00

- Punct 0.889 81.66

- Sentence-Len 0.887 81.41

- Tittle-Q 0.891 81.86

- group A 0.885 81.19

- group B 0.886 80.87

- group C 0.888 81.38

- group D 0.890 81.89

Table 6: Group A: Features without Ch-grams, N-grams,

Lemma Count and W-overlap.

A DA DC U

A 934 27 797 145

DA 180 42 342 133

DC 561 25 3493 385

U 16 1 298 18034

matrices 7 and 4.

• Group C: This group contains SVO, Neg, Root-

Dist and PPDB features. Confusion matrix 8

shows that by removing these features, the cate-

gories Disagree and Discuss are mostly affected.

• Group D: This group contains Punct, Bias,

Sentence-Len and T-Quest features. Removing

this combination has a greater effect on the Agree

category. See Table 9.

The Fake News Challenge: Stance Detection using Traditional Machine Learning Approaches

133

Table 7: Group B: Features without PPDB, Hyp-Sim,

W2Vec-Sim and Cos-Sim.

A DA DC U

A 913 30 783 177

DA 165 26 328 178

DC 563 30 3485 386

U 9 1 243 18096

Table 8: Group C: Features without SVO, Neg, Root-Dist

and PPDB.

A DA DC U

A 939 26 793 145

DA 212 29 310 146

DC 626 35 3486 317

U 24 1 212 18112

Figures for the accuracy and FNC-1 metrics after re-

moving these group features are shown in Table 5.

Overall the removal of all group features lead to de-

crease in performance. However, similar to the single

features cases the decreases are only moderate wit-

hout signiﬁcance relevance.

6.2 Discussion

Overall we have seen that our 3-step classiﬁer in set-

ting 2 outperforms the state-of-the-art system that par-

ticipated in the FNC-1 challenge. Although the diffe-

rences in the results are only moderate, nevertheless,

they show that it is possible to beat state-of-the-art

results with feature engineering as well as traditional

machine learning approaches. Furthermore, tackling

the problem in hand with such an approach has the

advantage that, unlike deep learning approaches, ena-

bles feature extraction and later feature analysis. In

our case, we carefully picked our features and investi-

gated settings including ﬁnding article parts and clas-

siﬁcation steps where they shine best.

Feature analysis shows that removing any single

feature leads to some drop in performance compared

to the results when all features are used. The signiﬁ-

cant drop happens when we remove the BoW feature.

The BoW feature includes uni-grams and bi-grams ex-

tracted from the article heading as well as from the

article tail. Thus, it aims to capture what is in those

article parts in terms of vocabulary. Those areas of

the article introduce and summarize arguments. The

chance is very high that they capture the claim intro-

duced in the headline. Indeed the results conﬁrm this

phenomenon with a signiﬁcant drop when removing

this feature.

We also grouped features and removed them al-

together from the complete feature set. The overall

drop in terms of performance was moderate. Howe-

Table 9: Group D: Features without Punct, Bias, Sentence-

Len and T-Quest.

A DA DC U

A 886 26 863 128

DA 173 34 356 134

DC 556 27 3592 289

U 14 2 215 18118

ver, in the confusion matrices we have seen that each

feature group has its strength in a speciﬁc category

or class. Group A features help in the relatedness

task (step one of the classiﬁcation) whereas the ot-

her groups ﬁnd their shining points at later steps and

address Agree, Disagree and Discuss classes.

Finally we performed error analysis on the ﬁnal

classiﬁer results. We observed the following points:

1. There is ambiguity in Disagree deﬁnition. Exam-

ple: pair: {headline: ”Justin Bieber Helps De-

fend Russian Fisherman...”, body ID: 2373}. This

pair’s correct class is ”Disagree”, but it is classi-

ﬁed as Unrelated by our classiﬁer. In this exam-

ple there is no mention of Justin Bieber. The ar-

ticle itself is about a Fisherman being attacked by

a bear. However, there is no disagreement about

the topic that is introduced in the headline. Thus

according to the deﬁnition for the category Dis-

agree, this pair should be classiﬁed as Unrelated.

2. Detecting disagreement is hard in some cases be-

cause it depends on the implicit meaning of the ar-

ticle. As an example, the pair:{Headline: ”People

Actually Believed Argentina’s President Adopted

A Jewish Boy...”, body ID: ”2382”,} This pairs

correct classiﬁcation is Disagree, but it is classi-

ﬁed as Discuss by our classiﬁer. The article talks

about passing a law to stop some act of Argen-

tina’s people and it does not refute explicitly what

it is in the headline.

3. Detecting unrelated titles to their paired articles is

critical when the article uses most of the words

mentioned in the title.

4. In most cases there is no clear indications for dif-

ferentiating between the classes Agree and Dis-

cuss which makes them hard to judge by our clas-

siﬁer. Most of the classiﬁer errors are due to this

phenomenon. See Table 4.

7 CONCLUSIONS

In this paper we re-investigated the Fake News ﬁrst

challenge of stance detection using traditional ma-

chine learning and feature engineering approach.

KMIS 2018 - 10th International Conference on Knowledge Management and Information Sharing

134

Using this method we scored better than the ﬁrst win-

ner’s deep learning model.

We performed feature analysis by removing a fea-

ture at a time but also groups of features. Any removal

led to moderate performance drop. The signiﬁcance

drop happened when the BoW feature was removed.

This feature contains uni-grams and bi-grams extrac-

ted from the article heading and article tail. As dis-

cussed both parts either introduce or summarize argu-

ments and are likely to capture what is said in the he-

adline. Overall every feature plays a role in the clas-

siﬁcation. We showed that some features play role in

the ﬁrst step (distinguishing between related and unre-

lated pairs) and others play at discriminating between

agree, disagree and discuss classes.

Our immediate future work will be to use stance to

perform judgments about fake news. We will investi-

gate how stance can be integrate for the fake news

classiﬁcation.

REFERENCES

Allen, K., Carenini, G., and Ng, R. (2014). Detecting dis-

agreement in conversations using pseudo-monologic

rhetorical structure. In Proceedings of the 2014 Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 1169–1180.

Breiman, L. (2001). Random forests. Machine learning,

45(1):5–32.

Dungs, S., Aker, A., Fuhr, N., and Bontcheva, K. (2018). (in

press). can rumour stance alone predict veracity? In

Proceedings of COLING 2018, The 27th International

Conference on Computational Linguistics.

Enayet, O. and El-Beltagy, S. R. (2017). Niletmrg at

semeval-2017 task 8: Determining rumour and ver-

acity support for rumours on twitter. In Proceedings

of the 11th International Workshop on Semantic Eva-

luation (SemEval-2017), pages 470–474.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and

Lin, C.-J. (2008). Liblinear - a library for large linear

classiﬁcation. The Weka classiﬁer works with version

1.33 of LIBLINEAR.

Ferreira, W. and Vlachos, A. (2016). Emergent: a novel

data-set for stance classiﬁcation. In Proceedings of

the 2016 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies. ACL.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reute-

mann, P., and Witten, I. H. (2009). The WEKA data

mining software: an update. SIGKDD Explorations,

11(1):10–18.

Hardalov, M., Koychev, I., and Nakov, P. (2016). In se-

arch of credible news. In International Conference

on Artiﬁcial Intelligence: Methodology, Systems, and

Applications, pages 172–180. Springer.

Jin, Z., Cao, J., Zhang, Y., and Luo, J. (2016). News veri-

ﬁcation by exploiting conﬂicting social viewpoints in

microblogs. In AAAI, pages 2972–2978.

Jin, Z., Cao, J., Zhang, Y., Zhou, J., and Tian, Q. (2017).

Novel visual and statistical image features for micro-

blogs news veriﬁcation. IEEE Transactions on Multi-

media, 19(3):598–608.

Kuhn, H. W. (1955). The hungarian method for the assign-

ment problem. Naval Research Logistics (NRL), 2(1-

2):83–97.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Be-

thard, S., and McClosky, D. (2014). The stanford co-

renlp natural language processing toolkit. In ACL (Sy-

stem Demonstrations), pages 55–60.

Markowitz, D. M. and Hancock, J. T. (2014). Linguistic

traces of a scientiﬁc fraud: The case of diederik stapel.

PloS one, 9(8):e105937.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Miller, G. A. (1995). Wordnet: A lexical database for eng-

lish. Commun. ACM, 38(11):39–41.

Munkres, J. (1957). Algorithms for the assignment and

transportation problems. Journal of the society for in-

dustrial and applied mathematics, 5(1):32–38.

Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B.,

and Callison-Burch, C. (2015). Ppdb 2.0: Better

paraphrase ranking, ﬁne-grained entailment relations,

word embeddings, and style classiﬁcation. In Procee-

dings of the 53rd Annual Meeting of the Association

for Computational Linguistics and the 7th Internatio-

nal Joint Conference on Natural Language Processing

(Volume 2: Short Papers), volume 2, pages 425–430.

Popat, K., Mukherjee, S., Str

otgen, J., and Weikum, G.

(2017). Where the truth lies: Explaining the credibi-

lity of emerging claims on the web and social media.

In Proceedings of the 26th International Conference

on World Wide Web Companion, pages 1003–1012.

International World Wide Web Conferences Steering

Committee.

Qazvinian, V., Rosengren, E., Radev, D. R., and Mei, Q.

(2011). Rumor has it: Identifying misinformation in

microblogs. In Proceedings of the Conference on Em-

pirical Methods in Natural Language Processing, pa-

ges 1589–1599. Association for Computational Lin-

guistics.

Recasens, M., Danescu-Niculescu-Mizil, C., and Jurafsky,

D. (2013). Linguistic models for analyzing and de-

tecting biased language. In Proceedings of ACL.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,

C. D., Ng, A., and Potts, C. (2013). Recursive deep

models for semantic compositionality over a senti-

ment treebank. In Proceedings of the 2013 conference

on empirical methods in natural language processing,

pages 1631–1642.

Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., and

Procter, R. (2018). Detection and resolution of

rumours in social media: A survey. ACM Computing

Surveys (CSUR), 51(2):32.

The Fake News Challenge: Stance Detection using Traditional Machine Learning Approaches

135