An Effective Method to Answer Multi-hop Questions

by Single-hop QA System

Kong Yuntao, Nguyen Minh Phuong, Teeradaj Racharak, Tung Le and Nguyen Le Minh

School of Information Science, Japan Advanced Institute of Science and Technology, Nomi, Japan

Keywords:

Question Answering, Multi-hop QA, Two-step Tuning, Transfer Learning, Multi-task Learning.

Abstract:

Multi-hop question answering (QA) requires a model to aggregate information from multiple paragraphs to

predict the answer. Recent research on multi-hop QA has attempted this task by utilizing graph neural net-

works (GNNs) with sophisticated graph structures. While such models can achieve good performance, their

computation is rather expensive. In this paper, we explore an alternative method that leverages a single-hop

QA model to deal with multi-hop questions. Our system called ‘Answer Multi-hop questions by Single-hop

QA’ (AMS) consists of three main parts that ﬁrst ﬁlter a document and then conduct prediction using the

attention-based single-hop QA model with multi-task learning. Speciﬁcally, AMS is constructed based on the

co-attention and self-attention architecture. Lastly, consider that BERT-based model is pre-trained in a general

domain and the data distribution can be different from multi-hop QA task. We propose two-step tuning mech-

anism to enhance the model’s performance, which is based on transfer learning from other QA datasets. To

verify AMS effectiveness, we consider the previous state-of-the-art Hierarchical Graph Network (HGN) with

the same document ﬁlter as our baseline. Experiments on HotpotQA show that AMS can outperform HGN by

1.78 points and 0.56 points for Joint EM and Joint F1, respectively. Meanwhile, it has smaller model’s size

and uses less computational resource. We also experiment with other GNN-based models and achieve better

results.

1 INTRODUCTION

As a popular task in Natural Language Processing

(NLP), much effort has been made to the develop-

ment of question answering (QA) systems, due to the

release of many large-scale and high-quality datasets

such as (Hermann et al., 2015; Rajpurkar et al., 2018a;

Joshi et al., 2017). Early on, these datasets mainly

concentrate on single-hop questions, in which an an-

swer can be retrieved from a single paragraph and

only one fact is involved. With the recent explosion of

success of deep learning techniques, QA models such

as (Lan et al., 2019; Zhang et al., 2021) have corre-

spondingly improved and have achieved super-human

performance, especially in SQuAD 2.0

. More re-

cently, multi-hop QA datasets including (Khashabi

et al., 2018; Welbl et al., 2018; Yang et al., 2018)

have gained increasing attention. These datasets re-

quire models to answer a more complicated question

by integrating information from multiple paragraphs

and facts.

Figure 1 shows an example from HotpotQA (Yang

https://rajpurkar.github.io/SQuAD-explorer/

et al., 2018), which is a popular multi-hop QA dataset.

Here, given a complex question and a document, the

question is the composition of two single-hop sub-

questions: (i) ‘Who is the author of “Armad”?’ (the

answer is Ernest Cline) and (ii) ‘Which novel by

Ernest Cline will be adapted as a feature ﬁlm by

Steven Spielberg?’. The document contains 10 para-

graphs but only two paragraphs are related to the

question. Models are required to aggregate informa-

tion from scattered facts across multiple paragraphs,

and predict both the answer and supporting facts (i.e.,

sentences showing evidences of the answer).

Regarding the current research line, there has been

a trend of exploiting graph neural network (GNN) for

multi-hop QA (Qiu et al., 2019; Fang et al., 2020;

Huang and Yang, 2021). Investigation of the graph

construction and applying GNN reasoning has been

explored. GNN-based models intuitively consider an-

swering multi-hop questions as reasoning process on

a document graph. Speciﬁcally, the document is ﬁrst

modeled into a graph, and then GNN is applied for in-

formation propagation and aggregation. The updated

graph state is expected to have the semantics of each

244

Yuntao, K., Phuong, N., Racharak, T., Le, T. and Minh, N.

An Effective Method to Answer Multi-hop Questions by Single-hop QA System.

DOI: 10.5220/0010824200003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 2, pages 244-253

ISBN: 978-989-758-547-0; ISSN: 2184-433X

Paragraph 1: Ernest Cline

Ernest Christy Cline (born March 29, 1972) is an American novelist,

spoken-word artist, and screenwriter.

He is mostly famous for his novels "Ready Player One" and

"Armada"; he also co-wrote the screenplay of "Ready Player One"'s

upcoming film adaptation by Steven Spielberg.

Paragraph 2: Armada (novel)

Armada is a science fiction novel by Ernest Cline, published on July

14, 2015 by Crown Publishing Group (a division of Random House).

The story follows a teenager who plays an online video game about

defending against an alien invasion ...

Paragraph 3: The Last Stage ...

…

Paragraph 10: Influence of Stanley Kubrick ...

Answer: Ready Player One

Supporting Facts: (Paragraph 1, 2

Sentence), (Paragraph 2, 1

Sentence)

Question: Which novel by the author of "Armada" will be adapted as a

feature film by Steven Spielberg?

Document

Figure 1: An example from HotpotQA. A document and A

compositional question are given. Both the answer and sup-

porting facts ( in green background ) should be predicted.

node with its neighbors, which would be used for the

ﬁnal prediction. However, it has been studied that

the computation of GNN is usually expensive and the

graph construction strongly depends on prior knowl-

edge (Wu et al., 2021).

Recently, document ﬁlters (Qiu et al., 2019; Fang

et al., 2020; Tu et al., 2020) are proposed to denoise

any document by selecting the most relevant para-

graphs inside it. Table 1 shows promising perfor-

mance of the ﬁlter from Hierarchical Graph Network

(HGN) (Fang et al., 2020). For 2-paragraph selection,

both precision and recall can achieve around 95%.

For 4-paragraph selection, recall is nearly 99%. We

observe that such performance can effectively neglect

irrelevant information while keeping necessary evi-

dences, making it acceptable to utilize single-hop QA

model for multi-hop QA.

Table 1: Performance of HGN’s document ﬁlter.

Filter Precision Recall

2-paragraph selection 94.53 94.53

4-paragraph selection 49.45 98.74

Inspired by this, our work proposes an effective

method to Answer Multi-hop questions by Single-hop

QA system (AMS). We consider HGN (Fang et al.,

2020), one of state-of-the-art (SOTA) models, with

its document ﬁlter as our baseline. Our AMS ex-

ploits existing single-hop QA models based on the

attention mechanism and integrates with the HGN’s

document ﬁlter. Since the prediction of supporting

facts is also required, additional layers are incorpo-

rated for related sub-tasks to adapt multi-task learn-

ing. Besides, two-step tuning is proposed to en-

hance model’s performance, which is based on trans-

fer learning from other QA datasets. We conduct

comprehensive experiments on ﬁve datasets to study

how two-step tuning impacts on the model’s perfor-

mance. To validate our method, we focus on the Hot-

potQA dataset distractor setting (Yang et al., 2018).

The result shows that AMS can outperform the strong

baseline model, and decrease both model’s size and

computational resource by around 80% and 23%, re-

spectively. Moreover, AMS also outperforms other

sophisticated GNN-based models.

To conclude, our contributions are threefolds.

First, we propose an effective method (AMS) to an-

swer multi-hop questions, which incorporates single-

hop QA models with a document ﬁlter. Second, the

proposed model outperforms the strong baseline and

other sophisticated GNN-based models, while it re-

quires less computational resource. Lastly, we pro-

pose a new two-step ﬁne-tuning scheme that can im-

prove the overall performance. We experimentally

study its effectiveness with diverse datasets to analyze

their effect on the model’s performance.

2 RELATED WORK

GNN-based Multi-hop QA. GNN-based models

attempt to construct a graph based on entities or

other levels of granularity in text, which could bridge

scattered information in different paragraphs. For

instance, MHQA-GRN (Song et al., 2018) integrates

evidence by constructing an entity-based graph

and investigates two GNNs to update graph state.

Entity-GCN (De Cao et al., 2019) reﬁnes entity-based

graphs with different edges representing different

relations. HDE-Graph (Tu et al., 2019) constructs

a heterogeneous graphs by introducing the entity

and document nodes. CogQA (Ding et al., 2019)

imitates human reasoning to construct a cognitive

graph and predicts both possible answer spans and

next-hop answer spans. DFGN (Qiu et al., 2019)

proposes a RoBERTa-based document ﬁlter to select

the most relevant paragraphs and develops a dynamic

entity-based graph interacting with context. SAE

(Tu et al., 2020) improves the document ﬁlter by

considering information between paragraphs. HGN

(Fang et al., 2020) utilizes Wikipedia’s hyperlinks to

retrieve more paragraphs and proposes a hierarchical

graph consisting of entity, sentence, paragraph and

question nodes. BFR-Graph (Huang and Yang, 2021)

An Effective Method to Answer Multi-hop Questions by Single-hop QA System

245

constructs a weighted graph by relational information

and poses restrictions on information propagation to

improve the efﬁciency of graph reasoning.

No-GNN-based Multi-Hop QA. There are also

attempts to address multi-hop QA by exploiting the

existing NLP methods. For instance, Coref-GRU

(Dhingra et al., 2018) extracts entities and their coref-

erence from different paragraphs, and aggregates

the information by using multi-GRU layers with a

gated-attention reader. CFC (Zhong et al., 2019)

employs the hierarchical attention to construct the

coarse and ﬁne module for two-stage scoring. QFE

(Nishida et al., 2019) follows an extractive summa-

rization work and incorporates an additional sentence

prediction layer for multi-task learning. C2F Reader

(Shao et al., 2020) considers the graph-attention

as a special kind of self-attention, and argues that

GNN may be unnecessary for multi-hop reasoning.

Compared with the above methods, our work takes a

step forward to effectively utilize existing single-hop

QA models, and shows better performance than

sophisticated GNN-based models.

Fine-tuning for NLP Tasks. ULMFiT (Howard and

Ruder, 2018) proposes the discriminative ﬁne-tuning

that employs layer-wise learning rates, and slanted

triangular learning rates with a sharp increase and a

gradual decrease of the learning rates. Peters et al.

(2019) compare the performance of feature extraction

and ﬁne-tuning, and demonstrates that the distance

between pre-training and the target task can impact

on their relative performance. Sun et al. (2019)

explores a general scheme to ﬁne-tune BERT for

text classiﬁcation, ranging from in-domain tuning,

multi-task learning, to the ﬁne-tuning in the target

task. Houlsby et al. (2019) proposes compact adapter

modules for the text Transformer. Above works

explore general ﬁne-tunning schemes or study on a

speciﬁc task. However, to the best of our knowledge,

there is no work focusing on multi-hop QA.

3 PROPOSED MODEL

We select HGN (Fang et al., 2020), which is the SOTA

approach for HotpotQA, as our strong baseline. In-

spired from HGN, our model is the integration of its

document ﬁlter and single-hop QA models. In our

approach, the document is ﬁrst denoised by the ﬁl-

ter and then is fed into the attention-based single-hop

QA model for the sub-tasks prediction and multi-task

learning. Figure 2 shows an overview of our model.

Attention-based

single-hop QA model

Paragraph

Prediction

Supporting Facts

Prediction

Answer

PredictionMulti-task

learning

QA mdoel

Document

denosie

Question Document

Filter

Question Context

Figure 2: Overview of our model. Answer prediction in-

cludes answer span prediction and answer type prediction.

3.1 Document Denoise

The ﬁlter plays a crucial role in our work and we fol-

low HGN’s ﬁlter consisting of three components:

- Paragraph Ranker: It is trained based on

RoBERTa and followed by a binary classiﬁcation

layer to calculate the probability of whether each

paragraph contains supporting facts.

- Title Matching: It searches for paragraphs whose

title exactly match any phrase with the question.

- Entity Matching: It searches for paragraphs which

contain any entity exactly that appears in the ques-

tion.

HGN’s ﬁlter selects paragraphs within two steps. In

the ﬁrst step, it retrieves paragraphs by Title Match-

ing. If multiple paragraphs are returned, two para-

graphs yielding the highest score from Paragraph

Ranker are selected. If it fails to retrieve any para-

graphs, it further searches for paragraphs by Entity

Matching. If it also fails, the paragraph yielding the

highest score from the Paragraph Ranker is thus se-

lected. In the second step, the ﬁlter retrieves addi-

tional paragraphs by Wikipedia’s hyperlinks from the

paragraphs identiﬁed by ﬁrst step.

Table 1 show the performance of the adopted ﬁl-

ter. According to the table, we select four paragraphs

from the total ten paragraphs since it achieves high

recall (98.74%). The retrieved paragraphs are con-

catenated and used as context. Figure 3 shows the

distribution of token length of the context, indicating

that around 94% token length is within 500. Such per-

formance can effectively reduce the input length and

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

246

keep necessary information. At this stage, the output

is the question and context denoised from the ﬁlter:

Question, Context = Filter(Question, Document)

(1)

9.71%

1.78%

5.86%

7.66%

74.99%

10%

20%

30%

40%

50%

60%

70%

80%

0~0.2 0.2~0.4 0.4~0.6 0.6~0.8 0.8~1

Percentage

ans_F1 score

Ans_F1 distribution

1 2 3 4 5 6 7 8 9 10

Joint_EM

Epoch

SQUAD tuning

original RoBERTa-large

1 2 3 4 5 6 7 8 9 10

Joint_F1

Epoch

SQUAD tuning

original RoBERTa-large

10%

15%

20%

25%

0~50

50 ~ 100

100 ~ 150

150 ~ 200

200 ~ 250

250 ~ 300

300 ~ 350

350 ~ 400

400 ~ 450

450 ~ 500

500 ~ 550

550 ~ 600

600 ~ 650

650 ~ 700

700 ~ 750

750 ~ 800

800 ~ 850

Percentage

Context token length from 4-paragraph selection

93.8% 6.2%

Figure 3: Distribution of context token length from 4-

paragraph selection.

3.2 QA Model

With the promising performance of the document ﬁl-

ter, we propose a single-hop QA model to eliminate

the burden of GNN in the multi-hop QA task. Figure

4 illustrates the proposed single-hop QA model archi-

tecture, which performs the following steps.

First, it feeds the question and the context into the

RoBERTa-large model to obtain question embeddings

∈ R

×d

and context embedding E

∈ R

×d

, where

and l

are the length of context and question. d

denotes the size of RoBERTa-large embedding.

After the representation of each context and ques-

tion is extracted, the context embedding needs to be

intensiﬁed by the question embedding. For this pur-

pose, we apply the attention mechanism to learn the

relationship between them. To show the generality of

our single-hop QA model’s effectiveness, we conduct

experiments with two kinds of attention mechanisms:

co-attention (Subsubsection 3.2.1) and self-attention

(Subsubsection 3.2.2). As a result, context can be up-

dated by either of them:

= attention(E

, E

) ∈ R

×h

(2)

where h denotes the hidden dimension. The detail is

explained in the subsequent sections.

3.2.1 Co-attention

Co-attention (Xiong et al., 2016) is a vital model for

single-hop QA. It enables the question and context to

attend mutually, and also learns the question-aware

context representation iteratively. We implement it

Paragraph

Prediction

Supporting Facts

Prediction

Mean-pooling Layer

Answer

Prediction

FNNs Layer

Updated Context Representation

Embedding Layer

Attention Layer

Question Context

Paragraphs and Sentences

Representations

Figure 4: Architecture of proposed attention-based single-

hop QA model.

as follows: Embedding E

and E

is mapped into

a hidden dimension by two-layer feed-forward net-

works (FFNs

). Afﬁnity matrix A is the product of

context representation C and question representation

Q. In matrix A, each value is the related score of one

word from the question and one word from the con-

text:

C = FFN

) ∈ R

×h

(3)

Q = FFN

) ∈ R

×h

(4)

A = CQ

∈ R

×l

(5)

We normalize matrix A row-wise by softmax, so

that each row indicates how much one word from the

context is attended by all words from the question. By

multiplying it with context representation C, we can

obtain the question representation S

attended by the

context. Similarly, we derive the context representa-

tion S

attended by the question as follows:

= softmax(A

) × C ∈ R

×h

(6)

= softmax(A) × Q ∈ R

×h

(7)

where softmax(·) denotes the normalization column-

wise and > denotes the matrix transpose.

Let the updated question S

attend context again

with the matrix A. In addition, the attended context is

All FFNs in this work includes two linear transforma-

tions with ReLU, Layer Normalization and Dropout in be-

tween.

An Effective Method to Answer Multi-hop Questions by Single-hop QA System

247

further fed into a BiGRU as follows:

= BiGRU(softmax(A)× S

) ∈ R

×h

(8)

and S

are context representations intensiﬁed

by the question. Finally, they are concatenated and

further applied with the FFN

to transform into the

original document’s length:

= FFN

([D

]) ∈ R

×h

(9)

where [·;·] denotes the concatenation function.

3.2.2 Self-attention

We use a Transformer encoder (Vaswani et al., 2017)

for deﬁning self-attention, including a linear layer that

maps the representation into the hidden dimension. It

can capture relations between each pair of words from

the query and the context. We set 8-head attention

and stack two encoder layers to keep the model’s size

smaller than HGN.

= TransformerEncoder([E

]) ∈ R

×h

(10)

3.2.3 Prediction

After the attention module, updated context C

is sent

to a mean-pooling layer to extract the representations

of paragraphs and sentences:

P = Mean-pooling(C

, start

, end

) (11)

S = Mean-pooling(C

, start

, end

) (12)

where start

and start

denote the starting positions of

each paragraph and each sentence, respectively. Sim-

ilarly, end

and end

denote the ending positions.

Unlike the conventional single-hop QA, additional

layers are employed for sub-tasks. the paragraphs’

representation P is sent to a FFN for binary classi-

ﬁcation to calculate the probability that they contain

supporting facts or not. Similarly, the sentences’ rep-

resentation S is sent to a FNN to calculate the proba-

bility that they are supporting facts or not.

para

= FFN

(P) (13)

sent

= FFN

(S) (14)

On the other hand, updated context C

is directly

sent to other FFNs to predict the starting and ending

positions of the answer span:

start

= FFN

) (15)

end

= FFN

) (16)

Since the answer type could be “yes”, “no” or an

answer span, 3-way classiﬁcation is conducted. If the

prediction is “yes” or “no”, the answer is directly re-

turned. Otherwise, the answer span is returned. Sim-

ilar with HGN, we also use the ﬁrst hidden represen-

tation for answer type classiﬁcation.

type

= FFN

[0]) (17)

3.3 Multi-task Learning

Finally, an answer type, an answer span with the start-

ing and ending positions, gold paragraphs, and sup-

port facts are jointly predicted for multi-task learning.

The cross-entropy loss is used for each task. Thus, the

total loss (L

total

) is a weighted sum of each loss and

each weight λ

is our hyper-parameter:

total

=λ

type

+λ

start

+λ

end

+ λ

para

+λ

sent

(18)

4 TWO-STEP TUNING

BERT-based language models (Devlin et al., 2019;

Liu et al., 2019) are pre-trained on the large-scale cor-

pora to learn universal semantics. But for a speciﬁc

task, such as multi-hop QA, the data distribution can

be different. More tuning on a related domain is ex-

pected to bring improvement as also investigated in

(Houlsby et al., 2019; Sun et al., 2019). Therefore,

we propose two-step tuning with an in-task distribu-

tion and a cross-task distribution for enhancing the

model’s performance. To study its effectiveness based

on diverse datasets, we experiment with ﬁve datasets:

SQuAD (Rajpurkar et al., 2018b), NewsQA (Trischler

et al., 2017), TweetQA (Xiong et al., 2019), CoLA

(Warstadt et al., 2019), IMDB (Maas et al., 2011).

In-task Tuning: In this scenario, language model

is ﬁrst tuned in a QA dataset

, including SQuAD,

NewsQA and TweetQA

. Then, we use the tuned lan-

guage model as an embedding in our proposed AMS

and perform the second tuning in HotpotQA.

Cross-task Tuning: In this scenario, the ﬁrst tun-

ing dataset is not a QA dataset. Speciﬁcally, CoLA

is a grammatical classiﬁcation dataset and IMDB is a

sentimental classiﬁcation dataset. The second tuning

process is the same as the in-task tunning.

5 EXPERIMENT

5.1 Dataset

HotpotQA (Yang et al., 2018) is a popular multi-hop

QA dataset, which is constructed from Wikipedia.

We only tune the the language model, instead of the

entire model, in ﬁrst tuning. It enables us to study its effec-

tiveness from cross-task datasets.

There is no annotated answer span in TweetQA. We

retrieve the span with the best BLUE-1 score for training.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

248

Table 2: Comparison between HGN and AMS on dev set. The upper part is based on original RoBERTa-large embedding,

which means the RoBERTa-large embedding from HuggingFace without two-step tuning. The lower part is based on SQuAD

tuning embedding, which means two-step tuning based on SQuAD. ‘Ans’ indicates ‘Answer’ and ‘Sup’ indicates ‘Supporting

facts’. ∆ = model’s performance - HGN (reproduced) performance with original RoBERTa-large.

Embedding Model

Ans Sup Joint

EM F1 EM F1 EM F1

HGN (reproduced) 68.33 82.04 62.89 88.53 45.78 74.06

AMS

co-attention

67.85 81.55 63.28 87.7 46.35 73.58

Original ∆ -0.48 -0.49 0.39 -0.83 0.57 -0.48

RoBERTa-large AMS

self-attention

68.87 82.14 63.20 88.45 46.67 74.21

∆ 0.54 0.10 0.31 -0.08 0.89 0.15

HGN (reproduced) 69.14 82.55 63.24 88.82 46.75 74.75

∆ 0.81 0.51 0.35 0.29 0.97 0.69

SQuAD AMS

co-attention

69.21 82.48 63.7 88.62 47.33 74.41

tuning ∆ 0.88 0.44 0.81 0.09 1.55 0.35

AMS

self-attention

69.26 82.51 64.4 88.63 47.56 74.62

∆ 0.93 0.47 1.51 0.1 1.78 0.56

Table 3: Comparison between different embeddings.

Ans Sup Joint

Embedding

EM F1 EM F1 EM F1

Original RoBERTa-large 67.85 81.55 63.28 87.7 46.35 73.58

SQuAD tuning 69.21 82.48 63.7 88.62 47.33 74.41

TweetQA tuning 67.87 81.79 63.52 88.62 46.84 73.93

NewsQA tuning 68.28 82.09 63.65 88.77 47.24 74.21

CoLA tuning 67.86 81.44 63.59 87.29 46.84 73.29

IMDB tuning 67.56 81.43 63.66 87.31 46.65 73.15

There are two sub-datasets: the distractor setting and

the fullwiki setting. For each case in the distractor set-

ting, a compositional question and a document con-

taining 10 paragraphs are given. In the document,

only 2 paragraphs are related with the question and

other 8 paragraphs are distractions. The gold para-

graphs, supporting facts and ground-truth answers are

annotated. The QA system is required to predict both

an answer and supporting facts. In the fullwiki set-

ting, the answer should be retrieved from the whole

Wikipedia. In this work, we focus on the distractor

setting. Ofﬁcial evaluation metrics are considered,

i.e., EM (exact match) and the F1 score for the in-

dividual and joint evaluations of both the answer and

supporting facts.

5.2 Experimental Setting

We conduct experiments based on a Quadro RTX

8000 GPU. We train the model for 8 epochs, and set

learning rate as 1e-5 with batch size 8. For the hyper-

parameters in our multi-task learning, we search λ

, λ

and λ

from {1,3,5} and λ

from {5, 10, 15,

20}, in which the boldface indicates the best setting.

5.3 Experimental Result

5.3.1 Comparison with Baseline

We reproduce HGN with its source code and the result

is based on RoBERTa-large. The upper part of Table

2 shows the comparison between our proposed AMS

and HGN on the development set. According the ta-

ble, the co-attention based model (AMS

co-attention

) un-

derperforms HGN within 1.0 point. The self-attention

based model (AMS

self-attention

) yields the better per-

formance and especially outperforms HGN by 0.89

points for Joint EM.

5.3.2 Comparison based on Two-step Tuning

Table 3 shows the comparison between the original

RoBERTa-large embedding and our two-step tuning

embedding. This result is based on AMS

co-attention

demonstrating the following information:

• In-task tuning can improve overall performance.

• SQuAD tuning yields the best improvement and

TweetQA yields the smallest improvement. Po-

tential reasons could be: (i) SQuAD and Hot-

potQA are all constructed from Wikipedia; thus,

An Effective Method to Answer Multi-hop Questions by Single-hop QA System

249

Table 4: Comparison with related work on dev set. AMS result is based on SQuAD tuning and HGN result is without SQuAD

tuning.

Embedding Model

Ans Sup Joint

EM F1 EM F1 EM F1

Bert-base

DFGN (Xiao et al., 2019) 55.66 69.34 53.10 82.24 33.68 59.86

HGN (Fang et al., 2020) 60.23 74.49 56.62 85.91 38.16 66.20

AMS

co-attention

61.39 75.39 58.78 85.93 40.04 67.03

AMS

self-attention

62.11 75.76 59.20 85.78 40.73 67.39

RoBERTa-large

SAE (Tu et al., 2020) 67.70 80.75 63.30 87.38 46.81 72.75

HGN (Fang et al., 2020) 68.33 82.04 62.89 88.53 45.78 74.06

AMS

co-attention

69.21 82.48 63.70 88.22 47.33 74.41

AMS

self-attention

69.26 82.51 64.40 88.63 47.56 74.62

9.71%

1.78%

5.86%

7.66%

74.99%

10%

20%

30%

40%

50%

60%

70%

80%

0~0.2 0.2~0.4 0.4~0.6 0.6~0.8 0.8~1

Percentage

ans_F1 score

Ans_F1 distribution

1 2 3 4 5 6 7 8 9 10

Joint_EM

Epoch

SQUAD tuning

original RoBERTa-large

1 2 3 4 5 6 7 8 9 10

Joint_F1

Epoch

SQUAD tuning

original RoBERTa-large

Figure 5: Comparison between original RoBERTa-large

and SQuAD tuning on Joint EM (upper) and Joint F1

(lower).

they may share the same resource and most rele-

vant data distribution. (ii) TweetQA is more oral-

style than other datasets. And the retrieved answer

for training in TweetQA could be incomplete.

• Cross-task tuning can improve Sup EM but can-

not beneﬁt the answer prediction. We hypothesize

that this is because supporting facts prediction is

closely aligned with the classiﬁcation task.

The lower part of Table 2 illustrates that both HGN

and AMS can be overall enhanced by SQuAD tun-

ing (two-step tuning based on SQuAD). Compared

with the reproduced HGN, AMS with SQuAD tun-

ing can outperform it obviously in Sup EM and Joint

EM. Furthermore, under the condition of both AMS

and HGN using SQuAD tuning, their performances

are quite competitive.

Table 5: Comparison of model’s size, computational re-

source and performance.

Baseline Proposed model

HGN AMS

co-attention

AMS

self-attention

Model’s size 31.61M 6.30M 30.83M

RoBERTa-large 355M 355M 355M

Training time 191 min 148 min 160 min

Joint EM 45.78 47.33 47.56

Joint F1 74.06 74.41 74.62

Figure 5 shows curve comparisons between the

original RoBERTa-large and the SQuAD tuning based

on Joint F1 (bottom) and Joint EM (top). From the

ﬁgure, the SQuAD tuning curve is initially better than

the original RoBERTa-large curve and it converges

around 4

epoch. This is faster than the original

RoBERTa-large, showing the power of transfer learn-

ing in multi-hop reasoning.

5.3.3 Comparison with Related Work

We make comparisons with GNN-based models that

use the BERT-based language model and the doc-

ument ﬁlter. Table 4 shows the comparison result

on the development set. According to the table,

our proposed method outperforms GNN-based mod-

els with both BERT-base and RoBERTa-large, and

AMS

self-attention

yields the best performance.

5.4 Comparison of Model’s Size and

Computational Resource

Table 5 shows the comparison of the model’s size,

computational resource and performance. The result

is based on RoBERTa-large. AMS

co-attention

model’s

size is only about 20% of HGN and AMS

self-attention

model’s size is close to HGN. For computational re-

source, AMS

co-attention

and AMS

self-attention

is 77.5%

and 83.8% of HGN, respectively. Since RoBERTa-

large (355M) dominates the total model’s size, train-

ing time is not reduced signiﬁcantly. The computa-

tional resource is expected to further decrease by in-

corporating a lighter language model. Generally, both

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

250

Table 6: Some examples that supporting facts F1 is 0 but answer F1 is 1.

ID Answer Supporting Facts Predicted Answer Predicted Supporting Facts

5ae180195542

9901ffe4aec4

Battle

Creek,

Michigan

[[‘Adventures of Super-

man (TV series), 2],

[‘ Kellogg’s’, 0], [‘Kel-

logg’s’, 2]]

Battle

Creek,

Michigan

[[‘Cocoa Krispies’, 0],[‘Adven-

tures of Superman (TV series)’,

0]]

5ae1fa2b5542

997f29b3c1df

Eminem

[[‘Mack 10 discog-

raphy’, 2], [‘Numb

(Rihanna song)’, 0]]

Eminem

[[‘The Monster (song)’, 0],

[‘Numb (Rihanna song)’, 1]]

5ae18d615542

997283cd2229

mixed

martial

arts

[[‘Liz McCarthy

(ﬁghter)’, 0], [‘Atom-

weight (MMA)’, 0]]

mixed

martial

arts

[[‘Atomweight’, 0], [‘Amber

Brown (ﬁghter)’, 0]]

proposed models show better performance and use

less computational resource.

5.5 Error Analysis

9.71%

1.78%

5.86%

7.66%

74.99%

10%

20%

30%

40%

50%

60%

70%

80%

0~0.2 0.2~0.4 0.4~0.6 0.6~0.8 0.8~1

Percentage

Answer F1 score

1 2 3 4 5 6 7 8 9 10

Joint_EM

Epoch

SQUAD tuning

original RoBERTa-large

1 2 3 4 5 6 7 8 9 10

Joint_F1

Epoch

SQUAD tuning

original RoBERTa-large

10%

15%

20%

25%

0~50

50 ~ 100

100 ~ 150

150 ~ 200

200 ~ 250

250 ~ 300

300 ~ 350

350 ~ 400

400 ~ 450

450 ~ 500

500 ~ 550

550 ~ 600

600 ~ 650

650 ~ 700

700 ~ 750

750 ~ 800

800 ~ 850

Percentage

Context token length from 4-paragraph selection

93.8% 6.2%

Figure 6: Answer F1 score distribution on dev set. There

are almost 10% answer F1 score less than 0.2.

We analyse the answer F1 score on the development

set. Figure 6 illustrates its distribution. Almost 10%

of the answer F1 score is less than 0.2, in which 9.7%

answer F1 score is 0. Further improvement can be

considered from this error. Similar with HGN, we

randomly sample 100 examples with answer F1 score

as 0 and they are categorized as follows:

• Multi-answer (12%): There are multiple gold an-

swers and the predicted answer is different from

the annotation. For example, the annotation is

‘National Broadcasting Company’ and the pre-

dicted answer is ‘NBC’.

• Multi-hop (28%): The supporting facts predic-

tion is incorrect, from which the model fails to

predict the right answer. For example, the sup-

porting facts are the 1

and the 2

sentences, but

the model predicts the 3

and the 4

sentences as

supporting facts and retrieves answer from them.

Accordingly, the answer prediction is incorrect.

• MRC (38%): The supporting facts’ prediction is

right but the answer prediction is wrong. For ex-

ample, the supporting facts are the 1

and the 2

sentences. The model predicts them correctly. But

the ﬁnal answer prediction is wrong.

• Comparison (22%): The model fails to do nu-

merical operations that involves information ag-

gregation. For example, the question is ‘ The CEO

of Walmart and the CEO of Apple, who is older?’

Multi-hop and MRC account for more than 50%,

which indicates that the performance could be further

improved by more advanced QA models.

Another tricky error is that there are 1,322 cases,

about 17% of the development set, that supporting

fact F1 is 0 but answer F1 is 1. This means that the

supporting facts prediction is wrong but the answer

prediction is right. Table 6 shows some examples

of this case. Such interpretable problem may occur

when the answer is not directly retrieved from pre-

dicted sentences. It could be further studied by con-

sidering supporting facts prediction’s restrictions for

the answer prediction.

6 CONCLUSIONS

In this research, we propose a new model, called

AMS, for multi-hop QA. AMS is the integration

of HGN’s document ﬁlter and single-hop QA mod-

els. We also introduce a new ﬁne-tuning scheme

for improving its performance. The result shows

that AMS can outperform the strong baseline HGN

with less amount of computational resource. Further-

more, AMS can achieve the better performance than

other sophisticated GNN-based models. In contrast

to GNN-based methods, our method can effectively

leverage existing single-hop QA models and does not

require any auxiliary tool, such as NER, which should

gain more attention in the further research.

According to our analysis, there is still potential

for further improvement and interpretable issues to be

An Effective Method to Answer Multi-hop Questions by Single-hop QA System

251

addressed. In addition, since our method strongly de-

pends on the document ﬁlter, the development of ﬁlter

for other datasets is necessary for its universal appli-

cation. We leave these studies as our future work.

ACKNOWLEDGEMENTS

This work was supported by JSPS Kakenhi Grant

Number 20H04295, 20K20406, and 20K20625.

REFERENCES

De Cao, N., Aziz, W., and Titov, I. (2019). Question an-

swering by reasoning across documents with graph

convolutional networks. In Proceedings of the 2019

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, Volume 1 (Long and Short Pa-

pers), pages 2306–2317.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. In NAACL.

Dhingra, B., Jin, Q., Yang, Z., Cohen, W. W., and Salakhut-

dinov, R. (2018). Neural models for reasoning over

multiple mentions using coreference. arXiv preprint

arXiv:1804.05922.

Ding, M., Zhou, C., Chen, Q., Yang, H., and Tang, J. (2019).

Cognitive graph for multi-hop reading comprehension

at scale. arXiv preprint arXiv:1905.05460.

Fang, Y., Sun, S., Gan, Z., Pillai, R., Wang, S., and Liu,

J. (2020). Hierarchical graph network for multi-hop

question answering. In Proceedings of the 2020 Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 8823–8838.

Hermann, K. M., Kocisky, T., Grefenstette, E., Espe-

holt, L., Kay, W., Suleyman, M., and Blunsom, P.

(2015). Teaching machines to read and comprehend.

Advances in neural information processing systems,

28:1693–1701.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,

De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and

Gelly, S. (2019). Parameter-efﬁcient transfer learn-

ing for nlp. In International Conference on Machine

Learning, pages 2790–2799. PMLR.

Howard, J. and Ruder, S. (2018). Universal language

model ﬁne-tuning for text classiﬁcation. In Proceed-

ings of the 56th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 328–339, Melbourne, Australia. Association for

Computational Linguistics.

Huang, Y. and Yang, M. (2021). Breadth ﬁrst reason-

ing graph for multi-hop question answering. In Pro-

ceedings of the 2021 Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies, pages

5810–5821.

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017).

Triviaqa: A large scale distantly supervised challenge

dataset for reading comprehension. arXiv preprint

arXiv:1705.03551.

Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and

Roth, D. (2018). Looking beyond the surface: A

challenge set for reading comprehension over multi-

ple sentences. In Proceedings of the 2018 Conference

of the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, Volume 1 (Long Papers), pages 252–262.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma,

P., and Soricut, R. (2019). Albert: A lite bert for

self-supervised learning of language representations.

arXiv preprint arXiv:1909.11942.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y.,

and Potts, C. (2011). Learning word vectors for sen-

timent analysis. In Proceedings of the 49th annual

meeting of the association for computational linguis-

tics: Human language technologies, pages 142–150.

Nishida, K., Nishida, K., Nagata, M., Otsuka, A., Saito,

I., Asano, H., and Tomita, J. (2019). Answering

while summarizing: Multi-task learning for multi-

hop qa with evidence extraction. arXiv preprint

arXiv:1905.08511.

Peters, M. E., Ruder, S., and Smith, N. A. (2019). To tune

or not to tune? adapting pretrained representations to

diverse tasks. arXiv preprint arXiv:1903.05987.

Qiu, L., Xiao, Y., Qu, Y., Zhou, H., Li, L., Zhang, W.,

and Yu, Y. (2019). Dynamically fused graph network

for multi-hop reasoning. In Proceedings of the 57th

Annual Meeting of the Association for Computational

Linguistics, pages 6140–6150.

Rajpurkar, P., Jia, R., and Liang, P. (2018a). Know what

you don’t know: Unanswerable questions for squad.

arXiv preprint arXiv:1806.03822.

Rajpurkar, P., Jia, R., and Liang, P. (2018b). Know what you

don’t know: Unanswerable questions for SQuAD. In

Proceedings of the 56th Annual Meeting of the Associ-

ation for Computational Linguistics (Volume 2: Short

Papers), pages 784–789, Melbourne, Australia. Asso-

ciation for Computational Linguistics.

Shao, N., Cui, Y., Liu, T., Wang, S., and Hu, G. (2020). Is

graph structure necessary for multi-hop question an-

swering? In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 7187–7192.

Song, L., Wang, Z., Yu, M., Zhang, Y., Florian, R., and

Gildea, D. (2018). Exploring graph-structured pas-

sage representation for multi-hop reading compre-

hension with graph neural networks. arXiv preprint

arXiv:1809.02040.

Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How to

ﬁne-tune bert for text classiﬁcation? In China Na-

tional Conference on Chinese Computational Linguis-

tics, pages 194–206. Springer.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

252

Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A.,

Bachman, P., and Suleman, K. (2017). Newsqa: A

machine comprehension dataset. In Proceedings of

the 2nd Workshop on Representation Learning for

NLP, pages 191–200.

Tu, M., Huang, K., Wang, G., Huang, J., He, X., and Zhou,

B. (2020). Select, answer and explain: Interpretable

multi-hop reading comprehension over multiple doc-

uments. In Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, volume 34, pages 9073–9080.

Tu, M., Wang, G., Huang, J., Tang, Y., He, X., and Zhou,

B. (2019). Multi-hop reading comprehension across

multiple documents by reasoning over heterogeneous

graphs. In Proceedings of the 57th Annual Meeting of

the Association for Computational Linguistics, pages

2704–2713.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In Advances in

neural information processing systems, pages 5998–

6008.

Warstadt, A., Singh, A., and Bowman, S. R. (2019). Neu-

ral network acceptability judgments. Transactions of

the Association for Computational Linguistics, 7:625–

641.

Welbl, J., Stenetorp, P., and Riedel, S. (2018). Constructing

datasets for multi-hop reading comprehension across

documents. Transactions of the Association for Com-

putational Linguistics, 6:287–302.

Wu, L., Chen, Y., Shen, K., Guo, X., Gao, H., Li, S., Pei,

J., and Long, B. (2021). Graph neural networks for

natural language processing: A survey. arXiv preprint

arXiv:2106.06090.

Xiao, Y., Qu, Y., Qiu, L., Zhou, H., Li, L., Zhang,

W., and Yu, Y. (2019). Dynamically fused graph

network for multi-hop reasoning. arXiv preprint

arXiv:1905.06933.

Xiong, C., Zhong, V., and Socher, R. (2016). Dynamic

coattention networks for question answering. arXiv

preprint arXiv:1611.01604.

Xiong, W., Wu, J., Wang, H., Kulkarni, V., Yu, M., Chang,

S., Guo, X., and Wang, W. Y. (2019). Tweetqa: A

social media focused question answering dataset. In

Proceedings of the 57th Annual Meeting of the As-

sociation for Computational Linguistics, pages 5020–

5031.

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.,

Salakhutdinov, R., and Manning, C. D. (2018). Hot-

potQA: A dataset for diverse, explainable multi-hop

question answering. In Proceedings of the 2018 Con-

ference on Empirical Methods in Natural Language

Processing, pages 2369–2380, Brussels, Belgium. As-

sociation for Computational Linguistics.

Zhang, Z., Yang, J., and Zhao, H. (2021). Retrospective

reader for machine reading comprehension. In Pro-

ceedings of the AAAI Conference on Artiﬁcial Intelli-

gence, volume 35, pages 14506–14514.

Zhong, V., Xiong, C., Keskar, N. S., and Socher, R.

(2019). Coarse-grain ﬁne-grain coattention network

for multi-evidence question answering. arXiv preprint

arXiv:1901.00603.

An Effective Method to Answer Multi-hop Questions by Single-hop QA System

253