Mixed Reference Interpretation in Multi-turn Conversation

Nanase Otake, Shoya Matsumori, Yosuke Fukuchi, Yusuke Takimoto and Michita Imai

Faculty of Science and Technology, Keio University, Japan

Keywords:

Conversation Context, Demonstrative Word Interpretation, Contextual Reference, Situational Reference,

Mixed Reference Interpretation.

Abstract:

Contextual reference refers to the mention of matters or topics that appear in the conversation, and situational

reference to the mention of objects or events that exist around the conversation participants. In conventional

utterance processing, the system deals with either contextual or situational reference in a dialogue. However, in

order to achieve meaningful communication between people and the system in the real world, the system needs

to consider Mixed Reference Interpretation (MRI) problem, that is, handling both types of reference in an

integrated manner. In this paper, we propose DICONS, a method that sequentially estimates an interpretation

of utterances from interpretation candidates derived from both contextual reference and situational reference

in a dialogue. In an experiment in which DICONS handled this task with both contextual and situational

references, we found that it could properly judge which type of reference had occurred. We also found that

the referent of the demonstrative word in each context and situation could be properly estimated.

1 INTRODUCTION

In human conversation, both contextual reference and

situational reference are intermingled. Contextual ref-

erence is to refer matters or topics that has mentioned

in the conversation. Situational reference, on the other

hand, is to mention objects or events that exist around

the conversation participants. These two types of ref-

erence are illustrated in Fig. 1, which shows exam-

ple scenarios in which a person and a robot are shop-

ping at a supermarket. In the scene on the left, where

the person asks “Which one is better for dinner, ﬁsh

or meat?” and the robot answers “Fish”, contextual

reference occurs because the robot selects the word

“ﬁsh” as the answer to the question based on the con-

tent of the human utterance. In the scene on the right,

where ﬁsh and meat are displayed on the shelf and

the person standing in front of it asks “Which one is

better?” and the robot answers “Fish”, situational ref-

erence occurs because the robot answers the question

based on the information about objects in the envi-

ronment. As shown in this example, in real-world di-

alogue, both contextual reference and situational ref-

erence may occur in the same situation.

A common problem with previous methods is that

they deal with either contextual or situational refer-

ence in dialogue, not both. In other words, the system

is designed assuming a situation where only context

reference occurs, as represented by a chatbot, or only

Figure 1: Example of contextual reference and situational

reference. We call the task in which speakers need to deal

with both references Mixed Reference Interpretation (MRI).

situational reference, as represented by an instruction

task to the robot. However, in order to deal with more

realistic situations in the real world, a system needs

to handle the Mixed Reference Interpretation (MRI)

problem, that is, both types of reference coexist, and

we cannot determine in advance which one will occur.

Another problem is that previous methods do not deal

with the “sequentiality of dialogue”, where the con-

text gradually develops according to the utterance of

each speaker. In order to enable a dialog system and a

person to dynamically understand each other’s utter-

Otake, N., Matsumori, S., Fukuchi, Y., Takimoto, Y. and Imai, M.

Mixed Reference Interpretation in Multi-turn Conversation.

DOI: 10.5220/0010260103210328

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 1, pages 321-328

ISBN: 978-989-758-484-8

321

ances, it is necessary, in ambiguous states in which

contextual reference or situational reference occur,

to decide which one it is by estimating sequentially

which reference the current utterance corresponds to.

In this paper, we focus on the MRI problem in

multi-turn conversation processing. We propose Dy-

namic and Incremental Interpretation of Contextual

and Situational References in Conversational Dia-

logues (DICONS), a method that sequentially esti-

mates an interpretation of utterances from interpreta-

tion candidates derived from contextual reference and

situational reference in a dialogue. DICONS simulta-

neously performs a probabilistic search of interpreta-

tion candidates considering the contextual reference

and considering the situational reference in a multi-

turn conversation, compares each interpretation, and

then estimates which type of reference it is. In this

way, even in a conversation where the reference type

is not clear, DICONS can estimate the reference type

and the referent.

This paper is structured as follows. In Chapter 2,

we provide an overview of previous works on contex-

tual and situational references and describe SCAIN

(Takimoto et al., 2020), which is the base algorithm

of DICONS. Chapter 3 presents DICONS in detail.

In Chapter 4, we explain the experiments conducted

in this study and report the results. Chapter 5 de-

scribes the future directions of DICONS. We con-

clude in Chapter 6 with a brief summary.

2 RELATED WORK

2.1 Contextual Reference

In the ﬁeld of natural language processing, corefer-

ence resolution is one of the tasks that deal with con-

textual reference. This task is the process of esti-

mating the target of pronouns or demonstratives and

complementing a zero pronoun, which is the omitted

noun phrase. There are two basic methods to deal

with coreference resolution: a rule-based method that

extracts coreference relations based on the heuristic

rules (Lee et al., 2011), and a method that uses ma-

chine learning (Clark et al., 2019; Joshi et al., 2019;

Lee et al., 2018; Joshi et al., 2020).

The major problem with previous methods is that

they do not consider situational reference. That is,

they cannot handle a case in which the object that a

pronoun refers to exists in visual information. Fur-

thermore, whole sentences are required to resolve

coreferences, and the sequentiality of dialogue is not

considered. As stated earlier, the term “sequentiality

of dialogue” refers to how the context gradually de-

velops according to the utterance of each speaker. In

order for the system to understand the utterance and

generate a reply in a multi-turn conversation, it needs

to dynamically consider possible reference candidates

from the history of dialogue. Estimating the reference

candidate will inevitably involve uncertainty. Further-

more, a person might make an additional utterance

after the estimation has started, which is also an im-

portant hint for estimating the reference candidates,

so the system has to continuously look back on the

dialogue and reinterpret the reference candidate to re-

duce the uncertainty. In order to achieve meaningful

communication between people and a system in the

real world, it is necessary for the system to be able to

consider the sequentiality of dialogue.

2.2 Situational Reference

A referring expression (RE) is any noun phrase, or

surrogate for a noun phrase, whose function in dis-

course is to identify some individual object. In hu-

man conversations, humans interpret referring expres-

sions using language, gesture, and context, fusing in-

formation from multiple modalities over time. Inter-

preting Multimodal Referring Expressions (Whitney

et al., 2016) and Interactive Picking System (Hatori

et al., 2018) deal with the RE problem, which is a

task to identify objects in an image that correspond to

ambiguous instructions.

Visual dialog (VisDial) (Das et al., 2017) is a task

which requires a dialog agent to answer a series of

questions grounded in an image. Attention-based ap-

proaches were primarily proposed to address these

challenges, including Dual Attention Networks (Kang

et al., 2019) and Light-weight Transformer for Many

Inputs (Nguyen et al., 2020). VisDial dataset is a stan-

dard dataset for evaluating methods dealing with vi-

sual dialog. This dataset has two components: image

and dialogue history about the image. A dialogue his-

tory is a set of successive question-answer pairs.

This paper focuses on more realistic dialogues

than previous works. The setting in RE problem as-

sumes that the referent must exist in the given im-

age. In other words, the setting considers only situa-

tional references and excludes contextual references.

VisDial dataset has also several disadvantages. First,

the dialogue history consists of only questions about

an image and answers for them, but realistic dia-

logue does not necessarily consist of such question-

answering pairs. For example, one may give his/her

thoughts about surrounding objects or opinions on

others’ utterance, so it is not natural to ask questions

only. Second, all the questions in a dialogue history

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

322

Figure 2: Flow chart of DICONS.

are about the things in a corresponding image. This

means that no contextual references are taken into ac-

count. In a realistic situation, the situation where ei-

ther contextual reference or situational reference oc-

curs is limited. Since it is usually unknown which

reference will occur, it is desirable that one system

can handle both references.

Therefore, this paper deals with situations that

include both contextual and situational references,

which are closer to realistic situations. Since there

was no dataset to deal with this setup, we prepared a

few samples to evaluate our system on a trial basis.

2.3 SCAIN

SCAIN (Takimoto et al., 2020) is an algorithm that

dynamically estimates the context and interpretations

of words in a conversation. SCAIN achieves paral-

lel context estimation while being able to obtain new

information throughout the sequential input of a con-

versation text. As such, SCAIN can deal with the se-

quentiality of dialogue in dynamic resolution of coref-

erence.

SCAIN evolved from FastSLAM (Montemerlo

et al., 2002), an algorithm for a mobile robot that

statistically resolves the interdependence between the

robot’s self-position and a map. SCAIN replaces self-

position with a context and the map with a word inter-

pretation space to apply FastSLAM to the interdepen-

dence between context and interpretation in a multi-

turn conversation. The Kalman ﬁlter and a particle

ﬁlter are the key mechanisms brought over from Fast-

SLAM.

In SCAIN, the following processes are performed

to sequentially interpret a dialogue and estimate the

context. The correspondences between the SLAM

and SCAIN variables are listed in Table 1.

SCAIN consists of 3 steps. First is update self-

position, second is update landmark points, and third

is resampling of particles. In step 1, SCAIN applies

Table 1: Correspondence of random variables between con-

ventional SLAM and SCAIN.

Random variable Conventional SLAM SCAIN

x Self-position Context

m Environment map Interpretation domain

u Control Own utterance

z Observation Other’s utterance

the interpreter’s own utterance u to the context x in

each particle. In step 2, SCAIN interprets the other’s

utterance z on the interpretation domain m and up-

dates m. In step 3, SCAIN reﬂects the relationship

between z and x. Then, SCAIN leaves valid interpre-

tations by resampling the particles.

3 DICONS

We propose Dynamic and Incremental Interpretation

of Contextual and Situational References in Conver-

sational Dialogues (DICONS) as a method for dy-

namically estimating context and word interpretation

in a multi-turn conversation, considering both contex-

tual information and visual information. With DI-

CONS, it is possible to sequentially solve the MRI

problem; it can determine whether the reference type

in the dialogue is a contextual reference or a situa-

tional reference and estimate the referent of a demon-

strative word.

At the timing when the demonstrative word ap-

pears in the conversation, DICONS starts to inter-

pret the demonstrative word based on visual and con-

textual information. DICONS handles two types of

particle: one for contextual reference and the other

for situational reference, where both types of parti-

cle can handle conversational sentences sequentially.

The difference between the two is that particles for

contextual reference handle only utterance informa-

tion whereas particles for situational reference handle

both utterance information and images showing visual

Mixed Reference Interpretation in Multi-turn Conversation

323

information of the space in which the conversation is

occurring. The input image is used to obtain the can-

didates of a demonstrative word.

Candidates of a demonstrative word include two

types: one that can be obtained from a conversation

history Y

text

= {y

1(text)

, y

2(text)

, ...} and one that can

be obtained from an image Y

image

= {y

1(image)

2(image)

, ...}. Here, each element of Y

text

and Y

image

is a distributed representation (Mikolov et al., 2013)

of each interpretation candidate. The particles for

contextual reference treat the words extracted from

the past conversation history as the interpretation can-

didates. The candidates from situational reference are

assumed to appear in the input image, and multiple

labels extracted by object detection (Redmon et al.,

2016) are treated as the candidates from the situa-

tional reference.

In a conversational sentence, the utterances after

the appearance of the demonstrative word are input to

each particle as the other’s utterance. Following the

steps 1–3 of SCAIN, the context and the distributed

representations of the words, which are included in

the interpretation domain, are updated. In detail, the

update method of context is different from SCAIN.

In step 1 of SCAIN, the number of particles is in-

creased by randomly dividing the context of one par-

ticle into multiple directions, and it achieves an efﬁ-

cient context search in the distributed representation

space. The update expression for the context x

k·n+i

t+1

the k-th particle at time t is represented, for i in the

range from 1 to n of Y, as

k·n+i

t+1

= (1 − λ

+ λ

+ σ

, (1)

where n is the number of interpretation candidates in

Y, Y = (y

, . . . , y

), Y = (Y

text

or Y

image

is a hyper-parameter, σ

is random Gaussian noise,

and v

is the weighted average of the distributed rep-

resentation of the words composing the utterance u.

However, in DICONS, the number of particles is

increased by dividing the context into the directions of

the distributed representations of the candidates han-

dled by the particles. This is represented as

k·n+i

t+1

= (1 − λ

+ λ

+ αy

, (2)

where α is a hyper-parameter.

We do this so as to achieve a more efﬁcient search

in the distributed representation space than the ran-

dom method can by assuming that the context is

related to the candidates, as the candidates of the

demonstrative word are limited in the task of estimat-

ing the referent. Then, as a result of the resampling

in step 3 of SCAIN, the weight of each particle w

is calculated based on the context and the interpreta-

tion domain, and the particle with the largest weight

is selected as the optimum interpretation. The system

can judge whether the contextual information or the

visual information is dominant based on whether the

best particle is derived from the context or the situa-

tion.

Next, the system decides the referent of the

demonstrative word. This is done by comparing the

cosine similarity between the target T GT of the best

particle obtained by resampling and each candidate,

and the candidate E with the maximum cosine sim-

ilarity is selected as the referent. Here, T GT is the

tracker of the possible interpretation candidates of a

demonstrative word included in interpretation domain

m. The above process can be represented by Eq. 3 be-

low.

E = argmax

∈{Y

text

or Y

image

}

{cosine similarity(y

, TGT)}

(3)

Note that cosine similarity(x, y) =

|x|

|y|

, where x, y

are vectors.

4 CASE STUDY

4.1 Evaluation Method

To evaluate whether the proposed DICONS can solve

the MRI problem in a sequential way, we conducted a

case study of estimating the referent of demonstrative

words. The purpose of the experiment is to consider

whether it is possible to 1) determine the reference

type of the demonstrative word and 2) estimate the

referent.

In this experiment, two types of conversation exam-

ples with different difﬁculties were designed. Con-

versation example 1 is (Tables 2 and 3) an example in

which the candidates for contextual reference and the

candidates for situational reference are quite different

such as fruits and birds. Conversation example 2 (Ta-

bles 6 and 7) is an example in which the candidates

for contextual reference and the candidates for situa-

tional reference are similar such as fruits and fruits, so

it is assumed to be more difﬁcult to distinguish than

conversation example 1.

Conversation examples are dialogues between two

speakers: “A” and “B”. Here, “A” plays a role of pro-

moting the utterance, while “B” provides information

useful for estimating the referent. In our case study,

our system observes the utterances of “B” and esti-

mates the referent. Interpretation candidates and the

correct referent of a demonstrative word are shown in

Tables 4 and 5 for conversation example 1 and Ta-

bles 8 and 9 for conversation example 2. The demon-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

324

Table 2: Conversation example 1.1. The ground truth of the

demonstrative word is as follows. Reference type: Contex-

tual. Referent: Mango.

Speaker Utterance content

“Bananas and pineapples are grown

here. This is the ﬁrst time

I’ve actually seen any of them.

Oh, mangoes are eaten by birds.

I like mangoes.”

“I like them, too.”

(∗timing to get interpretation

candidates)

“Really?”

B (input tx1a)

“They are sweet and delicious,

aren’t they?”

“Yes, they are.”

B (input tx2a)

“There are various desserts made

from them. I like eating them

as jelly.”

Table 3: Conversation example 1.2. The ground truth of the

demonstrative word is as follows. Reference type: Situa-

tional. Referent: Crow.

Speaker Utterance content

“Bananas, pineapples, and

mangoes are grown here.

This is the ﬁrst time I’ve

actually seen any of them.”

“This botanical garden also has

many kinds of birds.

There are birds everywhere.”

“I like that bird.”

(∗timing to get interpretation

candidates)

B (input vis1a)

“Are you talking about the black

one?”

“Yes.”

B (input vis2a)

“Well, I think it’s cool in shape

and color, but I don’t like it

because it digs in the trash

in the city.”

strative words in the conversation examples are inter-

preted following the algorithm in Sec. 3.

4.2 Conversation Example 1:

Easy-to-Identify Examples

4.2.1 Evaluation Results

In each conversation example, the change in cosine

similarity between the target and each interpretation

candidate in the best particle and the change of ref-

erence type of the best particle are shown in Tables

4 and 5. Input tx1a, tx2a in Table 4 and input vis1a,

vis2a in Table 5 correspond to the sentences shown in

the Tables 2 and 3. The cosine similarity in Tables 4

Table 4: Cosine similarity between each candidate and tar-

get of best particle (excluding initial position) and estima-

tion result in conversation example 1.1. ∗ indicates the

ground truth.

Candidates / Input Initial position input tx1a input tx2a

Cosine similarity of contextual candidates

mango∗ 0.5000 0.9973 0.9722

banana 0.5000 0.7082 0.8478

pineapple 0.5000 0.8583 0.8439

Cosine similarity of situational candidates

crow 0.5000 – –

pigeon 0.5000 – –

sparrow 0.5000 – –

Estimation result

Reference type – contextual contextual

Estimation – mango mango

Table 5: Cosine similarity between each candidate and tar-

get of best particle (excluding initial position) and estima-

tion result in conversation example 1.2. ∗ indicates the

ground truth.

Candidates / Input Initial position input vis1a input vis2a

Cosine similarity of contextual candidates

mango 0.5000 – –

banana 0.5000 – –

pineapple 0.5000 – –

Cosine similarity of situational candidates

crow∗ 0.5000 0.9771 0.9256

pigeon 0.5000 0.4601 0.4334

sparrow 0.5000 0.4112 0.7026

Estimation result

Reference type – situational situational

Estimation – crow crow

and 5 refers to the initial position and the best particle

calculated after each input. Here, regarding the ini-

tial position, since there is no difference in the weight

between particles when there is no input, the conﬁ-

dence is the same for all interpretation candidates.

The cosine similarity is regarded as equal to the conﬁ-

dence, and 0.5 is set as the initial position value. Note

that 0.5 indicates the ambiguous state where it is un-

known if each interpretation candidate is a referent or

not. The best particle is one particle with the highest

weight. Cosine similarities were calculated for either

contextual or situational interpretation candidates be-

cause one particle considers only either ones. There-

fore, there are some empty spaces (–) in Tables 4 and

In addition, in each conversation example, in order

to analyze whether the reference types are correctly

distinguished, we visualized the ratio of the reference

types in the top-30 particles, which is 10% of the total

number of particles, and the distribution of the parti-

cle weights. The change in the ratio of the reference

type (context or image) in the top-30 particles in de-

scending order of weight is shown in Figs. 3 and 4.

The change in the distribution of particle weights is

Mixed Reference Interpretation in Multi-turn Conversation

325

Figure 3: Change in ratio of reference type in top-30 parti-

cles in conversation example 1.1. No particles derive from

the situational reference in the top-30 particles.

Figure 4: Change in ratio of reference type in top-30 parti-

cles in conversation example 1.2. No particles derive from

the contextual reference in the top-30 particles.

shown in Figs. 5 and 6.

4.2.2 Discussion

In conversation example 1.1, the column of input tx2a

in Table 4 has the highest cosine similarity. In con-

versation example 1.2, the column of input vis2a in

Table 5 has the highest cosine similarity. Therefore,

the DICONS estimation of conversation example 1.1

is “mango”, which comes from the context, and in

conversation example 1.2 it is “crow”, which comes

from the input image. In both cases, the referent of

the demonstrative word was correctly obtained.

From Figs. 3 and 4, there are only particles with

the correct reference type in the top-30 particles. In

both conversation examples, from Figs. 5 and 6, the

weight distributions of the particles were clearly sep-

arated by the reference type in the progress of the di-

alogue. Therefore, our system can clearly distinguish

the reference types.

4.3 Conversation Example 2:

Difﬁcult-to-Identify Examples

4.3.1 Evaluation Results

In each conversation example, the change in cosine

similarity between the target and each interpretation

candidate in the best particle and the change of refer-

ence type of the best particle are shown in Tables 8

and 9. Input tx1b, tx2b in Table 8 and input vis1b,

vis2b in Table 9 correspond to the sentences shown in

the Tables 6 and 7.

Figure 5: Change in distribution of particle weights in con-

versation example 1.1. The vertical line shows the fre-

quency (that is, the number of particles per weight) and the

horizontal one the weight of particles.

Figure 6: Change in distribution of particle weights in con-

versation example 1.2.

In addition, in each conversation example, the change

in the ratio of the reference type (context or image) in

the top-30 particles in descending order of weight is

shown in Figs. 7 and 8. The change in the distribution

of particle weights is shown in Figs. 9 and 10.

4.3.2 Discussion

The DICONS estimation of conversation example 2.1

is “lemon”, which comes from the context, and in

conversation example 2.2 it is “banana”, which comes

from the input image. In both cases, the referent of the

demonstrative word was correctly obtained.

From Figs. 7 and 8, the correct reference type be-

come dominant due to the sequential update. In the

conversation example 2.1, from Fig. 9, the weight

distribution of the situational reference particles is

smaller and that of the contextual reference parti-

cles is larger in the progress of the dialogue from

input tx1b to input tx2b. In the conversation exam-

ple 2.2, from Fig. 10, the weight distribution of the

contextual reference particles moves toward a small

direction in the progress of the dialogue from in-

put vis1b to input vis2b. Therefore, the estimation

correctly changes even when focusing on all particles.

5 FUTURE WORK

This paper has some future works. For example, use

of other visual information contained in the image

added to the labels, evaluation on dataset or interac-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

326

Table 6: Conversation example 2.1. The ground truth of the

demonstrative word is as follows. Reference type: Contex-

tual. Referent: Lemon.

Speaker Utterance content

“I have a cold. Do you know any

fruit that is good for colds?”

“How about kiwi or pineapple?”

“Well, anything else?”

“How about lemon?”

“Oh, it is certainly effective.”

(∗timing to get interpretation

candidates)

B (input tx1b)

“Yeah, it is rich in vitamin C and

citric acid, so I think it is effective.”

“OK, I must buy it.”

B (input tx2b)

“Yeah, but, It is sour and hard to

eat, so I recommend eating it

with other foods.”

Table 7: Conversation example 2.2. The ground truth of the

demonstrative word is as follows. Reference type: Situa-

tional. Referent: Banana.

Speaker Utterance content

“I have a cold. Do you know any

fruit that is good for colds?”

“How about lemon, kiwi, or

pineapple?”

“Good. What do you think of

that fruit?”

(∗timing to get interpretation

candidates)

B (input vis1b)

“It has a lot of sugar and changes

quickly to energy.”

“So, do you think it is effective

for colds?”

B (input vis2b)

“Yes. Better with bread or cereal.”

tion experiments with humans, and the well-designed

method to obtain the contextual reference candidates.

6 CONCLUSIONS

The biggest problem with previous methods for

demonstrative word interpretation is that they cannot

deal with MRI problem. Our experimental results

demonstrate that DICONS can handle the task with

MRI problem in a multi-turn conversation.

ACKNOWLEDGEMENTS

This work was supported by JST CREST Grant Num-

ber JPMJCR19A1, Japan.

Table 8: Cosine similarity between each candidate and tar-

get of best particle (excluding initial position) and estima-

tion result in conversation example 2.1. ∗ indicates the

ground truth.

Candidates / Input Initial position input tx1b input tx2b

Cosine similarity of contextual candidates

lemon∗ 0.5000 0.9931 0.9559

kiwi 0.5000 0.4078 0.6015

pineapple 0.5000 0.6717 0.6891

Cosine similarity of situational candidates

banana 0.5000 – –

apple 0.5000 – –

orange 0.5000 – –

Estimation result

Reference type – contextual contextual

Estimation – lemon lemon

Table 9: Cosine similarity between each candidate and tar-

get of best particle (excluding initial position) and estima-

tion result in conversation example 2.2. ∗ indicates the

ground truth.

Candidates / Input Initial position input vis1b input vis2b

Cosine similarity of contextual candidates

lemon 0.5000 – –

kiwi 0.5000 – –

pineapple 0.5000 – –

Cosine similarity of situational candidates

banana∗ 0.5000 0.9760 0.9385

apple 0.5000 0.5311 0.6099

orange 0.5000 0.5999 0.7408

Estimation result

Reference type – situational situational

Estimation – banana banana

Figure 7: Change in ratio of reference type in top-30 parti-

cles in conversation example 2.1.

Figure 8: Change in ratio of reference type in top-30 parti-

cles in conversation example 2.2.

Mixed Reference Interpretation in Multi-turn Conversation

327

Figure 9: Change in distribution of particle weights in con-

versation example 2.1.

Figure 10: Change in distribution of particle weights in con-

versation example 2.2.

REFERENCES

Clark, K., Khandelwal, U., Levy, O., and Manning, C. D.

(2019). What does BERT look at? an analysis of

BERT’s attention. arXiv preprint arXiv:1906.04341.

Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura,

J. M., Parikh, D., and Batra, D. (2017). Visual dialog.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 326–335.

Hatori, J., Kikuchi, Y., Kobayashi, S., Takahashi, K.,

Tsuboi, Y., Unno, Y., Ko, W., and Tan, J. (2018).

Interactively picking real-world objects with uncon-

strained spoken language instructions. In 2018 IEEE

International Conference on Robotics and Automation

(ICRA), pages 3774–3781. IEEE.

Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer,

L., and Levy, O. (2020). SpanBERT: Improving pre-

training by representing and predicting spans. Trans-

actions of the Association for Computational Linguis-

tics, 8:64–77.

Joshi, M., Levy, O., Weld, D. S., and Zettlemoyer, L.

(2019). BERT for coreference resolution: Baselines

and analysis. arXiv preprint arXiv:1908.09091.

Kang, G.-C., Lim, J., and Zhang, B.-T. (2019). Dual at-

tention networks for visual reference resolution in vi-

sual dialog. In Proceedings of the 2019 Conference on

Empirical Methods in Natural Language Processing

and the 9th International Joint Conference on Natural

Language Processing, pages 2024–2033.

Lee, H., Peirsman, Y., Chang, A., Chambers, N., Sur-

deanu, M., and Jurafsky, D. (2011). Stanford’s multi-

pass sieve coreference resolution system at the conll-

2011 shared task. In Proceedings of the ﬁfteenth con-

ference on computational natural language learning:

Shared task, pages 28–34. Association for Computa-

tional Linguistics.

Lee, K., He, L., and Zettlemoyer, L. (2018). Higher-order

coreference resolution with coarse-to-ﬁne inference.

arXiv preprint arXiv:1804.05392.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., et al.

(2002). FastSLAM: A factored solution to the simulta-

neous localization and mapping problem. AAAI/IAAI,

593598.

Nguyen, V.-Q., Suganuma, M., and Okatani, T. (2020). Ef-

ﬁcient attention mechanism for visual dialog that can

handle all the interactions between multiple inputs.

In Proceedings of the 16th European Conference on

Computer Vision.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Takimoto, Y., Fukuchi, Y., Matsumori, S., and Imai, M.

(2020). SLAM-inspired simultaneous contextualiza-

tion and interpreting for incremental conversation sen-

tences. arXiv preprint arXiv:2005.14662.

Whitney, D., Eldon, M., Oberlin, J., and Tellex, S. (2016).

Interpreting multimodal referring expressions in real

time. In 2016 IEEE International Conference on

Robotics and Automation (ICRA), pages 3331–3338.

IEEE.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

328