ances, it is necessary, in ambiguous states in which
contextual reference or situational reference occur,
to decide which one it is by estimating sequentially
which reference the current utterance corresponds to.
In this paper, we focus on the MRI problem in
multi-turn conversation processing. We propose Dy-
namic and Incremental Interpretation of Contextual
and Situational References in Conversational Dia-
logues (DICONS), a method that sequentially esti-
mates an interpretation of utterances from interpreta-
tion candidates derived from contextual reference and
situational reference in a dialogue. DICONS simulta-
neously performs a probabilistic search of interpreta-
tion candidates considering the contextual reference
and considering the situational reference in a multi-
turn conversation, compares each interpretation, and
then estimates which type of reference it is. In this
way, even in a conversation where the reference type
is not clear, DICONS can estimate the reference type
and the referent.
This paper is structured as follows. In Chapter 2,
we provide an overview of previous works on contex-
tual and situational references and describe SCAIN
(Takimoto et al., 2020), which is the base algorithm
of DICONS. Chapter 3 presents DICONS in detail.
In Chapter 4, we explain the experiments conducted
in this study and report the results. Chapter 5 de-
scribes the future directions of DICONS. We con-
clude in Chapter 6 with a brief summary.
2 RELATED WORK
2.1 Contextual Reference
In the field of natural language processing, corefer-
ence resolution is one of the tasks that deal with con-
textual reference. This task is the process of esti-
mating the target of pronouns or demonstratives and
complementing a zero pronoun, which is the omitted
noun phrase. There are two basic methods to deal
with coreference resolution: a rule-based method that
extracts coreference relations based on the heuristic
rules (Lee et al., 2011), and a method that uses ma-
chine learning (Clark et al., 2019; Joshi et al., 2019;
Lee et al., 2018; Joshi et al., 2020).
The major problem with previous methods is that
they do not consider situational reference. That is,
they cannot handle a case in which the object that a
pronoun refers to exists in visual information. Fur-
thermore, whole sentences are required to resolve
coreferences, and the sequentiality of dialogue is not
considered. As stated earlier, the term “sequentiality
of dialogue” refers to how the context gradually de-
velops according to the utterance of each speaker. In
order for the system to understand the utterance and
generate a reply in a multi-turn conversation, it needs
to dynamically consider possible reference candidates
from the history of dialogue. Estimating the reference
candidate will inevitably involve uncertainty. Further-
more, a person might make an additional utterance
after the estimation has started, which is also an im-
portant hint for estimating the reference candidates,
so the system has to continuously look back on the
dialogue and reinterpret the reference candidate to re-
duce the uncertainty. In order to achieve meaningful
communication between people and a system in the
real world, it is necessary for the system to be able to
consider the sequentiality of dialogue.
2.2 Situational Reference
A referring expression (RE) is any noun phrase, or
surrogate for a noun phrase, whose function in dis-
course is to identify some individual object. In hu-
man conversations, humans interpret referring expres-
sions using language, gesture, and context, fusing in-
formation from multiple modalities over time. Inter-
preting Multimodal Referring Expressions (Whitney
et al., 2016) and Interactive Picking System (Hatori
et al., 2018) deal with the RE problem, which is a
task to identify objects in an image that correspond to
ambiguous instructions.
Visual dialog (VisDial) (Das et al., 2017) is a task
which requires a dialog agent to answer a series of
questions grounded in an image. Attention-based ap-
proaches were primarily proposed to address these
challenges, including Dual Attention Networks (Kang
et al., 2019) and Light-weight Transformer for Many
Inputs (Nguyen et al., 2020). VisDial dataset is a stan-
dard dataset for evaluating methods dealing with vi-
sual dialog. This dataset has two components: image
and dialogue history about the image. A dialogue his-
tory is a set of successive question-answer pairs.
This paper focuses on more realistic dialogues
than previous works. The setting in RE problem as-
sumes that the referent must exist in the given im-
age. In other words, the setting considers only situa-
tional references and excludes contextual references.
VisDial dataset has also several disadvantages. First,
the dialogue history consists of only questions about
an image and answers for them, but realistic dia-
logue does not necessarily consist of such question-
answering pairs. For example, one may give his/her
thoughts about surrounding objects or opinions on
others’ utterance, so it is not natural to ask questions
only. Second, all the questions in a dialogue history
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
322