Interest-Point-Based Landmark Computation for Agents’ Spatial
Description Coordination
J. I. Olszewska
School of Computing and Technology, University of Gloucestershire, The Park, Cheltenham, GL50 2RH, U.K.
Keywords:
Qualitative Spatial Reasoning, Object Detection, Local Feature Descriptors, Feature Extraction, Visual Scene
Understanding, Automated Image Annotation, Robotics, Autonomic Agents.
Abstract:
In applications involving multiple conversational agents, each of these agents has its own view of a visual
scene, and thus all the agents must establish common visual landmarks in order to coordinate their space
understanding and to coherently share generated spatial descriptions of this scene. Whereas natural language
processing approaches contribute to define the common ground through dialogues between these agents, we
propose in this paper a computer-vision system to determine the object of reference for both agents efficiently
and automatically. Our approach consists in processing each agent’s view by computing the related, visual
interest points, and then by matching them in order to extract the salient and meaningful landmark. Our
approach has been successfully tested on real-world data, and its performance and design allow its use for
embedded robotic system communication.
1 INTRODUCTION
Communication between agents about space is of
prime importance for actions requiring spatial coor-
dination of intelligent agents such as robots operating
in rescue activities or in assistive aid.
In these joint actions, the conversational agents,
i.e. the speaker and the hearer, should reach a shared
understanding of the scene they observe and/or evolve
in. For this purpose, they usually acquire knowledge
about this scene as well as its objects (Olszewska,
2011), and generate qualitative spatial descriptions of
the scene, by using semantic concepts such as “to the
right”, “at two oclock” (Olszewska and McCluskey,
2011) or “above” (Olszewska, 2013). However, no-
tions involving spatial relations require the definition
of a reference object in the scene. Hence, the agents
must adopt a common ground in order to coordi-
nate their spatial descriptions of the scene (Ma et al.,
2012).
Such common reference could be of different na-
ture (Anacta et al., 2014) in a discourse describing and
interpreting visual scenes such as presented in Fig. 1.
Indeed, the reference could be defined by some third
object, leading to a relative reference (e.g. the palm
tree), distinct form the reference and related objects
(e.g. the speaker and the hearer, respectively). The
reference could also be the object itself, i.e. intrin-
sic reference (e.g. the coffee table). On the other
hand, the reference could refers to some global ref-
erence point, set as an extrinsic reference (e.g. the
North). Hence, the scene could be grounded by in-
ferred or explicit reference to these specific objects
(Levinson, 2003). For example, the sofa may be“to
the right of the coffee table”, “behind the coffee ta-
ble”, or “North of the coffee table”. This leads to
a situated conversation as the agents have now a vi-
sual common ground, while objects of the scene may
be described from different perspectives correspond-
ing to each agent’s view of the 3D scene (Olszewska,
2015a).
In Natural Language Processing, methods are usu-
ally setting visual landmarks through dialogue be-
tween agents (Jurafsky and Martin, 2000). However,
this latter step is a limitation for effectively coordi-
nating spatial descriptions between dialogue partners,
in case one agent is human and the other one is a
non-human agent as for applications involving robots
(Summers-Stay et al., 2014).
Cognitive approaches such as (Zhanga et al.,
2014) focus on the analysis of the visual scene in or-
der to identify objects which attract human agents’
attention, leading to a fast definition of reference ob-
jects. This quantitative study highlights that select-
ing as a reference object salient objects visible in all
agents’ views could positively impact on the land-
566
Olszewska, J.
Interest-Point-Based Landmark Computation for Agents’ Spatial Description Coordination.
DOI: 10.5220/0005847705660569
In Proceedings of the 8th International Conference on Agents and Artificial Intelligence (ICAART 2016) - Volume 2, pages 566-569
ISBN: 978-989-758-172-4
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All r ights reserved
(a)
(b)
Figure 1: CANDELA dataset image of the studied vi-
sual scene observed from (a) a speaker’s viewpoint; (b) a
hearer’s viewpoint.
mark definition effectiveness, while works such as
(Watson et al., 2004) demonstrate that intrinsic refer-
ences are the most widely adopted type of references
among agents.
On the other hand, in Computer Vision, meth-
ods such as presented in (Alsuqayhi and Olszewska,
2013), (Bhat and Olszewska, 2014), (Olszewska,
2015b), have proven to be efficient for automatic an-
notation of visual scenes and for automated scene un-
derstanding.
Hence, in this work, we propose to use a
computer-vision approach to automatically identify
common visual landmarks in order to allow the au-
tomated coordination of spatial descriptions between
any type of agent, i.e. human/non-human one.
Our method does not imply constraints on the ge-
ometrical properties of visual landmarks (see Fig. 3).
The contribution of this paper thus consists in the
automatic definition of visual landmarks by process-
ing the visual feature descriptors of agents’ different
views of the same scene in order to extract common-
ground, reference objects between these two agents.
The paper is structured as follows. In Section
2, we present our approach to compute visual land-
marks in an automated way. The performance of our
computer-vision approach successfully tested for dif-
ferent types of agents in real-world, indoor situations
are reported and discussed in Section 3. Conclusions
are drawn up in Section 4.
2 PROPOSED APPROACH
To automatically extract the common landmark
for coordinating spatial descriptions of multi-view
scenes, we propose the following approach involving
computer-vision based techniques (see Fig. 2) as ex-
plained in Sections 2.1-2.2.
2.1 Detecting Objects of Interest
Firstly, each view is processed separately in order
to extract the visual information. Hence, an interest
point detector (Alqaisi et al., 2012) is applied to the
speaker’s and hearer’s views such as Figs. 1(a) and
(b), respectively.
Once the interest points have been detected in
each of the view of the scene, a candidate landmark
object (e.g. the coffee table) is matched with each
view to determine the potential objects of reference,
i.e. the objects of interests, as illustrated in Fig. 3(a)
and Fig. 3(b), respectively. This process is involv-
ing the automatic labeling method as described in (Ol-
szewska, 2012).
2.2 Computing Visual Landmarks
Next, the objects which have been detected in the dif-
ferent views are matched with each other in order to
set the common the landmark as displayed in Fig.
3(c).
All the matching computations we performed
rely on the computation of the Hausdorff distance
d
H
(A,B) defined as follows:
d
H
(A,B) = max
d
h
(A,B),d
h
(B,A)
, (1)
where d
h
(A,B) is the directed Hausdorff distance
from A to B defined as
d
h
(A,B) = max
aA
min
bB
d
P
(a,b), (2)
with d
P
(a,b), the Minkowski-form distance based on
the L
P
norm, and defined as
d
P
(a,b) =
k
(a
k
b
k
)
P
1/P
, (3)
and involve the embedded matching algorithm (Algo-
rithm 1) (Alqaisi et al., 2012) where A and B are the
two finite sets of SIFT local descriptors, detected in
Interest-Point-Based Landmark Computation for Agents’ Spatial Description Coordination
567
Figure 2: Architecture of the proposed automated computation of visual landmarks for coordinating spatial descriptions in
discourse between agents.
Algorithm 1: Embedded Matching.
Given A
= A,B
= B,M = ,
for all a
i
A
do
for all b
j
B
do
repeat
if
d
P
(a
i
,b
j
) = min
bB
d
P
(a
i
,b)
d
P
(b
j
,a
i
) = min
aA
d
P
(b
j
,a)
d
P
(a
i
,b
j
) d
H
(A,B)
d
P
(b
j
,a
i
) d
H
(A,B)
then
(a
i
,b
j
) M
A
= A\{a
i
} B
= B\{b
j
}
end if
until A
6= B
6=
end for
end for
return M
each view and and where M is the doubly matched
feature set.
The object similarity measure d
S
(A,B) is then de-
fined as follows
d
S
(A,B) =
#M
#A+#B
2
, (4)
3 EXPERIMENTS
We carried out two types of experiments to vali-
date our system. In the first experiment, we used
a publicly-available dataset called CANDELA which
contains images showing the same indoor scene cap-
tured from different point of views (Fig. 1) by two
different cameras. This configuration maps the set-
up consisting of two cameras, each connected to a
MatLab-equipped PC, with one modeling the speaker
agent and the other one the hearer agent. Processing
the image sent by the speaker and the one acquired by
the hearer leads to compute common landmarks based
on matching the detected interest points in each view
with a candidate object of reference and with each
other. To test our approach in this case, matching has
been performed repeatedly on different views of the
dataset and for different objects of interest (Fig. 3).
It is worth noting that changes in landmark objects’
poses due to the different views have a major impact
on this matching process. Experimental results have
been compared with ground truth data obtained by
two human agents looking each at a different view.
Our automatic system shows excellent performance,
achieving 94% of accuracy, and is thus promising to
be embedded in an autonomous process.
In the second experiment, the set-up is composed
of a speaker agent modeled by a camera Arducam
acquiring the view 1 pinned to an Arduino board
connected to a Bluetooth TTL transceiver module.
The hearer agent consists of a webcam recording the
view 2 and linked to a PC running MatLab soft-
ware processing the images of the different views and
connected with a Bluetooth master module in order
to communicate with the agent 1. As long as the
Bluetooth connection is operating properly, the accu-
racy of the system setting the common landmark is
of 92% as assessed by comparing the matching re-
sults with two humans agents looking at each of the
views recorded by the different cameras. These re-
sults are excellent compared to other approaches such
as (Summers-Stay et al., 2014), and could be fur-
ther improved by adding some image pre-processing
techniques to cope with light variations in the cap-
ICAART 2016 - 8th International Conference on Agents and Artificial Intelligence
568
(a) (b)
(c)
(d) (e)
(f)
Figure 3: Computing visual landmark: (1st column) detect-
ing interest points in the first view; (2nd column) detecting
interest points in the second view; (3rd column) matching
both views’ interest points.
tured images, especially those acquired by the mobile
speaker agent compared to the static hearer agent.
4 CONCLUSIONS
This paper presents an automatic and accurate method
to objectively define common visual landmarks in or-
der to coordinate spatial descriptions generated by
different agents, each with a different view of the
same scene. Hence, the common ground is computed
by detecting interest points in all the agents’ views
and by applying the Hausdorff-enhanced matching of
these points in order to extract the common salient
object visible in both agents’ views. Our approach
is a new application of computer-vision local feature
descriptor computation in context on agent commu-
nication systems. This new automated process could
be successfully integrated in robotic applications as
demonstrated.
REFERENCES
Alqaisi, T., Gledhill, D., and Olszewska, J. I. (2012). Em-
bedded double matching of local descriptors for a fast
automatic recognition of real-world objects. In Pro-
ceedings of the IEEE International Conference on Im-
age Processing (ICIP’12), pages 2385–2388.
Alsuqayhi, A. and Olszewska, J. I. (2013). Efficient opti-
cal character recognition system for automatic soccer
player’s identification. In Proceedings of the IAPR In-
ternational Conference on Computer Analysis of Im-
ages and Patterns Workshop (CAIP’13), pages 139–
150.
Anacta, V. J. A., Schwering, A., and Li, R. (2014). Deter-
mining hierarchy of landmarks in spatial descriptions.
In Proceedings of the International Conference on Ge-
ographic Information Science (GIScience’14).
Bhat, M. and Olszewska, J. I. (2014). DALES: Auto-
mated tool for detection, annotation, labelling and
segmentation of multiple objects in multi-camera
video streams. In Proceedings of the ACL Inter-
national Conference on Computational Linguistics
(COLING’14), pages 87–94.
Jurafsky, D. and Martin, J. H. (2000). Dialogue and conver-
sational agents, chapter 19, pages 719–761. Prentice
Hall.
Levinson, S. C. (2003). Space in Language and Cognition:
Explorations in Cognitive Diversity, chapter 5. Cam-
bridge Press University.
Ma, Y., Raux, A., Ramachandran, D., and Gupta, R. (2012).
Landmark-based location belief tracking in a spoken
dialog system. In Proceedings of the Annual Meeting
of the Special Interest Group on Discourse and Dia-
logue (SIGDIAL’12), pages 169–178.
Olszewska, J. I. (2011). Spatio-Temporal Visual Ontology.
In Proceedings of the 1st EPSRC/BMVA Workshop on
Vision and Language (VL’11).
Olszewska, J. I. (2012). A new approach for automatic ob-
ject labeling. In Proceedings of the 2nd EPSRC/BMVA
Workshop on Vision and Language (VL’12).
Olszewska, J. I. (2013). Clock-modeled ternary spatial re-
lations for visual scene analysis. In Proceedings of
the ACL International Conference on Computational
Semantics Workshop, pages 20–30.
Olszewska, J. I. (2015a). 3D Spatial reasoning using the
clock model. Research and Development in Intelligent
Systems XXXII, Springer, pages 147–154.
Olszewska, J. I. (2015b). “Where is my cup?” - Fully auto-
matic detection and recognition of textureless objects
in real-world images. Lectures Notes in Computer Sci-
ence, Springer, 9256:501–512.
Olszewska, J. I. and McCluskey, T. L. (2011). Ontology-
coupled active contours for dynamic video scene un-
derstanding. In Proceedings of the IEEE International
Conference on Intelligent Engineering Systems, pages
369–374.
Summers-Stay, D., Cassidy, T., and Voss, C. R. (2014).
Joint navigation in commander/robot teams: Dia-
log and task performance when vision is bandwidth-
limited. In Proceedings of the ACL International Con-
ference on Computational Linguistics, pages 9–16.
Watson, M. E., Pickering, M. J., and Branigan, H. P. (2004).
Alignment of reference frames in dialogue. In Pro-
ceedings of the Annual Conference of the Cognitive
Science Society.
Zhanga, X., Lia, Q.-Q., Fang, Z.-X., Lu, S.-W., and Shaw,
S.-L. (2014). An assessment method for landmark
recognition time in real scenes. Journal of Environ-
mental Psychology, 40:206–217.
Interest-Point-Based Landmark Computation for Agents’ Spatial Description Coordination
569