such as in “Pick up an orange which is behind the
book and right of red cup”.
For all techniques Google speech is used for speech-
to-text conversion. We have solved all the four test
cases in this paper and achieved 98% accuracy for
PPO for all techniques, but the tree structure and
Euclidean distance based algorithm perform better for
subject object relation extraction. The overall
architecture is illustrated in Figure 1.
Figure 1: Illustration of the proposed overall architecture.
Verbal utterances given by the human are analysed with our
proposed language modelling techniques for further
mapping to visual scenes.
The structure of the paper is as follows: Section 2
gives a literature overview of previous work on
relation extraction and language grounding. In
Section 3, all the four language modelling techniques
are explained. Experimental results and analysis are
discussed in Sections 4 and 5. In Section 6, the paper
is concluded towards different language modelling
techniques.
2 BACKGROUND
Most of the work has been done on name entity-based
relation extraction or on language grounding. Golland
et al., (2010) proposed a game theoretic model for
language grounding where they tried to identify the
spatial relation between objects. It is a “speak and
tell” approach where a speaker generates an utterance
containing an actual object and the listener tries to
guess the object. If both objects are equal, then it is a
success otherwise fail. They have evaluated their
experiment with some constraints and achieve 78%
accuracy. These constraints have been released in
paper (Guadarrama et al., 2013) by using a
probabilistic approach, in which a primary object and
their spatial relationship with other objects is
extracted from a visual scene. This information is
combined with semantic parsing of sentences using
template matching and a probabilistic approach. They
achieve an accuracy about 84%. Olszewska (2017)
built human/robot dialogues based on semantically
meaningful instructions like the directional spatial
relations represented by the clock model. Explainable
AI is used for language grounding (Hendricks et al.,
2018). In this method features extracted from visual
and language module are provided to LSTM and
apply 2-layer neural network to obtained the final
score of grounding. In papers (Alomari et al., 2017;
Alomari et al., 2016) grounding is performed using a
robot control language (RCL) tree where visual
learning is done with the help of color, shape and
location feature obtained from an object. Direction
and distance between pair of objects is extracted as a
relation feature and finally the action performed by
the robot is extracted from the video clip. These
features are clustered based on their category and
mapped with words using RCL tree. Preprocessing
and word extraction from sentences is done using
NLTK toolkit (Bird and Loper, 2004). In name entity-
based approach they tried to extract the relation
between person and organization, organization and
city etc. Open relation extraction approach is
proposed by (Banko and Etzioni, 2008; Banko et al.,
2007) where lexico-syntactic patterns is used to build
a relation independent model. Conditional random
field (CRF) is used for classifying a relational token.
Very less work is done on primary entity and their
relation extraction with other objects with respect to
our day to day life objects like knife, mango, bottle
etc. In this paper we are proposing and analyzing
different language modelling techniques extracting
above mentioned components from sentences. This
work will be extended to ground primary objects and
their relations in visual scenes, paving the way for
effective human-robot interaction where the human
commands the robot that specifically acts as helper in
our daily life.
3 LANGUAGE MODELLING
TECHNIQUES
The proposed language modelling techniques consist
of two modules, namely speech-to-text conversion
(explained in Section 3.1 below) and the algorithmic
part of language modelling (explained in Sections 3.2,
3.3, 3.4, 3.5 below). The algorithmic part of language
modelling uses regular expressions, Euclidian
distance, an extended Hobb’s algorithm based on
dependency parses and Stanford phrase structure
parses. All four techniques have few common steps
such as word tokenization, preprocessing, pos tagging
and chunking but they differ in the extraction part.
Extracting Primary Objects and Spatial Relations from Sentences
253