set of spatio-temporal relations (semantics) to keep
the new framework even simpler.
2 RELATED WORKS
In this paper, we applied spatial reasoning between
objects to represent manipulation actions performed
by humans. This type of reasoning has been previ-
ously presented in numerous other domains, includ-
ing robot planning and navigation (Crockett et al.,
2009), interpreting visual inputs (Park et al., 2006),
computer aided design (Contero et al., 2006) and nat-
ural language understanding (Wei et al., 2009).
To represent manipulation actions semantically,
various methodologies have been proposed. (Qi et al.,
2019) used an attentive semantic recurrent neural net-
work to understand individual actions and group ac-
tivities in videos. To encode interactions between ob-
jects, (Sridhar et al., 2008) extracted functional ob-
ject categories from spatio-temporal patterns. The
next ability intelligent systems must be equipped with
after representing actions is to be able to recognize
them. Recently, (Khan et al., 2020) used deep neural
networks, with features from a convolutional Neural
Network model, and multiview features to recognize
human actions. Other studies utilized RGB-D data
to classify actions through a Bag-of-Visual-Words
model (Avola et al., 2019; Fei-Fei and Perona, 2005),
a multi-class Support Vector Machine classifier and a
Naive Bayes Combination method (Kuncheva, 2004)
to recognize human actions.
Among the existing methods, approaches that use
a semantic perspective are more widely used, due
to their simplicity in perception and similarity with
the human cognitive system. In this regard, (Ak-
soy et al., 2011) introduced the semantic event chain
(SEC) which considers the sequence of transitions be-
tween touch and non-touch relations between manip-
ulated objects to represent and recognize actions. We
further improved this method using a computational
model, named the enriched Semantic Event Chain
(eSEC) (Ziaeetabar et al., 2017), which incorporates
the information of static (e.g. top, bottom) and dy-
namic spatial relations (e.g. moving apart, getting
closer) between objects in an action scene. This led
to a significant accuracy in recognition and predic-
tion of manipulation actions (Ziaeetabar et al., 2018).
The predictive power of humans and the eSEC frame-
work was compared in (W
¨
org
¨
otter et al., 2020). Here,
we intend to upgrade the current eSEC framework to
cover other new and important applications of manip-
ulation actions in every-day life.
This paper is organized into the following sec-
tions: First, we introduce the eSEC framework to con-
tinue with its enhanced version (e
2
SEC) in 3.1. Then,
the similarity measurement algorithm is proposed in
3.2. Next, the importance of rows in an eSEC ma-
trix is computed in 3.3 and the updated semantics are
presented in 3.4. The results are discussed follow-
ing the methods section in 4 and finally, the paper is
concluded by providing a conclusion and outlook to
future work.
3 METHODS
3.1 eSEC
The eSEC framework has been completely introduced
in our previous papers (Ziaeetabar et al., 2018; Zi-
aeetabar et al., 2017; W
¨
org
¨
otter et al., 2020). Here,
we mention its basics.
The Enriched SEC framework is inspired by the
original Semantic Event chain (SEC) approach (Ak-
soy et al., 2011) which check touching (T) and not-
touching (N) relations between each pair of objects in
all frames of a manipulation scene and focus on tran-
sitions (change) of these relations. The extracted se-
quences of relational changes which are represented
in the form of a matrix will then used in the ma-
nipulation action recognition. In the enriched SEC
framework the wealth of relations described below are
embedded into a similar matrix-form representation,
showing how the set of relations changes throughout
the action.
A practical application would be human-robot in-
teraction where a human performs an action while a
robot observes it and performs the suitable response
as soon as possible (Ziaeetabar et al., 2018).
3.1.1 Spatial Relations
The details on how to calculate static and dynamic
spatial relations have been provided in (W
¨
org
¨
otter
et al., 2020). Here we only define these relations.
• Touching and non-touching relations (TNR)
between two objects are defined according to col-
lision or no-collision between their corresponding
point clouds.
• Static spatial relations (SSR) include: “Above”
(Ab), “Below” (Be), “Right” (R), “Left” (L),
“Front” (F), “Back” (Ba), “Inside” (In), “Sur-
round” (Sa). Since “Right”, “Left”, “Front” and
“Back” depend on the viewpoint and directions
A Summarized Semantic Structure to Represent Manipulation Actions
371