A Summarized Semantic Structure to Represent Manipulation Actions

Tobias Str

ubing, Fatemeh Ziaeetabar and Florentin W

org

otter

ottingen University, Institute for Physics 3 - Biophysics and Bernstein Center for Computational Neuroscience,

Friedrich-Hund-Platz 1, 37077 G

ottingen, Germany

Keywords:

Enriched Semantic Event Chain, Statistical Analysis, Spatial Relations, Semantic Action Representation.

Abstract:

To represent human manipulation actions in a simple and understandable way, we had proposed a framework

called enriched semantic event chains (eSEC) which creates a temporal sequence of static and dynamic spatial

relations between objects in a manipulation. The eSEC framework has so far only been used in manipula-

tion actions consisting of one hand. As the eSECs descriptors are in the form of huge matrices, we need

to have a concise version of them. Here, we want to extend this framework to interactions which involve

more hands. Therefore, we applied statistical and semantic analyses to summarize the current eSEC while

preserving its important features and introducing an enhanced eSEC (e

SEC). This summarization is done by

reducing the number of rows in an eSEC matrix and merging semantic spatial relations between manipulated

objects. Eventually, we presented the new e

SEC framework which has 20% fewer rows, 16.7% less static

spatial and 11.1% less dynamic spatial relations while still maintaining the eSEC efﬁciency in recognition

and differentiation of manipulation actions. This simpliﬁcation paves the way for a simpler recognition and

predicting complex actions and interactions in a shorter time and is beneﬁcial in real time applications such as

human-robot interactions.

1 INTRODUCTION

Thanks to the experience of many training samples

in their lives, humans can recognize and even pre-

dict actions well. However, this task is very difﬁ-

cult for a machine. A machine knows neither the

objects that are used for an action, nor the reason

and outcome of the action. Even simple actions like

cut, stir or push are difﬁcult to understand. Unless

the machine is equipped with an efﬁcient algorithm

to represent and recognize human actions, many im-

portant challenges such as human computer interac-

tions, visual surveillance and video indexing will re-

main unattainable. Therefore, various methodologies

have recently been developed to resolve this issue.

The majority of these proposed approaches, recognize

human actions through statistical, syntactic or seman-

tic analyses. Recently, We have introduced a semantic

framework which represents manipulation actions (as

an important category of human actions) in terms of

thirty-row matrices whose rows indicate the chains of

static and dynamic spatial relations between each pair

of fundamental manipulated objects during the video

frame sequences (Ziaeetabar et al., 2018; Ziaeetabar

https://orcid.org/0000-0001-8206-9738

et al., 2017; W

org

otter et al., 2020). This approach

extracts qualitative spatio-temporal relations between

objects without further knowledge of their type which

makes it superior to other approaches.

Although eSEC is a useful method for simple one-

handed manipulation actions, its efﬁciency decreases

as the number of involved hands increases or the ac-

tions become more complex. It also fails to repre-

sent the interactions effectively. On the other hand,

we intend to extend our work by integrating the eSEC

framework with the basic components of body limbs

(Borr

as et al., 2017) and create a conjoint frame-

work to represent whole body human actions. But

as long as our matrices are so huge (with thirty rows

and many columns), we will not succeed in combin-

ing other features of body which inevitably lead to

larger matrices. Because the huge matrices increase

the complexity of the computations and slow down

the performance of the algorithm.

Therefore, in this paper an enhanced version of

ESEC (e

SEC) is proposed to keep the eSEC matrices

smaller and simpler with almost the same amount of

information. The procedure is as follows: (a) comput-

ing the level of importance for each row in an eSEC

matrix by an extensive statistical analysis to remove

the less important rows as well as (b) shrinking the

370

Strübing, T., Ziaeetabar, F. and Wörgötter, F.

A Summarized Semantic Structure to Represent Manipulation Actions.

DOI: 10.5220/0010258803700379

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 4: VISAPP, pages

370-379

ISBN: 978-989-758-488-6

set of spatio-temporal relations (semantics) to keep

the new framework even simpler.

2 RELATED WORKS

In this paper, we applied spatial reasoning between

objects to represent manipulation actions performed

by humans. This type of reasoning has been previ-

ously presented in numerous other domains, includ-

ing robot planning and navigation (Crockett et al.,

2009), interpreting visual inputs (Park et al., 2006),

computer aided design (Contero et al., 2006) and nat-

ural language understanding (Wei et al., 2009).

To represent manipulation actions semantically,

various methodologies have been proposed. (Qi et al.,

2019) used an attentive semantic recurrent neural net-

work to understand individual actions and group ac-

tivities in videos. To encode interactions between ob-

jects, (Sridhar et al., 2008) extracted functional ob-

ject categories from spatio-temporal patterns. The

next ability intelligent systems must be equipped with

after representing actions is to be able to recognize

them. Recently, (Khan et al., 2020) used deep neural

networks, with features from a convolutional Neural

Network model, and multiview features to recognize

human actions. Other studies utilized RGB-D data

to classify actions through a Bag-of-Visual-Words

model (Avola et al., 2019; Fei-Fei and Perona, 2005),

a multi-class Support Vector Machine classiﬁer and a

Naive Bayes Combination method (Kuncheva, 2004)

to recognize human actions.

Among the existing methods, approaches that use

a semantic perspective are more widely used, due

to their simplicity in perception and similarity with

the human cognitive system. In this regard, (Ak-

soy et al., 2011) introduced the semantic event chain

(SEC) which considers the sequence of transitions be-

tween touch and non-touch relations between manip-

ulated objects to represent and recognize actions. We

further improved this method using a computational

model, named the enriched Semantic Event Chain

(eSEC) (Ziaeetabar et al., 2017), which incorporates

the information of static (e.g. top, bottom) and dy-

namic spatial relations (e.g. moving apart, getting

closer) between objects in an action scene. This led

to a signiﬁcant accuracy in recognition and predic-

tion of manipulation actions (Ziaeetabar et al., 2018).

The predictive power of humans and the eSEC frame-

work was compared in (W

org

otter et al., 2020). Here,

we intend to upgrade the current eSEC framework to

cover other new and important applications of manip-

ulation actions in every-day life.

This paper is organized into the following sec-

tions: First, we introduce the eSEC framework to con-

tinue with its enhanced version (e

SEC) in 3.1. Then,

the similarity measurement algorithm is proposed in

3.2. Next, the importance of rows in an eSEC ma-

trix is computed in 3.3 and the updated semantics are

presented in 3.4. The results are discussed follow-

ing the methods section in 4 and ﬁnally, the paper is

concluded by providing a conclusion and outlook to

future work.

3 METHODS

3.1 eSEC

The eSEC framework has been completely introduced

in our previous papers (Ziaeetabar et al., 2018; Zi-

aeetabar et al., 2017; W

org

otter et al., 2020). Here,

we mention its basics.

The Enriched SEC framework is inspired by the

original Semantic Event chain (SEC) approach (Ak-

soy et al., 2011) which check touching (T) and not-

touching (N) relations between each pair of objects in

all frames of a manipulation scene and focus on tran-

sitions (change) of these relations. The extracted se-

quences of relational changes which are represented

in the form of a matrix will then used in the ma-

nipulation action recognition. In the enriched SEC

framework the wealth of relations described below are

embedded into a similar matrix-form representation,

showing how the set of relations changes throughout

the action.

A practical application would be human-robot in-

teraction where a human performs an action while a

robot observes it and performs the suitable response

as soon as possible (Ziaeetabar et al., 2018).

3.1.1 Spatial Relations

The details on how to calculate static and dynamic

spatial relations have been provided in (W

org

otter

et al., 2020). Here we only deﬁne these relations.

• Touching and non-touching relations (TNR)

between two objects are deﬁned according to col-

lision or no-collision between their corresponding

point clouds.

• Static spatial relations (SSR) include: “Above”

(Ab), “Below” (Be), “Right” (R), “Left” (L),

“Front” (F), “Back” (Ba), “Inside” (In), “Sur-

round” (Sa). Since “Right”, “Left”, “Front” and

“Back” depend on the viewpoint and directions

A Summarized Semantic Structure to Represent Manipulation Actions

371

of the camera axes, we combined them into

“Around” (Ar) and used it at times when one

object was surrounded by another. Moreover,

“Above” (Ab), “Below” (Be) and “Around” (Ar)

relations in combination with “Touching” were

converted to “Top” (To), “Bottom” (Bo) and

“Touching Around” (ArT), respectively, which

corresponded to the same cases with physical

contact. If two objects were far from each other

or did not have any of the above-mentioned

relations, their static relation was considered as

Null (O). This led to a set of nine static relations

in the eSECs:

SSR = {Ab, Be, Ar, Top, Bo, ArT, In, Sa, O}

The additional relations, mentioned above: R,

L, F, Ba are only used to deﬁne the relation

Ar=around, because the former four relations are

not view-point invariant.

• Dynamic Spatial Relations (DSR) require to make

use of the frame history in the video. We

used a history of 0.5 seconds, which is an es-

timate for the time that a human hand takes to

change the relations between objects in manipu-

lation actions. DSRs included the following re-

lations: “Moving Together” (MT), “Halting To-

gether” (HT), “Fixed-Moving Together” (FMT),

“Getting Close” (GC), “Moving Apart” (MA) and

“Stable” (S). MT, HT and FMT denote situations

when two objects are touching each other while:

both of them are moving in a same direction (MT),

are motionless (HT), or when one object is ﬁxed

and does not move while the other one is mov-

ing on or across it (FMT). Case S denotes that any

distance-change between objects remained below

a deﬁned threshold of (ξ = 1 cm) during the en-

tire action. In addition, Q is used as a dynamic

relation between two objects when their distance

exceeded the deﬁned threshold ξ or if they did not

have any of the above-deﬁned dynamic relations.

Therefore, dynamic relations make a set of seven

members:

DSR = {MT, HT, FMT, GC, MA, S, Q}

To distinguish between touching/non-touching, we

measure the distance between the closest points of

two objects and set a touching relation if this dis-

tance is smaller than a predeﬁned threshold (η = 1

cm). To facilitate the computation of spatial rela-

tions between objects, we use camera axes and create

an Axis Aligned Bounding Box (AABB) surrounding

each object’s point cloud. In the AABB representa-

tion, all box sides are parallel to the directions of axes.

An example of an object’s point cloud and its corre-

sponding AABB is shown in ﬁgure 1. By taking the

AABBs around the objects’ point clouds (instead of

their original shape), the computation is much easier

but also reliable.

3.1.2 Fundamental Object Roles

Computing the spatial relations described above be-

tween all pairs of objects is time consuming and use-

less. Therefore, we recognize the so-called “funda-

mental objects” among all of the other objects in a

manipulation scene. The deﬁnition of these objects is

based on the original SEC relations and given in Table

1. This way we exclude distractor objects which are

present in the scene but do not perform any role in the

manipulation.

Table 1: Deﬁnition and remarks of all objects in the eSEC

framework. (Ziaeetabar et al., 2018).

Object Deﬁnition Remarks

Hand

Hand interacts with

the objects in the scene.

In the beginning and

the end not touching anything.

Interacts at least with one

object during the manipulation.

Ground

The Support of all

objects except the hand.

A ground plane extracted

from a visual scene.

The ﬁrst object which

has a transition from N to T.

This object will have

its ﬁrst transition with hand.

The second object which

has a transition from N to T.

Can have a change from

N → T or

from T → N.

The third object which

has a transition from N to T.

Can have a change from

N → T or

from T → N.

Figure 1: An AABB around a point cloud. The box is par-

allel to the axes x,y and z.

3.2 Similarity Measurement

The extracted sequences of spatial relational changes

(produced in the form of a matrix, see the left matrix

in ﬁgure 6) are used in the representation as well as

recognition of manipulation actions. An eSEC matrix

always consists of 30 rows while the top, middle and

bottom 10 rows indicate the sequence of Touching

and non-touching, static spatial and dynamic spatial

relations between each pair of the fundamental ma-

nipulated objects during a manipulation contentious

frames, respectively. Although the number of rows is

constant, the number of columns varies depends on

the number of frames. With the change of spatial re-

lations between objects a new column is created.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

372

Action representation by eSEC matrices allows us

to measure the (diss)similarity between them through

mathematical computations. To this end, each 30-

row eSEC matrix is transformed to a 10-row matrix Θ

that consists of triples containing (TNR, SSR, DSR)

like seen in the following equation (Ziaeetabar et al.,

2018):

Θ =







1,1

11,1

21,1

) (a

1,2

11,2

21,2

) ··· (a

1,n

11,n

21,n

)

2,1

12,1

22,1

) (a

2,2

12,2

22,2

) ··· (a

2,n

12,n

22,n

)

10,1

20,1

30,1

) (a

10,2

20,2

30,2

) ··· (a

10,n

20,n

30,n

)







With the three relations categories L

1:3

(Ziaeetabar

et al., 2018)

i, j



0, if a

i, j

= b

i, j

1, otherwise

i, j



0, if a

i+10, j

= b

i+10, j

1, otherwise

i, j



0, if a

i+20, j

= b

i+20, j

1, otherwise

a compound difference d

i, j

can be deﬁned:

i, j

+ L

i, j

+ L

i, j

√

(1)

This leads to a compound matrix, which holds the dif-

ference values between each pair of the correspond-

ing items in two eSEC matrices, and ﬁnally obtains

the amount of dissimilarity between the two manipu-

lation actions (D

,Θ

) (Ziaeetabar et al., 2018):

(10,k)







1,1

1,2

··· d

1,k

2,1

2,2

··· d

2,k

10,1

10,2

··· d

1,k







,Θ

k ·10

∑

j=1

∑

i=1

i, j

(2)

3.3 Importance of eSEC Matrix Rows

Some rows of an eSEC matrix depend on the impor-

tance of the objects that reﬂect the spatial relations

between them contain more information than others.

For instance, obviously the interaction between the

hand and ground pair is less important than the one

between hand and object 1 (which is the ﬁrst object

touched by hand and usually refers to a tool with the

essential role in an action’s execution).

According to our previous studies, we have de-

ﬁned and represented 35 manipulation actions by the

eSEC framework (Ziaeetabar et al., 2018). To inves-

tigate the effectiveness and importance of a row, we

considered all combinations of size 2 from our pre-

deﬁned 35 possible manipulation actions. Compar-

ing the two matrices, when the number of columns is

not equal, we repeated the last column in the matrix

with fewer columns until the number of columns was

the same. Then we initialized a counter to zero, and

during the comparison, each member of the i

row

(1 ≤ i ≤30) of each manipulation eSEC matrix com-

pared with its corresponding member in the other ma-

trix, and if they were not equal, the counter value was

increased by one. This counter value was assigned to

each row in any comparison. Next, the counter values

of 30 rows were ranked from the lowest to highest.

An example can be seen in table 2. Then we did this

for each pair of the possible permutations between the

predeﬁned manipulations: C(35, 2) = 595, and ﬁnally

deﬁned a parameter called the “degree of importance

of the row” by calculating the mean (median) of the

ranks (which were computed according to the counter

values) for each row using all the permutation compu-

tations. An example of these comparison between ac-

tion 1 to action 35: {(1,2), (1,3),...,(35,34)} is shown

in table 3.

Table 2: An example of the calculation for the importance

of rows. If one value in a row is same for manipulation 1

and manipulation 2, the dissimilarity counter rises by one.

The rank is calculated with the percentages.

Row Manipulation 1 Manipulation 2 Dis. % Rank

1 U U T T T N N U U T T N N N 1 20 2

· · · · · · · · · · · · · · · 0 0 1

11 U U Ar ArT ArT Ar O U U Ab To To Ab O 4 80 3

· · · · · · · · · · · · · · · 0 0 1

30 U U U U U U U U U U U U U U 0 0 1

Table 3: An example for the calculation the mean and me-

dian of the ranks.

Manipulations Row 1 · Row 11 · Row 30

1,2 2 · 3 · 1

· · · · · ·

35,34 1 · 2 · 7

Mean 3.14 · 5.24 · 2.42

Median 3 · 5 · 1

The mean/median value is obtained at the end and

is directly related to the “degree of importance of the

rows”. Because the row that produces the most dis-

tinction among all possible action permutations and

causes more counter-value is of higher importance.

3.3.1 Removing Unimportant Rows

With the evaluation of “degree of importance of the

rows”, the less important rows can be deleted while all

35 predeﬁned manipulations are still distinguishable

A Summarized Semantic Structure to Represent Manipulation Actions

373

from each other.

If there is more than one row that is less important

according to 3.3, all possible combination of those

must be considered to see if all 35 manipulations are

still distinguishable after removing those rows or not.

Therefore, we deﬁned the number of rows that are

least important according to the analysis explained in

3.3 as “n”. Initially, only one row is deleted (k = 1),

then all possible combinations of two rows (k = 2),

next three rows (k = 3) and so on, while every combi-

nation is considered. The resulting combinations are

calculated with the binomial coefﬁcient.





1≤k≤n

k!(n −k)!

(3)

After considering all possible combination of

rows to delete, we calculate the dissimilarity value for

each pair of the predeﬁned 35 manipulations, using

equation 2. Given this results, we plot a huge dissimi-

larity matrix (size: 35×35) (ﬁgure 7), which displays

the dissimilarity values between each pair of prede-

ﬁned manipulations after removing rows. In this way,





1≤k≤n

dendrograms are produced. Despite the re-

moval of the less importance rows, to make sure that

the actions are still signiﬁcantly different, we select

the combination from which the most distinction be-

tween the existing manipulation actions is produced.

3.3.2 Dissimilarity Measure of Groups

We had categorized manipulation actions into 6

groups based on their nature in the ﬁgure 6 of (Zi-

aeetabar et al., 2018). To obtain more information

about the “degree of importance of rows”, we in-

troduced the dissimilarity measure of those groups.

Therefore, groups were determined using an unsu-

pervised clustering between different manipulation

actions using their dissimilarity (Ziaeetabar et al.,

2018). These groups can be seen in ﬁgure 7.

Using these groups, we calculated the dissimilar-

ity between each member of one group with each

member of another group, using equation 2. A cal-

culation example can be seen in ﬁgure 2.

Finally, we calculated the minimum, maximum,

mean and median for the dissimilarity of the groups

which can be represented in a dissimilarity matrix.

We select the rows that lead to the minimal informa-

tion loss while removing.

3.4 Updated Semantics

So far, we summarized the eSEC manipulations de-

scriptor by reducing the number of rows without com-

promising the uniqueness of the actions. To make

Chop

Scratch

Squash

Stir

Rotate-align

Scoop

Break/Rip-o

Group 2

Group 5

Chop,Stir

Chop,Rotate-align

Chop,Scoop

Chop,Break

Dissimilarity D

i, j

Chop,Stir

...

Scratch,Stir

...

Squash,Stir

...

Figure 2: A dissimilarity value is calculated between each

member of two groups (left). To determine the minimal in-

formation loss, we measured the mean, median, min and

max values for all calculated dissimilarities between two

groups (right).

the e

SEC tables even simpler, we decided to shrink

the huge set of static and dynamic spatial relations

as well, and merge some of the current semantics.

Our purpose is to ensure every manipulation is still

distinguishable from each other while the set of spa-

tial relations has been summarized. To this end, we

combined the items that seemed most logical to inte-

grate. For example, it is reasonable that the relations

“Above” and “Below” can be merged in some way

but “Inside” and “Around” have no relation to each

other. Since there are more than one semantics we

wanted to merge, we had to consider every possible

combinations, using equation 3. For further analysis,

we applied our merged semantics with the removed

rows from chapter 3.3.1 & 3.3.2 and once more cal-

culated the dissimilarity between the groups accord-

ing to chapter 3.3.2.

Eventually, the maximum, minimum, mean and

median values of the dissimilarities were computed

and accordingly the merged semantics that leads to

minimal information loss were selected.

4 RESULTS

In this paper, we divided the problem of simpliﬁcation

of the eSEC manipulation action descriptors into two

parts.

• Determining the degree of importance for the

rows and removing the less important ones.

• Integrating some spatio-temporal spatial relations

with each other and shrinking the set of semantics.

To achieve the ﬁrst purpose, we measured the impor-

tance of eSEC rows according to chapter 3.3, for ev-

ery combination of the 35 predeﬁne manipulation ac-

tions, which leads to C(35,2) = 595 calculations. To

specify which rows can be removed with the least in-

formation loss, we plotted the median and mean val-

ues of the resulting ranks. As shown in ﬁgure 3, there

are ﬁve rows with a lower rank and thus of less impor-

tance. These row numbers are 3, 4, 6, 8 and 10. Since

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

374

1 2 3 4

5 6

7 8 9 10

Row

Rank (mean)

TNR

1 2 3 4

5 6

7 8 9 10

Row

SSR

1 2 3 4

5 6

7 8 9 10

Row

DSR

Figure 3: Mean values of the row ranks from TNR, SSR and DSR with corresponding error bars. A row is more important if

the rank is high and vice versa.

we have n = 5 rows in total, to investigate further,

we check the combination of one(k = 1), two(k = 2),

three(k = 3), four(k = 4) and ﬁve(k = 5) rows. Using

equation 3, we reach a total of 5 + 10 + 10 + 5 + 1 =

31 combinations to remove. We discovered that all

pairs of manipulations are still distinguishable, even

when all ﬁve rows were removen together. Therefore,

we needed to calculate the dissimilarity between the

groups of manipulations to ﬁnd the combinations of

the less important rows, by the removal of which, the

different actions are still the most different from each

other.

We started by removing all combinations of rows

as mentioned before. Then, we calculated the dissim-

ilarity between each group, leading to dissimilarity

values of each group member {D

i, j

,...,D

n,m

} while

i,...,n ∈ group x and j, . . . , m ∈ group y, as shown in

ﬁgure 2. We used these values to calculate the mini-

mum, maximum, mean and median to plot them in a

dissimilarity matrix, shown in ﬁgure 4.

We determined the best suitable combinations to

remove by calculating the minimal cost between the

original dissimilarity matrix (A) and the dissimilarity

matrix with removed rows (B) by using the following

equation:





A −



A −



max(A) −max(B)

min(A) −min(B)



The variables

B are deﬁned as the mean and

as the median of the matrices. We discovered that

removing rows 4 (relation between the hand and the

ground) and 10 (relation between object 3 and the

ground) result in a minimal information loss.

Once the special pairs of fundamental objects with

less meaning are identiﬁed, we next integrate seman-

tics to further simplify our e

SEC framework.

In total, we discovered four semantics(n=4) that

were possible candidates to merge, shown in the fol-

lowing list:

• “Above”(Ab) & “Below”(Be) → “Vertical

Around”(VArT)

Figure 4: An example of the mean, median, min and max

dissimilarity values of all groups with removed rows 4 and

10.

• “Top”(To) & “Bottom”(Bo) → “Vertical Around

With Touch”(VArT)

• “Moving Together”(MT) & “Fixed-Moving To-

gether”(FMT) → “Moving Together”(MT)

• “Getting Close”(GC) & “Moving Apart”(MA) →

“Moving Around”(MA)

We once more check all combinations for one(k=1),

two(k=2), three(k=3) and four(k=4) merged seman-

tics. Using equation 3, we reach a total of 4 + 6 +

4 + 1 = 15 combinations. In detail, we check all the

combinations listed in table 4. We discovered that

the minimal information loss is obtained by merging

“MT+FM”, “Ab+Be” and “To+Bo”.

Furthermore, some semantics were renamed to re-

main consistent:

• “Around”(Ar) → “Horizontal Around”(HAr)

• “Around With Touch”(ArT) → “Horizontal

Around With Touch”(HArT)

A Summarized Semantic Structure to Represent Manipulation Actions

375

Table 4: All possible combinations of semantics for the

analysis.

Combination Relations

One

To+Bo;

Ab+Be;

MA+GC;

MT+FMT;

Two

Ab+Be, To+Bo;

MA+GC, To+Bo;

MA+GC, Ab+Be;

MT+FMT, To+Bo;

MT+FMT, Ab+Be;

MA+GC, MT+FMT

Three

MA+GC, To+Bo, Ab+Be;

MT+FMT, To+Bo, Ab+Be;

MT+FMT, MA+GC, To+Bo;

MT+FMT, MA+GC, Ab+Be;

Four Ab+Be, To+Bo, MT+FMT, MA+GC

In conclusion, we obtain the following new relations:

• TNR: {T, N, U, X}

• SSR: {VAr, In, Sa, Bw, HAr, VArT, HArT, U, X,

• DSR: {MT, HT, GC, MA, S, U, X, Q}

which can be observed in ﬁgure 5.

b3) b4) b5)

Static relations

Dynamic relations

a2) a3)a1)

b2)b1)

Figure 5: Final static spatial and dynamic spatial relations

of the new e

SEC framework.

a) Static Spatial Relations: a1) Vertical Around, a2) Hor-

izontal Around, a3) Inside/Surround. b) Dynamic Spatial

Relations: b1) Halting Together, b3) Moving Together, b4)

Getting Close, b5) Moving Apart, b6) Stable.

At the appendix, we show an example to present the

difference between the eSEC and e

SEC matrices.

The ﬁgure 6 includes an eSEC and e

SEC table of

the “Scratch” manipulation. We Selected this action

out of the 35 analyzed eSEC matrices because it ben-

eﬁts the most from the new framework. The images

on top of the tables are the frames, which correspond

to the process of this manipulation. First, the pencil

(object 1) is touched by hand. After scratching on the

paper, the pencil lead breaks (object 2) and separates

from the remaining part of the pencil (object 3). The

hand moves away from the lead and the pencil is put

down by the hand. Finally, the hand moved out of the

frame.

After applying our new e

SEC approach, i.e., re-

moving the rows containing spatial relation between

(H,G) and (3,G) as well as merging the semantics, we

are able to reduce the number of columns by ≈27% in

comparison to the eSEC table. Furthermore, the num-

ber of rows is reduced by 20%. Speciﬁcally, columns

1/2, 6/7 and 9/10 are equal with our framework and

can therefore be removed. If we consider all 35 ma-

nipulation we get a mean of ≈ 12% reduced rows.

5 CONCLUSIONS AND

OUTLOOK

In this paper, we improved our previously deﬁned

action representation framework, the so-called eSEC

(enriched semantic event chain) and produced the en-

hanced version of it, called e

SEC framework to rep-

resent human actions in a simple and concise way.

The traditional eSEC performed well in recogni-

tion and prediction of simple manipulations which

were performed only by one hand (Ziaeetabar et al.,

2018; W

org

otter et al., 2020) but was not efﬁcient

enough when we aimed to represent and recognize

complex actions as well as interactions when two (or

more) hands were involved. Because their eSEC ta-

bles were growing in size and the computations be-

came heavier. This disadvantage was mostly con-

siderable in real time applications such as prediction.

The new e

SEC simpliﬁes the previous eSEC and pro-

vides a new possibility for the analysis of complex ac-

tions as well as the interactions that are most common

in humans’ every-day life.

In the e

SEC framework, the number of rows was

reduced to 20%. Moreover by merging the semantics

in the set of spatial relations, we reduced the amount

of static and dynamic spatial relations to 16.7% and

11.1%, respectively. This simpliﬁcation allows us

to combine manipulation descriptors with features of

body limbs (Mandery et al., 2015) and create an in-

tegrated framework for full-body human action repre-

sentation.

REFERENCES

Aksoy, E., Abramov, A., D

orr, J., Ning, K., Dellen, B.,

and W

org

otter, F. (2011). Learning the semantics of

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

376

object-action relations by observation. I. J. Robotic

Res.

Avola, D., Bernardi, M., and Foresti, G. L. (2019). Fusing

depth and colour information for human action recog-

nition. Multimedia Tools and Applications.

Borr

as, J., Mandery, C., and Asfour, T. (2017). A whole-

body support pose taxonomy for multi-contact hu-

manoid robot motions. Science Robotics.

Contero, M., Naya, F., Company, P., and Saor

ın, J. (2006).

Learning support tools for developing spatial abilities

in engineering design. International Journal of Engi-

neering Education.

Crockett, T. M., Powell, M. W., and Shams, K. S. (2009).

Spatial planning for robotics operations. In 2009 IEEE

Aerospace conference.

Fei-Fei, L. and Perona, P. (2005). A bayesian hierarchical

model for learning natural scene categories. In 2005

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition (CVPR’05).

Khan, M., Javed, K., Saba, T., and Habib, U. (2020). Hu-

man action recognition using fusion of multiview and

deep features: An application to video surveillance.

Multimedia Tools and Applications.

Kuncheva, L. I. (2004). Combining Pattern Classiﬁers:

Methods and Algorithms. Wiley-Interscience.

Mandery, C., Terlemez, O., Do, M., Vahrenkamp, N., and

Asfour, T. (2015). The kit whole-body human motion

database. In International Conference on Advanced

Robotics (ICAR).

Park, J. A., Kim, Y. S., and Cho, J. Y. (2006). Visual rea-

soning as a critical attribute in design creativity. In

Proceedings of International Design Research Sym-

posium.

Qi, M., Wang, Y., Qin, J., Li, A., Luo, J., and Van Gool,

L. (2019). stagnet: An attentive semantic rnn for

group activity and individual action recognition. IEEE

Transactions on Circuits and Systems for Video Tech-

nology.

Sridhar, M., Cohn, A., and Hogg, D. C. (2008). Learning

functional object-categories from a relational spatio-

temporal representation. In ECAI.

Wei, Y., Brunskill, E., Kollar, T., and Roy, N. (2009). Where

to go: Interpreting natural directions using global in-

ference. In 2009 IEEE International Conference on

Robotics and Automation.

org

otter, F., Ziaeetabar, F., Pfeiffer, S., Kaya, O., Kulvi-

cius, T., and Tamosiunaite, M. (2020). Humans pre-

dict action using grammar-like structures. Scientiﬁc

reports.

Ziaeetabar, F., Aksoy, E. E., W

org

otter, F., and Tamosiu-

naite, M. (2017). Semantic analysis of manipula-

tion actions using spatial relations. In 2017 IEEE In-

ternational Conference on Robotics and Automation

(ICRA).

Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M., and

org

otter, F. (2018). Prediction of manipulation

action classes using semantic spatial reasoning. In

2018 IEEE/RSJ International Conference on Intelli-

gent Robots and Systems (IROS).

Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M., and

org

otter, F. (2018). Recognition and prediction of

manipulation actions using enriched semantic event

chains. Robotics and Autonomous Systems.

A Summarized Semantic Structure to Represent Manipulation Actions

377

APPENDIX

Figure 6: Comparison of eSEC (left) and e

SEC (right) matrices for the manipulation “Scratch”. The important frames are

shown on top of the tables.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

378

Take & invert

Shake

Pick & place

Knead

Push from x t o y

Chop

Scrat ch

Squash

Hit /Flick

Push/ Pull

Poke

Bore, Rub, Rot at e

Lay

Lev er

Cut

Push dow n

Push t oget her

Draw

Push ap ar t

Take down

Uncover(Pick & Place)

Uncover(Push)

Put on t op

Put insid e

Push on t op

Put over

Push ov er

Stir

Rot at e align

Scoop

Break

Pour t o g round(v1)

Pour t o g round(v2)

Pour t o cup (v 1)

Pour t o cup (v 2)

Take & invert

Shake

Pick & place

Knead

Push from x t o y

Chop

Scrat ch

Squash

Hit /Flick

Push/ Pull

Poke

Bore, Rub, Rot at e

Lay

Lev er

Cut

Push dow n

Push t oget her

Draw

Push ap ar t

Take down

Uncover(Pick & Place)

Uncover(Push)

Put on t op

Put insid e

Push on t op

Put ov er

Push ov er

Stir

Rot at e align

Scoop

Break

Pour t o g round(v1)

Pour t o g round(v2)

Pour t o cup (v 1)

Pour t o cup (v 2)

0 0.0 2 0.1 0 .14 0 .54 0.64 0.7 2 0 .64 0.1 3 0 .15 0 .15 0.1 5 0.1 2 0 .12 0 .58 0.3 5 0.3 4 0 .34 0.3 5 0 .36 0 .28 0.3 0.2 3 0 .23 0 .29 0.1 8 0.2 8 0 .41 0 .14 0.35 0.3 6 0 .38 0.3 9 0 .57 0 .62

0.0 2 0 0.11 0 .13 0.54 0.6 4 0.7 2 0 .64 0 .13 0.1 6 0.1 5 0 .15 0 .12 0.1 2 0.5 8 0 .35 0 .34 0.34 0.3 5 0 .36 0.2 8 0 .29 0 .23 0.2 3 0.2 8 0 .19 0 .28 0.4 1 0.1 4 0.3 5 0 .36 0 .38 0.3 9 0.5 7 0 .61

0.1 0 .11 0 0.11 0.5 2 0.6 1 0.7 2 0 .66 0 .12 0.1 4 0.1 1 0 .12 0 .11 0.1 3 0.5 9 0 .34 0 .32 0.3 0.3 4 0.3 5 0 .22 0 .27 0.2 5 0.2 5 0 .29 0 .22 0.2 0.3 9 0 .15 0 .32 0.3 4 0.4 3 0.4 3 0 .61 0 .66

0.1 4 0 .13 0 .11 0 0 .52 0 .61 0.7 2 0.6 7 0 .1 0 .12 0.1 0.0 9 0 .09 0 .14 0.5 8 0.3 7 0 .27 0 .25 0.3 1 0.3 5 0.3 1 0 .28 0 .28 0.2 8 0.3 1 0 .29 0 .27 0.4 0.1 2 0 .28 0 .36 0.4 4 0.4 5 0.6 3 0 .67

0.5 4 0 .54 0 .52 0.5 2 0 0.4 8 0.5 5 0 .63 0 .52 0.5 4 0.5 3 0 .54 0 .5 0.5 6 0.4 8 0 .3 0.43 0.5 0 .23 0 .32 0.3 7 0.3 0.4 4 0 .44 0 .41 0.5 6 0.4 8 0 .5 0 .54 0.5 3 0.5 1 0 .62 0 .64 0.5 4 0.5 4

0.6 4 0 .64 0 .61 0.6 1 0.4 8 0 0 .22 0.3 3 0 .63 0 .64 0.6 2 0.6 3 0 .63 0 .65 0.1 8 0.6 2 0.5 7 0 .61 0.6 0.6 2 0.5 2 0 .53 0 .56 0.5 6 0.5 4 0 .62 0 .58 0.6 6 0.6 5 0 .4 0 .34 0.7 7 0.7 7 0 .71 0 .7

0.7 2 0 .72 0 .72 0.7 2

0.5 5 0 .22 0 0.36 0.73 0.7 4 0.7 3 0 .72 0 .72 0.7 4 0.2 4 0 .66 0 .63 0.6 8 0.6 6 0.6 7 0 .62 0.62 0.6 5 0.6 5 0 .63 0 .71 0.7 1 0.7 3 0 .75 0 .44 0.42 0.7 5 0 .75 0.6 9 0 .68

0.6 4 0 .64 0 .66 0.6 7 0.6 3

0.3 3 0 .36 0 0.66 0.68 0.6 7 0.6 7 0 .65 0 .68 0.3 7 0.6 6 0 .63 0 .62 0.6 5 0.6 6 0 .61 0.6 0 .6 0.6 0.6 1 0.6 5 0 .66 0.7 0.6 9 0.4 2 0 .37 0 .73 0.7 4 0.7 3 0 .73

0.1 3 0 .13 0 .12 0.1 0.5 2 0 .63 0 .73 0.6 6 0 0.0 4 0.0 4 0 .04 0 .06 0.1 4 0.5 9 0 .36 0 .25 0.2 4 0.3 2 0.3 5 0 .32 0.27 0.2 9 0.2 9 0 .31 0 .29 0.2 8 0.3 8 0 .09 0 .3 0.3 5 0.4 4 0 .45 0 .63 0.67

0.1 5 0 .16 0 .14 0.1 2 0.5 4 0 .64 0 .74 0.6 8 0.0 4 0 0.0 6 0 .06 0 .09 0.1 6 0.6 0 .38 0 .22 0.2 6 0.3 4 0.3 6 0 .34 0 .29 0.3 1 0.3 1 0 .34 0 .31 0.3 0.3 9 0 .13 0 .31 0.3 7 0.4 6 0.4 6 0 .64 0 .69

0.1 5 0 .15 0 .11 0.1 0.5 3 0 .62 0 .73 0.6 7 0.0 4 0 .06 0 0.02 0 .08 0.1 4 0.5 9 0 .38 0 .26 0.2 4 0.3 3 0.3 7 0 .31 0.29 0.3 0.3 0 .33 0.28 0.2 7 0.3 8 0 .11 0.3 0.3 6 0.4 5 0 .46 0 .64 0.6 8

0.1 5 0 .15 0 .12 0.0 9 0.5 4 0 .63 0 .72 0.6 7 0.0 4 0.0 6 0 .02 0 0.08 0.15 0.5 8 0 .38 0 .26 0.22 0.3 4 0 .37 0.3 2 0 .29 0 .3 0.3 0.3 3 0 .29 0 .28 0.3 8 0.1 2 0 .28 0 .36 0.4 5 0.4 6 0.6 4 0 .69

0.1 2 0 .12 0 .11 0.0 9 0.5 0 .63 0 .72 0.6 5 0.0 6 0 .09 0 .08 0.0 8 0 0.1 0 .59 0.3 5 0.2 2 0 .2 0 .29 0.3 3 0.3 1 0 .26 0 .27 0.27 0.3 0 .28 0 .26 0.3 5 0.0 9 0 .29 0 .34 0.4 3 0.4 4 0.6 2 0 .66

0.1 2 0 .12 0 .13 0.1 4 0.5 6 0 .65 0 .74 0.6 8 0.1 4 0.1 6 0 .14 0 .15 0.1 0 0 .62 0.3 9 0.3 0 .28 0 .36 0.3 7 0.2 9 0 .33 0 .27 0.2 7 0.3 3 0 .27 0 .29 0.3 7 0 .11 0 .33 0.3 8 0.4 3 0 .44 0 .62 0.6 6

0.5 8 0 .58 0 .59 0.5 8 0.4 8 0 .18 0 .24 0.3 7 0.5 9 0 .6 0 .59 0.5 8 0.5 9 0.6 2 0 0.62 0 .56 0.5 8 0.6 0.6 2 0 .55 0 .54 0.5 3 0.5 3 0.5 3 0 .56 0 .57 0.6 5 0.6 2 0 .33 0 .37 0.7 5 0.7 5 0 .69 0.6 8

0.3 5 0 .35 0 .34 0.3 7 0.3 0 .62 0 .66 0.6 6 0.3 6 0 .38 0 .38 0.3 8 0 .35 0.39 0.6 2 0 0.2 7 0 .35 0 .15 0.09 0.2 0 .15 0 .28 0.2 8 0.2 7 0 .36 0 .32 0.2 9 0.3 9 0.3 8 0 .36 0 .37 0.3 7 0.5 1 0 .56

0.3 4 0 .34 0 .32 0.2 7 0.4 3 0 .57 0 .63 0.6 3 0.2 5 0.2 2 0 .26 0 .26 0.2 2 0.3 0 .56 0 .27 0 0 .19 0 .23 0.2 5 0.2 3 0 .16 0 .27 0.2 7 0.2 3 0.3 3 0 .29 0 .27 0.2 9 0.2 9 0 .31 0 .41 0.4 0.5 9 0 .63

0.3 4 0 .34 0 .3 0.2 5 0 .5 0 .61 0.6 8 0.6 2 0.2 4 0 .26 0 .24 0.2 2 0.2 0 .28 0 .58 0.3 5 0.1 9 0 0.3 0 .33 0 .28 0.2 5 0.3 0 .3 0 .3 0.3 2 0.2 8 0 .28 0 .28 0.2 7 0.3 1 0 .41 0.4 1 0 .54 0 .58

0.3 5 0 .35 0 .34 0.3 1 0.2 3 0 .6 0 .66 0.6 5 0.3 2 0 .34 0 .33 0.3 4 0.2 9 0 .36 0.6 0 .15 0 .23 0.3 0 0.11 0.2 1 0.1 5 0 .29 0 .29 0.2 6 0.3 6 0 .3 0 .29 0.3 4 0.3 3 0 .34 0 .36 0.3 7 0.5 2 0.5 7

0.3 6 0 .36 0 .35 0.3 5 0.3 2 0 .62 0 .67 0.6 6 0.3 5 0.3 6 0 .37 0 .37 0.3 3 0.3 7 0 .62 0 .09 0.2 5 0.3 3 0 .11 0 0.2 1 0 .19 0 .3 0.3 0.2 9 0 .37 0 .34 0.2 5 0.3 8 0.3 5 0 .36 0 .38 0.3 9 0.5 4 0 .59

0.2 8 0 .28 0 .22 0.3 1 0.3 7 0 .52

0.6 2 0 .61 0 .32 0.3 4 0.3 1 0 .32 0 .31 0.2 9 0.5 5 0 .2 0 .23 0.2 8 0.2 1 0 .21 0 0.13 0 .22 0.2 2 0.2 0 .23 0 .2 0.3 4 0.3 4 0 .29 0 .28 0.3 8 0.3 8 0 .53 0 .58

0.3 0 .29 0.27 0.2 8 0.3 0 .53 0 .62 0.6 0 .27 0.2 9 0 .29 0 .29 0.2 6 0.3 3 0 .54 0 .15 0.1 6 0.2 5 0 .15 0.1 9 0 .13 0 0.21 0.21 0.1 7 0.3 0 .25 0.34 0.3 0 .29 0 .28 0.3 8 0 .38 0.53 0.5 8

0.2 3 0 .23 0 .25 0.2 8 0.4 4 0 .56

0.6 5 0 .6 0.29 0.3 1 0 .3 0 .3 0 .27 0.2 7 0.5 3 0 .28 0 .27 0.3 0.2 9 0 .3 0 .22 0.2 1 0 0.0 3 0.0 9 0 .23 0 .27 0.3 1 0.3 0 .28 0 .29 0.3 4 0.3 3 0 .51 0 .55

0.2 3 0 .23 0 .25 0.2 8 0.4 4 0 .56 0 .65 0.6 0.2 9 0 .31 0 .3 0.3 0.2 7 0 .27 0 .53 0.2 8 0.2 7 0 .3 0 .29 0.3 0.2 2 0 .21 0 .03 0 0 .11 0 .23 0.2 7 0 .28 0.3 0.2 8 0 .29 0 .35 0.3 4 0 .51 0 .56

0.2 9 0 .28 0 .29 0.3 1 0.4 1 0 .54

0.6 3 0 .61 0 .31 0.3 4 0.3 3 0 .33 0 .3 0.3 3 0.5 3 0 .27 0 .23 0.3 0.2 6 0 .29 0 .2 0.1 7 0.0 9 0 .11 0 0.28 0 .21 0.3 0 .33 0.2 9 0 .29 0 .37 0.3 5 0.5 4 0 .57

0.1 8 0 .19 0 .22 0.2 9 0.5 6 0 .62 0 .71 0.6 5 0.2 9 0.3 1 0 .28 0 .29 0.2 8 0.2 7 0 .56 0 .36 0.3 3 0.3 2 0 .36 0 .37 0.2 3 0.3 0.2 3 0 .23 0 .28 0 0 .14 0 .39 0.3 1 0.3 4 0.3 6 0 .29 0 .31 0.5 8 0.6 3

0.2 8 0 .28 0 .2 0.2 7 0.4 8 0 .58 0 .71 0.6 6 0.2 8 0 .3 0 .27 0.2 8 0.2 6 0 .29 0 .57 0.3 2 0.2 9 0 .28 0 .3 0.3 4 0.2 0 .25 0 .27 0.2 7 0.2 1 0 .14 0 0.3 9 0 .3 0 .32 0.3 4 0.3 2 0 .33 0 .6 0.6 5

0.4 1 0 .41 0 .39 0.4 0.5 0 .66 0.73 0.7 0 .38 0 .39 0.3 8 0 .38 0.35 0.3 7 0.6 5 0 .29 0 .27 0.2 8 0.2 9 0 .25 0 .34 0.3 4 0 .31 0.28 0.3 0 .39 0 .39 0 0 .42 0 .38 0.41 0.4 0 .4 0 .52 0.5 6

0.1 4 0 .14 0 .15 0.1 2 0.5 4 0 .65 0 .75 0.6 9 0.0 9 0.1 3 0 .11 0 .12 0.0 9 0.1 1 0 .62 0 .39 0.2 9 0.2 8 0 .34 0.3 8 0 .34 0.3 0.3 0.3 0 .33 0.31 0.3 0 .42 0 0.34 0 .39 0.4 5 0.4 6 0 .64 0 .68

0.3 5 0 .35 0 .32 0.2 8 0.5 3 0 .4 0 .44 0.4 2 0.3 0 .31 0.3 0.2 8 0.2 9 0 .33 0 .33 0.3 8 0.2 9 0 .27 0 .33 0.3 5 0.2 9 0 .29 0 .28 0.28 0.2 9 0 .34 0.3 2 0 .38 0 .34 0 0.1 0 .46 0.4 5 0.6 3 0 .68

0.3 6 0 .36 0 .34 0.3 6 0.5 1 0 .34 0 .42 0.3 7 0.3 5 0 .37 0.3 6 0 .36 0 .34 0.3 8 0.3 7 0 .36 0 .31 0.3 1 0 .34 0.36 0.2 8 0.2 8 0 .29 0 .29 0.2 9 0.3 6 0 .34 0 .41 0.3 9 0.1 0 0 .44 0 .44 0.6 2 0.6 6

0.3 8 0 .38 0 .43 0.4 4 0.6 2 0 .77 0 .75 0.7 3 0.4 4 0.4 6 0 .45 0 .45 0.4 3 0.4 3 0 .75 0 .37 0.4 1 0.4 1 0 .36 0.3 8 0 .38 0 .38 0.3 4 0.3 5 0 .37 0 .29 0.3 2 0.4 0 .45 0 .46 0.4 4 0 0 .09 0 .3 0.3 8

0.3 9 0 .39 0 .43 0.4 5 0.6 4 0 .77 0 .75 0.7 4 0.4 5 0.4 6 0 .46 0 .46 0.4 4 0.4 4 0 .75 0 .37 0.4 0.4 1 0 .37 0 .39 0.3 8 0.3 8 0 .33 0.3 4 0 .35 0 .31 0.3 3 0.4 0 .46 0 .45 0.4 4 0.0 9 0 0.3 9 0 .43

0.5 7 0 .57

0.6 1 0 .63 0 .54 0.7 1 0.6 9 0 .73 0 .63 0.6 4 0.6 4 0.6 4 0 .62 0.62 0.6 9 0.5 1 0 .59 0 .54 0.5 2 0.5 4 0 .53 0 .53 0.51 0.5 1 0 .54 0 .58 0.6 0.5 2 0 .64 0 .63 0.6 2 0.3 0 .39 0 0.09

0.6 2 0 .61 0 .66 0.6 7 0.5 4 0 .7 0 .68 0.7 3 0.6 7 0 .69 0 .68 0.6 9 0.6 6 0 .66 0 .68 0.56 0.6 3 0.5 8 0 .57 0 .59 0.5 8 0.5 8 0 .55 0 .56 0.5 7 0.6 3 0 .65 0.5 6 0 .68 0 .68 0.6 6 0.3 8 0 .43 0 .09 0

0.00

0.15

0.30

0.45

0.60

0.75

Figure 7: Dissimilarity matrix for all manipulations without removed rows. The color of the manipulation names represent

the groups. Group 1: black, Group 2: red, Group 3: blue, Group 4: green, Group 5: orange, Group 6: purple

A Summarized Semantic Structure to Represent Manipulation Actions

379