of an activity within competing plans, and the higher
importance of compound nodes that incorporate sev-
eral different activities. We initialize the weights of
the sum edges as
1
|subactivities(n
m
)|
. The weighting of
the sum edge of an activity is relatively increased to
the other sum edges within the method, if the activity
is rather unique for the respective plan. The weighting
of sum edges connecting compound nodes to method
nodes is amplified relatively to simple operator nodes.
Moreover, we normalize the weights of sum edges,
which enforces a comparability between activities in-
dependent of the hierarchical level.
As another extension we introduce state effects
that represent the validity of a certain state upon com-
pleting a subgoal within the ASN. This is important
for recognizing activity sequences that are likely to
be repeated several times. A state effect is valid until
another state effect is introduced.
Furthermore, we introduce a backpropagation
procedure that enables predictions about future ac-
tivities. At first, we determine the most likely goal
with the highest activation value and then consider the
child nodes of the method that constructs this com-
pound node. We iteratively traverse the hierarchy to-
wards the lowest levels in the plan constituting the
currently assessed goal. In case of compound nodes,
we iterate through the child nodes of their method. On
each hierarchical level, the validity of the precondi-
tions is checked as they serve as an indicator for pos-
sible next activities. When predicting future activities,
the ones that have activated sum edges due to fulfilled
preconditions and that are directly subsequent to pre-
viously activated activities are considered. Lastly, the
ASN recovers from misclassifications and missed ac-
tivities by setting the activation value of an activity to
1 if it has been missed or misclassified, while serv-
ing as a precondition for two subsequent successfully
recognized activities.
3.1.2 Activation Spreading Process
The activation value propagation process is initiated
by a new activity recognition. When the activation
value of a newly recognized activity is updated to 1,
we start by iterating from the lowest level methods
to the highest level ones in order to ensure a correct
activation value propagation within the hierarchy.
All preconditions of an activity have to be valid
in order for an activity to spread its activation value.
If all preconditions and state effect preconditions are
fulfilled the respective sum edge of the considered ac-
tivity is activated. The value of ActSumEdges is up-
dated from 0 to 1 for the relevant activity. After all
activities of a method have been considered, the acti-
vation value of the method gets updated by summing
over the weighted activation values from its activities.
Upon a method achieving the activation value 1, the
activation values of the activities involved with that
method get reset to 0 while the compound node main-
tains its activation value at 1. The compound node
itself is going to get reset if the method node it con-
tributes to achieves the activation value 1.
3.2 Structural Recurrent Neural
Network
In this section we explain the action recognition based
on the action-affordance S-RNN proposed by Jain
et al. (2016) and our action S-RNN.
3.2.1 Feature Preprocessing
In order for the S-RNN to be able to perform ac-
tion and affordance recognition we first introduce the
feature preprocessing steps that are inspired by Kop-
pula et al. (2013). The features are computed based
on skeleton and object tracking performed on sta-
tionary video data. The object node features de-
pend on spatial object information within the seg-
ment, whereas the human node features rely on the
spatial information of the upper body joints. The
edge features are defined for object-object edges and
human-object edges within one segment of the spatio-
temporal graph. The temporal object and human fea-
tures are defined based on the relations between ad-
jacent temporal segments. Similar to Koppula et al.
(2013), the continuous feature values are discretized
by using cumulative binning into 10 bins yielding
a discrete distribution over feature values. The re-
sulting dimension of the feature vector thus yields
(number of features) × 10 . As a result we obtain a
histogram distribution over the feature values that is
especially useful when adding object features.
The spatio-temporal graph depicted in Figure 2
represents a concise representation of the relation be-
tween the human and the objects within and between
temporal segments. In order for the spatio-temporal
graph to model meaningful transitions, the video is
Figure 2: Exemplary spatio-temporal-graph with one hu-
man and two object nodes within three temporal segments.
ICAART 2022 - 14th International Conference on Agents and Artificial Intelligence
18