classifier with a χ
2
kernel (Zhang et al., 2007)
serving as a detector for each action. For the random
forest we use Breiman and Cutler's implementation
(Breiman, 2001), with the M-parameter equal to the
total number of features values. For the SVM we use
the libSVM implementation (Chang and Lin, 2001),
where the χ
2
kernel is normalized by the mean
distance across the full training set (Zhang et al.,
2007), with the SVM's slack parameter default C=1.
The weight of the positive class is set to
(#pos+#neg)/#pos and the weight of the negative
class to (#pos+#neg)/#neg, where #pos is the size of
the positive class and #neg of the negative class (van
de Sande, 2010).
The novelties with respect to the above pipeline
are: (1) We have improved the selection of negative
examples during training (Burghouts and Schutte,
2012). The rationale is to select negatives that are
semantically similar to the positive class. This gives
an average improvement of approx. 20%. (2) We
have improved the detection of each action, by
fusion of all actions in a second stage classification.
For each action, we create a second stage SVN
classifier that takes the first stage classifiers’
outputs, i.e. the posterior probability of each action
detector, as a new feature vector (Burghouts and
Schutte, 2012). The improvement is approx. 40%.
The combination of both improvements yields an
overall improvement is 50% for the detection of the
48 human actions.
The recognizer’s performance is measured by the
Matthews Correlation Coefficient, MCC = ( TP·TN
- FP·FN ) / sqrt( (TP+FP) · (TP+FN) · (TN+FP) ·
(TN+FN)), where T=true, F=false, P=positive and
N=negative. This performance measure is
independent of the sizes of the positive and negative
classes. This is important for our evaluation purpose,
as there are +1,000 positive samples for “move”, to
61 samples for “bury”. . The actions that go well
(MCC > 0.2) are: Dig, Hold, Throw, Receive, Carry,
Bounce, Raise, Replace, Exchange, Bury, Lift,
Hand, Open, Haul. Fair performance 0.1 ≤ MCC ≤
0.2 is achieved for: Touch, Give, Kick, Take,
Pickup, Fly, Drop, Snatch. Actions that do not go
well (MCC < 0.1) are: Hit, Catch, Putdown, Push,
Attach, Close. The average MCC = 0.23.
7 DESCRIPTION
Based on the actions classified by the action
recognizer a textual description of the scene is
generated. The description is generated by a Rule
Based System (RBS). The RBS (Hanckmann et al.,
2012) encodes world knowledge about the actions
and encodes these as rules. There are 73 rules
describing 48 actions. The rules specify a set of
conditions. The conditions are based on the
properties and relations as generated by the Event
Properties (see 5.2).
The RBS connects the action with the entity or
entities involved in the action. It determines which
actor is the sentence subject and, if present, which
object or actors are involved as direct or indirect
objects. Subsequently, the description sentence is
constructed using the action as an action combined
with the subjects and objects, and a number of
templates. A sentence is considered to at least
contain a subject and an action.
7.1 RBS Algorithm
Based on the rules, a multi hypotheses tree is
constructed. Each hypotheses is a combination of
possible entities and/or object connected with the
action. The hypothesis score is higher when more
conditions are met.
There are a three condition types: entity/object
properties (event properties that are expected to be
valid for an entity/object in combination with an
action) entity/object relations (event properties that
are expected to be valid describing the relation two
entities/objects have), temporal ordering (temporal
properties of the previous two condition types, e.g.
the order of actions in time).
The description describes the actions with the
highest probabilities (maximum of seven actions
with a minimum probability of 0.7). From these
actions, the hypothesis with the highest score is
selected and used in the sentence construction.
7.2 Description Performance
The description generator is evaluated on 241 short
videos (visint.org) with ground truth. The ground
truth consist of 10 sentences per video, written by 10
different people. Per video the number of different
annotated actions is approximately 5.
For each ground truth (GT) sentence we extract,
using The Stanford Parser (see last reference), the
action, subject, and object(s) and compare these with
the system response (SR) of the RBS.
We calculate two scores: a union and a percentage
score. The clip’s union score is the best match for all
sentence pairs (the percentage of clips with at least
one agreement between GT and SR); its percentage
score is the mean match corrected for the minimum
number of the amount of ground truth sentences and
ASearchEngineforRetrievalandInspectionofEventswith48HumanActionsinRealisticVideos
415