on Siamese Neural Networks (SNN). The second
strategy consists in a statistic modeling approach with
Hidden Markov Models (HMM), as in (Pylv
¨
an
¨
ainen,
2005) in order to model correlations between tempo-
ral data samples. Finally, the last strategy implies ma-
chine learning methods in order to model class fea-
tures, such as Support Vector Machines (SVM) (Wu
et al., 2009), Bayesian Networks (Cho et al., 2006) or
Recurrent Neural Networks (Lefebvre et al., 2015).
Using visual feature data, these three main strate-
gies are still relevant. Firstly, (Zhou and De la
Torre Frade, 2012) present a Generalized Time Warp-
ing (GTW) algorithm, which is an extension of the
DTW algorithm to temporally align multi-modal se-
quences from multiple subjects performing similar
activities. Secondly, (Xia et al., 2012) present an ap-
proach for human action recognition with histograms
of 3D joints locations. These features are projected
using Linear Discriminant Analysis (LDA) and clus-
tered into several posture visual words. The tempo-
ral evolutions of those visual words are then modeled
by a discrete HMM. Thirdly, a study by (Vemulapalli
et al., 2014) uses a SVM classifier to build an ac-
tion recognition system. Their approach is based on
a skeletal representation modeling the 3D geometric
relationships between body parts using rotations and
translations in 3D space. Since 3D rigid body motions
are members of the Special Euclidean group SE(3),
human actions can be modeled as curves in this Lie
group.
These previous strategies focus on one sensor and
one classifier to increase action recognition rates.
This can be viewed as a classifier selection. Few stud-
ies take into account multi-modal sources and several
classifiers to build a more robust system. (Chen et al.,
2015) present a two-level fusion approach based on
two modality sensors consisting of a depth camera
and an inertial body sensor. In the feature-level fu-
sion, features generated from the two differing modal-
ity sensors are merged before classification. In the
decision-level fusion, outcomes from two classifiers
are combined with decision fusion (in their article, us-
ing Dempster-Shafer Theory (DST)).
Inspired by this recent study, we propose to fuse
a posteriori decisions taken by classifiers, on the one
hand from individual data, and on the other hand from
a priori combined data.
2.2 Decision Fusion
In the following (see Figure 1), we assume that the fi-
nal classification should be made between c
i
classes,
with i ∈ {1, . . . , I}. We have available C
j
classi-
fiers, with j ∈ {1, . . . , J}, each giving a decision x
j
i,k
∈
[0, 1] for a class c
i
about a gesture instance G
k
, k ∈
{1, . . . , K}. A decision x
j
i,k
closer to 1 indicates a high
confidence that the instance belongs to the class c
i
,
whereas a decision closer to 0 indicates a low confi-
dence. The final decision taken by the decision fusion
method is denoted as c
d
.
classifier. Others studies by (Berlemont et al., 2015)
proposed a non-linear metric learning strategy based
on Siamese Neural Networks (SNN). The second
strategy consists in a statistic modeling approach with
Hidden Markov Models (HMM), as in (Pylv
¨
an
¨
ainen,
2005) in order to model correlations between tempo-
ral data samples. Finally, the last strategy implies ma-
chine learning methods in order to model class fea-
tures, such as Support Vector Machines (SVM) (Wu
et al., 2009), Bayesian Networks (Cho et al., 2006) or
Recurrent Neural Networks (Lefebvre et al., 2015).
Using visual feature data, these three main strate-
gies are still relevant. Firstly, (Zhou and De la
Torre Frade, 2012) present a Generalized Time Warp-
ing (GTW) algorithm, which is an extension of the
DTW algorithm to temporally align multi-modal se-
quences from multiple subjects performing similar
activities. Secondly, (Xia et al., 2012) present an ap-
proach for human action recognition with histograms
of 3D joints locations. These features are projected
using Linear Discriminant Analysis (LDA) and clus-
tered into several posture visual words. The tempo-
ral evolutions of those visual words are then modeled
by a discrete HMM. Thirdly, a study by (Vemulapalli
et al., 2014) uses a SVM classifier to build an ac-
tion recognition system. Their approach is based on
a skeletal representation modeling the 3D geometric
relationships between body parts using rotations and
translations in 3D space. Since 3D rigid body motions
are members of the Special Euclidean group SE(3),
human actions can be modeled as curves in this Lie
group.
These previous strategies focus on one sensor and
one classifier to increase action recognition rates.
This can be viewed as a classifier selection. Few stud-
ies take into account multi-modal sources and several
classifiers to build a more robust system. (Chen et al.,
2015) present a two-level fusion approach based on
two modality sensors consisting of a depth camera
and an inertial body sensor. In the feature-level fu-
sion, features generated from the two differing modal-
ity sensors are merged before classification. In the
decision-level fusion, outcomes from two classifiers
are combined with decision fusion (in their article, us-
ing Dempster-Shafer Theory (DST)).
Inspired by this recent study, we propose to fuse
a posteriori decisions taken by classifiers, on the one
hand from individual data, and on the other hand from
a priori combined data.
2.2 Decision Fusion
In the following (see Figure 1), we assume that the fi-
nal classification should be made between c
i
classes,
with i 2 {1,...,I}. We have available C
j
classi-
fiers, with j 2 {1,...,J}, each giving a decision x
j
i,k
2
[0, 1] for a class c
i
about a gesture instance G
k
, k 2
{1,...,K}. A decision x
j
i,k
closer to 1 indicates a high
confidence that the instance belongs to the class c
i
,
whereas a decision closer to 0 indicates a low confi-
dence. The final decision taken by the decision fusion
method is denoted as c
d
.
G
k
C
1
C
2
C
J
x
1
1,k
x
1
I,k
...
x
2
1,k
x
2
I,k
...
x
J
1,k
x
J
I,k
...
.
.
.
Decision
c
d
Fusion
Figure 1: The decision fusion process.
2.2.1 Voting Methods
Voting methods are based on the following principle:
each classifier C
j
adds a vote V
j
i
for each class c
i
. The
class decided by the fuser is then the one that collects
the most votes. In case of a tie, a class at random from
those with the most votes is chosen.
Voting by Majority (VM) C
j
adds a vote of 1 for
the class it has the most confidence in (i.e. the class
c
i
for which x
j
i,k
is closest to 1), and a vote of 0 for all
other classes.
Voting by Borda Count (VBC) C
j
adds a vote of
P 2 {1,...,I} for the class it has the most confidence
in, P1 for the second most confident class, . . . , 1 for
the P
th
most confident class, and 0 for all remaining
classes for which it has less confidence. If P = 1, this
VBC method is identical to VM.
Weighted Votes (VW) C
j
adds a vote of V
j
i
=
j(x
j
i,k
), with j a weighting function, taking into ac-
count the decision value. As in the Borda Count vot-
ing method, only the P 2 {1,...,I} classes the clas-
sifier has the most confidence in can vote, while the
remaining classes receive a vote of 0.
Voting by Kumar and Raj (VK) A weighted vote
approach is also presented by (Kumar and Raj, 2015).
We define the positive set X
+
i
(i.e. instances belong-
ing to a class c
i
), the negative set X
i
(i.e. instances
Figure 1: The decision fusion process.
2.2.1 Voting Methods
Voting methods are based on the following principle:
each classifier C
j
adds a vote V
j
i
for each class c
i
. The
class decided by the fuser is then the one that collects
the most votes. In case of a tie, a class at random from
those with the most votes is chosen.
Voting by Majority (VM) C
j
adds a vote of 1 for
the class it has the most confidence in (i.e. the class
c
i
for which x
j
i,k
is closest to 1), and a vote of 0 for all
other classes.
Voting by Borda Count (VBC) C
j
adds a vote of
P ∈ {1, . . . , I} for the class it has the most confidence
in, P −1 for the second most confident class, . . . , 1 for
the P
th
most confident class, and 0 for all remaining
classes for which it has less confidence. If P = 1, this
VBC method is identical to VM.
Weighted Votes (VW) C
j
adds a vote of V
j
i
=
ϕ(x
j
i,k
), with ϕ a weighting function, taking into ac-
count the decision value. As in the Borda Count vot-
ing method, only the P ∈ {1, . . . , I} classes the clas-
sifier has the most confidence in can vote, while the
remaining classes receive a vote of 0.
Voting by Kumar and Raj (VK) A weighted vote
approach is also presented by (Kumar and Raj, 2015).
We define the positive set X
+
i
(i.e. instances belong-
ing to a class c
i
), the negative set X
−
i
(i.e. instances
belonging to all the other classes) and β ∈ R a regu-
larization parameter. x
j
i,k
is set to 0 if c
i
is not the best
decided class for C
j
(meaning only the best class for
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
494