Figure 5: Example of Feature Dimensions with Different
Distributions.
4.2 Test Pipeline
In our experiment, classifiers should be trained that
predict the class label of an input vector with high
confidence. Support Vector Machines (SVM) were
chosen because similarity measures can be directly
applied as kernels, and because they are fairly effi-
cient in respect to computational cost. In order to ex-
tend the SVM concept to a multi-class problem, the
approach of one-vs-one (Milgram et al., 2006) was
implemented.
Usually, one would reduce the dimensionality,
e.g.by means of PCA or SVD, but for our experiment,
we decided to omit this step in order to keep full
diagnosticity for each dimension.
The experimental setting was as follows:
1. Input data (observation vectors and ground truth)
is read in and randomly divided into training and
test set (using a holdout of 50%).
2. From the training set, for each feature dimension
in random order, m
1
, m
2
, m
3
and m
4
is tried out by
the Greedy algorithm: At each step, a set of SVMs
is trained one-by-one, using DPM similarity as
kernel. The resulting classifier is cross-validated
on the test set. From the four possibilities, that
measure which yields the best classification result
(maximal global F
1
score) is kept for this dimen-
sion.
3. In the end, a DPM similarity kernel is obtained
that is approximately optimal.
4. The performance of this DPM kernel is compared
to single kernels (linear, quadratic, polynomial,
Radial Base Function) in terms of precision, re-
call, fallout and F
1
score, per class and averaged.
Evaluation was performed in terms of precision,
recall, fallout and F
1
score, per class and globally.
For global performance estimation of a model, macro-
averaging of the per-class results was used as pro-
posed by Yang and Liu, albeit in the context of text
categorisation (Yang and Liu, 1999):
F
global
1
=
∑
k
i=1
F
1
(i)
k
, k is the number of classes (6)
4.3 Results
Figure 6 shows the precision-recall curves for each
class, collected from ten test runs. One figure is given
per kernel type. As can be seen, the DPM kernel and
the linear kernel come closest to the optimal recall and
precision (upper right corner of the figures).
Tables 2 contain the performance indicators,
macro-averaged over all classes, hence equally penal-
ising classification error rates among classes of differ-
ent size.
Table 2: Averaged Global Performance.
linear quadratic polynomial
avg. precision 0.4598 0.4132 0.3043
avg. recall 0.3914 0.2688 0.2173
avg. fallout 0.1370 0.1704 0.1933
avg. F
1
score 0.4131 0.2829 0.2058
RBF DPM
avg. precision 0.1509 0.5823
avg. recall 0.2000 0.4446
avg. fallout 0.2000 0.1112
avg. F
1
score 0.1720 0.4836
4.4 Discussion
Our experiments demonstrate that, for our data set of
MPEG-7 features, the Dual Process Model as trained
by the Greedy algorithm performs always better that
the best singular kernel (which was the linear kernel
in all cases). In numbers, global average precision
of the DPM kernel is 26.66% higher than for the lin-
ear kernel, recall is 13.58% higher, fallout is 12.41%
lower, and F
1
score is 17.07% higher.
Quadratic and polynomial kernels performed a lit-
tle worse than the linear one, and RBF had the prob-
lem of always associating all instances to the same
class, namely the largest one. This is an interesting
finding per se. Arguably, RBF by design works well
in classification problems where class sizes are in the
same order of magnitude, but cannot cope with very
unequally sized classes.
In terms of classes, it is obvious that class 1 (only
one class member) failed in all models, whereas class
5 (the largest one) performed best with precision, re-
call and F
1
values close to 1.0. Regarding the smaller
classes 2 to 4, the discriminative power of the DPM
kernel comes into play. Here the DPM has always per-
formed better in terms of recall, precision and fallout
than all other kernels under consideration.
Similarity Assessment as a Dual Process Model of Counting and Measuring
145