tion process is not based on the feature measurements
of individual object samples but rather on a suitable
dissimilarity measure among the individual samples.
Therefore, in this experiment, after measuring the dis-
similarity among paired samples with the Euclidean
distance, the classifications were performed on the
constructed dissimilarity matrix. In the interest of
compactness, the details of DBC are omitted here, but
can be found in (Pekalska and Duin, 2005).
Conventional SSMB and the newly proposed
SSMB (which are referred to as SSMB-original and
SSMB-improved, respectively) were performed with
numbers of weak learners ranging from 10 to 50 in in-
crements of 5 at a time. This was repeated ten times.
The scalar decreasing function employed for the mar-
gin ρ was c(x) = e
−x
. In particular, the step-length
α
t
=
1
4
ln
1−ε
t
ε
t
, where ε
t
=
∑
i
w
t
(i)δ(y
i
g
t
(x
i
),−1)
was commonly used for both SSMBs. For all of
the boosting algorithms, a decision-tree classifier
was used as the weak learner and implemented with
Prtools (Duin and Tax, 2004).
First, the experimental results obtained with the
two classifying approaches, FBC and DBC, for the
Nist389, mfeat-fac, and mfeat-kar databases were in-
vestigated. Fig. 2 shows a comparison of the error
rates (and standard deviations) of SemiBoost, SSMB-
original, and SSMB-improved, obtained with the two
classifying approaches for Nist389. Here, in the inter-
est of brevity, the other results are omitted. Also, to
reduce the computational complexity, the dimension-
ality of all of the data sets was reduced to 10 values
using a principal component analysis (PCA).
From the figures shown in Fig. 2, it can be ob-
served that, in general, the classification accuracies
of SSMB, estimated with FBC and DBC, can be im-
proved. This is clearly shown in the error rates of the
ensemble classifiers, as represented by the red lines
(dashed and solid lines with the ♦ marker) and the
black and blue lines (dashed and solid lines with the
and ⊖ markers, respectively). For all three data sets
and for each repetition, the rank of achieving the low-
est error rate is always identical in the order of SSMB-
improved, SSMB-original, and SemiBoost. That is,
the winner is always the SSMB-improved. In addi-
tion, it should be pointed out that the improvements
of the two methods of DBC and FBC were similar.
According to the different number of repetitions, the
increase and/or decrease in the error rates of the two
approaches appeared to be consistent.
Furthermore, the following is an interesting is-
sue to investigate: Is the classification accuracy of
the improved SSMB algorithm better (or more robust)
than those of conventional schemes when changing
the amount of the selected strong data? To answer
this question, for the data sets, we repeated the ex-
periment with four different strong data sizes (i.e.,
T
5
u
,T
10
u
,T
15
u
,T
20
u
) and ten repetitions, as was done
previously under the same experimental conditions.
The experimental results in this case showed that the
error rates obtained with the four differently sized in-
stances of strong unlabeled data are similar.
Table 1 shows a numerical comparison of the error
rates obtained with AdaBoost, MarginBoost, Semi-
Boost, and the original and improved SSMB algo-
rithms for the three data sets. Here, two supervised
boosting algorithms, AdaBoost and MarginBoost,
were employed as a reference for comparison. These
supervised algorithms were trained with only 20%
of the labeled training data and were evaluated with
10% of the labeled test data, while the three semi-
supervised algorithms, SemiBoost, SSMB-original,
and SSMB-improved, were trained with 70% of the
unlabeled training data as well as 20% of the labeled
data. They were also evaluated also with 10% of
the labeled test data. For all of the boosting algo-
rithms, the number of weak classifiers was identical,
at t
1
= 50. In the table, the estimated error rates that
increase and/or decrease more than the sum of the
standard deviations are underlined.
To investigate the advantage of incrementally us-
ing strong unlabeled data further and especially to de-
termine which types of significant data sets are more
suitable for the scheme, we repeated the experiment
with a few UCI data sets. From the results obtained,
as in Table 1, it should be noted again that the classi-
fication accuracy of the SSMB algorithm can be gen-
erally improved when utilizing the unlabeled data in
an incremental learning fashion. However, the pro-
posed scheme does not work satisfactorily with low-
dimensional data sets. That is, for high-dimensional
data sets, the difference in the error rates of SSMB-
original and SSMB-improved schemes is relatively
large, whereas the difference in the error rates for low-
dimensional data sets is marginal.
4 CONCLUSIONS
In an effort to improve the classification performance
of SSMB, in this paper we used an incremental learn-
ing strategy with which the SSMB can be imple-
mented efficiently. We first computed the similarity
matrix of labeled and unlabeled data and, in turn, se-
lected a small amount of strong unlabeled samples
based on their confidence levels. We then trained a
classification model using the selected unlabeled sam-
ples as well as labeled samples in an incremental fash-
ion. The proposed strategy was evaluated with well-
ON IMPROVING SEMI-SUPERVISED MARGINBOOST INCREMENTALLY USING STRONG UNLABELED DATA
267