ON IMPROVING SEMI-SUPERVISED MARGINBOOST

INCREMENTALLY USING STRONG UNLABELED DATA

∗

Thanh-Binh Le and Sang-Woon Kim

Department of Computer Engineering, Myongji University, 449-728, Yongin, South Korea

Keywords:

Semi-supervised MarginBoost, Incremental learning strategy, Dissimilarity-based classiﬁcations.

Abstract:

The aim of this paper is to present an incremental learning strategy by which the classiﬁcation accuracy of

the semi-supervised MarginBoost (SSMB) algorithm (d’Alch´e Buc, 2002) can be improved. In SSMB, both a

limited number of labeled and a multitude of unlabeled data are utilized to learn a classiﬁcation model. How-

ever, it is also well known that the utilization of the unlabeled data is not always helpful for semi-supervised

learning algorithms. To address this concern when dealing with SSMB, in this paper we study a means of

selecting only “small” helpful portion of samples from the additional available data. More speciﬁcally, this is

done by performing SSMB after incrementally reinforcing the given labeled training data with a part of strong

unlabeled data; we train the classiﬁcation model in an incremental fashion by employing a small amount of

“strong” samples selected from the unlabeled data per iteration. The proposed scheme is evaluated with well-

known benchmark databases, including some UCI data sets, in two approaches: dissimilarity-based classiﬁca-

tion (DBC) (Pekalska and Duin, 2005) as well as conventional feature-based classiﬁcation. Our experimental

results demonstrate that, compared to previous approaches, it achieves better classiﬁcation accuracy results.

1 INTRODUCTION

MarginBoost (Mason, 2000) aims at improving the

classiﬁcation performance of an ensemble classiﬁer

designed with weak classiﬁers by means of linear

combination. By introducing a means of generat-

ing the MarginBoost in a semi-supervised approach,

semi-supervised MarginBoost (SSMB) was proposed

(d’Alch´e Buc, 2002). In SSMB, a large amount of

unlabeled data, U, together with labeled data, L, are

used to build better classiﬁers. That is, the algorithm

exploits the samples of U in addition to the labeled

counterparts to improve the performance on a classi-

ﬁcation task, leading to a performance improvement

of the supervised learning algorithm with a multitude

of unlabeled data.

However, it is also well known that U does not

always help during SSMB learning. Speciﬁcally, it

is not guaranteed that adding U to the training data,

T, i.e., T = L ∪ U, leads to a situation in which we

can improve the classiﬁcation performance. There-

fore, if we can know more about conﬁdence levels in-

volved in classifying U, we could choose some of the

∗

This work was supported by the National Research

Foundation of Korea funded by the Korean Government

(NRF-2011-0002517).

informative data and include it when training weak

classiﬁers. This idea has been used in SemiBoost

(Mallapragada, 2009), where the authors measured

the pairwise similarity to guide the selection of a sub-

set of U at each iteration and to assign labels to them.

To improve the performance of SSMB further, in

this paper we propose a modiﬁed SSMB algorithm in

which we use the discriminating unlabeled data in an

incremental fashion rather than in batch mode (Cesa-

Bianchi, 2006). In both SemiBoost and the modiﬁed

SSMB, some instances of the strong unlabeled data

are selected from the given U and are then used to

train the classiﬁcation model in addition to L. How-

ever, the two algorithms differ in how they construct

T. In the present SSMB, the cardinality of T is in-

creased incrementally as the iterations are repeated,

while, in SemiBoost, the cardinality of T is always

the same when executing the learning iterations.

The main contribution of this paper is that

it demonstrates that the classiﬁcation accuracy of

SSMB can be improved by incrementally utilizing a

portion of the unlabeled data as well as the labeled

training data. Also, an evaluation of the proposed

scheme has been performed in two fashions: tradi-

tional feature-based classiﬁcation (Fukunaga, 1990)

and recently developed dissimilarity-based classiﬁca-

tion (Pekalska and Duin, 2005).

265

Le T. and Kim S. (2012).

ON IMPROVING SEMI-SUPERVISED MARGINBOOST INCREMENTALLY USING STRONG UNLABELED DATA.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 265-268

DOI: 10.5220/0003721202650268

 SciTePress

2 SSMB IMPROVED

In SSMB, an ensemble classiﬁer, g

, is designed

with weak classiﬁers, h

∈ H , by means of a lin-

ear combination, as follows: g

(x) =

∑

τ=1

(x),

where α

is a normalized step-length. For the train-

ing data, T = L∪U, where L = {(x

)}

i=1

and U =



)



j=1

, the algorithm minimizes the cost func-

tion C deﬁned with any scalar decreasing function

c of the margin ρ: C (g

) =

∑

i=1

c(ρ

),y

)) +

∑

i=1

c(ρ

))), where ρ

),y

) = y

)

and ρ

)) = g

)

. Here, the criterion quan-

tities for L and U, J

and J

, are expressed as:

∑

∈L

(i)y

t+1

), (1)

∑

∈U

(i)

∂ρ

))

∂g

)

t+1

), (2)

where w

(i) is computed as follows:

(i) =







′

(ρ

),y

))

∑

∈T

t−1

( j)

, if x

∈ L,

′

(ρ

)))

∑

∈T

t−1

( j)

, if x

∈ U.

(3)

An algorithm for SSMB is formalized as follows:

1. Initialization: g

(x) = 0; w

(i) =

,i =

1,··· , n

+ n

2. Compute predicted labels ofU using the nearest

neighbor (NN) rule.

3. Do the following steps while increasing t by

unity from 1 to t

per epoch:

(a) Learn the gradient direction h

for T while

maximizing J

(= J

+ J

) computed with (1, 2).

(b) If J

≤ 0, then exit and return g

(x); otherwise,

go to the next sub-step.

t+1

(x) = g

(x)+α

(x), up-

date the weights w

t+1

(i) with (3) for the next iteration.

As mentioned previously, the unlabeled data do

not always help in SSMB learning processes. In par-

ticular, when the cardinality of the unlabeled data is

much larger than that of labeled data, the situation

is much worse. To overcome the limitation based

on this, we expand SSMB using classiﬁcation con-

ﬁdence of the unlabeled data as SemiBoost does. The

present SSMB and SemiBoost select the strong ex-

amples from unlabeled data based on the conﬁdence

level. However, two algorithms differ in terms of

the following points: how they select the strong sam-

ples from the unlabeled data and how they determine

pseudo-labels of the selected unlabeled data. In Semi-

Boost, 10% of the entire unlabeled data set is repeat-

edly selected based on the conﬁdence levels, while in

the present SSMB, 10% of the currently available un-

labeled data is selected incrementally. Also, the two

Figure 1: A comparison of the training data sets of SSMB,

improved SSMB, and SemiBoost learning algorithms.

algorithms are different in how they determine the

class labels of the selected unlabeled data. The for-

mer determines the pseudo-labels based on the simi-

larity matrix, but the latter determines them based on

the NN rule. On the basis of what we have brieﬂy

discussed, an algorithm for the present SSMB is for-

malized as follows:

1. This step is the same as Step 1 in SSMB.

2. For all x

∈ T, compute a similarity matrix,

S (i, j) = exp(−kx

− x

/σ

), where σ is a scale pa-

rameter, and predicted labels of U using NN rule.

3. Repeat the following steps while increasing t

by unity from 1 to t

per epoch: First, using a “sam-

pling”, obtain new training data, T

, from available

unlabeled data, T

, in addition to L. Next, for T

, re-

peat the three sub-steps of Step 3 of SSMB ten times.

In the above sampling process, we ﬁrst choose a

portion of the strong data from T

(i.e., 10%; T

)

according to the conﬁdence levels based on S . Then,

we update T

← T

− T

and n

← |T

|. Fig.1 shows

a comparison of the training data set T

of SSMB, the

modiﬁed (and improved) SSMB, and SemiBoost.

3 EXPERIMENTAL RESULTS

The proposed scheme was tested and compared with

conventional methods. This was done by performing

experiments on the well-known benchmark databases

of Nist389, mfeat-fac, and mfeat-kar, as well as other

multivariate data sets cited from the UCI Machine

Learning Repository

In this experiment, the data sets are initially split

into three parts: labeled training sets, labeled test sets,

and unlabeled data at a ratio of 20 : 10 : 70. The train-

ing and test procedures are then repeated ten times

and the results obtained are averaged. Speciﬁcally,

the classiﬁcations are performed in two fashions:

feature-based classiﬁcation (FBC) and dissimilarity-

based classiﬁcation (DBC). In DBC, the classiﬁca-

http://www.ics.uci.edu/∼mlearn/MLRepository.html.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

266

tion process is not based on the feature measurements

of individual object samples but rather on a suitable

dissimilarity measure among the individual samples.

Therefore, in this experiment, after measuring the dis-

similarity among paired samples with the Euclidean

distance, the classiﬁcations were performed on the

constructed dissimilarity matrix. In the interest of

compactness, the details of DBC are omitted here, but

can be found in (Pekalska and Duin, 2005).

Conventional SSMB and the newly proposed

SSMB (which are referred to as SSMB-original and

SSMB-improved, respectively) were performed with

numbers of weak learners ranging from 10 to 50 in in-

crements of 5 at a time. This was repeated ten times.

The scalar decreasing function employed for the mar-

gin ρ was c(x) = e

−x

. In particular, the step-length



1−ε



, where ε

∑

(i)δ(y

),−1)

was commonly used for both SSMBs. For all of

the boosting algorithms, a decision-tree classiﬁer

was used as the weak learner and implemented with

Prtools (Duin and Tax, 2004).

First, the experimental results obtained with the

two classifying approaches, FBC and DBC, for the

Nist389, mfeat-fac, and mfeat-kar databases were in-

vestigated. Fig. 2 shows a comparison of the error

rates (and standard deviations) of SemiBoost, SSMB-

original, and SSMB-improved, obtained with the two

classifying approaches for Nist389. Here, in the inter-

est of brevity, the other results are omitted. Also, to

reduce the computational complexity, the dimension-

ality of all of the data sets was reduced to 10 values

using a principal component analysis (PCA).

From the ﬁgures shown in Fig. 2, it can be ob-

served that, in general, the classiﬁcation accuracies

of SSMB, estimated with FBC and DBC, can be im-

proved. This is clearly shown in the error rates of the

ensemble classiﬁers, as represented by the red lines

(dashed and solid lines with the ♦ marker) and the

black and blue lines (dashed and solid lines with the

 and ⊖ markers, respectively). For all three data sets

and for each repetition, the rank of achieving the low-

est error rate is always identical in the order of SSMB-

improved, SSMB-original, and SemiBoost. That is,

the winner is always the SSMB-improved. In addi-

tion, it should be pointed out that the improvements

of the two methods of DBC and FBC were similar.

According to the different number of repetitions, the

increase and/or decrease in the error rates of the two

approaches appeared to be consistent.

Furthermore, the following is an interesting is-

sue to investigate: Is the classiﬁcation accuracy of

the improved SSMB algorithm better (or more robust)

than those of conventional schemes when changing

the amount of the selected strong data? To answer

this question, for the data sets, we repeated the ex-

periment with four different strong data sizes (i.e.,

) and ten repetitions, as was done

previously under the same experimental conditions.

The experimental results in this case showed that the

error rates obtained with the four differently sized in-

stances of strong unlabeled data are similar.

Table 1 shows a numerical comparison of the error

rates obtained with AdaBoost, MarginBoost, Semi-

Boost, and the original and improved SSMB algo-

rithms for the three data sets. Here, two supervised

boosting algorithms, AdaBoost and MarginBoost,

were employed as a reference for comparison. These

supervised algorithms were trained with only 20%

of the labeled training data and were evaluated with

10% of the labeled test data, while the three semi-

supervised algorithms, SemiBoost, SSMB-original,

and SSMB-improved, were trained with 70% of the

unlabeled training data as well as 20% of the labeled

data. They were also evaluated also with 10% of

the labeled test data. For all of the boosting algo-

rithms, the number of weak classiﬁers was identical,

at t

= 50. In the table, the estimated error rates that

increase and/or decrease more than the sum of the

standard deviations are underlined.

To investigate the advantage of incrementally us-

ing strong unlabeled data further and especially to de-

termine which types of signiﬁcant data sets are more

suitable for the scheme, we repeated the experiment

with a few UCI data sets. From the results obtained,

as in Table 1, it should be noted again that the classi-

ﬁcation accuracy of the SSMB algorithm can be gen-

erally improved when utilizing the unlabeled data in

an incremental learning fashion. However, the pro-

posed scheme does not work satisfactorily with low-

dimensional data sets. That is, for high-dimensional

data sets, the difference in the error rates of SSMB-

original and SSMB-improved schemes is relatively

large, whereas the difference in the error rates for low-

dimensional data sets is marginal.

4 CONCLUSIONS

In an effort to improve the classiﬁcation performance

of SSMB, in this paper we used an incremental learn-

ing strategy with which the SSMB can be imple-

mented efﬁciently. We ﬁrst computed the similarity

matrix of labeled and unlabeled data and, in turn, se-

lected a small amount of strong unlabeled samples

based on their conﬁdence levels. We then trained a

classiﬁcation model using the selected unlabeled sam-

ples as well as labeled samples in an incremental fash-

ion. The proposed strategy was evaluated with well-

ON IMPROVING SEMI-SUPERVISED MARGINBOOST INCREMENTALLY USING STRONG UNLABELED DATA

267

10 15 20 25 30 35 40 45 50

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Ensemble sizes (T)

Estimated error rates

Nist389 (FBC)

SSMB−original

SSMB−imprved

SemiBoost

10 15 20 25 30 35 40 45 50

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Ensemble sizes (T)

Estimated error rates

Nist389 (DBC)

SSMB−original

SSMB−imprved

SemiBoost

Figure 2: A comparison of the estimated error rates (standard deviations) obtained with the FBC and DBC approaches for

Nist389: (a) left and (b) right; (a) and (b) are of FBC and DBC, obtained with SemiBoost and the two SSMB algorithms.

Table 1: A numerical comparison of the error rates (standard deviations) obtained with two supervised schemes (i.e., AdaBoost

and MarginBoost) and three semi-supervised schemes (i.e., SemiBoost, SSMB-original, and SSMB-improved) for the three

data. Here, three numbers in brackets of the ﬁrst column represent the dimensions d, samples n, and classes c, respectively.

data sets classiﬁer supervised learning semi-supervised learning

(d/n/c) types AdaBoost MarginBoost SemiBoost SSMB-original SSMB-imprved

Nist389 FBC 0.0648(0.0130) 0.0648(0.0130) 0.0730(0.0155) 0.0696(0.0142) 0.0496(0.0122)

(1024/1500/3) DBC 0.0563(0.0134) 0.0537(0.0139) 0.0574(0.0129) 0.0652(0.0165) 0.0400(0.0110)

mfeat-fac FBC 0.0131(0.0033) 0.0131(0.0036) 0.0136(0.0038) 0.0156(0.0033) 0.0099(0.0022)

(216/2000/10) DBC 0.0258(0.0035) 0.0258(0.0034) 0.0270(0.0054) 0.0280(0.0044) 0.0227(0.0046)

mfeat-kar FBC 0.0236(0.0025) 0.0236(0.0025) 0.0234(0.0033) 0.0224(0.0030) 0.0166(0.0027)

(64/2000/10) DBC 0.0167(0.0029) 0.0166(0.0029) 0.0175(0.0025) 0.0178(0.0021) 0.0134(0.0024)

known benchmark databases, including some UCI

data sets, in two ways: traditional feature-based clas-

siﬁcation and newly developed dissimilarity-based

classiﬁcation schemes. Our experimental results

demonstrate that the classiﬁcation accuracy of SSMB

was improved by employing the proposed learning

method. Although we have shown that SSMB can

be improved in terms of classiﬁcation accuracy, many

tasks remain. One of them is to improve the classiﬁ-

cation efﬁciency by selecting an optimized or nearly

optimized number of unlabeled samples for the incre-

mental learning process. The signiﬁcant data sets best

suited for the scheme should be determined. There-

fore, the problem of theoretically investigating the ex-

perimental results obtained with the proposed SSMB

remains to be solved.

REFERENCES

Cesa-Bianchi, N., G. C. Z. L. (2006). Incremental algo-

rithms for hierarchical classiﬁcation. Journal of Ma-

chine Learning Research, 7:31–54.

d’Alch´e Buc, F., G. Y. A. C. (2002). Semi-supervised

marginboost. In Advances in Neural Information Pro-

cessing Systems, volume 14, pages 553–560. the MIT

press.

Duin, R. P. W., J. P. d. D. P. P. P. E. and Tax, D. M. J. (2004).

PRTools 4: a Matlab Toolbox for Pattern Recognition.

Delft University of Technology, The Netherlands.

Fukunaga, K. (1990). Introduction to Statistical Pattern

Recognition, 2nd. Academic Press, San Diego, CA.

Mallapragada, P. K., J. R. J. A. K. L. Y. (2009). Semiboost:

Boosting for semi-supervised learning. IEEE Trans.

Pattern Anal. and Machine Intell., 31(11):2000–2014.

Mason, L., B. J. B. P. L. F. M. (2000). Functional gradient

techniques for combining hypotheses. In Advances in

Large Margin Classiﬁers. the MIT press.

Pekalska, E. and Duin, R. P. W. (2005). The Dissimilarity

Representation for Pattern Recognition: Foundations

and Applications. World Scientiﬁc, Singapore.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

268