Discriminative Prior Bias Learning for Pattern Classiﬁcation

Takumi Kobayashi and Kenji Nishida

National Institute of Advanced Industrial Science and Technology,

1-1-1 Umezono, Tsukuba, Japan

Keywords:

Pattern Classiﬁcation, Discriminative Learning, Bias, SVM.

Abstract:

Prior information has been effectively exploited mainly using probabilistic models. In this paper, by focus-

ing on the bias embedded in the classiﬁer, we propose a novel method to discriminatively learn the prior

bias based on the extra prior information assigned to the samples other than the class category, e.g., the 2-D

position where the local image feature is extracted. The proposed method is formulated in the framework of

maximum margin to adaptively optimize the biases, improving the classiﬁcation performance. We also present

the computationally efﬁcient optimization approach that makes the method even faster than the standard SVM

of the same size. The experimental results on patch labeling in the on-board camera images demonstrate the

favorable performance of the proposed method in terms of both classiﬁcation accuracy and computation time.

1 INTRODUCTION

Prior information has been effectively exploited in the

ﬁelds of computer vision and machine learning, such

as for shape matching (Jiang et al., 2009), image seg-

mentation (El-Baz and Gimel’farb, 2009), graph in-

ference (Cremers and Grady, 2006), transfer learn-

ing (Jie et al., 2011) and multi-task learning (Yuan

et al., 2013). Learning prior has so far been addressed

mainly in the probabilistic framework on the assump-

tion that the prior is deﬁned by a certain type of gener-

ative probabilistic model (Wang et al., 2010; Kapoor

et al., 2009); especially, non-parametric Bayesian ap-

proach further considers the hyper priors of the prob-

abilistic models (Ghosh and Ramamoorthi, 2003).

In this paper, we focus on the classiﬁer, y =

⊤

x+b, and especially on the bias term, so called ‘b’

term (Poggio et al., 2001)

, while some transfer learn-

ing methods are differently built upon the prior of the

weight w

w for effectively transferring the knowledge

into the novel class categories (Jie et al., 2011; Gao

et al., 2012) and the prior of w

w also induces a regu-

larization on w

w. The bias is regarded as rendering the

prior information on the class probabilities (Bishop,

1995; Van Gestel et al., 2002) and we aim to learn the

In this paper, we describe the classiﬁer in such a linear

form for simplicity, but our proposed method also works on

the kernel-based classiﬁer by simply replacing the feature x

with the kernel feature φ

in the reproducing kernel Hilbert

space.

{ }

patch

c-th class label

p-th position

feature

vector

p=1

p=P

Figure 1: Patch labeling. The task is to predict the class la-

els c of the patches, each which consists of the appearance

feature vector x

x and the prior position p. Note that there are

P positions in total.

unstructured prior bias b without assuming any spe-

ciﬁc models. While the bias b is generally set as a

constant across samples depending only on the class

category, in this study we deﬁne it adaptively based

on the extra prior information other than the class cat-

egory, as follows.

Suppose samples are associated with the extra

prior information p ∈ {1, ..,P} as well as the class

category c ∈ {1, ..,C}, where P and C indicate the

total number of the prior types and the class cate-

gories, respectively. For instance, in the task of la-

beling patches on the on-board camera images, each

patch (sample) is assigned with the appearance fea-

ture x

x, the class category c and the position (extra

prior information) p, as shown in Fig. 1. Not only

the feature x

x but also the prior position p where the

feature is extracted is useful to predict the class cat-

egory of the patch; the patches on an upper region

Kobayashi T. and Nishida K..

Discriminative Prior Bias Learning for Pattern Classiﬁcation.

DOI: 10.5220/0004813600670075

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 67-75

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

probably belong to sky and the lower region would be

road, even though the patches extracted from those

two regions are both less textured, resulting in similar

features.

The probabilistic structure that we assume in this

study is shown in Fig. 2b with comparison to the

simple

model in Fig. 2a. By using generalized linear

model (Bishop, 2006), the standard classiﬁer (Fig. 2a)

is formulated toestimate the posterior on the class cat-

egory c as

logp(c|x

x) ∼ logp(x

x|c) + logp(c) = w

⊤

x+ b

, (1)

where b

= logp(c) indicates the class-dependent

bias. On the other hand, the

proposed

model

(Fig. 2b) using the prior p induces the following clas-

siﬁer;

logp(c|x

x, p) ∼ logp(x

x|c) + logp(p|c) + logp(c)

= w

⊤

x+ b

[p]

, (2)

where the bias b

[p]

= logp(p|c) + logp(c) is depen-

dent on both the class category c and the prior infor-

mation p. Thus, if the bias could be properly deter-

mined, the classiﬁcation performance would be im-

proved compared to the standard classiﬁcation model

(1). One might also consider the

full-connected

model shown in Fig. 2c whose classiﬁer is formulated

log p(c|x

x, p) ∼ log p(x

x|c, p) + log p(p|c) + logp(c)

= w

[p]

⊤

x+ b

[p]

, (3)

where the classiﬁer weight w

[p]

relies on the prior p

as the bias b

[p]

does. This model is more complicated

and consumes large memory storage since the classi-

ﬁer model {w

[p]

} is prepared for respective priors

p = 1,..,P. And, due to the high degree of freedom

(D.O.F) of this model, it would be vulnerable to over-

learning. These models are summarized in Table 1

and will be again discussed later.

In this paper, we propose a novel method for

discriminatively learning the prior biases b

[p]

in (2)

to improve the classiﬁcation performance. The pro-

posed method is formulated in the optimization prob-

lem of the maximum margin criterion (Smola et al.,

2000). We also propose the computationally efﬁcient

approach for the optimization which contains large

amount of samples drawn from all the priors p ∈

{1,..,P}. Thereby, the proposed method is even faster

than the standard SVM (Vapnik, 1998) of the same

size, while providing the high-performance classiﬁer

that exploits the prior information.

‘∼’ in (1) means the equality in disregard of the irrele-

vant constant term logp(x

x) or logp(x

x, p) in (2) and (3).

(a)

simple

(b)

proposed

(c)

full-connected

Figure 2: Graphical models to depict the probabilistic de-

pendencies. The notations c, x

x and p denote the class cate-

gory, the (appearance) feature vector and the extra prior in-

formation, respectively. The arrows show the probabilistic

dependencies. (a) The feature x

x is simply drawn from the

class category c in the

simple

model. (b) The

proposed

model incorporates the extra prior information p which is

connected to x

x via c. (c) Those three variables are fully

connected in the

full-connected

model.

Table 1: Classiﬁcation methods for c-th class category. The

dimensionality of the feature vector is denoted by D, x

x ∈

ℜ

, and the number of prior types is P.

Method Model D.O.F

simple

= w

⊤

x+ b

D+ 1

proposed

= w

⊤

x+ b

[p]

D+ P

full-connected

= w

[p]

⊤

x+ b

[p]

PD+ P

2 BIAS LEARNING

We detail the proposed method by ﬁrst deﬁning the

formulation for learning the biases and then present-

ing the computationallyefﬁcient approachto optimize

them. As we proceed to describe a general form re-

garding the prior biases, it might be helpful for un-

derstanding to refer to the task of labeling patches in

on-board camera images as shown in Fig. 1; the sam-

ple is represented by the appearance feature x

x and the

prior position p ∈ {1, ..,P}.

2.1 Formulation

We consider a binary class problem for simplicity and

take a one-vs-rest approach for multi-class tasks. Sup-

pose we have P types of prior information, and let

[p]

∈ ℜ

denote the D-dimensional feature vector of

the i-th sample (i = 1,..,n

[p]

) drawn from the p-th type

of prior. As described in Sec.1, we deal with the clas-

siﬁcation deﬁned by

y = w

⊤

[p]

+ b

[p]

, (4)

where y denotes the classiﬁer output which is sub-

sequently thresholded by zero for performing binary

classiﬁcation, and w

w and b

[p]

are the classiﬁer weight

vector and the bias, respectively. Note again that

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

the bias b

[p]

depends on the p-th type of prior, p ∈

{1,..,P}. The classiﬁer (4) can be learned via the fol-

lowing optimization formulation in the framework of

maximum margin (Smola et al., 2000);

min

w,{b

[p]

}

∑

[p]

∑

[p]

(5)

s.t. ∀p ∈ {1, .., P}, ∀i ∈ {1,..,n

[p]

⊤

[p]

+ b

[p]

) ≥ 1 − ξ

[p]

, ξ

[p]

≥ 0,

where C is the cost parameter. This is obviously con-

vex and its Lagrangian is written by

L =

∑

[p]

∑

[p]

−

∑

[p]

∑

[p]

(6)

−

∑

[p]

∑

[p]

⊤

[p]

+ b

[p]

) − 1+ ξ

[p]

where we introduce the Lagrange multipliers α

[p]

≥

0, β

[p]

≥ 0. The derivatives of the Lagrangian are

∂L

∂w

= w

w−

∑

[p]

∑

[p]

= 0

⇒ w

w =

∑

[p]

∑

[p]

(7)

∂L

∂ξ

[p]

= C− α

[p]

− β

[p]

= 0 ⇒ 0 ≤ α

[p]

≤ C (8)

∂L

∂b

[p]

∑

[p]

= 0. (9)

Thereby, the dual is ﬁnally obtained as

min

{α

[p]

}

i,p

∑

p,q

[p]

∑

[q]

∑

[p]

[q]

[p]

[q]

[p]

⊤

[q]

−

∑

[p]

∑

[p]

(10)

s.t. ∀p,

[p]

∑

[p]

= 0, ∀i,∀p, 0 ≤ α

[p]

≤ C.

This is a quadratic programming (QP) analogous to

the dual of SVM (Vapnik, 1998) except that there ex-

ist P linear equality constraints with respect to α

[p]

The standard QP solver is applicable to optimize

(10), though requiring substantial computation cost.

For optimizing QP of the SVM dual, the method of

SMO (Platt, 1999) is successfully applied, but in this

case, we can not employ it directly due to the multiple

equality constraints. In what follows, we present the

computationally efﬁcient approach to optimize (10).

2.2 Optimization

A large number of variables {α

[p]

}

i,p

in the QP (10)

are inherently partitioned into block-wise variables

regarding the prior p; we obtain P blocks of α

[p]

{α

[p]

}

i=1,..,n

[p]

∈ ℜ

[p]

, p = 1,..,P. According to those

block-wise variables, (10) is decomposed into the fol-

lowing sub-problem as well:

min

[p]

∑

i, j

[p]

⊤

[p]

−

[p]

∑

[p]

(

1− y

[p]

∑

q6=p

[q]

∑

[q]

[p]

⊤

[q]

)

(11)

s.t.

[p]

∑

[p]

= 0, ∀i, 0 ≤ α

[p]

≤ C.

This is again a quadratic programming which resem-

bles the SVM dual except for the linear term with re-

spect to α

[p]

and thus is effectively optimized by using

the SMO (Platt, 1999). Therefore, the whole proce-

dure for optimizing (10) consists of iteratively opti-

mizing the sub-problem (11) with respect to the prior

p by means of SMO as shown in Algorithm 1.

In order to discuss the convergence of the itera-

tive optimization, we mention the KKT condition of

(10) (Fan et al., 2005). The optimizer α

[p]

satisﬁes the

following condition:

i,p

(α

α) + b

[p]

= λ

[p]

− µ

[p]

, (12)

[p]

= 0, µ

[p]

(C − α

[p]

) = 0, λ

[p]

≥ 0, µ

[p]

≥ 0,

where G

i,p

(α

α) = y

[p]

⊤

∑

[q]

−1 is the

derivative of the objective function in (10) with re-

spect to α

[p]

. This is rewritten into

[p]

< C : G

i,p

(α

α) + b

[p]

≥ 0, (13)

[p]

> 0 : G

i,p

(α

α) + b

[p]

≤ 0, (14)

and since y

[p]

∈ {+1,−1}, the aboveconditions result

−y

[p]

i,p

(α

α)

(

≤ b

[p]

i ∈ I

[p]

≥ b

[p]

i ∈ I

[p]

−

, (15)

where

[p]

={i|(α

[p]

<C∧ y

[p]

=1) ∨ (α

[p]

>0∧ y

[p]

=−1)},

(16)

[p]

−

={i|(α

[p]

<C∧ y

[p]

=−1) ∨ (α

[p]

>0∧ y

[p]

=1)}.

(17)

DiscriminativePriorBiasLearningforPatternClassification

Therefore, we can conclude that α

[p]

is a stationary

point if and only if

[p]



max

i∈I

[p]

−y

[p]

i,p

(α

α)



−



min

i∈I

[p]

−

−y

[p]

i,p

(α

α)



≤ 0.

(18)

On the basis of this measure, we can stop the it-

eration when max

[p]

< ε with a small tolerance

ε > 0. The measure δ

[p]

also provides a clue for

effectively selecting the prior p to be optimized via

(11). That is, we perform the (sub-)optimization (11)

at p

∗

= argmax

[p]

so as to effectively minimize

max

[p]

. This approach will be empirically validated

in the experiment. At the optimum, the bias b

[p]

is re-

trieved by

[p]

∑

i∈I

[p]

−y

[p]

i,p

(α

α), (19)

where I

[p]

= {i|0 < α

[p]

< C}, (20)

since the right hand side in (12) equals zero for i ∈ I

[p]

Finally, we describe the technical tip for further

reducing the computational cost in the optimization.

From a practical viewpoint, the samples of the two

class categories are not equally distributed across the

priors p = 1, .., P but are localized in limited number

of priors. For instance, in the case of on-board cam-

era images, the road never appears in upper regions

where the sky usually dominates. That is, we occa-

sionally encounter the following sub-problem;

min

[p]

∑

i, j

[p]

⊤

[p]

−

[p]

∑

[p]

(

1− y

[p]

∑

q6=p

[q]

∑

[q]

[p]

⊤

[q]

)

(21)

s.t.

[p]

∑

[p]

= 0, ∀i, 0 ≤ α

[p]

≤ C, (22)

∀i, y

[p]

= 1

(or ∀i, y

[p]

= −1). (23)

The above QP is trivially optimized by α

[p]

= 0

0 due to

the constraint (22), and the bias b

[p]

can be determined

[p]

(

+∞ ∀i, y

[p]

= 1

−∞ ∀i, y

[p]

= −1

, (24)

which means that the samples from such a prior are

deﬁnitely classiﬁed as positive (or negative) no matter

how the appearance features of the samples are. In

this case, the class category is solely dependent on the

prior information via the bias b

[p]

∈ {+∞,−∞}. This

setting (24) might be too excessive and more mild one

Algorithm 1: Bias Learning.

Input: {x

[p]

}: feature vector and its class label

of the i-th training sample from the p-th type of

prior, p = 1,..,P, i = 1,..,n

[p]

ε > 0: small tolerance for terminating the it-

eration.

1: P = {p|∃i, y

[p]

= 1∧ ∃i,y

[p]

= −1}

2: Initialization: ∀p ∈ {1,..,P}, α

[p]

= 0

3: Randomly pick up p ∈ P

4: repeat

5: Set α

[p]

as the optimizer of (11)

6: Compute δ

[p]

in (18), ∀p ∈ P

7: p ← argmax

p∈P

[p]

8: until max

p∈P

[p]

< ε

Output: w

w computed by (7) and {b

[p]

}

p=1,..,P

com-

puted by (19) for p ∈ P and (24) for p /∈ P, using

the optimizers {α

[p]

}

would be preferable for the classiﬁcation; this is our

future work. By eliminating such trivial types of prior,

we can reduce the computational burden of the whole

procedure to optimize (10). As a result, the proposed

optimization procedure is shown in Algorithm 1.

2.3 Discussion

In the proposed method, all samples across all types

of priors are leveraged to train the classiﬁer, improv-

ing the generalization performance. In contrast, the

full-connected

method (Table 1) treats the samples

separately regarding the priors, and thus the p-th clas-

siﬁer is learnt by using only a small amount of sam-

ples belonging to the p-th type of prior, which might

degrade the performance. On the other hand, the

simple

method learning the classiﬁer from the whole

set of samples is less discriminative without utiliz-

ing the prior information associated with the samples.

The proposed method effectivelyintroduces the priors

into the classiﬁers via the biases which are discrimi-

natively optimized.

The proposed method is slightly close to the cross-

modal learning (Kan et al., 2012; Sharma and Jacobs,

2011). The samples belonging to different priors are

separated as if they are in different modalities, though

the feature representations are the same in this case.

The proposed method deals with them in a uniﬁed

manner via the adaptive prior biases. Actually, the

proposed method is applicable to the samples that are

distributed differently across the priors; the sample

distribution is shifted (translated) as x

[q]

= x

[p]

e and

the prior bias can adapt to it by b

[q]

= b

[p]

−w

⊤

e since

[p]

= w

⊤

[p]

+ b

[p]

, y

[q]

= w

⊤

[q]

+ b

[q]

= w

⊤

[p]

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

On-board image Label image

Figure 3: CamVid dataset (Brostow et al., 2008).

[q]

+ w

⊤

e) = y

[p]

. Therefore, the samples of the dif-

ferent priors are effectively transferred into the opti-

mization to improve the classiﬁcation performance.

3 EXPERIMENTAL RESULTS

We evaluated the proposed method on patch label-

ing in the on-board camera images by using CamVid

dataset (Brostow et al., 2008). This patch labeling

contributes to understand the scene surrounding the

car.

3.1 Setting

The CamVid dataset (Brostow et al., 2008) contains

several sequences composed of fully labeled image

frames as shown in Fig. 3: each pixel is assigned with

one of 32 class labels including ‘void’. Those labeled

images are captured at 10 Hz. In this experiment, we

employ the major 11 labels frequently seen in the im-

age frames, road, building, sky, tree, sidewalk, car,

column pole, sign symbol, fence, pedestrian and bicy-

clist, to form the 11-class classiﬁcation task.

We extracted the GLAC image feature (Kobayashi

and Otsu, 2008) from a local image patch of 20 × 40

pixels which slides at every 10 pixels over the resized

image of 480 × 360. In this case, the feature vector

x ∈ ℜ

2112

is associated with the 2D position of the

patch as the extra prior information; the total number

of prior types (grid points) is P = 1551. Thus, the task

is to categorize the patch feature vectors extracted at

1511 positions into the above-mentioned 11 classes.

We used the three sequences in the CamVid

dataset, and partitioned each sequence into three sub-

sequences along the time, one of which was used for

training and the others were for test. This cross vali-

dation was repeated three times and the averaged clas-

siﬁcation accuracy is reported.

For comparison, we applied the methods men-

tioned in Sec.1;

simple

and

full-connected

meth-

ods as listed in Table 1. The

simple

method is a stan-

dard classiﬁcation using the weight w

w with the bias

b without relying on the prior information p. The

full-connected

method applies classiﬁers compris-

ing w

[p]

and b

[p]

at respective priors p = 1,..,P.

This method requires tremendous memory storage

for those P classiﬁers; in this experiment, 2112-

dimensional weight vectors w

w in 11 class categories

are stored at each of 1511 positions. On the other

hand, in the

proposed

method, the feature vectors

are classiﬁed by using the identical weight w

w across

the priors together with the adaptively optimized bias

[p]

depending on the prior p.

3.2 Computation Cost

We evaluated the proposed method in terms of com-

putation cost.

The ﬁrst issue is related to the way of selecting

p in the iterative optimization; the proposed proce-

dure selects p deterministically by p

∗

= argmax

[p]

(the 7-th line in Algorithm 1). As the alternative for

the proposed selection, the other two ways are con-

ceivable, sequential and random selections. In the se-

quential selection, the target prior p is simply selected

as in raster scan over the image frame. The random

selection means that the target p is randomly picked

up from the whole set {1,..,P}. Fig. 4 shows the com-

parison results with respect to the objective cost in

(10) and the gap max

[p]

in (18) measuring viola-

tion of the KKT condition, both of which should be

decreased toward convergence. The optimization is

fast converged via the proposed method, while in the

other methods the optimization takes a larger num-

ber of iterations until convergence; in particular, the

sequential method requires more than 10,000 itera-

tions. These results reveal the importance of selecting

p to be optimized and show that the proposed method

quickly decreases the cost as well as the gap, leading

to fast convergence.

The second issue is about scalability of the pro-

posed method. The method trains the classiﬁer by us-

ing all the samples across the priors, scale of which

is as large as in the

simple

method. Fig. 5a shows

the computation time with comparison to the

simple

method on various sizes of training samples. These

methods are implemented by MATLAB using lib-

svm (Chang and Lin, 2001) on Xeon 3.33GHz PC

The proposed method is signiﬁcantly faster than the

simple

method. The time complexity of

simple

method which solves the standard SVM dual has been

empirically shown to be O(n

2.1

) (Joachims, 1999).

The proposedoptimization approachiteratively works

In this experiment, the feature vectors are actually con-

verted into the form of the kernel Gram matrix to which the

QP solver in libsvm is directly applied, for fair comparison

of the QP problems in the

proposed

and

simple

methods.

DiscriminativePriorBiasLearningforPatternClassification

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

−1400

−1200

−1000

−800

−600

−400

−200

Cost

Iterations

sequential

random

proposed

(a) Objective cost

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Gap

Iterations

sequential

random

proposed

(b) Gap, max

[p]

Figure 4: Comparison for the ways of selecting the target

prior p in terms of (a) the objective cost in (10) and (b)

the gap max

[p]

in (18) which measures violation of KKT

condition.

on the block-wise subset into which the whole train-

ing set is decomposed (Sec.2.2). The subset is re-

garded as the working set whose size is an important

factor for fast computing QP (Fan et al., 2005). In the

proposed

method, it is advantageous to inherently

deﬁne the subset, i.e., the working set, of adequate

size according to the prior. Thus, roughly speaking,

the time complexity of the

proposed

method results

in O(M

2.1

) = O(

2.1

1.1

). In particular, the computa-

tion time essentially depends on the (resultant) num-

ber of support vectors (SVs); Fig. 5b shows the num-

ber of support vectors produced by those two meth-

ods. The

proposed

method provides a smaller num-

ber of support vectors, which signiﬁcantly contributes

to reduce the computation time. As a result, the

proposed optimization approach works quite well to-

gether with the working set (prior p) selection dis-

cussed in the previous experiment (Fig. 4). These re-

sults show the favorable scalability of the

proposed

method, especially compared to the standard

simple

5000 10000 20000 30000 40000

Number of training samples

Computation time (sec)

simple

proposed

(a) Computation time

5000 10000 20000 30000 40000

1000

2000

3000

4000

5000

6000

Number of training samples

Number of SVs

simple

proposed

(b) Number of support vectors (SVs)

Figure 5: Comparison of the

simple

and

proopsed

meth-

ods in terms of (a) computation time as well as (b) number

of support vectors (SVs).

method.

3.3 Classiﬁcation Performance

We then compared the classiﬁcation performance of

the three methods,

simple

full-connected

and

proposed

(Table 1). Table 2 shows the overall per-

formance, demonstrating that the

proposed

method

outperforms the others. It should be noted that

the

full-connected

method individually applies the

classiﬁer speciﬁc to the prior p ∈ {1,,P}, requiring

a plenty of memory storage and consequently taking

large classiﬁcation time due to loading the enormous

memory. The

proposed

method renders as fast clas-

siﬁcation as the

simple

method since it enlarges only

the bias. By discriminativelyoptimizing the biases for

respective priors, the performance is signiﬁcantly im-

proved in comparison to the

simple

method; the im-

provement is especially found at the categories of car,

pedestrian and bicyclist that are composed of patch

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

−inf

−9

−8

−7

−6

−5

−4

+inf

−inf

−5.5

−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

+inf

−inf

−1

−0.5

0.5

1.5

2.5

3.5

+inf

road building sky

−inf

−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

−1

+inf

−inf

−3

−2.5

−2

−1.5

−1

−0.5

0.5

+inf

−inf

−14

−13.5

−13

−12.5

−12

−11.5

−11

−10.5

−10

−9.5

−9

+inf

tree sidewalk car

−inf

−5.5

−5

−4.5

−4

−3.5

−3

+inf

−inf

−9

−8.5

−8

−7.5

+inf

−inf

−2.6

−2.4

−2.2

−2

−1.8

−1.6

−1.4

+inf

column pole sign symbol fence

−inf

−1

−0.5

0.5

+inf

−inf

−6.5

−6

−5.5

−5

+inf

pedestrian bicyclist

Figure 6: Maps of the biases learnt by the

proposed

method. The signiﬁcance of the biases are shown by using pseudo colors

from (dark) blue to (dark) red. This ﬁgure is best viewed in color.

parts similar to other categories but are associated

with the distinct prior positions.

Finally, we show in Fig. 6 the biases learnt by

the

proposed

method; the biases {b

[p]

}

are folded

into the form of image frame according to the x-y po-

sitions. These maps of the biases reﬂect the prior

probability over the locations where the target cat-

egory appears. These seem quite reasonable from

the viewpoint of the trafﬁc rules that the car obeys;

since the CamVid dataset is collected at the Cam-

bridge city (Brostow et al., 2008), in this case, the

trafﬁc rules are of the United Kingdom. The high

biases for the sky are distributed above the horizon-

tal line, while those of the road are high in the lower

part. The pedestrian probably walks on the sidewalk

mainly shown in the left side. The oncoming car runs

on the right-hand road, and the row of the building

is found on the roadside. These biases are adaptively

learnt from the CamVid dataset and they would be dif-

ferent if we use other datasets collected under differ-

ent trafﬁc rules.

DiscriminativePriorBiasLearningforPatternClassification

Table 2: Classiﬁcation accuracy (%).

simple full-connected proposed

road 93.10 93.80 94.92

building 75.90 72.96 78.70

sky 90.52 82.21 90.25

tree 70.49 77.59 79.95

sidewalk 77.06 78.43 81.36

car 53.84 58.64 65.16

column pole 9.53 16.15 12.85

sign symbol 1.73 1.62 1.70

fence 5.23 11.09 13.48

pedestrian 17.26 30.69 31.52

bicyclist 17.09 18.49 24.88

avg. 46.52 49.24 52.25

4 CONCLUSIONS

We have proposed a method to discriminatively learn

the prior biases in the classiﬁcation. In the proposed

method, for improving the classiﬁcation performance,

all samples are utilized to train the classiﬁer and the

input sample is adequately classiﬁed based on the

prior information via the learnt biases. The proposed

method is formulated in the maximum-margin frame-

work, resulting in the optimization problem of the QP

form similarly to SVM. We also presented the compu-

tationally efﬁcient approach to optimize the resultant

QP along the line of SMO. The experimental results

on the patch labeling in the on-board camera images

demonstrated that the proposed method is superior in

terms of classiﬁcation accuracy and the computation

cost. In particular, the proposed classiﬁer operates

as fast as the standard (linear) classiﬁer, and besides

the computation time for training the classiﬁer is even

faster than the SVM of the same size.

REFERENCES

Bishop, C. M. (1995). Neural Networks for Pattern Recog-

nition. Oxford University Press, New York, NY.

Bishop, C. M. (2006). Pattern Recognition and Machine

Learning. Springer, Berlin, Germany.

Brostow, G. J., Shotton, J., Fauqueur, J., and Cipolla, R.

(2008). Segmentation and recognition using structure

from motion point clouds. In ECCV’08, the 10th Eu-

ropean Conference on Computer Vision, pages 44–57.

Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library

for support vector machines. Software available at

http://www.csie.ntu.edu.tw/ cjlin/libsvm.

Cremers, D. and Grady, L. (2006). Statistical priors for ef-

ﬁcient combinatorial optimization via graph cuts. In

ECCV’06, the 9th European Conference on Computer

Vision, pages 263–274.

El-Baz, A. and Gimel’farb, G. (2009). Robust image seg-

mentation using learned priors. In ICCV’09, the 12nd

International Conference on Computer Vision, pages

857–864.

Fan, R.-E., Chen, P.-H., and Lin, C.-J. (2005). Working set

selection using second order information for training

support vector machines. Journal of Machine Learn-

ing Research, 6:1889–1918.

Gao, T., Stark, M., and Koller, D. (2012). What makes a

good detector? - structured priors for learning from

few examples. In ECCV’12, the 12th International

Conference on Computer Vision, pages 354–367.

Ghosh, J. and Ramamoorthi, R. (2003). Bayesian Nonpara-

metrics. Springer, Berlin, Germany.

Jiang, T., Jurie, F., and Schmid, C. (2009). Learning shape

prior models for object matching. In CVPR’09, the

22nd IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 848–855.

Jie, L., Tommasi, T., and Caputo, B. (2011). Multi-

class transfer learning from unconstrained priors. In

ICCV’11, the 13th International Conference on Com-

puter Vision, pages 1863–1870.

Joachims, T. (1999). Making large-scale svm learning prac-

tical. In Sch¨olkopf, B., Burges, C., and Smola, A.,

editors, Advances in Kernel Methods - Support Vec-

tor Learning, pages 169–184. MIT Press, Cambridge,

MA, USA.

Kan, M., Shan, S., Zhang, H., Lao, S., and Chen, X. (2012).

Multi-view discriminant analysis. In ECCV’12, the

12th International Conference on Computer Vision,

pages 808–821.

Kapoor, A., Hua, G., Akbarzadeh, A., and Baker, S. (2009).

Which faces to tag: Adding prior constraints into ac-

tive learning. In ICCV’09, the 12nd International

Conference on Computer Vision, pages 1058–1065.

Kobayashi, T. and Otsu, N. (2008). Image feature extraction

using gradient local auto-correlations. In ECCV’08,

the 10th European Conference on Computer Vision,

pages 346–358.

Platt, J. (1999). Fast training of support vector machines

using sequential minimal optimization. In Sch¨olkopf,

B., Burges, C., and Smola, A., editors, Advances

in Kernel Methods - Support Vector Learning, pages

185–208. MIT Press, Cambridge, MA, USA.

Poggio, T., Mukherjee, S., Rifkin, R., Rakhlin, A., and

Verri, A. (2001). b. Technical Report CBCL Paper

#198/AI Memo #2001-011, Massachusetts Institute of

Technology, Cambridge, MA, USA.

Sharma, A. and Jacobs, D. (2011). Bypassing synthesis:

Pls for face recognition with pose, low-resolution and

sketch. In CVPR’11, the 24th IEEE Conference on

Computer Vision and Pattern Recognition (CVPR),

pages 593–600.

Smola, A. J., Bartlett, P., Sch¨olkopf, B., and Schuurmans,

D. (2000). Advances in Large-Margin Classiﬁers.

MIT Press, Cambridge, MA, USA.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

Van Gestel, T., Suykens, J., Lanckriet, G., Lambrechts, A.,

De Moor, B., and Vandewalle, J. (2002). Bayesian

framework for least squares support vector machine

classiﬁers, gaussian processes and kernel ﬁsher dis-

criminant analysis. Neural Computation, 15(5):1115–

1148.

Vapnik, V. (1998). Statistical Learning Theory. Wiley, New

York, NY, USA.

Wang, C., Liao, X., Carin, L., and Dunson, D. (2010). Clas-

siﬁcation with incomplete data using dirichlet process

priors. The Journal of Machine Learning Research,

11:3269–3311.

Yuan, C., Hu, W., Tian, G., Yang, S., and Wang, H. (2013).

Multi-task sparse learning with beta process prior for

action recognition. In CVPR’13, the 26th IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 423–430.

DiscriminativePriorBiasLearningforPatternClassification