MULTITASK LEARNING

An Application to Incremental Face Recognition

David Masip

Universitat Oberta de Catalunya, Rambla del Poblenou 156, 08018 Barcelona, Spain

Agata Lapedriza

Computer Vision Center, Computer Science Depart., Universitat Aut`onoma de Barcelona

Ediﬁci O Bellaterra, Barcelona 08193, Spain

Jordi Vitri`a

Department of Appl. Math. and Analysis, University of Barcelona, Ediﬁci Hist`oric

Gran Via de les Corts 585, Barcelona 08007, Spain

Keywords:

Face Classiﬁcation, Incremental Learning, Boosting.

Abstract:

Usually face classiﬁcation applications suffer from two important problems: the number of training samples

from each class is reduced, and the ﬁnal system usually must be extended to incorporate new people to rec-

ognize. In this paper we introduce a face recognition method that extends a previous boosting-based classiﬁer

adding new classes and avoiding the need of retraining the system each time a new person joins the system.

The classiﬁer is trained using the multitask learning principle and multiple veriﬁcation tasks are trained to-

gether sharing the same feature space. The new classes are added taking advantage of the previous learned

structure, being the addition of new classes not computationally demanding. Our experiments with two differ-

ent data sets show that the performance does not decrease drastically even when the number of classes of the

base problem is multiplied by a factor of 8.

1 INTRODUCTION

Face recognition problem can be stated as a ma-

chine learning process where we receive as input a

high-dimensional data vector X ∈ R

(considering

the n

× n

= D face image), and we must provide

the identity or class membership C ∈ {C

,...,C

} of

the subject. In real-world applications the number of

classes is large, being most of the classic machine

learning methods not suitable for the face recogni-

tion task. In addition, the number of available sam-

ples from each class is usually limited, making the

estimation of the classiﬁer parameters more difﬁcult.

Furthermore, in real-world applications the number

of subjects to identify is variable in time: for exam-

ple, an automatic face recognition system for pres-

ence checking should take into account that new sub-

jects can be added to the system avoiding the need of

retraining the whole learning scheme.

Most of the face recognition algorithms found on

the literature focus on the problem of classiﬁcation in

high dimensional subspaces. Usually a feature extrac-

tion step is performed in order to reduce the problem

complexity, and then a classiﬁer is applied on the re-

duced space. Many unsupervised feature extraction

methods have been applied to face recognition, being

the seminal paper in this ﬁeld the one proposing the

“eigenfaces” approach, by Turk and Pentland (Turk

and Pentland, 1991), that uses Principal Component

Analysis (PCA) to ﬁnd the optimal subspace under

the reconstruction error criterion. On the other hand,

supervised feature extraction techniques also take into

account the labels of the data in the feature extrac-

tion task. Fisher Linear Discriminant Analysis (LDA)

(Fisher, 1936) is the most known technique, and dif-

ferent extensions of the algorithm have been devel-

oped to relax some of the original assumptions. Some

examples are the Nonparametric Discriminant Analy-

sis (Fukunaga and Mantock, 1983) or more recently

the boosted feature extraction (Masip et al., 2005).

The main drawback of this techniques is that usually

we have few training samples from each class. More-

585

Masip D., Lapedriza À. and Vitrià J. (2008).

MULTITASK LEARNING - An Application to Incremental Face Recognition.

In Proceedings of the Third International Conference on Computer Vision Theory and Applications, pages 585-590

DOI: 10.5220/0001079205850590

 SciTePress

over the system is difﬁcult to scale when new classes

join the system.

In this paper we introduce a face recognition

scheme to deal with the above mentioned difﬁculties:

the robustnessagainst the small-size training set prob-

lem and the scalability to add new classes avoiding a

new costly additional training step. For this purpose

we consider the multitask learning (MTL) paradigm

for the face recognition problem. The term MTL

was ﬁrstly introduced by Caruana (Caruana, 1997),

showing that simultaneously learning related tasks

in an environment can achieve important improve-

ments at two different levels: the number N of train-

ing samples needed to learn each classiﬁcation tasks

decreases as more related tasks are learned (parallel

knowledge transfer), and it has been proved that under

some theoretic conditions, a classiﬁer trained on suf-

ﬁciently related tasks is likely to ﬁnd good solutions

to solve novel related tasks (sequential knowledge

transfer). Baxter (Baxter, 2000) proved that the num-

ber of training samples required to train each tasks

decreases linearly as the number of tasks increases

log(O(K))).

Torralba et al. (Torralba et al., 2004) extended the

MTL principle to the ensemble classiﬁers ﬁeld, intro-

ducing a new algorithm, called Joint Boosting, where

they combine the use of an ensemble of simple clas-

siﬁers (using the GentleBoost algorithm (Friedman

et al., 2000)) with a feature sharing strategy that im-

plements an structural approach for multitask learn-

ing.

In this work we use the Joint Boosting algorithm

to build a robust classiﬁer for face recognition using a

ﬁxed number of classes K. Then we extend the algo-

rithm in order to incorporate new unseen classes. The

main goal of our proposal is to incorporate new per-

sons to recognize, avoiding the computationally ex-

pensive cost of retraining the whole system. Our ex-

perimental results show that the performance of the

extended class problem does not degrade drastically.

In the next section we describe our proposed

method, which is based on the Gentleboost algorithm

(Friedman et al., 2000). We focus our work on ex-

tending this approach to the addition of new tasks

without training the whole system, with minimum

computational cost. Section 3 describes the experi-

ments performed using two standard face databases,

focusing on evaluating the scalability against the ad-

dition of new subjects to the system. Finally Section

4 concludes this work.

2 GENTLEBOOST APPLIED TO

FACE RECOGNITION

The original Adaboost algorithm (Freund and

Schapire, 1996) builds iteratively a binary classiﬁer

such that the ﬁnal classiﬁcation rule is a linear com-

bination of weak classiﬁers. At each boosting step a

new classiﬁer is generated, and the training samples

are reweighted according to the classiﬁcation results.

The weights are used to generate the next step classi-

ﬁer. In the literature, there are multiple implementa-

tions of the boosting scheme. Our approach is based

on the “GentleBoost” algorithm proposed by Fried-

man et al. (Friedman et al., 2000) which has been

shown to be more robust and appropriate for face clas-

siﬁcation tasks (Yokono and Poggio, 2006).

To extend the binary Adaboost classiﬁer to the

multiclass case, two different approaches have been

found in the recent literature: extend the classiﬁcation

rule to the multiclass case (Ji Zhu and Hastie, 2006)

or to combine different binary classiﬁers using error

correction output codes (Schapire, 1997). Torralba et

al. (Torralba et al., 2004) introduced the knowledge

transfer concept on the gentleAdaBoost. The main

idea is to see the multiclass classiﬁcation problems as

multiple binary classiﬁcation tasks. They experimen-

tally show that the obtained multiclass classiﬁer needs

less training examples and also less different features.

In this paper we use the shared feature boosting

approach to build a global scheme where new classes

can be added to the system.

2.1 Training the Original Joint Boosting

Algorithm

The algorithm takes as input the N training samples

X = {X

= (x

,...,x

)} and the corresponding labels

C ∈ {C

,...,C

}. A predeﬁned number M of boost-

ing rounds are performed

. At each boosting step,

the multiclass classiﬁcation problem is converted to a

binary problem by grouping the classes in a positive

and a negative cluster. A decision stumps classiﬁer is

trained on the new binary problem. The parameters

of the weak learner are computed as:

ρ =

∑

C∈Positive(n)

∑

δ(x

≤ θ)

∑

C∈Positive(n)

∑

δ(x

≤ θ)

, (3)

α+ ρ =

∑

C∈Positive(n)

∑

δ(x

> θ)

∑

C∈Positive(n)

∑

δ(x

> θ)

, (4)

We assume that the reader is familiar with the Joint

Boosting algorithm. For a detailed description the reader

can see (Torralba et al., 2004)

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

586

1. Given the matrix X

1,...,N+Q

containing data samples x

, and the vector c with the corresponding labels c

∈

,... ,C

K+1

} (i = 1...K + 1)

2. Initialize a set of weights: W

(1) = 1, i=1,. . . ,N+Q

3. For t = 1...M:

(a) Assign the new samples to the Positive cluster, according to the optimal class grouping selected on the step M in the

previous joint boosting algorithm.

(b) Classify the training data X using the decision stumps generated at the step t of the previous joint boosting algorithm.

,c) =



αδ(x

> θ) + ρ, when C

∈ Positive(n)

, when C

/∈ Positive(n)

(1)

Err

K+1

∑

c=1

N+Q

∑

i=1

− h

,c))

. (2)

where b

∈ {−1,+1} is the label assigned to C

in the m optimal binary grouping.

(d) Assign the new samples to the Negative cluster, according to the optimal class grouping selected on the step M in the

previous joint boosting algorithm, and compute the error Err

as in 2.

4. Assign the new class to the clustering with minimum error: m = min(Err

,Err

);

5. Update the data weights: W

(t + 1) = W

(t)exp

−b

,c)

, i = 1,..., N.

6. Update the estimation for each class: H(x

,c) = H

(t) + h

,c)

7. Output the estimation of each sample for each possible class.

Figure 1: Algorithm to add new samples to a previously trained joint boosting algorithm.

∑

, if c/∈ Positive(n) (5)

where k

acts as a constant to preventthe effectsof un-

balanced training sets on the class selection and {W

}

is the weights set.

In a ﬁrst attempt, all the possible groupings could

be made, O(2

). Nevertheless, when the number of

classes is large this approach is not possible. Torralba

et al.(Torralba et al., 2004) followed a best ﬁrst search

approximation (O(K

)), where the grouping is per-

formed as follows:

1. Train a weak learner using a single class as posi-

tive. For each feature a decision stumps classiﬁer

is trained, keeping the one that minimizes the er-

ror criterion.

2. Select the class with minimum weighted classiﬁ-

cation error as the initial Positive cluster.

3. For a the remaining K − 1 classes:

• Train a classiﬁer using the previous Positive

cluster but adding another class from the Nega-

tive cluster.

• Add to the previous Positive cluster the class

from the Negative cluster only if the joint selec-

tion improves the previous classiﬁcation error.

The class grouping with minimum error is selected,

and at each boosting step the set of weights W

are

adjusted according to the partial classiﬁcation results.

Note that the optimal grouping is different at each

step, given that the error criterion is computed tak-

ing into account the weights that focus the cluster se-

lection on the most difﬁcult samples. The grouping

step allows the transfer of knowledge among several

recognition tasks.

2.2 Adding New Classes to the System

Once the joint boosting algorithm is trained, it can be

used to classify a K-class problem. In a face recog-

nition environment this fact means that it can only be

used to recognize K people. When the subject K + 1

is admitted in the system, the whole learning process

must be retrained. We propose to take beneﬁt of the

class grouping performed in the sharing features step

in order to incorporate online new classes to the sys-

tem, avoiding the expensive relearning step.

The training algorithm can be divided in two

steps: in the ﬁrst one the joint boosting algorithm is

run. The second step takes as input the Q samples

of the new training class X

N+1,...,N+Q

and the corre-

MULTITASK LEARNING - An Application to Incremental Face Recognition

587

sponding label C

k+1

, and runs M rounds of the algo-

rithm shown in ﬁgure 1.

The idea is to add the new class to the system tak-

ing advantage of the previous shared feature space de-

ﬁned in the classiﬁcation of the known tasks in the

ﬁrst step. Provided an optimal binary grouping at

each step trained according a large enough number of

samples and classes, we assign the new class samples

to the positive or negative cluster minimizing the er-

ror criterion. The algorithm is iterated the same ﬁxed

amount of times M. Notice that the method allows

the inclusion of many new tasks, given that the same

process can be iteratively repeated adding a new class

each time. This approach is computationally fast,

avoiding the most computationally expensive step of

ﬁnding the optimal binary subgroup.

3 EXPERIMENTS

The experiments have been performed using two dif-

ferent face databases: the Face Recognition Grand

Challenge (Phillips et al., 2005), and the AR Face

database (Martinez and Benavente, 1998). The idea

of the experimental section is to show the evolution

of the performance of our proposal as new classes are

added to the system. We compare our proposal with a

variation of the classic eigenface approach (Turk and

Pentland, 1991) followed by a NN classiﬁcation rule.

We use PCA to extract 500 features from each data

set, and then a discriminant analysis step is performed

to obtain the 200 ﬁnal features from each example.

The NDA algorithm has been used for this purpose,

which has been shown to improve the performance of

other classic discriminant analysis techniques (Bres-

san and Vitria, 2003) under the NN rule. The new

classes are added by projecting the training vectors on

the reduced space, and using this projected features as

a model for the new classiﬁcation task.

Images from both data sets have been previously

converted from the original RGB space to gray scale.

Then we perform a geometric normalization using the

center coordinates of each eye. Images have been ro-

tated and scaled according to the inter-eye distance,

in such a way that the center pixel of each eye coin-

cides in all of them. The samples were then cropped

obtaining a 37× 33 thumbnail, therefore only the in-

ternal region of the faces has been preserved. The

ﬁnal sample from each image is encoded as a 1221

feature vector. In Figure 2 some examples from both

databases are shown. From each data set, we have

used only images from subjects that contain at least

20 samples (10 for training and the rest for testing).

3.1 Results

The experiments have been repeated 10 times, the re-

sults shown in table 1 are the mean accuracies of each

method. The 95% conﬁdence intervals are also shown

near each value. The experimental protocol follows

these steps for each database: (1) We randomly take

25 classes (people) from the data set. (2) We learn a

classiﬁer using 10 training samples from the 25 peo-

ple. (3) We progressively add a new class without

retraining the system. The remaining samples from

each class are used for testing the resulting classiﬁer.

The results with the FRGC database show an ac-

curacy close to 98% using our boosted approach for

the initial problem with 25 classes, while the applica-

tion of feature extraction methods with the NN classi-

ﬁer obtains an initial 92%. This experiment suggests

that for a perfectly acquired and normalized set, the

use of shared boosting is the best option for multi-

class face problems. Figure 3 shows the accuracies as

a function of the number of classes added. The ac-

curacy on the ﬁrst 25 steps remains constant given

that the classiﬁer is initially trained on this subset.

Notice that from that point the accuracy decreases,

as expected, when new classes are added to the sys-

tem. This fact is due to 2 reasons: ﬁrst, usually the

more classes has a classiﬁcation problem the more

decreases the accuracy, and second, when new sam-

ples are added to the system, there is an implicit er-

ror given that the classiﬁer has not been retrained.

Nevertheless, the accuracy does not decrease drasti-

cally, even when we increase the number of classes

an 800%.

On the other hand, we also show the absolute and

relative loss of accuracy as new classes are added (see

Table 1). For each data set we add up to the max-

imum number of classes (160 and 86 for the FRGC

and AR Face respectively) and take the resulting ac-

curacy. The absolute decrease is computed as the ac-

curacy using 25 classes minus the accuracy using the

maximum number of classes. The relative decrease

is computed as the absolute decrease divided by the

initial accuracy considering the 25 classes. Using our

approach the accuracy decreases less, specially in the

case of the AR Face data set, obtaining a more ro-

bust classiﬁcation rule in presence of occlusions and

strong changes in the illumination.

The main advantage using our adding-class ap-

proach, is the reduction on the computational needs.

It has been shown experimentally that the use of joint

boosting achieves high accuracies in face classiﬁca-

tion. Nevertheless, the computational cost makes the

method unfeasible when the problem has too many

classes. The clustering step to build binary problems

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

588

(a) Some faces from the FRGC. (a) Some faces from the AR Face.

Figure 2: Examples of faces from the FRGC and AR Face databases.

Table 1: Mean accuracy of the discriminant method and our proposal on two face databases. Only 25 classes are used for

training, a total of 135 extra classes have been added in the FRGC case, and 59 in the AR Face. We show the decrease (absolute

and relative percentage) in the mean accuracy from the ﬁrst experiment with only 25 classes and the largest extended problem.

Data set Acc. FE Decrease Relative

FRGC 0.8554 ± 0.0193 0.0548 6.0%

AR Face 0.6013 ± 0.0020 0.2105 25.9%

Acc. Adding-Shared Decrease. Relative

0.9214 ± 0.0024 0.0567 5.8%

0.7515 ± 0.0034 0.1062 12.4%

at each boosting round is O(K

) using the BFS ap-

proach. This computational complexity is avoided

when we use our proposal of adding new classes to the

system. Typically, training the shared boosting algo-

rithm using an initial set of 25 classes takes 8 hours on

a Pentium IV computer (using the Matlab software).

Training the same algorithm using 80 classes can take

weeks, while extending the previous 25 class problem

to the new 80 class problem using our approach takes

a few minutes.

4 CONCLUSIONS

We propose a method to online add new classes to the

joint boosting classiﬁer in order to solve real world

face recognition problems. We incrementally add a

new class to the system extending the classiﬁer in or-

der to take into account the new binary classiﬁcation

task. The multiclass problem is seen as a set of mul-

tiple binary classiﬁcation tasks that are trained shar-

ing the feature space. The resulting classiﬁer can be

extended adding a new class avoiding the computa-

tionally expensive retraining process which could be

computationally unfeasible in a large class problem.

Nevertheless, it can be trained using less classes at

ﬁrst instance, and then extended using our approach

to solve the real large problem.

We have experimentally validated our proposal

using two different face databases: the FRGC

database acquired in a controlled environment, and

the AR Face database which contains important ar-

tifacts due to strong changes in the illumination and

occlusions. The results show that the classiﬁcation ac-

curacy decreases less drastically than using the clas-

sic NN rule used in online learning methods when the

original sets are extended to large class problems (up

to 8 times the original class set size).

We plan as future work to analyze the importance

of the classes chosen in the original trained algorithm.

A diverse choice of the classes should allow a more

general base for extending the classiﬁer. The use of

a training and an extra validation set could improve

slightly the accuracies. The initial number of classes

used to train the system inﬂuence the performance of

the extended classiﬁer.

ACKNOWLEDGEMENTS

This work is supported by the TIN2006-15308-C02-

01 project, Ministerio de Educaci´on y Ciencia, Spain.

REFERENCES

Baxter, J. (2000). A model of inductive bias learning. Jour-

nal of Machine Learning Research, 12:149–198.

Bressan, M. and Vitria, J. (2003). Nonparametric discrimi-

nant analysis and nearest neighbor classiﬁcation. Pat-

tern Recognition Letters, 24(15):2743–2749.

Caruana, R. (1997). Multitask learning. Machine Learning,

28(1):41–75.

Fisher, R. (1936). The use of multiple measurements in

taxonomic problems. Ann. Eugenics, 7:179–188.

Freund, Y. and Schapire, R. E. (1996). Experiments with a

new boosting algorithm. In International Conference

on Machine Learning, pages 148–156.

Friedman, J., T.Hastie, and R.Tibshirani (2000). Additive

logistic regression: a statistical view of boosting. An-

nals of statistics, 28:337–374.

MULTITASK LEARNING - An Application to Incremental Face Recognition

589

20 40 60 80 100 120 140 160

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

FRGC

Traditional FE

Shared Boosted Adding classes

10 20 30 40 50 60 70 80

0.5

0.6

0.7

0.8

0.9

ARFACE

Traditional FE

Shared Boosted Adding classes

(a) Accuracy using the FRGC data set. (b) Accuracy using the ARFace data set.

Figure 3: Accuracy as a function of the number of classes.

Fukunaga, K. and Mantock, J. (1983). Nonparametric dis-

criminant analysis. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 5(6):671–678.

Ji Zhu, Saharon Rosset, H. Z. and Hastie, T. (2006). Multi-

class adaboost. Technical report, Standford Univer-

sity.

Martinez, A. and Benavente, R. (1998). The AR Face

database. Technical Report 24, Computer Vision Cen-

ter.

Masip, D., Kuncheva, L. I., and Vitria, J. (2005). An

ensemble-based method for linear feature extraction

for two-class problems. Pattern Analysis and Appli-

cations, 8:227–237.

Phillips, P. J., Flynn, P. J., Scruggs, T., Bowyer, K. W.,

Chang, J., Hoffman, K., Marques, J., Min, J., and

Worek, W. (2005). The 2005 IEEE workshop on face

recognition grand challenge experiments. In CVPR

’05: Proceedings of the 2005 IEEE Computer Society

Conference on Computer Vision and Pattern Recogni-

tion (CVPR’05) - Workshops, page .45, Washington,

DC, USA. IEEE Computer Society.

Schapire, R. E. (1997). Using output codes to boost multi-

class learning problems. In Proc. 14th International

Conference on Machine Learning, pages 313–321.

Morgan Kaufmann.

Torralba, A., Murphy, K., and Freeman, W. (2004). Sharing

features: efﬁcient boosting procedures for multiclass

object detection. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition.

Turk, M. and Pentland, A. (1991). Eigenfaces for recogni-

tion. Journal of Cognitive Neuroscience, 3(1):71–86.

Yokono, J. J. and Poggio, T. (2006). A multiview face iden-

tiﬁcation model with no geometric constraints. Tech-

nical report, Sony Intelligence Dynamics Laborato-

ries.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

590