MULTITASK LEARNING
An Application to Incremental Face Recognition
David Masip
Universitat Oberta de Catalunya, Rambla del Poblenou 156, 08018 Barcelona, Spain
`
Agata Lapedriza
Computer Vision Center, Computer Science Depart., Universitat Aut`onoma de Barcelona
Edifici O Bellaterra, Barcelona 08193, Spain
Jordi Vitri`a
Department of Appl. Math. and Analysis, University of Barcelona, Edifici Hist`oric
Gran Via de les Corts 585, Barcelona 08007, Spain
Keywords:
Face Classification, Incremental Learning, Boosting.
Abstract:
Usually face classification applications suffer from two important problems: the number of training samples
from each class is reduced, and the final system usually must be extended to incorporate new people to rec-
ognize. In this paper we introduce a face recognition method that extends a previous boosting-based classifier
adding new classes and avoiding the need of retraining the system each time a new person joins the system.
The classifier is trained using the multitask learning principle and multiple verification tasks are trained to-
gether sharing the same feature space. The new classes are added taking advantage of the previous learned
structure, being the addition of new classes not computationally demanding. Our experiments with two differ-
ent data sets show that the performance does not decrease drastically even when the number of classes of the
base problem is multiplied by a factor of 8.
1 INTRODUCTION
Face recognition problem can be stated as a ma-
chine learning process where we receive as input a
high-dimensional data vector X R
D
(considering
the n
1
× n
2
= D face image), and we must provide
the identity or class membership C {C
1
,...,C
K
} of
the subject. In real-world applications the number of
classes is large, being most of the classic machine
learning methods not suitable for the face recogni-
tion task. In addition, the number of available sam-
ples from each class is usually limited, making the
estimation of the classifier parameters more difficult.
Furthermore, in real-world applications the number
of subjects to identify is variable in time: for exam-
ple, an automatic face recognition system for pres-
ence checking should take into account that new sub-
jects can be added to the system avoiding the need of
retraining the whole learning scheme.
Most of the face recognition algorithms found on
the literature focus on the problem of classification in
high dimensional subspaces. Usually a feature extrac-
tion step is performed in order to reduce the problem
complexity, and then a classifier is applied on the re-
duced space. Many unsupervised feature extraction
methods have been applied to face recognition, being
the seminal paper in this field the one proposing the
“eigenfaces” approach, by Turk and Pentland (Turk
and Pentland, 1991), that uses Principal Component
Analysis (PCA) to find the optimal subspace under
the reconstruction error criterion. On the other hand,
supervised feature extraction techniques also take into
account the labels of the data in the feature extrac-
tion task. Fisher Linear Discriminant Analysis (LDA)
(Fisher, 1936) is the most known technique, and dif-
ferent extensions of the algorithm have been devel-
oped to relax some of the original assumptions. Some
examples are the Nonparametric Discriminant Analy-
sis (Fukunaga and Mantock, 1983) or more recently
the boosted feature extraction (Masip et al., 2005).
The main drawback of this techniques is that usually
we have few training samples from each class. More-
585
Masip D., Lapedriza À. and Vitrià J. (2008).
MULTITASK LEARNING - An Application to Incremental Face Recognition.
In Proceedings of the Third International Conference on Computer Vision Theory and Applications, pages 585-590
DOI: 10.5220/0001079205850590
Copyright
c
SciTePress
over the system is difficult to scale when new classes
join the system.
In this paper we introduce a face recognition
scheme to deal with the above mentioned difficulties:
the robustnessagainst the small-size training set prob-
lem and the scalability to add new classes avoiding a
new costly additional training step. For this purpose
we consider the multitask learning (MTL) paradigm
for the face recognition problem. The term MTL
was firstly introduced by Caruana (Caruana, 1997),
showing that simultaneously learning related tasks
in an environment can achieve important improve-
ments at two different levels: the number N of train-
ing samples needed to learn each classification tasks
decreases as more related tasks are learned (parallel
knowledge transfer), and it has been proved that under
some theoretic conditions, a classifier trained on suf-
ficiently related tasks is likely to find good solutions
to solve novel related tasks (sequential knowledge
transfer). Baxter (Baxter, 2000) proved that the num-
ber of training samples required to train each tasks
decreases linearly as the number of tasks increases
O(
1
N
log(O(K))).
Torralba et al. (Torralba et al., 2004) extended the
MTL principle to the ensemble classifiers field, intro-
ducing a new algorithm, called Joint Boosting, where
they combine the use of an ensemble of simple clas-
sifiers (using the GentleBoost algorithm (Friedman
et al., 2000)) with a feature sharing strategy that im-
plements an structural approach for multitask learn-
ing.
In this work we use the Joint Boosting algorithm
to build a robust classifier for face recognition using a
fixed number of classes K. Then we extend the algo-
rithm in order to incorporate new unseen classes. The
main goal of our proposal is to incorporate new per-
sons to recognize, avoiding the computationally ex-
pensive cost of retraining the whole system. Our ex-
perimental results show that the performance of the
extended class problem does not degrade drastically.
In the next section we describe our proposed
method, which is based on the Gentleboost algorithm
(Friedman et al., 2000). We focus our work on ex-
tending this approach to the addition of new tasks
without training the whole system, with minimum
computational cost. Section 3 describes the experi-
ments performed using two standard face databases,
focusing on evaluating the scalability against the ad-
dition of new subjects to the system. Finally Section
4 concludes this work.
2 GENTLEBOOST APPLIED TO
FACE RECOGNITION
The original Adaboost algorithm (Freund and
Schapire, 1996) builds iteratively a binary classifier
such that the final classification rule is a linear com-
bination of weak classifiers. At each boosting step a
new classifier is generated, and the training samples
are reweighted according to the classification results.
The weights are used to generate the next step classi-
fier. In the literature, there are multiple implementa-
tions of the boosting scheme. Our approach is based
on the “GentleBoost” algorithm proposed by Fried-
man et al. (Friedman et al., 2000) which has been
shown to be more robust and appropriate for face clas-
sification tasks (Yokono and Poggio, 2006).
To extend the binary Adaboost classifier to the
multiclass case, two different approaches have been
found in the recent literature: extend the classification
rule to the multiclass case (Ji Zhu and Hastie, 2006)
or to combine different binary classifiers using error
correction output codes (Schapire, 1997). Torralba et
al. (Torralba et al., 2004) introduced the knowledge
transfer concept on the gentleAdaBoost. The main
idea is to see the multiclass classification problems as
multiple binary classification tasks. They experimen-
tally show that the obtained multiclass classifier needs
less training examples and also less different features.
In this paper we use the shared feature boosting
approach to build a global scheme where new classes
can be added to the system.
2.1 Training the Original Joint Boosting
Algorithm
The algorithm takes as input the N training samples
X = {X
i
= (x
1
i
,...,x
d
i
)} and the corresponding labels
C {C
1
,...,C
K
}. A predefined number M of boost-
ing rounds are performed
1
. At each boosting step,
the multiclass classification problem is converted to a
binary problem by grouping the classes in a positive
and a negative cluster. A decision stumps classifier is
trained on the new binary problem. The parameters
of the weak learner are computed as:
ρ =
CPositive(n)
i
W
C
i
b
C
i
δ(x
j
i
θ)
CPositive(n)
i
W
C
i
δ(x
j
i
θ)
, (3)
α+ ρ =
CPositive(n)
i
W
C
i
b
C
i
δ(x
j
i
> θ)
CPositive(n)
i
W
C
i
δ(x
j
i
> θ)
, (4)
1
We assume that the reader is familiar with the Joint
Boosting algorithm. For a detailed description the reader
can see (Torralba et al., 2004)
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
586
1. Given the matrix X
1,...,N+Q
containing data samples x
i
, and the vector c with the corresponding labels c
i
{C
1
,... ,C
K
,C
K+1
} (i = 1...K + 1)
2. Initialize a set of weights: W
c
i
(1) = 1, i=1,. . . ,N+Q
3. For t = 1...M:
(a) Assign the new samples to the Positive cluster, according to the optimal class grouping selected on the step M in the
previous joint boosting algorithm.
(b) Classify the training data X using the decision stumps generated at the step t of the previous joint boosting algorithm.
h
n
t
(x
i
,c) =
αδ(x
j
i
> θ) + ρ, when C
i
Positive(n)
k
c
, when C
i
/ Positive(n)
(1)
(c) Compute the weighted error for the class grouping as:
Err
p
=
K+1
c=1
N+Q
i=1
W
c
i
(b
c
i
h
m
t
(x
i
,c))
2
. (2)
where b
c
i
{−1,+1} is the label assigned to C
i
in the m optimal binary grouping.
(d) Assign the new samples to the Negative cluster, according to the optimal class grouping selected on the step M in the
previous joint boosting algorithm, and compute the error Err
n
as in 2.
4. Assign the new class to the clustering with minimum error: m = min(Err
p
,Err
n
);
5. Update the data weights: W
c
i
(t + 1) = W
c
i
(t)exp
b
c
i
h
m
t
(x
i
,c)
, i = 1,..., N.
6. Update the estimation for each class: H(x
i
,c) = H
c
i
(t) + h
m
t
(x
i
,c)
7. Output the estimation of each sample for each possible class.
Figure 1: Algorithm to add new samples to a previously trained joint boosting algorithm.
k
C
=
i
W
C
i
b
C
i
i
W
C
i
, if c/ Positive(n) (5)
where k
c
acts as a constant to preventthe effectsof un-
balanced training sets on the class selection and {W
c
i
}
is the weights set.
In a first attempt, all the possible groupings could
be made, O(2
K
). Nevertheless, when the number of
classes is large this approach is not possible. Torralba
et al.(Torralba et al., 2004) followed a best first search
approximation (O(K
2
)), where the grouping is per-
formed as follows:
1. Train a weak learner using a single class as posi-
tive. For each feature a decision stumps classifier
is trained, keeping the one that minimizes the er-
ror criterion.
2. Select the class with minimum weighted classifi-
cation error as the initial Positive cluster.
3. For a the remaining K 1 classes:
Train a classifier using the previous Positive
cluster but adding another class from the Nega-
tive cluster.
Add to the previous Positive cluster the class
from the Negative cluster only if the joint selec-
tion improves the previous classification error.
The class grouping with minimum error is selected,
and at each boosting step the set of weights W
c
i
are
adjusted according to the partial classification results.
Note that the optimal grouping is different at each
step, given that the error criterion is computed tak-
ing into account the weights that focus the cluster se-
lection on the most difficult samples. The grouping
step allows the transfer of knowledge among several
recognition tasks.
2.2 Adding New Classes to the System
Once the joint boosting algorithm is trained, it can be
used to classify a K-class problem. In a face recog-
nition environment this fact means that it can only be
used to recognize K people. When the subject K + 1
is admitted in the system, the whole learning process
must be retrained. We propose to take benefit of the
class grouping performed in the sharing features step
in order to incorporate online new classes to the sys-
tem, avoiding the expensive relearning step.
The training algorithm can be divided in two
steps: in the first one the joint boosting algorithm is
run. The second step takes as input the Q samples
of the new training class X
N+1,...,N+Q
and the corre-
MULTITASK LEARNING - An Application to Incremental Face Recognition
587
sponding label C
k+1
, and runs M rounds of the algo-
rithm shown in figure 1.
The idea is to add the new class to the system tak-
ing advantage of the previous shared feature space de-
fined in the classification of the known tasks in the
first step. Provided an optimal binary grouping at
each step trained according a large enough number of
samples and classes, we assign the new class samples
to the positive or negative cluster minimizing the er-
ror criterion. The algorithm is iterated the same fixed
amount of times M. Notice that the method allows
the inclusion of many new tasks, given that the same
process can be iteratively repeated adding a new class
each time. This approach is computationally fast,
avoiding the most computationally expensive step of
finding the optimal binary subgroup.
3 EXPERIMENTS
The experiments have been performed using two dif-
ferent face databases: the Face Recognition Grand
Challenge (Phillips et al., 2005), and the AR Face
database (Martinez and Benavente, 1998). The idea
of the experimental section is to show the evolution
of the performance of our proposal as new classes are
added to the system. We compare our proposal with a
variation of the classic eigenface approach (Turk and
Pentland, 1991) followed by a NN classification rule.
We use PCA to extract 500 features from each data
set, and then a discriminant analysis step is performed
to obtain the 200 final features from each example.
The NDA algorithm has been used for this purpose,
which has been shown to improve the performance of
other classic discriminant analysis techniques (Bres-
san and Vitria, 2003) under the NN rule. The new
classes are added by projecting the training vectors on
the reduced space, and using this projected features as
a model for the new classification task.
Images from both data sets have been previously
converted from the original RGB space to gray scale.
Then we perform a geometric normalization using the
center coordinates of each eye. Images have been ro-
tated and scaled according to the inter-eye distance,
in such a way that the center pixel of each eye coin-
cides in all of them. The samples were then cropped
obtaining a 37× 33 thumbnail, therefore only the in-
ternal region of the faces has been preserved. The
final sample from each image is encoded as a 1221
feature vector. In Figure 2 some examples from both
databases are shown. From each data set, we have
used only images from subjects that contain at least
20 samples (10 for training and the rest for testing).
3.1 Results
The experiments have been repeated 10 times, the re-
sults shown in table 1 are the mean accuracies of each
method. The 95% confidence intervals are also shown
near each value. The experimental protocol follows
these steps for each database: (1) We randomly take
25 classes (people) from the data set. (2) We learn a
classifier using 10 training samples from the 25 peo-
ple. (3) We progressively add a new class without
retraining the system. The remaining samples from
each class are used for testing the resulting classifier.
The results with the FRGC database show an ac-
curacy close to 98% using our boosted approach for
the initial problem with 25 classes, while the applica-
tion of feature extraction methods with the NN classi-
fier obtains an initial 92%. This experiment suggests
that for a perfectly acquired and normalized set, the
use of shared boosting is the best option for multi-
class face problems. Figure 3 shows the accuracies as
a function of the number of classes added. The ac-
curacy on the first 25 steps remains constant given
that the classifier is initially trained on this subset.
Notice that from that point the accuracy decreases,
as expected, when new classes are added to the sys-
tem. This fact is due to 2 reasons: rst, usually the
more classes has a classification problem the more
decreases the accuracy, and second, when new sam-
ples are added to the system, there is an implicit er-
ror given that the classifier has not been retrained.
Nevertheless, the accuracy does not decrease drasti-
cally, even when we increase the number of classes
an 800%.
On the other hand, we also show the absolute and
relative loss of accuracy as new classes are added (see
Table 1). For each data set we add up to the max-
imum number of classes (160 and 86 for the FRGC
and AR Face respectively) and take the resulting ac-
curacy. The absolute decrease is computed as the ac-
curacy using 25 classes minus the accuracy using the
maximum number of classes. The relative decrease
is computed as the absolute decrease divided by the
initial accuracy considering the 25 classes. Using our
approach the accuracy decreases less, specially in the
case of the AR Face data set, obtaining a more ro-
bust classification rule in presence of occlusions and
strong changes in the illumination.
The main advantage using our adding-class ap-
proach, is the reduction on the computational needs.
It has been shown experimentally that the use of joint
boosting achieves high accuracies in face classifica-
tion. Nevertheless, the computational cost makes the
method unfeasible when the problem has too many
classes. The clustering step to build binary problems
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
588
(a) Some faces from the FRGC. (a) Some faces from the AR Face.
Figure 2: Examples of faces from the FRGC and AR Face databases.
Table 1: Mean accuracy of the discriminant method and our proposal on two face databases. Only 25 classes are used for
training, a total of 135 extra classes have been added in the FRGC case, and 59 in the AR Face. We show the decrease (absolute
and relative percentage) in the mean accuracy from the first experiment with only 25 classes and the largest extended problem.
Data set Acc. FE Decrease Relative
FRGC 0.8554 ± 0.0193 0.0548 6.0%
AR Face 0.6013 ± 0.0020 0.2105 25.9%
Acc. Adding-Shared Decrease. Relative
0.9214 ± 0.0024 0.0567 5.8%
0.7515 ± 0.0034 0.1062 12.4%
at each boosting round is O(K
2
) using the BFS ap-
proach. This computational complexity is avoided
when we use our proposal of adding new classes to the
system. Typically, training the shared boosting algo-
rithm using an initial set of 25 classes takes 8 hours on
a Pentium IV computer (using the Matlab software).
Training the same algorithm using 80 classes can take
weeks, while extending the previous 25 class problem
to the new 80 class problem using our approach takes
a few minutes.
4 CONCLUSIONS
We propose a method to online add new classes to the
joint boosting classifier in order to solve real world
face recognition problems. We incrementally add a
new class to the system extending the classifier in or-
der to take into account the new binary classification
task. The multiclass problem is seen as a set of mul-
tiple binary classification tasks that are trained shar-
ing the feature space. The resulting classifier can be
extended adding a new class avoiding the computa-
tionally expensive retraining process which could be
computationally unfeasible in a large class problem.
Nevertheless, it can be trained using less classes at
first instance, and then extended using our approach
to solve the real large problem.
We have experimentally validated our proposal
using two different face databases: the FRGC
database acquired in a controlled environment, and
the AR Face database which contains important ar-
tifacts due to strong changes in the illumination and
occlusions. The results show that the classification ac-
curacy decreases less drastically than using the clas-
sic NN rule used in online learning methods when the
original sets are extended to large class problems (up
to 8 times the original class set size).
We plan as future work to analyze the importance
of the classes chosen in the original trained algorithm.
A diverse choice of the classes should allow a more
general base for extending the classifier. The use of
a training and an extra validation set could improve
slightly the accuracies. The initial number of classes
used to train the system influence the performance of
the extended classifier.
ACKNOWLEDGEMENTS
This work is supported by the TIN2006-15308-C02-
01 project, Ministerio de Educaci´on y Ciencia, Spain.
REFERENCES
Baxter, J. (2000). A model of inductive bias learning. Jour-
nal of Machine Learning Research, 12:149–198.
Bressan, M. and Vitria, J. (2003). Nonparametric discrimi-
nant analysis and nearest neighbor classification. Pat-
tern Recognition Letters, 24(15):2743–2749.
Caruana, R. (1997). Multitask learning. Machine Learning,
28(1):41–75.
Fisher, R. (1936). The use of multiple measurements in
taxonomic problems. Ann. Eugenics, 7:179–188.
Freund, Y. and Schapire, R. E. (1996). Experiments with a
new boosting algorithm. In International Conference
on Machine Learning, pages 148–156.
Friedman, J., T.Hastie, and R.Tibshirani (2000). Additive
logistic regression: a statistical view of boosting. An-
nals of statistics, 28:337–374.
MULTITASK LEARNING - An Application to Incremental Face Recognition
589
20 40 60 80 100 120 140 160
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
FRGC
Traditional FE
Shared Boosted Adding classes
10 20 30 40 50 60 70 80
0.5
0.6
0.7
0.8
0.9
1
ARFACE
Traditional FE
Shared Boosted Adding classes
(a) Accuracy using the FRGC data set. (b) Accuracy using the ARFace data set.
Figure 3: Accuracy as a function of the number of classes.
Fukunaga, K. and Mantock, J. (1983). Nonparametric dis-
criminant analysis. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 5(6):671–678.
Ji Zhu, Saharon Rosset, H. Z. and Hastie, T. (2006). Multi-
class adaboost. Technical report, Standford Univer-
sity.
Martinez, A. and Benavente, R. (1998). The AR Face
database. Technical Report 24, Computer Vision Cen-
ter.
Masip, D., Kuncheva, L. I., and Vitria, J. (2005). An
ensemble-based method for linear feature extraction
for two-class problems. Pattern Analysis and Appli-
cations, 8:227–237.
Phillips, P. J., Flynn, P. J., Scruggs, T., Bowyer, K. W.,
Chang, J., Hoffman, K., Marques, J., Min, J., and
Worek, W. (2005). The 2005 IEEE workshop on face
recognition grand challenge experiments. In CVPR
’05: Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recogni-
tion (CVPR’05) - Workshops, page .45, Washington,
DC, USA. IEEE Computer Society.
Schapire, R. E. (1997). Using output codes to boost multi-
class learning problems. In Proc. 14th International
Conference on Machine Learning, pages 313–321.
Morgan Kaufmann.
Torralba, A., Murphy, K., and Freeman, W. (2004). Sharing
features: efficient boosting procedures for multiclass
object detection. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition.
Turk, M. and Pentland, A. (1991). Eigenfaces for recogni-
tion. Journal of Cognitive Neuroscience, 3(1):71–86.
Yokono, J. J. and Poggio, T. (2006). A multiview face iden-
tification model with no geometric constraints. Tech-
nical report, Sony Intelligence Dynamics Laborato-
ries.
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
590