FACE VERIFICATION BY SHARING KNOWLEDGE FROM

DIFFERENT SUBJECTS

David Masip

Agata Lapedriza

and Jordi Vitri`a

Department of Applied Mathematics and Analysis (MAiA), University of Barcelona (UB)

Ediﬁci Hist`oric Gran Via de les Corts Catalanes 585, Barcelona, Spain

Computer Vision Center, Department of Computer Science

Universitat Aut`onoma de Barcelona, Ediﬁci O, Bellaterra 08193, Spain

Keywords:

Face veriﬁcation, Computer Vision, Logistic Regression Model, Multi-task Learning.

Abstract:

In face veriﬁcation problems the number of training samples from each class is usually reduced, making dif-

ﬁcult the estimation of the classiﬁer parameters. In this paper we propose a new method for face veriﬁcation

where we simultaneously train different face veriﬁcation tasks, sharing the model parameter space. We use a

multi-task extended logistic regression classiﬁer to perform the classiﬁcation. Our approach allows to share

information from different classiﬁcation tasks (transfer knowledge), mitigating the effects of the reduced sam-

ple size problem. Our experiments performed using the publicly available AR Face Database, show lower

error rates when multiple tasks are jointly trained sharing information, which conﬁrms the theoretical approx-

imations in the related literature.

1 INTRODUCTION

Face veriﬁcation can be deﬁned as a binary classi-

ﬁcation problem where we receive as input a high-

dimensional data vector x ∈ R

and a claimed iden-

tity for the individual. Then, the goal is to verify the

correctness of the identity using a model of the person

to be veriﬁed.

Different facts make that Face Veriﬁcation is still

nowadays an unsolved problem. For example, the di-

mensionality of the face data is large, making the esti-

mation of the classiﬁer parameters more difﬁcult. On

the other hand, the shortage of samples from a single

person is frequent, what produces classiﬁers that are

not robust enough.

Most of the face veriﬁcation algorithms found in

the literature are focused on the classiﬁcation in high

dimensional spaces problem. The procedure in that

cases is usually divided in two steps. First a fea-

ture extraction step is performed to reduce the data

complexity and second a classiﬁer in the new reduced

space is trained. Thus, the problem of the high dimen-

sional data vectors classiﬁcation has been addressed

with two complementary methodologies: feature ex-

traction techniques and classiﬁcation algorithms.

One of the most spread unsupervised feature ex-

traction techniques is PCA, where the goal is to

ﬁnd the linear projection that minimizes the mean

squared error criterion, obtaining a decorrelated rep-

resentation of the data. Recently, more sophisti-

cated techniques have appeared, imposing extra re-

strictions on the extracted feature space, such as spar-

sity (Lee and Seung, 2000) or independence (Hy-

varinen, 1999), which have been successfully applied

to face classiﬁcation in presence of occlusions and

strong changes in the illumination. On the other hand,

supervised feature extraction techniques use the data

label in the dimensionality reduction process, ﬁnd-

ing the subspace that maximizes some separability

criterion on the training data. Linear Discriminant

Analysis (Fisher, 1936) is the most popular super-

vised linear technique (LDA), which seems to out-

perform the PCA ”eigenfaces” approach (Moghad-

dam et al., 1998) (Belhumeur et al., 1997). Neverthe-

less, LDA uses the class-scatter matrices of the train-

ing data as a separability measure, being blind beyond

second order statistics. Recent methods such as Non

Parametric Discriminant Analysis (NDA) (Fukunaga

and Mantock, 1983) or Boosted Discriminant Projec-

tions (Masip and Vitri`a, 2006; Masip et al., 2005)

have been shown to outperform the classic approach.

In this paper we focuss our attention in the short-

286

Masip D., Lapedriza À. and Vitrià J. (2007).

FACE VERIFICATION BY SHARING KNOWLEDGE FROM DIFFERENT SUBJECTS.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 286-289

 SciTePress

age of training samples in the Face Veriﬁcation ﬁeld.

We propose to use a face veriﬁcation scheme where

multiple face veriﬁcation tasks are simultaneously

learned.

The idea of sharing knowledge by training related

classiﬁcation tasks was proposed by Caruana (Caru-

ana, 1997). He introduced the term Multi-task learn-

ing to describe a technique that learns a neural net-

work classiﬁer from a set of related tasks, improv-

ing thus the generalization error. This behavior is

justiﬁed by the fact that the bias learned in a multi-

ple related tasks environment is likely to be less spe-

ciﬁc than in a single task problem. It has been shown

that using a multiple related tasks learning scheme the

number of samples needed decreases with the number

of tasks, achieving also better generalization results

(Thrun and Pratt, 1997).

The multi-task learning paradigm has been re-

cently applied to different classiﬁers, such as SVM

(Evgeniou et al., 2005), Adaboost (Torralba et al.,

2004), or probabilistic frameworks (Ando and Zhang,

2005). Nevertheless, up to our knowledge it has not

been still applied to face classiﬁcation tasks.

There no exists in the literature a formal deﬁnition

of “related tasks”. Here we consider each veriﬁcation

problem as a task and suppose that these kind of tasks

are related one to each other.

The paper is organized as follows: in the next sec-

tion we present our method, that is based on applying

the sharing knowledge paradigm to multiple related

logistic regression classiﬁers, section 3 describes the

experiments performed on a publicly available data

set and section 4 concludes this work.

2 SHARED LOGISTIC

REGRESSION MODEL FOR

CLASSIFICATION

A binary classiﬁcation task is the problem of assign-

ing to a sample x ∈ R

its corresponding label L ∈

{−1, 1}. A common procedure to solve such a task

is to assume that the data follows a speciﬁc statistical

model and ﬁx the parameters of the model from a set

of known samples that is called the training data set.

Classic logistic regression model is a statistical

approach for solving a binary classiﬁcation task and

it assigns to a sample x ∈ R

a label L ∈ {−1, 1} as

L = arg max

c∈{−1,1}

P(L = c|x) (1)

where

P(L = 1|x) =

1+ e

−βx

(2)

and β ∈ R

is the parameters vector that is usually

estimated by the maximum likelihood criterion. No-

tice that P(L = −1|x

, β) = 1 − P(1|x

, β) given that

P is a probability distribution.

Our proposal is to extend this logistic regression

approach to a multi-task learning framework, where

we have multiple related binary tasks T

, . . . , T

Let be Z

= {((x

, L

), . . . , (x

, L

)

)}

i=1,..,M

training data for each one of the M tasks and Z =

}

i=1,..,M

the entire training samples set. Suppose

that we model each task using the logistic regression

explained above, what yields a parameters matrix B

B =







. . . β







where each β

= (β

, . . . , β

) is the parameter vec-

tor for the i-th task. Thus, to assign the label L to an

input x ∈ R

according the i-th task, we should follow

the criterion

L = arg max

c∈{−1,1}

(L = c|x) =

1+ e

−β

(3)

Given that situation, normally the negated log-

likelihood N(Z, B) is used to estimate the parameters

adding a regularization term, usually

kBk

, to avoid

a complex probability distribution on the parameters

set.

Nevertheless, our goal is to enforce the different

tasks to share some information given that we are as-

suming that they are related. For this aim, we hier-

archically impose a prior distribution on each row of

the matrix B.

Let us deﬁne the mean vector

β = (

, . . . ,

) as:

∑

i=1

(4)

and impose a gaussian centered prior to the mean

vector

β. We want to enforce each row of the matrix B

to be gaussian distributed with mean

. The resulting

optimization function is then G(B) = N(Z, B)+ R(B)

where

R(B) =

βk

∑

i=1

kβ

−

βk

(5)

and (σ

,σ

) are the corresponding variances of the

imposed priors.

In this work we optimize the criterion G(B) us-

ing the gradient descent algorithm given that this loss

function is differentiable and we can compute all the

partial derivatives

∂G(B)

∂β

(s)

∂N(Z, B)

∂β

(s)

∂R(B)

∂β

(s)

(6)

The ﬁrst term N(Z, B) only depends on the para-

meter matrix B and can be directly obtained differ-

entiating the negated log-likelihood estimator of the

tasks set Z.

The second term R(B) depends on B and

β, and

can be rewritten as follows

R(B) =

∑

j=1

[

∑

i=1

(β

−

)

] (7)

Given that the second term of the sum depends

also on

β, we need to express it as a function of B.

Nevertheless, notice that we we can obtain an expres-

sion of each β

depending only (B) since we want to

minimize R(B). Thus, we have

= argmin

(

∑

i=1

(β

− b)

) (8)

and this expression has a global minimum at

(B) =

∑

i=1

+ Mσ

(9)

Given that the

∂

(B)

∂β

(s)

= 0 if j 6= k, we obtain:

∂

(B)

∂β

(s)

+ Mσ

(10)

Then, we get a ﬁnal expression for the derivatives

of the term R(B) substituting here

∂R(B)

∂β

(s)

∂

∂β

(s)

∑

i=1

[(β

(i)

−

)

∂

∂β

(s)

] (11)

the expressions in equations 9 and 10.

3 EXPERIMENTS

To test this proposed shared logistic regression model

in face classiﬁcation ﬁeld we have performed dif-

ferent subject veriﬁcation experiments using images

from the public AR Face Database (Martinez and Be-

navente, 1998).

To perform the experiments we have used only the

internal part of the face images and this fragments

have been resized to be 16 × 16 pixels. All the im-

ages used in the training and test step are aligned by

Figure 1: The corresponding processed training and test im-

ages: resized fragments of the original images including the

internal part of the face.

the center pixel of each eye. Some examples of these

images are shown in ﬁgure 1.

The experiments have been repeated 10 times,

and the subjects identiﬁers, the training images and

the test images have been always randomly selected.

Given that the shared models are specially appropri-

ated when the training set has small size, to train this

veriﬁcations we have used only 2 positive samples

(images from the subject we want to verify) and 4

negative samples (images from other subjects). To

test the system we have used 20 positive images and

40 negative samples in each case.

We have considered different task groups that

have from 1 to 10 different tasks, using the classical

single-task logistic regression model and the shared

presented model to train all the tasks of each group at

a time. The results are shown in table 1, considering

that a correct veriﬁcation is the correct classiﬁcation

of a subject in its corresponding positive or negative

label.

The subject classiﬁcation results obtained show

that when more than 4 tasks are simultaneously

trained sharing information, the general accuracies

increase. The performance of our proposal progres-

sively increases as new tasks are added to the sys-

tem. However, when there are a few tasks to train

our shared logistic regression model, the optimization

method fails in obtaining the proper model parame-

ters.

4 CONCLUSIONS

In this paper we introduce a shared logistic regres-

sion classiﬁer applied to a face veriﬁcation problem.

We show that the theoretic beneﬁts of the inductive

transfer knowledge in the machine learning process

stated in the recent literature can be practically ap-

plied in a probabilistic modelling of a real life prob-

lem. The results obtained in the face veriﬁcation ap-

plication, show that considering the training of the

different subject classiﬁcation tasks sharing the model

information yields improved accuracies. Moreover,

Table 1: Mean accuracies and 95% conﬁdence intervals of the logistic regression method trained separately (ﬁrst row) and

following our shared logistic approach (second row). When more than 4 veriﬁcation tasks are simultaneously trained, the

error rates of the shared approach become lower.

1 2 3 4 5

Logistic 68.1± 8.2 65.5± 6.4 69.5± 5.3 68.2± 4.2 69.8± 3.9

Shared Logistic 59.4± 4.2 64.2± 5.4 67.9± 5.2 71.3± 5.2

6 7 8 9 10

Logistic 68.2± 3.6 68.6± 3.3 70.4± 3.3 69.8± 3.0 70.4± 2.9

Shared Logistic 72.8± 4.2 76.4± 3.1 78.2± 2.8 82.5± 2.4 84.6± 2.3

the improvement is more signiﬁcant when the number

of jointly trained veriﬁcation tasks increases, being a

15% higher in the case of 10 simultaneous veriﬁca-

tions.

The probabilistic modelling presented in this pa-

per suggests new lines of future research. In our

ﬁrst formulation, the sharing knowledge property is

imposed by constraining the parameter space of the

classiﬁers along the multiple tasks. Other approaches

could be followed, such as a more complex modelling

based on a hidden model that generates the parameter

space.

Moreover, the addition of extra related tasks from

different domains could be studied. For example, a

gender or ethnicity recognition problem. The enlarge-

ment of the task pool should beneﬁt the amount of

shared information between the related tasks, miti-

gating the effects of the small sample size problem

in face veriﬁcation.

ACKNOWLEDGEMENTS

This work is supported by MEC grant TIN2006-

15308-C02-01, Ministerio de Ciencia y Tecnologia,

Spain.

REFERENCES

Ando, R. and Zhang, T. (2005). A framework for learn-

ing predicitve structures from multiple tasks and un-

labeled data. Journal of Machine Learning Research,

6:1817–1853.

Belhumeur, P., Hespanha, J., and Kriegman, D. (1997).

Eigenfaces vs. ﬁsherfaces: Recognition using class

speciﬁc linear projection. IEEE Trans. Pattern Analy-

sis and Machine Intelligence, 19(7):711–720.

Caruana, R. (1997). Multitask learning. Machine Learning,

28(1):41–75.

Evgeniou, T., Micchelli, C., and Pontil, M. (2005). Learn-

ing multiple tasks with kernel methods. Journal of

Machine Learning Research, 6:615–637.

Fisher, R. (1936). The use of multiple measurements in

taxonomic problems. Ann. Eugenics, 7:179–188.

Fukunaga, K. and Mantock, J. (1983). Nonparametric dis-

criminant analysis. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 5(6):671–678.

Hyvarinen, A. (1999). The ﬁxed-point algorithm and max-

imum likelihood estimation for independent compo-

nent analysis. Neural Process. Lett., 10(1):1–5.

Lee, D. and Seung, S. (2000). Algorithms for non-negative

matrix factorization. In NIPS, pages 556–562.

Martinez, A. and Benavente, R. (1998). The AR Face data-

base. Technical Report 24, Computer Vision Center.

Masip, D., Kuncheva, L. I., and Vitria, J. (2005). An

ensemble-based method for linear feature extraction

for two-class problems. Pattern Analysis and Appli-

cations, 8:227–237.

Masip, D. and Vitri`a, J. (2006). Boosted discriminant pro-

jections for nearest neighbor classiﬁcation. Pattern

Recognition, 39(2):164–170.

Moghaddam, B., Wahid, W., and Pentland, A. (1998).

Beyond eigenfaces: Probabilistic matching for face

recognition. In Proc. of Int’l Conf. on Automatic Face

and Gesture Recognition (FG’98), pages 30–35, Nara,

Japan.

Thrun, S. and Pratt, L. (1997). Learning to Learn. Kluwer

Academic.

Torralba, A., Murphy, K., and Freeman, W. (2004). Sharing

features: efﬁcient boosting procedures for multiclass

object detection. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition.