nel difference. We show with computer experiments
on the face discrimination problem that our model
performs better than SVM, and can give comparable
results with specifically tuned face detectors, despite
the smaller training data size.
2 OUTLINE OF THE MODEL
Data manifolds tend to be flexibly linearly separated
by a deep architecture, while indeed data represen-
tation in SVM or multilayer neural networks are lin-
early separated, but data manifolds are not necessar-
ily separated. In order to obtain a linearly separa-
ble representation, the feature extractor should be de-
signed so as to have a potential that takes smaller
values for the object class instances than for the out-
of-class instances. For example, in Auto-Encoder or
Sparse Coding, difference between the input vectors
and the reconstructed vectors is designed to be min-
imum, thus becomes a potential on the data. In a
deep architecture, such feature extractors are stacked,
and enabling to regularize the potential (Erhan et.al.
2010), making the data manifolds be linearly sepa-
rated.
Another idea to obtain a regularized feature poten-
tial comes from the generative models. Fisher kernel
is the one that reflects this idea, and defined by the
inner product of Fisher score, although a problem of
computational efficiency is involved. Given a model
P(x|θ) Fisher score is defined by
λ
θ
(x) =
∂logP(x|θ)
∂θ
(1)
The vector score of Eq.(1) takes a value near 0 for an
input vector x of the class, if the parameter θ is prop-
erly trained as a model of the class. This implies that
Fisher kernel is an over-transformed representation in
the sense that kernel values of x
1
,x
2
near 0 does not
necessarily mean both x
1
and x
2
belong to the class.
In this reason we propose a method to directly utilize
Fisher score with MRF unlike Fisher kernel method
such as (Jaakkola & Haussler 1998).
MRF models objects with auto-correlative units
called cliques, and have been applied for wide range
of signal processing or pattern analysis fields. If we
adopt MRF as a basic generative model for calculat-
ing Fisher score, we must have suffer from combi-
natorially huge number of model parameters corre-
sponding to the number of cliques.
A compact expression of features can be obtained
from the kernel representation of the random field. In
order to take the autocorrelations of cliques into ker-
nels, we will define a feature vector consist of cliques
in section 3, and the kernel is defined by the inner
product of feature vectors. Note that our definition of
autocorrelation kernel does not include the second or
a higher order of single variables. This reduces the
computational complexity of the kernel to linear or-
der.
The higher autocorrelation kernel introduced in
section 3 is able to reflect a difference of higher order
autocorrelations, while the popular Gaussian or poly-
nomial kernel depends only on the difference of input
vector values. A higher order autocorrelation kernel
was examined in (Horikawa 2004), in which direct
inner product of feature vectors of higher order auto-
correlations was used, besides higher order moments
of single variables were used in his setting. Thus it
requires huge computational efforts.
If we use our definition of higher autocorrelation
kernel, we can define Kernel Random Field (KRF) for
n dimensional discrete states x as follwos:
P(x|µ) =
1
Z
exp
−
m
∑
ℓ=1
µ
ℓ
K(ξ
ℓ
,x)
!
(2)
which is equivalent to MRF, where Z is the parti-
tion function, ξ
ℓ
are m training examples, and µ
ℓ
are
the model parameters. For the practical situations,
computation of KRF of eq.(2) is hard for seeking Z.
We apply the mean field approximation to derive the
mean field Fisher score expression in section 5:
λ
ℓ
′
(ξ
ℓ
) = K(ξ
ℓ
′
,ξ
ℓ
) − K(ξ
ℓ
′
, ¯x) (3)
where ¯x,ξ
ℓ
′
,ξ
ℓ
are the mean of the states in the mean
field, ℓ
′
th in-class training instance, and ℓth training
instance, respectively. In fact, on one hand, for the in-
class instances ξ
ℓ
the first and the second terms in the
right hand side of eq.(3) take similar (comparatively
large) values, and the subtraction results in near 0. On
the other hand, if ξ
ℓ
is an outside-the-class instance,
the first term of the right hand side of eq.(3) takes a
small value, and the subtraction results in negatively
large. As the result, the features of eq.(3) becomes lin-
early separable, because the problem is reduced to the
majority voting with the negativelycontinuous values.
We propose a learning scheme using a linear SVM
to discriminate the Fisher score features given in
eq.(3). Then the training process is divided in two
steps; in the first step the mean field KRF is trained for
the class data, and in the second step, the linear SVM
is trained on features of eq.(3) using the all training
data. We will show computer experiments on the face
detection problem in section 6, and show that the pro-
posed scheme works well to get far better results than
SVMs. The results are comparable to a state-of-the-
art face detection system using SURF, cascade, and
AdaBoost (Li & Zhang 2013).
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
168