GENERATIVE EMBEDDINGS BASED ON RICIAN MIXTURES
Application to Kernel-based Discriminative Classification of Magnetic Resonance
Images
Anna C. Carli
1
, M´ario A. T. Figueiredo
2
, Manuele Bicego
1,3
and Vittorio Murino
1,3
1
Dipartimento di Informatica, Universit`a di Verona, Verona, Italy
2
Instituto de Telecomunicac¸ ˜oes, Instituto Superior T´ecnico, Lisboa, Portugal
3
Istituto Italiano di Tecnologia (IIT), Genova, Italy
Keywords:
Discriminative learning, Magnetic resonance images, Generative embedding, Information theory, Kernels,
Rice distributions, Finite mixtures, EM algorithm.
Abstract:
Most approaches to classifier learning for structured objects (such as images or sequences) are based on proba-
bilistic generative models. On the other hand, state-of-the-art classifiers for vectorial data are learned discrim-
inatively. In recent years, these two dual paradigms have been combined via the use of generative embeddings
(of which the Fisher kernel is arguably the best known example); these embeddings are mappings from the
object space into a fixed dimensional score space, induced by a generative model learned from data, on which
a (maybe kernel-based) discriminative approach can then be used.
This paper proposes a new semi-parametric approach to build generative embeddings for classification of mag-
netic resonance images (MRI). Based on the fact that MRI data is well described by Rice distributions, we
propose to use Rician mixtures as the underlying generative model, based on which several different generative
embeddings are built. These embeddings yield vectorial representations on which kernel-based support vector
machines (SVM) can be trained for classification. Concerning the choice of kernel, we adopt the recently
proposed nonextensive information theoretic kernels.
The methodology proposed was tested on a challenging classification task, which consists in classifying MRI
images as belonging to schizophrenic or non-schizophrenic human subjects. The classification is based on
a set of regions of interest (ROIs) in each image, with the classifiers corresponding to each ROI being com-
bined via boosting. The experimental results show that the proposed methodology outperforms the previous
state-of-the-art methods on the same dataset.
1 INTRODUCTION
Most approaches to learning classifiers belong to one
of two paradigms: generative and discriminative (Ng
and Jordan, 2002; Rubinstein and Hastie, 1997). Gen-
erative approaches are based on probabilistic class
models and a priori class probabilities, learnt from
training data and combined via Bayes law to yield
posterior probability estimates. Discriminative learn-
ing methods aim at learning class boundaries or pos-
terior class probabilities directly from data, without
relying on generative class models.
In the past decade, several hybrid generative-
discriminative approaches have been proposed with
We acknowledge financial support from the FET pro-
gramme within EU FP7, under the SIMBAD project (con-
tract 213250).
the goal of taking advantage of the best of both
paradigms (Jaakkola and Haussler, 1999; Lasserre
et al., 2006). In this context, the so-called generative
score space methods (or generative embeddings) have
stimulated significant interest. The key idea is to ex-
ploit a generative model to map the objects to be clas-
sified into a feature space, where discriminative tech-
niques, namely kernel-based ones, can be used. This
is particularly suitable to deal with non-vectorial data
(strings, trees, images), since it maps objects (maybe
of different dimensions) into a fixed dimension space.
The seminal work on generative embeddings is
arguably the Fisher kernel (Jaakkola and Haussler,
1999). In that work, the features of a given object
are the derivatives of the log-likelihood under the as-
sumed generative model, with respect to the model
parameters, computed at that object. Other examples
113
C. Carli A., A. T. Figueiredo M., Bicego M. and Murino V. (2012).
GENERATIVE EMBEDDINGS BASED ON RICIAN MIXTURES - Application to Kernel-based Discriminative Classification of Magnetic Resonance
Images.
In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 113-122
DOI: 10.5220/0003790801130122
Copyright
c
SciTePress
of generative embeddings have been more recently
proposed (Bosch et al., 2006; Perina et al., 2009).
In this paper, we exploit generative embeddings
to tackle a challenging classification task: based on a
set of regions of interest (ROIs) of a magnetic res-
onance image (MRI), classify the patient as suffer-
ing, or not, from schizophrenia (Cheng et al., 2009a).
We build on the knowledge of the fact that MRI data
is well modeled by Rician distributions (Gudbjarts-
son and Patz, 1994), and propose several generative
embeddings based on Rician mixture models. Con-
cerning the kernels used in the obtained feature space,
we adopt the nonextensive information theoretic ker-
nels recently proposed by (Martins et al., 2009). An
SVM classifier is learnt for each ROI. Finally, an op-
timal combination of these SVM classifiers is learnt
via the AdaBoost algorithm (Freund and Schapire,
1997). The experimental results reported show that
the proposed methodology outperforms the previous
state-of-the-art on the same dataset.
The paper is organized as follows. Section 2 ad-
dresses the problem of estimating Rician finite mix-
tures using the expectation-maximization (EM) algo-
rithm. In Section 3, we propose several generative
embeddings based on the Rician mixture model. Sec-
tion 4 briefly reviews the information theoretic ker-
nels proposed by (Martins et al., 2009), while Section
5 described SVM combination via boosting. Finally,
Section 6 reports the experimental results on the mag-
netic resonance (MR) image categorization problem.
2 RICIAN MIXTURE FITTING
VIA THE EM ALGORITHM
2.1 The EM Algorithm
The expectation-maximization (EM) algorithm
(Dempster et al., 1977) is the most common approach
for computing the maximum likelihood estimate
(MLE) of the parameters of a finite mixture. In
this section, we briefly review how EM is used to
estimate a mixture of Rician distributions. A Rician
probability density function (Rice, 1944) has the
form
f
R
(y;v,σ) =
y
σ
2
e
y
2
+v
2
2σ
2
I
0
yv
σ
2
, (1)
for y > 0, and zero for y 0, where v is the mag-
nitude parameter, σ is the noise parameter, and I
0
(z)
denotes the 0-th order modified Bessel function of the
first kind (Abramowitz and Stegun, 1972)
I
0
(z) =
1
2π
Z
2π
0
e
zcosφ
dφ. (2)
A finite mixture of Rician distributions, with g
components, is thus
f(y;Ψ) =
g
i=1
π
i
f
R
y;ν
i
,σ
2
i
, (3)
where the π
i
s, i = 1,...,g, are nonnegative quanti-
ties that sum to one (the so-called mixing proportions
or weights), θ
i
= (ν
i
,σ
2
i
) is the pair of parameters of
component i, and Ψ = (π
1
,... , π
g1
,θ
1
,... , θ
g
) is the
vector of all the parameters of the mixture model.
Let Y = {y
1
,... , y
n
} be a random sample of size
n, assumed to have been generated independently by
a mixture of the form (3) and consider the goal of ob-
taining an MLE of Ψ, that is,
b
Ψ = argmax
Ψ
L(Ψ),
where
L(Ψ,Y) =
n
j=1
log f(y
j
;Ψ) =
n
j=1
log
g
i=1
π
i
f
R
y
j
;ν
i
,σ
2
i
.
(4)
As is common in EM, let z
j
{0, 1}
g
be a g-
dimensional hidden/missing binary label vector asso-
ciated to observation y
j
, such that z
ji
= 1 if and only
if y
j
was generated by the i-th mixture component.
The so-called complete data is {(y
1
,z
1
),..., (y
n
,z
n
)
and the corresponding complete loglikelihood for Ψ,
logL
c
(Ψ), is given by
L
c
(Ψ,Y,Z) =
n
j=1
g
i=1
z
ji
logπ
i
+ log f
R
(y
j
;θ
i
)
(5)
where Z = {z
1
,... , z
n
}.
The EM algorithm proceeds iteratively in two
steps. The E-step computes the conditional expec-
tation (with respect to the missing labels Z), of the
complete loglikelihood given the observed data y and
the current parameter estimate
b
Ψ
(k)
,
Q(Ψ;Ψ
(k)
) := E
Z
h
L
c
(Ψ,Y,Z)|Y,
b
Ψ
(k)
)
i
. (6)
Since the complete-data log likelihood is linear in the
unobservable data z
ij
(as is clear in (5)), this reduces
to computing the conditional expectation of hidden
variables and plugging these into the complete log-
likelihood. These conditional expectations are well
known and equal to the posterior probability that the
j-th sample was generated by the ith component of
the mixture; denoting this quantity as w
ji
, we have
w
ji
=
π
i
f(y
j
;θ
(k)
i
)
g
h=1
π
(k)
h
f(y
j
;θ
(k)
h
)
, (7)
for i = 1,... ,g and j = 1, . . . ,n. It follows that the
conditional expectation of the complete loglikelihood
(6) becomes
Q(Ψ;Ψ
(k)
) =
g
i=1
n
j=1
w
ji
logπ
i
+ log f(y
j
;θ
i
)
. (8)
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
114
The M-step obtains an updated parameter estimate
Ψ
(k+1)
by maximizing Q(Ψ;Ψ
(k)
) with respect to Ψ
over the parameter space . The updated estimates of
the mixing proportions π
(k+1)
i
are well-known to be
given by
π
(k+1)
i
=
1
n
n
j=1
w
ji
. (9)
2.2 Updating the Parameters of the
Rician Components
Updating the estimate of θ
i
= (ν
i
,σ
2
i
) requiressolving
g
i=1
n
j=1
w
ji
θ
log f
R
(y
j
;θ) = 0, (10)
where
θ
denotes the gradient with respect to θ. In
the following proposition (proved in the appendix),
we provide an explicit solution of (10) for the Rician
mixture.
Proposition 2.1. The updated estimate
b
θ
(k+1)
i
=
(bv
(k+1)
i
,(
b
σ
2
i
)
(k+1)
), that is, the solution of (10), is
bv
(k+1)
i
=
1
n
j=1
w
ji
n
j=1
w
ji
y
j
φ
y
j
v
(k)
i
σ
2
(k)
i
(11)
and
(
b
σ
2
i
)
(k+1)
) =
0.5
n
j=1
w
ji
n
j=1
w
ji
y
2
j
+ v
(k+1)
2
i
2y
j
v
(k+1)
i
φ
y
j
v
(k)
i
σ
2
(k)
i
!
(12)
where
φ(u) =
I
1
(u)
I
0
(u)
. (13)
3 GENERATIVE EMBEDDINGS
BASED ON RICIAN MIXTURES
This section introduces several generative embed-
dings for images based on the Rician mixture model.
Let X
s
=
y
s
1
,... , y
s
N
s
, for s = 1,...,S, be a set of
images, each belonging to one of R classes. Each
image X
s
is modeled simply as a bag of N
s
strictly
positive pixels y
s
j
R
++
, for j = 1,...,N
s
. Each im-
age is mapped into a finite-dimensional Hilbert space
(the so–called feature space) using the Rician mixture
generative model, as explained next.
Based on a K-componentsRician mixture with pa-
rameters Ψ, the posterior probability that y
s
j
(the j-th
pixel of the s-th image) belongs to the i-th component
of the mixture is
w
i
(y
s
j
;Ψ) =
π
i
f(y
s
j
;θ
i
)
K
k=1
π
k
f(y
s
j
;θ
k
)
, (14)
as used in the E-step (7). Based on (14), different
generative embeddings can be defined, as shown in
Definitions 3.1, 3.2, and 3.3.
Definition 3.1. If a single Rician mixture Ψ is esti-
mated for the S images, the embedding of an image
X = {y
1
,... , y
N
} is a K-dimensional vector given by
e
e
single
(X;Ψ) =
1
N
"
N
j=1
w
1
(y
j
;Ψ),... ,
N
j=1
w
K
(y
j
;Ψ)
#
T
. (15)
Definition 3.2. If a set of R Rician mixtures (one per
class) is estimated, {Ψ
1
,... , Ψ
R
}, each with K com-
ponents, the embedding of an image X = {y
1
,... , y
N
}
is a (KR)-dimensional vector given by
e
e(X;Ψ
1
,...,Ψ
R
) =
1
N
s
e
e
single
(X;Ψ
1
)
T
,...,
e
e
single
(X;Ψ
R
)
T
T
.
(16)
Other possible embeddings and their generalizations
are introduced in the following definition.
Definition 3.3. We will also consider the two follow-
ing K-dimensional embeddings, defined for an arbi-
trary image X = {y
1
,... , y
N
} as
¯
e
single
(X;Ψ) =
1
N
N
j=1
h
π
1
f(y
j
;θ
1
),..., π
K
f(y
j
;θ
K
)
i
T
and
b
e
single
(X;Ψ) =
1
N
N
j=1
h
f(y
j
;θ
1
),... , f(y
j
;θ
1
)
i
T
,
as well as their (KR)-dimensional generalizations to
the case in which a Rician mixture is estimated for
each of the R classes,
¯
e(X;Ψ
1
,...,Ψ
R
) =
¯
e
single
(X;Ψ
1
)
T
,... ,
¯
e
single
(X;Ψ
R
)
T
T
and
b
e(X;Ψ
1
,... , Ψ
R
) =
b
e
single
(X;Ψ
1
)
T
,... ,
b
e
single
(X;Ψ
R
)
T
T
.
GENERATIVE EMBEDDINGS BASED ON RICIAN MIXTURES - Application to Kernel-based Discriminative
Classification of Magnetic Resonance Images
115
4 NONEXTENSIVE
INFORMATION THEORETIC
KERNELS ON MEASURES
This section briefly reviews the information theoretic
kernels proposed in (Martins et al., 2009), introducing
notation which will be useful later on.
4.1 Suyari’s Entropies
Begin by recalling that both the Shannon-Boltzmann-
Gibbs (SBG) and the Tsallis entropies are particular
cases of functions S
q,φ
following Suyari’s axioms (Su-
yari, 2004). Let
n1
be the standard probability sim-
plex and q 0 be a fixed scalar (the entropic index).
The function S
q,φ
:
n1
R has the form
S
q,φ
(p
1
,··· , p
n
) =
k
φ(q)
1
n
i=1
p
q
i
if q 6= 1
k
n
i=1
p
i
ln p
i
if q = 1
(17)
where φ : R
+
R is a continuous function with prop-
erties stated in (Suyari, 2004), and k > 0 an arbitrary
constant, henceforth set to k = 1. For q = 1, we re-
cover the SBG entropy,
S
1,φ
(p
1
,··· , p
n
) = H(p
1
,··· , p
n
) =
n
i=1
p
i
ln p
i
,
while setting φ(q) = q 1 yields the Tsallis entropy
S
q
(p
1
,··· , p
n
) =
1
q 1
1
n
i=1
p
q
i
=
n
i=1
p
q
i
ln
q
p
i
,
where ln
q
(x) =
(x
1q
1)
1q
is the q-logarithmic function.
4.2 Jensen-Shannon (JS) Divergence
Consider two measure spaces (X ,M ,ν), and
(T , J ,τ), where the second is used to index the first.
Let H denote the SBG entropy, and consider the ran-
dom variables T T and X X , with densities π(t)
and p(x) ,
R
T
p(x|t)π(t). The Jensen divergence
(Martins et al., 2009) is defined as
J
π
(p) , J
π
H
(p) = H(E[p]) E[H(p)]. (18)
When X and T are finite with |T | = m,
J
π
H
(p
1
,··· , p
m
) is called the Jensen-Shannon (JS)
divergence of p
1
,··· , p
m
, with weights π
1
,··· ,π
m
(Burbea and Rao, 1982), (Lin, 1991). In partic-
ular, if |T | = 2 and π = (1/2, 1/2), p may be
seen as a random distribution whose value on
{p
1
, p
2
} is chosen tossing a fair coin. In this case,
J
(1/2,1/2)
= JS(p
1
, p
2
), where
JS(p
1
, p
2
) , H
p
1
+ p
2
2
H(p
1
) + H(p
2
)
2
,
which will be used in Section 4.4 to define JS kernels.
4.3 Jensen-Tsallis (JT) q–Differences
Notice that Tsallis’ entropy can be written as
S
q
(X) = E
q
[ln
q
p(X)],
where E
q
denotes the unnormalized q–expectation,
which, for a discrete random variable X X with
probability mass function p : X R, is defined as
E
q
[X] ,
xX
x p(x)
q
;
(of course, E
1
[X] is the standard expectation).
As in Section 4.2, consider two random variables
T T and X X , with densities π(t) and p(x) ,
R
T
p(x|t)π(t). The Jensen q-difference (nonextensive
analogue of (18)) (Martins et al., 2009) is
T
π
q
(p) = S
q
(E[p]) E
q
[S
q
(p)].
If X and T are finite with |T | = m, T
π
q
(p
1
,··· , p
m
)
is called the Jensen-Tsallis (JT) q-difference of
p
1
,··· , p
m
, with weights π
1
,··· ,π
m
. In particular, if
|T | = 2 and π = (1/2,1/2), define T
q
= T
1/2,1/2
q
T
q
(p
1
, p
2
) = S
q
p
1
+ p
2
2
S
q
(p
1
) + s
q
(p
2
)
2
,
which will be used in Section 4.4 to define JT kernels.
Naturally, T
1
coincides with the JS divergence.
4.4 Jensen-Shannon and Tsallis Kernels
The JS and JT differences underlie the kernels pro-
posed in (Martins et al., 2009), which can be defined
for normalized or unnormalized measures.
Definition 4.1 (Weighted Jensen-Tsallis kernels). Let
µ
1
and µ
2
be two (not necessarily probability) mea-
sures; the kernel
e
k
q
is defined as
e
k
q
(µ
1
,µ
2
) ,
S
q
(π) T
π
q
(p
1
, p
2
)
(ω
1
+ ω
2
)
q
where p
1
=
µ
1
ω
1
and p
2
=
µ
2
ω
2
are the normalized coun-
terparts of µ
1
and µ
2
, with corresponding total masses
ω
1
and ω
2
, and π = (ω
1
+ ω
2
)
1
[ω
1
,ω
2
]. The kernel
k
q
is defined as
k
q
(µ
1
,µ
2
) , S
q
(π) T
π
q
(p
1
, p
2
)
Notice that if ω
1
= ω
2
,
e
k
q
and k
q
coincide up to
a scale factor. For q = 1, k
q
is the so-called Jensen-
Shannon kernel, k
JS
(p
1
, p
2
) = ln2 JS(p
1
, p
2
).
The following proposition characterizes these ker-
nels in terms of positive definiteness, a crucial aspect
for their use in support vector machines (SVM).
Proposition 4.1. The kernel
e
k
q
is positive definite
(pd), for q [0,2]. The kernel k
q
is pd, for q [0,1].
The kernel k
JS
is pd.
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
116
5 COMBINING SVM
CLASSIFIERS VIA BOOSTING
The final building block of our approach to MR image
classification is a way to combine the classifiers work-
ing on each of the several regions of interest (ROI).
For that end, we adopt the Adaboost algorithm (Fre-
und and Schapire, 1997), which we now briefly re-
view. In the description of AdaBoost in Algorithm
5.1, each (weak) classifiers G
m
(x), m = 1, . ..,M, each
corresponding to one of the M regions.
Algorithm 5.1: AdaBoost (Freund and Schapire, 1997).
1. Initialize weights p
i
= 1/S, i = 1,... ,S.
2. For m = 1 to M:
(a) Learn classifier G
m
(x) with current weights.
(b) Compute weighted error rate:
err
m
=
S
i=1
p
i
1
(y
i
6=G
m
(x
i
))
S
i=1
p
i
.
(c) Compute γ
m
= log(1 err
m
) log(err
m
) .
(d) p
i
p
i
· exp(γ
m
1
(y
i
6=G
m
(x
i
))
), i = 1,... ,S.
3. Output G(x) = sign
M
m=1
γ
m
G
m
(x)
.
Each boosting step requires learning a classifier by
minimizing a weighted criterion, that is, with weights
p
1
,... , p
S
corresponding to each training observa-
tions (y
i
,X
i
), i = 1,... ,S. In our case, the classi-
fier G
m
is a weighted version of the SVM classifier
corresponding to the m-th ROI, i.e., the SVM clas-
sifier whose kernel function is built on the Rician
mixture estimated for that ROI. To take into account
these weights, the optimization problem solved by the
SVM learning algorithm requires a modification: the
penalty on the slack variable ξ
i
corresponding to the
example X
i
is set to be proportional to the weight p
i
.
The corresponding modified 1-norm SVM optimiza-
tion problem (Cristianini and Shawe-Taylor, 2000),
(Sch¨olkopf and Smola, 2002) is
min
ξ,β,β
0
hβ,βi +C
S
i=1
p
i
ξ
i
(19)
s.t. y
i
(hβ,φ(X
i
)i + β
0
) 1 ξ
i
, i = 1,... , S
ξ
i
0, i = 1,... ,S.
The Lagrangian for problem (19) is
L
p
(β,β
0
,ξ,α,µ) =
1
2
kβk
2
+C
S
i=1
p
i
ξ
i
S
i=1
α
i
[y
i
(hφ(X
i
),βi + β
0
) (1 ξ
i
)]
S
i=1
µ
i
ξ
i
(20)
with α
i
0 and µ
i
0. By minimizing L
p
with re-
spect to β, β
0
, ξ
i
and µ
i
, i = 1,... , S, the Lagrange
dual problem results
max
α
S
i=1
α
i
1
2
S
i, j=1
α
i
α
j
y
i
y
j
k(X
i
,X
j
) (21)
s.t. 0 α
i
p
i
C
S
i=1
α
i
y
i
= 0.
Notice that each α
i
is constrained to be less or equal
to p
i
C rather than C while the objective function in
(21) is the same as the original 1-norm dual problem
(Cristianini and Shawe-Taylor, 2000), (Sch¨olkopf and
Smola, 2002). As a consequence, if p
i
is close to zero,
so is α
i
, thus contributing very weakly to the defini-
tion of the optimal hyperplane, which is still given by
f(X,α
,β
0
) =
S
i=1
y
i
α
i
k(X
i
,X) + β
0
. (22)
6 EXPERIMENTS
Let us begin this section with a summary of the pro-
posed approach. The training data consists of set of
images, each containing a set of M regions of inter-
est (ROI) and labeled as belonging to a schizophrenic
or non-schizophrenic patient. For each ROI of the
set of training images, either a single Rician mixture
or two Rician mixtures (one for each class) are esti-
mated and used to embed the data on a Hilbert space,
as described in Section 3. On the Hilbert space for
each ROI, one of the information theoretic kernels de-
scribed in Section 4 is used. Finally, a set of M (one
per ROI) SVM classifiers is obtained by the AdaBoost
algorithm described in Section 5; the final classifier is
the one resulting at the last step of Algorithm 5.1.
The baselines against which we compare the pro-
posed approach are SVM classifiers with linear ker-
nels (LK) and Gaussian radial basis function kernels
(GRBFK) built on the same generative embeddings.
SVM training is carried out using the LIBSVM pack-
age (http://www.csie.ntu.edu.tw/cjlin/libsvm). The
underlying Rician mixtures were estimated using the
EM algorithm described in Section 2, with K (the
number of components) selected using the criterion
proposed in (Figueiredo and Jain, 2002); this leads to
numbers in the [4,6] range. We tested the generative
embeddings
e
e,
¯
e and
b
e proposed in Section 3, both in
the single-mixture and R-mixtures versions.
The dataset contains 124 images (64 patients and
60 controls), each with the following 14 ROIs (7
pairs): Amygdala (1-Left, 2-Right), Dorso-lateral
PreFrontal Cortex (3-Left, 4-Right), Entorhinal Cor-
tex (5-Left, 6-Right), Heschl’s Gyrus (7-Left, 8-
Right), Hippocampus (9-Left, 10-Right), Superior
GENERATIVE EMBEDDINGS BASED ON RICIAN MIXTURES - Application to Kernel-based Discriminative
Classification of Magnetic Resonance Images
117
Table 1: Mean accuracy for the best values of q and C for the SVM classifiers learnt on ROI 2, 4, 6 respectively, using one
Rician mixture per class with K = 4,5,6 components and embeddings
e
e,
¯
e and
b
e.
ROI 2 4 6
No. of components 4 5 6 4 5 6 4 5 6
Embedding
e
e
Linear 54.84 53.06 53.39 60.16 60 60 57.26 58.23 58.23
RBF 59.52 60.16 62.26 60.81 60.81 61.13 65.32 65.16 64.48
Jensen-Shannon 58.87 58.39 59.84 60.81 58.55 60.32 67.42 66.61 65.48
Jensen-Tsallis 59.35 60 60.97 62.42 59.84 62.42 67.58 67.42 65.97
Weighted JT
e
k
q
59.35 59.84 60.97 61.13 60.32 61.94 67.74 67.26 66.29
Weighted JT k
q
59.35 59.19 59.84 62.42 59.84 62.42 67.58 66.94 65.97
Embedding
¯
e
Linear 53.06 51.94 51.94 58.87 58.23 57.74 56.45 58.55 57.74
RBF 61.94 62.26 63.39 59.84 60.48 60.97 64.03 63.39 63.55
Jensen-Shannon 60 61.45 60.32 57.74 57.74 57.26 64.84 65.48 65.81
Jensen-Tsallis 61.45 61.45 62.9 60.48 60.16 60 67.1 67.58 66.61
Weighted JT
e
k
q
62.58 62.26 62.1 57.9 58.06 58.87 66.13 65.97 65
Weighted JT k
q
61.77 61.45 63.23 56.94 58.06 57.09 66.45 66.94 67.74
Embedding
b
e
Linear 52.74 53.55 55.65 58.39 58.06 58.55 57.1 57.26 57.1
RBF 61.94 62.1 63.39 60.32 60.65 60.32 65.81 64.84 65.16
Jensen-Shannon 60.48 60.32 60.97 57.74 57.74 57.9 65 66.45 65.97
Jensen-Tsallis 60.97 61.13 63.39 59.52 60.16 59.52 66.76 68.06 66.29
Weighted JT
e
k
q
62.1 62.58 62.42 58.39 57.9 58.55 64.08 65 65.65
Weighted JT k
q
61.45 61.45 62.42 57.74 57.74 59.84 65.32 66.13 67.74
Temporal Gyrus (11-Left, 12-Right), Thalamus (13-
Left, 14-Right). To evaluate the classifiers, the dataset
was split 50%-50% into training and test subsets and
10 runs were performed.
SVM classifiers were trained for each individual
ROI (without the boosting-based combination), and
the conclusion was that ROI 10 leads to the best accu-
racy (see Tables 1, 2, 3). The accuracy is robust to the
number of components of the mixture. The best per-
formances over q andC are reported. For the GRBFK,
the best performance over the width parameter and
over C are reported. Mean accuracies are plotted in
Figure 1 as a function of q for the best value of C and
as a function of C for the best value of q, for the gen-
erative embeddings
e
e,
¯
e and
b
e, with 2 (one per class)
Rician mixtures each with 4 components. The results
with a single mixture are very similar, thus omitted.
For q > 1, the results shown for the weighted JT ker-
nel (which is positive definite only for q [0,1]) cor-
respond to q = 1. These results show that the pro-
posed generative embeddings lead to comparable per-
formances. The information theoretic kernels outper-
form the LK and GRBFK. Namely, the best perfor-
mances are obtained with the JT and weighted JT ker-
nels, for all ROIs. The standard error of the mean is
less than 0.006.
Results obtained by combining the SVM clas-
sifiers with the AdaBoost algorithm are shown in
Table 4 for the generative embeddings
e
e,
¯
e and
b
e.
These results show that the proposed approach out-
performs state-of-the-art methods for ROIs intensity
histograms for this dataset, see (Cheng et al., 2009a),
(Cheng et al., 2009b), (Ulas et al., 2010), (Ulas et al.,
2011).
7 CONCLUSIONS
In this paper, we have proposed a new approach
for building generative embeddings for kernel-based
classification of magnetic resonance images (MRI) by
exploiting the Rician distribution that characterizes
MR images. Using generative embeddings, the im-
ages to be classified are mapped onto a Hilbert space,
where kernel-based techniques can be used. Concern-
ing the choice of kernel, we have adopt the recently
proposed nonextensive information theoretic kernels.
The proposed approach was tested on a challenging
classification task: classifying subjects as suffering,
or not, from schizophrenia on the basis of a set of re-
gions of interest (ROIs) in each image. To this pur-
pose, an SVM classifier for each ROI is learnt. Fi-
nally, we propose to combine the SVM classifiers via
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
118
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
56
58
60
62
64
66
68
70
72
Entropic index q
Linear
RBF
Jensen−Shannon
Jensen−Tsallis
WJensen−Tsallis tilde
WJensen−Tsallis
(a)
10
−1
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
50
55
60
65
70
75
C
Linear
RBF
Jensen−Shannon
Jensen−Tsallis
WJensen−Tsallis tilde
WJensen−Tsallis
(b)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
56
58
60
62
64
66
68
70
Entropic index q
Linear
RBF
Jensen−Shannon
Jensen−Tsallis
WJensen−Tsallis tilde
WJensen−Tsallis
(c)
10
−1
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
50
52
54
56
58
60
62
64
66
68
70
C
Linear
RBF
Jensen−Shannon
Jensen−Tsallis
WJensen−Tsallis tilde
WJensen−Tsallis
(d)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
56
58
60
62
64
66
68
70
Entropic index q
Linear
RBF
Jensen−Shannon
Jensen−Tsallis
WJensen−Tsallis tilde
WJensen−Tsallis
(e)
10
−1
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
50
52
54
56
58
60
62
64
66
68
70
C
Linear
RBF
Jensen−Shannon
Jensen−Tsallis
WJensen−Tsallis tilde
WJensen−Tsallis
(f)
Figure 1: Mean accuracy on 10 runs as a function of q (best C) and as a function of C (best q) for the SVM classifier learnt on
ROI 10 using one Rician mixture per class with K = 4 components and embeddings
e
e ((a), (b)),
¯
e ((c), (d)) and
b
e ((e), (f)).
GENERATIVE EMBEDDINGS BASED ON RICIAN MIXTURES - Application to Kernel-based Discriminative
Classification of Magnetic Resonance Images
119
Table 2: Mean accuracy for the best values of q and C for the SVM classifiers learnt on ROI 8, 12, 14 respectively, using one
Rician mixture per class with K = 4,5,6 components and embeddings
e
e,
¯
e and
b
e.
ROI 8 12 14
No. of components 4 5 6 4 5 6 4 5 6
Embedding
e
e
Linear 62.58 60.32 59.52 58.39 60.65 59.35 55.32 55 55.48
RBF 65.48 65.32 64.03 65.97 65.32 63.71 61.94 62.74 61.13
Jensen-Shannon 65.32 65 64.84 64.35 64.84 64.52 62.42 61.45 60.16
Jensen-Tsallis 66.45 66.13 65.65 66.13 66.94 64.68 62.58 62.1 61.45
Weighted JT
e
k
q
67.26 66.77 65.65 66.13 66.29 65 62.74 61.94 61.45
Weighted JT k
q
66.45 65.65 65.65 66.13 66.94 64.68 62.58 62.1 61.45
Embedding
¯
e
Linear 59.35 60.16 59.19 58.23 59.03 57.26 55 54.84 54.84
RBF 63.71 64.68 63.23 62.42 62.9 62.9 62.1 63.55 63.06
Jensen-Shannon 63.71 64.68 63.23 60.65 61.94 62.1 66.61 65.98 65.32
Jensen-Tsallis 64.68 64.84 64.68 62.58 64.84 63.71 67.9 66.61 66.29
Weighted JT
e
k
q
65.16 64.19 63.23 63.87 64.19 62.9 65.48 64.84 63.87
Weighted JT k
q
64.84 64.03 64.03 64.03 63.87 63.23 65 64.19 63.71
Embedding
b
e
Linear 59.19 60.48 58.87 60.48 60.16 60 55.65 55.65 56.13
RBF 64.03 63.87 63.06 64.03 64.52 62.74 63.23 63.55 63.06
Jensen-Shannon 63.39 64.84 63.71 60.97 62.74 62.26 66.61 66.13 64.35
Jensen-Tsallis 64.68 64.84 64.03 62.74 62.74 63.87 68.06 67.1 65.48
Weighted JT
e
k
q
64.84 64.03 63.39 64.35 63.55 63.39 65.48 65 63.87
Weighted JT k
q
64.52 64.35 63.87 64.35 65.65 63.06 64.68 64.68 63.23
Table 3: Mean accuracy for the best values of q and C for the SVM classifier learnt on ROI 10 using one Rician mixture per
class with K = 4,5,6 components and embeddings
e
e,
¯
e and
b
e.
ROI 10
No. of components 4 5 6
Embedding
e
e
Linear 58.39 58.23 57.42
RBF 66.13 67.26 67.42
Jensen-Shannon 69.68 68.71 68.06
Jensen-Tsallis 71.13 70.32 68.87
Weighted JT
e
k
q
70.65 70.97 69.19
Weighted JT k
q
71.13 70.32 68.87
Embedding
¯
e
Linear 56.29 56.13 55.81
RBF 65.65 67.42 67.26
Jensen-Shannon 68.06 68.55 69.68
Jensen-Tsallis 69.03 69.68 70.48
Weighted JT
e
k
q
67.1 67.58 68.39
Weighted JT k
q
67.26 67.26 69.19
Embedding
b
e
Linear 56.94 57.1 57.9
RBF 67.9 66.94 67.42
Jensen-Shannon 68.55 68.39 69.52
Jensen-Tsallis 69.84 70 70.48
Weighted JT
e
k
q
66.94 67.26 68.55
Weighted JT k
q
67.9 67.26 69.03
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
120
Table 4: Mean accuracy for the best values of q and C for the set of SVM classifiers obtained by the boosting algorithm, using
one Rician mixture per class with K = 4,5,6 components and embeddings
e
e,
¯
e and
b
e. Results with state-of-the-art methods
for ROIs intensity histograms using leave-one-out are also reported.
Boosting
No. of components 4 5 6
Embedding
e
e
Jensen-Shannon 78.55 78.23 77.74
Jensen-Tsallis 79.68 80.16 79.03
Weighted JT
e
k
q
80 79.03 78.39
Weighted JT k
q
79.68 80.16 79.03
Embedding
¯
e
Jensen-Shannon 75 75.97 77.42
Jensen-Tsallis 78.71 78.06 79.84
Weighted JT
e
k
q
78.23 78.06 77.58
Weighted JT k
q
78.71 78.39 78.55
Embedding
b
e
Jensen-Shannon 77.90 76.94 76.61
Jensen-Tsallis 79.35 78.39 78.39
Weighted JT
e
k
q
81.77 78.39 78.06
Weighted JT k
q
80.48 77.90 78.39
State-of-the-art methods
Methodology Accuracy
SVM Best Single ROI
(Cheng et al., 2009a) 73.4
Dissimilarity representations
(Ulas et al., 2011) 78.07
SVM Multiple ROIs
Constellation probab. model + Fisher kernel
(Cheng et al., 2009b) 80.65
Combined dissimilarity representations
(Ulas et al., 2010) 79
Dissimilarity representations
(Ulas et al., 2011) 76.32
a boosting algorithm. The experimental results show
that the proposed methodology outperforms the pre-
vious state-of-the-art methods on the same dataset.
REFERENCES
Abramowitz, M. and Stegun, I. (1972). Handbook of Math-
ematical Functions. Dover, New York.
Bosch, A., Zisserman, A., and Munoz, X. (2006). Scene
classification via plsa. In Proc. of ECCV.
Burbea, J. and Rao, C. (1982). On the convexity of some di-
vergence measures based on entropy functions. IEEE
Trans. on Information Theory, 28(3):489–495.
Cheng, D., Bicego, M., Castellani, U., Cerutti, S., Bel-
lani, M., Rambaldelli, G., Atzori, M., Brambilla, P.,
and Murino, V. (2009a). Schizophrenia classification
using regions of interest in brain MRI. In IDAMAP
Workshop.
Cheng, D., Bicego, M., Castellani, U., Cristani, M., Cerruti,
S., Bellani, M., Rambaldelli, G., Aztori, M., Bram-
billa, P., and Murino, V. (2009b). A hybrid gener-
ative/discriminative method for classification of re-
gions of interest in schizophrenia brain MRI. In MIC-
CAI09 Workshop on Probabilistic Models for Medical
Image Analysis.
Cristianini, N. and Shawe-Taylor, J. (2000). An introduction
to Support Vector Machines and other kernel-based
learning methods. Cambridge University Press.
Dempster, A., Laird, N., and Rubin, D. (1977). Maxi-
mum likelihood from incomplete data via the EM al-
gorithm. Jour. the Royal Statistical Soc. (B), 39:1–38.
Figueiredo, M. and Jain, A. K. (2002). Unsupervised learn-
ing of finite mixture models. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 24:381–396.
Freund, Y. and Schapire, R. (1997). A decision-theoretic
generalization of online learning and an application to
boosting. Jour. Comp. System Sciences, 55:119–139.
Gudbjartsson, H. and Patz, S. (1994). The rician distribution
of noisy MRI data. Magnetic Resonance in Medicine,
34:910–914.
Jaakkola, T. and Haussler, D. (1999). Exploiting generative
models in discriminative classifiers. In Neural Infor-
mation Processing Systems – NIPS.
Lasserre, J., Bishop, C., and Minka, T. (2006). Principled
hybrids of generative and discriminative models. In
Proc. Conf. Computer Vision and Patt. Rec. – CVPR.
Lin, J. (1991). Divergence measures based on Shannon en-
tropy. IEEE Trans. Information Theory, 37:145–151.
Martins, A. F., Smith, N. A., Aguiar, P. M., and Figueiredo,
M. A. T. (2009). Nonextensive information theoretic
kernels on measures. Journal of Machine Learning
Research, 10:935 – 975.
Ng, A. and Jordan, M. (2002). On discriminative vs gener-
ative classifiers: A comparison of logistic regression
and naive Bayes. In Neural Information Processing
Systems – NIPS.
Perina, A., Cristani, M., Castellani, U., Murino, V., and
Jojic, N. (2009). A hybrid generative/discriminative
classification framework based on free-energy terms.
In Proc. Int. Conf. Computer Vision – ICCV, Kyoto.
Rice, S. O. (1944). Mathematical analysis of random noise.
Bell Systems Tech. J., 23:282–332.
Rubinstein, Y. and Hastie, T. (1997). Discriminative vs in-
formative learning. In Proc. 3rd Int. Conf. Knowledge
Discovery and Data Mining, Newport Beach.
GENERATIVE EMBEDDINGS BASED ON RICIAN MIXTURES - Application to Kernel-based Discriminative
Classification of Magnetic Resonance Images
121
Sch¨olkopf, B. and Smola, A. J. (2002). Learning with Ker-
nels. MIT Press.
Suyari, H. (2004). Generalization of Shannon-Khinchin ax-
ioms to nonextensive systems and the uniqueness the-
orem for the nonextensive entropy. IEEE Trans. on
Information Theory, 50(8):1783–1787.
Ulas, A., Duin, R., Castellani, U., Loog, M., Bicego, M.,
Murino, V., Bellani, M., Cerruti, S., Tansella, M., and
P.Brambilla (2010). Dissimilarity-based detection of
schizophrenia. In ICPR 2010 workshop on Pattern
Recognition Challenges in fMRI Neuroimaging.
Ulas, A., Duin, R., Castellani, U., Loog, M., Mirtuono,
P., Bicego, M., Murino, V., Bellani, M., Cerruti, S.,
Tansella, M., and P.Brambilla (2011). Dissimilarity-
based detection of schizophrenia. Int. Journal of
Imaging Systems and Technology.
APPENDIX
Proof of Proposition 2.1
Proof. First of all, let us note that f (y
j
;θ
i
) can be
written in factorized form as
f
i
(y
j
;θ
i
) = A(y
j
;θ
i
) · B(y
j
;θ
i
) (23)
where
A(y
j
;θ
i
) =
y
j
σ
2
i
e
y
2
j
+v
2
i
2σ
2
i
(24)
and
B(y
j
;θ
i
) = I
0
y
j
v
i
σ
2
i
(25)
It follows that the partial derivatives of the log-
likelihhod with respect to v
i
and σ
2
i
result
log f(y
j
;θ
i
)
v
i
=
1
f(y
j
;θ
i
)
·
f(y
j
;θ
i
)
v
i
=
1
A· B
·
A
v
i
· B+ A·
B
v
i
=
1
A
·
A
v
i
+
1
B
·
B
v
i
(26)
log f(y
j
;θ
i
)
σ
2
i
=
1
A
·
A
∂σ
2
i
+
1
B
·
B
∂σ
2
i
(27)
The partial derivative of A(y
j
;θ
i
) with respect to v
i
is
A(y
j
;θ
i
)
v
i
=
y
j
σ
2
i
e
y
2
j
+v
2
i
2σ
2
i
·
1
2σ
2
i
· 2v
i
(28)
Moreover, recalling that the higher order modified
Bessel functions I
n
(z), defined by the contour integral
I
n
(z) =
1
2πi
I
e
(
z
2
)(
t+1
t
)
t
n1
dt (29)
where the contour encloses the origin and is traversed
in a counterclockwise direction, can be expressed in
terms of I
0
(z) through the following derivative iden-
tity (Abramowitz and Stegun, 1972)
I
n
(z) = T
n
d
dz
I
0
(z) (30)
where T
n
(z) is a Chebyshev polynomial of the first
kind (Abramowitz and Stegun, 1972)
T
n
(z) =
1
4πi
I
(1 t
2
)t
n1
(1 2tz+t
2
)
dt (31)
with the contour enclosing the origin and traversed
in a counterclockwise direction, and in particular that
T
1
(z) = z, then the partial derivative of B results
B(y
j
;θ
i
)
v
i
=
I
0
y
j
v
i
σ
2
i
v
i
= I
1
y
j
v
i
σ
2
i
·
y
j
σ
2
i
(32)
Substituting (28) and (32) in (26) we get
log f(y
j
;θ
i
)
v
i
=
v
i
σ
2
i
+
I
1
y
j
v
i
σ
2
i
I
0
y
j
v
i
σ
2
i
·
y
j
σ
2
i
(33)
which, substituted in (10) yields (11).
The same considerations hold for the partial
derivatives with respect to σ
2
i
, yielding to the follow-
ing expressions for the partial derivative of A and B
(with respect to σ
2
i
)
A(y
j
;θ
i
)
∂σ
2
i
=
y
j
σ
4
i
e
y
2
j
+v
2
i
2σ
2
i
+
y
j
σ
2
i
e
y
2
j
+v
2
i
2σ
2
i
y
2
j
+ v
2
i
2σ
4
i
(34)
B(y
j
;θ
i
)
∂σ
2
i
= I
1
y
j
v
i
σ
2
i
·
y
j
v
i
σ
4
i
(35)
Substituting (34) and (35) in (27), the partial deriva-
tive of log f(y
j
;θ
i
) with respect to σ
2
i
results
log f(y
j
;θ
i
)
∂σ
2
i
=
1
σ
2
i
1
y
2
j
+ v
2
i
2σ
2
i
!
I
1
y
j
v
i
σ
2
i
I
0
y
j
v
i
σ
2
i
·
y
j
v
i
σ
4
i
(36)
which, plugged in (10) yields (12).
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
122