classification. Observe that with a Gaussian kernel
we could achieve similar classification performance
in this particular problem (assuming some tuning of
the parameters is done); however, the exponential ker-
nel hides the simplicity of the solution found by our
algorithm (a quadratic mapping).
2 PREVIOUS WORK
Since the introduction of kernel machines in the 90’s,
there has been growing literature on metric and kernel
learning. It is beyond the scope of this paper to review
all previous work along these lines, but a good re-
view on metric learning can be found in (Yang, 2006),
and for kernel selection methods see (Schlkopf and
Smola, 2002; Shawe-Taylor and Cristianini, 2004).
A common approach to learn a kernel matrix uses
semi-definite programming (SDP) (Lanckriet et al.,
2004; Cristianini et al., 2001) to maximize some sort
of alignment with respect to an ideal kernel. A ma-
jor limitation of SDP is its computational complexity,
since it scales O(n
6
), where n is the number of sam-
ples (Boyd and Vandenberghe, 2004). This limita-
tion has restricted its application to small scale prob-
lems. Recently, (Kim et al., 2006) posed the ker-
nel selection for kernel linear discriminant analysis
as a convex optimization problem. To optimize over
a positive combination of known kernels, they use
interior point methods with a computational cost of
O(d
3
+ n
3
), where d is the dimension of the samples.
Along these lines, (Weinberger et al., 2006) learns a
Mahalanobis distance metric in the kNN classification
setting by SDP. The learned distance metric enforces
the k-nearest neighbors to belong to the same class,
while examples from different classes are separated
by a large margin.
In the literature of metric learning, Goldberger et
al. (Goldberger et al., 2004) have proposed Neighbor
component analysis that computes the Mahalanobis
distance that minimizes an approximation of the clas-
sification error. Similarly, (Shental et al., 2002) opti-
mizes the linear discriminant analysis (LDA) criteria
in a semi-supervised manner to learn the metric.
In previous work, typically, a parameterized fam-
ily of linear or non-linear kernels, (e.g. Gaussian,
polynomial ) are chosen and the kernel parameters are
tuned with some sort of cross-validation. In this pa-
per, we consider the more generic problem of finding
a functional mapping of the data.
3 PARAMETERIZING THE
KERNEL
Many visual classification tasks (e.g. object recogni-
tion) are highly complex and non-linear kernels are
needed to model changes such as illumination, view-
point or internal object variability. Learning a non-
linear kernel is a relatively difficult problem; for in-
stance, proving that a function is a kernel is a chal-
lenging mathematical task. A given function is a
kernel if and only if the value it produces for two
vectors corresponds to a dot product in some feature
Hilbert space. This is the well known Mercer’s the-
orem: ”Every positive definite, symmetric function
is a kernel. For every kernel K, there is a function
ϕ(x) : k(d
1
,d
2
) = hϕ(d
1
),ϕ(d
2
)i.”, where hi denotes
dot product. To avoid the problem of proving that a
similarity function is a kernel, it is common to param-
eterize the Kernel as a positive combination of exist-
ing Kernels (e.g. Gaussian, polynomial, ...).
In this paper, we propose to learn a kernel as a pos-
itive combination of normalized kernels as follows:
T = D
T
AD
ˆ
T = dm(T)
−
1
2
Tdm(T)
−
1
2
T
t
= D
T
A
t
D
ˆ
T
t
= dm(T
t
)
−
1
2
T
t
dm(T
t
)
−
1
2
K
1
(A,α) =
∑
p
t=0
α
t
ˆ
T
t
K
2
(A
1
,··· ,A
p
,α) =
∑
p
t=0
α
t
ˆ
T
t
t
(1)
where α
t
≥ 0 ∀t, the columns of D ∈ ℜ
d×n
(see nota-
tion
1
) contain the original data points, d denotes the
dimension of the data, n the number of samples and
p the degree of the polynomial. Each element i j of
the matrix T, t
i j
= d
T
i
Ad
j
contains the dot weighted
product between the sample i and j. Each element i j
of the matrix
ˆ
T represents the cosine of the angle be-
tween the samples i and j (i.e.
ˆ
t
i j
=
d
T
i
Ad
j
q
d
T
j
Ad
j
d
T
i
Ad
i
).
ˆ
T
k
exponentiates each of the entries in T. K
1
is a
positive combination of
ˆ
T
k
, and if A is positive def-
inite, K
1
will be a valid kernel because of the closure
1
Bold capital letters denote a matrix D, bold lower-case
letters a column vector d. d
j
represents the j column of the
matrix D. d
i j
denotes the scalar in the row i and column
j of the matrix D and the scalar i-th element of a column
vector d
j
. All non-bold letters will represent variables of
scalar nature. d iag is an operator that transforms a vector
to a diagonal matrix or takes the diagonal of the matrix into
a vector. dm(A) is a matrix that contains just the diagonal
elements of A. ◦ denotes the Hadamard or point-wise prod-
uct. 1
k
∈ ℜ
k×1
is a vector of ones. I
k
∈ ℜ
k×k
is the iden-
tity matrix. tr(A) =
∑
i
a
ii
is the trace of the matrix A and
|A| denotes the determinant. ||A||
F
= tr(A
T
A) designates
the Frobenious norm of a matrix. A
k
denotes point-wise
power, i.e. a
k
i j
∀i, j.
PARAMETERIZED KERNELS FOR SUPPORT VECTOR MACHINE CLASSIFICATION
117