3 A CONCEPTUAL STUDY
We now study the problem of model selection based
on the criterion proposed by Vapnik (Vapnik, 1999;
Hastie et al., 2001) under the name of Structural
Risk Minimization (SRM). Here the objective is
to minimize the expected value of a loss function
L(y, f (x)) that states how much penalty is assigned
when our class estimation f (x) differs from class y.
A typical loss function is the zero-one loss function
L(y, f (x)) = I(y 6= f (x)), where I(·) is an indicator
function. We define the risk in adopting a family of
models parameterized by θ as the expected loss:
R(θ) = E[L(y, f (x|θ))] (3)
which cannot be estimated precisely because y(x ) is
unknown. One can compute instead the empirical
risk:
ˆ
R(θ) =
1
N
N
∑
i=1
L(y
i
, f (x
i
|θ)) (4)
where N is the size of the training set. Using a mea-
sure of model-family complexity h, known as the VC-
dimension (Vapnik, 1999), the idea is now to provide
an upper bound on the true risk using the empirical
risk and a function that penalizes for complex models
using the VC dimension:
SRM:
R(θ) ≤
ˆ
R(θ) +
s
h(ln
2N
h
+ 1) −ln(
η
4
)
N
(5)
where the inequality holds with probability at least
1−η over the choice of the training set. The goal is to
find the family of models that minimizes equation 5.
The ideas just described are typical on model se-
lection techniques. Since our training set T comprises
a limited number of examples and we do not know
the form of the true target distribution, the problem
we face is referred to as the bias-variance dilemma
in statistical inference (Geman et al., 1992; Hastie
et al., 2001). Specifically, simple classifiers exhibit
limited flexibility on their decision boundaries; their
small repertoire of functions produces high bias (since
the best approximating function may lie far from the
target function) but low variance (since there is lit-
tle dependence on local irregularities in the data). In
such cases, it is common to see high values for the
empirical risk but low values for the penalty term. On
the other hand, complex models encompass a large
class of approximating functions; they exhibit flexi-
ble decision boundaries (low bias) but are sensitive to
small variations in the data (high variance). Here, in
contrast, we commonly find low values for the em-
pirical risk but high values for the penalty term. The
goal in SRM is to minimize the right hand side of in-
equality 5 by finding a balance between empirical er-
ror and model complexity, where the VC dimension h
becomes a controllable variable.
3.1 Multiple Models vs One Model
We now provide an analysis of the conditions under
which combining multiple local models is expected
to be beneficial. In essence we wish to compare a
composite model M
c
to a basic global model M
b
. M
c
is the combination of multiple models. We assume
M
b
has VC-dimension h
b
and M
c
has VC-dimension
h
c
, which comes from the combination of k models,
each of VC-dimension at most h, where we assume
h < h
b
.
The question we address is the following: how
many models of VC-dimension at most h can M
c
comprise to still improve on generalization accuracy
over M
b
, assuming both models have the same empir-
ical error? The question refers to the maximum value
of k that still gives an advantage of M
c
over M
b
. To
proceed we look at the VC-dimension of h
c
, which in
essence is the VC-dimension of k-fold unions or inter-
sections. It is an open problem to determine the VC-
dimension of a family of k-fold unions (Reyzin, 2006;
Blumer et al., 1989; Eisenstat and Angluin, 2007); re-
cent work, however, shows that such a family of mod-
els has a lower bound of
8
5
kh, and an upper bound of
2kh log
2
3k (it has been shown that O(nk log
2
k) is a
tight bound (Eisenstat and Angluin, 2007)). We be-
gin our study with the lower optimistic bound, and
assume the VC-dimension of h
c
to be
8
5
kh. To solve
the question above we equate the right hand side in
equation 5 for both M
c
and M
b
:
v
u
u
u
t
8
5
kh
ln
2N
8
5
kh
+ 1
−ln(
η
4
)
N
=
v
u
u
t
h
b
ln
2N
h
b
+ 1
−ln(
η
4
)
N
(6)
where our goal is now simply to solve for k. After
some algebraic manipulation we get the following:
c
1
k −k ln k = c
2
(7)
where c
1
and c
2
are constants:
c
1
= ln2N + 1 −ln(
8
5
h) (8)
A CONCEPTUAL STUDY OF MODEL SELECTION IN CLASSIFICATION - Multiple Local Models vs One Global
Model
115