= {w
0
,w
j
,w
j
, j = 1, ··· , J}, z
j
≡ g(w
T
j
x), and g(h)
is an activation function. Given data {(x
µ
,y
µ
),µ =
1,· · · ,N}, we try to find MLP(J) which minimizes
an error function. We also consider MLP(J−1)
with θ
J−1
= {u
0
,u
j
,u
j
, j = 2,··· , J}. The output is
f
J−1
(x;θ
J−1
) = u
0
+
∑
J
j=2
u
j
v
j
, where v
j
≡ g(u
T
j
x)
Now consider the following reducibility mappings
α, β, and γ. Then apply α, β, and γ to the optimum
b
θ
J−1
to get regions
b
ϑ
α
J
,
b
ϑ
β
J
, and
b
ϑ
γ
J
respectively.
b
θ
J−1
α
−→
b
ϑ
α
J
,
b
θ
J−1
β
−→
b
ϑ
β
J
,
b
θ
J−1
γ
−→
b
ϑ
γ
J
b
ϑ
α
J
≡{θ
J
|w
0
=
b
u
0
, w
1
=0,
w
j
=
b
u
j
,w
j
=
b
u
j
, j =2, ··· , J}
b
ϑ
β
J
≡{θ
J
|w
0
+w
1
g(w
10
)=
b
u
0
,
w
1
=[w
10
,0, ...,0]
T
,
w
j
=
b
u
j
,w
j
=
b
u
j
, j = 2, ...,J}
b
ϑ
γ
J
≡{θ
J
|w
0
=
b
u
0
,w
1
+w
m
=
b
u
m
,
w
1
=w
m
=
b
u
m
,
w
j
=
b
u
j
,w
j
=
b
u
j
, j ∈ {2, ...,J}\m}
Now two singular regions can be formed. One is
b
ϑ
αβ
J
,
the intersection of
b
ϑ
α
J
and
b
ϑ
β
J
. The parameters are
as follows, where only w
10
is free: w
0
=
b
u
0
, w
1
=
0, w
1
= [w
10
,0, ··· ,0]
T
,w
j
=
b
u
j
, w
j
=
b
u
j
, j =
2,· · · ,J. The other is
b
ϑ
γ
J
, which has the restriction:
w
1
+ w
m
=
b
u
m
.
SSF starts search from MLP(J=1) and then gradu-
ally increases J one by one until J
max
. When start-
ing from the singular region, the method employs
eigenvector descent (Satoh and Nakano, 2012), which
finds descending directions, and from then on em-
ploys BPQ (Saito and Nakano, 1997), a quasi-Newton
method. SSF finds excellent solution of MLP(J) one
after another for J=1,···,J
max
. Thus, SSF guarantees
that training error decreases monotonically as J gets
larger, which will be quite preferable for model selec-
tion.
4 EXPERIMENTS
Experimental Conditions:
We used artificial data since they are easy to control
and their true nature is obvious. The structure of an
MLP is defined as follows: the numbers of input, hid-
den, and output units are K, J, and I respectively.
Both input and hidden layers have a bias. Values
of input data were randomly selected from the range
[0, 1]. Artificial data 1 and data 2 were generated
using MLP(K = 5, J = 20, I = 1) and MLP(K = 10,
J = 20, I = 1) respectively. Weights between input
and hidden layers were set to be integers randomly
selected from the range [−10, +10], whereas weights
between hidden and output layers were integers ran-
domly selected from [−20, +20]. A small Gaussian
noise with mean zero and standard deviation 0.02 was
added to each MLP output. Size of training data was
set to be N = 800, whereas test data size was set to be
1,000.
WAIC and WBIC were compared with AIC
and BIC. The empirical approach needs a sam-
pling method; however, usual MCMC (Markov chain
Monte Carlo) methods such as Metropolis algorithm
will not work at all (Neal, 1996) since MLP search
space is quite hard to search. Thus, we employ power-
ful learning methods BPQ and SSF as sampling meth-
ods. For AIC and BIC a learning method runs with-
out any regularizer, whereas WAIC and WBIC need a
weight decay regularizer whose regularization coeffi-
cient λ depends on temperature T . The temperature T
was set as suggested in (Watanabe, 2010; Watanabe,
2013): T = 1 for WAIC and T = log(N) for WBIC.
The regularization coefficient λ of WAIC is smaller
than that of WBIC. WAIC and WBIC were calculated
using a set of weights {w
t
} approximating a posterior
distribution. Test error was calculated using test data.
Our various previous experiments have shown that
BPQ (Saito and Nakano, 1997) finds much better so-
lutions than BP (Back Propagation) does, mainly be-
cause BPQ is a quasi-Newton, a 2nd-order method.
Thus, we employ BPQ as a conventional learning
method. We performed BPQ independently 100 times
changing initial weights for each J. Moreover, we em-
ploy a newly invented learning method called SSF as
well. For SSF, the maximum number of search routes
was set to be 100 for each J; J was changed from 1
until 24. Each run of a learning method was termi-
nated when the number of sweeps exceeded 10,000
or the step length got smaller than 10
−16
.
Experimental Results:
Figures 1 to 6 show a set of results for artificial data
1. Figure 1 shows minimum training error obtained
by each learning method for each J. Although SSF
guarantees the monotonic decrease of minimum train-
ing error, BPQ does not in general. However, BPQ
showed the monotonic decrease for this data. Figure
2 shows test error for
b
w of the best model obtained
by each learning method for each J. BPQ with λ =
0, BPQ with λ for WAIC, and BPQ with λ for WBIC
got the minimum test error at J = 20, 24, and 24 re-
spectively. SSF with λ = 0, SSF with λ for WAIC, and
SSF with λ for WBIC found the minimum test error
at J = 18, 19, and 20 respectively.
Figure 3 shows AIC values obtained by each
How New Information Criteria WAIC and WBIC Worked for MLP Model Selection
107