How New Information Criteria WAIC and WBIC

Worked for MLP Model Selection

Seiya Satoh

and Ryohei Nakano

National Institute of Advanced Industrial Science and Tech, 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan

Chubu University, 1200 Matsumoto-cho, Kasugai, 487-8501, Japan

seiya.satoh@aist.go.jp, nakano@cs.chubu.ac.jp

Keywords:

Information Criteria, Model Selection, Multilayer Perceptron, Singular Model.

Abstract:

The present paper evaluates newly invented information criteria for singular models. Well-known criteria

such as AIC and BIC are valid for regular statistical models, but their validness for singular models is not

guaranteed. Statistical models such as multilayer perceptrons (MLPs), RBFs, HMMs are singular models.

Recently WAIC and WBIC have been proposed as new information criteria for singular models. They are

developed on a strict mathematical basis, and need empirical evaluation. This paper experimentally evaluates

how WAIC and WBIC work for MLP model selection using conventional and new learning methods.

1 INTRODUCTION

A statistical model is called regular if the mapping

from a parameter vector to a probability distribution

is one-to-one and its Fisher information matrix is al-

ways positive deﬁnite; otherwise, it is called singular.

Many useful statistical models such as multilayer per-

ceptrons (MLPs), RBFs, HMMs, Gaussian mixtures,

are all singular.

Given data, we sometimes have to select the best

statistical model that has the optimum trade-off be-

tween goodness of ﬁt and model complexity. This

task is called model selection, and many information

criteria have been proposed as measures for this task.

Most information criteria such as AIC (Akaike’s

information criterion) (Akaike, 1974), BIC (Bayesian

information criterion) (Schwarz, 1978), and BPIC

(Bayesian predictive information criterion) (Ando,

2007) are for regular models. These criteria assume

the asymptotic normality of maximum likelihood es-

timator; however, in singular models this assumption

does not hold any more. Recently Watanabe estab-

lished singular learning theory (Watanabe, 2009), and

proposed new criteria WAIC (widely applicable infor-

mation criteria) (Watanabe, 2010) and WBIC (widely

applicable Bayesian information criterion) (Watan-

abe, 2013), applicable to singular models. WAIC and

WBIC have been developed on a strict mathematical

basis, and how they work for singular models needs

to be investigated hereafter.

Let MLP(J) be an MLP having J hidden units;

note that an MLP model is determined by the num-

ber J. When evaluating MLP model selection ex-

perimentally, we have to run learning methods for

different MLP models. There can be two ways to

perform this learning: independent learning and suc-

cessive learning. In the former, we run a learn-

ing method repeatedly and independently for each

MLP(J), whereas in the latter MLP(J) learning in-

herits solutions from MLP(J−1) learning. As J gets

larger, a model gets more complex having more ﬁtting

capability. This means training error should mono-

tonically decrease as J gets larger. However, inde-

pendent learning will not guarantee this monotonic-

ity. A new learning method called SSF (Singularity

Stairs Following) (Satoh and Nakano, 2013a; Satoh

and Nakano, 2013b) realizes successive learning by

utilizing singular regions to stably ﬁnd excellent so-

lutions, and can guarantee the monotonicity.

This paper experimentally evaluates how new cri-

teria WAIC and WBIC work for MLP model selec-

tion, compared with conventional criteria AIC and

BIC, using a conventional learning method called

BPQ (Back Propagation based on Quasi-Newton)

(Saito and Nakano, 1997) and the new learning

method SSF for search and sampling. BPQ is a

kind of quasi-Newton method with BFGS (Broyden-

Fletcher-Goldfarb-Shanno) update.

Satoh, S. and Nakano, R.

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection.

DOI: 10.5220/0006120301050111

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 105-111

ISBN: 978-989-758-222-6

105

2 INFORMATION CRITERIA

FOR MODEL SELECTION

Let a statistical model be p(x|w), where x is an input

vector and w is a parameter vector. Let given data be

D = {x

, µ = 1, ··· , N}, where N indicates data size,

the number of data points.

AIC and BIC:

AIC (Akaike information criterion) (Akaike, 1974)

and BIC (Bayesian information criterion) (Schwarz,

1978) are famous information criteria for regular

models. Both deal with the trade-off between good-

ness of ﬁt and model complexity.

The log-likelihood is deﬁned as follow:

(w) =

∑

µ=1

log p(x

|w). (1)

Let

w be a maximum likelihood estimator. AIC is

given below as an estimator of a compensated log-

likelihood using the asymptotic normality of

w. Here

M is the number of parameters.

AIC = −2L

(

w) + 2M

= −2

∑

µ=1

log p(x

w) + 2M (2)

BIC is obtained as an estimator of free energy

F(D) shown below. Here p(D) is called evidence and

p(w) is a prior distribution of w.

F(D) = −log p(D), (3)

p(D) =

p(w)

∏

µ=1

p(x

|w) dw (4)

BIC is derived using the asymptotic normality and

Laplace approximation.

BIC = −2L

(

w) + M logN

= −2

∑

µ=1

log p(x

w) + M logN (5)

AIC and BIC can be calculated using only one point

estimator

WAIC and WBIC:

WAIC and WBIC are derived from Watanabe’s sin-

gular learning theory (Watanabe, 2009) as new infor-

mation criteria for singular models. Watanabe intro-

duced the following four quantities: Bayes general-

ization loss BL

, Bayes training loss BL

, Gibbs gen-

eralization loss GL

, and Gibbs training loss GL

= −

∗

(x) log p(x|D)dx (6)

= −

∑

µ=1

log p(x

|D) (7)

= −



∗

(x)log p(x|w)dx



p(w|D)dw (8)

= −

∑

µ=1

log p(x

|w)

p(w|D)dw (9)

Here p

∗

(x) is the true distribution, p(w|D) is a poste-

rior distribution, and p(x|D) is a predictive distribu-

tion.

p(w|D) =

p(D)

p(w)

∏

µ=1

p(x

|w) (10)

p(x|D) =

p(x|w) p(w|D) dw (11)

WAIC1 and WAIC2 are given as estimators of BL

and GL

respectively (Watanabe, 2010). WAIC1 re-

duces to AIC for regular models.

WAIC1 = BL

+ 2(GL

− BL

) (12)

WAIC2 = GL

+ 2(GL

− BL

) (13)

WBIC is given as an estimator of free energy F(D) for

singular models (Watanabe, 2013), where p

(w|D) is

a posterior distribution under the inverse temperature

β. In WBIC context, β is set to be 1/log(N). WBIC

reduces to BIC for regular models.

WBIC = −

∑

µ=1

log p(x

|w)

(w|D)dw (14)

(w|D) =

(D)

p(w)

∏

µ=1

p(x

|w)

(15)

There are two ways to calculate WAIC and WBIC:

analytic approach and empirical one. We employ the

latter, which requires a set of weights {w

} which ap-

proximates a posterior distribution (Watanabe, 2009).

3 SSF: NEW LEARNING

METHOD

SSF (Singularity Stairs Following) is brieﬂy ex-

plained; for details, refer to (Satoh and Nakano,

2013a; Satoh and Nakano, 2013b). SSF ﬁnds so-

lutions of MLP(J) successively from J=1 until J

max

making good use of singular regions of each MLP(J).

Singular regions of MLP(J) are created by utilizing

the optimum of MLP(J−1). Gradient is zero all over

the region.

How to create singular regions is explained be-

low. Consider MLP(J) with just one output unit

which outputs f

(x;θ

) = w

∑

j=1

, where θ

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

106

= {w

, j = 1, ··· , J}, z

≡ g(w

x), and g(h)

is an activation function. Given data {(x

),µ =

1,· · · ,N}, we try to ﬁnd MLP(J) which minimizes

an error function. We also consider MLP(J−1)

with θ

J−1

= {u

, j = 2,··· , J}. The output is

J−1

(x;θ

J−1

) = u

∑

j=2

, where v

≡ g(u

Now consider the following reducibility mappings

α, β, and γ. Then apply α, β, and γ to the optimum

J−1

to get regions

, and

respectively.

J−1

−→

J−1

−→

J−1

−→

≡{θ

, w

=0,

, j =2, ··· , J}

≡{θ

g(w

=[w

,0, ...,0]

, j = 2, ...,J}

≡{θ

, j ∈ {2, ...,J}\m}

Now two singular regions can be formed. One is

αβ

the intersection of

and

. The parameters are

as follows, where only w

is free: w

, w

0, w

= [w

,0, ··· ,0]

, w

, j =

2,· · · ,J. The other is

, which has the restriction:

+ w

SSF starts search from MLP(J=1) and then gradu-

ally increases J one by one until J

max

. When start-

ing from the singular region, the method employs

eigenvector descent (Satoh and Nakano, 2012), which

ﬁnds descending directions, and from then on em-

ploys BPQ (Saito and Nakano, 1997), a quasi-Newton

method. SSF ﬁnds excellent solution of MLP(J) one

after another for J=1,···,J

max

. Thus, SSF guarantees

that training error decreases monotonically as J gets

larger, which will be quite preferable for model selec-

tion.

4 EXPERIMENTS

Experimental Conditions:

We used artiﬁcial data since they are easy to control

and their true nature is obvious. The structure of an

MLP is deﬁned as follows: the numbers of input, hid-

den, and output units are K, J, and I respectively.

Both input and hidden layers have a bias. Values

of input data were randomly selected from the range

[0, 1]. Artiﬁcial data 1 and data 2 were generated

using MLP(K = 5, J = 20, I = 1) and MLP(K = 10,

J = 20, I = 1) respectively. Weights between input

and hidden layers were set to be integers randomly

selected from the range [−10, +10], whereas weights

between hidden and output layers were integers ran-

domly selected from [−20, +20]. A small Gaussian

noise with mean zero and standard deviation 0.02 was

added to each MLP output. Size of training data was

set to be N = 800, whereas test data size was set to be

1,000.

WAIC and WBIC were compared with AIC

and BIC. The empirical approach needs a sam-

pling method; however, usual MCMC (Markov chain

Monte Carlo) methods such as Metropolis algorithm

will not work at all (Neal, 1996) since MLP search

space is quite hard to search. Thus, we employ power-

ful learning methods BPQ and SSF as sampling meth-

ods. For AIC and BIC a learning method runs with-

out any regularizer, whereas WAIC and WBIC need a

weight decay regularizer whose regularization coefﬁ-

cient λ depends on temperature T . The temperature T

was set as suggested in (Watanabe, 2010; Watanabe,

2013): T = 1 for WAIC and T = log(N) for WBIC.

The regularization coefﬁcient λ of WAIC is smaller

than that of WBIC. WAIC and WBIC were calculated

using a set of weights {w

} approximating a posterior

distribution. Test error was calculated using test data.

Our various previous experiments have shown that

BPQ (Saito and Nakano, 1997) ﬁnds much better so-

lutions than BP (Back Propagation) does, mainly be-

cause BPQ is a quasi-Newton, a 2nd-order method.

Thus, we employ BPQ as a conventional learning

method. We performed BPQ independently 100 times

changing initial weights for each J. Moreover, we em-

ploy a newly invented learning method called SSF as

well. For SSF, the maximum number of search routes

was set to be 100 for each J; J was changed from 1

until 24. Each run of a learning method was termi-

nated when the number of sweeps exceeded 10,000

or the step length got smaller than 10

−16

Experimental Results:

Figures 1 to 6 show a set of results for artiﬁcial data

1. Figure 1 shows minimum training error obtained

by each learning method for each J. Although SSF

guarantees the monotonic decrease of minimum train-

ing error, BPQ does not in general. However, BPQ

showed the monotonic decrease for this data. Figure

2 shows test error for

w of the best model obtained

by each learning method for each J. BPQ with λ =

0, BPQ with λ for WAIC, and BPQ with λ for WBIC

got the minimum test error at J = 20, 24, and 24 re-

spectively. SSF with λ = 0, SSF with λ for WAIC, and

SSF with λ for WBIC found the minimum test error

at J = 18, 19, and 20 respectively.

Figure 3 shows AIC values obtained by each

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

107

0 5 10 15 20 25

Number of Hidden Units

-1

Training Error

BPQ (λ =0)

SSF1.4 (λ =0)

BPQ (WAIC)

SSF1.4 (WAIC)

BPQ (WBIC)

SSF1.4 (WBIC)

Figure 1: Training Error for Data 1.

0 5 10 15 20 25

Number of Hidden Units

Test Error

BPQ (λ =0)

SSF1.4 (λ =0)

BPQ (WAIC)

SSF1.4 (WAIC)

BPQ (WBIC)

SSF1.4 (WBIC)

Figure 2: Test Error for Data 1.

learning method for each J. AIC of both methods se-

lected J ≥ 24 as the best model, which is not suitable

at all. Figure 4 shows BIC values obtained by each

learning method for each J. BIC of BPQ selected J

= 18 as the best model, whereas BIC of SSF selected

J = 19. Thus, BIC selected a bit smaller models than

the true one for this data.

Figure 5 shows WAIC1 and WAIC2 values ob-

tained by each learning method for each J. WAIC1

and WAIC2 of BPQ selected J ≥ 24 as the best model,

which is not suitable. WAIC1 and WAIC2 of SSF se-

lected J = 19 as the best model, which is very close to

the true one (J = 20). WAIC1 and WAIC2 selected the

same model for each method. Figure 6 shows WBIC

values obtained by each learning method for each J.

WBIC of BPQ selected J ≥ 24, which is not suitable,

whereas WBIC of SSF selected J = 20, which is right.

Figures 7 to 12 show the results for artiﬁcial data

2. Figure 7 shows minimum training error. SSF

showed the monotonic decrease, whereas BPQ did not

for WBIC. Figure 8 shows test error for

w of the best

model obtained by each learning method. BPQ with

0 5 10 15 20 25

Number of Hidden Units

-1000

1000

2000

3000

4000

5000

6000

7000

8000

9000

Criterion AIC

BPQ

SSF1.4

Figure 3: AIC for Data 1.

0 5 10 15 20 25

Number of Hidden Units

-500

500

1000

1500

2000

2500

3000

3500

4000

4500

Criterion BIC

BPQ

SSF1.4

Figure 4: BIC for Data 1.

λ = 0, λ for WAIC, and λ for WBIC got the minimum

test error at J = 24, 23, and 9 respectively. SSF with λ

= 0, λ for WAIC, and λ for WBIC found the minimum

test error at J = 23, 20, and 24 respectively.

Figure 9 shows AIC values obtained by each

learning method for each J. AIC of BPQ and SSF

selected J = 23 and J ≥ 24 respectively as the best

model, which is not acceptable. Figure 10 shows BIC

values obtained by each learning method for each J.

BIC of BPQ and SSF selected J = 22 and J = 21 re-

spectively as the best model. BIC selected a bit larger

models for this data.

Figure 11 shows WAIC1 and WAIC2 values ob-

tained by each learning method for each J. WAIC1

and WAIC2 of BPQ selected J = 21 as the best model,

which is very close to the true model. WAIC1 and

WAIC2 of SSF selected J = 20 as the best model,

which is exactly the true one. For this data WAIC1

and WAIC2 again selected the same model for each

method. Figure 12 shows WBIC values obtained by

each learning method for each J. WBIC of BPQ se-

lected J = 10, which is quite unacceptable, whereas

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

108

0 5 10 15 20 25

Number of Hidden Units

-3

-2

-1

Criterion WAIC

BPQ (WAIC 1)

SSF 1.4 (WAIC 1)

BPQ (WAIC 2)

SSF 1.4 (WAIC 2)

Figure 5: WAIC for Data 1.

0 5 10 15 20 25

Number of Hidden Units

-3

-2

-1

Criterion WBIC

×10

BPQ

SSF 1.4

Figure 6: WBIC for Data 1.

0 5 10 15 20 25

Number of Hidden Units

-2

-1

Training Error

BPQ (λ =0)

SSF1.4 (λ =0)

BPQ (WAIC)

SSF1.4 (WAIC)

BPQ (WBIC)

SSF1.4 (WBIC)

Figure 7: Training Error for Data 2.

WBIC of SSF selected J = 20, which is just the same

as the true one.

Tables 1 and 2 summarize our results of model

selection using BPQ and SSF for artiﬁcial data 1 and

2 respectively.

0 5 10 15 20 25

Number of Hidden Units

Test Error

BPQ (λ =0)

SSF1.4 (λ =0)

BPQ (WAIC)

SSF1.4 (WAIC)

BPQ (WBIC)

SSF1.4 (WBIC)

Figure 8: Test Error for Data 2.

0 5 10 15 20 25

Number of Hidden Units

-1000

1000

2000

3000

4000

5000

6000

7000

8000

9000

Criterion AIC

BPQ

SSF1.4

Figure 9: AIC for Data 2.

0 5 10 15 20 25

Number of Hidden Units

500

1000

1500

2000

2500

3000

3500

4000

4500

Criterion BIC

BPQ

SSF1.4

Figure 10: BIC for Data 2.

Considerations:

The results of our experiments may suggest the fol-

lowing. Note, however, that since our experiments

are quite limited, more intensive investigation will be

needed to make the tendencies more reliable.

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

109

0 5 10 15 20 25

Number of Hidden Units

-2

-1

Criterion WAIC

BPQ (WAIC 1)

SSF 1.4 (WAIC 1)

BPQ (WAIC 2)

SSF 1.4 (WAIC 2)

Figure 11: WAIC for Data 2.

0 5 10 15 20 25

Number of Hidden Units

-2

-1

Criterion WBIC

×10

BPQ

SSF 1.4

Figure 12: WBIC for Data 2.

Table 1: Comparison of Selected Models for Data 1.

learning method

criterion BPQ SSF

AIC ≥24 ≥24

BIC 18 19

WAIC1 ≥24 19

WAIC2 ≥24 19

WBIC ≥24 20

(a) Independent learning of BPQ does not guarantee

monotonic decrease of training error along with the

increase of J, whereas successive learning of SSF

does guarantee the monotonic decrease. For MLP

model selection, independent learning sometimes

did not work well, showing an up-and-down curve

of training error and leading to wrong selection,

whereas successive learning seems suited for MLP

Table 2: Comparison of Selected Models for Data 2.

learning method

criterion BPQ SSF

AIC 23 ≥24

BIC 22 21

WAIC1 21 20

WAIC2 21 20

WBIC 10 20

model selection due to the monotonic decrease of

training error.

(b) In our experiments AIC had the tendency to

select the largest (J ≥ 24) among the candidates

for any learning method. This is probably because

the penalty for model complexity is too small. BIC

worked relatively well, having the tendency to select

a bit smaller or larger models than the true one.

lecting the true model or models very close to the true

one. However, WAIC and WBIC of BPQ sometimes

didn’t work well. Moreover, there was little differ-

ence between WAIC1 and WAIC2 for each learning

method in our experiments.

5 CONCLUSION

WAIC and WBIC are new information criteria for sin-

gular models. This paper evaluates how they work for

MLP model selection using artiﬁcial data. We com-

pared them with AIC and BIC using sampling meth-

ods. For this sampling, we used independent learning

of a conventional learning method BPQ and succes-

sive learning of a newly invented SSF. Our experi-

ments showed that WAIC and WBIC of SSF worked

very well, selecting the true model or very close mod-

els for each data, although WAIC and WBIC of BPQ

sometimes did not work well. AIC did not work well

selecting larger models, and BIC had the tendency to

select a bit smaller or larger models. In the future we

plan to do more intensive investigation on WAIC and

WBIC.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

110

ACKNOWLEDGEMENT

This work was supported by Grants-in-Aid for Scien-

tiﬁc Research (C) 16K00342.

REFERENCES

Akaike, H. (1974). A new look at the statistical model iden-

tiﬁcation. IEEE Trans. on Automatic Control, AC-

19:716–723.

Ando, T. (2007). Bayesian predictive information criterion

for the evaluation of hierarchical Bayesian and empir-

ical Bayes models. Biometrika, 94:443–458.

Neal, R. (1996). Bayesian learning for neural networks.

Springer.

Saito, K. and Nakano, R. (1997). Partial BFGS update and

efﬁcient step-length calculation for three-layer neural

networks. Neural Comput., 9(1):239–257.

Satoh, S. and Nakano, R. (2012). Eigen vector descent

and line search for multilayer perceptron. In IAENG

Int. Conf. on Artiﬁcial Intelligence & Applications

(ICAIA’12), volume 1, pages 1–6.

Satoh, S. and Nakano, R. (2013a). Fast and stable learn-

ing utilizing singular regions of multilayer perceptron.

Neural Processing Letters, 38(2):99–115.

Satoh, S. and Nakano, R. (2013b). Multilayer perceptron

learning utilizing singular regions and search pruning.

In Proc. Int. Conf. on Machine Learning and Data

Analysis, pages 790–795.

Schwarz, G. (1978). Estimating the dimension of a model.

Annals of Statistics, 6:461–464.

Watanabe, S. (2009). Algebraic geometry and statistical

learning theory. Cambridge University Press, Cam-

bridge.

Watanabe, S. (2010). Equations of states in singular statis-

tical estimation. Neural Networks, 23:20–34.

Watanabe, S. (2013). A widely applicable Bayesian infor-

mation criterion. Journal of Machine Learning Re-

search, 14:867–897.

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

111