models with different complexity to tune them to the
desired complexity on demand.
4.2 MNIST and CIFAR-10:
Experimental Settings
The main goals of these experiments is to demonstrate
the availability of the hypernetworks to generate the
deep learning model parameters with the condition on
the complexity value λ. As we obtain the parameters
for the desired model we prune it to check how many
informative ones have each of the models depending
on the complexity value λ. This experiment allows
us to compare properties of models which parameters
were obtained from hypernetwork with properties of
directly optimized ones.
For both the experiment we trained our models for
50 epochs. The minibatch size is set to 256. The fol-
lowing implementations were compared:
(a) variational neural network (9);
(b) network with covariance reparametrization (10);
(c) base network (11);
(d) variational linear hypernetwork (12);
(e) network with covariance; reparametrization (10)
with linear hypernetwork (12);
(f) base network (11) (Lorraine and Duvenaud, 2018)
with linear hypernetwork (12);
(g) variational piecewise-linear hypernetwork (13),
N = 5;
(h) network with covariance reparametrization (10)
with piecewise-linear hypernetwork (13), N = 5;
(i) base network (11) (Lorraine and Duvenaud, 2018)
with piecewise-linear hypernetwork (13), N = 5.
We launched the neural network training for dif-
ferent values of the complexity value λ ∈ Λ. The pa-
rameters of each model were pruned after the opti-
mization using the g
var
criterion (17). For the imple-
mentations (c), (f), (i) we used the simplified criterion
g
simple
(18).
4.3 MNIST Experiment Results
For the MNIST dataset we used a neural network con-
sisting of two layers with 50 and 10 neurons, where
the second layer contains the softmax function. Pa-
rameters L, R for uniform distribution were set to −3
and 3 correspondingly.
Fig. 4a shows how the accuracy changes when
parameters were pruned for variational neural net-
work (9). The graph shows that the variational
method allows to remove ≈ 60% parameters for λ ∈
{10
−3
, 10
−2
, 10
−1
, 10
0
, 10
1
} and ≈ 80% parameters
for λ = 10
2
without significant loss of classification
accuracy. If we delete more parameters, the accuracy
for all values decreases. For large values of λ > 10
2
we obtain an oversimplified model. It contains a small
number of informative parameters. Thus, removing
of them for a given value of λ has little effect on the
classification accuracy. However, the initial accuracy
is low.
Fig. 4d shows how the classification accuracy
changes for the model with covariance reparametriza-
tion (10). Fig. 4g shows how the classification accu-
racy changes for the base network (11). The classifi-
cation accuracy of these two models hardly changed,
but the networks with the variational approach were
more robust to parameter deletion.
Fig. 4b, e, h shows how the classification accuracy
changes when parameters are removed by the speci-
fied method for models with the linear hypernetworks.
As can be seen from the graph, the average classi-
fication accuracy for all values of λ ∈ Λ, increased.
The deviation from the mean also increased for the
big percents of deleted parameters. At the same time,
for all values of λ ∈ Λ, a more stable models were
obtained: the classification accuracy less depends on
the removal of parameters.
Fig. 4c, f, i shows how the classification accuracy
changes when parameters were removed by the speci-
fied method for a model with the piecewise-linear ap-
proximation. Models with the piecewise-linear hy-
pernetwork showed similar behaviour to models that
were trained directly during pruning. Moreover, for
all values of λ ∈ Λ, a more stable models were ob-
tained. All results are presented in the Table 1 and on
Fig. 5, where results for all λ were averaged.
4.4 CIFAR-10 Experiment Results
For the CIFAR-10 dataset, we used CNN-based archi-
tecture with convolutional layers of size (3,48), (48,
96), (96, 192), (192, 256), ReLU activation, and feed-
forward layer in the end. Parameters L, R for uniform
distribution were set to −2 and 0 correspondingly.
It can be seen from the Fig. 6a that the varia-
tional method also allowed removing ≈ 60% parame-
ters for λ = 0.01, 0.1, in contrast to the base model
Fig. 6d, where the classification accuracy dropped
significantly when 40 percent of the parameters were
removed.
The network with covariance reparametriza-
tion (10) showed poor results for CIFAR-10. They
are presented on the Fig. 8. The poor results can be
mainly explained by the specialty of (5) for the mod-
Deep Learning Model Selection With Parametric Complexity Control
71