3.3.2 Selection
In the selection phase, we fix the weights of the
main network and optimize the weights of aux net-
work. Meanwhile, the decision of which aux data
to use is jointly optimized. To this end, we intro-
duce real-valued gate parameters {θ
i
} which force
N-dimensional aux data to be active at training; see
Fig. 4b. Therefore, the goal of the selection phase is
to learn two learnable parameters by minimizing the
training loss L(W, θ) which is determined by both the
network weights W and the gate parameters θ.
Beside the performance, the correlation between
the main and aux data is important objective; oth-
erwise the model can be overfitted to the aux data
(i.e. the model might select the aux data to produce
the percolative features that are difficult to represent
during the percolating phase). To avoid this, we in-
troduce a simple penalty term to capture correlation
across main and aux data. Specifically, given main
and aux data x
main
, x
aux
, our loss function penalizes
the difference between the network’s percolative fea-
tures f
perc
with / without percolation procedure. As
such, we have the expected correlations between the
main and aux data as:
`
corr
=
1
|T |
∑
|T |
j=1
k f
perc
(x
main
j
,ρ(x
aux
j
, p
perc
)) − f
perc
(x
main
j
,x
aux
j
)k
2
2
where we denote our modified percolation procedure
and the percolating probability by ρ(·) and p
perc
. In
the procedure ρ(·), the aux data x
aux
is stochastically
dropped out with a probability p
perc
that is linearly
increased during the selection phase. We empirically
found that these penalizes help to improve general-
ization and avoid overfitting. Thus, the total loss L
sel
used for aux data selection can be written as:
L
sel
= `
CE
(x
main
,x
aux
) + `
corr
(x
main
,x
aux
)
where `
CE
is cross entropy between the model predic-
tions and the training labels.
Unlike the network weight parameters, the gate
parameters θ cannot be updated by using the standard
gradient descent. Therefore, we approximate the gra-
dients, with respect to the gate parameters θ, to di-
rectly optimize its corresponding parameters. In this
work, we present two implementations of aux data
selection, although we think that other optimization
techniques could also be employed. The first method,
simple gradient based method, approximately esti-
mates the gradients with respect to its gate parame-
ters. The second one, natural gradient based method,
formulates the optimization task in a probabilistic
manner. After the selection phase is completed, we
deterministically select the aux data based on θ (i.e.
argmax
p
p(g|θ)).
Simple Gradient based Method: In this approach,
we determine which aux data to use stochastically. As
with the optimization in (Courbariaux et al., 2015),
we constraint the parameters to either 0 or 1 to deter-
mine whether or not to use aux data for percolative
learning. To be specific, the gate parameters {θ
i
} are
transformed to binarized weights {g
i
} stochastically:
g
i
=
1 with probability p
i
= σ(θ
i
),
0 with probability 1 − p
i
.
where we denote σ as the hard sigmoid function:
σ(x) = clip(x,0, 1)
Although we think other functions could also be em-
ployed, for simplicity we use this simple hard sigmoid
function. Since the gradients ∂L/∂θ
i
cannot be calcu-
lated through backpropagation, we simply update the
gate parameters θ using ∂L/∂g
i
instead of ∂L/∂θ
i
. To
this end, we compute the “masked aux data” and use
them as inputs to aux network to ensure the binarized
weights g are involved in the computational graph:
x
0
aux
= gx
aux
The gradients ∂L/∂g
i
can be computed using the
backpropagation, thus we can analogously learn the
gate parameters θ.
Natural Gradient based Method: As with the opti-
mization method proposed in (Shirakawa et al., 2018;
Saito et al., 2018), we consider the gate parameter g
that determines which aux data to use for percolat-
ing. The gate parameter g is sampled from the proba-
bilistic distribution p(g|θ) which is parameterized by
a distribution parameter θ ∈ Θ. Under the Bernoulli
distribution p(g|θ) =
∏
N
i=1
θ
g
i
i
(1 − θ
i
)
1−g
i
, we mini-
mize the following loss function:
G(W,θ) =
Z
L(W,g)p(g|θ) dg
We optimize both the W and θ by computing the gra-
dient and the natural gradient with respect to W and
θ, respectively.
∇
W
G(W,θ) =
Z
∇
W
L(W,g)p(g|θ) dg
˜
∇
θ
G(W,θ) =
Z
L(W,g)
˜
∇
θ
ln p(g|θ) dg
where
˜
∇ is natural gradient (Amari, 1998) which can
be computed by the product of the inverse of Fisher
information matrix and the gradient F(θ)
−1
∇
θ
. We
follow (Shirakawa et al., 2018) and approximate these
gradient by using Monte-Carlo methods with λ sam-
ples from p(g|θ). Specifically, we use the analytical
natural gradients of the log-likelihood
˜
∇
θ
ln p(g|θ) =
g − θ, thus the total gradients as:
G(W,θ) =
∑
g
L(W,g)(g − θ)
ICAART 2022 - 14th International Conference on Agents and Artificial Intelligence
384