Figure 4: Residual module structure.
are available as options. From these, the zero-padding
approach is better because it does not increase the pa-
rameters, but projection, which is easy to implement,
is often used.
In deep networks, updating the parameters of one
layer causes an internal covariate shift in which the
distribution of inputs to the next layer changes sig-
nificantly from batch to batch, resulting in inefficient
learning. was there. Batchnormalization (Ioffe and
Szegedy, 2015) is a method to stabilize and speed up
learning by normalizing this internal covariate shift
and allowing each layer to learn independently as
much as possible. In ResNet, efficient learning of
deep networks is realized by incorporating this batch-
normalization in the residual module, and batchnor-
malization has come to be used as standard in the
models after ResNet.
3 PRPPOSED APPROACH
It is said that differences can be identified with
high accuracy by using CNN, and attempts have
been made to identify gender differences using Grad-
CAM, which identifies regions that contribute to dis-
crimination using the weights after learning CNN
(Jiang et al., 2020). However, it has not been pos-
sible to identify a meaningful area, and the reason for
the identification has hardly been explained.
One way to know what shape or pattern the CNN
is looking at is to find out what features the CNN fil-
ter is looking at. However, since CNN learns to ex-
tract more complex features by combining simple fea-
tures extracted in the shallow layer in the deep layer,
it is necessary to express complex features in order to
know what CNN itself is looking at. It is necessary to
know the multiple simple features used in the above
and to consider what kind of features are expressed
from them. In addition, since many filters are used in
each layer of CNN, it is necessary to think about mul-
tiple filters in the same way in multiple layers, and it is
not realistic to actually explain the identification pro-
cess from the filters. In addition, in order to explain
the difference between images, it is necessary to know
not only the area but also the difference in shape and
pattern within the area, and it is not enough to specify
the area. Therefore, in order to explain fake masks
and real masks, there is a need for a method that can
identify the areas involved in their identification and
obtain differences in shapes and patterns within the
areas.
Therefore, in this study, we propose a method us-
ing the hostile generation network (GAN) as a method
to know the shape and pattern related to the identifi-
cation of CNN from the result instead of the process.
In order to analyze fake masks and real masks using
GAN, it is necessary to use a model that can learn
the difference between fake masks and real masks.
Therefore, as a model that can learn the difference,
there is CycleGAN, which is a model that applies
GAN. Since CycleGAN learns mutual conversion be-
tween datasets by unsupervised learning, there is no
need for data pairs between datasets, and it is pos-
sible to learn transformations that have no solution
in reality, such as mutual conversion between fake
masks and real masks.Specifically, the MaskedFace-
Net dataset is used as the ”fake mask” domain and
the MAsked FAces dataset (MAFA) (Ge et al., 2017)
is used as the ”real mask” domain to create a Cycle-
GAN. Use to perform mask transformation training
between domains. Fig.5 shows a schematic diagram.
When training data is given with the ”fake mask” do-
main as X and the ”real mask” domain Y as train-
ing data, it is equivalent to optimizing G
Y
: X → Y
and G
X
: Y → X by CycleGAN. In addition, since
each domain plays the role of a teacher domain with
each other, it should be considered that it exerts a
probabilistic learning effect on a data group such as
MaskedFace-Net that is not given clear teacher label
data.However, since CycleGAN converts the entire
image, the area related to the mask cannot be spec-
ified from the difference.
Therefore, in this study, we introduced the follow-
ing additional loss function into CycleGAN to limit
the conversion area.
L
identity
(G
X
, G
Y
) = E
y∼
Pdata
(y)
[||G
Y
(y) − y||
1
]
+E
x∼
Pdata
(x)
[||G
X
(x) − x||
1
] (5)
In other words, the L1 norm is used to set the loss
function so that the distributions of the ”input” and the
”generated image” are close to each other. In other
words, by using this ”identity loss”, GAN works to
learn the conversion between each domain. Then, we
thoughtthat we could change the color and style of the
area we wanted to convert and keep the color and style
of the area we didn’t want to convert. In addition, the
following three points were implemented to improve
the accuracy of images by CycleGAN.