ing the weights in the classification layer and freez-
ing the weights in the other layer. Finally, an additive
noise is found by minimizing the following objective
function:
r
∗
= argmin
r
ψ(loss(X + r), c, k) +λkrk
2
(3)
ψ(L, c, k) =
β × L [c] argmaxL = c
L[k] −L[c] otherwise
(4)
where c is the actual class label, k is the predicted
class label, λ is the regularizing weight and loss(X +
r) returns the loss vector of the degraded image X + r
computed over all classes. Also, β is a multiplier to
penalize those values of r that do not properly degrade
the image so it is not misclassified by ConvNet. We
minimized the above objective function on a sample
image from the Calteach101 dataset. Figure 2 illus-
trates the frequency response of r along with the fre-
quency response of the first 7 filters in the first layer
of Googlenet (Szegedy et al., 2014). Note that the
maximum and minimum values of the noise are very
small. However, we have normalized their intensity
for visualization purposes.
First, we observe that the noise affects almost all
the frequencies (note that on the chart, only points
with blue color shows a magnitude near zero). Sec-
ond, the frequency responses of the filters reveal that
not only they pass low and mid frequencies they may
also pass very high frequencies. If the response of
each filter is multiplied with the response of the noise
(i.e. convolution in spatial domain), the result will
be another noisy image where the effect of some fre-
quencies are slightly reduced. In other words, the
output of the first convolution layer in Googlenet is
a multi-channel noisy image since the filters are not
able to effectively reduce the effect of the additive
noise.
When the noisy multi-channel image is passed
through a max-pooling layer, it may produce another
noisy image where the magnitude of high frequencies
may increase. Analyzing several ConvNets (illus-
trated in the supplementary document) in frequency
domain shows that they tend to learn filters which re-
spond to most of the frequencies in the image. For this
reason, the noise is propagated along the network and
they also appear in the last convolution layer where
they may alter the output of the ConvNet.
It should be noted an additive noise can affect all
the frequencies. This means that removing only the
effect of certain frequencies (for example, high fre-
quencies) will not increase the stability of ConvNets.
In addition, high frequencies are as important as low
frequencies and removing their response can reduce
the classification accuracy. As the result, we cannot
judge a filter by only studying its response in differ-
ent frequencies.
From the frequency domain perspective, it is not
trivial to suppress the additive noise r during the con-
volution process. This is due to the fact that r has pos-
itive magnitude in nearly all the frequencies. Hence,
even discarding effect of the noise on some frequen-
cies is not going to effectively solve the problem since
the frequencies which correspond to noise will be
passed to the next layers through other frequencies.
However, as we show in the next section, by learning
filters which are more localized in the frequency do-
main, the stability of the network may increase while
the accuracy of the network remains the same.
3 EXPERIMENTS
In this section, we study stability of ConvNets empir-
ically and in the frequency domain. To this end, we
utilize ConvNets with different architectures trained
on various datasets. Specifically, we use the archi-
tecture in (Jia et al., 2014) for training a ConvNet on
CIFAR10 dataset (Krizhevsky, 2009). We also use
the pre-trained models of Alexnet (Krizhevsky et al.,
2012) and Googlenet (Szegedy et al., 2014) and fine-
tune them on Caltech101 dataset (Fergus and Perona,
2004). Finally, we train the architectures from (Cire-
san et al., 2012) and [will cite our paper] on GT-
SRB (Stallkamp et al., 2012) dataset. Table 1 shows
the accuracy of each ConvNets trained on the orig-
inal datasets. It is clear that all the ConvNets have
achieved state-of-art results.
3.1 Stability of ConvNets
To empirically study the stability of the ConvNets
against noise, the following procedure is conducted.
First, we pick the test images from the original
datasets which are correctly classified by the Con-
vNets. Then, 100 noisy images are generated for
each σ ∈ {1, 2, 4, 8, 10, 15, 20, 25, 30, 35, 40}. In other
words, 1100 noisy images are generated for each
of correctly classified test images from the original
datasets. The same procedure is repeated on every
dataset and the accuracy of the ConvNets is computed
using the noisy test sets. Table 2 shows the accuracy
of the ConvNets per each value of σ.
First, we observe that except IRCV and Alexnet
other ConvNets have misclassified a few of the cor-
rectly classified test images which are degraded using
a Gaussian noise with σ = 1. Note that when σ = 1,
it is highly improbable that a pixel is degraded more
than ±4 intensity levels in each channel. However,
VISAPP 2017 - International Conference on Computer Vision Theory and Applications
364