extract semantic features for image classification and
segmentation (Huang et al., 2017; He et al., 2016;
Szegedy et al., 2016; Krizhevsky et al., 2012). A
CNN is a bio-inspired network that uses the concept
of receptive fields to explore spatial correlations in the
image, so that it is capable of transforming and of re-
ducing image information, thus obtaining a meaning-
ful representation of its content.
Any CNN has its main structure defined by three
types of layers: convolutional, pooling, and fully con-
nected layer. The convolutional layer uses the convo-
lution operation to emulate a receptive field and its re-
sponse to a visual stimulus. The convolutional layer is
the primary layer of CNN since it acts as an attribute
detector in the image. Pooling layers help to re-
duce the dimensionality of the image as also the CNN
sensitivity to image distortions and shifting. A few
fully connected layers, followed by a softmax layer
(Krizhevsky et al., 2012), are located at the end of
CNN, and they are used for classification, outputting
the most probably class for a given image.
Due to the difficult to model, train, and test
different network models, large computing compa-
nies (such as Google and Microsoft) have developed
CNNs models and trained them in large image data
sets. These pre-trained networks can be used to learn
generic features of new datasets, while we fine-tune
the output of the network to this new problem. In this
work, we used InceptionV3, ResNet, SqueezeNet,
and DenseNet models pre-trained on the 2012 Ima-
geNet image data set, which contains 1000 classes:
• InceptionV3: Google’s research team proposed
this network model, and it introduces the incep-
tion module as an approach to reduce the compu-
tational load of CNNs while maintaining its per-
formance (Szegedy et al., 2016). The inception
module is based on factorizing and asymmetric
convolutions, in which the main goal is to re-
duce the number of connections/parameters of the
network without decreasing its efficiency. Incep-
tionV3 is a large CNN containing 23.8 millions of
parameters.
• ResNet: Microsoft Research (He et al., 2016)
proposed this network model, and it uses resid-
ual learning to improve network accuracy. ResNet
learns residues by using a scheme of skip connec-
tion to propagate information over layers. As a re-
sult, this scheme enables us to create deeper net-
works as it minimizes the problem of vanishing
gradients. Depending on its structure, the num-
ber of layers in ResNet can range from 18 to 152
layers, and up to 100.11 millions of parameters.
• DenseNet: different from other CNN models,
DenseNet (Huang et al., 2017) is considered a
small network is having 8 millions of parameters.
Similar to ResNet, it uses the concept of residue
connections as building blocks of its model. How-
ever, DenseNet proposes to concatenate the previ-
ous layers instead of using a summation. Addi-
tionally, DenseNet presents more group connec-
tions than other networks, so that feature maps of
all predecessor layers are used as input to all sub-
sequent layers.
• SqueezeNet: this model (Iandola et al., 2016)
was designed to be a smaller network model, but
still capable of achieving results similar to bigger
models. It introduces the concept of Fire mod-
ules with squeeze convolution layers (only 1 × 1
filters), that are fed to an expanded layer. This
model results in a network with 50× fewer pa-
rameters than AlexNet, and ideal for application
in hardware with limited memory.
Table 1 shows the number of parameters, input
size and number of convolutions of each network
model evaluated.
Table 1: CNN models specifications.
CNN # of Input # of
model parameters size conv.
DenseNet 8.0M 224 × 224 120
ResNet 25.6M 224 × 224 104
InceptionV3 23.8M 299 × 299 197
SqueezeNet 1.2M 224 × 224 24
3 IMAGE DATASET
To study the variability of CNN when subjected to
a cross-validation scheme, we opted to use an im-
age dataset presenting high inter-class and low intra-
class similarity, as it poses an additional challenge
for classification tasks. To accomplish, that we used
the virus dataset available at www.cb.uu.se/
∼
gustaf/
virustexture/. Further details on how the images were
obtained can be found in (Kylberg et al., 2012).
This dataset contains 1500 Transmission Electron
Microscopy (TEM) images. These images represent
15 different types of viruses: Adenovirus, Astrovirus,
CCHF, Cowpox, Dengue, Ebola, Influenza, Lassa,
Marburg, Norovirus, Orf, Papilloma, Rift Valley, Ro-
tavirus, WestNile. Each virus type is represented by
100 images of 41 × 41 pixels size. Although this
database is available in 8-bits and 16-bits formats,
we used only the 8-bits format in our experiments to
avoid normalization problems during the training of
the CNN.
Variability Evaluation of CNNs using Cross-validation on Viruses Images
627