large filters (kernels) to cover broader areas of an im-
age during convolution operation. In SFC layer, we
create pyramids of stacked filters of different sizes.
The stacking of filters create a conical filter structure
referred to as filter pyramids. Input image data pass-
ing through SFC layer are convolved with each filter
resulting in feature maps of different scales. These
feature maps are then upscaled and used as global
features and passed to the next layer in the CNN net-
work. In this work, we refer to the integration of SFC
layer within an existing CNN as Stacked Filter CNN
(SFCNN) model.
We conduct extensive experiments to evaluate
scale invariance performance of SFCNN. We use
LeNet5 CNN (LeCun et al., 1998) as our benchmark
model. First, the datasets are trained on LeNet5 to
establish benchmark results for comparison. Then
we ensemble LeNet5 with SFC layer by placing it as
the first layer in LeNet5 feature extraction pipeline.
This location enables the SFC layer to pass fea-
tures extracted from spatially broader areas of the im-
age (global-first) into the CNN network for further
processing. We train the ensemble SFCNN on the
same datasets. We study the performance of SFCNN
in classifying image samples on specific scale cate-
gories. We also study the performance of SFCNN on
individual classes where images from various scales
per class are evaluated. For consistency we use the
same test samples on all models developed. In all our
case studies, performance of SFCNN are compared
with our benchmarks. Our results show SFCNN out-
performs traditional LeNet5 CNN in classifying color
images across majority of the scale categories. In ad-
dition, we report promising results on SFCNN’s abil-
ity to classify images in various scale levels for each
dataset class in particular for color images.
The main contributions of this paper are to im-
prove CNNs towards classification of scaled images
by showing the effectiveness of a) processing spa-
tially broader areas of an input image in the initial
stages of a CNN feature extraction pipeline, and b) en-
hancing features extracted by applying upscaling on
feature maps.
The rest of the paper is organised as follows: Sec-
tion 2 reviews related work while Section 3 introduces
our model. Section 4 describes our experiment design
and results are presented in Section 5. We summarise
and point to future directions in Section 6.
2 BACKGROUND
Use of Global Features in CNNs. While local fea-
tures are effectively extracted in CNNs using small
filters performing a patch-wise operation with the tar-
get image, extracting global features requires study-
ing the whole image or spatially larger areas of the tar-
get image. Here, local features are classified as lines
(edges) and curves while shape, colour and shape con-
tours are labelled as global features. In some stud-
ies, global features have been studied and applied in
CNNs but are limited to using feature descriptors such
as histogram of gradients (HOG) (Zhang et al., 2016)
and SIFT (Zheng et al., 2017). However, they have
not been tested on the networks ability to be spatially
invariant and also feature extractors such as HOG and
SIFT are non-trainable.
Use of Large Kernels in CNNs. The use of large
kernels to extract features from spatially broader areas
of the target image have been studied in some work.
In the area of semantic segmentation (Peng et al.,
2017) proposed a Global Convolutional Neural Net-
work in which they studied the use of large kernels.
Instead of directly applying large kernels as normal
convolutions, they used a combination of vector type
kernels of size 1 × k + k × 1 to connect with a large
k × k region in the feature map. They conducted their
experiments on PASCAL VOC dataset and concluded
that large kernels play an important role in both clas-
sification and localisation tasks. In their design, they
did not use any non-linearity after convolution layers
as is the practice in standard CNN models. In another
piece of work, (Park and Lee, 2016) inform extracting
information from a large area surrounding the target
pixel is essential to extract for example texture infor-
mation.
Pyramid based Methods in CNNs for Scale-
Invariant Classification. Pyramid based methods
have been used to address scale invariance in CNNs
to some extent but have been limited to either generat-
ing image pyramids or feature map pyramids. For ex-
ample, (Kanazawa et al., 2014) describe work where
they first create an image pyramid by scaling the tar-
get image and using the same filter to convolve all
scaled input. The feature maps generated are nor-
malised to obtain the same spatial dimensions and
then pooled to obtain a locally scale-invariant repre-
sentation. However, in their implementation, scaling
the target image is similar to applying scale augmen-
tation. In our work, we present no augmentation of
the input images. In another work, (Xu et al., 2014)
propose a scale-invariant CNN (SiCNN) by applying
a similar process of convolving a filter on different
image scales. (Lin et al., 2017) exploit the pyrami-
dal hierarchy of feature maps in deep convolutional
networks by developing lateral connections from each
VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications
114