5.2 Size of Traffic Sign Images
Another serious issue which we noticed is the size of
the real-time traffic sign images. Even if the aspect
ratio is acceptable the size of image could be too low.
The dataset contains varying image sizes from lesser
than 10 × 10 pixels to greater than 100 × 100 pix-
els with varying aspect ratio. The histogram of height
and width of all the images in 29 class triangular sign
dataset is plotted as shown in Fig. 7. From the his-
togram, it is observed that most of the images are very
small in size and concentrated in the lower bins (0-
15). However, such tiny images do not contain any
useful information and after resizing to larger size im-
age that information is also lost. For example, an im-
age of size 1 × 4 contains only 4 pixels which are
not enough for classification. Also, very small size
images when resized to 64 × 64 which is desired in-
put for our CNN model, the pixels are interpolated
and thus the resized image obtained will not be hav-
ing the information required for classification. The
images having both width and height below 15 are ig-
nored. Thus, even if one dimension is greater than
15 the image is included. The minimum dimension is
obtained as 11 or 12 because any image having an as-
pect ratio less than 0.77 and greater than 1.38 is con-
sidered as an outlier. Thus, the minimum size image
possible now in the dataset is 11 × 15 pixels. After
removing all the tiny images from the dataset, eval-
uation is done on a pre-trained network. This shows
improvement in accuracy from 96.1% by the flat clas-
sifier to 96.9% by the hierarchical classifier. Earlier
it was just 93.3%. Thus, removing small size images
increase the accuracy of the classifier.
Another problem was resizing the image from
small sizes like 11 × 15 pixels to 64 × 64 pixels may
also lead to artifacts. A solution to this is spatial pyra-
mid matching (Gupta et al., 2018) which eliminates
the need for a fixed size input image. In this method,
original size images can be used. The limitation of
fixed size image in the input layer of CNN is because
of the fully connected layer at the end and not because
of convolution layers. We can get fixed length repre-
sentation from variable sized feature maps using this
concept. Table 3, shows feature map sizes for gen-
eral M × N × 3 image as input to our building block
CNN. It is observed that if we want at least 1 × 1 fea-
ture map to be present till the last layer the minimum
size of the image should be 22 × 22 × 3. However,
in our case, the smallest image size can be 11 × 15.
Thus, this concept cannot be used here.
5.3 Results after Removing both
Outliers and Small-sized Images
In this section, results are obtained by removing both
outliers and small sized images. After this step,
the count of training images become lesser than half
which is 14097 which means half of the images in
the dataset were small sized images. Also, the vali-
dation count falls to 2474 from 5948. The pre-trained
network is now evaluated by these 2474 images. The
result obtained is 96.7% accuracy by the flat classi-
fier and 97.5% accuracy by the hierarchical classifier
which is better than all the cases mentioned above.
It is observed that 62 examples are misclassified by
hierarchical classifier from total 2474 examples. Out
of these 62 examples around 50 examples cannot be
even correctly classified by the human. The reasons
for misclassification are complete and partial occlu-
sion, motion blur, scale variation, illumination varia-
tions, and label noise. Thus, the hierarchical classifier
performance of 97.5% accuracy is nearly justified.
5.4 Super-resolution
Resizing by interpolation reduces contrast (sharp
edges). It is desirable to recover finer texture details
when we upscale the image. Super-resolution tech-
nique is used to upscale the image to preserve per-
ceptual details. There are many different ways of up-
scaling the image by super-resolution. Some of those
methods are prediction based, image statistics based,
edge-preserving based, example pair-based methods
etc. Here, single image super-resolution is done using
Generative Adversarial Network (SRGAN) (Ledig
et al., 2017). Image scaling is required because build-
ing block CNN is taking a fixed size image as input
which is 64 × 64 × 3. The super-resolution tech-
nique is applied to 29 class triangular sign dataset for
upscaling small images by a factor of 4. The SRGAN
network is not trained from scratch instead a network
pre-trained for DIV2K dataset is used. It is observed
that the image upscaled using bicubic interpolation is
blurred whereas the super-resolved image is percep-
tually satisfying preserving texture details. To train
the network from scratch using super-resolved images
we have to restructure our dataset of 29 class triangu-
lar sign images. It is observed that after removing
small images and outliers, in one of the class there is
no example present in the validation set. Also, now
the training images are also very less which is 14097.
Thus, to train the model from scratch using super-
resolved images we have to restructure the dataset.
Heavy augmentation is done and now the validation
images contain different tracks of signs. In this case
Hierarchical Traffic Sign Recognition for Autonomous Driving
317