Trafﬁc Sign Classiﬁcation using Hybrid HOG-SURF Features and

Convolutional Neural Networks

Rishabh Madan

∗

, Deepank Agrawal

∗

, Shreyas Kowshik

∗

, Harsh Maheshwari

∗

, Siddhant Agarwal

∗

and Debashish Chakravarty

Autonomous Ground Vehicle Team, Indian Institute of Technology, Kharagpur, India

Keywords:

Convolutional Neural Network, Feature Descriptors, Trafﬁc Sign Classiﬁcation, Histogram of Oriented

Gradient, Speeded Up Robust Features.

Abstract:

Trafﬁc signs play an important role in safety of drivers and regulation of trafﬁc. Trafﬁc sign classiﬁcation is

thus an important problem to solve for the advent of autonomous vehicles. There have been several works

that focus on trafﬁc sign classiﬁcation using various machine learning techniques. While works involving

the use of convolutional neural networks with RGB images have shown remarkable results, they require a

large amount of training time, and some of these models occupy a huge chunk of memory. Earlier works

like HOG-SVM make use of local feature descriptors for classiﬁcation problem but at the expense of reduced

performance. This paper explores the use of hybrid features by combining HOG features and SURF with CNN

classiﬁer for trafﬁc sign classiﬁcation. We propose a unique branching based CNN classiﬁer which achieves

an accuracy of 98.48% on GTSRB test set using just 1.5M trainable parameters.

1 INTRODUCTION

Trafﬁc sign classiﬁcation holds an essential place in

visually guided driving assistance and autonomous

driving systems and several other trafﬁc-related utili-

ties. Trafﬁc signs are utilised as a method of warning

and guiding drivers, helping to regulate the ﬂow of

trafﬁc among vehicles, pedestrians, and others who

travel the streets, highways and other roadways. The

development of trafﬁc sign classiﬁcation is dedicated

to reducing the number of fatalities and the severity of

road accidents and is an important and active research

area. In general, trafﬁc signs have unique and distinc-

tive features like simple shape priors in the form of

circles and triangles combined with uniform colour-

ing patterns, which makes them easily recognizable

and thus also a restricted classiﬁcation task. Regard-

less, classifying these signs without any human super-

vision is still a challenging task considering the dif-

ferent kinds of problems like occlusions, disoriented

poles, lighting changes and poor quality signs that are

encountered during real-world execution.

For many years, local features descriptors have

dominated all domains of computer vision. The con-

siderable progress that has been visible in classiﬁca-

∗

These authors have contributed equally.

tion and detection is mostly due to them. Histogram

of Gradients (HOG) after its introduction, outper-

formed all the previous methods for human detection

(Dalal and Triggs, 2005) by a considerable margin. It

has since then been used to solve a variety of clas-

siﬁcation and detection problems. (Schmitt and Mc-

Coy, 2011) used Speeded Up Robust Features (SURF)

(Bay et al., 2006) feature descriptors along with Sup-

port Vector Machine (SVM) classiﬁer for classiﬁca-

tion on a subset of the CalTech-101 database (Fei-Fei

et al., 2004).

In recent years, convolutional neural networks

(CNN) have become pervasive in classiﬁcation tasks

after AlexNet (Krizhevsky et al., 2012) which popu-

larized deep convolutional neural networks for classi-

ﬁcation. The general trend has been to make deeper

and more complicated networks to extract high-level

features in order to achieve higher accuracy (Szegedy

et al., 2014),(He et al., 2015). Many approaches in-

volving deep learning have achieved exemplary per-

formance in traditional road environments. However,

an increase in depth of network leads to exponen-

tial increase in the number of parameters. A deep

convolutional neural network involving Spatial Trans-

former networks (Garc

ıa et al., 2018) achieved high-

est accuracy on the German Trafﬁc Sign Recognition

Benchmark (GTSRB) (Stallkamp et al., 2012) dataset

Madan, R., Agrawal, D., Kowshik, S., Maheshwari, H., Agarwal, S. and Chakravarty, D.

Trafﬁc Sign Classiﬁcation using Hybrid HOG-SURF Features and Convolutional Neural Networks.

DOI: 10.5220/0007392506130620

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 613-620

ISBN: 978-989-758-351-3

613

but used 14M model parameters. These advances to

improve accuracy are making networks inefﬁcient in

terms of memory use and inference speed. In the real

world, trafﬁc sign classiﬁcation might be carried out

on a memory constrained platform but would still re-

quire highly accurate prediction.

In this work, we propose the use of hybrid fea-

ture descriptors along with a CNN based classiﬁer for

trafﬁc sign classiﬁcation thereby providing a compu-

tationally efﬁcient method with a lesser number of

model parameters and improved training time. The

proposed hybrid feature descriptors comprise of HOG

feature descriptors and Bag of Words (BoW) variant

of SURF feature descriptors. HOG descriptors tend

to capture the global features of the object and are

lighting invariant, thus beneﬁcial in the case of trafﬁc

signs, due to the diverse shapes and high-level geo-

metric priors for different signs. But HOG descrip-

tors highly vary with change in object orientation,

which could become a problem when these signs are

viewed from multiple viewpoints. On the other hand,

SURF descriptors are rotation invariant, but when us-

ing a BoW representation, the spatial and geometric

relationship information between descriptors is lost.

This motivates the use of a combination of HOG de-

scriptors with SURF BoW descriptors. The presented

approach leverages the use of these hand-crafted de-

scriptors by using a CNN to learn a better represen-

tation for ﬁner feature extraction and observe per-

formance comparable to recent deep CNN architec-

tures using RGB images for classiﬁcation. The use

of this combination of descriptors to classify trafﬁc

signs using a basic CNN classiﬁer has been demon-

strated. The purpose of using CNN, in comparison to

SVM or other learning-based classiﬁers in this partic-

ular case is to learn a better feature representation of

this combination of two different descriptors and thus

make use of relatively high-dimensional features for

classiﬁcation. We extend the experiments by using a

branched CNN architecture to reduce the number of

trainable model parameters by a large factor and also

observe incremental improvements in terms of classi-

ﬁcation accuracy.

Contributions. The main contributions of this pa-

per comprise the use of a hybrid combination of HOG

and SURF feature descriptors as an input to the CNN

classiﬁer for accurate classiﬁcation of trafﬁc signs.

We perform an extensive comparative study regard-

ing the combined use of these two very different fea-

ture descriptors, and how they support each other to

improve the prediction results with minimal training

parameters by a substantial margin.

2 RELATED WORK

HOG descriptors with Support Vector Machine

(SVM) classiﬁer have been used for classiﬁcation pur-

poses earlier (Blauth et al., ). Even though these fea-

tures show better performance in characterizing ob-

ject shape and appearance, there is a drawback of this

approach that it is restricted to binary classiﬁcation as

the SVM determines the optimal hyperplane, separat-

ing two classes in the dataset. The use of SVM in

a multinomial classiﬁcation problem thus becomes a

case of one-versus-all, in which the positive class is

the class with the highest score whereas the rest rep-

resent the negative class. In other words, we need

to train N-SVMs for an N-class classiﬁer, whereas an

N-class classiﬁer neural network can be trained in one

go. Also, the neural network can generalize in a bet-

ter manner as it acts like one whole system whereas

SVMs are isolated systems.

SURF descriptors with CNN have been previously

used by (Elmoogy et al., 2018) for solving the prob-

lem of indoor localization. CNNs are capable of ex-

tracting high-level features but require high dimen-

sional optimization procedure due to which the train-

ing time is signiﬁcantly long. On the other hand,

image descriptors obtain features from the image

through deterministic means which have much higher

speed than CNN. The disadvantage of the descriptor

is that the output feature size is generally large com-

pared to CNN. In this approach, they combined both

the feature descriptor and the CNN, by ﬁrst using an

image descriptor to extract features from the image

and then using the CNN to reduce the dimension of

the feature. SURF features alone are not able to rep-

resent the geometric property of the image. This com-

plication is dealt with using combined HOG features

with SURF.

(Abedin et al., 2016) have taken a similar ap-

proach of using hybrid features for trafﬁc sign clas-

siﬁcation. They have used SURF and HOG feature

descriptors with Artiﬁcial Neural Network (ANN) for

classiﬁcation but have not shown their methodology

regarding how they are combining both descriptors.

They do not provide any valid reasons for the use of

this combination and do not explain why using indi-

vidual HOG, or SURF descriptors would not work.

Also, they have tested their approach for classifying

just 4 different trafﬁc signs on a very small dataset

which makes their approach unreliable in case of large

and complex datasets. In contrast, we establish proper

reasoning and show detailed experiments on 43 dif-

ferent trafﬁc signs, concerning the use of the hybrid

feature. Moreover, a CNN is used instead of a Multi-

layer Perceptron (MLP). CNN works particularly well

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

614

on data having a spatial relationship and is thus ideal

for the task of trafﬁc sign classiﬁcation.

3 METHOD

In this paper, a method involving the use of multiple

feature descriptors with CNN for image classiﬁcation

is introduced. This method is applied to the problem

of trafﬁc sign classiﬁcation. The procedure begins

by ﬁrst generating the HOG descriptor for the entire

dataset followed by extraction of SURF that have high

saliency and lastly feeding features to a convolutional

neural network for prediction.

3.1 Feature Extraction with HOG

HOG is a dense feature extraction method. It detects

complex shapes of structures by the distribution of in-

tensity gradients or edge directions. HOG generates

pixel-wise histograms of gradient directions and con-

catenates them to get the descriptor.

HOG features are based on magnitude distribu-

tions and gradient angles. Due to this, they have a nat-

ural invariance to changes in lighting conditions and

colour variation. These make them robust in visual

data. It involves ﬁrst computing the gradients of all

pixels. For an image I, gradient estimation ﬁlters as

= [−1, 0, 1] and H

= [−1, 0, 1]

. Let G

and G

be the gradient matrices generated by

= I ∗ H

and G

= I ∗ H

(1)

where * is the convolution. The gradient value at each

pixel can be calculated as

g(i, j) =

(i, j)

+ G

(i, j)

) (2)

and the dominant gradient direction at each pixel can

be estimated by

θ(i, j) = tan

−1

(i, j)

(3)

This is followed by creating cell histograms. Each

point in a cell casts a weighted vote for the histogram

channel. These votes are based on the gradient val-

ues computed earlier. This orientation based channel

is distributed over 0 to 180

◦

. The cells are grouped

into spatially connected larger blocks and normalized

locally providing an invariance in changes in illumi-

nation and contrast. These normalized cell histograms

are concatenated to form the HOG feature vector.

Figure 1: The above ﬁgure shows the image in column (a)

and their corresponding SURF and HOG features in column

(b) and column (c).

3.2 Feature Extraction with SURF

SURF is an interest point detector and descriptor. It

is scale and rotation invariant and thus is more reli-

able for practical purposes since camera feed mostly

provides tilted and scaled trafﬁc signs.

To calculate the orientation of a point, SURF uses

wavelet responses in horizontal and vertical directions

for a neighbourhood of size 6. The sum of all re-

sponses within a sliding orientation window of angle

◦

is calculated. This gives us the dominant orienta-

tion. The detector is based on the determinant of the

Hessian matrix.

Given a point x = (x, y) in an image I, the Hessian

matrix H (x, σ) in x at scale σ is deﬁned as follows

H (x, σ) =



(x, σ) L

(x, σ)

(x, σ) L

(x, σ)



(4)

where L

(x, σ) is the convolution of the 2

order

derivative of Gaussian with image I in point x and

same for L

(x, σ) and L

(x, σ).

SURF uses box ﬁlters as an approximation to

Laplacian of Gaussian(LoG) for computational ad-

vantages. The descriptor, on the other hand, describes

a distribution of Haar-wavelet responses within the

interest point neighbourhood. SURF is sensitive to

lighting conditions and image-distortions thus pro-

ducing variable distributions of feature vectors across

the dataset. To ensure a ﬁxed dimensionality of the

SURF feature vectors, descriptors were clusteresd us-

ing K-Means algorithm.

Trafﬁc signs comprise of simple geometric pat-

terns. Using only SURF BoW descriptors for clas-

siﬁcation will not work as SURF BoW does not store

Trafﬁc Sign Classiﬁcation using Hybrid HOG-SURF Features and Convolutional Neural Networks

615

the information related to geometric patterns. HOG

features, on the other hand, identify these simple ge-

ometric patterns very well. But there is one potential

problem with using only HOG features. HOG fea-

tures are highly invariant to changes in lighting condi-

tions but get hugely affected by orientation changes of

the sign and do not detect local features. In real world,

the images of trafﬁc sign come from different view-

points and may also be slightly distorted. So, HOG

alone cannot accurately classify trafﬁc signs. For this

purpose, SURF BoW descriptors are used, which are

rotation and orientation invariant and capture the local

features. HOG detects the basic geometry of the sign

while SURF BoW descriptors complement HOG by

making it more robust to changes in orientation and

capturing local features.

Algorithm 1: High saliency SURF feature extraction.

1: procedure SURF GENERATE(image, num clusters)

2: H ← 0

3: while True do

4: SURF(H )

5: m = f (image)

6: if num f eature

< ε

min

then

7: skip the image

8: if num f eature

> ε

max

then

9: H ← H + α

10: else

11: break

12: SURF(H )

13: m = f (image)

14: KMEANS(x, num clusters)

15: . Cluster m along x axis

16: return cluster centers

Where H represents Hessian Threshold for SURF feature.

α is step size with which H is increased. ε

min

and ε

max

represent the minimum and maximum threshold values of

number of feature along x-axis, between which loops break

and clustering is done.

3.3 Branched Pipeline

Initially, a generic CNN architecture is used as a clas-

siﬁer. This classiﬁer was directly fed HOG features

appended with SURF. While it achieves good results

on the GTSRB test set, it has a large number of train-

able parameters and takes a lot of time for training. To

solve this problem, a unique branched CNN architec-

ture is proposed. This model uses HOG features and

SURF as input to two different branches that consist

of convolutional blocks similar to the generic CNN.

The embeddings obtained from these two branches

are concatenated and further passed through fully

connected layers. This reduces the model parame-

ters, thus reducing computational cost. Additionally,

improvement is also observed in the accuracy on the

test set. This can be attributed to the fact that differ-

ent convolutional ﬁlters are used for the two different

kinds of feature descriptors.

For a detailed analysis of the results of the descriptors,

refer to the Analysis Section 5.

4 EXPERIMENTS

4.1 Dataset

German Trafﬁc Sign Recognition Benchmark (GT-

SRB) dataset (Stallkamp et al., 2011) has been used

to perform all the experiments. It consists of 39,029

training images spread across 43 classes. The distri-

bution of classes is highly skewed with around 200 in

one up to 2000 in another class. The dataset has been

created by extracting image frames from 1-second

video sequences. A single sequence of 30 images usu-

ally contains images of the same trafﬁc sign with in-

creased size. Thus it is important to employ a proper

strategy to create a meaningful training dataset. The

technique used by (Sermanet and LeCun, 2011) is

used to tackle the above issue.

To bring in uniformity, all classes are augmented

by applying a random brightness, random rotation and

random distortion to the image and those images are

ﬂipped which are invariant to horizontal, vertical and

180

◦

ﬂips. This is done for all the classes. This cre-

ates a dataset of over 130k images where data dis-

tribution across different classes is still skewed. To

tackle this problem, further augmentation is done in

selected classes with less than 1000 images by apply-

ing a random translation to them. The number of aug-

mented images of other classes was reduced to around

1200. This creates a comparatively uniform distribu-

tion (Figure 2) after applying HOG and SURF to the

images with 53,238 images in the dataset.

Figure 2: No. of training images of each class.

The HOG and SURF vectors are computed on each

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

616

image (refer to Appendix for function parameter val-

ues). Each SURF vector in its 64-dimensional space

is clustered into 8 clusters using KMeans. This gives

us a 1x1764 sized HOG-vector. and 1x512 sized

SURF-vector.

4.2 Network Architectures

We broadly describe the architectures used for per-

forming all the experiments. For complete implemen-

tation details, refer to the Appendix 6.

4.2.1 Basic CNN Architecture

Figure 6 [in Appendix 6] shows the preliminary CNN

architecture that is used as a classiﬁer along with ex-

tracted features as the input. The network comprises

of two convolutional blocks, where each block con-

sists of a batch normalization layer, ReLU activation

and max pooling layer. The embedding after passing

the input features through them is ﬂattened and fed to

fully connected layers, and ﬁnally predicted probabil-

ities for all 43 classes are returned in the ﬁnal layer.

4.2.2 Branched CNN Architecture

The basic CNN model suffers from the problem

of a large number of model parameters. To im-

prove its performance on computationally limited re-

sources and reduce the model parameters a branched

CNN (Figure 7) architecture is used, where the in-

put HOG and SURF are separately fed into two dif-

ferent branches, each consisting of two convolutional

blocks similar to the ones used in the basic architec-

ture. The respective embedding received from these

two branches are ﬂattened, concatenated and passed

through fully connected layers to output the indi-

vidual probabilities for each class. The HOG conv-

branch is fed input HOG features of shape 7x252, and

the SURF-branch is fed input of 8x64. This method

of branching allows us to apply convolutions on the

two different type of feature maps separately. As can

be seen in Table 1, the number of model parameters is

reduced by 6x when using the branched CNN model.

Table 1: Comparison of number of model parameters.

Model Architecture No. of Parameters

Basic CNN 8,543,467

Branched CNN 1,583,947

3-Conv-No-STN 7,303,883

3-Conv-3-STN 14,629,801

Figure 3: Comparision of subset accuracies.

Figure 4: Comparison of test accuracy for the experiments.

5 RESULTS

To validate the use of multiple feature descriptors

with a CNN based architecture, the Basic CNN

architecture is used. Upon experimenting with

a branched CNN architecture, better accuracy is

observed. over the basic architecture. Figure 4 shows

the test accuracies of different models during train-

ing. Table 3 shows the accuracy obtained on the test

dataset. Figure 3 compares the subset accuracies for

different experimental models. The different subsets

consist of “Danger” signs, “Speed” signs, “End-of”

signs, “Blue” signs, “Red round” signs, “Red other”

signs, “Spezial” signs as deﬁned in the GTSRB

dataset (Stallkamp et al., 2011). The branched CNN

architecture with multiple features outperforms the

HOG-SVM based approach. The model performance

is compared with (Garc

ıa et al., 2018) which shows

state of the art results on GTSRB dataset. Their base

model is deﬁned as 3-Conv-No-STN, i.e. CNN with-

out spatial transformer networks (STN) and their best

model as 3-Conv-3-STN, i.e. CNN with three STNs.

It is visible from Table 1 and Table 3 that even though

state of the art method achieves a slightly better

accuracy, due to large number of trainable parame-

ters it is heavily computation intensive during train

and test time when compared to the proposed method.

Trafﬁc Sign Classiﬁcation using Hybrid HOG-SURF Features and Convolutional Neural Networks

617

Table 2: Analysis of False Positive count (FP) for some high FP struck classes. Third Column is the class that is predicted

with maximum FP count. The class indices correspond to the GTSRB dataset.

Class Index Total FP Class Most Confused With Corresponding FP

HOG HOG + SURF HOG HOG HOG + SURF

Pedestrians Possible 40 5 Danger Point 18 0

Road narrows on the right 72 1 Danger Point 45 0

End of (truck) 45 2 End of (car) 44 1

Table 3: Test set accuracy for different classiﬁers and in-

puts.

Classiﬁer Input Data Accuracy(%)

Basic CNN HOG 91.09

Basic CNN SURF 77.41

SVM HOG 96.93

Basic CNN HOG + SURF 98.07

Branched

CNN

HOG + SURF 98.48

3-Conv-No-

STN

RGB Image 98.81

3-Conv-3-

STN

RGB Image 99.49

Using HOG features with the branched model yields

an accuracy close to 90%. After checking the model

outputs on classes having low test accuracy, it is ob-

served that the model confuses it with trafﬁc signs

having a similar shape but does not distinguish well

enough between the drawing placed inside the shape.

This validates the theory that HOG features lack in-

formation about local features. Experiments are also

performed by just feeding the SURF to the basic

model and a poor accuracy of 77.41% is observed on

test set. Using both the HOG and SURF with the

branched CNN model outperforms all the baselines

using feature descriptor as input and achieves compa-

rable accuracy with respect to state of the art method.

Analysis. Refer to Table 2 for the following analy-

sis. It corresponds to classes represented by images

in Figure 5. With reference to this ﬁgure, it is evident

that HOG features tend to capture the overall shape

of an object very well, leaving apart the intricate de-

tails embedded in the shapes. This property of HOG

features causes it to confuse the images in the left col-

umn with the images in the right column as evident in

Table 2. SURF, on the other hand, describes local-

ized key-points embedded in the shapes. Upon using

HOG and SURF together, SURF entirely learned the

unique features of the ”danger point” sign (Figure 5)

and brought down the total FP from 40 to 5. Similar

improvements were observed in case of ”Road Nar-

rows on the right” (row 2, Figure 5). SURF learned to

distinguish between the two different “End of” signs

Figure 5: Left - Classes with maximum false positives with

only HOG features (Ground Truth Images). Right - Pre-

dicted class with only HOG features.

shown in Figure 5, by bringing down the FP count

from 44 to just 1. As can be seen, the two classes are

very similar in appearance and only a few unique key

points can differentiate between them, which SURF

captured successfully.

6 CONCLUSION

We proposed the method of concurrently using the

HOG and SURF of an image with a CNN based ar-

chitecture for the classiﬁcation of trafﬁc signs in the

GTSRB trafﬁc sign dataset. The proposed pipeline

using the basic architecture achieved an accuracy of

98.075%. The performance of the approach is fur-

ther enhanced by using a branched CNN architec-

ture achieving an accuracy of 98.48% on the test set.

The advantages of using two different feature descrip-

tors, HOG and SURF together, are examined over

previous works that either use a RGB image or the

above-mentioned features individually. The experi-

ments show that using these pre-computed hybrid fea-

tures along with CNN achieves slightly lower perfor-

mance to the state of the art method but at the gain of

much lesser number of parameters, hence leading to

reduced computational resource usage by manifolds.

In future work, we intend to focus on using the pro-

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

618

posed technique of trafﬁc sign classiﬁcation with the

existing region proposal networks (RPN) to perform

efﬁcient real-time detection of trafﬁc signs. We also

intend to ﬁne tune this method for classifying Indian

trafﬁc signs.

REFERENCES

Abedin, M. Z., Dhar, P., and Deb, K. (2016). Trafﬁc sign

recognition using hybrid features descriptor and ar-

tiﬁcial neural network classiﬁer. In 2016 19th In-

ternational Conference on Computer and Information

Technology (ICCIT), pages 457–462.

Bay, H., Tuytelaars, T., and Van Gool, L. (2006). Surf:

Speeded up robust features. In Leonardis, A., Bischof,

H., and Pinz, A., editors, Computer Vision – ECCV

2006, pages 404–417, Berlin, Heidelberg. Springer

Berlin Heidelberg.

Blauth, M., Kraft, E., Hirschenberger, F., and B

ohm, M.

Large-scale trafﬁc sign recognition based on local fea-

tures and color segmentation.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on, volume 1, pages 886–

893. IEEE.

Elmoogy, A. M., Dong, X., and Lu, T. (2018). Cnn : a

descriptor enhanced convolutional neural network.

Fei-Fei, L., Fergus, R., and Perona, P. (2004). Learning gen-

erative visual models from few training examples: An

incremental bayesian approach tested on 101 object

categories. In 2004 Conference on Computer Vision

and Pattern Recognition Workshop, pages 178–178.

Garc

ıa,

A. A.,

Alvarez, J. A., and Soria-Morillo, L. M.

(2018). Deep neural network for trafﬁc sign recogni-

tion systems: An analysis of spatial transformers and

stochastic optimisation methods. Neural networks :

the ofﬁcial journal of the International Neural Net-

work Society, 99:158–165.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Schmitt, D. and McCoy, N. (2011). Object classiﬁcation

and localization using surf descriptors.

Sermanet, P. and LeCun, Y. (2011). Trafﬁc sign recogni-

tion with multi-scale convolutional networks. In Neu-

ral Networks (IJCNN), The 2011 International Joint

Conference on, pages 2809–2813. IEEE.

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C.

(2011). The german trafﬁc sign recognition bench-

mark: a multi-class classiﬁcation competition. In Neu-

ral Networks (IJCNN), The 2011 International Joint

Conference on, pages 1453–1460. IEEE.

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C.

(2012). Man vs. computer: Benchmarking machine

learning algorithms for trafﬁc sign recognition. Neu-

ral Networks, (0):–.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2014). Going deeper with convolutions.

CoRR, abs/1409.4842.

APPENDIX

HOG features and SURF are computed using the

OpenCV library. The parameters used during HOG

computation are window size = (32,32), block size =

(8,8), block stride = (4,4), cell size = (4,4), number of

bins = 9. All experiments were performed using Ten-

sorﬂow. Dropout was used in all convolutional layers

with a probability of 0.6. Adam Optimizer, with an

initial learning rate of 1e-4 and default parameters,

was used.

Basic Architecture

The basic architecture was implemented entirely in

Tensorﬂow. With reference to Figure 6, Conv C K S

refers to a convolution layer with ’C’ output channels,

’KxK’ kernel size and an ’SxS’ strided convolution.

All Pool layers are Max-Pool layers with a kernel of

2x2 and stride of 2x2. This reduces the input size by a

factor of 2 at each stage. A ReLU activation is applied

after each block. We implement Batch Normalization

after each convolutional block with the default scale

and shifting parameters in tensorﬂow. Batch Normal-

ization is not applied to fully connected layers. The

output of the network is the Softmax activation prob-

abilities over the 43 classes of the GTSRB dataset.

The loss function used is as follows:-

L(y, y

) = −

∑

log(y) (5)

where y are the labels for classiﬁcation and y’ are

the predictions made (the logits). This represents the

cross-entropy loss for multiple classes.

Branched CNN Architecture

The Branched Architecture was implemented in a

similar fashion to the Basic Architecture with similar

default parameters. Batch Normalization was applied

only after each convolutional block and not on any

fully connected layer.

Trafﬁc Sign Classiﬁcation using Hybrid HOG-SURF Features and Convolutional Neural Networks

619

Figure 6: Basic CNN Architecture.

Figure 7: Branched CNN Architecture.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

620