Evaluating the Effects of Convolutional Neural Network Committees

Fran Juri

c, Ivan Filkovi

c and Zoran Kalafati

University of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia

Keywords:

Convolutional Neural Network, Commitee, Ensemble.

Abstract:

Many high performing deep learning models for image classiﬁcation put their base models in a committee as

a ﬁnal step to gain competitive edge. In this paper we focus on that aspect, analyzing how committee size

and makeup of models trained with different preprocessing methods impact ﬁnal performance. Working with

two datasets, representing both rigid and non-rigid object classiﬁcation in German Trafﬁc Sign Recognition

Benchmark (GTSRB) and CIFAR-10, and two preprocessing methods in addition to original images, we report

performance improvements and compare them. Our experiments cover committees trained on just one dataset

variation as well as hybrid ones, unreliability of small committees of low error models and performance

metrics speciﬁc to the way committees are built. We point out some guidelines to predict committee behavior

and good approaches to analyze their impact and limitations.

1 INTRODUCTION

Convolutional neural networks (CNNs) have become

one of the most used approaches for various computer

vision problems, with especially notable results in im-

age classiﬁcation (Krizhevsky et al., 2012; Simonyan

and Zisserman, 2014; Szegedy et al., 2015). Chal-

lenges related to object detection, classiﬁcation and

segmentation frequently receive many CNN submis-

sions, and it is not uncommon for such approaches to

hold state-of-the-art results, more so for large scale

problems with vast amounts of data (Russakovsky

et al., 2014). To achieve highly competitive re-

sults, just coming up with new model architectures

is no longer enough. Today’s models push the lim-

its of hardware capacity, can take weeks to train, and

are carefully ﬁne-tuned for that last push to achieve

state-of-the-art result. While deep models distinguish

themselves by being able to learn high level abstract

representations from data alone, they are prone to

having many minute detail parameters. Those pa-

rameters can be manually set with reasonable effort

for decent results, but must be carefully considered to

push the model to its limits.

To get more out of deep models some top scoring

results additionally use image preprocessing methods

and organize multiple trained models into committees

or ensembles (Ciresan et al., 2012; Jin et al., 2014).

The rationale is that committees smooth out decision

function, giving a boost to correct classiﬁcations by

eliminating outliers from individual trained models,

while different preprocessing methods can emphasize

distinguishing object features.

In this paper we focus on evaluating those as-

pects and how they impact baseline results. We work

with two datasets; German Trafﬁc Sign Recognition

Benchmark - GTSRB (Stallkamp et al., 2011) and

CIFAR-10 (Krizhevsky, 2009) which present different

challenges, such as rigid and non-rigid object classes,

and use well known models for each. We evaluate

models trained on original images and two prepro-

cessing methods, combined into homogeneous and

hybrid committees. Our results show some ﬁne de-

tails about the work of committees, and point out

good practices and possible pitfalls. To help better

understand the performance impact committees have,

we introduce a novel metrics (to the best of our knowl-

edge), speciﬁc to the ways committees are assembled,

and distinguishing missclassiﬁcations that exemplify

committee limitations.

The remainder of this paper is organized as fol-

lows. In Section 2 an overview of previous work

is given. Publicly available trafﬁc sign classiﬁcation

dataset GTSRB and general visual object classiﬁca-

tion dataset CIFAR-10 are outlined in Subsection 3.1.

Also, used models based on CNNs are presented in

the same section. Committee experiments are de-

scribed in Section 4, together with presentation of our

performance metrics. In the end we discuss experi-

ment results and provide a conclusion (Section 5).

560

Juriši

c, F., Filkovi

c, I. and Kalafati

c, Z.

Evaluating the Effects of Convolutional Neural Network Committees.

DOI: 10.5220/0005719305600565

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 560-565

ISBN: 978-989-758-175-5

2 RELATED WORK

Work by (Krizhevsky et al., 2012) and (Szegedy et al.,

2015) on ImageNet dataset (Russakovsky et al., 2014)

exempliﬁed true power of CNNs for general visual

object classiﬁcation. While shallow learning is based

on extraction of hand-crafted features and involves a

lot of painstaking work and human insight into the na-

ture of data, CNNs automatically extract multi-scale

features that are most discriminative for given prob-

lems.

One of the ﬁrst uses of committees involving

convolutional neural networks paired them with a

multi-layer perceptron trained on HOG/HAAR fea-

tures (Ciresan et al., 2011) for the purpose of trafﬁc

sign classiﬁcation in German Trafﬁc Sign Recogni-

tion Benchmark (Stallkamp et al., 2011). In the same

paper there is also experimentation with several pre-

processing methods to help with sometimes low qual-

ity of source images. Their two model MLP/CNN

committee had 99.15% recognition rate compared to

98.73% of the best single CNN. In a continuation of

this work committees consisting purely of CNNs with

various preprocessing methods were used on several

datasets (Ciresan et al., 2012). In the case of the afore-

mentioned GTSRB dataset, the ﬁnal model consisted

of ﬁve trained models for original images and four

preprocessing methods, resulting in a committee of

25 CNNs and 99.46% recognition rate.

Further application of neural network commit-

tees (Jin et al., 2014) on GTSRB improved recog-

nition rate to 99.65% by training ﬁve models with

three preprocessing methods and original images with

hinge loss and putting them in an ensemble of 20

CNNs, wherein the individual networks had average

recognition rates of 98.96 ± 0.20%.

On large-scale ImageNet dataset (Russakovsky

et al., 2014), most of the high scoring models em-

ploy an ensemble of trained models in a last attempt

to further reduce error rates and push state-of-the-

art results. The main source of deep learning fame,

AlexNet (Krizhevsky et al., 2012) swept the ILSVRC-

2012 challenge with its single CNN achieving 18.2%

top-5 error rate, while averaging ﬁve such models

brought the error rate down to 16.4%. Adding two

more CNNs pre-trained on Fall 2011 release of the

ImageNet dataset to the ﬁve CNN ensemble further

reduced the top-5 error to 15.3%.

The ILSVRC-2013 challenge attracted much

more deep learning submissions. Again the winning

approach of classiﬁcation challenge used averaging of

several CNN models (Zeiler and Fergus, 2014). In the

ﬁnal submission a single CNN model achieved top-

5 error rate of 12.15% while an average of multiple

models brought it down to 11.7%. In their paper they

also give error rates for models trained only on 2012

dataset as 16.5% (a) and 16.0% (b) for single CNNs,

and 14.8% for an 5×(a) & 1×(b) ensemble.

At ILSVRC-2014 challenge, deep learning gave

even better error rates with even deeper models.

GoogleNet (Szegedy et al., 2015) won the classiﬁca-

tion challenge with 6.67% top-5 error using an en-

semble of seven trained single CNNs and a large num-

ber of crops per image. Image crops are deﬁned

as random image samples from original source im-

age. Single CNN/single crop model had the base

10.07% top-5 error, while single CNN/144 crops

had 7.89% and seven CNN/single crop ensemble had

8.09% top-5 errors. On the same competition the

VGG model (Simonyan and Zisserman, 2014) which

won the localization challenge, also achieved impres-

sive classiﬁcation results with 7.5% top-5 error rate

using an ensemble of seven different deep architec-

tures. They successfully brought it down to 6.8%

post-submission by averaging two models and uti-

lizing multi-crop and dense evaluation similar to the

GoogleNet submission. Their best single CNN model

had 7.5% top-5 validation error. From all these re-

sults we can see that ﬁnal score optimizing methods

such as model ensembles, preprocessing methods and

dense cropping don’t give sufﬁcient improvement to

match baseline results of newer generation methods,

but make the main difference between top scorers

where model resource efﬁciency is heavily traded for

small but important boost in recognition rates.

3 METHODOLOGY

3.1 Datasets and Models

3.1.1 CIFAR-10

CIFAR-10 (Krizhevsky, 2009) is a dataset consisting

of color images of resolution 32 × 32 pixels labeled

into 10 signiﬁcantly distinct categories. The training

set has 50000 and the test set has 10000 images. Main

challenges this dataset presents are small resolution

images and non-rigid but visually distinct categories.

The model we use (Krizhevsky et al., 2012) has four

convolutional layers, each followed by a ReLU acti-

vation, pooling and local response normalization in

feature maps. The architecture is shown in Figure 1

and its deﬁnition is also readily available in frame-

works such as cuda-convnet2 (Krizhevsky, 2014)

and Caffe (Jia et al., 2014).

Evaluating the Effects of Convolutional Neural Network Committees

561

Figure 1: CifarNet (Krizhevsky et al., 2012) model archi-

tecture and our variant of the model used by (Ciresan et al.,

2012).

3.1.2 GTSRB

The German Trafﬁc Sign Recognition Bench-

mark (Stallkamp et al., 2011) introduced a classiﬁ-

cation dataset of German trafﬁc signs extracted from

annotated videos. It has 43 classes and presents a

rigid object classiﬁcation problem with ﬁne-grained

classes, as many trafﬁc signs are very similar at small

resolutions. The makeup of classes is disproportion-

ate, in accordance to their occurrence in real world.

The dataset from the ﬁnal phase of competition has a

total of 1728 different physical trafﬁc signs organized

in tracks of 30 images of increasing resolution per

physical sign, resulting in 39210 training and 12630

testing images. On this dataset we use a slightly modi-

ﬁed model (Figure 1) from (Ciresan et al., 2012), with

added ReLU activations and dropout during training

after the ﬁrst fully connected layer. Additionally,

during training we enlarge to original annotations to

53 ×53 and then take random crops of 48× 48 which

are the input dimensions of the network. GTSRB con-

tains quite a bit of images where annotated box is not

ﬁnely ﬁtted to the trafﬁc sign in image, so additional

translational and scale variance helps.

3.2 Preprocessing Methods

Figure 2: Examples of images from CIFAR-10 and GTSRB

in order: original, pixel intensity equalization, non-local

means denoising.

In addition to working with original images, we

also use histogram equalization of pixel intensities

and non-local means denoising to pre-process images

with examples for both datasets shown in Figure 2.

Both methods were chosen for being fairly simple and

available with most image processing libraries. Find-

ing optimal image preprocessing methods is out of

the scope of this paper, but we rather aim to evalu-

ate how the additional varied information they pro-

vide impacts the performance of a committee.

4 EXPERIMENTS

To get committee scores, we use 30 trained models

for each considered metric and simulate the ’build-

ing’ of a committee by randomly adding individual

CNNs one by one and averaging their scores. This

method is most common in deep models and we don’t

use others to avoid the different problem of commit-

tee forming. Each such run gives speciﬁc recognition

rates dependent on the ordering of individual models,

so for metrics tied to committee size alone we cal-

culate mean and standard deviation over 1000 runs.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

562

Table 1: Individual model and committee correct recognition rates (%) for GTSRB and CIFAR-10 datasets. Hybrid commit-

tees have equal numbers of constituting models.

Preprocessing 1 CNN 6 CNN 30 CNN

GTSRB ﬁnal phase test set

Originals (a) 98.52 ± 0.23 98.89 ± 0.13 98.98

HistEq (b) 98.51 ± 0.18 98.80 ± 0.10 98.87

NlMeansDenoising (c) 96.39 ± 0.21 96.80 ± 0.14 96.87

(a) and (b) - 98.98 ± 0.13 99.15

(b) and (c) - 98.78 ± 0.13 98.90

(a) and (c) - 98.55 ± 0.13 98.66

(a), (b) and (c) - 98.87 ± 0.14 99.05

CIFAR-10

Originals (a) 81.16 ± 0.27 83.16 ± 0.13 83.25

HistEq (b) 77.96 ± 0.33 80.31 ± 0.16 80.75

NlMeansDenoising (c) 76.45 ± 0.30 78.37 ± 0.16 78.67

(a) and (b) - 82.72 ± 0.16 83.28

(b) and (c) - 81.41 ± 0.10 81.81

(a) and (c) - 82.11 ± 0.14 82.56

(a), (b) and (c) - 82.54 ± 0.15 82.92

Table 2: Correct classiﬁcation rate (CCR) is deﬁned by the number of examples assigned the correct label divided by number

of all examples. Base correct classiﬁcation rate is based upon examples that are correctly classiﬁed by all single models, while

inconsistent ones have both correct and incorrect classiﬁcations. Base error are examples that are wrongly classiﬁed by all

CNNs. True improvement (Equation 1) is evaluated as the increase in CCR once base CCR is deducted.

Data Base CCR Inconsistent Base err. Avg. CNN 30 CNN Impro.

GTSRB original 95.86% 4.04% 0.10% 98.52% 98.98% 117.67%

GTSRB HistEq 96.27% 3.56% 0.17% 98.51% 98.87% 115.90%

GTSRB original & HistEq 95.17% 4.74% 0.09% 98.51% 99.15% 119.29%

CIFAR10 original 58.58% 36.51% 4.91% 81.16% 83.25% 109.66%

CIFAR10 HistEq 53.28% 40.38% 6.34% 77.96% 80.75% 111.30%

CIFAR10 original & HistEq 52.15% 43.46% 4.39% 80.34% 83.28% 110.43%

For committees consisting of models learned with dif-

ferent preprocessing methods, we add to the growing

committee one of each together, in order to keep the

makeup balanced.

In Table 1 we compare recognition rates for indi-

vidual models and committees of several sizes. Visu-

alization for select models are presented in Figure 3,

where we show average performance dependent on

committee size as well as sample single runs.

Committees based upon a single preprocessing

method show similar increases in performance, with

the ﬁnal result more tied to performance of individual

models. Hybrid model performance is not easily an-

ticipated as it can give slight boosts or reductions, if

the combined preprocessing methods prove compati-

ble for that dataset and model.

It is important to observe that, as opposed to

smooth increases in average recognition rates as

committee size grows, in some cases adding models

to an existing committee does not consistently in-

crease recognition rate, but can in fact be noticeably

detrimental as shown in Figure 3. The effect is much

more present in GTSRB dataset than CIFAR-10.

Our reasoning is that this is due to much higher

base recognition rate in GTSRB, leaving less room

for smooth improvement and a greater chance for

individual models to make matching errors. The

opposite effect can also happen when a smaller com-

mittee performs above the convergent value which

is achieved as committee size grows. We believe

that a good indicator for this problem is the increase

in recognition rate achieved by putting models in

a committee. An improvement of ∼ 2% appears

in related work (Krizhevsky et al., 2012; Szegedy

et al., 2015) including ours for CIFAR-10, even with

a smaller number of individual models, while an

improvement of ∼ 0.5% (Ciresan et al., 2012; Jin

et al., 2014) present in our GTSRB results seems

to require a larger number of individual models to

be achieved reliably. To give a performance metric

unique to committees, we break down classiﬁcations

of individual models, as a committee is not able to

impact any labels that are consistently assigned by all

single models. We give a special metric (Equation 1)

Evaluating the Effects of Convolutional Neural Network Committees

563

Figure 3: Committee performance dependent on the number of constituting CNNs trained on original images from GTSRB

(left) and CIFAR-10 (right). The top plots show average and standard deviation over 1000 committee constructing runs, while

center and bottom show sample single runs.

Figure 4: Images from GTSRB test set that all 30 models

trained on original images give the same wrong label.

for committee performance as correct classiﬁcation

rate shared by all constituting models and the im-

provement over average single model by a committee

of 30 CNNs on inconsistently classiﬁed examples.

Average recognition rates for single models and

committees come from Table 1.

Improvement =

Committee − BaseCCR

AverageCNN − BaseCCR

(1)

Table 2 shows that on GTSRB individual models have

much more consistent classiﬁcations, which could in

large be due to trafﬁc sings being rigid in contrast to

visually varying CIFAR-10 classes. It also shows that

on GTSRB committees prove relatively more effec-

tive compared to CIFAR-10, as they give more of an

improvement on examples that are disputable. This

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

564

also brings up missclassiﬁcations consistent across

all individual models, which are interesting as they

showcase model, training method or dataset limita-

tions. We show all such examples for models trained

on original GTSRB images in Figure 4.

These experiments provide valuable insight into

how committees boost model performance scores and

help us with assumptions on what they can and cannot

do.

5 CONCLUSION

In this paper and work of others we observe some

consistencies in results achieved using committees

of base deep models. For considered problems that

are not saturated, even smaller committees improve

recognition rates by a value close to 2%. However,

when room for improvement is much smaller, com-

mittees need to be much larger or built with greater

care to be reliable, as smaller committees could have

a signiﬁcant amount of wrongly classiﬁed examples

when individual models make similar errors. We

show statistics for committees of various sizes on two

datasets, trained on original or preprocessed images,

as well as hybrid committees. When using a single

preprocessing method to build committees, the in-

crease achieved is similar and the ﬁnal recognition

rate depends largely on average performance of in-

dividual models. Hybrid committees prove more of

a challenge, since the right choice of preprocessing

method combinations can boost or reduce results de-

pending on whether the preprocessing methods prove

compatible for that dataset and model.

We also looked into performance metrics speciﬁc

for committees since they can only improve results on

examples that individual models do not consistently

classify. Deﬁning base correct classiﬁcation rate as

the examples all individual models classify correctly,

we calculated true improvement as the increase of

correct classiﬁcations above the base. Results of this

metric showed an ∼ 117% increase on GTSRB and

∼ 110% increase on CIFAR-10, giving a much better

insight on how much committees help rather than just

the increase in recognition rate. Overall, we brought

to light intricacies of a much used but not elaborated

approach to boost ﬁnal model performance.

ACKNOWLEDGEMENT

This work has been supported by the Croatian Science

Foundation under the project UIP-11-2013-1544.

REFERENCES

Ciresan, D. C., Meier, U., Masci, J., and Schmidhuber,

J. (2011). A committee of neural networks for traf-

ﬁc sign classiﬁcation. In IJCNN, pages 1918–1921.

IEEE.

Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012).

Multi-column deep neural networks for image classi-

ﬁcation. In CVPR, pages 3642–3649. IEEE.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long,

J., Girshick, R. B., Guadarrama, S., and Darrell, T.

(2014). Caffe: Convolutional architecture for fast fea-

ture embedding. CoRR, abs/1408.5093.

Jin, J., Fu, K., and Zhang, C. (2014). Trafﬁc sign recog-

nition with hinge loss trained convolutional neural

networks. Intelligent Transportation Systems, IEEE

Transactions on, PP(99):1–10.

Krizhevsky, A. (2009). Learning multiple layers of features

from tiny images.

Krizhevsky, A. (2014). One weird trick for parallelizing

convolutional neural networks. CoRR, abs/1404.5997.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in Neural Information Pro-

cessing Systems 25, pages 1097–1105. Curran Asso-

ciates, Inc.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M. S., Berg, A. C., and Fei-Fei, L. (2014). Ima-

genet large scale visual recognition challenge. CoRR,

abs/1409.0575.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR.

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C.

(2011). The German Trafﬁc Sign Recognition Bench-

mark: A multi-class classiﬁcation competition. In

IEEE International Joint Conference on Neural Net-

works, pages 1453–1460.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In The IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Zeiler, M. D. and Fergus, R. (2014). Visualizing and under-

standing convolutional networks. In Computer Vision

- ECCV 2014 - 13th European Conference, Zurich,

pages 818–833.

Evaluating the Effects of Convolutional Neural Network Committees

565