Increasing the Stability of CNNs using a Denoising Layer Regularized by

Local Lipschitz Constant in Road Understanding Problems

Hamed H. Aghdam, Elnaz J. Heravi and Domenec Puig

Computer Engineering and Mathematics Department, Rovira i Virgili University, Tarragona, Spain

{hamed.habibi, elnaz.jahani, domenec.puig}@urv.cat

Keywords:

Image Restoration, Image Denoising, Road Understanding, Trafﬁc Sign Classiﬁcation, Pedestrian Detection.

Abstract:

One of the challenges in problems related to road understanding is to deal with noisy images. Especially,

recent studies have revealed that ConvNets are sensitive to small perturbations in the input. One solution for

dealing with this problem is to generate many noisy images during training a ConvNet. However, this approach

is very costly and it is not a certain solution. In this paper, we propose an objective function regularized by the

local Lipschitz constant and train a ReLU layer for restoring noisy images. Our experiments on the GTSRB

and the Caltech-Pedestrian datasets show that this lightweight approach not only increases the accuracy of

the classiﬁcation ConvNets on the clean datasets but it also increases the stability of the ConvNets against

noise. Comparing our method with similar approaches shows that it produces more stable ConvNets while it

is computationally similar or more efﬁcient than these methods.

1 INTRODUCTION

Understanding road is crucial for autonomous cars.

Lane segmentation, pedestrian detection, trafﬁc sign

recognition and car detection are some of the well

known problems in this ﬁeld. Convolutional Neural

Networks (ConvNets) have been successfully applied

on these problems. Ciresan et.al. (Cirean et al., 2012)

and Sermanet et.al. (Sermanet and Lecun, 2011) pro-

posed ConvNets that beat a human driver in clas-

siﬁcation of trafﬁc signs on a challenging dataset

called GTSRB (Stallkamp et al., 2012). Aghdam

et.al. (Aghdam et al., 2015) also proposed a more

accurate ConvNet with much less parameters. Simi-

larly, Angelova et.al. (Angelova et al., 2015) detected

pedestrians using a cascade of lightweight and com-

plex ConvNets. Besides, Levi et.al. (Levi et al., 2015)

and Bittel et.al. (Bittel et al., 2015) have proposed

ConvNets for segmenting lane in an image.

In real world applications, road understanding

faces some practical challenges. For examples, if the

weather is rainy or foggy, the camera mounted on the

car may not acquire clean images. This may cause

some artifacts on the image. In addition, based on the

shutter speed, the image might be degraded by a mo-

tion if the car is being driven on an uneven route. Sim-

ilarly, engine and other parts may affect the transmit-

ted signal from the camera which in turn may cause

some irregularities on the image. All of these situa-

tions can degrade the image. Consequently, the input

of road understanding module might be noisy.

ConvNets have considerably advanced compared

with AlextNet (Krizhevsky et al., 2012) which won

the ImageNet competition in 2012. In particular,

depth of ConvNets have greatly increased last years.

Szegedy et al. (Szegedy et al., 2014a) created a net-

work consisting of multiple Inception modules. Be-

sides, Simonyan and Zisserman (Simonyan and Zis-

serman, 2015) proposed a network with 19 layers.

The idea behind this ConvNet is to increase the depth

rather than its width. Srivastava et al. (Srivastava

et al., 2015) showed how to train very deep networks

by directly ﬂowing information from previous layers

to next layers through a gate function. Recently, He

et.al. (He et al., 2015) trained a 152-layer ConvNet

and won the ImageNet competition. Their ConvNet is

similar to (Srivastava et al., 2015) in the sense that in-

formation ﬂows directly to the following layers. How-

ever, the gate function in (Srivastava et al., 2015) has

been replaced with an identity mapping function.

Despite the impressive results obtained by Con-

vNets, Szegedy et al. (Szegedy et al., 2014b) showed

that small perturbation,so called adversarial exam-

ples, of input images can alter their classiﬁcation re-

sult. The difference between the image and its adver-

sarial samples is not sometimes even recognizable to

human eye. They study the reason by computing the

upper bound of the Lipschitz constant for each layer.

218

Aghdam H., Heravi E. and Puig D.

Increasing the Stability of CNNs using a Denoising Layer Regularized by Local Lipschitz Constant in Road Understanding Problems.

DOI: 10.5220/0006123602180225

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 218-225

ISBN: 978-989-758-226-4

Copyright

c

2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

The results suggest that instability of ConvNets might

be due to the fact that they are highly non-linear func-

tions. Hence, a small change in the input may con-

siderably change the output. Aghdam et.al.(Aghdam

et al., 2016) empirically studied various ConvNets

trained on different datasets. In this work, they gener-

ated 1200 noisy images for each sample in the test sets

using a Gaussian noise with σ ∈ [1 . . . 40]. The results

showed that all the ConvNets in our experiments were

unstable to image degradation even when the samples

were degraded using the Gaussian noise with σ = 1.

Moreover, the ConvNets were unstable regardless of

the class of the object.

Contribution. One of the crucial requirements of

road understanding modules is that they must be exe-

cuted in real-time and they must use resources such

as CPU, GPU and memory as few as possible. In

this paper, we show that ﬁne tuning a ConvNet with

generic approaches such as blurring, median ﬁlter-

ing and bilateral ﬁltering is an effective and afford-

able way to increase the stability of a classiﬁcation

ConvNet against different kinds of noise. More im-

portantly, we train channel-wise ﬁlters for restoring

images. Our objective function tries to locally reduce

the nonlinearity of the restoration module. To be more

speciﬁc, we train a convolution layer with 3 ﬁlters to

restore an image as accurate as possible but also it

generates nearly identical outputs for all perturbations

in small neighborhood of an image. Our experiments

on pedestrian detection and trafﬁc sign classiﬁcation

datasets show that this lightweight restoration layer is

able to effectively tackle with noisy images compared

with other methods.

2 RELATED WORK

Szegedy et al. (Szegedy et al., 2014b) discovered that

ConvNets are sensitive to small variations of the in-

put. They found the additive noise ν which was able

to reduce the score of the true class close to zero.

They also studied the non-linearity of ConvNets us-

ing the Lipschitz theorem. Similarly, Papernot et

al. (Papernot et al., 2015) produced adversarial sam-

ples which were incorrectly classiﬁed by the Con-

vNet. They produced these samples by modifying

4.02% of the input features. Aghdam et.al.(Aghdam

et al., 2016) also proposed an objective function to

ﬁnd the additive noise ν in the closest distance to the

decision boundary in which x + ν falls into the wrong

class. Goodfellow et al. (Goodfellow et al., 2015) ar-

gued that the instability of ConvNets to adversarial

examples is due to linear nature of ConvNets. Based

on this idea, they proposed a method for quickly gen-

erating adversarial examples. They used these exam-

ples to reduce the test error.

Gu and Rigazio (Gu and Rigazio, 2014) stacked

a denoising autoencoder (DAE) to their ConvNet and

preprocessed the adversarial examples using the DAE

before feeding them to the ConvNet. They mentioned

that the resulting network can be still attacked by new

adversarial examples. Inspired by contractive autoen-

coders, they added a smoothness penalty to the objec-

tive function and trained a more stable network.

Instead of minimizing the classiﬁcation score,

Sabour et al. (Sabour et al., 2015) tried to ﬁnd a de-

graded image closest to the original image that its rep-

resentation mimics those produced by natural images.

Fawzi et al. (Alhussein Fawzi et al., 2015) provided

a theoretical framework for explaining the adversarial

examples. Their framework suggests that the instabil-

ity to noise is due to low ﬂexibility of classiﬁers.

3 PROPOSED METHOD

Denoting the softmax layer of a ConvNet (i.e. the last

layer in a classiﬁcation ConvNet) by L

θ

(x), the gen-

eral idea is to ﬁnd a parameter vector θ such that:

∀

kνk≤ε

L

θ

(x + ν) = L

θ

(x) (1)

where ν is a noise vector whose magnitude is less

than threshold T . Solving the instability of ConvNets

against noise using the above formulation may require

to add new terms to the loss function or generate thou-

sands of noisy samples for each training sample. In-

stead, we propose a modular approach consisting of

two ConvNets. The ﬁrst ConvNet is a denoising layer

that we are going to mention in this section. The sec-

ond ConvNet is the one that is originally trained on

training samples. In our approach, we connect the

denoising network to the classiﬁcation network and

feed the images to the denoising network. Our aim is

to train a denoising ConvNet that is able to restore the

original image as accurate as possible and it produces

identical results for all the samples located within ra-

dius r from the current sample. Formally, we are look-

ing for two sets of parameters θ

1

and θ

2

such that:

∀

kνk≤T

L

θ

1

(F

θ

2

(x + ν)) = L

θ

1

(F

θ

2

(x)). (2)

where θ

1

indicates the parameters of the classiﬁcation

ConvNet and θ

2

denotes the parameters of the denois-

ing ConvNet. The parameters θ

1

is already available

by training the classiﬁcation ConvNet on the train-

ing samples. Then, our goal is to ﬁnd a function

F : R

n

→ R

n

that is able to map all points around

x ∈ R

n

to the same point. If we can ﬁnd such a func-

tion, the sample x and all its adversarial examples will

Increasing the Stability of CNNs using a Denoising Layer Regularized by Local Lipschitz Constant in Road Understanding Problems

219

be mapped to the same point. Then, the classiﬁcation

ConvNet will be able to produce the same output for

all adversarial examples.

In contrast to (Jain and Seung, 2009), we do not

restrict F to Gaussian noise. Furthermore, contrary to

(Burger et al., ) and (Hradi, 2015) that model F using

16M and 4.5Mparameters, our approaches requires

determining only 75 parameters. From one perspec-

tive, F can be seen as an associative memory that is

able to memorize patterns X = {x

1

. . . x

i

. . .

M

}, x

i

∈ R

N

in our dataset and map every sample {x

i

+ν|kνk ≤ ε}

to x

i

. Here, x

i

is an image patch and X is the set of

all possible image patches collected from all classes

of objects in our dataset. Figure 1 illustrates our ap-

proach. Our approach can be considered as a layer

which is later connected to the input of a classiﬁca-

tion layer and its aim is to reduce the effect of noise.

The two layers shown in this ﬁgure have identi-

cal architectures and they share all their parameters.

Furthermore, as we will discuss shortly, we need the

two layers during the training phase and we will only

use one of them in the test phase. The layer consists

of 3 convolution ﬁlters of size 5 × 5 which are sepa-

rately applied on the red, green and blue channels of

the noisy image. Also, the result of convolutions are

passed through a ReLU activation function and they

are concatenated in order to create the ﬁnal image.

It should be noted that the noise generation mod-

ule in Figure 1 is only used during the training phase.

In the test phase, the noise generation module is omit-

ted. In this paper, we have only concentrated on ad-

ditive noise. The noise generation module creates

noisy patterns with various probability density func-

tions. Five examples of the probability density func-

tions have been shown in Figure 1. Besides the Gaus-

sian and uniform distributions, there are also other

density functions that generate sparse noise patterns.

Given a set of clean image patches X

clean

=

{x

1

clean

, . . . , x

N

clean

} and their noisy versions X

noisy

=

{x

1

noisy

, . . . , x

N

noisy

}, restoration ConvNets are usually

trained by minimizing the Euclidean loss func-

tion(Jain and Seung, 2009; Dong et al., 2014; Svo-

boda et al., 2016; Hradi, 2015; Burger et al., ):

E =

1

N

N

∑

i=1

kx

i

clean

− F

θ

(x

i

noisy

)k

2

+ λkθk

2

(3)

where θ is the set of network weights and biases and

λ is the regularization coefﬁcient. The objective of

this function is to train a restoration ConvNet which

is able to restore clean images from noisy inputs as

accurate as possible. However, we argue that train-

ing a ConvNet using the above loss function could

be accurate if X

clean

is clean in practice. But, this is

not usually the case in datasets collected for road un-

derstanding problems. This is illustrated in Figure 2

on the samples from GTSRB(Stallkamp et al., 2012)

and Caltech-Pedestrian(Doll

´

ar et al., 2009). Each col-

umn shows two different samples belonging to the

same class. The green rectangle shows contradic-

tory patches in each column. For example, the green

patches related to the speed limit sign are pointing to

the same pattern. However, one of these patches are

degraded due to camera motion. In the second and

third columns, shadow and excessive ambient light

on the patches has caused the contradiction. In the

last two columns, there are some irregularities due to

camera noise. X

noisy

is usually generated from X

clean

.

However, because of the above reasons, it might not

be practical to train the ConvNet using (3) due to con-

tradictions in the database.

To tackle with this problem, we propose to add

a new term to the objective function encouraging

the layer to learn a mapping in which kF (x

clean

) −

F (x

noisy

)k is less than kx

clean

− x

noisy

k. This is analo-

gous to locally reducing the Lipschitz constant of the

layer. Our ﬁnal objective function is deﬁned as fol-

lows:

E =

1

N

N

∑

i=1

"

w

1

kx

i

cl ean

− F

θ

(x

i

noisy

)k

2

+ w

2

kF (x

i

cl ean

) − F (x

i

noisy

)k

kx

cl ean

− x

noisy

k

#

(4)

where N is the total number of the images. More-

over, F (x

i

clean

) and F (x

i

noisy

) are computed at the

same time using the top and bottom layers in Figure

1, respectively. It is worth mentioning that the noisy

patterns are generated on the ﬂy. That said, we have

implemented a degradation module which accepts a

mini-batch of clean images and outputs their degraded

version along with identity mapping data. This helps

the network not only learn to restore noisy patches but

also apply identity mapping on clean patches.

One the one hand, our layer learns to restore im-

ages where intensity values is in interval [0, 1]. On

the other, we initialize our ﬁlters close to averag-

ing ﬁlters. Therefore, the output of the layer never

becomes a negative number. Since our approach is

only one layer consisting of convolution operators

and ReLU functions, it is a linear operator which

is applied on the input image. Formally, conditions

f (kx) = k f (x) and f (x + y) = f (x) + f (y) hold in

our approach. Taking into account one convolution

kernel, F (x

i

clean

) − F (x

i

noisy

) can be simpliﬁed as

W ∗ x

i

clean

−W ∗ x

i

noisy

= W ∗ (x

i

clean

− x

i

noisy

) when all

elements of W are positive. Consequently, the second

term in (4) is minimized by reducing kW k. In con-

trast, when all elements of W are negative, the second

term becomes zero. This is similar to regularizing the

objective function with an adaptive weight analogous

to the difference between clean and noisy samples.

For this reason, we do not add other regularization

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

220

Figure 1: The proposed ConvNet for modelling F in (2). C(s, k) shows a convolution-ReLU layer containing s ﬁlters of size

k × k. Probability density function used for generating noise in the training phase is shown in the left.

Figure 2: Unclean training samples with contradictory patches (best viewed in color).

terms to our objective function. In terms of Lips-

chitz constant, the second term reduces the slope of

the hyperplane represented by each convolution ker-

nel. Moreover, it helps to reduce the effect of contra-

dictory patches.

4 EXPERIMENTS

We carry out our experiments on German Trafﬁc Sign

Recognition Benchmark (GTSRB) (Stallkamp et al.,

2012) and Caltech-Pedestrian(Doll

´

ar et al., 2009)

datasets. The GTSRB and the Caltech-Pedestrian

datasets have some important characteristics. First,

they have been collected considering real scenarios

(e.g. shadow, lightening, occlusion, camera motion)

and they contain many degraded images. Second, the

imaging device are noisy and they produce artifacts

on the acquired images. Third, the resolution of im-

ages are low. Therefore, a slight change in the image

may affect the classiﬁcation score.

We use 48 × 48 (the GTSRB dataset) and 64 × 64

(the Caltech-Pedestrian dataset) image patches as the

input of our layer. Also, we do not apply zero-padding

in the training phase to avoid the impact of border ef-

fect on the loss function. Besides, the input is nor-

malized to [0, 1]. All the weights in our layer are ini-

tialized using the normal distribution with mean value

set to 1 and standard deviation set to 0.2. Taking into

account the fact that each activation of the layer must

be in interval [0, 1], the initial weights are divided by

25 in order to make the results of convolution kernels

close to this interval. After we have the layer trained,

zero-padding is applied on the input.

Exploratory Analysis. To evaluate the restoration

accuracy of our method, we generate 150 noisy im-

ages for each sample in the test set. Generating a

noisy pattern is done in several steps. First, we ran-

domly select a uniform or a normal distribution with

probability 0.5. Then, a noisy pattern is generated

with µ = 0 and σ = U(0.5, 15) if the normal random

number generator is selected. Here, U (0.5, 15) re-

turns a number between 0.5 and 15 using the uniform

distribution. In the case of uniform random number

generator, the noisy pattern is generated in interval

[−3U(0.5, 15), 3U(0.5, 15)]. Next, the noisy pattern

is sparsiﬁed with probability 0.25. The sparsiﬁca-

tion is done by generating a binary mask using bi-

nomial distribution with n = 1 and p = U (0.5, 1.0). It

is worth mentioning that we set the seed of random

Increasing the Stability of CNNs using a Denoising Layer Regularized by Local Lipschitz Constant in Road Understanding Problems

221

number generators to an identical value for all meth-

ods. Figure 3 illustrates a few noisy images along

with the noisy patterns generated on the GTSRB and

the Caltech-Pedestrian datasets.

The noisy samples are fed into the layer and the

peak signal to noise ratio of the restored image is

computed using the following equation:

psnr = 10 log

10

255

2

1

HW

∑

H

m=i

∑

W

n= j

(x − x

0

)

2

!

. (5)

In the above equation, x is the clean image and x

0

is

the noisy/restored image (both of them are re-scaled

back to [0, 255]). In addition, we also study how

the Lipschitz constant of our layer changes locally.

This is done by computing kF (x

i

clean

) − F (x

i

noisy

)k

and ν = kx

clean

− x

noisy

k. To compare our results with

other similar methods, we also restored the images

using the bilateral(d=5 and σ

1

= σ

2

= 9), the me-

dian(5 × 5) and Gaussian(5 × 5) ﬁltering approaches.

Figure 4 illustrates the scatter plot of the PSNR study

(left) and the Lipschitz study (right) superimposed

with a polynomial ﬁtted on the data.

According to the results, both Gaussian and me-

dian ﬁltering approaches are not able to restore im-

age accurately. This is due to the fact that objects in

the GTSRB and the Caltech-Pedestrian datasets are

represented using low resolution images. On the one

hand, details of objects are mainly determined using

high frequency pixels. On the other hand, these pix-

els are close in the case of these low resolution im-

ages. As the result, Gaussian and median ﬁltering ap-

proaches oversmooth the images which degrades the

edges of objects. For this reason, PSNR of the ﬁltered

image is much lower than the PSNR of the noisy im-

age. In contrast, bilateral ﬁltering preserves the edges

and this is the main reason that it has a higher PSNR

compared with these two methods. Moreover, bilat-

eral ﬁltering restores image with higher PSNR when

the PSNR of the noisy image is less than 35.

The Lipschitz study shows that Gaussian and me-

dian ﬁltering produce similar results regardless of the

magnitude of noise. In contrary, images restored by

bilateral ﬁltering are scattered at a distance which is

approximately similar to the distance of the noisy im-

age from the clean image. We are looking for a ﬁl-

tering approach which is able to restore images as ac-

curate as possible and it produces results with smaller

Lipschitz constant. Consequently, none of these three

approaches are appropriate for our purpose.

However, the ﬁlter learned by our approach has a

trade off between accuracy and the Lipschitz constant.

Looking at the PSNR values, we observe that, on av-

erage, it is more accurate than these three methods.

Besides, its Lipschitz constant is approximately lin-

ear. More importantly, the Lipschitz constant of our

ﬁlter is less than 1 which means that restored images

become closer after being ﬁltered by our layer. Fi-

nally, we observe that the Lipschitz constant is very

stable with very low variation in our approach. This

means that, the ﬁlter learned by our objective function

is not sensitive to the variations of input image.

We further analyze our ﬁlters in the frequency do-

main using the Fourier transform. Figure 5 illustrates

the frequency response of our ﬁlters along with the

Gaussian ﬁlter. First, our ﬁlters have higher response

to low frequencies than the Gaussian ﬁlter. For this

reason, it passes some of the details in the image more

than Gaussian ﬁlter. Second, they also have higher re-

sponses in very high frequencies. This helps our ﬁl-

ters to preserve edges more than the Gaussian ﬁlter.

Quantitative Analysis. We pick the ConvNet in (An-

gelova et al., 2015) for detecting pedestrians and the

ConvNet in (Aghdam et al., 2015) for classiﬁcation

of trafﬁc signs. First, these ConvNets are trained

on the GTSRB and the Caltech-Pedestrians datasets.

Then, we connect our learned restoration ﬁlters to

these ConvNets and ﬁne-tune them for one epoch on

the original dataset (we do not augment the dataset

with noisy images). Then, the ConvNets are tested

using noisy test sets. We repeat this procedure (ﬁne

tuning the classiﬁcation ConvNets) on Gaussian, me-

dian and bilateral ﬁltering as well.

The noisy test sets are created by generating 1050

Guassian noise pattern with σ ∈ {0.3, 1, 2, 3, 4, 8, 10}

for each sample (150 images per each value of σ).

Then, we feed these noisy samples to the above Con-

vNets (after connecting our layer to these ConvNets)

and compute the accuracy. To generate the same noisy

samples for all methods in our experiment, we always

seed the random number generator with a ﬁxed value.

Table 1 and Table 2 show the results on the GTSRB

and the Caltech-Pedestrian datasets.

We observe that adding a Guassian or a me-

dian3x3 layer to the GTSRB ConvNet increases the

classiﬁcation accuracy of the ConvNet on clean im-

ages. This is due to the fact that some of the test

samples might be noisy for the reasons we discussed

in Section 3. The Gaussian layer helps to deal with

this kind of noise and consequently it increases the

accuracy of the ConvNet on clean samples. Simi-

larly, median3x3 and bilateral1 ﬁltering increases the

accuracy of the Caltech-Pedestrian ConvNet on clean

samples compared with the original ConvNet. How-

ever, while Gaussian ﬁltering works well on the GT-

SRB dataset it does not increase the accuracy on the

Caltech-Pedestrian dataset. Likewise, bilateral ﬁlter-

ing improves the accuracy on the Caltech-Pedestrian

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

222

Figure 3: Samples of noisy images generated by our algorithm.

Figure 4: PSNR (left) and Lipschitz analysis (right) of the the Gaussian, median, bilateral and our approaches (best viewed in

color and electronically).

dataset but they do not increase the accuracy on the

GTSRB dataset.

Notwithstanding, the ﬁlters learned by our method

produce the most accurate results on both datasets. In

addition, our layer produce a ConvNet with highest

stability against noise compared with approaches with

similar computational complexity. In fact, the com-

putational complexity of our layer is identical to the

Gaussian 5x5 and its less than bilateral and median

ﬁltering approaches.

Analyzing Results. Figure 6 illustrates some of the

samples that are classiﬁed incorrectly by the original

ConvNet but they are classiﬁed correctly after being

smoothed by our method. The original image inside

Increasing the Stability of CNNs using a Denoising Layer Regularized by Local Lipschitz Constant in Road Understanding Problems

223

Figure 5: Comparing our ﬁlter with Gaussian ﬁlter in the frequency domain.

Table 1: Accuracy of the GTSRB ConvNet obtained by degrading the test images in the original dataset using a Gaussian

noise with various values of σ.

accuracy (%) for different values of σ

network clean 0.3 1 2 3 4 8 10 overall

original 99.06 98.56 98.56 98.55 98.52 98.48 98.20 97.93 98.48

gaussian3x3 99.22 98.72 98.72 98.71 98.68 98.64 98.36 98.13 98.65

gaussian5x5 99.22 98.71 98.71 98.70 98.68 98.65 98.42 98.23 98.66

median3x3 99.15 98.66 98.66 98.65 98.63 98.60 98.38 98.19 98.62

median5x5 98.94 98.42 98.42 98.39 98.36 98.32 98.04 97.83 98.34

bilateral1 98.99 98.49 98.49 98.48 98.46 98.45 98.27 98.13 98.47

bilateral2 96.94 96.49 96.48 96.48 96.44 96.42 96.28 96.17 96.46

our ﬁlter 99.31 99.31 99.31 99.30 99.28 99.26 99.03 98.84 99.21

Table 2: Accuracy of the Caltech-Pedestrian ConvNet obtained by degrading the test images in the original dataset using a

Gaussian noise with various values of σ.

accuracy (%) for different values of σ

network clean 0.3 1 2 3 4 8 10 overall

original 92.39 91.97 91.97 91.97 91.95 91.92 91.67 91.49 91.92

gaussian3x3 92.36 91.89 91.89 91.87 91.85 91.80 91.61 91.48 91.84

gaussian5x5 92.01 91.52 91.52 91.49 91.46 91.43 91.23 91.12 91.47

median3x3 92.61 92.16 92.16 92.16 92.17 92.16 92.07 92.02 92.19

median5x5 92.18 91.67 91.67 91.66 91.64 91.61 91.46 91.34 91.65

bilateral1 92.74 92.27 92.26 92.25 92.25 92.22 92.13 92.03 92.27

bilateral2 92.27 91.82 91.82 91.83 91.82 91.81 91.83 91.84 91.88

our ﬁlter 92.86 92.84 92.83 92.81 92.78 92.76 92.56 92.46 92.74

the green rectangle is degraded by shadow. Our layer

ﬁlters the edges of the object and reduces the effect

of shadow on the edges. The background of the im-

age inside the red rectangle is smoothed by the layer.

Edges in the original image inside the yellow rectan-

gle has Bayer like pattern because of excessive light-

ening in the background. This effect is reduced by our

ﬁlter. Finally, a general ﬁltering is applied on the im-

age inside the blue rectangle and makes it smoother.

In sum, our method increases the accuracy by improv-

ing degraded edges and smoothing background noise.

5 CONCLUSION

In this paper, we proposed a lightweight approach for

increasing stability of ConvNets. Our method trains

a ReLU layer containing 3 channel-wise ﬁlters. We

proposed a new objective function consisting of the

sum of square error penalized by the Lipschitz con-

stant of the ﬁlters. We showed that the Lipschitiz con-

stant in this particular conﬁguration act as an adap-

tive L

2

regularizer. Our experiments on the GTSRB

and the Caltech-Pedestrian datasets shows that this

approach increases the accuracy of the original Con-

vNets on the clean test sets. Using our approach, the

stability of ConvNets increases while the computa-

tional cost of our layer is negligible. Besides, since

it is a modular approach, we do not need to train a

large ConvNet using thousands of noisy samples to

increase the stability. Rather, we train the classiﬁca-

tion ConvNet on the clean dataset. Then, we train

our restoration layer on the noisy training set. Finally,

the classiﬁcation ConvNet is ﬁne-tune for one epoch

using the clean training set. This approach is very af-

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

224

Figure 6: Images that are correctly classiﬁed after being ﬁltered by our layer. Left to right: Original image, difference with

restored, restored image and normalized difference.

fordable in terms of time and computation resources.

ACKNOWLEDGEMENTS

Hamed H. Aghdam and Elnaz J. Heravi are grateful

for the supports granted by Generalitat de Catalunya’s

Ag

`

ecia de Gesti

´

o d’Ajuts Universitaris i de Recerca

(AGAUR) through the FI-DGR 2015 fellowship and

University Rovira i Virgili through the Marti Franques

fellowship, respectively.

REFERENCES

Aghdam, H. H., Heravi, E. J., and Puig, D. (2015). Rec-

ognizing Trafﬁc Signs using a Practical Deep Neural

Network. In Robot 2015: Second Iberian Robotics

Conference, pages 399–410, Lisbon. Springer.

Aghdam, H. H., Heravi, E. J., and Puig, D. (2016). Ana-

lyzing the Stability of Convolutional Neural Networks

Against Image Degradation. In Proceedings of the

11th International Conference on Computer Vision

Theory and Applications.

Alhussein Fawzi, Omar Fawzi, and Pascal Frossard (2015).

Analysis of classiﬁers’ robustness to adversarial per-

turbations. (2014):1–14.

Angelova, A., Krizhevsky, A., View, M., View, M., Van-

houcke, V., Ogale, A., and Ferguson, D. (2015). Real-

Time Pedestrian Detection With Deep Network Cas-

cades. Bmvc2015, pages 1–12.

Bittel, S., Kaiser, V., Teichmann, M., and Thoma, M.

(2015). Pixel-wise Segmentation of Street with Neural

Networks. pages 1–7.

Burger, H. C., Schuler, C. J., and Harmeling, S. Image

denoising Can plain neural networks compete with

BM3D .

Cirean, D., Meier, U., Masci, J., and Schmidhuber, J.

(2012). Multi-column deep neural network for traf-

ﬁc sign classiﬁcation. Neural Networks, 32:333–338.

Doll

´

ar, P., Wojek, C., Schiele, B., and Perona, P. (2009).

Pedestrian detection: A benchmark. 2009 IEEE Com-

puter Society Conference on Computer Vision and

Pattern Recognition Workshops, CVPR Workshops

2009, pages 304–311.

Dong, C., Loy, C. C., and He, K. (2014). Image

Super-Resolution Using Deep Convolutional Net-

works. arXiv preprint, 8828(c):1–14.

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015). Ex-

plaining and Harnessing Adversarial Examples. Iclr

2015, pages 1–11.

Gu, S. and Rigazio, L. (2014). Towards Deep Neural Net-

work Architectures Robust to Adversarial Examples.

arXiv:1412.5068 [cs], (2013):1–9.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

Residual Learning for Image Recognition. In arXiv

prepring arXiv:1506.01497.

Hradi, M. (2015). Convolutional Neural Networks for Di-

rect Text Deblurring. Bmvc, (1):1–13.

Jain, V. and Seung, S. (2009). Natural Image Denoising

with Convolutional Networks. pages 769–776.

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105. Curran Associates,

Inc.

Levi, D., Garnett, N., and Fetaya, E. (2015). StixelNet: a

deep convolutional network for obstacle detection and

road segmentation. Bmvc, pages 1–12.

Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik,

Z. B., and Swami, A. (2015). The Limitations of Deep

Learning in Adversarial Settings.

Sabour, S., Cao, Y., Faghri, F., and Fleet, D. J. (2015).

Adversarial Manipulation of Deep Representations.

arXiv preprint arXiv:1511.05122, (2015):1–10.

Sermanet, P. and Lecun, Y. (2011). Trafﬁc sign recognition

with multi-scale convolutional networks. Proceedings

of the International Joint Conference on Neural Net-

works, pages 2809–2813.

Simonyan, K. and Zisserman, A. (2015). Very Deep Con-

volutional Networks for Large-Scale Image Recogni-

tion. In International Conference on Learning Repre-

sentation (ICLR), pages 1–13.

Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015).

Highway Networks. arXiv:1505.00387 [cs].

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C.

(2012). Man vs. computer: Benchmarking machine

learning algorithms for trafﬁc sign recognition. Neu-

ral Networks, 32:323–332.

Svoboda, P., Hradis, M., Marsik, L., and Zemcik, P. (2016).

CNN for License Plate Motion Deblurring.

Szegedy, C., Reed, S., Sermanet, P., Vanhoucke, V., and

Rabinovich, A. (2014a). Going deeper with convolu-

tions. In arXiv preprint arXiv:1409.4842, pages 1–12.

Szegedy, C., Zaremba, W., and Sutskever, I. (2014b). In-

triguing properties of neural networks.

225