Improving Quality of Training Samples Through Exhaustless Generation

and Effective Selection for Deep Convolutional Neural Networks

Takayoshi Yamashita

, Taro Watasue

, Yuji Yamauchi

and Hironobu Fujiyoshi

Chubu University, 1200, Matsumoto-cho, Kasugai, Aichi, Japan

Tome R&D, kyoto, Japan

Keywords:

Deep Learning, CNN, Data Augmentation, Data Selection, MINIST.

Abstract:

Deep convolutional neural networks require a huge amount of data samples to train efﬁcient networks. Al-

though many benchmarks manage to create abundant samples to be used for training, they lack efﬁciency when

trying to train convolutional neural networks up to their full potential. The data augmentation is one of the

solutions to this problem, but it does not consider the quality of samples, i.e. whether the augmented samples

are actually suitable for training or not. In this paper, we propose a method that will allow us to select effective

samples from an augmented sample set. The achievements of our method were 1) to be able to generate a large

amount of augmented samples from images with labeled data and multiple background images; and 2) to be

able to select effective samples from the additionally augmented ones through iterations of parameter updating

during the training process. We utilized exhaustless sample generation and effective sample selection in order

to perform recognition and segmentation tasks. It obtained the best performance in both tasks when compared

to other methods using, or not, sample generation and/or selection.

1 INTRODUCTION

Deep learning methods have been applied to many

challenging tasks including speech recognition, bioin-

formatics and object recognition. Among those,

Deep Convolutional Neural Networks (ConvNets)

are used in various benchmark tests for handwrit-

ten characters(Y. LeCun ad L. Bottou and P.Haffner,

1998), house numbers(Sermanet et al., 2012), trafﬁc

signs(Ciresan et al., 2012) and ImageNet(Sutskever

and Hinton, 2012). ConvNets lead to signiﬁcantly

advanced performance results with these dataset.

Krizhevsky et al. obtained impressive classiﬁcation

performances using a large ConvNets and won the

ImageNet 2012(Sutskever and Hinton, 2012). Zeiler

et al. also utilized a ConvNets and applied it to not

only the recognition task but also the localization

challenge, winning the last ImageNet 2013(Zeiler and

Fergus, 2013). Because of these achievements, deep

learning including ConvNets is becoming a more and

more popular learning method.

The advantage of ConvNets is that it is able to ex-

tract complex and suitable features for the task. It can

reduce the burden of designing the features since the

entire system is trained from raw pixels.

Although many researchers have proposed meth-

ods to address the disadvantage of ConvNets, but it re-

mains that they require a huge amount of labeled sam-

ples. Although many benchmark methods can prepare

an abundant number of samples for training, this is of-

ten not just enough to fulﬁll all the learning potential

that ConvNets could actually achieve. Creating addi-

tional samples through data augmentation method is

one of the solutions to this problem as it enables us

to control at will the amount of samples to be used.

However, simple data augmentation does not take ac-

count of the quality of the augmented samples, i.e.

whether the additional samples will actually be useful

or not for training. While the amount of samples is

important for an effective training method, so is their

quality.

In this paper, we propose a method to select useful

samples from the ones generated through data aug-

mentation. First, we utilized an asynchronous sample

generation process. This data augmentation method

did not only use translation, scaling and rotation, but

also deformed the samples through elastic distortion.

Then, we selected the useful samples from all the aug-

mented ones through iterations of the parameter up-

date process. We applied this method to a ConvNets

and evaluated its performance for several tasks, in-

cluding recognition and segmentation. For the recog-

228

Yamashita T., Watasue T., Yamauchi Y. and Fujiyoshi H..

Improving Quality of Training Samples Through Exhaustless Generation and Effective Selection for Deep Convolutional Neural Networks.

DOI: 10.5220/0005263802280235

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 228-235

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

nition task, we tested the efﬁciency of our sample se-

lection method in MNIST for hand posture recogni-

tion with complex ﬁgures in shape variation. Further-

more, we applied our proposed method to a segmen-

tation task in order to show its wide application range.

The rest of this paper is organized as follows: The

related works and Convolutional Neural Networks are

brieﬂy reviewed in Section 2 and 3. Our proposed

method is described in Section 4 and our experiment

results are given in Section 5. Finally, we present our

conclusions in Section 6.

2 RELATED WORKS

Since the early 1990’s, many researchers have pro-

posed methods that use ConvNets(Vaillant et al.,

1994)(Y. LeCun ad L. Bottou and P.Haffner, 1998).

More recently, ConvNets have demonstrated state

of the art performance for text detection(Delakisand

and Garcia, 2008), image recognition(Ciresan et al.,

2012), and pedestrian detection(Sermanet et al.,

2013)(Ouyang and Wang, 2013b)(Ouyang and Wang,

2013a). After being utilized by Krizhevsky et

al. to win the object recognition competition Im-

ageNet, ConvNets moved into the limelight and

their use spread in numerous computer vision re-

search ﬁelds(Sutskever and Hinton, 2012). ConvNets

achieved a breakthrough on the 1000-class ImageNet

task, and while they are mainly used for object recog-

nition, they have a tremendous potential of applica-

tions, like pedestrian detection as proposed by Ser-

manet(Sermanet et al., 2013). ConvNets combine

multi-stage features, skipped layers connections and

unsupervised pre-training with sparse coding. They

integrate both global shape information and local dis-

tinctive motif information. The unsupervised pre-

training initializes the ﬁlters at each layer and obtains

efﬁcient ﬁlters in a ﬁne tuning process. Ouyang et

al. proposed a pedestrian detection method using a

special ConvNets that obtained the ﬁlter output from

local regions(Ouyang and Wang, 2013b). Each lo-

cal region corresponded to a body part, such as the

head, upper body, left or right half body, and all re-

sponses were joined together in the ﬁnal layer. Multi-

task applications are also a ﬁeld of usage where Con-

vNets are able to achieve great results. Osadchy et

al. utilized ConvNets for simultaneous face detection

and pose estimation by directly predicting the param-

eters of face location and head pose(Osadchy et al.,

2007). Face location and head pose were represented

by high-dimensional points on a 3D manifold, and the

feature space was trained by ConvNets to output the

face location and head pose. ConvNets based segmen-

tation can also be used to perform object localization.

For example, training was done to classify the central

pixel of the window with an objectfs category label to

represent the said object with a bounding contour in-

stead of a traditional bounding box(Jain et al., 2007).

ConvNets, however, still have several disadvan-

tages that restrain their full potential. Although Con-

vNets represent high-dimentional features due to hav-

ing a lot of ﬁlter banks, the large number of parame-

ters also may cause overﬁtting problems. Many re-

searchers proposed various approaches to avoid over-

ﬁtting. Here are the two main ones: Applying a

micro architecture to the ConvNets. (Scherer et al.,

2010)(Boureau et al., 2010)(Goodfellow et al., 2013);

and Using data augmentation. Data augmentation

is known as the easiest and most common method

to reduce overﬁtting on training data. Augmented

data is obtained by artiﬁcially enlarging the data set

but preserving their labels. Put simply, the amount

of data is augmented by cropping different regions

from the same image and mirroring them(Sutskever

and Hinton, 2012)(Hinton et al., 2012). Additional

transformations such as scaling or rotation can also

be applied (Wan et al., 2013). Simard proposed a

highly advanced deformation method called elastic

distortion, effective for handwritten character recog-

nition(Simard et al., 2003). We found that elastic dis-

tortion is also useful in hand posture recognition (see

Section 5). However, training a huge amount of pa-

rameters using augmented data takes a lot of time. In

order to reduce learning time, vast parallel comput-

ing with GPU is often used. Fortunately, some GPU

computing software packages that can train ConvNets

are nowadays available(Sermanet et al., 2013). Com-

monly, GPU parallelization is used when updating pa-

rameters. Even though GPU helps updating the pa-

rameters faster, it requires complex data augmenta-

tion that can be a bottleneck to the training process.

Krizhevsky et al. implemented simple data augmen-

tation in Python on CPU while parameter updating

is computed on GPU(Sutskever and Hinton, 2012).

Nevertheless, when a highly effective and highly so-

phisticated data augmentation process is required, if

both the data generation and the parameters reﬁne-

ment are run on the same process, the whole takes

quite some time to complete the training. However, if

data augmentation and parameter updating are sepa-

rate process and those two can asynchronously access

the data, then the program is kept simple and training

time can be reduced. Yet, approaches like these only

focus on the mass-generation of training data and do

not provide solutions to the issues of the effectiveness

of the samples to train or the selection of such high

performing samples. That is, there must be some ef-

ImprovingQualityofTrainingSamplesThroughExhaustlessGenerationandEffectiveSelectionforDeepConvolutional

NeuralNetworks

229

fective sample selection method used to achieve high

performance while using relatively less samples. We

also focus on these issues.

3 CONVOLUTIONAL NEURAL

NETWORK

ConvNets (Y. LeCun ad L. Bottou and P.Haffner,

1998) contain an alternate succession of convolu-

tional layers and subsampling layers, based on the

notion of local receptive ﬁelds discovered by Hubel

(Hubel and Wiesel, 1962). There are several types

of layers, including input layers, convolutional lay-

ers, pooling layers and classiﬁcation layers. Each in-

put layer also has, besides the raw data, edge and nor-

malized data as input. The convolutional layer has M

kernels with size (Kx × Ky) and ﬁlters them in order

to input data. The ﬁltered responses from all the input

data are then subsampled in the pooling layer. Scherer

(Scherer et al., 2010) found that max pooling can

lead to faster convergence and improved generaliza-

tion, while Boureau (Boureau et al., 2010) analyzed

theoretical feature pooling. Max pooling can output

the maximum value in certain regions such as a 2 × 2

pixel. The convolutional layer and pooling layer are

laid alternatively in order to create the deep network

architecture. Finally, the output feature vectors from

the last pooling layer are used in the classiﬁcation

layer. The classiﬁcation layer will output the prob-

ability of each class through a soft max connection

of all the nodes with weights in the previous layer.

Unlike Belief neural networks, ConvNets assume su-

pervised learning where ﬁlters are randomly initial-

ized and updated through backpropagation (Rumel-

hart et al., 1986) (Duffer and Garcia, 2007).

3.1 Training of Convolutional Neural

Networks

ConvNets are trained by backpropagation just as

general neural networks. Backpropagation uses the

function shown in Eq.(1) to estimate the connected

weights with minimized E by gradient descent in

Eq.(2).

E =

∑

p=1

(1)

(l)

← w

(l)

+ ∆w

(l)

= w

(l)

− λ

∂E

∂w

(l)

(2)

Note that {p|1, ..., P} is the training sample, o

the corresponding value of training sample p in out-

put layer and t

is the label data of p. λ is the training

ratio, w

(l)

is the weight that connects from the node

i in layer l to the node j in the next layer. The error

for each training sample E

is the sum of the differ-

ences between the output value and the label. ∆w

(l)

represented as following Eq.(3).

∆w

(l)

= −λδ

(l)

(l−1)

(3)

(l)

= e

ϕ(V

(l)

) (4)

(l)

∑

(l)

k j

∗ y

(l−1)

(5)

(l−1)

is the output of the node j in the (l − 1)th

layer and e

is the error of node kD V

(l−1)

is the ac-

cumulated value connected to node k from all nodes

in the (l − 1)th layer. The local gradient descent is

obtained by Eq.(4). The activation function ϕ has

variation such as sigmoid, hyperbolic tangent and

ReLu(Sutskever and Hinton, 2012). The connected

weights in the entire network are updated concur-

rently for a predetermined number of times, or until

reaching the satisfying convergence condition.

Backpropagation has multiple ways to calculate

the error E, including full-batch, online and Mini-

batch. Full-batch gives all training data at once. As

such, it requires few iterations but is weak for conver-

gence due to the increasing gradient descent. Online

gives the training data iteratively. As such, it obtains

optimized results from its small gradient descent, but

requires lengthy processing time due to its numerous

iterations. Mini-batch is a commonly used middle ap-

proach that updates the weight with small subsets of

training data. It is able to effectively update connected

weights with a huge amount of training data in a real-

istic time length.

4 PROPOSED METHOD

The conventional ConvNets method trains from a

huge yet limited amount of training samples. In most

cases, the training process divides the training sam-

ples into subsets of mini-batch size and uses them

to update the parameters. When the training pro-

cess ﬁnishes using all the subsets, it starts receiving

data again from ﬁrst subset through a process called

epoch. While this enables a continuous updating of

the parameters, it has distinctive limitations in the

variation of its samples. We can assume that varia-

tion in the data to be continuously given to the net-

work is signiﬁcantly important when trying to obtain

the best parameters. To demonstrate this assumption

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

230

!"#$%&'(&)&*"+,)

-*".).)('/"#$%&/'

0'123345

6"7"'"8(#&)7"+,)

98(#&)7&6'/"#$%&/'

0':'2;5

<"=>"(&

?,)@A&7/'7*".).)(

B).+"%')&7C,*>/

<"*"#&7&*/'8$6"+)(

D.)"%')&7C,*>/

#.).'E"7=F

9/G)=F*,),8/!

8$6"+)(

%,"6

Figure 1: Entire training system. It consists of two parts,

sample generation and ConvNets training. The sample gen-

eration make the augmentation samples through data aug-

mentation from basis training samples. The package that

contain certain samples is utilized parameter updating of

networks. The package are asynchronously updated and re-

loaded to training process.

!"#$%&'()$*+,%+-'.'

(/0+,1#%+-'

!"#$%&'()$*+,%+-'.'

(/0+,1#%+-'

2#$)$')1#3/!

2)-#,4')1#3/!

2#&53,+6-(')1#3/!

7631/-*/(')1#3/!

Figure 2: Sample generation process. The augmented im-

age is made by elastic distortion and deformation of scaling,

rotation and translation from basis image and binary image.

It also synthesizes object image and background image.

and achieve better results than the ones that can be

obtained by concrete training samples, we propose

here a method for an asynchronous exhaustless sam-

ple generation coupled with an effective sample se-

lection during parameter updating. We illustrated the

entire training system in Fig.1. The system consists

of two processes, data augmentation and ConvNets

training. The data augmentation continuously creates

a package that includes a certain amount of training

samples, while the ConvNets training updates the pa-

rameters using selected effective samples from the

package. The package is loaded to ConvNets and

used for training, and when all training samples in the

package are used, the package is reloaded. We will

describe details of the sample generation and selec-

tion in the following sections.

4.1 Exhaustless Sample Generation

ConvNets training requires samples with broad vari-

ation including non-controlled backgrounds if it is to

be used for real-world applications. Each sample we

used is generated through data augmentation. We start

with binary labeled data and multiple background im-

Table 1: Deformation range of each factor.

deformation factor range

translation ±3

scaling ± 5%

rotation ± 5

brightness ± 10%

ages to create a vast combination of samples through

elastic distortion, as shown in Fig.2. The elastic dis-

tortion arranges the corresponding new target position

∗

from the original position of every pixel. The new

target position x

∗

is derived from random displace-

ment and is applied a scaling factor α as follows,

∗

= x

x + α∆x

x. (6)

∆x

x is a random real value between -1 and 1 with uni-

form distribution. Like the method used by Simard,

∆x

x is convolved with Gaussian of standard deviation

σ to obtain smoothed displacements(Simard et al.,

2003). The value at position x

∗

is computed through

the neighboring pixels using bilinear interpolation.

After applying elastic distortion on both the sam-

ple images and the binary labeled data, the synthetic

images are then synthesized with background images

while applying scaling, rotation and translation. We

deﬁne the deformation range in Table.1. The back-

ground regions are randomly cropped from the origi-

nal images.

Sample generation results in a package that in-

cludes a set amount of synthetic images. That pack-

age is then updated asynchronously to avoid the recy-

cling of the same samples during the training process.

4.2 Effective Sample Selection

We employ a sample selection of data to obtain ef-

fective training samples from the package, as the data

augmentation process itself does not take in factor the

quality of the additional samples generated. Varia-

tion within training samples data is quite important

when training sophisticated networks. Using only

elementary samples repeatedly for training does not

tend to provide great results. The current training net-

works require an increasing difﬁculty in their samples

in order achieve their expected performances in real

environments and avoid over ﬁtting. For our effec-

tive sample selection, we evaluate all the samples in

the package using current networks and sort them by

worse error order. Then, a set amount of samples M

from the worse ones are used for training while the

rest of the samples that were recognized correctly are

discarded.

ImprovingQualityofTrainingSamplesThroughExhaustlessGenerationandEffectiveSelectionforDeepConvolutional

NeuralNetworks

231

!"#$%&'()*+

,-".-/$0-"

1)2&#--/'"*

1)2&-$%

,-".-/$0-"

3'/%+4+5&()#6

3'/%+4+5&()#6&

789:&6';+<

3'/%+4+5&()#6&

789:&()#6<

3$//=&>-""+>0-"

,/)66'@>)0-"

(a)

!"#$%&'()*+

,-".-/$0-"

1)2&#--/'"*

1)2&-$%

,-".-/$0-"

3'/%+4+5&()#6

3'/%+4+5&()#6&

789:&6';+<

3'/%+4+5&()#6&

789:&()#6<

3$//=&>-""+>0-"

?'")4';)0-"

?'")4=&'()*+&&

(b)

Figure 3: Architecture of our networks. (a) is the networks for recognition task. It has convolution, max pooling, maxout,

fully connected and classiﬁcation layers. (b) is the networks for segmentation task. It has the binarization layer instead of

classiﬁcation layer.

Table 2: Architecture of networks for each dataset.

Recognition task (Fig.3(a)) segmentation task (Fig.3(b))

MNIST Hand Shape Hand Shape

layer type size, # of kernels type size, # of kernels type size,# of kernels

input grayscale 28×28 grayscale 40×40 grayscale 40×40

1st convolution 5×5, 40 convolution 5×5, 40 convolution 5×5, 40

2nd max pooling 2×2 max pooling 2×2 max pooling 2×2

3rd maxout 2 maxout 2 maxout 2

4th convolution 5×5, 400 convolution 5×5, 40 convolution 5×5,40

5th max pooling 2×2 max pooling 2×2 max pooling 2×2

6th maxout 2 maxout 2 maxout 2

7th fully connected 100 fully connected 100 fully connected 400

output soft max 10 soft max 6 L2 norm 1600

4.3 Architecture of Networks

The network architecture for our method is shown in

Fig.3. It consists of six layer types, i.e. convolutional,

max pooling, maxout, fully connected, binarization

and classiﬁcation layers. The architectures are set up

individually for each task. As shown in Fig.3(a), the

recognition task utilizes the convolutional, max pool-

ing, maxout, fully connected and classiﬁcation layers,

but not the binarization layer. However, the binariza-

tion layer comes in handy for the segmentation task,

as illustrated in Fig.3(b). Each parameter, such as the

kernel size in the convolutional layer and the number

of nodes in the fully connected layer, are predeﬁned

for the experiment.

Max pooling can help the network achieve faster

convergence and improves generalization. Unlike

Krizhevsky and Schmidhuber methods, we employ

maxout following max pooling as same as (Goodfel-

low et al., 2013). While max pooling selects the max-

imum output from the same map produced by a kernel

and subsamples the map size to half, maxout selects

the maximum value from several maps and reduces

their number. The fully connected layer receives high

dimensional feature vectors by ﬂattening the output of

previous layer. It then trains with dropout and output

speciﬁc dimensional feature vectors.

For the segmentation task, the feature vectors are

input in the binarization layer in order to obtain bi-

nary image. This layer output the probability of each

pixel of being part of an object or not. The classiﬁca-

tion layer produces a distribution of each class labels

using soft max. When training the network, all the

parameters are optimized through backpropagation.

4.4 Input Layer

The network is trained with a large amount of data to

reduce over ﬁtting. However, as it is difﬁcult to collect

a huge amount of labelled data, the elastic distortions

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

232

described in 4.2 is employed to increase the amount

of data from the initial limited dataset. This data aug-

mentation method is applied to both the original im-

age and the binary labeled data in the segmentation

task.

4.5 Pooling Layer

We employ maxout to avoid the drawbacks of a tra-

ditional activation function design (Goodfellow et al.,

2013). This results in a higher representation com-

pared to a ReLU nonlinearity (Sutskever and Hin-

ton, 2012). The common activation function h

h = (h

, ...h

, i ∈ I) is deﬁned as in Eq. (7):

= σ(x

+ b

). (7)

σ(·) is the sigmoid function, x is the input vector, W

is the weight and b

is the bias. The maxout selects the

maximum value from several ﬁltered output kernels,

as Eq.(8):

= max

j∈[1,k]

i j

, (8)

i j

= x

i j

+ b

i j

(9)

Unlike max pooling, maxout selects the maximum

output from several maps.

4.6 Fully Connected Layer

Dropout is an efﬁcient method to reduce overﬁtting

and improve generalization (Hinton et al., 2012). It

randomly samples hidden units with a probability of

50%. The features of the selected units are then used

for optimization during each iteration of the training

process. Dropout is also used for training in the fully

connected layer.

4.7 Binarization Layer

For the segmentation task, we utilize a binarization

layer based on a fully connected layer. The output

node of the binarization layer receives the value of

all the input nodes that are the response of the previ-

ous layer. Each node output the probability of corre-

sponding to a pixel of the sample image. As such, the

number of output nodes is the same as the number of

pixels of the sample image.

4.8 Network Training

All W

W parameters, such as the kernel elements, the

weights between the units and the bias, are randomly

initialized. We use backpropagation to update the pa-

rameters W

t+1

in iteration t + 1 as deﬁned in Eq.(10):

t+1

= W

− ε

t+1

∂E

∂W

(10)

Figure 4: Example of evaluation dataset including 6 classes

based on the number of upped ﬁngers. Each hand posture

class is identiﬁed by the number of up ﬁngers displayed,

with ”0” for a closed ﬁst and ”5” for an open palm.

Soft max is used for the loss function E in the clas-

siﬁcation layer as follows:

exp(h

)

∑

j=1

exp(h

)

, (11)

E = −log(E

). (12)

E is derived from E

, the loss function of node i with

label y of the training sample, and M is the number of

total output nodes.

Meanwhile, we utilize the L2 norm for the loss

function E in the binarization layer as described in

Eq.(13):

E = ||h

h −

−

− y

y||

(13)

The update efﬁciency will decrease depending on

the update time t as deﬁned in Eq.(14):

t+1

max(t, τ)

(14)

The ε

is initially update-efﬁcient and τ is the deﬁned

parameter.

5 EXPERIMENT

We perform an experiment to demonstrate the effect

of our proposed exhaustless sample generation and

effective selection method for several tasks. First,

we evaluate it in MNIST, including ten classes for

handwritten digits. Although MNIST is an appropri-

ate dataset for a recognition task, its background is

quite simple and it is easy to obtain a high recogni-

tion performance when using it with state of the art

methods. Because of this, we prepare a hand pos-

ture dataset that is more complicated, including object

variation like handwritten digits under cluttered back-

grounds, as shown in Fig.4. In addition, the dataset

has binary images in order to perform the segmen-

tation task. It includes six hand shape classes based

on the number of upward ﬁngers and each class has

scale, position and rotation variations. We evaluate

the recognition performance with both MNIST and

our hand posture dataset. We also evaluate the seg-

mentation performance with the hand posture dataset.

ImprovingQualityofTrainingSamplesThroughExhaustlessGenerationandEffectiveSelectionforDeepConvolutional

NeuralNetworks

233

Table 3: The classiﬁcation error depends on the presence or

absence of sample generation and selection.

methods MNIST Hand Shape

generation & selection 0.57% 9.0%

no generation & selection 0.77% 14.4%

generation & no selection 0.71% 10.6%

no generation & no selection 0.89% 15.6%

The network architectures for each individual dataset

are as described in Table 2. In the sample selection,

training samples are sorted based on Eqns.(12)(13)

in descending order. The ﬁrst 50% samples are se-

lected and randomly shufﬂed to reduce the bias in

mini batch.

5.1 MNIST

MNIST contains 50000 images of each class for train-

ing and 10000 images for testing. The baseline, in

which no data generation and data selection is used,

is trained only with the original training samples. The

method using only sample selection uses the effec-

tive samples from the original ones. The method us-

ing only sample generation creates and uses the aug-

mented samples from original ones. The method us-

ing both generation and selection uses the effective

samples from the augmented samples. All methods

perform update of the network parameters through

200000 iterations using a mini-batch that included 10

samples from above ones. The methods with sample

generation renew the package that contains the aug-

mented samples. The updating parameters τ and ε

are set to 20000 and 0.006 respectively. We imple-

mented the methods using Theano library and train

the network on a NVIDIA GT640 2GB GPU. The

error for each method is displayed in Table 3. The

method using both sample generation and selection

achieves the best performance. There are just slight

differences between the results of the other methods

due to the dataset being composed of images with a

simple black background.

5.2 Recognition of Hand Shape

To evaluate the network in a more challenging envi-

ronment, e.g. like under cluttered background, we

collect images presenting shape variations in the same

class and illumination changes. The basic dataset

contains 1600 original grayscale and binary images

for each class. Our proposed method synthesizes

augmented images from the basic dataset with the

background images, as described in previous section.

We stored a set amount of augmented samples in the

package, here 1000 images. For a mini-batch size of

!#$"

!#%"

!#&"

!#'"

!" $!!!!" %!!!!" &!!!!" '!!!!" (!!!!!" ($!!!!" (%!!!!" (&!!!!" ('!!!!"

$!!!!!"

)**+*

,"+-"./0*12+34

5+"6030*12+3"7"5+"408092+3"

5+"6030*12+3"7"408092+3"

:030*12+3"7"5+"408092+3"

:030*12+3"7";08092+3"

Figure 5: Test error rate in each iteration.

Table 4: The binarization precision-recall rate depends on

the presence or absence of data generation and selection.

methods precision recall f value

generation & selection 0.9023 0.9298 0.9158

no generation & selection 0.8957 0.8909 0.8933

generation & no selection 0.9036 0.9249 0.9141

no generation & no selection 0.8958 0.9008 0.8983

10, the number of iterations for the package is 100.

The package is updated asynchronously until all the

iterations are completed. After the network parame-

ters are updated with 200000 iterations, the total aug-

mented sample number amounted to 2 million indi-

vidual images. The comparison results are show in

Fig. 5 and Table 3. The method with sample genera-

tion and selection is best performance in the dataset.

We can conclude that this is an efﬁcient method for

sample generation and selection.

5.3 Segmentation of Hand Shape

We also evaluate the efﬁciency of our proposed ap-

proach for the segmentation task. We prepare a

dataset of binary images of hand shapes. Through

data generation, we create the augmented images with

their corresponding labeled images. We compare the

methods by precision-recall rate as shown in Table 4.

From comparison results, the sample generation and

selection methods are efﬁcient for segmentation task.

The output of the binarization layer with 130000 iter-

ations is shown in Fig.6. Even with a grayscale image

input, the network successfully extracts the hand re-

gion under a cluttered background. Furthermore, it is

robust to not only the background but also to illumi-

nation changes.

6 CONCLUSION

We proposed a sample generation and selection ap-

proach for ConvNets. Sample generation is per-

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

234

Figure 6: Output of binarization layer: the 1st and 3rd

columns are original images and the 2nd and 4th are out-

put images.

formed asynchronously to create a package including

augmented samples in order to reduce the processing

time. The effective samples are selected from that

package and used to update the parameters. The ex-

haustless sample generation and effective sample se-

lection methods are used in an efﬁcient combination

in order to train the network using a huge amount of

samples, all the while keeping the sample quality. Af-

ter comparison, our proposed approach is not only

useful for the recognition task, but also for the seg-

mentation task. While this is still in its primitive state

for sample generation and selection, we will continue

our research by applying this proposed approach to

other datasets and more challenging recognition tasks.

REFERENCES

Boureau, Y., Bach, F., LeCun, Y., and j. Ponce (2010).

Learning mid-level features for recognition. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion(CVPR2010).

Ciresan, D., Meier, U., and Schmidhuber, J. (2012). Multi-

column deep neural networks for image classiﬁcation.

In IEEE Conference on Computer Vision and Pattern

Recognition(CVPR2012).

Delakisand, M. and Garcia, C. (2008). Text detection with

convolutional neural networks. In InternationalCon-

ference on Computer Vision Theory and Applications

(VISAPP 2008).

Duffer, S. and Garcia, C. (2007). An online backpropa-

gation algorithm with validation error-based adaptive

learning rate. In In International Conference on Artiﬁ-

cial Neural Networks (ICANN), volume 1, pages 958–

962.

Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A.,

and Bengio, Y. (2013). Maxout networks. In arXiv

preprint arXiv:1302.4389.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. R. (2012). Improving neural

networks by preventing co-adaptation of feature de-

tectors. In arXiv preprint arXiv:1207.0580.

Hubel, D. and Wiesel, T. (1962). Receptive ﬁelds, binocu-

lar interaction and functional architecture in the cat’s

visual cortex. Journal of Physiology, 160:106–154.

Jain, V., Murray, J., Roth, F., and Turaga, S. (2007). Super-

vised learning of image restoration with convolutional

networks. In IEEE International Conference on Com-

puter Vision (ICCV2007).

Osadchy, M., LeCun, Y., and Mille, M. (2007). Synergistic

face detection and pose estimation with energy-based

models. In Journal of Machine Learning Research,

number 1197-1215, page 8.

Ouyang, W. and Wang, X. (2013a). Joint deep learning for

pedestrian detection. In IEEE International Confer-

ence on Computer Vision (ICCV2013).

Ouyang, W. and Wang, X. (2013b). Single-pedestrian de-

tection aided by multi-pedestrian detection. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion(CVPR2013).

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).

Learning internal representations by error propaga-

tion. Parallel Distributed Processing:Explorations in

the Microstructures of Cognition, 1:318–362.

Scherer, D., Muller, A., and S.Behnke (2010). Evaluation of

pooling operations in convolutional architectures for

object recognition. In International Conference on Ar-

tiﬁcial Neural Networks(ICANN2010).

Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolu-

tional neural networks applied to house numbers digit

classiﬁcation. In International Conference on Pattern

Recognition (ICPR 2012).

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus,

R., and LeCun, Y. (2013). Overfeat: Integrated recog-

nition, localization and detection using convolutional

networks. In arXiv preprint arXiv:1312.6229.

Simard, P., Steinkraus, D., and Platt, J. (2003). Best prac-

tices for convolutional neural networks applied to vi-

sual document analysis. In In International Con-

ference on Document Analysis and Recognition, vol-

ume 2, pages 958–962.

Sutskever, A. K. I. and Hinton, G. (2012). magenet classi-

ﬁcation with deep convolutional neural networks. In

Advances in Neural Information Processing Systems

25(NIPS2012).

Vaillant, R., Monrocq, C., and LeCun, Y. (1994). Origi-

nal approach for the localisation of objects in images.

volume 4, pages 245–250.

Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R.

(2013). Regularization of neural networks using drop-

connect. In In International Conference on Machine

Learning (ICML2013).

Y. LeCun ad L. Bottou, Y. B. and P.Haffner (1998).

Gradient-basedlearning applied to document recogni-

tion. In Proceedings of the IEEE, 86(11):2278–2324.

Zeiler, M. D. and Fergus, R. (2013). Visualizing and un-

derstanding convolutional networks. In arXiv preprint

arXiv:1311.2901.

ImprovingQualityofTrainingSamplesThroughExhaustlessGenerationandEffectiveSelectionforDeepConvolutional

NeuralNetworks

235