Building Robust Industrial Applicable Object Detection Models using

Transfer Learning and Single Pass Deep Learning Architectures

Steven Puttemans, Timothy Callemein and Toon Goedem

KU Leuven, EAVISE Research Group, Jan Pieter De Nayerlaan 5, Sint-Katelijne-Waver, Belgium

Keywords:

Deep Learning, Object Detection, Industrial Speciﬁc Solutions.

Abstract:

The uprising trend of deep learning in computer vision and artiﬁcial intelligence can simply not be ignored.

On the most diverse tasks, from recognition and detection to segmentation, deep learning is able to obtain

state-of-the-art results, reaching top notch performance. In this paper we explore how deep convolutional

neural networks dedicated to the task of object detection can improve our industrial-oriented object detection

pipelines, using state-of-the-art open source deep learning frameworks, like Darknet. By using a deep learning

architecture that integrates region proposals, classiﬁcation and probability estimation in a single run, we aim

at obtaining real-time performance. We focus on reducing the needed amount of training data drastically by

exploring transfer learning, while still maintaining a high average precision. Furthermore we apply these al-

gorithms to two industrially relevant applications, one being the detection of promotion boards in eye tracking

data and the other detecting and recognizing packages of warehouse products for augmented advertisements.

1 INTRODUCTION

Several drawbacks have kept deep learning in the

background for quite the while. Until recently deep

learning had a very high computational cost, due to

the thousands of convolutions that had to process the

input data from a pixel level to a more content-based

level. On high-end systems, reporting processing

speeds of several seconds on VGA resolution has long

been the state-of-the-art. This limited the use of these

powerful deep learning architectures in industrial situ-

ations such as real-time applications and on platforms

with limited resources. Furthermore deep learning

needed a dedicated and expensive general purpose

graphical processing unit (GPGPU) and an enormous

set of manually annotated training data, both things

that are almost never available in industrially rele-

vant applications. The available frameworks for deep

learning lacked proper documentation and guidelines,

while pre-built deep learning models were not easily

adaptable to new and unseen classes.

These issues no longer exist, and deep learning

made a major shift towards usability and real-time

performance. With the rise of affordable hardware,

large public datasets (e.g. ImageNet (Deng et al.,

2009)), pre-built models (e.g. Caffe model zoo (Jia

et al., 2014)) and techniques like transfer learning,

deep learning took a step closer towards actual indus-

trial applications. Together with the explosion of sta-

ble and well documented open-source deep learning

frameworks (e.g. Caffe (Jia et al., 2014), Tensorﬂow

(Abadi et al., 2016), Darknet (Redmon, 2016)), deep

learning opened up a world of possibilities.

Many industrially relevant applications do not fo-

cus on existing classes from academically provided

datasets. Gathering tremendous amounts of train-

ing data and providing manual labelling, is a time-

consuming and thus expensive process. Luckily trans-

fer learning (Bengio, 2012) provides possibilities in

this case. Given a limited set of manually annotated

training samples, it retrains existing off-the-shelf deep

learning models and adapts the existing weights so

that the model can detect a complete new object class.

Figure 1: Examples of both applications: (left) promotion

boards and (right) product packs detectors.

Puttemans, S., Callemein, T. and Goedemé, T.

Building Robust Industrial Applicable Object Detection Models using Transfer Learning and Single Pass Deep Learning Architectures.

DOI: 10.5220/0006562002090217

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018), pages 209-217

ISBN: 978-989-758-306-3

209

We apply transfer learning in two applications,

while focussing on using a small set of training data

for the new classes (to prove we do not need huge

datasets for this task) and simultaneously ensuring

real-time performance. The ﬁrst industrial relevant

application handles the detection of promotion boards

in eye-tracking data, allowing companies to analyse

shopping behaviour, as seen in Figure 1(left). The

second application handles the detection and recogni-

tion of warehouse products for augmented advertise-

ment purposes (as seen in Figure 1(right)). Speciﬁ-

cally we look at a generic product box detector, fol-

lowed by a multi-class 14-brand-speciﬁc detector.

The remainder of this paper is organized as fol-

lows. In section 2 we discuss related work and high-

light the state-of-the-art in deep object detection. Sec-

tion 3 discusses the collected dataset for training and

validating the object detection models. This is fol-

lowed by section 4 discussing the selected deep learn-

ing framework and model architecture, used for trans-

fer learning. Section 5 handles each application case

in detail. In section 6 we discuss the obtained average

precisions and execution speeds. Finally we conclude

this paper in section 7 and provide some future work

in section 7.4.

2 RELATED WORK

Since convolutional neural networks process image

patches, doing a multi-scale and sliding-window

based analysis takes a lot of time, especially if im-

age dimensions increase. Combined with the fact that

deeper networks achieve a better detection accuracy,

deep networks like DenseNet201 and InceptionV3 can

easily take a couple of seconds for VGA resolution.

To increase execution speed, academia introduced the

concept of region proposal techniques. These are fast

lightweight ﬁlters that pre-process images, looking

for regions that might contain objects we are looking

for. Instead of supplying the classiﬁcation pipeline

with millions of candidate windows, the region pro-

posal algorithms reduce this to several thousands of

windows, clearly reducing processing time.

Region proposing convolutional neural networks

(R-CNN) (Ren et al., 2015) use a separate shallow re-

gion proposal network on the GPU that uses mutual

information (some convolutional layers are shared)

of the actual detection pipeline to avoid unnecessary

overhead. This enabled to run deep deep models like

VGG19 at 10FPS for VGA resolution, giving a 30x

speed improvement towards the original implementa-

tion. The ‘Single Shot Multibox’-detector (SSD) (Liu

et al., 2016) introduces an algorithm for detecting ob-

jects in image using only a single deep neural net-

work, immediately grouping the convolutional output

activations and returning object boxes. The ‘You Only

Look Once’-detector (YOLO) (Redmon et al., 2016)

implements a similar approach, using anchor points

that allow learned aspect ratios around pixel areas that

have a high response. Both detectors remove the need

of extra region proposal networks and perform bound-

ing box prediction and class probability generation in

a single run through the network.

If your industrial application concerns a class that

is previously trained on any given public dataset (e.g.

pedestrians, cars, bicyclists), these algorithms pro-

vide off-the-shelf pre-trained detection models. How-

ever, when your object class does not occur in any

of the pre-trained networks, one needs a huge dataset

and a lot of processing time to come up with a new

deep learned model. That is why (Yosinski et al.,

2014) investigated the transferability of deep learn-

ing features and discovered that the ﬁrst convolutional

layers of any deep learning architecture actually stick

to very general feature representations, and that the

actual power resides in the linking of those features

by ﬁnding the correct convolutional weights for each

operator. This allowed applying the concept of trans-

fer learning (Dauphin et al., 2012) onto deep learn-

ing. By initializing the weights of the convolutional

layers that provide a general feature description, us-

ing the weights of any pre-trained model, and by us-

ing a small number of annotated application-speciﬁc

training samples to update the weights of all layers

for the new object class, one can learn a complete new

model. Combined with data augmentation (Gan et al.,

2015), a small set of these training samples can in-

troduce enough knowledge for ﬁne-tuning an existing

deep model onto this new class.

3 DATASET AND FRAMEWORK

For this paper we created two datasets used for train-

ing and evaluating our deep learned object detection

models. In both cases we provided manual annota-

tions, allowing for a precision-recall based evaluation.

The ﬁrst dataset exists of several videos of an eye-

tracker experiment in a Belgian warehouse. Data is

collected by costumers, approached at the entrance

and asked to go shopping wearing an eye-tracker,

without explaining them the goal of the experiment.

Two classes of promotion boards (small sign and

large sign) are manually annotated in each frame of

the eye-tracker videos. As training data we use re-

spectively 75 and 65 frames containing the promo-

tion boards. As validation data, we use 420 and 960

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

210

Table 1: Number of product box samples per brand.

Label Training Validation

brandA v1 63 163

brandA v2 14 149

brandA v3 29 157

brandA v4 14 167

brandB v1 75 157

brandB v2 15 222

brandC v1 15 185

brandC v2 15 188

brandC v3 15 229

brandD v1 46 183

brandD v2 30 146

brandD v3 15 179

brandE v1 15 178

brandE v2 15 179

TOTAL: 376 4279

frames respectively, containing both frames with and

without the actual promotion boards.

The second industrial relevant dataset contains

376 images of product packages. In each image the

location of the package is manually annotated using a

rectangular bounding box. In total 14 brands are in-

cluded, allowing us to both provide labels of a prod-

uct box and the associated brand, useful for training

multi-class object detection models. A separate val-

idation set is collected, existing of twentysix ﬁfteen-

second videos (@30FPS), containing different views

of similar product boxes. Table 1 contains a detailed

overview of the speciﬁc amounts of training and vali-

dation samples per brand.

As framework we decide to use Darknet (Redmon,

2016), a lightweight deep learning library, based on

C and CUDA, which reports state-of-the-art perfor-

mance for object detection using the YOLOv2 archi-

tecture. Moreover, it reaches inference speeds up to

120 FPS and integrates well with our existing C++

based software platform (using OpenCV3.2 (Bradski

et al., 2000)). The framework does not require ex-

plicit negative training samples but rather uses areas

in the image that are not labelled as object. We build

our customized version of the framework, which is

publicly available

and allows integrated evaluation

of trained models using the precision-recall metric.

4 SUGGESTED APPROACH

In order to converge to an optimal conﬁguration, deep

learned models need large amounts of annotated train-

https://gitlab.com/EAVISE/darknet

ing data and a lot of processing time. For the sug-

gested YOLOv2 architecture, this is also the case.

The default network conﬁguration uses 800.000 ﬂoat-

ing point operations per given image. To optimize

the weights assigned to each operation, one literally

needs multiple thousands of training samples. For in-

dustrial applications this is not manageable, certainly

if for each new object detection model, you need new

and annotated training data. Transfer learning bridges

the gap here. For our detection models we start from

a YOLOv2 architecture previously trained on the Pas-

cal VOC 2007 dataset with pre-trained weights. To be

able to apply transfer learning onto a new object class

several adaptations have to be made to the network.

1. We physically change the architecture of the net-

work at the level of the class predictions and class

probabilities. To ensure the convolutional layers

output the correct format for the detection layer,

we change the total number of ﬁlters in the ﬁ-

nal convolutional layer to (N

classes

+5)×N

anchors

This allows the detection layer to convert the ﬁnal

layer activations into useful detections.

2. We adapt the amount of anchor ratios to our class

speciﬁc problem. This ensures that in the detec-

tion layer, prediction boxes are generated ﬁtted in

a ratio that agrees to our actual ﬁne-tuning data

from our new object class.

3. We initialize the weights of all convolutional lay-

ers except the last one and leave the weights of the

ﬁnal convolutional layer and the detection layer

uninitialized. This allows us to learn class speciﬁc

weights for these deciding layers.

Keeping these changes in mind, we transfer learn

a new object model for our new object class. We al-

low all weights to update (including the pre-assigned

weights) and thus to converge towards the optimal so-

lution given our new training data. We also use data

augmentation to ensure we enrich the small applica-

tion speciﬁc annotated training set.

In our experience the detection model generally

ﬁne-tunes over multiple thousands of iterations, pro-

viding a batch of 64 samples at each iteration to the

model for updating the weights and reducing the loss

rate on those given samples. If our model trains for

10.000 iterations, the model needs more than 640.000

training samples. Given we only have around 400 an-

notated training samples available this would mean

that each sample is shown over a thousand times to

the network, risking to overﬁt the model to the train-

ing data. Data augmentation applies speciﬁc transfor-

mations to the image to generate a large set of data

from a small set of training samples, effectively re-

ducing the number of times a single sample is shown

Building Robust Industrial Applicable Object Detection Models using Transfer Learning and Single Pass Deep Learning Architectures

211

to the network and thus avoiding overﬁtting.

During our training we allow the following aug-

mentations on our limited set of manually annotated

training data:

• Randomly rescaling (max.±30%) of the input size

of the ﬁrst convolutional layer, making the detec-

tor more robust for multi-scale detections.

• For each training sample, the algorithm randomly

decides to ﬂip around the vertical axis.

• Input images are converted to HSV and a maxi-

mally ±10% deviation on the hue value is applied.

• The average saturation and exposure of the input

image can deviate 50% from the input value.

• We allow the annotations of the training samples

to jitter for 20% in relation to the original size,

but the cropped or moved annotation still has to

contain 80% of the original annotation area.

In general we notice that any object model we

train is able to converge towards a stable model

overnight, maximally taking a full 24 hours, on a Ti-

tan X (Pascal) and with a default architecture input

resolution of 416 × 416 pixels. We halt the training

when the loss rate on the provided training samples

seems to drop under 0.5, what for our applications al-

ways produces a robust detection model. Continuing

the training from that point, lets say for example up to

50.000 iterations, only introduces training data over-

ﬁtting and thus a model that loses its generalization

properties on an unseen validation set.

5 PRACTICAL CASES

Before we move on to qualitative and quantitative

evaluation of our object models in section 6, we ﬁrst

elaborate on how we built object detection models for

our speciﬁc industrially relevant applications.

5.1 Promotion Boards

For our promotion boards, we look into deep learning

with two speciﬁc reasons:

1. We want to analyse complete eye-tracker experi-

ments by linking object detections to eye-tracker

data, removing the need for manual analysis.

2. We want to reduce the amount of annotations as

much as possible without losing too much average

precision on the obtained object detectors.

We build two separate object detection models for

this application. First we build a single-class de-

tection model, able to detect the small red promo-

tion board as seen in Figure 2(left). However, since

Figure 2: Two promotion board classes (left) small sign

(right) large sign.

the same shop also contains a second board of in-

terest, seen as the large red promotion board in Fig-

ure 2(right), we explore the possibilities of building

a two-class detector, able to detect and classify both

signs at the same time and in a single run.

5.2 Warehouse Product Packages

Our second application is the robust detection and

classiﬁcation of warehouse product packages for aug-

mented advertisements. In this application it is ﬁrst

of all our task to robustly locate the packages in any

given input image. Furthermore we want to classify

each product box in the same, single run of our deep

learning classiﬁer, and return its brand.

The packages all have general product box prop-

erties, while the print on the box distinguishes the

brand. Therefore we start building a robust product

package detector, able of localizing packages with

a very high average precision in newly provided in-

put images. This can easily be followed by a small

classiﬁcation network deciding which brand we have.

However, doing this in a single run per image is our

ultimate goal Once we succeed in building a robust

product pack detector, we try pushing our luck , so

we generate a 14-class brand detector, which is able

to both localize and classify the product packs in a

single run, completely removing the need of a sepa-

rate classiﬁcation pipeline.

6 EXPERIMENTS AND RESULTS

This section discusses in more detail the achieved re-

sults with each trained object detector. Furthermore it

highlights some of the issues we have when training

these object detection models and discusses some of

the limits of using the YOLOv2 architecture. Since

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

212

0.5 1 1.5 2 2.5 3

# Iterations

0.5

1.5

2.5

3.5

4.5

Training loss

Loss rate

Averaged loss rate

Figure 3: An example loss rate and average loss rate curve

for single class advertisement board detector, in function of

the iterations (with 64 sample batch per iteration).

all our models are ﬁne-tuned from the same YOLOv2

architecture, we have an equal storage size of 270MB,

while the GPU memory footprint of the same model

equals 450MB.

6.1 Stop the Model Transfer-learning

We need a way of deﬁning when our training algo-

rithm reaches an optimal solution. At the same time

we need to avoid overﬁtting the model to the train-

ing data, keeping a model that generalizes well to

never seen before data. In order to check the training

progress we continuously monitor the loss-rate on the

given training data. This is monitored for every new

batch of samples provided to the training algorithm,

together with the average loss-rate over all batches.

Once the average loss-rate no longer drops, we halt

the training process.

An example of the loss rate tracking for one of our

transfer-learned models can be seen in Figure 3. In

general we notice that a loss rate just below 0.5 seems

optimal for any model, which is transfer-learned from

the given YOLOv2 pre-trained architecture.

6.2 Evaluate the Promotion Detectors

Figure 4(top) and 4(bottom) shows the precision-

recall curves for the trained promotion board models.

For both models we run the training overnight halting

the training at 30.000 iterations. We evaluate the pre-

cision, recall and average precision performance of

each detection model at 1.000, 5.000, 10.000, 15.000,

20.000, 25.000 and 30.000 training iterations.

Initially we see an increase in average precision

when raising the number of iterations for each model.

However, once the average precision (calculated on

the validation set) starts dropping, we select the pre-

vious model as best ﬁt, in order to keep the model

that generalizes best on the given validation set. For

the single-class detector this means a model of 15.000

iterations at an average precision of 59.98% while for

the two-class detector the 30.000 iterations model at

an average precision of 59% performs best. Look-

ing at these curves it seems that the two-class model

might still be converging to an optimal solution and

thus continuing training with more iterations might

be a feasible option.

We reckon our models seem to be under-

performing, seeing their optimal conﬁguration strands

at about 65% precision for a 50% recall. This can be

due to motion blur in the eye-tracker data in the vali-

dation set. The scales of the annotated validation data

also differ from the training data, leading to unlearned

object dimensions. Finally YOLOv2 has problems

with detecting objects that are relatively small in com-

parison to the image dimensions, while this is not an

issue for a human annotator.

However these precision and recall values are ac-

tually more than enough, given the context of the

0 20 40 60 80 100

Recall

100

Precision

SClass(1000) / AP = 40.3279%

SClass(5000) / AP = 55.8061%

SClass(10000) / AP = 56.0418%

SClass(15000) / AP = 59.9822%

SClass(20000) / AP = 57.2641%

SClass(25000) / AP = 55.549%

SClass(30000) / AP = 59.5089%

0 20 40 60 80 100

Recall

100

Precision

2Class(1000) / AP = 51.0406%

2Class(5000) / AP = 53.0601%

2Class(10000) / AP = 55.1197%

2Class(15000) / AP = 56.4617%

2Class(20000) / AP = 57.9215%

2Class(25000) / AP = 58.5357%

2Class(30000) / AP = 59.0091%

Figure 4: Precision-recall curves for the single class (top)

and the two-class (bottom) object detection model.

Building Robust Industrial Applicable Object Detection Models using Transfer Learning and Single Pass Deep Learning Architectures

213

application. We only need to signal when a cos-

tumer has noticed a promotion board, related to the

gaze-cursor location. If we are not able to ﬁnd a

promotion board on a smaller scale, once the cos-

tumer walks towards the sign, we are able to ro-

bustly detect it with a high probability. Looking at

the execution speeds of the trained models, we no-

tice that both the single- and two-class models per-

form around 55 FPS for a 1280 × 720 pixel resolu-

tion on a NVIDIA Titan X GPU. A video of the sin-

gle class promotion board detector can be found at

https://youtu.be/dQIdRSDm6Jc.

6.3 Evaluating the Product Package and

Brand Detectors

For our general single-class product box detector, we

perform transfer-learning for 5.000 iterations, while

our fourteen-class brand detector is trained for 15.000

iterations. Both trainings lead to a convergence in

loss-rate, just below the 0.5 threshold, previously in-

dicated as a good stopping point.

Figure 6 gives us an overview of the generated

precision-recall curves for all the transfer-learned

models. At the bottom left part of the ﬁgure we

observe the performance on the complete validation

dataset (so all fourteen brands together). Our general

single-class product box detector is able to robustly

detect every single object instance in the large vali-

dation set, obtaining an average precision of 100%.

The 14-class object detection model is as promising,

obtaining an overall average precision of 99.87%, di-

rectly providing the correct brand label in combina-

tion with an accurate localisation.

The small difference in average precision between

both models is caused by brandC

v1, which is unable

to reach a class-speciﬁc average precision of 100%.

We discovered four samples of the brandC v1 with a

wrong label, which under data augmentation seems to

have a signiﬁcant inﬂuence. Before adjusting these la-

bels, we obtained a 93.97% average precision for this

class, while afterwards we obtain a 95.66% average

precision.

We compared the performance three different

brandC classes (v1, v2 and v3), as seen in Figure 5,

evaluated on brandC v1 validation data with the goal

of uncovering where it goes wrong. There is still a

large amount of validation data that gets classiﬁed as

brandC v2 or brandC v3, and this with a high class

probability, instead of providing the brandC v1 la-

bel. After visually inspecting the brandC class this be-

haviour is kind of expected, and actually the detectors

are performing better than expected, since there are

only very subtle visual changes between these three

Figure 5: Box-plot of the brandC sub-brands probability

scores indicating a signiﬁcant number of v1 packages trig-

gering v2 and v3 detections with a high probability.

sub-brands. Keeping in mind that our network resizes

the input data to a ﬁxed 416 × 416 resolution this can

lead to the loss of these very small details, resulting

in a deep learning architecture that is even more chal-

lenged in ﬁnding an optimal separation between the

sub-brand classes.

Looking at the execution speeds of the trained

models, we notice that both the single-class and the

fourteen-class product package detector are running

at 70 FPS for a 720 × 1280 pixel resolution on an

NVIDIA Titan X GPU.

7 DISCUSSION AND

CONCLUSIONS

In this section we want to discuss some restrictions

of the current YOLOv2 framework, shed a light on

some issues that still arise when using our models and

ﬁnally draw some conclusions on our suggested ap-

proach. Finally we suggest ways for further improv-

ing the current pipeline.

7.1 Drawbacks using YOLOv2

Due to chosen internal construction of convolutional

layers and the ﬁnal detection layer (in which YOLO

smartly combines it probability maps), it is unable to

robustly detect objects that are small compared to the

image dimensions. This happens quite frequently in

our promotion board detection case, resulting in lower

average precisions. Once the object instances cover

more than 20-30% of the image (e.g. the product box

case), this is no longer an issue.

The grid based region proposal system in

YOLOv2 forces neighbouring detections together in

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

214

Figure 6: Precision-recall curves for our product brand case, showing a per (sub-)brand evaluation, a combined evaluation

and a small extra investigation for the brandC v1 sub-brand.

a single detection box, dropping the average preci-

sion. The architecture has a non-maxima suppression

parameter that can be tweaked for this, but initial ex-

periments show this does not resolve all issues.

Finally deep learning separates training data based

on a combination of convolution ﬁlters, trying to ﬁnd

an optimal solutions. However we do not tell it how it

should do its task or which features to ignore, which

sometimes has undesired effects. E.g. in the case

of our generic product box detector, when presented

with a similar rectangular shape (e.g. a box of cere-

als or sugar) the model also triggers a detection. This

behaviour is expected since a product package is ﬁrst

of all a rectangular box.

7.2 Avoiding Detections on General

Rectangular Shapes

To remove the effect of other rectangular packages

triggering the product package detector, we investi-

gate the detection probability range. Given a set of

product packs and some random packs with other as-

pect ratios, we noticed that there is a large probability

gap between known and unknown packages. Simply

placing a probability threshold of 65% already results

in the removal of all detections of non-known product

packages and brands.

However we are interested in other possible solu-

tions to this problem, that do not include setting a hard

threshold on the probability output, since unseen data

could screw up this approach and trigger detection

probabilities for desired objects that are lower than

the given threshold.

7.2.1 Adding a Non-standard Box Class to the

Multi-class Detector

By adding a class that contains all non-standard boxes

we basically want to force the deep learning network

to learn all product brands, and on top of that an extra

generic box detector. We add extra t. raining data,

collected from 12 household rectangular boxes and

started training again. Figure 7 shows that the initial

iterations show a promising drop in loss rate, how-

ever, at a certain moment we literally experience a

training loss rate explosion, and the model is not able

to recover from this. To be able to recover from this,

a possibility might be to lower the learning rate from

the moment the loss-rate increases again, since the gi-

ant loss-rates clearly show the architecture is on the

wrong track looking for the optimal solution, given

the multi-dimensional error surface.

7.2.2 Adding Hard Negative Training Images

A second attempt is made by adding hard negative

training images, images containing generic unanno-

tated box shapes, without changing the number of

Building Robust Industrial Applicable Object Detection Models using Transfer Learning and Single Pass Deep Learning Architectures

215

1000 2000 3000 4000 5000 6000 7000 8000

# Iterations

100

Training loss

Loss rate

Averaged loss rate

Figure 7: Training a brand-speciﬁc box detector with a

generic box as extra class, leading to a loss rate explosion.

Figure 8: Example of classes that are unseen and untrained,

yielding clearly lower probabilities allowing for easier sep-

aration based on class probability.

classes to learn. The architecture will force the learn-

ing process to try and make the response of those hard

negative images zero and thus hopefully improve the

ignoring power of the model for generic packages.

This approach works and yields a transfer-learned

model with a low loss-rate. The hard negative data

is not removing the generic box detection properties

from the model, which it still needs for detecting

the actual brands, but when looking at the probabil-

ities for the generic box detections, we clearly notice

a larger gap in probability between object class in-

stances and generic box detections.

Caution should be used here. This only works

when using a limited set of hard negatives samples in

relation to the amount of training samples per class.

Adding too much hard negative samples increases the

risk of generating training sample batches that con-

tain almost no actual object annotations, which forces

to learn the model to detect nothing at all.

Examples of generic packages or packages of un-

trained brands, yielding low probabilities compared to

actual trained brands can be seen in Figure 8. Know-

ing that actual trained classes trigger detections al-

most always above a probability of 85% ensures us

that there is still a margin for deﬁning the most opti-

mal threshold.

7.3 Conclusion

We proved that using deep learning for industrial ob-

ject detection applications is no longer infeasible due

to computational demands and unavailable hardware.

By using an off-the-shelf deep learning framework

and by using transfer learning we succeed in building

several robust object detection models on an afford-

able general purpose GPU.

Our application of promotion board detection

proves that the deep learned object detection pipeline

is valid for industrially relevant cases, using only a

very limited training set and thus minimal manual ef-

fort in labelling, yielding models that achieve moder-

ate average precisions. At the same time we conclude

that not all applications need a model with a 100% av-

erage precision, in order to do its job, like in the case

of automated eye-tracker data analysis.

Finally the application of product package detec-

tion and classiﬁcation tries to push the limits of deep

learning and model ﬁne-tuning on industrial cases.

By using again very limited datasets (keep in mind

some classes have only 15 labelled image samples)

we achieve a perfect solution for our application,

achieving a 100% average precision when it comes to

generic box detection. Furthermore if we directly in-

corporate the classiﬁcation in the detection pipeline,

we achieve a remarkable 99.87% average precision.

This perfectly proves that deep object detection could

be a valid solution to several industrial challenges that

are still around nowadays.

7.4 Future Work

First of all we would like to do a deeper study of our

promotion board detection case and how we can im-

prove the moderate average precision rates obtained.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

216

We can only guess towards the actual reasons, but

we are convinced that one of the issues could well be

the sampling of the actual eye tracker video material,

which contains huge amounts of motion blur, espe-

cially when people tend to move their head fast, for a

quick look down the isle. Better eye-tracker hardware

could be a possible solution here, or replacing the

validation data with only clear eye-tracker data, re-

moving the actual motion blurred evaluation frames.

We also reckon our model will never be able to detect

motion blurred advertisement boards, since we never

used them as actual training data. Adding them to the

actual training data might also improve the general-

ization capabilities of the deep learned model.

A challenge for the package detection and classiﬁ-

cation case exists in expanding the multi-class model.

For now our model still needs an overnight training

step. For most applications this is feasible, but there

are still applications where this is not feasible at all.

We should thus investigate how we can optimally ex-

pand existing models with an extra class, at a minimal

processing cost. This research ﬁeld is called incre-

mental learning and already has several applicational

ﬁelds, like object detection in video sequences, as de-

scribed in (Kuznetsova et al., 2015).

Finally every augmented training sample is ran-

domly ﬂipped around the vertical axis in the Dark-

net data augmentation pipeline. While in most cases

this helps to build robustness inside the detector, there

are cases where this can worsens the actual model.

We therefore suggest training models without these

random ﬂips, and comparing them to our already ob-

tained results, to investigate their actual inﬂuence.

ACKNOWLEDGEMENTS

This work is supported by the KU Leuven, Cam-

pus De Nayer and the Flanders Innovation & En-

trepreneurship (AIO).

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., et al. (2016). Tensorﬂow: Large-scale machine

learning on heterogeneous distributed systems. arXiv

preprint arXiv:1603.04467.

Bengio, Y. (2012). Deep learning of representations for

unsupervised and transfer learning. In Proceedings

of ICML Workshop on Unsupervised and Transfer

Learning, pages 17–36.

Bradski, G. et al. (2000). The opencv library. Doctor Dobbs

Journal, 25(11):120–126.

Dauphin, G. M. Y., Glorot, X., Rifai, S., Bengio, Y., Good-

fellow, I., Lavoie, E., Muller, X., Desjardins, G.,

Warde-Farley, D., Vincent, P., et al. (2012). Unsuper-

vised and transfer learning challenge: a deep learning

approach. In Proceedings of ICML Workshop on Un-

supervised and Transfer Learning, pages 97–110.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. In Computer Vision and Pattern Recogni-

tion, 2009. CVPR 2009. IEEE Conference on, pages

248–255. IEEE.

Gan, Z., Henao, R., Carlson, D., and Carin, L. (2015).

Learning deep sigmoid belief networks with data aug-

mentation. In Artiﬁcial Intelligence and Statistics,

pages 268–276.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional architecture for fast feature em-

bedding. In Proceedings of the 22nd ACM inter-

national conference on Multimedia, pages 675–678.

ACM.

Kuznetsova, A., Ju Hwang, S., Rosenhahn, B., and Sigal,

L. (2015). Expanding object detector’s horizon: in-

cremental learning framework for object detection in

videos. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 28–

36.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In European conference on com-

puter vision, pages 21–37. Springer.

Redmon, J. (2013–2016). Darknet: Open source neural net-

works in c. http://pjreddie.com/darknet/.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 779–

788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014).

How transferable are features in deep neural net-

works? In Advances in neural information processing

systems, pages 3320–3328.

Building Robust Industrial Applicable Object Detection Models using Transfer Learning and Single Pass Deep Learning Architectures

217