IS THE GAME WORTH THE CANDLE?
Evaluation of OpenCL for Object Detection Algorithm Optimization
Floris De Smedt
1
, Lars Stuyf
1
, Sander Beckers
1
, Joost Vennekens
1,2
, Gorik De Samblanx
1,2
and Toon Goedem
´
e
1,3
1
Campus De Nayer, Lessius Mechelen, Association K.U. Leuven, Leuven, Belgium
2
Department of Computing science, K.U. Leuven, Leuven, Belgium
3
Department of electrical engineering, K.U. Leuven, Leuven, Belgium
Keywords:
OpenCL, Object Detection, HOG.
Abstract:
In this paper we present out experiences with the implementation of an object detector using OpenCL. With
this implementation we fullfil the need for fast and robust object detection, necessary in many applications in
multiple domains (surveillance, traffic, image retrieval, ...). The algorithm lends itself to be implemented in
a parallel way. We exploit this opportunity by implementing it on a GPU. For this implementation, we have
choosen to use the OpenCL programming language, since this allows for scalability to more performant and
different types of hardware, with minimal changes to the implementation. We will discuss how the paral-
lelization is done, and discuss the challenges we met. We will also discuss the experimental timing results we
achieved and evaluate the ease-of-use of OpenCL.
1 INTRODUCTION
Object detection has endless possibilities in many ap-
plication areas of computer vision. For example the
detection of humans can be used in surveillance ap-
plications, detection of vulnerable road users in blind
spot cameras (Van Beeck et al., 2011), blurring of per-
sons for privacy issues, e-health applications (detect-
ing falling elderly people), ... but also in image re-
trieval applications where large databases of images
are used to search for a specific class of objects.
It is important that the detection of the object hap-
pens as fast as possible. Many applications expect
real time performance while having a small amount
of false positives. Recently, a number of state-of-the-
art object detection algorithms are described in liter-
ature that have a very high recognition performance
(Felzenszwalb et al., 2008), (Felzenszwalb et al.,
2010a), (Leibe et al., 2004), (Gall et al., 2011). The
downside of these powerful algorithms is that they
come with a high computational cost. The algorithm
we chose to implement (Felzenszwalb et al., 2010b)
is a very robust algorithm based on histograms of ori-
ented gradients proposed by Dalal and Triggs (Dalal
and Triggs, 2005). To increase performance, we im-
plement this algorithm in OpenCL, a novel open stan-
dard for heterogeneous computing. This allows us to
execute the algorithm on dedicated hardware that ex-
ploit the opportunity of parallelization. We see this
implementation task as a testcase for OpenCL. In this
paper we present our experiences with this process
and evaluate the ease-of-use of OpenCL and the opti-
mization results.
In section 2, we will explain how the chosen algo-
rithm works, and which steps are taken to go from a
raw image to the detection of objects. In section 3, we
will discuss in detail the implementation of the con-
struction of the feature pyramid, which is used to find
objects. In this section we also handle the advantages
and disadvantages of these choices and how we can
circumvent the obstacles. In section 4, we will dis-
cuss the experiments we have done and the resulting
timing results. In section 5, we will share our expe-
rience with the use of OpenCL. We conclude with a
brief conclusion in section 6, and present possible im-
provements in future work.
2 OBJECT DETECTION
ALGORITHM
As mentioned in the introduction, we chose to im-
plement an algorithm proposed by Felzenszwalb
284
De Smedt F., Struyf L., Beckers S., Vennekens J., De Samblanx G. and Goedemé T..
IS THE GAME WORTH THE CANDLE? - Evaluation of OpenCL for Object Detection Algorithm Optimization.
DOI: 10.5220/0003821002840291
In Proceedings of the 2nd International Conference on Pervasive Embedded Computing and Communication Systems (PECCS-2012), pages 284-291
ISBN: 978-989-8565-00-6
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
(Felzenszwalb et al., 2010b) for object detection. This
algorithm is based on Histograms of Oriented Gradi-
ents (HOG) for human detection proposed by Dalal
and Triggs (Dalal and Triggs, 2005), who claim that
the use of HOG outperforms other feature sets. To
improve the detection rate, Felzenszwalb uses parts,
for example the limbs of a person, in a deformable
configuration to model an object. They applied this
technique with succes for multiple objects, not only
humans. The improvement of robustness came at
the cost of an increased calculation time, so they
proposed a cascaded implementation (Felzenszwalb
et al., 2010a) which uses partial hypothesis pruning.
A similar approach was used by Viola and Jones (Vi-
ola and Jones, 2001) where simple filters are used to
prune most of the search space, so more computation
intensive filters only work on possible detection areas.
In this paper we show our work in progress to speed
up this implementation even more by using dedicated
hardware that exploits the opportunity of paralleliza-
tion.
The algorithm can be divided in two main parts:
1. The construction of the feature pyramid.
2. The search for a model in this feature pyramid.
In this paper, we focus on the implementation of the
first part. So, the results we will present are indepen-
dent of the model we are looking for. Our optimized
implementation can thus be used in a detector for any
arbitrary object class, if a trained model of it is avail-
able: pedestrians, bicycles, horses, cars, .... In figure
1 the models for a person and the front view of a car
are shown. At the left we can see the root model, in
the middle the parts and at the right the probability of
finding the part at that position on the root model.
Figure 1: Person model (top) and car model (bottom). From
left to right: Root model, parts model, probablity of finding
this part on this spot on the root model.
The construction of the feature pyramid can be
subdivided in four stages:
1. Rescale the image to different resolutions of the
same image, resulting in a scale-space pyramid.
This allows to find the model on different sizes
without the need to rescale the model, which is
very complex.
2. Calculate the gradients of the pixels for each layer
in the scale-space pyramid. Using the gradients
creates an invariance for illumination changes.
3. Create histograms of the orientations of the gradi-
ents (HOG).
4. Use these histograms to calculate the features of
each layer.
Once the feature pyramid is built, it can be used
to search for a model on different scales. Each model
exists of a root model, which is used to find the object
as a whole (comparable to the model used by Dalal
and Triggs), and multiple part models. The part mod-
els, which are searching for small parts that can have a
deformable placement respectively to the root model,
are working on twice the resolution of the root model.
The higher resolution offers more image information
since more pixels of the same image area are present.
This can be seen in figure 2. At the left the scale-
space pyramid is shown, at the right we can observe
the resulting feature pyramid and the layers the differ-
ent parts are applied to. In figure 3 and figure 4 some
detection results for the pedestrian model and the car
model are shown.
Figure 2: Scale space pyramid and resulting feature pyra-
mid.
Figure 3: Detection of pedestrians.
IS THE GAME WORTH THE CANDLE? - Evaluation of OpenCL for Object Detection Algorithm Optimization
285
Figure 4: Detection of car using the side-view model.
3 IMPLEMENTATION
In this section we go deeper into the OpenCL specific
implementation details of this feature pyramid. We
will explain how the different parts exactly work, and
point out the advantages and disadvantages for our
implementation. Our implementation is based on a
publicly available Matlab implementation released by
Felzenszwalb (Felzenszwalb et al., 2010c). We reim-
plemented this algorithm first to C code, which is eas-
ier to port to OpenCL. The values resuling from the
different steps of our own implementation are identi-
cal to these of the original implementation. This way
we obtain identical detection results.
3.1 OpenCL
Modern platforms include one or more CPUs, GPUs,
DSPs, ... All these hardware types are designed
and optimized for a specific type of calculations.
OpenCL, Open Computing Language (Group, 2011),
is a novel open standard for heterogeneous comput-
ing. It is a framework for writing programs that can
use these platforms in an heterogeous way, in contrast
to CUDA, developed by Nvidia, for using GPU hard-
ware. This allows us to write an efficient and portable
implementation of an algorithm which exploits the
possibilities for parallizing parts of the algorithm on
the most suitable devices (multi-core CPU, GPU, cell-
type architectures or other parallel processors). Since
it is heterogenous, we don’t have to know in advance
which hardware will be used to execute the algorithm.
The used platform can easely be changed by changing
an initialisation variable of the program. Since differ-
ent devices have different instruction sets, the com-
piling op the OpenCL kernels happens online (during
the execution of the program).
The code is written in the form of kernels. A kernel
is a block of code, written in a language based on C99,
that can be executed in parallel. For example, when
each value of a matrix has to be multiplied by a certain
value, each kernel contains the code for one multipli-
cation and this kernel will be executed for all elements
of the matrix. The execution of the NDRange (all ker-
nels that have to be executed) is subdivided in work-
groups. A workgroup is subdivided in work-items,
which will execute the kernels in parallel (figure 5).
To distinguish different executing threads, each
thread has a unique global id, and within a workgroup
each thread has a unique local id. Both are assigned
for each dimention.
Figure 5: The execution of the kernels is divided in work-
groups, which can be subdivided in work items. Each work
item executes an instance of the kernel.
Figure 6 shows the memory model of a GPU de-
vice. The memory access times are going from the
slowest at the bottom (starting with the memory of the
host computer) to the fastest at the top (private mem-
ory). The global memory of the GPU (and CPU) is
shared over all executing work items, the local mem-
ory is shared over the work items in the same compute
unit (workgroup) and the private memory is only ac-
cessible by the running work item.
3.2 Rescale
The first step in the construction of the feature pyra-
mid is the construction of a scale-space pyramid,
which contains rescaled versions of the input image
using linear interpolation. Unlike most implementa-
tions of linear interpolation (e.g., those typically pro-
vided in hardware on GPUs), more than only the di-
rectly neighbouring pixels are used. For each power
of two by which the source resolution is bigger than
PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems
286
Figure 6: Memory model of a GPU, from slow to fast:
Global memory, Local memory, Private memory.
the target resolution, an extra pixel is used in horizon-
tal and vertical direction.
In our implementation, we choose to launch one
thread (parallel running instance) for each destination
pixel. Since the rescaling can be subdivided in a ver-
tical and horizontal rescaling, the kernel executed by
each thread is split up in these two directions. At the
beginning of the thread we calculate which pixels will
be needed by the linear interpolation process. These
pixels are then used to rescale in vertical direction and
the result is stored in a temporary pixel array in pri-
vate memory (registers of the GPU, memory with the
best access time). In the next step, these vertically
rescaled pixels are used in a horizontal rescaling step
which results in the destination pixel. The calculated
pixels are written to global memory as part of a layer
in the scale-space pyramid.
The advantage of this method is that each pixel
can be calculated independently, so we have a maxi-
mal parallel execution which can be exploited by an
increase of available execution units. The disadvan-
tage is the non-linear access pattern of the needed pix-
els, which makes the rescaling of images one of the
most computationally intensive tasks in the creation
of the feature pyramid (see section 4). This prob-
lem can probably be solved by using texture memory,
since this is capable of handling more random access
read- and write-patterns (see section 7).
3.3 Histogram
When the scale-space pyramid is built, we can create
histograms for each layer based on the orientation of
the gradients of these pixels. Subtracting each previ-
ous pixel from the subsequent pixel in horizontal and
vertical direction yields the horizontal, respectively
the vertical derivative of that pixel. The gradient is
computed by making the squared sum of the deriva-
tives. Only the strongest (largest) gradient of the three
color channels is used to vote in the histogram. To
determine the orientation of the gradient, the horizon-
tal and vertical derivatives are multiplied with respec-
tively the cosine and the sine of the bin orientation
(the use of 18 bins results in 20 degrees per orien-
tation bin) and are then summed. The maximum re-
sponse gives us the orientation of the gradient.
Each pixel votes in four neighbouring histograms,
thereby avoiding abrupt changes as a pixel smoothly
changes from one histogram to another. The influ-
ence of the votes is determined by trilinear interpola-
tion. The construction of the well-known SIFT local
feature descriptor (Lowe, 2004), which is also based
on HOG, uses a very similar approach. This trilin-
ear interpolation makes it very difficult to split up the
histograms in smaller parts, which would make paral-
lelization easier.
Each histogram contains the votes of a limited
amount of pixels (4x4 or 8x8). When we would use
a similar approach like in the rescaling part, and use
one thread per voting pixel, we face the problem that
multiple pixels need to have write acces to the same
memory adresses. We found out that the classic solu-
tion of using a semaphore to lock a memory location
is very complex to implement on GPU, since the pro-
gram counter of multiple threads is shared for perfor-
mance reasons. Sharing of the program counter has
the effect that the code for waiting on the release of
the lock is shared with the thread that has the lock, so
the lock is never released which results in a deadlock.
An extra disadvantage of the semaphore approach is
that it is against the philosophy of parallel program-
ming because we create a bottleneck by waiting for
the release of the memory lock, which prevents gain-
ing computation time by parallel execution.
Our solution to this problem is to keep the
four groups of histograms separately and launch one
thread for each block of pixels which are voting in the
same histogram (4x4 or 8x8). With this approach the
kernels don’t have to wait to write their result. When
the four groups of histograms are filled in, they can
be summed together, with respect to their disalign-
ment, to get the final histograms of the image layer.
Since the pixels we use in a single histogram are not
aligned in memory, the same disadvantage of random
read pattern arises. Therefore, also in this part of the
algorithm the use of texture memory can offer addi-
tional speedup (section 7).
IS THE GAME WORTH THE CANDLE? - Evaluation of OpenCL for Object Detection Algorithm Optimization
287
3.4 Features
The last step in the creation of the feature pyramid is
the calculation of the features out of the HOGs. The
feature pyramid has 32 layers, containing four types
of features, shown in figure 7. For the calculation of
the features, we chose the number of threads launched
to be equal to the number of places in a feature layer.
This equals the number of histograms in horizontal
and vertical direction minus two.
Figure 7: The layers of the feature pyramid.
The first type are the contrast sensitive features,
which takes 18 layers, one orientation per layer, and
uses the energy of the histograms combined with the
original content of the histograms. The energy is cal-
culated by summing the squared sum of opposite ori-
entations of each block. For this purpose, a separate
kernel is used, since the consecutive memory access is
exploited this way. The energy is used as a sum over
four neighbouring histogram places. Since the same
groups are used multiple times, we calculate these in
advance to save processing time.
To calculate the contrast-sensitive features, each
orientation in the histograms is multiplied four times
(four possible groups, see figure 8) with four values
we calculated in advance. The feature value is half
the sum of the resulting four values. Since the re-
sults of these multiplications, summed over all ori-
entations, are used for the texture features, we store
them in global memory. This results in four blocks of
memory, each the size of the number of histograms.
Figure 8: Four groups from energy histogram for a position
from original histograms.
The next nine layers of the feature pyramid are
filled with contrast insensitive features. These are cal-
culated very similary to the contrast sensitive features,
but instead of the original histogram values, the sum
of opposite orientations is used.
The third type of features, texture features, is re-
sponsible for four layers in the feature pyramid. Here,
the four blocks of memory we stored during the calcu-
lation of the contrast-sensitive features are used. Each
created memory block results in one layer by multi-
plying the values with a fixed number. Also, here the
consecutive memory access results in low calculation
cost.
The last layer contains the truncation features, and
exists of all zeros.
With the techniques described above, we have
now OpenCL implementations of the scale-space
pyramid (Rescale), histogram and feature space com-
putations, which are ready to be tested and compared.
4 EXPERIMENTAL TIMING
RESULTS
In this section we will present the timing results from
different experiments. We will begin with our ref-
erence implementation on CPU and go step by step
to a total implementation of the feature pyramid in
OpenCL.
4.1 Experiment Specifications
All experiments are executed on the same platform,
with a core i7 965 (3.2 GHz) CPU and a dedicated
Nvidia GeForce GTX 295 GPU. This GPU has the
possibility to be used as two parallel devices, but we
only use one. We run our experiments under the linux
operating system.
The experimental timing results we got are from
795 images with a resolution of 600x480 from the
PETS2010 dataset (PETS, 2010), which are pro-
cessed three times each. For OpenCL profiling we
used the visual profiler released by Nvidia, which runs
seven times the different implementations over 30 im-
ages. To profile the C implementation we made use of
callgrind.
Table 1: Distribution of calculation time on CPU.
Function Share of calculation time (%)
Transform 0.31
Rescale 20.05
Histogram 69.39
Energy 0.52
Feature calculation 9.73
4.2 C Implementation
(Implementation A)
The CPU implementation of the algorithm is used as
a reference. Since OpenCL is an extention of the C
programming language, using a C implementation as
PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems
288
Figure 9: Comparison of the time needed to calculate the total feature pyramid of the four implementations.
a starting point is of great use. In table 1 the division
of the calculation time for the CPU-implementation
can be observed. The largest share is spend by the
calculation of the histogram. The C-implementation
is compiled with the gcc compiler using the -O3 op-
tion for optimization.
4.3 Rescale in OpenCL
(Implementation B)
As a first experiment, the rescaling of the images is
executed on the GPU. The source image is transfered
one time to the GPU and is used multiple times to be
rescaled. The resulting scale-space pyramid needs to
be transferred back to host memory for further pro-
cessing. In figure 10 you can observe the amount of
time spent transferring information is large compared
to the actual computation time. Still we can observe a
big profit in time compared to the CPU version, which
can be seen in figure 9.
4.4 Rescale and Histogram in OpenCL
(Implementation C)
In this second experiment we execute two parts of the
feature pyramid on GPU, namely the rescaling of the
images and the HOGs from these images. After cal-
culating the HOGs, these are transfered back to host
memory for the calculation of the features. In figure
10 we can observe that almost all computation time is
consumed by the rescaling and the calculation of the
histogram. Like we mentioned in section 3.2 and sec-
tion 3.3, these functions are limited by the memory
access speed and non-sequential memory access pat-
tern. Figure 9 learns that the implementation of these
two functions result in the most time profit. This can
be explained by the potential speed incrementation of
Amdahl’s law (S for sequential part, P for parallel
part):
Figure 10: Distribution of calculation time on GPU over
kernels.
speedup =
1
S +
P
#cores
.
We learned from table 1 that these two functions are
the most computational intensive on the CPU, so by
parallelize these functions we can gain the most over-
all speedup.
Table 2: Distribution of calculation time on GPU for the
three implementations.
Function B (%) C (%) D (%)
Transform 0.823 0.431 0.576
Rescale 88.99 48.41 44.397
Histogram / 48.28 45.55
AddHist / 1.164 1.374
Contrast / / 2.48
Energy / / 0.443
Texture / / 0.443
Summation / / 0.266
Mem Copy 10.19 1.72 3.90
4.5 Total Feature Pyramid in OpenCL
(Implementation D)
In our final experiment we execute the total feature
pyramid on GPU. The initial image is tranferred to
device memory and after the execution of all the ker-
nels the total feature pyramid is transferred back to
host memory. Altough the share of memory trans-
fer is large compared to the previous implementation
IS THE GAME WORTH THE CANDLE? - Evaluation of OpenCL for Object Detection Algorithm Optimization
289
(Implementation C, figure 10), the implementation is
worth it. We still obtain a speed up, which can be seen
in figure 9.
4.6 Comparison of Results
In figure 9, a comparison of the experimental tim-
ing results is given. We can observe that the use of
dedicated hardware results in a feature pyramid four
times as fast as the CPU implementation. We can
also notice that the largest speed up is obtained in the
parts that are most computationally intensive, namely
the image rescaling and the calculation of the his-
togram. The speed we gain by implementing func-
tions in OpenCL is almost directly proportional to the
time needed on CPU.
5 OpenCL EVALUATION
In this section we will discuss our experiences with
the use of OpenCL as a way to optimize an algorithm
by running it on GPU. We will comment on the learn-
ing curve (5.1) and give tips and tricks for develop-
ment (5.2) and debugging and profiling (5.3)
5.1 Learning Curve
Since it is a quite novel standard, the available lit-
erature is still growing. The specifications released
by the khronos group (Group, 2011) is very valu-
able as a reference for function calls while develop-
ing. It does not only explain how to use the func-
tions, but also gives possible errors and shows how
to prevent them. Although it is a great help, it does
not contain enough information to exploit the possi-
bilities of OpenCL to produce the best implementa-
tion. It is necessary to know how OpenCL works, to
fully exploit these oportinities. When we started us-
ing OpenCL, the learning was mostly based on exam-
ples and a trial and error-approach, which results in
a longer learning curve. Now the available literature
((Tsuchiyama et al., 2009), (Benedict et al., 2011))
offers a more complete range of books which reduces
the learning curve drasticaly and can also teach the
reader a correct way of programming for a high per-
formance gain.
5.2 Development
OpenCL focuses on heterogeneity. This comes at the
cost of a lot of function calls to set up your exe-
cution enviroment (creating a platform, creating de-
vices, creating a program, creating command queues,
...). This can be seen as a disadvantage, but once
these functions are written, they can easely be reused
in later projects without losing the flexibility it offers.
This flexibility allows an easy change of execution de-
vice without modifying your kernel code.
To make optimal use of the possibilities of
OpenCL, it is necessary to understand every detail
of the algorithm to implement. Just copy-n-pasting
existing source code to kernel code can give a speed
up, but this will be small compared to an implementa-
tion which exploits the availibility of fast memory, the
highest parallelization possible and sequencial mem-
ory access. The upcomming amount of (public) avail-
able libraries of optimized implementations of com-
monly used functions (matrix multiplication, image
filtering, convolution, ...) can limit the developing
time, since the developer only needs to focus on the
rest of the algorithm.
5.3 Debugging and Profiling
We used a linux (Ubuntu) station, with a dedicated
Nvidia GTX295 card, as a development platform.
This approach is very useful, since the graphical shell
of the operating system can occupy the second GPU
device so that incorrect memory use on your main
GPU will not result in a freeze of your graphical envi-
roment. Nvidia released a bundle of the newest driver,
cuda and opencl drivers, and examples. These exam-
ples can be used to learn how to develop, but also as
a template for your own projects.
The use of an OpenCL compatible CPU can be
very useful during development, since this allows de-
bugging, but since the architecture of CPUs and GPUs
differ, some problems are very difficult to track. For
example the implementation of a semaphore is quite
easy on CPU, but the same code can result in a dead-
lock when executed on a GPU, since the program
counter is shared for optimization. The difficulty of
debugging is mostly due to the absence of available
debugging enviroments for OpenCL on linux, which
are already released for the Windows operating sys-
tem.
Also for profiling the availability of profiling tools
is quite sparse and mostly hardware manufacturer de-
pendent. In this paper we used the visual profiler re-
leased by Nvidia, which can give detailed information
about memory use and execution times of the kernel
code. The possibility of getting timing information
from within a kernel is not yet available, but could
be quite useful to track down bottlenecks inside the
kernels.
PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems
290
6 CONCLUSIONS
In this paper, we presented our experiences with the
implementation of an algorithm for object detection
in OpenCL. We discussed the opportunities we ex-
ploited by parallelizing parts of the algorithm on
GPUs. We used a CPU implementation as a refer-
ence, and percieved a speedup from circa 0.60 sec-
onds to 0.15 seconds for the construction of the fea-
ture pyramid for images with a resolution of 600x480.
During this implementation we encountered different
challenges. The most important one is the simulta-
neous write operation to the same memory location,
which called for an approach that allows execution in
a more parallel way.
We discussed also the ease-of-use of OpenCL.
The flexibility comes at the cost of needing many
function calls before the actual kernels can be exe-
cuted, but these calls can be reused in other projects
which makes the big advantage of heterogeneity and
scalability outweighs the extra work. Since it is a
novel standard, the available literature is limited, but
still growing. Overall, we can certainly state thus that
the optimization game is worth the OpenCL candle.
7 FUTURE WORK
In the future, we will extend our implementation with
the use of texture memory, which allows a more ran-
dom access pattern in memory. Recent preliminary
experiments show we could reach an access speed of
three to four times the current. We will also integrate
the use of vectors, which allows an operation to be
executed on multiple elements at once. This requires
an additional padding of memory to be a multiple of
the vector size. We will also implement the search
for models in OpenCL. This part of the algorithm is
mostly the execution of convolutions with the model.
This overcomes the need to transfer the total feature
pyramid to host memory, but only the coordinates of
the detections.
ACKNOWLEDGEMENTS
This work is supported by the Institute for the Pro-
motion of Innovation through Science and Technol-
ogy in Flanders (IWT) via the Tetra project S.O.S.
OpenCL - Multicore cooking.
REFERENCES
Benedict, G. R., David, K., Perhaad, M., and Dana, S.
(2011). Heterogeneous Computing with OpenCL.
Morgan Kaupmann.
Dalal, N. and Triggs, B. (2005). Histograms of oriented
gradients for human detection. In International Conf.
on CVPR, volume 2, pages 886–893.
Felzenszwalb, P., Girschick, R., and McAllester, D.
(2010a). Cascade object detection with deformable
part models. In Proc. of the IEEE Conf. on CVPR.
Felzenszwalb, P., Girschick, R., McAllester, D., and Ra-
manan, D. (2010b). Object detection with discrimina-
tively trained part based models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 32(9).
Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).
A discriminatively trained, multiscale, deformable
part model. In Proc. of the IEEE Conf. on CVPR.
Felzenszwalb, P. F., Girshick, R. B., and
McAllester, D. (2010c). Discriminatively
trained deformable part models, release 4.
http://people.cs.uchicago.edu/ pff/latent-release4/.
Gall, J., Yao, A., Razavi, N., Van Gool, L., and Lempitsky,
V. (2011). Hough forests for object detection, track-
ing, and action recognition. In IEEE Transactions on
Pattern Analysis and Machine Intelligence.
Group, K. (2011). Opencl - the open standard for
parallel programming of heterogeneous systems.
http://www.khronos.org/opencl/.
Leibe, B., Leonardis, A., and B.Schiele (2004). Combined
object categorization and segmentation with an im-
plicit shape model. In ECCV’04 Workshop on Sta-
tistical Learning in Computer Vision.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. International Journal of Com-
puter Vision.
PETS (2010). Pets 2010 benchmark data.
http://www.cvg.rdg.ac.uk/PETS2010/a.html.
Tsuchiyama, R., Nakamura, T., Lizuka, T., Asahara, A.,
and Miki, S. (2009). The OpenCL Programming book.
Fixstars.
Van Beeck, K., De Smedt, F., Beckers, S., Struyf, L., Ven-
nekens, J., De Samblanx, G., Goedem
´
e, T., and Tuyte-
laars, T. (2011). Towards robust automatic detection
of vulnerable road users: Monocular pedestrian track-
ing from a moving vehicle,. In Proc. of ATINER 7th
Annual International Conf. on Computer Science and
Information Systems.
Viola, P. and Jones, M. (2001). Rapid object detection using
a boosted cascade of simple features. In Proc. of the
IEEE Conf. on CVPR.
IS THE GAME WORTH THE CANDLE? - Evaluation of OpenCL for Object Detection Algorithm Optimization
291