calculate one feature for this kernel, so there will be
2913 threads. And the fourth kernel for the cascade
classifier and detecting faces which has the most in-
tensive work, this kernel needs (height x width) – (24
x 24) threads. For more information about the par-
allelized algorithm, we refer to (Guerfi et al., 2020).
Since the basic idea of the algorithm is to resize the
image, each kernel will deal with multiple image sizes
with a different total number of threads, for that using
static block size will lead to a degradation of the per-
formance. Next, we will prove that static block size
is a wrong decision for the programmer and how can
using a dynamic block size improve the performance.
After that, we will present our model, which will cal-
culate the block size for any image without interaction
with the programmer.
4 TUNING BLOCK SIZE
The optimal block size differs according to the lim-
its imposed by the GPU architecture, such as the max
threads per SM, and software factors such as the used
register per thread and the input data size. This sec-
tion describes our methods to determine the optimal
block size. First, we will illustrate the impact of pass-
ing over the tuning step and choosing a static block
size for all image sizes. After that, we present how
making a dynamic block size selection according to
the input size could improve the performance. The
dynamic block size gives us the optimal performance;
however, it’s tough for the programmer and takes too
much time. For that, we present a model that automat-
ically selects the block size. The block sizes selected
by the model provide near-optimal performance.
4.1 Static Block Size
When writing the kernels of the Viola-Jones algo-
rithm, the block size should be specified; however,
that number could not suit all the input image sizes.
Moreover, due to the nature of the Viola-Jones algo-
rithm, for the same input, the image should be resized
multiple times. This section will show how choos-
ing one fixed block size over another could consider-
ably decrease performance. We will execute ten dif-
ferent image sizes with all possible block sizes. How-
ever, since the search space for the block size (B
size
)
is vast, we need to reduce it; we will use some ba-
sic knowledge of the CUDA programming model that
any beginner should know and some hardware limits
(NVIDIA, 2021a)(Cheng et al., 2014).
• The blocks may contain a maximum of 1024
threads per block (MaxT
B
), which is the maximum
block size. Regardless of the block’s dimensions,
their product should not exceed this limit (equation
1).
B
size
≤ MaxT
B
(1)
• Since the hardware allocates threads for a block in
units of 32, which is the warp size (W
size
), so it makes
sense to choose a block size multiple of 32 (equation
2).
B
size
= W
B
×W
size
(2)
where (W
B
) is the number of warps per block. This
condition maximizes the performance and avoids
wasting resources; simultaneously, it fixes a lower
boundary for the block size. From (equation 1) and
(equation 2), we found the limits of block size:
32 ≤ B
size
≤ 1024
Consequently, because the number of threads per
block should be multiple of 32 (equation 2), the
search space is restricted to:
B
size
∈ {32, 64, 96, 128, 160, 192, 224, 256, 288, 320,
352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672,
704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024}
After decreasing the search space of the block size,
we claim that even the block size that gives the better
performance between these choices is not sufficient;
we will provide an example to explain the problem.
We suppose that the input image size is 1024 x 1024.
in the Integral image kernel (we will focus on the inte-
gral image; however, this example applies to all used
Kernels). The total number of threads needed for ex-
ecution will equal the image’s width (as explained in
the previous section). After an empirical search, we
assume that the programmer found that 128 is the best
block size. After launching the face detection algo-
rithm, it will resize the image 21 times. The total
number of executed threads for the integral image in
each iteration is {1024, 853, 711, 592, 493, 411, 342,
285, 238, 198 165, 137, 114, 95, 79, 66, 55, 46, 38,
32, 26} respectively. After some iterations, we can
see that 128 will be bigger than the total number of
elements, which means we invoke more threads than
needed. That’s only one problem with the static block
size because it’s sure that one fixed block size could
not fit all the input sizes. Next, we attempt to tune
the best block size for the ten used input sizes and all
resized images.
4.2 Empirical Block Size Tuning
(Dynamic Block Size)
Since it’s clear that tuning is indispensable, the obvi-
ous way to tune that comes to mind is an empirical
ICSOFT 2022 - 17th International Conference on Software Technologies
594