Authors:
Imene Guerfi
;
Lobna Kriaa
and
Leila Azouz Saidane
Affiliation:
CRISTAL Laboratory, RAMSIS Pole, National School for Computer Sciences (ENSI), University of Manouba, Tunisia
Keyword(s):
GPU Computing, Parallel Programming, Program Optimization, Auto-tuning, and Face Detection.
Abstract:
With the growing amount of data, computational power has became highly required in all fields. To satisfy these requirements, the use of GPUs seems to be the appropriate solution. But one of their major setbacks is their varying architectures making writing efficient parallel code very challenging, due to the necessity to master the GPU’s low-level design. CUDA offers more flexibility for the programmer to exploit the GPU’s power with ease. However, tuning the launch parameters of its kernels such as block size remains a daunting task. This parameter requires a deep understanding of the architecture and the execution model to be well-tuned. Particularly, in the Viola-Jones algorithm, the block size is an important factor that improves the execution time, but this optimization aspect is not well explored. This paper aims to offer the first steps toward automatically tuning the block size for any input without having a deep knowledge of the hardware architecture, which ensures the auto
matic portability of the performance over different GPUs architectures. The main idea is to define techniques on how to get the optimum block size to achieve the best performance. We pointed out the impact of using static block size for all input sizes on the overall performance. In light of the findings, we presented two dynamic approaches to select the best block size suitable to the input size. The first one is based on an empirical search; this approach provides the optimal performance; however, it is tough for the programmer, and its deployment is time-consuming. In order to overcome this issue, we proposed a second approach, which is a model that automatically selects a block size. Experimental results show that this model can improve the execution time by up to 2.5x over the static approach.
(More)