The starting point of this research area was not
based on GPGPUs, but based on regular GPUs on
embedded platforms. First approaches include using
OpenGL and DirectX APIs to program GPU’s
shader cores and programmable pipeline using
GLSL (Khronos Group, 2012) (Khronos Group,
2004). Especially using OpenGL ES 2.0 on
embedded platforms to increase performance of the
image processing algorithms was a hot topic. The
invention of CUDA and OpenCL maximized the use
of GPGPUs for image processing and other
applications.
2.1 GPU Model
General-purpose GPU model was achieved using
programmable shader infrastructure. GLSL is the
shader programming language which enables
implementing not only graphics shader code, but
also general purpose computation. A survey of
general-purpose computation on GPUs is described
in (Owens, J. 2007).
Singhal et al. implemented many image
processing, color conversion and transformation
algorithms and applications like Harris corner
detection and real-time video scaling for handheld
devices (Singhal, N. 2010). They used OpenGL ES
2.0 (OpenGL for Embedded Systems) (Munshi, A.
2008) and its texture hardware to achieve their goal.
There are other open source projects like
GPUCV (Farrugia, J-P. 2006), MinGPU (Babenko,
P. 2008) and OpenVIDIA (Fung, J. 2005). While
GPUCV and MinGPU support using GPU and
GPGPU functionalities simultaneously, OpenVIDIA
can only use CUDA and implements a set of useful
image processing functions.
2.2 GPGPU Model
Changing the computation and programming
capabilities of GPUs, NVidia developed Compute
Unified Device Architecture (CUDA) (Lindholm, E.
2008) and made their GPUs, GPGPUs. GPGPUs
support Single Instruction Multiple Thread (SIMT)
architecture. SIMT architecture allows one
instruction used by many threads and these threads
can process multiple data, resulting in excessive
performance gain. GPGPUs have been used
extensively to exploit this performance gain in
image processing applications. While some papers
implement image processing functions from scratch
(Yang, Z. 2008), some of them are implemented
over existing libraries (Farrugia, J-P. 2006) (Kong, J.
2010). There is also some research on particular
image processing algorithms. In (Luo, Y. 2008), Luo
et al. used CUDA to enhance the speed of canny
edge detection algorithm.
OpenCL is the standardized version of general
purpose computing platform (Khronos OpenCL
Working Group. 2008) (Munshi, A. 2011). Unlike
CUDA, OpenCL works on various processors
including embedded processors, DSPs and FPGAs
(Czajkowski, T. 2012). OpenCL and CUDA have
little differences in coding and capability as shown
in (Karimi, K. 2010) (Fang, J. 2011) (Du, P. 2012).
3 OpenCL EMBEDDED PROFILE
3.1 OpenCL Programming Model
OpenCL is an API which enables heterogeneous
parallel programming on various devices like GPUs,
DSPs, FPGAs and even CPUs. OpenCL
specification is maintained by Khronos Group, but
many vendors contribute to the improvement of this
specification (Khronos OpenCL Working Group.
2008). OpenCL programming model consists of one
host and one or more compute devices as shown in
Figure 1.
Host part runs the C/C++ part of the code
(OpenCL supports both C and C++), and
communicates with the compute devices. The code
that runs on compute units is called kernel and
written in OpenCL C language, which is a subset of
regular C language with functional extensions.
Kernels can be compiled just-in-time or the program
can read the pre-compiled device specific binary and
execute this binary on compute units. These kernels
are executed as single program multiple data which
are grouped by 1D, 2D or 3D set of work items.
Work items are grouped together into work groups.
Work items can be executed in a single instruction
multiple data (SIMD) fashion. The processors
hierarchy and memory hierarchy is shown in Fig. 2.
There can be multiple multiprocessor
infrastructures which have multiple processors. In
Figure 2, SP means streaming processor which is the
building block of compute device. SPs are grouped
into Multiprocessors.
Multiprocessors have their own register files and
their own on chip cache which is called local
memory in OpenCL and shared memory in CUDA.
Device memory is the global memory of the
compute unit that consists of RAM of the device.
Device memory may include specific parts
called texture cache and constant cache. These
memory types are not mandatory, but if they exist,
GPUbasedParallelImageProcessingLibraryforEmbeddedSystems
235