For example, while the CPU often involves up to
16 cores, GPU can have thousands cores due to the
fact that GPU has been developed to work as a paral-
lel computing architecture, mainly for image render-
ing and other graphics and media computations. An-
other advantage of using GPU as a parallel platform is
that NVIDIA has introduced a new programming lan-
guage called CUDA C, which is an extended version
of C language, to help developers monopolize all the
resources of the GPU without having to learn a new
programming language. The CUDA platform will be
responsible for scheduling the threads and avoiding
any misbalancing which could lead to starvation.
The rest of the paper is organized as follows.
First, Section 2 offers background information about
CUDA, along with some details for the advantages of
using this new computing platform. Then, a descrip-
tion of the proposed system for solving CSPs with a
GPU is presented in Section 3. Finally, concluding
remarks and future works are reported in Section 4.
2 COMPUTE UNIFIED DEVICE
ARCHITECTURE (CUDA)
The GPU has been designed to relieve the CPU from
the highly computation processes. The GPU has been
responsible for graphics processing like image render-
ing. This is why it comes with a large number of tran-
sistors which are mostly dedicated for data process-
ing, in comparison to the CPU which requires more
processing control and caching the memory. More-
over, GPU memory has a higher bandwidth than CPU
which speeds up the performance of the arithmetic
operations on the data. The reason for this is because
the main task of the GPUs is to execute the same op-
erations on a larger number of data, which is what the
image rendering is about. However, CPUs are general
architecture designed to execute all types of functions.
In the image rendering processing, or any other graph-
ics processing, the main idea is to divide the problem
data into smaller blocks in which a block can perform
the same work on the sub-problem. Thus, any prob-
lem that processes a repeatedly arithmetic operation
on a larger number of data can be partitioned into
a smaller size and tackled with as independent from
other partitions of the problem as long as the parti-
tioned data elements are independent from each other.
In 2006, NVIDIA announced a new parallel comput-
ing architecture called Compute Unified Device Ar-
chitecture, or CUDA (Cook, 2012). It exploits the
parallelism of the GPU hardware to its highly compu-
tation for solving highly computational problem out
the field of graphics. In contradiction with CPU,
CUDA architecture would perform parallel comput-
ing work in a more professional way than CPU. Also,
NVIDA has developed a software programming envi-
ronment that allows the developers to use it for im-
plementing their applications with a CUDA C that
is an extension programming language of the C Lan-
guage. One of the obstacles of a multicore processor
is that developers need to develop an application that
not just uses the multicore technology in the CPU,
but also maximizes the benefits of the parallel com-
putation to its high level to reach the top speedup that
can be achieved by the multicore processor. Still, this
requires a lot of programming experience and under-
standing of the CPU architecture. These difficulties
have been tackled with CUDA by introducing the new
programming model with instructions sets based on
the C programming language. Developers who are
not familiar with the CUDA platform can still perform
well in programming with CUDA C and reach a very
high leverage of the GPU parallel computing ability.
A developer needs only to write the kernel that will
run on the GPU, and the CUDA platform will execute
this kernel on all of the elements of the data. CUDA
C also allows the users to write a heterogeneous code
that runs on the CPU and GPU in the same file. The
code that concerns the CPU called host code while the
part that runs on the GPU is called the device code. In
CUDA C, there are a few new instructions that are
added to the extension of C programming language.
ThreadIdx is the keyword for accessing the current
thread ID. It is used to direct the specific thread work.
In Figure 1, a sample code, for example, shows
how to add two vectors and store the results in a third
one. The developer only needs to write two functions;
one that will be executed on the CPU and will be re-
sponsible for copying the data from the from CPU
memory to GPU memory, triggering the kernel, syn-
chronizing, then copying back the result form device
memory to the CPU memory. This code is written in
the main function and it is executed only once. While
in the Kernel section, the code will run equal to the
number of threads that are already specified by the
host code in the many function when the kernel is trig-
gered. In this example, the thread number is equal to
the number of array elements. Each thread will ac-
cess different elements in the array of A and adds it to
the correspondent element in array B, then store the
result in the correspondent element in array C. After
the kernel function is complete, the main function of
the host will be to direct the CPU to copy back the
elements of array C from the device memory to the
CPU memory, and the program will terminate.