multiple cores that work together to handle all the
data provided by the application. Using the GPU to
process non-graphical objects is called GPGPU
(general-purpose graphics processing unit), which
performs highly complex mathematical calculations
in parallel to achieve low power consumption and
reduce execution time (David, Sidd, Jose, 2006;
Shuai, Michael, Jiayuan, David, Jeremy W, Kevin,
2008).
In this paper, we consider a physical problem of a
vehicle of mass m moving at speed V. This speed
does not remain constant over time because of fluid
friction. We will apply the second law of dynamics
on the vehicle to calculate the distance travelled when
its speed reduce from 30 m/s to 15 m/s using the
trapezoidal method. We will implement this method
on both CPUs and GPU using CUDA C/C++. The
objective of this study is to examine the
implementation of this method and to show the
efficiency of using GPUs for parallel computing. This
implementation may be helpful to control the speed
of vehicles by precision. The upcoming section of this
paper is as follows: in section 2, we present the
architecture of the CUDA program. In section 3, we
define the numerical integration method to solve the
problem. In section 4, we will deliver, the hardware
used and the results and discuss this implementation.
2 THE CUDA PROGRAM
ARCHITECTURE
The CUDA environment is a parallel computing
platform and programming model invented by
NVIDIA (NVIDIA, 2008). It allows to significantly
increase computing performance by exploiting the
power of the graphics processing unit (GPU). CUDA
C/C++ is an extension of the programming languages
well suited and helpful for parallel algorithms. The
main idea of CUDA is to have thousands of threads
running in parallel to increase computational
performance. Typically, the higher the number of
threads, the better the performance. All threads
execute the same code, called kernel; each thread is
characterized by its address ID. These threads execute
using the exact instructions and different data
(CUDA-Wikipedia). The program executed by
CUDA consists of a few steps executed on the host
(CPU) and a GPU device. In the host code, no data
parallelism phases execute. In some cases, the data
parallelism is weak in the host code (CPU) and strong
in the peripheral (GPU) during execution. CUDA C
or C++ is a platform for parallel computing that
includes host and peripheral code. The host code
(CPU) is simple code compiled using only the C or
C++ compiler, the device code (GPU) written using
CUDA specific instructions for parallel tasks, called
kernels. The kernels can be executed on the CPU if
no GPU device is obtainable; this functionality is
provided using a CUDA SDK function. The CUDA
architecture consists of three main components,
which allow programmers to use all the computing
capabilities of the graphics card more efficiently. The
CUDA architecture divides the GPU device into
grids, blocks and threads in a hierarchical structure,
as shown in Figure 1. Since several threads in a block
and several blocks in a grid, and several grids in a
single GPU, the parallelism resulting from a
hierarchical architecture is very significant (Manish,
2012; Jayshree, Jitendra, Madhura, Amit, January
2012).
Figure 1: The architecture of the CUDA program and these
memories.
A grid is composed of several blocks; each block
contains several, threads. These threads are not
synchronized and cannot be shared between GPUs as
they use many grids for maximum performance. Each
call to CUDA by the CPU is made from a single grid.
Each block is a logical unit containing multiple
threads and some shared memory. Each block in a
grid uses the same program, and to identify the
current block number, the instruction "blockIdx" is