Then the agents are updated with their target. For an
agent placed in the bottom of the environment the
initial target is the top row. In the CUDA program
each agent is considered to be a single thread. The
total number of threads launched is equal to total
number of agents present in the environment. First,
it is required to ensure that enough threads can be
launched to cover all the agents present in the
environment. Then the threads are arranged into
grids and blocks. In each of the simulation steps a
kernel function is launched which carries out the
agent movement.
Inside the kernel function, before carrying out
the LEM algorithm, an agent checks whether it is
placed in the target column. If true, then it verifies
whether the immediate cell in the forward direction
is empty or not. If empty then no further calculation
is performed and the agent proceeds forward.
Otherwise, a cell gets selected using LEM. The
pseudo code is shown below:
if(target_col == present column &&
forward_cell == 0)
move one cell forward
else
Calculate LEM.
No movement is observed when all neighbouring
cells are occupied.
4.1.1 Challenges in Implementation
The primary motivation of the implementation of the
model on the GPU was to gain speed in the
simulation while emulating the situation where the
pedestrians are capable of decision making.
Decisions of each agent are considered to be
independent of the other. The biggest challenge in
implementation of the algorithm was to keep the
total number of agents same, without any loss. As
mentioned, each of the agents is launched as single
thread and all of them are executed in parallel. So a
situation could arise when two agents try to access
the same environment position at the same time and
a loss of agent could occur. To avoid this kind of
scenario, atomic features of CUDA are used.
In the atomic operations when one of the threads
is performing an operation on a particular memory
location, residing either in global or shared memory,
then it does not get interfered with by operations of
other threads. An atomic function performs a read-
modify-write operation on 32 or 64 bit word which
is residing in global or shared memory. Inside the
kernel function, the agent first finds out the number
of neighbouring empty cells, calculates the distance
of the empty cell from the target and arranges them
in the ascending order. After that a random number
is generated which decides the cell number to be
selected. Once the cell is chosen, the movement of
the agent is achieved by calling the atomicExch()
function. In this way the numbers of agents are kept
intact in every simulation step and no agent
(memory) gets overwritten throughout the operation.
4.1.2 Generation of Random Numbers
Inside the kernel function, after the agents calculates
the distance of the available neighbouring empty
cells to the target and arranging them in the
ascending order, a random number is generated.
Random numbers are generated from a normal
distribution. A separate random number is obtained
for each agent. This is obtained by using cuRAND
library of which comes with the CUDA SDK.
Before the start of the simulation, a setup kernel is
launched once for the total number of pedestrians
present in the environment. Inside this setup kernel
there is the cuRAND application programming
interface (api) curand_init(). This kernel function is
essentially responsible to generate a seed which is
later fed to the main random number generator. The
random number from the normal distribution is
generated by using the curand_normal() api. This
normal distribution api is responsible for generating
a random number having a mean of 0 and standard
deviation of 1. But the result is multiplied by 3 to
obtain the desired standard deviation.
5 SPEEDUP
Simulation of the LEM for a large number of agents
is very computationally intensive. The primary
motivation of using GPU for the implementation of
the LEM algorithm is to accelerate the simulation
process. The time for the simulation process is
measured for both the CPU and GPU
implementations. The time is only measured for the
simulation and not for any memory transaction in
case of the GPU. As mentioned, the GPU used is
NVIDIA® GEFORCE™ 560ti and CPU used is
CPU is Intel Xeon E3-1280 (server grade).
For the GPU, time for the simulation process is
measured by using cudaevent functions. The
computation time both for CPU and GPU is given in
the Figure 4a and the speedup for the simulation is
provided in the Figure 4b. In Figure 4a the time is
measured in seconds along the Y-axis with the the
number of agents along the X-axis. This speedup is
measured using two groups of agents only (bi-
SIMULTECH2013-3rdInternationalConferenceonSimulationandModelingMethodologies,Technologiesand
Applications
372