The rest of this paper is organized as follows. Sec-
tion 2 provides background on CUDA programming
and the simulated model. Section 3 covers the opti-
mization methods, while Sections 4 and 5 present and
discuss the achieved results and conclude the work.
2 BACKGROUND
2.1 CUDA Environment
The CUDA environment represents a parallel pro-
gramming model oriented on highly parallel comput-
ing, while making use of a large number of com-
putational cores. With respect of general purpose
CPUs, much more transistors are assigned to data pro-
cessing rather than to flow control or data caching.
However, memory latency is hidden with compute-
intensive calculations. The GPU is divided into a
set of Streaming Multiprocessors (SM) in which hun-
dreds of threads reside concurrently. A large num-
ber of CUDA threads, can be launched in parallel
by means of kernels, which are C functions executed
in parallel from all CUDA threads launched (CUDA,
2011).
2.2 Coarse Grain Simulator
BRAHMS (Biomembrane Reduced ApproacH
Molecular Simulator) is a CG lipid bilayer simulator
we optimized and accelerated for CUDA environ-
ment. It describes a method for CG modeling of lipid
bilayers of biomembranes and contains several origi-
nal features such as specific original representations
for water and charges (Orsi, 2010). It simulates the
displacement in time of the particles of a system,
calculating some of its macro properties such as
temperature, lateral pressure, potential and kinetic
energy (Orsi, 2010). It employs Molecular Dynamics
and in particular Newton Leapfrog Equations to move
the positions of beads, which in CG models represent
clusters of atoms, and their velocities one time step
forward (Rapaport, 2004).
3 GPU OPTIMIZATION OF
COARSE GRAIN MODELS
Code profiling of the CG simulator has been per-
formed to identify the most computation demanding
parts of the code which resulted to be: (i) Non-bonded
forces computation; (ii) Integration timestep; (iii)
Neighbor structure generation, accounting for about
95.2%, 2% and 1% of the total execution time of the
sequential simulator, respectively.
Accordingly to the bottlenecks identified, we have
implemented three kernels to be executed on the de-
vice (GPU), one for each of the most onerous parts
of code. Concerning data structures, a comprehensive
one is required containing for each bead, the informa-
tion on its position, velocity, force, torque, orienta-
tion, angular momentum in the body-fixed frame, ro-
tation matrix, bead type, mechanical type, lipid iden-
tifier and lipid unit identifier.
The integration algorithm exploits Leapfrog equa-
tions (Rapaport, 2004). This algorithm is imple-
mented by the integration kernel on the GPU, to avoid
additional data transfers between host and device.
For the calculation of the non-bonded forces, a
neighbor structure is needed to avoid considering a
contribution for each pair of beads, which would lead
to a quadratic time complexity. Therefore the system
is divided into cells of beads, where each cell has a
cubic shape and the edge size equal to the maximum
cut-off distance among cut-offs established for differ-
ent bead types. Only bead pairs with a distance value
under the cut-off are considered as neighbors. Hence,
the search for neighbors is applied to bead pairs of the
same cell or the eventual 26 adjacent cells.
In the CPU version, the cell, neighbor and inter-
action type structures cannot be efficiently mapped
on a GPU. Therefore we have created new structures
for the cells, neighbor beads and relative interaction
type storage, which provide coalesced
1
accesses for
CUDA threads.
Implementation and update of neighbor and inter-
action type structures within the GPU is needed to
obtain maximum performance from GPU by avoid-
ing data transfer overheads between CPU and GPU.
To this purpose we have implemented a kernel called
cudaneigh, in which we have used an algorithm sim-
ilar to the algorithm proposed in (Anderson, 2008)
for AL models, where the texture type used was one-
dimensional. Since the texture cache is optimized for
2D spatial locality (CUDA, 2011), we have used two
dimensional textures in our code even for the other
two kernels and employ them when the accesses in
device memory are not coalesced.
The non-bonded forces computation is imple-
mented by the kernel forces, that exploits the opti-
mized access to neighbor and interaction type struc-
tures. Compared to what has been done for AL mod-
els in (Anderson, 2008), we use an additional interac-
tion type structure with the same size and organization
1
When different threads need some particular piece of
information, they find it in the same relative address, calcu-
lated as base address+ absolute thread identifier.
BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms
340