number n of putative subregions per read is around
n = 180, requiring an exact matching substring of 12.
Thus m ∗ n = 72 billion alignments have to be com-
puted. To deal with this large number one needs to
employ available hardware as efficiently as possible.
For NGS data parallelizing the computation of a sin-
gle alignment is not useful because the computational
cost for a single alignment is low. Therefore we use
a single thread to compute one alignment and focus
on computing as many alignments in parallel as pos-
sible. This brings along the problem that every hard-
ware platform has different capacities when it comes
to parallel computation. On a CPU one can only use
a small number of threads. In contrast a GPU is only
working efficiently when running a high number of
threads in parallel. This means that a minimum num-
ber of alignment computations is required to fully uti-
lize a GPU. However, this minimal number varies be-
tween GPUs. To ensure that our implementation fully
loads different GPUs, independent of their architec-
ture, we calculate the batch size b:
b = N
mp
∗ t
max
(1)
where N
mp
is the number of multiprocessors, t
max
is
the maximum number of threads per multiprocessor.
We furthermore require that
b∗ M
align
≤ M
avail
(2)
where M
align
is the memory required per alignment
and M
avail
is the available global memory on the
graphic card (Nvidia, 2009). If b ∗ M
align
exceeds
M
avail
, b is reduced accordingly. For modern GPUs
b ranges from 50.000 to 100.000 alignment computa-
tions. This high numbers cause the problem that the
memory available for computing a single alignment,
is small. Thus, we implemented a SW alignment that
is less memory demanding. Since most NGS appli-
cation require a reference genome or parts thereof,
e.g. RNA-Seq or they sequence genomes that are
closely related to the reference, we can safely assume
that the number of insertions and deletions between
a read and the reference region is small. Therefore
we implemented a banded alignment algorithm (Gus-
field, 1997). Here only a small corridor surround-
ing the diagonal from upper left to lower right of the
matrix H is computed. Thus the complexity is re-
duced to O(| R | ∗c), where c is the corridor width and
c <<| G |. The memory reduction allows the com-
putation of a high number of alignments such that the
GPU is fully loaded. However, due to its size even the
reduced matrix H has to be stored in global memory.
To further reduce the computing time, we observe
that in NGS mapping only the alignments between
a read and the sub regions of the genome that show
the highest alignment scores are of interest. To iden-
tify the best matching sub region the optimal align-
ment score is sufficient. Thus, the matrix H can be
reduced to a vector of length c and the space com-
plexity is reduced to O(c) (Gusfield, 1997). Note that
the memory usage of score computation is indepen-
dent of read length. However,the computational com-
plexity remains O(r ∗ c) and the corridor width c will
indirectly depend on the read length, since it deter-
mines the number of consecutive insertions and dele-
tions. The reduced memory requirements allow us to
store H in fast on-chip (shared) memory. In addition
all remaining variables, like the scoring function, are
also stored in fast memory (private, constant and tex-
ture). Only read and reference sequences are stored in
global memory. Therefore global memory access is
minimized. This works for all corridor sizes smaller
than 100 on current GPUs. However, since the length
of indels is typically small this restriction of the cor-
ridor size is not limiting the applications.
Although GPUs are powerful co-processing units,
they are not available on every computer. In such
cases a fast computation of alignments on CPUs is de-
sirable. Except for multi-threaded programming, the
most common mechanisms for parallelization on the
CPU is SSE. SSE allows us to speed up score calcula-
tions by processing eight alignmentsin parallel. How-
ever, since SSE only supports a limited set of instruc-
tions, the observed speedup will be lower than eight.
Another major issue is that the backtracking step con-
tains several control flow statements. This makes it
difficult to process more than one alignment concur-
rently. Therefore, the backtracking is done without
SSE.
OpenCL offers another possibility to increase per-
formance on the CPU. As mentioned, OpenCL code
is independent of the hardware. However, to achieve
maximal performance it is necessary to optimize the
SW algorithm for the different platforms (CPU and
GPU). The most important difference between both
platforms is that on the GPU it is important to use
coalescent memory access patterns (Nvidia, 2009),
whereas on the CPU this hinders optimal caching and
therefore degrades performance. The second major
difference is that on the CPU the OpenCL compiler
implicitly uses SSE instructions to increase perfor-
mance. This is only possible when using specific
(vector) data types, which cannot be used efficiently
on a GPU.
In order to allow NGS analysis pipeline program-
mers to employ these technologies, we created a soft-
ware library that consists of three different implemen-
tations. (1) A CPU version including SSE instruc-
tions, (2) a CUDA based version for Nvidia graphic
MASon: MILLION ALIGNMENTS IN SECONDS - A Platform Independent Pairwise Sequence Alignment Library for
Next Generation Sequencing Data
197