accesses in mutual exclusions. Nonetheless, this ap-
proach is of no use for porting BFS on CUDA as a
lock and a queue may work fine for threads in the or-
der of tens, but will never guarantee acceptable per-
formances for thousands of threads. As other CUDA-
based BFS implementations, we replace our queues
with a “frontier”, or “status”, array, i.e., an array with
n elements, one for every vertex v ∈ V . The value of
each element indicates whether v will be part of the
frontier set on the next iteration or not. Henceforth,
all our queues are actually implemented through sta-
tus arrays and several mechanisms built on top. Then,
at every iteration one thread will be assigned to each
node, and it will process the vertex depending on its
value on the status array, finally updating the status
array for all of the vertex’s children.
Following previous works (Su et al., 2017), this
manipulation strategy implies huge levels of thread
divergence slowing down the entire process. In fact,
even when only few vertices belong to the frontier set,
the status array requires a full array scan at every it-
eration. The different number of children that each
node may have, can force threads to behave differ-
ently, i.e., to diverge, since some threads will process
thousands of children vertices while others none. To
reduce divergence by concurrently processing nodes
that have similar out-degrees at the same time, we fol-
low Luo et al. and Liu et al. (Luo et al., 2010; Liu and
Howie Huang, 2015). Thus, we divide the frontier set
into three distinct queues. Nodes are inserted in the
small queue if they have less than 32 children, in the
large queue if they have more than 256 children, and
in the medium queue in all other cases. Subsequently
each queue is processed by a number of threads be-
fitting the amount of work that has to be performed.
Nodes belonging to the small queue will be handled
by a single thread, whereas a warp of threads will be
assigned to each node in the medium queue, and a
block of threads to each vertex of the large one. To
implement this strategy, three different kernels will
be launched, each optimized to handle its own queue.
This approach will result in having approximately the
same amount of work for each thread in each ker-
nel, as well as in assigning resources proportionally
to each node’s processing requirements.
3.3 The Searching Strategy on CUDA
Once labels have been computed, they can be used to
verify reachability queries. Following the Enterprise
BFS (Liu and Howie Huang, 2015), we try to develop
our searching algorithm to maximize the number of
queries, i.e., concurrent graph explorations, that can
be managed in parallel. Unfortunately, the Enterprise
BFS algorithm parallelizes breadth-first visits belong-
ing to the same traversal. On the contrary, our query
search strategy must parallelize in-depth queries be-
longing to distinct traversal procedures. In general,
this implies that a significant number of threads will
visit a large number of nodes at the same time. As
a consequence, during each algorithmic step, many
visited nodes will be logically associated to distinct
explorations making hard to keep track of what each
thread is doing.
More specifically, when we process a query
u →
?
v, we start a BFS on the subtree rooted at u, and
we use labels to prune the set of nodes that have to be
visited in order to discover whether v is reachable or
not. Vertices are visited concurrently for each BFS
traversal level. Since we want to proceed in parallel
on all the nodes with several visits, we must imple-
ment a mechanism to singularly identify the frontier
set for each query. To do this, we use a status array
of integer values rather than Boolean values, and we
rely on bit-wise operators to manipulate single bits
of these values. In other words, the set of queries is
divided in groups of k queries, where k is the high-
est number of bits that can be efficiently stored (and
retrieved) on a single CUDA data-type. Groups are
then processed one at a time, and queries within the
same group are managed in parallel. Within each
group, a query is identified through an index value
included between 1 and k and corresponding to a bit
field. Therefore, during the bin generation phase the
status array is scanned and a node is inserted into a
bin if its value is different than zero. Similarly, dur-
ing the traversal, the status array’s value are modi-
fied through bit-wise atomic operations. We use func-
tions such as the CUDA API ATOMICOR(ADDRESS,
VALUE), which allows us to set the i-th bit of vertex v
to represent that this vertex has to be explored during
the next iteration, and that its exploration is associated
with the i-th search.
Furthermore, given the high number of expected
label comparisons, these arrays will be accessed con-
tinuously, and it is important to limit the number of
global accesses related to these requests. Thus, at the
beginning of each kernel execution all threads coop-
eratively load the labels of the searched nodes from
the global memory array into a shared memory cache.
4 EXPERIMENTAL RESULTS
Tests have been performed on a machine initially
configured for gaming purposes and equipped with a
CPU Intel Core i7 7770HQ (with a quad-core proces-
sor running at 2.8 GHz, and 16 GB of RAM), and a
ICSOFT 2020 - 15th International Conference on Software Technologies
602