GPU hardware termed Kepler (NVidia, 2012) con-
tains several performance improvements concerning
atomic operations, kernel execution and scheduling.
In 2009, Aila et al. (Aila and Laine, 2009) pre-
sented efficient depth-first BVH traversal methods,
which still constitute the state-of-the-art approach for
GPU ray tracing. Their implementation uses a sin-
gle kernel containing traversal and intersection, that is
run for each input ray in parallel. More recently they
presented some improvements, algorithmic variants
to increase the SIMT efficiency and updated results
for NVidia’s Kepler architecture in (Aila et al., 2012).
They also reported that ray tracing potentially gen-
erates irregular workloads, especially for incoherent
rays. To handle these uneven distributions of work,
compaction and reordering have been employed in
the context of shading computations (Hoberock et al.,
2009), ray-primitive intersection tasks (Pantaleoni
et al., 2010) and GPU path tracing (Wald, 2011).
Garanzha et al. propose a breadth-first BVH traver-
sal approach in (Garanzha and Loop, 2010), which is
implemented using a pipeline of multiple GPU ker-
nels. While coherent rays can be handled very ef-
ficiently using frustra-based traversal optimizations,
the performance for incoherent ray loads significantly
falls below the depth-first ray tracing algorithms as
reported in (Garanzha, 2010).
3 THE QUEST FOR EFFICIENCY
While the SIMT model is comfortable for the pro-
grammer (e.g. no tedious masking operations have
to be implemented), the hardware still has to sched-
ule the current instruction for all threads of the warp.
Diverging code paths within a warp have to be se-
rialized and lead to redundant instruction issues for
non-participating threads. If code executed on SIMT
hardware exhibits much intra-warp divergence, a con-
siderable amount of computational bandwidth will be
wasted. In accordance with Section 1 of (Aila and
Laine, 2009), we use the term SIMT efficiency, which
denotes the percentage of actually useful operations
performed by the active threads related to the total
amount of issued operations per warp to quantify this
wastage in our experiments. As computational band-
width continues to increase faster than memory band-
width, SIMT efficiency is one of the key factors for
high performance on current and most probably also
future GPU hardware platforms.
Beside computational power, memory accesses
and their efficiency are a crucial issue too. Threads
of a warp should access memory in a coherent fash-
ion, in order to get optimal performance by coalescing
their requests. However, traditional ray tracing algo-
rithms (as well as many others) are known to gener-
ate incoherent access patterns that potentially waste
a substantial amount of memory bandwidth as dis-
cussed in (Aila and Karras, 2010). The caches of
GPUs can help to improve the situation considerably
as discussed by Aila et al. in (Aila and Laine, 2009)
and (Aila et al., 2012), where they describe how re-
cent improvements of the GPU cache hierarchy affect
the overall ray tracing performance.
In general, GPUs are optimized for workloads that
distribute the efforts evenly among the active threads.
As reported in (Aila and Laine, 2009), ray tracing al-
gorithms generate potentially unbalanced workloads,
especially for incoherent rays. This fact poses sub-
stantial challenges to the hardware schedulers and can
still be a major source for inefficiency as noted by
Tzeng et al. (Tzeng et al., 2010). To support the
scheduling hardware and achieve a more favorable
distribution of work, compaction and task reorder-
ing steps are explicitly included in algorithms. This
paradigm has been successfully applied to shading
computations of scenes with strongly differing shader
complexity (Hoberock et al., 2009), where shaders of
the same type are grouped together to allow more co-
herent warp execution. In the context of GPU path
tracing, Wald applied a compaction step to the rays
after the construction of each path segment in order
to remove inactive rays from the subsequent compu-
tations (Wald, 2011).
3.1 Depth-first Traversals
The most commonly used approach for GPU ray trac-
ing is based on a depth-first traversal of a BVH as dis-
cussed in (Aila and Laine, 2009). Their approach is
based on a monolithic design, which combines BVH
traversal and intersection of geometric primitives into
a single kernel. This kernel is executed for each ray
of an input array in massively parallel fashion. Given
two different rays contained in the same warp, poten-
tial inefficiencies now stem from the fact that either
they require different operations (e.g. one ray needs
to execute a traversal step, while the other ray needs
to perform primitive intersection) or the sequences of
required operations have a different length (e.g. one
ray misses the root node of the BVH, while the other
ray does not). These two fundamental problems have
a negative impact on SIMT efficiency and lead to an
uneven distribution of work, especially for incoherent
ray loads.
To mitigate the effects of irregular work dis-
tribution, Aila et al. investigated the concept of
dynamic work fetching. Given a large number of
GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications
210