TRAVERSING A BVH CUT TO EXPLOIT RAY COHERENCE

R. Torres, P. J. Mart

ın and A. Gavilanes

Departamento de Sistemas Inform

aticos y Computaci

on, Universidad Complutense de Madrid, Madrid, Spain

Keywords:

Coherence, Bounding volume hierarchy, CUDA, Path tracing, Ray tracing.

Abstract:

In this paper we study how to deal with the ray incoherence that naturally arises in path tracing-based systems.

We introduce the notion of BVH Cut to split the tree into a forest of disjoint subtrees. We will use it to

ﬁlter the rays that are successively generated by the path tracing algorithm. Each subtree is then traversed

by its corresponding group of rays. Despite the overload of ﬁltering all the rays each time, a signiﬁcant

proﬁt is achieved. Nevertheless, constructing a BVH cut is a challenging task, because it can lead to a huge

amount of work if the same rays belongs to many groups. Thus, we present two kind of building heuristics:

structural heuristics that characterizes the root of a subtree by a property (the node’s depth or the surface area

of its bounding volume in this paper), and optimization heuristics that are based on the Simulated Annealing

method. The performance of traversing the cuts so built has been experimentally analyzed over four usual

scenes, using two popular implementations of the subtree traversal (persistent while-while / persistent packet).

The results show a relevant saving time w.r.t. the classic BVH traversal, that grows as the ray incoherence

increases. The best saving ranges from 32.0% / 40.9% for structural heuristics, to 32.0% / 51.7% for cuts built

with Simulated Annealing.

1 INTRODUCTION

One of the main bottlenecks for most ray tracing algo-

rithms is the traversal stage. Although great progress

has been made in their performance through the usage

of modern GPU architectures (Aila and Laine, 2009),

the success of efﬁciently traversing a great amount

of incoherent rays in parallel remains a challenging

topic, since it is highly connected to the programming

SIMD model of the hardware, and, more precisely,

to the way rays are arranged on the device. There-

fore, the notion of coherence is essential to under-

stand the behavior of SIMD-based implementations,

and more research on coherence is required to design

faster traversal procedures.

Although many deﬁnitions of coherence can be

found in the literature, most of them refer to a quali-

tative measure. Thereby, two rays are said to be co-

herent if they traverse the same nodes and triangles

most of the time. In order to exploit coherence in a

GPU, rays are usually grouped into packets, mainly to

allow the rays inside a packet to cooperate when read-

ing scene information from global memory. Thus, ray

packets become the traversal logical unit, which gives

rise to the so called packet-based traversals. Their

main disadvantage is that the rays inside a packet are

forced to traverse the hierarchy in the order the packet

chooses, which usually increases the total number of

nodes traversed w.r.t. the single-ray traversal. Hence,

the success of any packet-based traversal leans on

the assumption that the saving due to the cooperative

reading is greater than this traversal penalty. Conse-

quently, its success depends on the coherence inside

each packet, since the more coherent the rays of a

packet are, the higher the saving is.

Recently, (Aila and Laine, 2009) suggest that the

assumption is not valid for primary, one-bound dif-

fuse and ambient occlusion rays on modern GPUs.

Speciﬁcally, their experiments show that a stack-

based single-ray traversal is faster than a stack-based

packet traversal. Nevertheless, although the rays are

not grouped in explicit packets, they are implicitly

grouped since the SIMD model of GPUs is based on

the notion of warp. Therefore, the single-ray traver-

sal they compute can be actually considered packet-

based, since rays are arranged in the warps according

to a speciﬁc order (Z-order) of the image.

An important inconvenient of most coherence def-

initions is that it cannot be known before traversing

the tree. Thus, heuristics have to be used for packing

rays in order to obtain a high coherent level afterward.

In the literature, we can ﬁnd two heuristics. On the

140

Torres R., J. Martín P. and Gavilanes A..

TRAVERSING A BVH CUT TO EXPLOIT RAY COHERENCE.

DOI: 10.5220/0003363401400150

In Proceedings of the International Conference on Computer Graphics Theory and Applications (GRAPP-2011), pages 140-150

ISBN: 978-989-8425-45-4

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

(a) (b) (c) (d)

Figure 1: Scenes used for our testings. The images have been generated at a resolution of 1024 × 1024 with 1000 paths

per pixel (including primary rays). Each path is formed by 10 rays (each primary ray bounces 9 times). The total render-

ing times in SINGLE (persistent while-while) with Cut

root

are CONFERENCEROOM=552.7s (a), FAIRYFOREST=556.3s (b),

SPONZA=787.1s (c) and SIBENIK=815.4s (d). The total rendering time with the best structural depth Cuts in Table 1 and

inter-BVH pruning are CONFERENCEROOM=535.9s (3.03%), FAIRYFOREST=511.4s (8.07%), SPONZA=645.3s (18.01%)

and SIBENIK=605.9s (25.69%). The percentage in brackets are the saving w.r.t. Cut

root

one hand, coherence usually has a geometric mean-

ing: two rays are said to be geometrically coherent

whenever their origins lay “near” and/or the angle be-

tween their directions is “small” enough. Therefore,

the geometric coherent attempts to ensure a deeper

coherence, because it is expected for two geometri-

cally coherent rays to traverse the same nodes of the

acceleration structure.

So, it is also natural to suggest a behavioral mean-

ing: two rays are behaviorally coherent w.r.t. a node n

of the acceleration structure whenever both rays inter-

sect the bounding volume enclosing n. The underly-

ing idea behind behavioral coherence is that the accel-

eration structure drives the traversal for all the rays,

or an enough big set of the rays, simultaneously. In

fact, when a node is explored, only those rays inter-

secting its bounding volume are considered and the

rest of them have to be ﬁltered. In that sense, parallel

GPU primitives, such as sorting, compact and (seg-

mented) scan functions, become essential for imple-

menting many tasks during the ray classiﬁcation. No-

tice that the success of traversal then depends on the

performance of these primitives, and that, although

most of them are well known, their effective imple-

mentations on GPUs are relatively recent.

In this paper we research how to exploit the be-

havioral coherence when a great amount of incoher-

ent rays are shot through the scene, which is usual for

path tracing-based systems. Our main contribution is

double. On the one hand, we propose a BVH traver-

sal that begins classifying the rays on GPU accord-

ing to a sequence of descendants of the root, which

will be called Cut along the paper. This can be con-

sidered a breadth-ﬁrst traversal for exploiting behav-

ioral coherence, since it results in a set of traversal

tasks involving behaviorally-coherent packets. Then,

these tasks are ﬁnally traversed in a classic depth-ﬁrst

way on GPU. It is worth to mention that our approach

does not depend on the implementation of the traver-

sal that it is integrated into the system, since they are

fully interchangeable. We have actually tested two

of the fastest implementations on GPU –the persis-

tent packet and the persistent while-while by (Aila and

Laine, 2009)– yielding successful saving rates in both

cases.

On the other hand, we present different criteria

for building cuts that are compared each other regard-

ing the performance of their traversal over four usual

scenes. The results show a relevant saving time w.r.t.

the classic BVH traversal, that grows as the ray inco-

herence increases.

2 RELATED WORK

Ray Packets. (Wald et al., 2001) are pioneers in us-

ing ray packets for developing an interactive ray tracer

on CPU. They use the trivial geometric coherence of

neighboring pixels to pack primary rays. Packets al-

low to decrease memory trafﬁc and improve the cache

efﬁciency by exploiting the 4-wide SIMD units.

Later, packets were adapted to the 32-wide SIMD

units on GPUs. Two different directions have been

followed in order to simulate on GPU the recursive

nature of hierarchical traversals. The ﬁrst one is based

on stacks, which are implemented on shared memory.

Some of the papers included in this trend are (G

unther

et al., 2007) for BVHs, and (Horn et al., 2007) for

KD-trees. The second approach introduces new links

in the tree to guide its traversal. Examples of these

stackless tracers are (Popov et al., 2007) and (Foley

and Sugerman, 2005) for KD-trees, and (Torres et al.,

2009) for BVHs. Recently, (Zlatuska and Havran,

2010) make a comparison of several GPU implemen-

TRAVERSING A BVH CUT TO EXPLOIT RAY COHERENCE

141

tations, including proposals of both tendencies.

Concerning the efﬁciency of explicit packets,

(Aila and Laine, 2009) question their practical inter-

est. Indeed, their paper shows that traversing each

ray independently is faster than traversing ray pack-

ets. Nevertheless, it only considers primary and sec-

ondary rays, which are arranged on the device ac-

cording to the image Z-order, thus they are implicitly

packed into geometrically coherent warps.

Geometric Coherence. The notion of geometric

coherence appears very often in the ray tracing bibli-

ography (Wald and Slusallek, 2001), and more espe-

cially in those papers concerning packet-based traver-

sals. Thus, we only mention recent papers that ana-

lyze different techniques for exploiting geometric co-

herence, among their main contributions. (Mansson

et al., 2007) present several geometric heuristics to or-

ganize newly spawn rays. Unfortunately, classifying

secondary rays on CPU takes too much time to make

them applicable. (Noguera et al., 2009) present a KD-

tree traversal for ray packets, using CPU’s SEE. Rays

are simply classiﬁed according to the signs of their di-

rections. (Boulos et al., 2007) propose several ways

of packing secondary rays. It shows a performance of

around 3x for the method that groups rays of the same

type vs. the single ray method.

Behavioral Coherence on CPU. Most of the pa-

pers concerning behavioral coherence can be classi-

ﬁed in two groups. The ﬁrst one is composed of those

works that use the acceleration structure as a refer-

ence to pack the rays into coherent packets. Among

them, (Pharr et al., 1997) describe a Monte Carlo ren-

derer that takes advantage of the cache units to reduce

memory trafﬁc from disk. Furthermore, the rays get

enqueued in the voxels of a uniform grid. The sched-

uler subsystem is then responsible for starting the in-

tersection test of the rays in a queue against the ge-

ometry at the corresponding voxel, depending on the

information already cached.

Similarly, (Navratil et al., 2007) present another

technique to decrease the trafﬁc between DRAM and

cache L2. Queues are now located at some nodes of a

KD-tree, called queue points. The subtrees related to

these nodes ﬁt in cache L2, which is used to accelerate

the traversal of the rays in the queue along the subtree.

More recently, (Boulos et al., 2008) introduce the

quantitative notion of SIMD-coherence to measure

the utilization of the SIMD units. Speciﬁcally, it com-

putes the ratio between the number of active rays and

the packet size –which is ﬁxed to 256 rays– to ex-

press how coherent the packet is. It then uses ﬁlter-

ing techniques to compact those packets whose ratio

drops below a threshold. This demand-driven reorder-

ing method gives the best results for diffuse path trac-

ing vs. glossy and perfect specular ray tracing.

The second group includes packing techniques

based on the operations that the rays demand, instead

of the nodes of the structure they pass through. The

aim of these proposals is to get the maximum of the

SIMD units. Thereby, (Wald et al., 2007) and (Grib-

ble and Ramani, 2008) present ray tracers in which

the rays are ﬁltered to output those requiring the same

operations. Then, these operations are run over the

corresponding rays in a SIMD manner. The experi-

ments included in the former show a high SIMD uti-

lization, for ray streams of 64×64 at most. The per-

formance of the latter is predicted to 6-16 FPS, which

is subsequently improved to 15-32 FPS by separating

address and data processing (Ramani et al., 2009). In

both papers, the size of the ray streams is also up to

64×64.

Behavioral Coherence on GPU. (Garanzha and

Loop, 2010) is the ﬁrst paper in explicitly exploiting

the notion of behavioral coherence on GPU, as we are

concerned. It ﬁrstly packs the rays using a geometri-

cal criterion that is based on the direction and the ori-

gin of the ray. In order to accelerate the classiﬁcation,

the rays are previously transformed into hashing keys,

and then sorted by using fast GPU primitives (Harris

et al., CUDPP). Then, the frustum of each packet tra-

verses a BVH in breadth-ﬁrst order. Finally, a list of

leaves is obtained per ray. The rays related to each

leaf are then split into packets and tested for inter-

section with the bounding volume of the leaf and its

triangles.

Finally, (Aila and Karras, 2010) present an ar-

chitecture similar to NVidia Fermi, that reduces the

memory trafﬁc between DRAM and on-chip caches.

Its traversal is based on hierarchically located queue

points in the spirit of (Navratil et al., 2007).

3 BVH CUTS

A Cut of a BVH is a set of nodes C =

, n

, . . . , n

∈ BVH} such that for every leaf l

in the BVH, there exists a unique node n

∈ C satis-

fying l ∈ subtree(n

) (see Figure 2 for an example).

Thereby, a cut partitions the BVH into two disjoint

sets of nodes T (top) and B (bottom), with root ∈ T

and all the leaves belong to B –which is actually a

forest of N subtrees.

In order to exploit the behavioral coherence, rays

are classiﬁed into N sets of rays, one per node of the

cut. Speciﬁcally, a ray r is inserted into the set s

re-

lated to the node n

, whenever r intersects the bound-

GRAPP 2011 - International Conference on Computer Graphics Theory and Applications

142

Figure 2: Example of a BVH Cut.

1 i n : Cut C ; Ray R[N

] ;

2 out : f l o a t t

hit

] ;

3 v ar :

4 f l o a t t

hit

] ; b o o l mask[N

] ;

5 i n t id[N

] ; i n t max

;

7 f o r each r ∈ [1..N

] in p a r a l l e l do

8 t

hit

[r] = ∞ ;

10 / / For e a c h node i n t h e Cut

11 f o r e ach n

∈ C do {

12 / / I n t e r s e c t i o n o f a l l r a y s

13 / / w i t h t h e BV (n

) on GPU

14 f o r e a ch r ∈ [1..N

] in p a r a l l e l do {

15 t

hit

[r] = ∞ ;

16 mask[r] = test(r, BV (n

)) ;

17 }

18 / / Co m pa c tin g on GPU

19 compact(mask, id, max

) ;

20 / / T r a v e r s a l on GPU

21 traversal(R, id, max

, B

, t

hit

, t

hit

) ;

22 }

Figure 3: Traversing a BVH cut.

ing volume BV (n

) of n

. Observe that a ray can be-

long to different sets, thus, it can require the subse-

quent traversal of different subtrees. The classiﬁca-

tion process can be compared to a breadth-ﬁrst traver-

sal, since each ray spreads many tasks that are not

solved immediately, but later on. Finally, each set s

is split into packets that are behavioral coherent w.r.t.

the node n

. This splitting is trivial: a set of 32 con-

secutively rays are arranged into a packet. Moreover,

if the set s

yields a packet p, p is then related to the

BVH hanging from the node n

, which we will call B

Cut Traversing. Figure 3 shows the traversal

scheme for a BVH cut. It is mainly composed of three

stages. In the ﬁrst one (lines 14–17) the array mask

is updated with the intersection test of each ray with

BV (n

). The second stage (function compact at line

19) removes the rays that did not pass the last intersec-

tion test by compacting the remaining ones. The array

id stores the indices of the rays that passed the test and

max

keeps the number of them. The third stage (line

21) is a traversal algorithm of the BVH B

in a depth-

ﬁrst style. Any traversal algorithm is possible in this

stage and we have tested two GPU approaches as we

will detail in Section 5. The extraction order of the

rays to be traversed respects the order inside the array

Traversing a cut leads to N classic traversals that

compute the nearest intersection point for each ray, in-

side the part of scene the corresponding B

covers. As

usual, we use distances to refer to points, and thus we

write t

hit

[r] to denote the intersection point related to

the (local) traversal of the current B

w.r.t. a given ray

r. Notice that these local traversals are run on GPU,

but sequentially launched from CPU. Therefore, the

ﬁnal (global) distance for r, t

hit

[r], is computed as the

minimum among the values t

hit

[r] related to each B

Regarding the integration of pruning techniques,

two improvements can be considered. First, an intra-

pruning can be applied, and indeed is applied,

when launching the function traversal at line 21. The

current t

hit

[r] is then used during the traversal of B

rule out farther intersected nodes for r inside B

Second, an inter-B

pruning could be incorporated

at line 15 to suitably initialize the array t

hit

to the cur-

rent t

hit

, instead of ∞. Thus, this line would become

hit

[r] = t

hit

[r]. Again, the aim would be to take advan-

tage of the traversals that have been completed before

running the i-th iteration, i.e. the traversals of those

with j < i. Speciﬁcally, B

could be ruled out if

the current t

hit

[r] was less than the entry distance to

BV (n

). Nevertheless, the order among the B

that

leads to the best overall performance cannot be de-

termined in advance. So, we have not implemented

this inter-B

pruning and the results (Section 6) are an

upper bound, regardless how the B

are sorted.

4 CUT CREATION

In order to boost the efﬁciency of a cut, we must

compare the beneﬁt from the behavioral coherence of

each packet to the overload due to the total number of

packet traversals the rays produce. Since both issues

are opposite, let us ﬁrst analyze the two extremes.

The ﬁrst one corresponds to the case in which the

cut is composed of the leaves of the BVH (C

leaves

The overload is then more expensive than the beneﬁt,

because too many traversals arise: each ray requires a

test against each leaf whose bounding volume the ray

intersects. Hence, traversing the cut would degenerate

TRAVERSING A BVH CUT TO EXPLOIT RAY COHERENCE

143

1 Cut c r e a t e d e p t h ( node n , i n t d ) {

2 i f ( i s L e a f ( n ) ∨ d e p t h ( n ) == d )

3 r e t u r n {n} ;

4 e l s e {

5 Cut C

= c r e a t e d e p t h ( l e f t ( n ) , d ) ;

6 Cut C

= c r e a t e d e p t h ( r i g h t ( n ) , d ) ;

7 r e t u r n C

∪C

;

8 } }

9 Cut c r e a t e a r e a ( node n , f l o a t a ) {

10 i f ( i s L e a f ( n ) ∨ a r e a ( n ) < a )

11 r e t u r n {n} ;

12 e l s e {

13 Cut C

= c r e a t e a r e a ( l e f t ( n ) , a ) ;

14 Cut C

= c r e a t e a r e a ( r i g h t ( n ) , a ) ;

15 r e t u r n C

∪C

;

16 } }

Figure 4: Implementation of the structural heuristics for

building a Cut. Top: cut creation by DEPTH. Bottom: cut

creation by AREA.

into the inefﬁcient brute force.

In the other extreme, the cut is just composed of

the root of the BVH (C

root

). Each ray then traverses

the whole BVH from its root in a depth-ﬁrst way.

Thus, the usage of the cut is useless. To sum up, our

traversal method is not efﬁcient in both extremes, and

a trade-off between the beneﬁt and the overload of us-

ing a BVH cut should be found. Hence, we present

two different group of heuristics for building cuts,

which are later compared with respect to their per-

formance over usual scenes.

4.1 Structural Heuristics

The ﬁrst group corresponds to structural heuristics,

because the resulting cuts are composed of those

nodes satisfying certain property that only depends

on the structure of the BVH. In our experiments we

have tested two properties that are respectively based

on the node’s depth (called DEPTH), and on the sur-

face area of its bounding volume (called AREA). Con-

cretely, the cuts consist of the nodes at a given depth

d for the DEPTH heuristic, while it is composed of the

ﬁrst nodes from the root whose surface area falls be-

low a given threshold a for the AREA heuristic.

Figure 4 shows how to build a cut in function of

the property. Observe that a leaf l is immediately

added to the cut, although the property did not hold

for any node in the path from the root to l. This pre-

vents the traversal from ruling out parts of the scene.

1 Cut S i m u l a t e d A n n e a l i n g ( node root ) {

2 Cut currentCut = {root} ;

3 f l o a t currentTime = r e n d e r ( currentCut ) ;

4 Cut bestCut = currentCut ;

5 f l o a t bestTime = currentTime ;

6 Cut nextCut = e v o l v e ( currentCut ) ;

7 f l o a t nextTime =r e n d e r ( nextCut ) ;

8 i n t temp =MAX TEMP ;

10 f o r ( i = 0 ; i < NSteps ; i++){

11 f o r ( j = 0 ; j < NS t e p s pe r Te m p ; j + +){

12 / / A c c e p t a n c e t h r e s h o l d

13 f l o a t p = exp(

|currentTime−nextTime|

temp

) ;

14 i f ( ( nextTime < currentTime ) ∨ ( r a n d ( 0 , 1 ) < p ) ) {

15 currentCut = nextCut ;

16 currentTime = nextTime ;

17 / / Up d a te t h e b e s t T i m e

18 i f ( currentTime < bestTime ) {

19 bestTime = currentTime ;

20 bestCut = currentCut ;

21 }

22 }

24 nextCut =e v o l v e ( currentCut ) ;

25 nextTime = r e n d e r ( nextCut ) ;

26 } / / f o r j

27 temp = α · temp ;

28 } / / f o r i

30 r e t u r n bestCut ;

31 }

Figure 5: Implementation of Simulated Annealing for

building a cut.

4.2 Simulated Annealing

The cut construction can be formulated as an opti-

mization problem. Thus, our second group of heuris-

tics consists of methods that look for the minimum

solution inside a search space that is composed of all

possible cuts. The objective function to be minimized

is the render time a cut traversal requires. Accord-

ing to this formulation, many of the algorithms em-

ployed in combinatorial optimization can be used to

ﬁnd the best cut. Nevertheless, searching for the best

cut turns to be unfeasible, as it usually happens for

many combinatorial optimization problems, hence we

focus on approximation algorithms. Among the ex-

isting algorithms, we have adapted the Simulated An-

nealing method (SA in the sequel), since it can be eas-

ily applied to these problems, due to its generic nature

(Zomaya and Kazman, 1999).

SA can be described as a randomized iterative im-

provement algorithm, since it does not only accept de-

GRAPP 2011 - International Conference on Computer Graphics Theory and Applications

144

creasing moves, regarding the given objective func-

tion, but it also tolerates increasing moves in order

to avoid getting trapped in local minima. Indeed, it

uses a probability function, that decreases as the exe-

cution advances, for accepting increasing moves. The

method asymptotically converges to a global mini-

mum, whenever certain conditions hold, concerning

the annealing schedule.

Figure 5 describes how to build a BVH cut us-

ing SA. Besides the current cut (currentCut), the al-

gorithm also holds another one (nextCut) that corre-

sponds to a random evolution of the former. These

two cuts advance together along the execution of

two nested loops: one for decreasing the control pa-

rameter temp (line 10) –the temperature used in the

original SA formulation– and another one for trying

many moves at the same temp (line 11). Regard-

ing increasing the render time, the algorithm accepts

those cuts whose acceptance threshold (line 13) is

greater than a uniform random value in [0,1] (line

14). If the nextCut is ﬁnally accepted, it is assigned to

currentCut (line 15) and the best cut is updated if re-

quired (lines 18–21). In any case, a new random evo-

lution is computed (line 24) and subsequently stored

in nextCut.

The function evolve generates a reachable cut

from currentCut by applying either the join or

the un f old operation. In the former, an inner

node n of the cut C is replaced by its two chil-

dren: un f old(C, n) = (C −{n})∪{le f t(n), right(n)},

whereas two sibling nodes l, r ∈ C of C are re-

placed by their father in the later: join(C, l, r) = (C −

{l, r}) ∪ { f ather(l)} In this function, one of these op-

erations is randomly chosen (if both are possible).

5 EXPERIMENTAL SETTINGS

Our application has been run on a NVIDIA GeForce

GTX 285 with 1GB of RAM. The test scenes are

FAIRYFOREST, CONFERENCEROOM, SPONZA and

SIBENIK (see Figure 1). The FAIRYFOREST scene

is open but a quadrilateral has been positioned as

a roof, preventing the rays from escaping from the

scene. All the images have been taken at a resolution

of 1024 × 1024.

The BVHs have been built by following the

Surface Area Heuristics (SAH) by (Goldsmith and

Salmon, 1987) and using the greedy top-down algo-

rithm by (Ize et al., 2007). To improve the overall per-

formance of the BVH, we have also applied the early

split clipping technique by (Ernst and Greiner, 2007).

So, before starting the construction, the bounding vol-

ume of each triangle is iteratively halved until its sur-

face area is lower than a certain threshold.

We have used path tracing (Kajiya, 1986) as our

ray tracing algorithm, and for the sake of conve-

nience, every surface of the scene is considered as

diffuse (i.e. with a constant BRDF). Hence, as soon

as a ray ﬁnds the nearest intersection point, a new ray

is spawned. Its origin is the intersection point and

its direction is randomly chosen over a virtual hemi-

sphere on the surface normal. We have considered the

cosine as the probability density function, i.e. those

points near the pole have more probability because it

depends on cos θ (where θ is the angular deviation of

the point from the pole). Since the number of rays

does not increase, we have an absolute control over

the memory that is actually allocated.

Each ray is bound to a persistent CUDA thread,

according to (Aila and Laine, 2009). The set of

rays whose associated threads are simultaneously

launched is called a generation. Generations are enu-

merated; the generation 0 is composed of the primary

rays, and the generation i is composed of the rays

spawn from the generation i − 1. The number of con-

sidered generations in this paper is ﬁxed to 10. The

number of rays in a generation is the biggest one that

our implementation and our graphics card are able to

store: 8 MRays (= 8 · 2

rays). The primary rays are

spawned from a bidimensional array of 4096×2048.

Since the images are at a resolution of 1024 × 1024,

each subarray of 4 × 2 rays contains 8 samples for the

same pixel. When it is stored in memory, the bidi-

mensional array is ﬂattened according to the Z-order

(Morton code).

In these settings, path tracing is specially suitable

for our experiments since no property can be assumed

in advance for the rays from generation 0 on (i.e. no

primary rays). As we will see in Section 6, the inco-

herency becomes maximal from generation 2 on.

We have used the linear congruential generator by

(Park and Miller, 1988) as random number generator

algorithm. It has a period of 2

− 2, which is greater

than the total amount of random numbers needed in

the tests, ensuring that each ray receives different ran-

dom numbers.

Our path tracer has been implemented with ﬁve

CUDA kernels: RayGenerator (RG), Test, Compact,

TraversalIntersection (TI) and Shader (SH). The algo-

rithm runs according to the following scheme. First,

the primary rays are spawned from a pinhole cam-

era, in the kernel RG. Then, in the kernel Test, the

rays are tested for intersection with a node n of the

cut. Next, the rays that passed the previous intersec-

tion test are compacted, in the kernel Compact. This

kernel is actually the primitive cudppCompact of the

CUDPP library by (Harris et al., CUDPP) and pre-

TRAVERSING A BVH CUT TO EXPLOIT RAY COHERENCE

145

Conference Room Fairy Forest

0 1 2 3 4 5

Depth

100

200

300

400

500

600

10000200003000040000

Area

100

200

300

400

500

600

700

0 1 2 3 4 5

Depth

100

200

300

400

500

600

700

200003000040000

Area

100

200

300

400

500

600

Sibenik Sponza

0 1 2 3 4 5

Depth

100

200

300

400

500

600

700

600100014001800

Area

100

200

300

400

500

600

700

800

0 1 2 3 4 5

Depth

100

200

300

400

500

600

700

800

200003000040000

Area

100

200

300

400

500

600

700

800

900

Figure 6: Render times (in ms) measured for the four scenes with the traversal algorithm SINGLE by using structural cuts. The

colors are: black (generation 0), blue (generation 1), red (generation 2), green (generation 3).

Conference Room Fairy Forest

0 1 2 3 4 5 6 7 8

Depth

500

1000

1500

2000

2500

10000200003000040000

Area

500

1000

1500

2000

0 1 2 3 4 5 6 7 8

Depth

500

1000

1500

2000

2500

10000200003000040000

Area

500

1000

1500

2000

2500

Sibenik Sponza

0 1 2 3 4 5 6 7 8

Depth

500

1000

1500

2000

2500

3000

200600100014001800

Area

500

1000

1500

2000

2500

3000

0 1 2 3 4 5 6 7 8

Depth

500

1000

1500

2000

2500

3000

10000200003000040000

Area

500

1000

1500

2000

2500

3000

Figure 7: Render times (in ms) measured for the four scenes with the traversal algorithm PACKET by using structural cuts.

The colors are: black (generation 0), blue (generation 1), red (generation 2), green (generation 3).

serves the Z-order of the initial rays. Afterward, the

kernel TI ﬁnds the nearest intersection for every ray

by traversing the subtree hanging from n. The two

algorithms used for traversing a subtree are due to

(Aila and Laine, 2009). They are the persistent packet

and the persistent while-while and will be denoted by

PACKET and SINGLE, respectively. Finally, a new sec-

ondary ray is spawned over the hemisphere from the

nearest intersection in the kernel SH.

6 RESULTS

Structural Heuristics. Several structural cuts have

been built with different values for the parameter of

the DEPTH and AREA heuristics. The render time for

their traversal are depicted in Figure 6 for SINGLE and

in Figure 7 for PACKET. In the y-axis, the measured

render times (in ms) of the cut traversal are displayed.

In the x-axis, different values of the parameter are in-

cluded. Points of the same generation are joined in a

continuous line. However, only the ﬁrst four genera-

GRAPP 2011 - International Conference on Computer Graphics Theory and Applications

146

Table 1: The percentage of saving in render time of the best cut built with the DEPTH heuristics w.r.t. C

root

. The numbers in

brackets are the depths of the best cuts.

SINGLE

Scene \ Gen. 0 1 2 3 4 5 6 7 8 9

Conf.Room 0.0(0) 0.0(0) 0.0(0) 0.0(0) 0.1(2) 1.4(2) 1.6(2) 2.0(2) 2.1(2) 2.1(2)

FairyForest 0.0(0) 0.0(0) 8.6(1) 9.6(1) 10.0(1) 9.7(1) 9.5(1) 9.4(1) 9.3(1) 9.0(1)

Sibenik 0.0(0) 1.5(1) 17.3(2) 23.2(3) 26.2(3) 28.0(3) 29.0(3) 29.8(3) 30.3(3) 30.6(3)

Sponza 0.0(0) 0.0(0) 10.4(2) 15.4(2) 17.1(2) 18.3(2) 19.0(2) 19.5(2) 19.8(2) 20.1(2)

PACKET

Scene \ Gen. 0 1 2 3 4 5 6 7 8 9

Conf.Room 0.0(0) 0.0(0) 3.3(5) 11.7(5) 7.4(5) 14.3(5) 9.0(5) 14.4(5) 9.0(5) 14.4(5)

FairyForest 0.0(0) 0.5(1) 8.6(6) 19.7(6) 16.7(6) 22.1(6) 18.0(6) 22.7(6) 18.1(6) 22.6(6)

Sibenik 0.0(0) 0.0(0) 4.9(6) 16.7(6) 13.8(6) 20.1(6) 17.0(6) 22.3(6) 17.8(6) 21.8(6)

Sponza 0.0(0) 0.6(1) 5.7(6) 15.4(6) 12.8(6) 17.9(6) 14.8(6) 19.7(6) 15.6(6) 20.0(6)

Table 2: The percentage of saving in render time of the best cut built with the AREA heuristics w.r.t. C

root

. The numbers in

brackets are the percentage of surface area related to the best cut w.r.t. the surface area of the root.

SINGLE

Scene \ Gen. 0 1 2 3 4 5 6 7 8 9

Conf.Room 0.0(100) 0.0(100) 0.0(100) 0.0(100) 0.8(75.3) 2.2(75.3) 2.4(75.3) 2.7(75.3) 2.4(75.3) 2.3(75.3)

FairyForest 0.0(100) 0.0(100) 8.6(99.4) 9.6(99.4) 10.0(99.4) 9.7(99.4) 9.5(99.4) 9.4(99.4) 9.3(99.4) 9.0(99.4)

Sibenik 0.0(100) 1.5(99.5) 18.9(59.7) 25.1(51.2) 27.9(51.2) 29.6(51.2) 30.6(51.2) 31.3(51.2) 31.7(51.2) 32.0(51.2)

Sponza 0.0(100) 0.0(100) 11.1(75.3) 15.4(72.7) 17.1(72.7) 18.3(72.7) 19.0(72.7) 19.5(72.7) 19.8(72.7) 20.1(72.7)

PACKET

Scene \ Gen. 0 1 2 3 4 5 6 7 8 9

Conf.Room 0.0(100) 6.5(86.4) 9.7(53.0) 16.0(53.0) 10.5(53.0) 16.6(53.0) 10.7(53.0) 16.3(53.0) 10.5(53.0) 15.7(53.0)

FairyForest 0.0(100) 2.8(22.7) 25.0(19.8) 34.6(19.8) 34.3(19.8) 39.1(19.8) 36.4(19.8) 40.3(19.8) 37.1(22.7) 40.9(22.7)

Sibenik 0.0(100) 0.0(100) 10.2(31.2) 19.6(28.4) 17.6(28.4) 23.5(31.2) 20.0(28.4) 25.1(28.4) 20.8(28.4) 24.9(28.4)

Sponza 0.0(100) 1.7(54.5) 8.2(44.1) 16.0(44.1) 13.6(44.1) 19.7(44.1) 15.4(44.1) 20.0(44.1) 16.8(44.1) 19.7(44.1)

Table 3: The percentage of saving in render time of the best cut found with Simulated Annealing w.r.t. C

root

. The numbers in

brackets (D/A) are: D, the averaged depth of the nodes in the cut; and A, the percentage of averaged surface area of the nodes

in the cut w.r.t. the surface area of the root.

SINGLE

Scene \ Gen. 0 1 2 3 4 5 6 7 8 9

Conf.Room 0.0 0.0 0.0 2.5 3.9 5.3 4.9 5.3 5.0 5.0

(0.0/100) (0.0/100) (0.0/100) (4.4/46.9) (4.5/43.2) (4.6/42.9) (4.0/49.0) (4.7/42.4) (4.1/48.8) (4.8/42.2)

FairyForest 0.0 5.9 17.1 16.3 15.8 14.7 14.0 13.4 12.8 12.4

(0.0/100) (5.0/17.9) (5.1/26.1) (4.4/31.2) (4.4/31.2) (4.4/31.2) (4.4/31.2) (4.4/31.2) (4.4/31.2) (4.4/31.2)

Sibenik 0.0 1.5 18.9 25.4 28.0 29.6 30.6 31.3 31.7 32.0

(0.0/100) (1.0/71.3) (2.8/46.9) (3.0/45.5) (3.0/45.5) (3.1/43.8) (3.1/43.8) (3.1/43.8) (3.1/43.8) (3.1/43.8)

Sponza 0.0 0.0 11.1 15.4 17.1 18.3 19.0 19.5 19.8 20.1

(0.0/100) (0.0/100) (1.6/68.2) (2.0/64.5) (2.0/64.5) (2.0/64.5) (2.0/64.5) (2.0/64.5) (2.0/64.5) (2.0/64.5)

PACKET

Scene \ Gen. 0 1 2 3 4 5 6 7 8 9

Conf.Room 0.0 21.8 31.5 37.0 33.3 37.2 32.1 36.0 30.7 34.5

(0.0/100) (6.1/26.8) (6.6/18.3) (6.6/17.4) (6.5/17.0) (6.5/17.4) (6.5/17.0) (6.5/17.4) (6.5/17.4) (6.4/17.7)

FairyForest 0.0 45.1 49.4 51.7 48.5 51.4 47.9 47.7 47.4 47.6

(0.0/100) (5.7/23.5) (5.8/20.0) (5.9/17.9) (5.8/16.6) (5.9/17.9) (5.9/17.8) (5.7/17.8) (5.9/17.9) (5.7/17.8)

Sibenik 0.0 9.3 14.8 21.4 18.8 23.7 20.3 24.5 20.9 24.9

(0.0/100) (5.8/28.0) (6.1/22.4) (6.2/21.3) (5.9/20.9) (6.0/20.9) (6.0/21.4) (6.0/21.8) (5.9/21.5) (6.0/21.8)

Sponza 0.0 3.4 14.8 24.1 21.5 26.6 22.9 27.3 23.2 27.5

(0.0/100) (5.5/32.0) (6.0/29.3) (6.4/28.2) (6.5/30.9) (6.4/29.9) (6.3/30.1) (6.2/30.3) (6.3/30.1) (6.3/29.9)

tions are showed for the sake of clarity, which gives

rise four curves per chart. The remaining ones have a

behaviour similar to generation 3.

The ﬁrst (the leftmost) value of the parameter al-

ways corresponds to the structural value that builds

root

. Therefore, the ﬁrst value of each curve cor-

TRAVERSING A BVH CUT TO EXPLOIT RAY COHERENCE

147

responds to the SINGLE or PACKET traversal of the

whole BVH plus an extra time due to ﬁltering (around

10 ms according to our measures). Higher values in

DEPTH and lower values in AREA provoke an expo-

nential growth in render time, which is not included in

the charts. We have measured generations for differ-

ent random number seeds. The results are very similar

and only the charts for one seed are displayed on the

ﬁgures.

As it can be seen, the curves of a given generation

have a similar shape in every scene. The curves of

generation 0 (primary rays) and generation 1 do not

undergo any improvement w.r.t. the traversal of C

root

On the contrary, the generations 2 to 9 have a drop at

the beginning and an exponential increase after. The

depth of this valley depends both on the scene as well

as on the traversal algorithm.

The valley is deeper for PACKET than for SINGLE.

As (Aila and Laine, 2009) mention, SINGLE is more

efﬁcient than PACKET for coherent (such as primary

rays) and non-coherent rays. This is due to the fact

that the memory bandwidth in modern GPUs is high,

and the bottleneck in PACKET is not the memory traf-

ﬁc but the additional amount of traversed nodes.

Notice that, the minimum of each curve occurs

more to the left in SINGLE than in PACKET (i.e. in

shallower nodes or with bigger surface area). The

overload in both algorithms is the same, so the mem-

ory system must be the responsible for this difference.

If the packets are more coherent in SINGLE, the num-

ber of nodes read from memory does not vary, but the

texture caches are better used. On the contrary, if the

packets are more coherent in PACKET, the number of

nodes read from memory decreases, but the texture

cache usage is the same. Therefore, the curves show

that the improvement due to the diminishment of the

read nodes becomes relevant more to the right than

the beneﬁt of cache.

For a given scene, the shape of the curves are very

similar in the DEPTH and AREA charts. This fact is

not surprising since deeper nodes have also smaller

surface areas.

The generations 0 and 1 have not an improvement

by the use of cuts. This is due to the fact that these

rays are very coherent and the improvement obtained

by launching more coherent packets is not enough to

exceed the overload.

Tables 1 and 2 summarize the best saving of the

ﬁgures. They include a column for each generation

that shows the percentage of saving of the best struc-

tural cut w.r.t. the performance of traversing C

root

Hence, it is computed by comparing the ﬁrst value

of the corresponding curve with its minimum, that is,

through the expression

root

−t

min

root

, where t

min

and t

root

denote these two values. The most relevant savings

are 30.6%/32.0% (DEPTH/AREA) for SINGLE applied

to SIBENIK, while 22.7%/40.9% for PACKET applied

to FAIRYFOREST.

Simulated Annealing. The results can be seen on

Table 3. The parameters used are MAX TEMP=600,

NSteps=1000, NSteps per temp=1000 and α=0.99.

Observe that the percentage of saving is always

better than those related to structural cuts. This is nat-

ural since SA manages other cuts apart from structural

cuts.

For some scenes, there is a correspondence be-

tween the averaged depth of the best SA cut and the

best structural-depth cut (e.g. SIBENIK with SINGLE).

However, this cannot be generalized to all scenes.

7 DISCUSSION AND FUTURE

WORK

The beneﬁt of the usage of cuts is consequence of the

fact that the overload due to ﬁltering is less than the

improvement obtained by traversing more coherent

rays. It is an open issue if this technique is also appli-

cable to CPU ray tracers, other rendering algorithms

(such as bidirectional path tracing), other non-diffuse

surfaces (such as specular or glossy), and other accel-

eration structures (such as KD-trees).

Figures 8a and 8b show the render time for only

the kernel TI concerning SINGLE and PACKET respec-

tively. Observe that the curves of highly incoherent

generations (red and green) present a minimum show-

ing that a cut at a certain depth leads to a relevant

improvement. Nevertheless, the overload due to ﬁl-

tering grows exponentially (Figure 8c). This is why

the minima in Figures 6 and 7 are shifted to the left.

It is necessary to study ways of making the most of

that coherence or diminishing the overload.

In order to diminish the overload (number of ﬁl-

ters), two cuts C1 and C2 can be used. The nodes of

C1 are used to ﬁlter the rays whereas the nodes of C2

are used to traverse the scene. Each node n ∈ C1 is

linked to a set of nodes {n

, . . . , n

} ⊆ C2, such that

the nodes n

are descendants of n. Thus, the number

of ﬁlters are fewer than the amount of nodes in C2

(since |C1| ≤ |C2|). The inconvenient is that the rays

launched for traversal are more incoherent. We did

not obtain successful results and this technique was

dismissed.

Nowadays, there already exist cards with more

DRAM capacity than the one used in this paper (e.g.

the Tesla C2070 has 6 GB). A bigger amount of mem-

ory would allow more rays to be stored and traversed

GRAPP 2011 - International Conference on Computer Graphics Theory and Applications

148

0 2 4 6 8 10

Depth

100

200

300

400

500

600

700

0 2 4 6 8 10 12

Depth

500

1000

1500

2000

2500

0 2 4 6 8 10 12

Depth

5000

10000

15000

20000

(a) (b) (c)

Figure 8: (a) Render time without overload (only TI) for SPONZA and SINGLE; (b) render time without overload (only TI) for

SPONZA and PACKET; (c) overload for the subﬁgures (a) and (b).

in parallel. Thus, the coherence would be higher and

better results would be expected. However, this anal-

ysis should be experimentally evaluated.

In this paper, the russian roulette method for ﬁn-

ishing a path has not been implemented. On the con-

trary, every ray keeps alive till generation 9. It is ex-

pected that high generations will not behave similarly

if the size of their populations is different.

The time used to build our cuts are not included

in the results, since the construction is considered as

a preprocess. It would be worth to study methods that

quickly ﬁnd an effective cut in order to execute the

construction during rendering.

8 CONCLUSIONS

In this paper we have studied how to deal with the

ray incoherence that naturally arises in path tracing-

based systems. In order to improve the BVH traver-

sal of a great amount of incoherent rays, we split the

BVH structure into a forest of disjoint subtrees, called

Cut, that will be used to group the rays that are suc-

cessively generated. Each subtree is then traversed

by state-of-the-art algorithms: persistent while-while

and persistent packet. We experimentally show that,

despite the overload of ﬁltering all the rays for each

subtree, the subsequent traversal of all these subtrees

results faster than traversing the whole BVH. The rea-

son is that the rays traversing a subtree are more co-

herent according to the behavioral criterion.

We have presented two kinds of heuristics for

building a BVH cut. The ﬁrst one corresponds to

structural properties such as the node’s depth and the

surface area of the bounding volume of the node. For

the second one, the construction of the cut is formu-

lated as an optimization problem, and the Simulated

Annealing method is applied to build the best cut. Our

experiments show that using a cut results in a signiﬁ-

cant improvement w.r.t the classic traversal of the

BVH. Moreover, this improvement increases accord-

ing to the incoherent measure of the ray generation.

The saving depends on the scene, and also on the

traversal algorithm (persistent while-while / persistent

packet). For example, for the FAIRYFOREST scene,

the best saving times for DEPTH are 10.0% / 22.7%

(SINGLE / PACKET), for AREA are 10.0% / 40.9%, and

for Simulated Annealing are 17.1% / 51.7%.

ACKNOWLEDGEMENTS

This paper has been supported by the Spanish

projects CCG10-UCM/TIC-5476 and BSCH-UCM

GR58/08-921547.

REFERENCES

Aila, T. and Karras, T. (2010). Architecture considera-

tions for tracing incoherent rays. In Proceedings of

the High-Performance Graphics 2010.

Aila, T. and Laine, S. (2009). Understanding the efﬁciency

of ray traversal on GPUs. In Proceedings of High-

Performance Graphics 2009, pages 145–149.

Boulos, S., Edwards, D., Lacewell, J. D., Kniss, J., Kautz,

J., Wald, I., and Shirley, P. (2007). Packet-based Whit-

ted and Distribution Ray Tracing. In Proceedings of

Graphics Interface 2007, pages 177–184.

Boulos, S., Wald, I., and Benthin, C. (2008). Adaptive

ray packet reordering. Symposium on Interactive Ray

Tracing, 0:131–138.

Ernst, M. and Greiner, G. (2007). Early split clipping for

bounding volume hierarchies. In Proceedings of the

2007 IEEE Symposium on Interactive Ray Tracing,

pages 73–78.

Foley, T. and Sugerman, J. (2005). KD-tree acceleration

structures for a GPU raytracer. In HWWS’05 Confer-

ence on Graphics Hardware, pages 15–22.

TRAVERSING A BVH CUT TO EXPLOIT RAY COHERENCE

149

Garanzha, K. and Loop, C. (2010). Fast ray sorting and

breadth-ﬁrst packet traversal for GPU ray tracing. In

Eurographics, volume 29.

Goldsmith, J. and Salmon, J. (1987). Automatic creation

of object hierarchies for ray tracing. IEEE Computer

Graphics and Application, 7(5):14–20.

Gribble, C. P. and Ramani, K. (2008). Coherent ray trac-

ing via stream ﬁltering. In IEEE/Eurographics Sym-

posium on Interactive Ray Tracing, pages 59–66.

unther, J., Popov, S., Seidel, H.-P., and Slusallek, P.

(2007). Realtime ray tracing on GPU with BVH-based

packet traversal. In Proceedings of the Eurographics

Symposium on Interactive Ray Tracing, pages 113–

118.

Harris, M., Owens, J. D., Sengupta, S., Tzeng,

S., Zhang, Y., Davidson, A., and Satish, N.

CUDA data parallel primitives library (CUDPP).

http://gpgpu.org/developer/cudpp.

Horn, D. R., Sugerman, J., Mike, H., and Hanrahan, P.

(2007). Interactive KD-tree GPU raytracing. In

I3D’07: Proceedings of the symposium on Interactive

3D graphics and games, pages 167–174.

Ize, T., Wald, I., and Parker, S. G. (2007). Asynchronous

BVH construction for ray tracing dynamic scenes on

parallel multi-core architectures. In Proceedings of

the Eurographics Symposium on Parallel Graphics

and Visualization, pages 101–108.

Kajiya, J. T. (1986). The rendering equation. SIGGRAPH

Computer Graphics, 20(4):143–150.

Mansson, E., Munkberg, J., and Akenine-Moller, T. (2007).

Deep coherent ray tracing. In RT ’07: Proceedings of

the 2007 IEEE Symposium on Interactive Ray Trac-

ing, pages 79–85.

Navratil, P. A., Fussell, D. S., Lin, C., and Mark, W. R.

(2007). Dynamic ray scheduling to improve ray co-

herence and bandwidth utilization. In RT ’07: Pro-

ceedings of the 2007 IEEE Symposium on Interactive

Ray Tracing, pages 95–104.

Noguera, J. M. and Ure

na, C. and and Garc

ıa, R. J. (2009).

A vectorized traversal algorithm for ray-tracing. In

International Conference on Computer Graphics The-

ory and Applications (GRAPP 2009), pages 58–63.

Park, S. K. and Miller, K. W. (1988). Random number gen-

erator: Good ones are hard to ﬁnd. Communications

of the ACM, 31(10):1192–1201.

Pharr, M., Kolb, C., Gershbein, R., and Hanrahan, P. (1997).

Rendering complex scenes with memory-coherent ray

tracing. In SIGGRAPH ’97: Proceedings of the 24th

annual conference on Computer graphics and inter-

active techniques, pages 101–108.

Popov, S., G

unther, J., Seidel, H.-P., and Slusallek, P.

(2007). Stackless KD-tree traversal for high perfor-

mance GPU ray tracing. Computer Graphics Forum

(Proceedings of Eurographics), 26(3):415–424.

Ramani, K., Gribble, C. P., and Davis, A. (2009). Stream-

ray: A stream ﬁltering architecture for coherent ray

tracing. In Internationa Conference on Architectural

Support for Programming Languajes and Operating

System, pages 325–336.

Torres, R., Mart

ın, P. J., and Gavilanes, A. (2009). Ray cast-

ing using a roped BVH with CUDA. In Proc. Spring

Conference on Computer Graphics, pages 107 – 114.

Wald, I., Benthin, C., Wagner, M., and Slusallek, P. (2001).

Interactive rendering with coherent ray tracing. In

Computer Graphics Forum (Proceedings of Euro-

graphics’01), volume 20, pages 153–164.

Wald, I., Gribble, C. P., Boulos, S., and Kensler, A. (2007).

SIMD Ray Stream Tracing - SIMD Ray Traversal with

Generalized Ray Packets and On-the-ﬂy Re-Ordering.

Technical Report UUSCI-2007-012.

Wald, I. and Slusallek, P. (2001). State of the art in inter-

active ray tracing. In State of the Art Reports, EURO-

GRAPHICS 2001, pages 21–42.

Zlatuska, M. and Havran, V. (2010). Ray tracing on a GPU

with CUDA – comparative study of three algorithms.

In Proceedings of WSCG’2010, communication pa-

pers, pages 69–76.

Zomaya, A. and Kazman, R. (1999). Handbook of Algo-

rithms and Theory of Computation, chapter Simulated

Annealing Techniques, pages 37.1–33.19. CRC Press.

GRAPP 2011 - International Conference on Computer Graphics Theory and Applications

150