Parallel Axis Split Tasks for Bounding Volume Construction with OpenMP

Gustaf Waldemarson

1,2 a

and Michael Doggett

1 b

Department of Computer Science, Lund University, Sweden

Arm, Lund, Sweden

Keywords:

Ray-Tracing, Bounding Volume Hierarchy, OpenMP, Parallelization.

Abstract:

Many algorithms in computer graphics make use of acceleration structures such as Bounding Volume Hier-

archies (BVHs) to speed up performance critical tasks, such as collision detection or ray-tracing. However,

while the typical algorithms for constructing BVHs are relatively simple, actually implementing them for

performance critical systems is still challenging. Further, to construct them as quickly as possible, it is also

desirable to parallelize the process. To that end, parallelization APIs such as OpenMP

can be leveraged to

greatly simplify this matter. However, BVH construction is not a trivially parallelizable problem. Thus, in

this paper we propose a method of using OpenMP

tasking to further parallelize the spatial splitting algorithm

and thus improve construction performance. We evaluate the proposed way and compare it with other ways

of using OpenMP

, ﬁnding that some of these work well to improve the construction time by between 3 and

5 times on an 8-core machine with a minimal amount of work and negligible quality reduction of the ﬁnal

BVH.

1 INTRODUCTION

Bounding volume hierarchies are arguably one

of the most important data-structures currently in

widespread use in the ﬁeld of computer graphics, and

it is often prominently used for ray-tracing during im-

age synthesis. However, it is also used for various

other tasks, such as collision detection or occlusion

based audio mixing (Fowler et al., 2014). As such,

it is often important to be able to create these hier-

archies with as high quality as possible, thus ensur-

ing that when the structure is used to accelerate some

task, that query operation is as fast as possible.

Typically, the recursive algorithms used to build

these structures are relatively simple, but rewrit-

ing them for maximum throughput in a parallelized

context can be challenging. Thus, modern ver-

sions of parallelization APIs such as OpenACC or

OpenMP

(Dagum and Menon, 1998) can be lever-

aged to trial various paralellization strategies before

committing to a particular approach, or in other cases,

be used directly in the original algorithm. However,

there are often multiple ways to apply these APIs. As

such, ﬁnding the most performant way to use them

can be a beneﬁcial endeavor.

https://orcid.org/0000-0003-2524-0329

https://orcid.org/0000-0002-4848-3481

2 RELATED WORK

Over the years, many variations of acceleration struc-

tures have been invented, notable examples being oc-

trees (Meagher, 1980), kd-trees (Bentley, 1975) and

bounding volume hierarchies, or BVHs. Lately how-

ever, BVHs have become the more popular category

for four main reasons: They have a predictable mem-

ory footprint, queries are robust and efﬁcient, they

easily adapt to dynamic geometry, and most crucially:

The build itself is scalable; allowing users to either

quickly create a hierarchy with possibly slow queries,

or, to spend more time upfront to yield potentially

faster ones (Meister et al., 2021).

Thus, as it is assumed that construction of an op-

timal BVH is an NP-hard problem (Karras, 2012),

numerous heuristics and algorithms have been devel-

oped for generating these hierarchies, and depending

on the target application, one of the above approaches

are typically preferred:

1. For interactive applications, such as real-time ray-

tracing, fast builders running on the GPU are pre-

dominantly used, such as the LBVH (Lauterbach

et al., 2009), HLBVH (Pantaleoni and Luebke,

2010), and more recently, the H-PLOC algorithm

by (Benthin et al., 2024).

2. For ofﬂine ray-tracing applications, such as those

described by (Pharr, 2018), slower builders, such

Waldemarson, G. and Doggett, M.

Parallel Axis Split Tasks for Bounding Volume Construction with OpenMP.

DOI: 10.5220/0013317100003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 1: GRAPP, HUCAPP

and IVAPP, pages 347-354

ISBN: 978-989-758-728-3; ISSN: 2184-4321

347

the SBVH (Stich et al., 2009) or PRBVH (Meister

and Bittner, 2018) may be preferable, where any

improvement in the quality of the BVH is often re-

covered during the actual ray-tracing phase (Aila

and Laine, 2009).

No matter the application however, it is always desir-

able to be able to create these structures as quickly as

possible. To that end, these build processes are usu-

ally parallelized as much as possible. Fast builders

typically do this by relaxing some spatial constraints

to expose more parallelism, making them amenable

to fast GPU implementations. In contrast, qual-

ity focused builders typically create their hierarchies

with a CPU implementation, as that often provides

a bit more ﬂexibility when analyzing the input ge-

ometry (Ganestam et al., 2015; Wald et al., 2014).

However, many of these algorithms build the hierar-

chy from the top-down, thus initially suffering from

poor scaling in the ﬁrst few splitting tasks. To that

end, a number of parallelization schemes have been

proposed to extract additional parallelism from these

early splits (Wald, 2007; Wald, 2012; Fuetterling

et al., 2016). These approaches typically attempt to

split up the computations over all primitives, thus pro-

viding a large amount of potential parallelism, but re-

quires a number of complex synchronization mecha-

nisms to function. In contrast, in this paper we pro-

pose an arguably simpler approach by only paralleliz-

ing over the split axes, thus losing some opportunities

for parallelization, but in turn only requiring a rela-

tively simple synchronization method.

3 BACKGROUND

This section provides relevant background informa-

tion about the SBVH algorithm (Stich et al., 2009) tar-

geted for parallelization with OpenMP

in this work.

3.1 Spatial Split BVH

While BVHs have many great qualities, they can per-

form poorly in scenes with many overlapping prim-

itives, as is often the case with triangle meshes. In

those cases, Kd-trees (Bentley, 1975) are typically

able to achieve higher ray-tracing performance. To

that end, (Stich et al., 2009) developed a variation

of the BVH construction algorithm that drew inspi-

ration from the spatial splits used by kd-trees, thus

creating one of the highest performing triangle based

BVH algorithms in terms of query-time, which is typ-

ically referred to as the SBVH algorithm. However,

while the query-times are fast, its major drawback is

the construction time: It is typically the slowest BVH

OpenMP

Task I Task II Task III Task I Task II Task III

Figure 1: Graphical visualization of the typical OpenMP

usage for splitting up tasks into parallel regions.

size_t n = 8;

#pragma omp parallel for num_threads(n)

for (size_t i = 0; i < height; ++i)

{

for (size_t j = 0; j < width; ++j)

{

im[i][j] = ray_trace(i, j);

}

Figure 2: A simple for-loop parallelized using an

OpenMP

compiler directive.

construction algorithm in widespread use. Further-

more, this algorithm is challenging to parallelize, as

part of its operation depend on being able to dynam-

ically create new references to triangles with subdi-

vided bounding boxes.

This particular aspect was improved by (Ganes-

tam and Doggett, 2016), who noted that only around

10 % extra references are needed in most scenes.

Thus, by pre-allocating memory for these and dis-

tributing them in each split task, parallelization gets

a bit easier. Thus, in this paper we further simplify

this matter, showing how the SBVH algorithm can be

easily parallelized with the help of OpenMP

3.2 OpenACC and OpenMP

OpenACC and OpenMP

are APIs for performing nu-

merous types of multi-processing tasks in a conve-

nient and portable fashion in the C, C++ and Fortran

programming languages through the use of compiler

directives and library routines. Both of these are man-

aged by non-proﬁt organizations: The OpenMP Ar-

chitecture Review Board, and the OpenACC organi-

zation, jointly governed by all major compiler and

hardware developers with collaboration from their

user communities.

As illustrated in ﬁgure 1, these APIs provide a rel-

atively simple way of successively parallelizing por-

tions of a program, and the simplest application of

it is typically through the use of the parallel for

compiler pragma from the OpenMP

API, as shown

in ﬁgure 2.

OpenMP

3.0 and onwards also enables the man-

ual creation of parallel tasks that may be submitted to

a thread-pool, a process typically referred to as task-

ing. Further, tasks may even recursively create more

tasks, as seen in ﬁgure 3, thus enabling complex algo-

rithms to be parallelized in simple fashion (Ayguad

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

348

int fibonacci(int n)

{

int fn1, fn2;

if (n == 0 || n == 1)

return n;

#pragma omp task shared(fn1)

fn1 = fibonacci(n - 1);

#pragma omp task shared(fn2)

fn2 = fibonacci(n - 2);

#pragma omp taskwait

return fn1 + fn2;

}

Figure 3: A more complex parallelization example to

demonstrate the use of OpenMP

tasking. Note that, while

illustrative, this particular example would likely not beneﬁt

much from parallelization as the overhead of creating tasks

likely outweigh the cost of the work itself.

et al., 2009). However, some care is still needed to

ensure that each task is able to perform a suitable

amount of work to account for the overhead of its cre-

ation. Beyond this, OpenMP

also provide directives

for automatically converting the iterations of loops to

tasks with the taskloop directive and even perform-

ing guided SIMD vectorization of loops.

In contrast, OpenACC was originally intended for

ofﬂoading tasks to discrete accelerator devices, thus

providing a simple interface to program coprocessors

such as GPUs. However, modern versions of this API

can also parallelize on the host CPU when required.

Additionally, as of OpenMP

4.0, similar device of-

ﬂoading capabilities are available for that API as well,

even if the performance of these features may be a bit

worse (Usha et al., 2020).

Still, applying device ofﬂoading correctly often

requires signiﬁcantly more effort to ensure that the

data can be transferred correctly to the coprocessor,

often forcing a major restructuring of the original al-

gorithms. As such, these types of approaches are out

of scope for this paper.

4 ALGORITHM

This section provides a high-level overview of how

the SBVH algorithm recursively constructs a hierar-

chy from a collection of primitive references, i.e., a

set of triangles with potentially subdivided bounding

boxes. Our contribution for parallelizing this algo-

rithm with OpenMP

is also described here.

4.1 SBVH Splitting Tasks

In algorithm 1 each SBVH splitting task can recur-

sively create two more work-packets up to the point

that it decides to create a leaf-node instead, in a fash-

Fn build(references, bounds):

if create leaf? then

return;

end

ob j ← object split(references, bounds);

spt ← spatial split(references, bounds);

if spt.cost ≤ ob j.cost then

perform spatial split;

else

perform object split;

end

(1) build(left-references, left-bounds);

(2) build(right-references, right-bounds);

EndFn

Algorithm 1: High-level overview of the SBVH con-

struction algorithm. The functions object split and

spatial split are described in algorithms 2 and 3 respec-

tively. Further, lines that may be parallelized with tasks are

marked with (1) and (2) as is described in section 4.2.

Fn object split(references, bounds):

(3) foreach axis do

sort references;

foreach reference do

estimate split cost;

end

return optimal split cost and location;

EndFn

Algorithm 2: High level overview of the BVH object split

estimation: Find the appropriate splitting axis and segment

the objects to the left and right side of it. Note that the

axis-loop on line (3) may be parallelized as described in

section 4.2.

Fn spatial split(references, bounds):

(4) foreach axis do

foreach reference do

chop references into bins;

split references on bin boundaries;

end

ﬁnd axis splitting plane;

end

return optimal split axis and plane;

EndFn

Algorithm 3: High level overview of the BVH spatial split

estimation: Find the appropriate splitting axis and plane,

and segment or create references to the left and right side of

it. Note that the axis-loop on line (4) may be parallelized as

described in section 4.2.

ion that is very similar to the OpenMP

tasking exam-

ple in ﬁgure 3. This also means that the algorithm is

not able to run at full capacity until enough tasks have

been spawned to keep all available threads occupied.

Parallel Axis Split Tasks for Bounding Volume Construction with OpenMP

349

To that end, we propose that further subdividing the

splitting task itself may expose more beneﬁcial paral-

lelism during the early stages of the SBVH construc-

tion. E.g., by creating tasks for searching each split-

ting plane along each of the primary axes marked in

algorithms 2 and 3, and visualized in ﬁgure 4. Further,

this may allow more expensive splitting heuristics to

be evaluated each time, as more of them can be tried

in parallel. E.g., more bins can be used for the binned

surface area heuristic (SAH) by (Wald, 2007), or a

different, more expensive heuristics such as the origi-

nal SAH variant proposed by (Goldsmith and Salmon,

1987; MacDonald and Booth, 1990) can be used. This

could theoretically improve the BVH quality, while

still providing some balance between the amount of

work being done and the time it takes to execute each

task.

4.2 Variants

In order to evaluate the potential SBVH build

time improvement from multithreading with various

OpenMP

constructs, we apply one or more source

level patches to a base implementation of the algo-

rithm; effectively inserting the necessary compiler di-

rectives (i.e., #pragma omp ...) at the correct loca-

tions. In total, we evaluate ﬁve different variants of

this approach:

NoOpenMP. Reference implementation without any

OpenMP

directives.

TaskingOnly. Parallel tasks are created using the

#pragma omp task directive for each recursive

call to build in algorithm 1, similar to the tasking

example in ﬁgure 3.

Tasks. Same as TaskingOnly, but create additional

tasks for each iteration of the object and spatial

axis search loops, i.e., for the marked loops in al-

gorithms 2 and 3 and ensure that these tasks are

synchronized afterwards using the #pragma omp

taskwait directive.

Taskloop. Same as Tasks, but use the #pragma omp

taskloop directive instead, thus avoiding the

need for explicit task synchronization through the

#pragma omp taskwait directive.

ParallelFor. Same as TaskingOnly, but use nested

parallelism for each object and spatial axis search

using the #pragma omp parallel for direc-

tive, similar to the example in ﬁgure 2.

Further, we also investigate only parallelizing the ob-

ject and spatial split search tasks, i.e., we do not par-

allelize the recursive splits in algorithm 1. However,

these variants are not expected to scale beyond three

available threads as there are only three axes to search

in each task. To that end, we test the following addi-

tional variants:

NoTaskingFor. Use the #pragma omp parallel

for directive to search each axis, as in the Par-

allelFor variant.

NoTaskingTasks. Same as NoTaskingFor, but use

OpenMP

task constructs as in the Tasks variant.

NoTaskingTaskloop. Same as NoTaskingTasks, but

use the #pragma omp taskloop directive instead,

same as for the Taskloop variant.

5 RESULTS

All BVH construction algorithms and their varia-

tions ran on an Intel(R) Core(TM) i7-6700K CPU

@ 4.00GHz built by the gcc (gcc (Ubuntu 11.4.0-

1ubuntu1 22.04) 11.4.0) and clang (Ubuntu clang

version 14.0.0-1ubuntu1.1) compilers.

Each scene was rendered with an OpenCL based

ray-tracer running on an NVIDIA GeForce RTX 3060

GPU using an ambient occlusion algorithm, example

renders of which can be seen in ﬁgure 5.

The results depicting how these variants scale

with additional threads can be found in ﬁgures 6 and

7 for gcc and clang respectively, clearly demon-

strating that OpenMP

is able to provide a sub-

stantial improvement to the BVH construction time:

Up to 5 times faster than the single thread result

on our 8-core setup. Thus, proving that the addi-

tional task parallelization of the object and spatial axis

searches proposed in section 4.1 is able to provide

some additional beneﬁts. However, there is only a

non-signiﬁcant difference between using plain tasks

(Tasks and NoTaskingTasks) or using the taskloop

directive (Taskloop and NoTaskingTaskloop), as such,

performance-wise, it does not matter which of these

are actually used, but the taskloop directive is usu-

ally a bit easier to read at a glance and does not require

explicit synchronization, and may thus be preferred.

6 DISCUSSION

While ﬁgures 6 and 7 demonstrate a notable improve-

ment, it is also evident that the scaling is logarith-

mic: Each additional thread is able to increase the

performance, but each subsequent gain is always di-

minished. In fact, as can be seen in ﬁgure 8, on a

heavily multi-threaded machine scaling almost com-

pletely stops between 10 and 20 threads. This can be

explained by viewing this type of BVH construction

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

350

ject Cost Spatia

Cost

Split

Leaf?

Recurse

Figure 4: Graphical representation of the task creation process during the construction of an SBVH as well as the further

parallelizable regions of a single BVH splitting task.

Sponza: 393 meshes, 262 267 triangles. Hairball: 2 meshes, 2 880 002 triangles. Bistro: 1 mesh, 3 847 246 triangles.

San-Miguel: 287 meshes, 9 980 699 triangles. Buddah: 1 mesh, 1 087 720 triangles. Powerplant: 21 meshes, 12 759246 triangles.

Figure 5: The scenes investigated during this work along with their mesh and triangle counts.

Figure 6: Visualization of how each variant scales with more available threads in the gcc implementation.

as a divide-and-conquer problem: At ﬁrst, each addi-

tional thread greatly reduces the amount of necessary

work, but eventually each task becomes too small to

beneﬁt from being processed in parallel, thus stopping

the scaling.

Furthermore, depending on how the SBVH algo-

rithm is implemented, some synchronization, or crit-

ical regions may be necessary. As an example, in

our implementation, one such region is used to allo-

cate indices for each of the BVH nodes. Thus, one

Parallel Axis Split Tasks for Bounding Volume Construction with OpenMP

351

Figure 7: Visualization of how each variant scales with more available threads in the clang implementation.

Figure 8: Scaling results for the Powerplant scene on

a heavily multithreaded machine (Intel(R) Xeon(R) w7-

3465X).

side effect of the parallelization is that the ordering of

the nodes is no longer guaranteed to be determinis-

tic. This is particularly noticeable for the Tasks and

Taskloop variants that may interleave axis searches

between the main build tasks. While subtle, this effect

is evident in ﬁgure 9 where it manifests as a minor in-

crease in the average and variation of the rendering

time due to cache-misses from the increased memory

fragmentation of the BVH nodes.

Further, it appears that there is a moderate gain

from only parallelizing the object and spatial split

axis searches, with a minor lead for the NoTask-

ingFor variant, which is likely explained by the

parallel for directive being more mature and that

Figure 9: Rendering time using the constructed BVH by

each of the OpenMP constructs using a OpenCL based ray-

tracer.

it has less overhead than a task queue implementa-

tion. Thus, this method may be beneﬁcial if there

is a strict requirement on a deterministic BVH hier-

archy. As expected, none of these approaches scale

beyond three threads, and should in fact be locked to

that level, as the more threads cause a signiﬁcant per-

formance overhead.

Additionally, in ﬁgure 7, we can see that when us-

ing the clang compiler, it works well to create nested

parallel regions, i.e., when tasks and parallel for

are used inside one-another, as is done in the Paral-

lelFor variant. In contrast, the gcc implementations

in ﬁgure 6 experience dramatic regressions by several

times the baseline for this variant. This is most likely

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

352

a consequence of the thread-cache not being used for

nested parallel regions

. Thus, given that the perfor-

mance is in-line with the Tasks and Taskloop variants,

it may be prudent to avoid nested regions, unless the

targeted compiler is known beforehand.

Finally, note that using OpenMP

is not strictly

beneﬁcial: If the code is serialized, i.e., only a sin-

gle thread is being used, effectively all variants have

a small but noteworthy penalty to the construction

times.

7 FUTURE WORK

Object Cost

Spatial Cost

Split

Leaves?

Recurse

Figure 10: A potentially improved SBVH construction task:

The object and spatial cost evaluation are theoretically in-

dependent and may thus run in parallel. Further, by deter-

mining if a child node would become a leaf before recursing

can greatly reduce the number of necessary tasks.

Implementation-wise, there are a number of things

that could be improved: Currently, the additional

tasks for the axis searches are beneﬁcial, particularly

when each split contains a lot of primitives. However,

smaller tasks are typically less useful, but OpenMP

also has support for conditionally merging or spawn-

ing tasks. Thus, ﬁnding an appropriate metric that can

be used for tuning the task creation process may be a

beneﬁcial endeavor. Moreover, as seen in ﬁgure 10, it

should be possible to restructure the splitting task it-

self to expose more opportunities for parallelism and

thus improve the performance even further. Addi-

tionally, this work only considered binary BVHs, but

research is currently being done on hierarchies with

higher branching factors. While building such struc-

tures is more complicated, the additional branching

may provide even more opportunities to parallelize

the construction.

As noted in section 3.2, modern implementations

of OpenMP

and OpenACC have support for of-

ﬂoading computations to coprocessors using e.g., the

target directive. This was not investigated as a part

of this work due to the additional complexity of map-

ping the input data-structures to the devices. Fur-

thermore, as can be seen in ﬁgure 8, this particular

algorithm does not appear to scale beyond 30 or 40

threads, thus it is questionable whether it would ben-

eﬁt from a massively parallel architecture.

https://gcc.gnu.org/bugzilla/show bug.cgi?id=108494

8 CONCLUSIONS

OpenMP

is a very convenient way to drastically re-

duce the amount of necessary code required to im-

plement many complex algorithms, such as the con-

struction of bounding volume hierarchies (BVHs). In

this paper we have both devised a new way of fur-

ther parallelizing the splitting tasks of the so-called

Spatial Split BVH algorithm, and tested numerous

ways of applying OpenMP

on it. Thus showing

that OpenMP

tasking can be effectively leveraged

to keep the construction algorithms simple while still

improving the build time by up to 5 times on a modern

8-core consumer workstation.

ACKNOWLEDGEMENTS

This work was partially supported by the Wallen-

berg AI, Autonomous Systems and Software Program

(WASP) funded by the Knut and Alice Wallenberg

Foundation. We would also like to thank Arm Sweden

AB for letting Gustaf pursue a PhD as one of their em-

ployees. Additionally, we would like to thank Rikard

Olajos and Simone Pellegrini for their valuable input

during this work.

Finally, we would also like to thank the authors of

the models used in this work: San-Miguel (Guillermo

M. Leal Llaguno), Bistro (Amazon Lumberyard),

Powerplant (University of North Carolina), Hairball

(NVIDIA Research), Sponza (Crytek), and Buddha

(Stanford).

REFERENCES

Aila, T. and Laine, S. (2009). Understanding the efﬁciency

of ray traversal on gpus. In Proceedings of the Confer-

ence on High Performance Graphics 2009, HPG ’09,

page 145–149, New York, NY, USA. Association for

Computing Machinery.

Ayguad

e, E., Copty, N., Duran, A., Hoeﬂinger, J., Lin, Y.,

Massaioli, F., Teruel, X., Unnikrishnan, P., and Zhang,

G. (2009). The design of openmp tasks. IEEE Trans.

Parallel Distrib. Syst., 20(3):404–418.

Benthin, C., Meister, D., Barczak, J., Mehalwal, R., Tsakok,

J., and Kensler, A. (2024). H-ploc: Hierarchical par-

allel locally-ordered clustering for bounding volume

hierarchy construction. Proc. ACM Comput. Graph.

Interact. Tech., 7(3).

Bentley, J. L. (1975). Multidimensional binary search

trees used for associative searching. Commun. ACM,

18(9):509–517.

Dagum, L. and Menon, R. (1998). Openmp: an industry

standard api for shared-memory programming. Com-

putational Science & Engineering, IEEE, 5(1):46–55.

Parallel Axis Split Tasks for Bounding Volume Construction with OpenMP

353

Fowler, C., Doyle, M. J., and Manzke, M. (2014). Adap-

tive bvh: an evaluation of an efﬁcient shared data

structure for interactive simulation. In Proceedings of

the 30th Spring Conference on Computer Graphics,

SCCG ’14, page 37–45, New York, NY, USA. Asso-

ciation for Computing Machinery.

Fuetterling, V., Lojewski, C., Pfreundt, F.-J., and Ebert, A.

(2016). Parallel spatial splits in bounding volume hi-

erarchies. In Proceedings of the 16th Eurographics

Symposium on Parallel Graphics and Visualization,

EGPGV ’16, page 21–30, Goslar, DEU. Eurograph-

ics Association.

Ganestam, P., Barringer, R., Doggett, M., and Akenine-

oller, T. (2015). Bonsai: Rapid bounding volume hi-

erarchy generation using mini trees. Journal of Com-

puter Graphics Techniques (JCGT), 4(3):23–42.

Ganestam, P. and Doggett, M. (2016). Sah guided spatial

split partitioning for fast bvh construction. Computer

Graphics Forum, 35(2):285–293.

Goldsmith, J. and Salmon, J. (1987). Automatic creation

of object hierarchies for ray tracing. IEEE Computer

Graphics and Applications, 7(5):14–20.

Karras, T. (2012). Maximizing parallelism in the construc-

tion of bvhs, octrees, and k-d trees. In Proceed-

ings of the Fourth ACM SIGGRAPH / Eurographics

Conference on High-Performance Graphics, EGGH-

HPG’12, page 33–37, Goslar, DEU. Eurographics As-

sociation.

Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., and

Manocha, D. (2009). Fast bvh construction on gpus.

Computer Graphics Forum, 28(2):375–384.

MacDonald, J. D. and Booth, K. S. (1990). Heuristics for

ray tracing using space subdivision. The Visual Com-

puter, 6:153–166.

Meagher, D. (1980). Octree encoding: A new technique for

the representation, manipulation and display of arbi-

trary 3-d objects by computer. Technical report, IPL:

Image Processing Laboratory, Electrical and Systems

Engineering Department, Rensselaer Polytechnic In-

stitute, Troy, New York 12181.

Meister, D. and Bittner, J. (2018). Parallel reinsertion for

bounding volume hierarchy optimization. Computer

Graphics Forum, 37(2):463–473.

Meister, D., Ogaki, S., Benthin, C., Doyle, M. J., Guthe, M.,

and Bittner, J. (2021). A Survey on Bounding Volume

Hierarchies for Ray Tracing. Computer Graphics Fo-

rum.

Pantaleoni, J. and Luebke, D. (2010). Hlbvh: hierarchical

lbvh construction for real-time ray tracing of dynamic

geometry. In Proceedings of the Conference on High

Performance Graphics, HPG ’10, page 87–95, Goslar,

DEU. Eurographics Association.

Pharr, M. (2018). Guest editor’s introduction: Special issue

on production rendering. ACM Trans. Graph., 37(3).

Stich, M., Friedrich, H., and Dietrich, A. (2009). Spatial

splits in bounding volume hierarchies. In Proceed-

ings of the Conference on High Performance Graph-

ics 2009, HPG ’09, page 7–13, New York, NY, USA.

Association for Computing Machinery.

Usha, R., Pandey, P., and Mangala, N. (2020). A com-

prehensive comparison and analysis of openacc and

openmp 4.5 for nvidia gpus. In 2020 IEEE High Per-

formance Extreme Computing Conference (HPEC),

pages 1–6.

Wald, I. (2007). On fast construction of sah-based bound-

ing volume hierarchies. In 2007 IEEE Symposium on

Interactive Ray Tracing, pages 33–40.

Wald, I. (2012). Fast construction of sah bvhs on the in-

tel many integrated core (mic) architecture. IEEE

Transactions on Visualization and Computer Graph-

ics, 18(1):47–57.

Wald, I., Woop, S., Benthin, C., Johnson, G. S., and Ernst,

M. (2014). Embree: a kernel framework for efﬁcient

cpu ray tracing. ACM Trans. Graph., 33(4).

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

354