Figure 1: CUDA matrix-based implementation of the inclu-
sion test.
The algorithm constructs a set of tetrahedra between
the origin of coordinates and each triangular face of
the mesh. The point is tested for inclusion against
each tetrahedron and a counter is incremented if the
result of the test is positive. If the point is inside an
odd number of tetrahedra, the point is inside the mesh.
Notice that if the point is at a face shared by two tetra-
hedra, the counter is added 0.5 by each one to avoid a
double increment that would lead to incorrect results.
The programming model of CUDA fits especially
well with problems whose solution can be expressed
in a matrix form. In our case, we could construct a
matrix in which the rows are the tetrahedra to pro-
cess and the columns, the points to test. This matrix
is divided into blocks of threads, and each thread is
made responsible of testing the point in the column
j against the tetrahedron in the row i, and adding the
result of the test (0,1,0.5) to the counter j (see Fig-
ure 1). Unfortunately, this approach has an important
drawback: the support for atomic accesses to mem-
ory, that ensure a correct result after read-write opera-
tions on the same value in global memory performed
from several concurrent threads, is only available in
devices of compute capability 1.1, that is the GeForce
8500 and 8600 series (NVIDIA, 2007). This prob-
lem can be avoided if each thread stores the result of
the point-in-tetrahedron inclusion test in the position
(i, j) of a matrix of integers. After the computation,
the matrix is retrieved to CPU memory, and each in-
clusion counter j is simply calculated from the sum
of the values of the row j. But this implementation
makes an inefficient use of the device memory, requir-
ing the allocation of a huge matrix when working with
large meshes and many points to test. Moreover these
two approaches generates an overwhelming number
of threads during the GPU execution, which leads to
poor results.
We choose a different strategy, computing in each
thread the inclusion test of one or severalpoints on the
entire mesh. Each thread iterates on the mesh, copy-
ing a triangle from global memory to a local variable
and performing the inclusion test on the points, then it
accumulates the result in a vector that stores an inclu-
Figure 2: CUDA implementation of the inclusion test.
sion counter per point (Figure 2). It could be argued
that the task assigned to each thread is very heavy,
specially when compared with the matrix-based im-
plementations but in practice it works very well. Two
implementation aspects require a special attention.
First, the accesses from the threads to the triangle list
must be interleaved to avoid conflicts that could pe-
nalize performance. And second, the relatively high
cost of retrieving a triangle from global memory to a
processor register makes interesting testing it against
several points. These points must be also cached in
processor registers for maximum performance, and
therefore its number is limited.
The host part of the CUDA computation starts
by allocating a buffer of triangles and a buffer of
points to test in the device memory, and copying
these from the data structures in host memory. An-
other buffer is allocated to store the inclusion coun-
ters, which are initialized to 0. The number of blocks
of threads are estimated as a function of the total
number of points to test, the number of points per
block and the number of points processed by a sin-
gle thread: numBlocks = numPoints/(BLOCKSIZE∗
THREADPOINTS). The last two constants should
be chosen with care to maximize performance: a high
number of threads per block limits the number of reg-
isters available in a thread, and therefore, the num-
ber of points that can be cached and processed. A
low number of threads per block makes a poor use of
the multiprocessors. Finally, after the GPU computa-
tion has completed, the buffer of inclusion counters is
copied back to the host memory.
A thread begins by copying THREADPOINTS
points from the point buffer in global memory to a
local array, that is stored in registers by the CUDA
compiler. The copy starts at the position (blockIdx.x∗
BLOCKSIZE + threadIdx.x) ∗ THREADPOINTS to
assign a different set of points to each thread. Af-
ter this, the iteration on the triangle list starts by
copying the triangles to a local variable and call-
ing a point-in-tetrahedron inclusion test function. In
case of success, the inclusion counter of the corre-
sponding point is updated. A good interleaving is en-
GEOMETRIC ALGORITHMS ON CUDA
109