SYNTHESIS OF B

EZIER SURFACES ON THE GPU

R. Concheiro, M. Amor

Universidade da Coru

na, Spain

M. B

Universidade de Santiago de Compostela, Spain

Keywords:

ezier surfaces, GPU, Adaptive tessellation, Level of detail (LOD).

Abstract:

ezier surfaces are one of the most useful primitives employed for high quality modeling in CAD/CAM tools

and graphics software. Traditionally, the B

ezier representations are usually tessellated on the CPU (Central

Processing Unit) and the set of generated triangles is sent to the GPU (Graphic Processing Unit). The CPU-

GPU bus can become a bottleneck in this approach due to the large number of triangles generated for high

quality models. In this paper we present two proposals for synthesizing the B

ezier models directly in the

GPU. With this strategy the compact representation associated with the B

ezier models is sent to the GPU

where the rendering is performed. The ﬁrst proposal is based on the exploitation of the vertex shader to

perform the tessellation. In this case a parametric map guides the computation of the ﬁnal coordinates of

a set of virtual vertices. Our second proposal is based on the efﬁcient exploitation of the geometry shader

capabilities to perform the tessellation in a direct way. Tests performed show that both proposals produce high

quality images and promising results for real time rendering of complex parametric models.

1 INTRODUCTION

NURBs (Non-Uniform rational B-splines) surfaces

(Piegl and Tiller, 1997) have been widely employed

in CAD/CAM tools and graphic applications due to

their capabilities for modeling complex geometries.

Together with the high quality associated with the

NURBS models, another advantage of the NURBS

representations is the compactness of the description

and, as a consequence, the low storage and transmis-

sion requirements.

Current GPUs (Graphic Processing Unit) are tri-

angle oriented and not designed for the direct render-

ing of parametric representations. Therefore, these

representations are usually tessellated into triangles in

the CPU (Central Processing Unit) before being sent

to GPU to be displayed. This strategy presents some

problems which diminishes system performance, for

example, the amount of information to be sent from

CPU to GPU or the increment in the storage require-

ments associated with the triangle mesh.

To synthesize parametric surfaces on the GPU the

tessellation of the models is directly realizing on the

GPU (Guthe et al., 2005; Dyken et al., 2009). In these

proposals the rendering process is performed per

patch (Guthe et al., 2005) or per set of patches accord-

ing to the required level of detail (Dyken et al., 2009).

In these applications the computational cost increases

with the number of patches due to the amount of syn-

chronous calls between CPU and GPU. Another tes-

sellation approach is presented in (Eisenacher et al.,

2009; Schwarz and Stamminger, 2009) where the tes-

sellation of bicubic B

ezier surfaces is performed fol-

lowing a GPGPU strategy (General-Purpose Compu-

tation on GPU).

Although the strategy is interesting, the program-

ming platform is in fact inadequate for advanced ren-

dering systems.

In this work we focus on the tessellation of B

ezier

surfaces on GPU. B

ezier surfaces are a particular case

of NURBs surfaces (Piegl and Tiller, 1997). B

ezier

representations are widely used because of their lower

complexity. Additionally the tessellation of NURBs

models is usually performed through their previous

conversion to B

ezier representations.

In this paper we present two approaches for B

ezier

110

Concheiro R., Amor M. and Bóo M. (2010).

SYNTHESIS OF BÉZIER SURFACES ON THE GPU.

In Proceedings of the International Conference on Computer Graphics Theory and Applications, pages 110-115

DOI: 10.5220/0002847201100115

 SciTePress

surfaces tessellation on the GPU. Our ﬁrst proposal

consists of the utilization of a parametric map of vir-

tual vertices (Guthe et al., 2005) with an efﬁcient

exploitation of the information stored on the GPU.

Speciﬁcally, we propose an adaptive technique that

permits the optimization of the memory usage of the

GPU to increase the data locality exploitation. This

strategy allows the minimization of draw calls and

the CPU-GPU communications.The second proposal

is based on the utilization of the geometry shader for

the generation of geometry in the GPU. This tech-

nique avoids the precomputation and storage of pre-

deﬁned grids in the local memory as the tessellation

can be executed on-the-ﬂy. Both proposals have been

tested under different GPU platforms. Good results in

terms of quality and timing requirements have been

obtained for both. As result of our analysis we con-

clude that the adequate exploitation of the GPU capa-

bilities is close to permit real time rendering of para-

metric models even for very complex scenes.

This paper is organized as follows:

In Section 2 a brief revision of tessellation options

on current GPUs are summarized. Then, in Section

3 our ﬁrst proposal based on the efﬁcient storage and

exploitation of the information with the vertex shader

is presented. In Section 4 the second proposal based

on the utilization of geometry shader is developed. In

Section 5 the proposals are evaluated and ﬁnally, in

Section 6 the main conclusions are highlighted.

2 TESSELLATION OPTIONS IN

CURRENT GRAPHICS CARDS

In this section, we brieﬂy summarize the structure

of current GPUs and the available hardware options

for tessellation. The structure of the GPU accord-

ing to Direct3D10 (Blythe, 2006) consists of ﬁxed-

function stages (Input Assembler, Rasterizer and Out-

put Merger) and three programmable stages (Vertex

Shader, Geometry Shader and Pixel Shader) whose

behavior is deﬁned by a code. With the tessellation

procedure in mind, we will focus our analysis on the

programmable stages and the possibilities to imple-

ment a tessellation procedure on them.

The programmable vertex and pixel shaders can

not be employed for generating/destroying geometry

in a direct way and have no access to the information

associated with another neighboring primitives.

The geometry shader works with primitives

(point, line segment, or triangle) and the output num-

ber of primitives can be higher or lower than the input

number. Adjacent information is available so that for

each triangle the information of the three neighbor tri-

angles can be accessed. However, the main drawback

is the limitation of the number of output primitives

per invocation, as currently only 1024 32-bit values

can be output. The intermediate results processed by

the vertex shader or the geometry shader can be sent

either back to the pipeline through stream out, allow-

ing iterative processing, or can be sent directly to the

rasterization stage.

Recently, the introduction with DirectX 11 (Ni

and Casta

no, 2009) a new tessellator unit permits the

tessellation on the GPU. However, this unit performs

a ﬁxed and regular pattern.

But today the geometry shader is the only option

for the direct implementation of a free tessellation al-

gorithm.

Taking into account these options, our proposals

exploit two different alternatives for the tessellation.

The ﬁrst one is based on the exploitation of the ver-

tex shader (VST, Vertex Shader Tessellation). In this

case, and due to the impossibility to generate geome-

try, the utilization of techniques based on virtual ver-

tices (Boubekeur and Schlick, 2005) is the key for

a multiresolution application. The second proposal

is based on the exploitation of the geometry shader

(GST, Geometry Shader Tessellation). In this case,

the tessellation of surface is performed in the geom-

etry shader. The resolution level can be selected on-

the-ﬂy and the generated geometry can be fed back to

the standard pipeline through the stream out unit. In

next sections we describe in detail both proposals.

3 VERTEX SHADER

TESSELLATION (VST)

In this section we describe our proposal for the B

ezier

surfaces tessellation using the vertex shader. Our

method is based on the storage and efﬁcient exploita-

tion of the information in the GPU. Speciﬁcally, our

proposal uses a regular grid of parametric coordi-

nates as the basis for the computation. The efﬁcient

scheduling employed permits the efﬁcient exploita-

tion of the information stored in the GPU reducing the

transmission requirements between CPU and GPU.

The representation of a B

ezier surface

Q(u,v), 0 ≤ u,v ≤ 1 is based on the utilization

of two parametric values deﬁned in a normalized

interval [0, 1]. In our proposal the tessellation is per-

formed on the GPU and this implies the evaluation of

the surface equation Q(u,v) for different parametric

values (u, v). The resulting points are vertices that are

connected to build the triangles of the ﬁnal mesh. For

reasons of clarity we work with a simple algorithm

that performs a uniform subdivision of the parametric

SYNTHESIS OF BE´ ZIER SURFACES ON THE GPU

111

space in the two dimensions. Speciﬁcally and for a

tessellation level l, 2

l+1

parametric values in each

dimension are considered. The grid of parametric

values P

to be evaluated are:







) ··· (u

l+1

)

) ··· (u

l+1

)

l+1

) ··· (u

l+1

)







(1)

where

i − 1

l+1

− 1

,i ∈ {1,·· · , 2

l+1

}

For a resolution level l, the grid of parametric val-

ues to be evaluated P

is made up of 2

l+1

× 2

l+1

sam-

ples. The resolution level to be applied to each B

ezier

surface is selected by the application taking into ac-

count different factors (screen space error, model

complexity,...). Taking this into account, a system of

L grids of parametric values for the different resolu-

tion levels {P

,··· ,P

} can be computed a priori,

L being the highest resolution level. These grids are

computed and stored in the GPU to be selected and

employed for the different surfaces of the model.

However, the utilization of a single system of grids

limits the speed of the application. If an unique sys-

tem of grids stored in memory is accessed by all sur-

faces in the scene a sequential procedure is forced.

This means that for each frame there are as many

Draw Primitive calls as surfaces N

, so the perfor-

mance decreases due to the amount of calls. There-

fore, the amount of synchronizations, N

, by frame

is N

= N

. As only a surface is computed per Draw

Primitive call, GPU parallelism is not exploited. Ad-

ditionally, a large amount of synchronous calls de-

crease the performance because a Draw Primitive is a

slow operation. Therefore in our proposal we use sev-

eral copies of the system of grids of parametric values

to process more surfaces per draw call. That is, sev-

eral copies of {P

,··· ,P

} are used. The utiliza-

tion of different copies of the grid systems permits

the simultaneous evaluation of several models with

the consequent increment in the processing speed.

To evaluate the number of surfaces that can be pro-

cessed per Draw Primitive call the storage require-

ments of the application have to be evaluated.

VST performs N

draw calls, processing and ren-

dering N

surfaces per call: N

with 1 ≤ N

≤ N

. Then, the required amount of

memory is

M =

∑

l=1

× N

+ M

]

× N

where M

is the memory requirements for the grid

of resolution level l, P

, and N

is the number of

surfaces in the scene. M

]

includes the amount of

memory used for the control points of each surface.

For a (n, n)-degree surface this amount is, M

]

3 × (n + 1) × (n + 1).

The desirable framework is storing all data on the

GPU memory and performing an unique draw call.

But when the storage requirements exceeds a given

value of capacity, the performance decreases. Tak-

ing this into account, we have developed an adaptive

technique whereby to get an optimum application in

terms of speed, the following transmission and stor-

age requirements have to be veriﬁed:

1. The data transfer between CPU and GPU has to

be minimized. In our proposal the information re-

quired (parametric grids and control points of the

surfaces) is sent once to the GPU. The information

is efﬁciently stored and re-employed for optimum

performance.

2. The storage requirements associated with the

grids of parametric values should not exceed

the global memory capabilities. Speciﬁcally, in

our application the grids are stored in a vertex

buffer but exceeding the recommended capabil-

ities would result in limitations for other uti-

lizations and could affect the resource swapping.

As a result the following condition has to be

veriﬁed:

∑

l=1

× N

< per · M

GPU

being the GPU global memory size and per

a percentage value that depends on each GPU.

3. In our application and due to the global memory

latency the control points [B

] of the surfaces are

stored in the texture memory. This memory is

cached so if there is a cache miss the informa-

tion is obtained from global memory with a delay.

Therefore, the storage of the control points asso-

ciated to the B

ezier surfaces to be processed per

draw call should not exceed the capabilities of the

texture memory. That is, M

]

× N

< M

being

the texture memory size.

4. The number of draw calls (N

) should be mini-

mized due to their ﬁxed-cost overhead (Akenine-

uller et al., 2008). The basic idea of our batch-

ing strategy is combining many small transfers

into a large one to optimize the data communi-

cation procedure.

The analysis of the storage requirements and rec-

ommended number of draw calls according to our

tests is included in the results section.

GRAPP 2010 - International Conference on Computer Graphics Theory and Applications

112

4 GEOMETRY SHADER

TESSELLATION (GST)

In this section, our second proposal is included. This

approach is based on the exploitation of the geome-

try shader for the B

ezier surfaces tessellation (GST).

The objective is exploiting the geometry shader capa-

bilities for geometry generation. This, in contrast to

the vertex shader proposal, permits the generation of

geometry without requiring the utilization of a virtual

vertices strategy. Therefore, the GPU memory does

not limit the level of resolution per surface.

The key idea of our GST proposal is the on-the-

ﬂy computation of the P

values for each input sur-

face. As a consequence, no pre-computed grids are

employed and the storage requirements are reduced

since only the control points of the surfaces are re-

quired. Current versions of the geometry shader per-

mits the generation of 1024 32-bit elements per in-

put primitive. This, in our implementation, limits the

number of triangles to be generated per B

ezier surface

and, in consequence, the maximum resolution level to

be generated. Speciﬁcally, the maximum resolution

level allowed is l = 3, that is, 2

× 2

triangles can be

generated. Our method obtains a higher level of detail

with an iterative execution of the geometry shader for

each surface. This approach is possible as the geom-

etry shader output can be stored in output stream and

feedback as input for the rendering pipeline. How-

ever, the inherent timing costs associated with the it-

erative procedures makes the reduction of the number

of iterations to be performed important. The objec-

tive of our proposal is to reduce this number of itera-

tions through an efﬁcient method to increase the high-

est level of detail that can be managed per iteration.

The key idea of the GST proposal for increasing

the resolution level is partitioning the parametric map

in zones and the parallel evaluation of these zones in

the geometry shader. That is, the P

grid (see Equation

1) with 2

l+1

× 2

l+1

parametric values is partitioned

and the corresponding parametric values groups pro-

cessed in parallel in the geometry shader. Consider-

ing groups of m × m parametric values the P

matrix

of values can be rewritten as a system of submatrices:







[1,1]

··· P

[1,Nz

]

[Nz

,1]

··· P

[Nz

,Nz

]







(2)

and Nz

are the number of zones in u and v

directions, respectively. Thus is,

l+1

; Nz

l+1

In our proposal two geometry shader kernels are

devoted to two tasks: zones identiﬁcation and tessel-

lation per zone.

The ﬁrst task of the algorithm is the parametric

grid partitioning into zones. As indicated in Equation

2 the P

matrix is partitioned into a set of submatri-

ces P

[i, j], with i = 1,··· ,Nz

and j = 1, · · · ,Nz

In our approach the ﬁrst shader makes the paramet-

ric map partitioning through the identiﬁcation of the

ﬁrst element of each submatrix, (u

(i·m)+1

( j·m)+1)

Once this value is identiﬁed, the remaining paramet-

ric values can be generated with simple incremen-

tal operations. As a result, the ﬁrst shader generates

four values per zone [s,u

(i·m)+1

( j·m)+1

,t], where s is

the surface index and t indicates the iteration num-

ber. Due to the geometry shader limitations (only

1024 32-bit data can be generated per input primitive)

up to 1024/4 = 256 zones can be processed in each

step of the iterative algorithm. The second shader per-

forms the evaluation of the B

ezier surface correspond-

ing to each zone. In consequence the resolution level

that can be obtained with our proposal per iteration is

· m × 2

· m, with m = 4, l

= 5.

The second shader of the algorithm performs the

surface evaluation for the points assigned to each

zone. The zones will be managed by the geometry

shader as isolated input primitives, the vertices lo-

cated in the border among zones are evaluated more

than once. This permits the avoidance of cracks be-

tween contiguous zones. To avoid cracks between

neighbor zones the vertices in the border between two

zones have to be computed for both zones. Conse-

quently, the matrices are of size (m + 1) × (m + 1)

with an overlap of elements between matrices with

consecutive indices.

5 EXPERIMENTAL RESULTS

In this section we present the results of the evaluation

of our VST and GST proposals.

We ran our implementations on an Intel Core 2 2.4

GHz with 2 GB of RAM and on two different GPUS:

Nvidia GeForce 9800 GTX (Nvidia) with DirectX 10

Microsoft’s HLSL and ATI Radeon 5870 (ATI) with

DirectX 11 Microsoft’s HLSL.

We evaluated our proposals with different scenes

comprising replicated versions of a small set of mod-

els. The models (Teacup, Teapot and Elephant) em-

ployed are depicted in Figure 1 with different reso-

lution levels. The ﬁnal images have a screen resolu-

tion of 1280 × 1024 pixels. Table 1 includes the re-

sults obtained for 16 of these scenes, denoted as S

with i = 1,··· ,16. Column N

includes the number of

SYNTHESIS OF BE´ ZIER SURFACES ON THE GPU

113

L1 L4

(a)

(b)

(c)

Figure 1: Models employed in the test scenes (a) Teacup (b)

Teapot (c) Elephant.

Table 1: Number of triangles generated (in K) for each

scene with L = 4 and L = 6.

Scene N

Adpt. N

Adpt.

26 0.46 48.80 25.49 819.05 432.66

32 0.56 60.06 34.39 1008.06 554.98

260 4.57 488.01 254.86 8190.51 3815.56

320 5.63 600.62 343.95 10080.63 5206.89

520 9.14 976.02 509.72 16381.02 6935.26

640 11.25 1201.25 687.90 20161.25 9454.31

780 13.71 1465.02 764.59 24571.52 9259.56

811 14.26 1522.21 1241.23 25548.08 14923.50

960 16.88 1801.88 1031.84 30241.88 12856.10

1040 18.28 1952.03 1019.45 32762.03 10801.40

1280 22.50 2402.50 1375.79 40322.50 15142.10

1300 22.85 2440.04 1274.31 40952.54 11495.80

1600 28.13 3003.12 1719.74 50403.13 16525.70

2600 45.70 4880.08 2548.62 81905.08 38865.80

3200 56.25 6006.25 3439.48 100806.25 30826.25

8110 142.56 15222.09 12421.30 255480.84 55340.50

ezier surfaces, while column N

includes the num-

ber of triangles generated for the coarsest level of de-

tail; i.e., L = 1. Columns N

and N

include the num-

ber of triangles generated with L = 4 and L = 6 for a

non-adaptive tessellation. Columns N

Adpt. and N

Adpt. show the number of generated triangles on av-

erage for an adaptive tessellation proposal with L = 4

and L = 6; i.e., when the resolution level of each sur-

face is up to 4 or 6 respectively. In this case, the res-

olution of each surface is selected on the basis of its

position in the scene with a varied set of viewpoints.

Note that complex scenes with a high number of sur-

faces were used.

First, and for the VST proposal, the number of

draw calls N

were analyzed. As an example of

our analysis, Figure 2 shows the frames per second

for scene S

for different N

and L values consider-

ing Nvidia. A similar behavior was obtained for all

the scenes tested. As can be observed in the ﬁgure,

the number N

has a strong inﬂuence on the perfor-

mance. For example, the obtained speedup is 1.42

with L = 5 for N

= 4, and up to 1.31 with L = 6 for

= 8. The good performance in terms of frames

per second is due to the reduction of global mem-

1 2 4 8 16 32 64

100

NDP

frames per second (fps)

L=4

L=5

L=6

Figure 2: VST proposal variant N

for S

with L = 4, L = 5

and L = 6.

0 10 20 30 40 50

100

150

200

250

(K)

frames per second (fps)

GST

VST

(a)

0 10 20 30 40 50

500

1000

1500

2000

2500

3000

3500

(K)

frames per second (fps)

GST

VST

(b)

Figure 3: VST and GST for L = 4 (a) Nvidia GeForce 9800

GTX and (b) ATI Radeon 5870.

ory accesses and the efﬁcient utilization of the tex-

ture memory. In summary, the satisfactory results are

associated with the data locality exploitation and the

scheduling strategy employed. For larger N

values

this trend changes due to the cost overhead of each

draw call. With respect to the dependence with the L

value, for larger L values the best frames per second

values are obtained for larger N

values. Accord-

ing to Wloka’s rule (Akenine-M

uller et al., 2008), is

due to the larger number of polygons per surface and

the rasterization costs which makes the standard GPU

pipeline the bottleneck of the application.

Finally, we conducted a detailed analysis of the ef-

ﬁciency of our tessellation methods. Figure 3 depicts

the performance for L = 4 with two GPUs. VST and

GST proposals obtain a good performance in terms

of FPS, allowing real-time adaptive tessellation, even

for a high number of triangles. For example, for scene

with 9.14 K input triangles, 113.8 fps for the VST

proposal and 10.04 fps for the GST proposal are ob-

GRAPP 2010 - International Conference on Computer Graphics Theory and Applications

114

0 10 20 30 40 50

500

1000

1500

2000

2500

3000

3500

(K)

frames per second (fps)

GST

Tessellator

Figure 4: Comparative for GST and Tessellation unit on a

Radeon 5870.

tained with the Nvidia. With the ATI, 606.36 and

742.64 fps are obtained with the VST and GST pro-

posals, respectively. In this case, as indicated in Ta-

ble 1, the number of triangles generated is 509.72 K

with an adaptive approach. Moreover, in the Nvidia

the performance of the GST method is inferior to the

performance achieved with the VST method. While

in ATI using DirectX 11 the result is opposite. The

differences in performance would seem to be due to

the improvement in the utilization of geometry shader

output as input to the vertex shader. Speciﬁcally, for

L = 4 the utilization of two stages is necessary to ob-

tain the desired resolution level.

On the other hand, comparisons in terms of fps

with other proposals are analyzed. Moreover, with re-

spect to traditional algorithms of tessellation on the

GPU (Guthe et al., 2005), our two proposals have

achieved better performance in all cases and in both

architectures; for example, the scene S

of this pro-

posal is 4.05 fps on Nvidia. Finally, Figure 4 shows a

comparative between our GST approach and the uti-

lization of the tessellation unit (Microsoft, 2009) on

the ATI, where, in any case, our proposal obtains a

better performance.

6 CONCLUSIONS

In this paper we have presented two proposals for the

tessellation of B

ezier surfaces on the GPU. The ﬁrst

method, VST, is based on the utilization of virtual ver-

tices strategy and a system of multi-resolution para-

metric maps. The utilization of this system of maps

to evaluate the ﬁnal coordinates of the virtual vertices

allows the processing of multiple surfaces in parallel.

Additionally, to exploit the data locality and to reduce

the number of global memory accesses, an analysis

of the optimum number of surfaces to be processed in

parallel was performed.

With respect to the second method, GST, it is

based on the exploitation of the geometry shader as

a primitive generator. Due to the current limitations

of the shader in terms of number of primitives gener-

ated per input primitive, our proposal is based on the

utilization of a smaller primitive, a parametric map

section.

As a result of our analysis we conclude that cur-

rent and future graphics cards will become an ad-

equate platform for parametric surfaces tessellation.

We have obtained very good results in terms of timing

requirements for both proposals on complex scenes.

ACKNOWLEDGEMENTS

This work has been partially suppported by the Min-

istery of Science and Innovation of Spain under con-

tract TIN 2007-67537-C03 and Xunta de Galicia un-

der the contract 08TIC001206PR.

REFERENCES

Akenine-M

uller, T., Haines, E., and Hoffman, N. (2008).

Real-Time Rendering. A. K. Peters, Ltd, third edition.

Blythe, D. (2006). The Direct3D 10 System. ACM Trans.

Graph., 25(3):724–734.

Boubekeur, T. and Schlick, C. (2005). Generic Mesh Ee-

ﬁnement on GPU. In HWWS ’05: Proceedings of

the ACM SIGGRAPH/EUROGRAPHICS conference

on Graphics hardware, pages 99–104.

Dyken, C., M., R., and Seland, J. (2009). Semi-uniform

Adaptive Patch Tessellation. Computer Graphics Fo-

rum, 28(8):2255–2263.

Eisenacher, C., Meyer, Q., and Loop, C. (2009). Real-time

view-dependent rendering of parametric surfaces. In

I3D ’09: Proceedings of the 2009 symposium on Inter-

active 3D graphics and games, pages 137–143, New

York, NY, USA. ACM.

Guthe, M., Bal

azs, A., and Klein, R. (2005). GPU-Based

Trimming and Tessellation of NURBS and T-Spline

Surfaces. ACM Trans. Graph., 24(3):1016–1023.

Microsoft (2009). Directx sample browser (august 2009).

http://www.microsoft.com/.

Ni, T. and Casta

no, I. (2009). Efﬁcient Substitues for Sub-

division Surfaces. Exhibition Tech. SIGGRAPH’09

Course Notes, 2009.

Piegl, L. and Tiller, W. (1997). The NURBS book. Springer.

Schwarz, M. and Stamminger, M. (2009). Fast GPU-based

adaptive tessellation with cuda. Computer Graphics

Forum, 28(2):365–374.

SYNTHESIS OF BE´ ZIER SURFACES ON THE GPU

115