A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR

EFFICIENT CACHING ON GRAPHICS HARDWARE

Increasing Streaming Execution Performance with Minimal Design Effort

Sammy Rogmans

1,2

, Philippe Bekaert

and Gauthier Lafruit

Hasselt University – tUL – IBBT, Expertise centre for Digital Media, Wetenschapspark 2, 3590 Diepenbeek, Belgium

Multimedia Group, IMEC, Kapeldreef 75, 3001 Leuven, Belgium

Keywords:

High-level, Transformation, Rule set, GPU, Efﬁcient.

Abstract:

This paper proposes a high-level rule set that allows algorithmic designers to optimize their implementation

on graphics hardware, with minimal design effort. The rules suggest possible kernel splits and merges to

transform the kernels of the original design, resulting in an inter-kernel rather then low-level intra-kernel

optimization. The rules consider both traditional texture caches and next-gen shared memory – which are used

in the abstract stream-centric paradigms such as CUDA and Brook+ – and can therefore be implicitly applied

in most generic streaming applications on graphics hardware.

1 INTRODUCTION

The landscape of parallel computing has been signiﬁ-

cantly altering during these latest few years. Since the

beginning of the programmability of Graphics Pro-

cessing Units (GPUs), it was very clear for the high-

performance computing community that GPUs could

be exploited as powerful coprocessors. General-

purpose GPU (GPGPU) computing has therefore pro-

liferated the use of data parallel programming, cer-

tainly since the introduction of abstract programming

environments such as CUDA and Brook+. These

paradigms abstract the GPU as a hybrid distributed-

shared memory architecture, having multiple multi-

processors, each with their individual shared memory.

The popularization of data parallel programming

has triggered researchers to investigate and formal-

ize mapping rules to ﬁt sequentially modelled algo-

rithms on the parallel GPU architecture. Owens et al.

has been systematically surveying traditional GPGPU

(Owens et al., 2007), where the computational re-

sources can only be exploited through the graphics

APIs Direct3D or OpenGL. More recently, the next-

generation GPGPU vendor-speciﬁc APIs CUDA and

Brook+ were released for NVIDIA and ATI hard-

ware respectively. Abstracting the graphics hardware

through these generic APIs has further levered the

motivation to investigate formal mapping rules to the

GPU architecture. As the GPU exhibits a massive

parallel architecture, the main challenge for the map-

ping rules is to keep the processors busy instead of

idle due to relatively slow memory reads. The pre-

liminary visions and statements of the university of

California at Berkeley in (Asanovic et al., 2006) state

that there are an important set of elementary ker-

nels – consistently deﬁned as dwarfs – which has led

many researchers to investigate intra-kernel optimiza-

tion and formalization. Govindaraju et al. created

a memory model for traditional GPGPU in (Govin-

daraju et al., 2006), where the memory efﬁciency

of a kernel can be modelled and examined. How-

ever, research such as (Fatahalian et al., 2004) proves

that some stand-alone kernels cannot be optimized

and are therefore always bound by a memory bot-

tleneck, most certainly with the constraints of tradi-

tional GPGPU. Ryoo et al. investigated many possi-

ble intra-kernel optimizations in (Ryoo et al., 2008)

when considering next-generation GPGPU. Nonethe-

less, Victor Podlozhnyuk shows in his image convolu-

tion tutorial for CUDA (Podlozhnyuk, 2007) that tra-

ditional (texture) memory access can be more bene-

ﬁcial in some cases. Therefore, traditional and next-

gen GPGPU are in general not mutually exclusive in

an end-to-end optimized application.

Complementary to previous related research on

low-level intra-kernel optimization, this paper rather

Rogmans S., Bekaert P. and Lafruit G. (2009).

A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS HARDWARE - Increasing Streaming Execution

Performance with Minimal Design Effort.

In Proceedings of the International Conference on Signal Processing and Multimedia Applications, pages 38-43

DOI: 10.5220/0002188400380043

 SciTePress

Figure 1: Concepts of a streaming kernel.

focusses on a high-level rule set with regards to form a

methodology for inter-kernel optimization. The most

important concepts related to an individual kernel are

(see Fig. 1) the input stream(s), the computations in-

side, and the output stream(s). Moreover, two im-

portant performance speciﬁcations of a kernel are the

arithmetic intensity and processor occupancy. Arith-

metic intesity deﬁnes the ratio of the amount of com-

putations inside the kernel, to the total amount of data

transfered. In contrast with arithmetic intensity that

is already used for years, processor occupancy is rel-

atively new, and speciﬁes the amount of data parallel

thread batches (i.e. CUDA warps) that are able to si-

multaneously execute per individual (multi)processor.

In Section 2 we describe the formal rule set that opti-

mizes a stream-centric processing chain, while moti-

vating the rules in an abstract processing level. Sec-

tion 3 consequently presents a case study and its re-

sults, optimizing the truncated windows (Lu et al.,

2007; Rogmans et al., 2009) algorithm, which com-

putes a dense depth map out of a rectiﬁed stereo im-

age input. The conclusion and possible future work is

ultimately presented in Section 4.

2 HIGH-LEVEL RULE SET

By splitting and merging kernels in the high-level

design, the speciﬁcations of the resulting individual

kernels can be altered for optimal end-to-end perfor-

mance. Nonetheless, proper care should be taken, as

a random kernel split or merge could inﬂuence the

performance negatively. The proposed rule set de-

ﬁnes the plausible kernel splits and merges to bene-

ﬁt the overall performance, and can therefore assist a

GPGPU programmer to optimize an implementation

with minimal design effort. The motivation of split-

ting should always be to lever the processors occu-

pancy of the kernel. If a kernel is split, the memory

footprint – i.e. the temporary memory and registers

needed for the kernel computations – is consequently

reduced. As a single thread consumes less mem-

ory, the amount of threads that can be managed si-

multaneously on a single (multi)processor with ﬁxed

amount of shared memory and registers, is thereby

signiﬁcantly increased. Although this levers the pro-

cessor occupancy, a kernel split could potentially hurt

the arithmetic intensity when the two resulting sub-

kernels exhibit redundant input and/or output streams.

Merging two kernels should always be motivated

by levering the arithmetic intensity. Since two ker-

nels communicate through global memory, this com-

munication can be completely avoided. Nevertheless,

a random merge could potentially increase the mem-

ory footprint of a kernel, and therefore damage its

processor occupancy. Moreover, in some cases the

arithmetic intensity can suddenly proliferate due to

implicit data dependencies between the two merged

kernels, resulting in a non-trivial computational bot-

tleneck. As kernel merging and splitting can hence

crosswise counter the kernel speciﬁcations while at-

tempting to improve the performance, the rule set pro-

poses only those rules which will improve either arith-

metic intesity or processor occupancy, without nega-

tively affecting the other spec.

2.1 Kernel Splits

Splitting kernels to lever the occupancy while main-

taining the arithmetic intensity, can be applied in two

different forms, depending on the input streams of the

kernel that potentially needs to be split. Whenever

a single input stream can be isolated to a sub-kernel

functionality, the sub-kernel functionality can be seen

to be only dependent of a single preceding kernel, and

can therefore be extracted without breaking any data

dependencies.

In contrast, a kernel that has input streams com-

ing from multiple preceding kernels, cannot be easily

split without breaking data dependencies. However,

the kernel can be duplicated by restructuring the data

streams, while still having a beneﬁcial effect in over-

all performance.

2.1.1 Rule (1): Input Isolation

“If a data input stream can be isolated to a spe-

ciﬁc sub-kernel functionality, the concerning kernel

should be split into two kernels harnessing the iso-

lated sub-kernel functionality and the remaining part.

The jointly generated kernel should be recursively

checked for the further application of this rule.”

As depicted in Fig. 2a, a sub-kernel functionality

with a random amount of output streams, that is de-

pendent of only a single data input stream, is isolated

to form two different kernels. Both kernels will there-

fore increase their individual processor occupancy,

as their memory footprint is reduced. Moreover,

the footprint of the isolated sub-kernel is minimized,

when assuming the further application of intra-kernel

optimizations. When using traditional texture caches,

A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS

HARDWARE - Increasing Streaming Execution Performance with Minimal Design Effort

Figure 2: An (a) input isolation and (b) kernel duplication.

this rule will also signiﬁcantly impact the sampling

rate, and therefore the memory transfer. Sampling

from only a single ‘texture’ or data stream, will al-

low an optimal use of the caches with minimal misses,

most deﬁnitely when a linear sampling pattern can be

used. Since the input and output streams can be un-

tangled, no redundancy is necessary, and the average

arithmetic intesity of both kernels is still equivalent to

the original one.

2.1.2 Rule (2): Kernel Duplication

“Kernels that cannot be further split by rule (1), and

have multiple input streams coming from different ker-

nels with the same functionality, can be duplicated by

devectorizing the data streams inside the SIMD com-

ponents of the multiprocessor. The original output

is consequently acquired by revectorizing the output

streams of the duplicated kernels.”

Fig. 2b depicts a duplicated kernel with reduced

number of input streams. The SIMD components

in the streams (i.e. the traditional RGBA-channels,

or next-gen CUDA execution warps) therefore have

to be restructurized. Since the data components of

SIMD are per deﬁnition independant, they can be im-

plicitely untangled without breaking any data depen-

dencies. If the preceding kernels exhibit the same

functionality, the individual SIMD components can

be freely interchanged. By doing so, all required (de-

vectorized) scalar input dependencies of the concern-

ing kernel, are transmitted over the outputs of a single

preceding kernel. Hence, the devectorized kernel can

be duplicated to consequently process the output of

each of the identical preceding kernels. For this to

work however, the concerning kernel can no longer

work on its original SIMD size, but offers no con-

crete problems as next-gen GPGPU can function on a

scalar level. The advantage is again that processor oc-

cupancy is levered, while maintaining the arithmetic

intensity, and additionally will allow for a consecu-

tive kernel merge that further levers the overall per-

formance.

2.2 Kernel Merges

Merging kernels can signiﬁcantly lever the arith-

metic intensity, whenever input streams can be reused

and/or global communication is avoided. In case

input streams can be reused, the individual kernel

computations are packed together into a single ker-

nel. Nevertheless, packing more computations inside

a single kernel can potentially increase the memory

footprint, and therefore reduce the average processor

occupancy.

Whenever the input streams of a kernel are di-

rectly linked to the output streams of a preceding ker-

nel, the stream ﬂow can be simpliﬁed – and therefore

avoid global communication – by merging both ker-

nels. The risc of simplifying the ﬂow however, is that

the data dependencies between the two concerning

kernels can proliferate the arithmetic intensity, caus-

ing a computational bottleneck that is larger than the

SIGMAP 2009 - International Conference on Signal Processing and Multimedia Applications

Figure 3: A (a) computational packing and (b,c) ﬂow simpliﬁcation.

original global memory bottleneck.

2.2.1 Rule (3): Computation Packing

“If two different kernels read from identical input

streams and are not inter-dependent, the computa-

tions can be packed into a single kernel with expanded

functionality. The number of input streams therefore

remains equivalent, while the number of outputs is the

sum of the separate output streams.”

As depicted in Fig. 3a, the computations of two

kernels with identical input streams are packed to-

gether. The arithmetic intensity is signiﬁcantly lev-

ered whenever the two merged kernels share a number

of common input streams, however there is in general

no guarantee that the occupancy does not drop, be-

cause of a possible increase in the memory footprint

of the merged kernel. Nonetheless, when their func-

tionalities are not inter-dependent (i.e. one of the ker-

nels needs the other one’s output as input), the mem-

ory footprint cannot exceed the maximum of both

both individual footprints. This is thanks to the pos-

sibility of clearing the required temporary memory

– excluding the input data – for the following sub-

kernel.

2.2.2 Rule (4): Flow Simpliﬁcation

“A ﬂow of two kernels that are interconnected sequen-

tially by their respective output and input streams,

can be simpliﬁed and merged to a single kernel. The

merge is only implicit optimal whenever the global

memory footprint between the two kernels exhibits a

strict one-to-one relationship. This does not impose

any restrictions on additional outputs or inputs that

do not interconnect the concerning kernels.”

Fig. 3b shows two sequential kernels that are

merged, whereas intermediate in- and output streams

are also allowed in this rule (see Fig. 3c). Since inter-

kernel communication can only occur through global

memory, merging two sequential kernels could avoid

a signiﬁcant amount of global memory communica-

tion. Whenever the data dependencies between the

two kernels exhibits a one-to-one relationship (e.g. an

RGB to YUV conversion), the result of the ﬁrst kernel

can be immediately reused without the need of writ-

ing and reading the data to global memory. However,

in the case the two kernels do not exhibit a one-to-

one relationship (e.g. an N-tap convolution ﬁlter), the

intermediate results of threads in the vicinity (i.e. the

size of the ﬁlter) of the thread block borders cannot be

reused, because thread blocks are not able to commu-

nicate with each other. In this effect, the computations

of the ﬁrst kernel have to be performed in a redundant

manner, to provide the required input data of the con-

secutive kernel. Whether or not the kernel merge is a

good design choice, is then dependent of the low-level

computations and optimizations, making it difﬁcult to

exactly predict an overall performance increase.

3 CASE STUDY RESULTS

As a case study for the application of the high-level

kernel transformation rules, we present the optimiza-

tion of the truncated windows (Lu et al., 2007) stereo

depth estimation algorithm, with minimal design ef-

fort. Stereo matching has been already intensively re-

searched by the computer vision community, and is

systematically surveyed by (Scharstein and Szeliski,

2002). However, only the latest years these algo-

rithms started to become real-time through the use

of GPGPU computing. Gong et al. and Rogmans et

al. have studied the performance of various important

real-time stereo algorithms in (Gong et al., 2007) and

(Rogmans et al., 2009) respectively. From this pre-

vious research, the truncated windows algorithm is

A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS

HARDWARE - Increasing Streaming Execution Performance with Minimal Design Effort

Figure 4: The (a) original algorithmic ﬂow sketch and (b,c,d,e) optimization phases.

identiﬁed as having one of the best trade-offs between

quality and execution speed, hence we present the fur-

ther optimization of this algorithm.

Following the taxonomy of (Scharstein and

Szeliski, 2002), the truncated windows algorithm ex-

ists out of three phases, i.e. the absolute difference

(AD), a truncated convolution (TC), and a winner-

takes-all (WTA) disparity (depth) selection. As

shown in Fig. 4a, the AD-kernel takes in a left and

right stereo image, and outputs a pixelwise difference

to the TC-kernel that will convolve the input with a

truncated ﬁlter, resulting in four separate outputs. The

ﬁnal WTA-kernel selects the minimal value, as it indi-

cates the best disparity (depth) hypothesis. For more

detail about the algorithm, the reader can consult Lu

et al.’s original paper (Lu et al., 2007).

As Fig. 4a only depicts the basic algorithmic ﬂow

sketch in the ﬁrst phase, a trivial algorithmic opti-

mization is to separate the 2D convolution in a 1D

horizontal and vertical ﬁlter, resulting in the ﬂow dia-

gram shown in Fig. 4b. Since this decreases the algo-

rithmic complexity, it is a perfect example of signiﬁ-

cantly reducing the amount of computations to com-

pensate the extra inter-kernel communication. How-

ever, many algorithmic designers do not go beyond

these optimizations, as most of the time further opti-

mizations require platform-speciﬁc knowledge.

The speciﬁc GPU architectural knowledge is con-

veniently embedded inside the proposed high-level

rules, to destress the designer from ﬁrst going through

a rather though learning curve. In a third phase, the

ﬁrst rule can be applied on the vertical convolution,

since the horizontal convolution kernel outputs a left

(L) and right (R) part. Instead of reading from both

streams, the streams are now read separately and pro-

cessed individually to an upper (U) and lower (D)

SIGMAP 2009 - International Conference on Signal Processing and Multimedia Applications

part, shown in Fig. 4c. As previously discussed, this

signiﬁcantly improves the use of traditional texture

caches, and levers the processor occupancy in case of

next-gen shared memory. The arithmetic intensity is

left clearly untouched with this transformation.

The WTA-kernel now inputs from both vertical

convolution ﬁlters (see Fig. 4d). Since they both per-

form the same functionality, but on different data, the

WTA-kernel can be duplicated according to rule (2).

The be able to apply this rule, the data needs to be re-

structurized accordingly. The original algorithm uses

the RGBA-component texture format to communicate

between the kernels, hence the data is batched in a

four-component SIMD way. As four (independent)

disparity estimations are packed inside these compo-

nents, the data stream L0:3 – respresenting the four

components of the left convolution ﬁlter – and R0:3

are restructurized to L0:1R0:1 and L2:3R2:3, thus

crosswise switching the ﬁrst and last two components

of the vectors. Since these components are indepen-

dent by deﬁnition of SIMD, the restructuring is al-

lowed because both vertical kernels exhibit the same

functionality. The four data streams that are required

by the WTA-kernel (i.e. UL, DL, UR, and DR), are

available at the output of a single vertical convolution

kernel, albeit only two components instead of four.

For this, the WTA-kernel has to be slightly modiﬁed

(devectorized), but in this case results in no penalty, as

the minimum selection is already a scalar operation.

In a ﬁnal and ﬁfth stage, the duplicated WTA-

kernels can be merged with the preceding vertical

convolutions, as depicted in Fig. 4e. The communi-

cation with global memory (i.e. legacy texture mem-

ory) can therefore be avoided. Since the dependencies

between the WTA- and convolution kernels exhibit a

one-to-one relationship (i.e. selecting the minimum),

the merge can be carried out implicitly.

To give an indication of the overall performance

gain, we have benchmarked the phase 3 (as origi-

nally proposed by Lu et al.) and phase 5 implemen-

tation, as suggested by the high-level kernel trans-

formation rules. The implementations where mea-

sured on an NVIDIA GeForce 8800GT, using im-

ages with a 450 × 375 resolution and 60 disparity es-

timations. The phase 3 implementation gives 115fps,

while phase 5 reaches over 129fps, resulting in an in-

crease of 12.2%. However, the overall performance

increase (phase 2 until 5) is over 40% with minimal

design effort.

4 CONCLUSIONS

We have proposed a high-level rule set to trans-

form the original algorithmic design, resulting in an

inter-kernel optimization rather then a low-level intra-

kernel optimization. Since the rules take both tra-

ditional texture caches and next-gen shared memory

into account, they can be implicitely applied in most

streaming applications on graphics hardware. We

have applied the rule set to a state-of-the-art depth

estimation algorithm, and achieved over 40% perfor-

mance increase with minimal design effort.

ACKNOWLEDGEMENTS

Sammy Rogmans would like to thank the IWT for the

ﬁnancial support under grant number SB071150.

REFERENCES

Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J.,

Husbands, P., Keutzer, K., Patterson, D. A., Plishker,

W. L., Shalf, J., Williams, S. W., and Yelick, K. A.

(2006). The landscape of parallel computing research:

A view from berkeley. Technical report.

Fatahalian, K., Sugerman, J., and Hanrahan, P. (2004).

Understanding the efﬁciency of GPU algorithms for

matrix-matrix multiplication. In Graphics Hardware.

Gong, M., Yang, R., Wang, L., and Gong, M. (2007). A

performance study on different cost aggregation ap-

proaches used in real-time stereo matching. Int’l Jour-

nal Computer Vision.

Govindaraju, N. K., Larsen, S., Gray, J., and Manocha, D.

(2006). A memory model for scientiﬁc algorithms on

graphics processors. In Super Computing.

Lu, J., Lafruit, G., and Catthoor, F. (2007). Fast vari-

able center-biased windowing for high-speed stereo

on programmable graphics hardware. In ICIP.

Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger,

J., Lefohn, A., and Purcell, T. (2007). A survey of

general-purpose computation on graphics hardware.

CG Forum.

Podlozhnyuk, V. (2007). Image convolution with CUDA.

Rogmans, S., Lu, J., Bekaert, P., and Lafruit, G. (2009).

Real-time stereo-based view synthesis algorithms:

A uniﬁed framework and evaluation on commodity

gpus. Signal Processing: Image Communications.

Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S.,

Kirk, D. B., and Hwu, W.-M. W. (2008). Optimization

principles and application performance evaluation of a

multithreaded GPU using CUDA. In PPoPP.

Scharstein, D. and Szeliski, R. (2002). A taxonomy and

evaluation of dense two-frame stereo correspondence

algorithms. Int’l Journal Computer Vision.

A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS

HARDWARE - Increasing Streaming Execution Performance with Minimal Design Effort