part, shown in Fig. 4c. As previously discussed, this
significantly improves the use of traditional texture
caches, and levers the processor occupancy in case of
next-gen shared memory. The arithmetic intensity is
left clearly untouched with this transformation.
The WTA-kernel now inputs from both vertical
convolution filters (see Fig. 4d). Since they both per-
form the same functionality, but on different data, the
WTA-kernel can be duplicated according to rule (2).
The be able to apply this rule, the data needs to be re-
structurized accordingly. The original algorithm uses
the RGBA-component texture format to communicate
between the kernels, hence the data is batched in a
four-component SIMD way. As four (independent)
disparity estimations are packed inside these compo-
nents, the data stream L0:3 – respresenting the four
components of the left convolution filter – and R0:3
are restructurized to L0:1R0:1 and L2:3R2:3, thus
crosswise switching the first and last two components
of the vectors. Since these components are indepen-
dent by definition of SIMD, the restructuring is al-
lowed because both vertical kernels exhibit the same
functionality. The four data streams that are required
by the WTA-kernel (i.e. UL, DL, UR, and DR), are
available at the output of a single vertical convolution
kernel, albeit only two components instead of four.
For this, the WTA-kernel has to be slightly modified
(devectorized), but in this case results in no penalty, as
the minimum selection is already a scalar operation.
In a final and fifth stage, the duplicated WTA-
kernels can be merged with the preceding vertical
convolutions, as depicted in Fig. 4e. The communi-
cation with global memory (i.e. legacy texture mem-
ory) can therefore be avoided. Since the dependencies
between the WTA- and convolution kernels exhibit a
one-to-one relationship (i.e. selecting the minimum),
the merge can be carried out implicitly.
To give an indication of the overall performance
gain, we have benchmarked the phase 3 (as origi-
nally proposed by Lu et al.) and phase 5 implemen-
tation, as suggested by the high-level kernel trans-
formation rules. The implementations where mea-
sured on an NVIDIA GeForce 8800GT, using im-
ages with a 450 × 375 resolution and 60 disparity es-
timations. The phase 3 implementation gives 115fps,
while phase 5 reaches over 129fps, resulting in an in-
crease of 12.2%. However, the overall performance
increase (phase 2 until 5) is over 40% with minimal
design effort.
4 CONCLUSIONS
We have proposed a high-level rule set to trans-
form the original algorithmic design, resulting in an
inter-kernel optimization rather then a low-level intra-
kernel optimization. Since the rules take both tra-
ditional texture caches and next-gen shared memory
into account, they can be implicitely applied in most
streaming applications on graphics hardware. We
have applied the rule set to a state-of-the-art depth
estimation algorithm, and achieved over 40% perfor-
mance increase with minimal design effort.
ACKNOWLEDGEMENTS
Sammy Rogmans would like to thank the IWT for the
financial support under grant number SB071150.
REFERENCES
Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J.,
Husbands, P., Keutzer, K., Patterson, D. A., Plishker,
W. L., Shalf, J., Williams, S. W., and Yelick, K. A.
(2006). The landscape of parallel computing research:
A view from berkeley. Technical report.
Fatahalian, K., Sugerman, J., and Hanrahan, P. (2004).
Understanding the efficiency of GPU algorithms for
matrix-matrix multiplication. In Graphics Hardware.
Gong, M., Yang, R., Wang, L., and Gong, M. (2007). A
performance study on different cost aggregation ap-
proaches used in real-time stereo matching. Int’l Jour-
nal Computer Vision.
Govindaraju, N. K., Larsen, S., Gray, J., and Manocha, D.
(2006). A memory model for scientific algorithms on
graphics processors. In Super Computing.
Lu, J., Lafruit, G., and Catthoor, F. (2007). Fast vari-
able center-biased windowing for high-speed stereo
on programmable graphics hardware. In ICIP.
Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger,
J., Lefohn, A., and Purcell, T. (2007). A survey of
general-purpose computation on graphics hardware.
CG Forum.
Podlozhnyuk, V. (2007). Image convolution with CUDA.
Rogmans, S., Lu, J., Bekaert, P., and Lafruit, G. (2009).
Real-time stereo-based view synthesis algorithms:
A unified framework and evaluation on commodity
gpus. Signal Processing: Image Communications.
Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S.,
Kirk, D. B., and Hwu, W.-M. W. (2008). Optimization
principles and application performance evaluation of a
multithreaded GPU using CUDA. In PPoPP.
Scharstein, D. and Szeliski, R. (2002). A taxonomy and
evaluation of dense two-frame stereo correspondence
algorithms. Int’l Journal Computer Vision.
A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS
HARDWARE - Increasing Streaming Execution Performance with Minimal Design Effort
43