A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS HARDWARE - Increasing Streaming Execution Performance with Minimal Design Effort

Sammy Rogmans, Gauthier Lafruit, Philippe Bekaert

Abstract

This paper proposes a high-level rule set that allows algorithmic designers to optimize their implementation on graphics hardware, with minimal design effort. The rules suggest possible kernel splits and merges to transform the kernels of the original design, resulting in an inter-kernel rather then low-level intra-kernel optimization. The rules consider both traditional texture caches and next-gen shared memory – which are used in the abstract stream-centric paradigms such as CUDA and Brook+ – and can therefore be implicitly applied in most generic streaming applications on graphics hardware.

References

  1. Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., and Yelick, K. A. (2006). The landscape of parallel computing research: A view from berkeley. Technical report.
  2. Fatahalian, K., Sugerman, J., and Hanrahan, P. (2004). Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Graphics Hardware.
  3. Gong, M., Yang, R., Wang, L., and Gong, M. (2007). A performance study on different cost aggregation approaches used in real-time stereo matching. Int'l Journal Computer Vision.
  4. Govindaraju, N. K., Larsen, S., Gray, J., and Manocha, D. (2006). A memory model for scientific algorithms on graphics processors. In Super Computing.
  5. Lu, J., Lafruit, G., and Catthoor, F. (2007). Fast variable center-biased windowing for high-speed stereo on programmable graphics hardware. In ICIP.
  6. Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., and Purcell, T. (2007). A survey of general-purpose computation on graphics hardware. CG Forum.
  7. Podlozhnyuk, V. (2007). Image convolution with CUDA.
  8. Rogmans, S., Lu, J., Bekaert, P., and Lafruit, G. (2009). Real-time stereo-based view synthesis algorithms: A unified framework and evaluation on commodity gpus. Signal Processing: Image Communications.
  9. Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W.-M. W. (2008). Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP.
  10. Scharstein, D. and Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int'l Journal Computer Vision.
Download


Paper Citation


in Harvard Style

Rogmans S., Bekaert P. and Lafruit G. (2009). A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS HARDWARE - Increasing Streaming Execution Performance with Minimal Design Effort . In Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2009) ISBN 978-989-674-007-8, pages 38-43. DOI: 10.5220/0002188400380043


in Bibtex Style

@conference{sigmap09,
author={Sammy Rogmans and Philippe Bekaert and Gauthier Lafruit},
title={A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS HARDWARE - Increasing Streaming Execution Performance with Minimal Design Effort},
booktitle={Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2009)},
year={2009},
pages={38-43},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002188400380043},
isbn={978-989-674-007-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2009)
TI - A HIGH-LEVEL KERNEL TRANSFORMATION RULE SET FOR EFFICIENT CACHING ON GRAPHICS HARDWARE - Increasing Streaming Execution Performance with Minimal Design Effort
SN - 978-989-674-007-8
AU - Rogmans S.
AU - Bekaert P.
AU - Lafruit G.
PY - 2009
SP - 38
EP - 43
DO - 10.5220/0002188400380043