6 CONCLUSIONS
GPU version on a OpenCL-enabled graphics card al-
lows denoising of the whole patient dataset in merely
minutes instead of hours and does not bring any ar-
tifacts created with aggressive optimizations needed
for CPU. Thus we consider this a good approach for
processing such intensive problems.
We have also validated that the optimized version
(with selection of voxels), which brings about one or-
der speedup on CPU, is actually slower on GPU due
to the breaking of the thread coherency with condi-
tional jumps and the limited size of the on-chip mem-
ory which results in lenghty memory fetches.
A full evaluation of quality is not given in this pa-
per, it can be found in (Coupe et al., 2008)
The profiling has shown that we have success-
fully eliminated the memory bandwidth problem, on a
GT200 architecture is room only for cca 8% improve-
ment onto the peak theoretical performance.
However, on the new GF104 the performance is
about half of the theoretical peak. This is caused by
heavy usage of the on-chip memory - two warps do
not fit on a single SM and this prevents the HW from
executing 4 instructions per cycle. We have not found
a way to reach this limit without drastically increasing
the bandwith dependency or quality of denoising.
Future improvements may be in implementing
the blockwise approach for GPU (will very proba-
bly bring substantial performance improvement at the
cost of result quality) and overcoming the one-warp-
per-multiprocessor limit when running on a GF104
and newer architectures. Another improvement may
be trying to implement a GPU version of (Darbon
et al., 2008), which has even lower computational
complexity than the original NLM.
Figure 4: Example of denoised result: (left) Original data.
(right) Denoised data with OpenCL GPU implementation
of the original algorithm.
ACKNOWLEDGEMENTS
This work was supported by the Grant Agency of
Charles University, Prague (project number 121409).
REFERENCES
Buades, A., Coll, B., and Morel, J. (2005). A review of im-
age denoising algorithms, with a new one. Multiscale
Modeling and Simulation, 4(2):490–530.
Coupe, P., Yger, P., Prima, S., Hellier, P., Kervrann, C., and
Barillot, C. (2008). An optimized blockwise nonlo-
cal means denoising filter for 3-D magnetic resonance
images. Medical Imaging, IEEE Transactions on, 22.
Darbon, J., Cunha, A., Chan, T. F., Osher, S., and Jensen,
G. J. (2008). Fast nonlocal filtering applied to electron
cryomicroscopy. In ISBI’08, pages 1331–1334.
Federle, M. (2007). CT of the small intestine: Enterography
and angiography. Applied Radiology.
Gallagher, N. and Wise, G. (1981). A theoretical analysis of
the properties of median filters. IEEE Transactions on
Acoustic, Speech and Signal Processing, ASSP-29(6).
Gu, J., Zhang, L., Yu, G., Xing, Y., and Chen, Z. (2006).
X-ray CT metal artifacts reduction through curvature
based sinogram inpainting. Journal of X-Ray Science
and Technology, 14(2):73–82.
Kharlamov, A. and Podlozhnyuk, V. (2007). Image denois-
ing. Technical report, nVidia Corporation.
Khronos OpenCL Working Group (2009). The OpenCL
specification. Technical report, Khronos OpenCL
Working Group.
NVIDIA Corporation (2008). OpenCL Programming guide.
Technical report, NVIDIA Corporation.
NVIDIA Corporation (2009). OpenCL Best practices
guide. Technical report, NVIDIA Corporation.
Paulsen, S., Huprich, J., Fletcher, J., Booya, F., Young, B.,
Fidler, J., Johnson, C., Barlow, J., and Earnest IV,
F. (2006). CT enterography as a diagnostic tool in
evaluating small bowel disorders: Review of clini-
cal experience with over 700 cases. RadioGraphics,
26(3):641–657.
Rudin, L. and Osher, S. (1981). Total variation based image
restoration with free local constraints. IEEE Transac-
tions on Image Processing, 1:31–35.
Rudin, L., Osher, S., and Fatemi, E. (1992). Nonlinear total
variation based noise removal algorithms. Physica D:
Nonlinear Phenomena, 60.
IMAGAPP 2011 - International Conference on Imaging Theory and Applications
146