Raw Camera Image Demosaicing using Finite Impulse Response

Filtering on Commodity GPU Hardware using CUDA

Patrik Goorts, Sammy Rogmans and Philippe Bekaert

Hasselt University, Expertise Centre for Digital Media, Wetenschapspark 2, 3590 Diepenbeek, Belgium

Keywords:

Demosaicing, Bayer, Finite Impulse Response Filtering, GPU, CUDA.

Abstract:

In this paper, we investigate demosaicing of raw camera images on parallel architectures using CUDA. To

generate high-quality results, we use the method of Malvar et al., which incorporates the gradient for edge-

sensing demosaicing. The method can be implemented as a collection of ﬁnite impulse response ﬁlters, which

can easily be mapped to a parallel architecture. We investigated different trade-offs between memory opera-

tions and processor occupation to acquire maximum performance, and found a clear difference in optimization

principles between different GPU architecture designs. We show that trade-offs are still important and not

straightforward when using systems with massive fast processors and slower memory.

1 INTRODUCTION

Nowadays, there is a large need for real-time image

processing algorithms for various applications. How-

ever, developing algorithms for real-time execution is

not a trivial task, especially if the algorithm is part of

a more complex processing pipeline. Therefore, we

will investigate the real-time aspect of one of such al-

gorithms, demosaicing of raw camera images.

The majority of the cameras nowadays use a CCD

array of sensors where every pixel sensor on the array

can capture only one light intensity. Therefore, acolor

ﬁlter with different colors for every pixel is placed be-

fore the sensor array to capture red, green or blue val-

ues of the color spectrum. An example of such a ﬁlter

is given in ﬁgure 1. The colors are placed in a spe-

ciﬁc pattern i.e. the Bayer pattern (Bayer, 1976). Typ-

ically, there are more green values than red and blue

values, because of the spectral response of the human

eye. The effect of such a color ﬁlter is that every pixel

of the captured image only has a speciﬁc value for

one color channel and the other color channels should

hence be computed from the surrounding pixels. The

calculation of the missing color channels is frequently

called debayering or demosaicing.

In spite of the fact that most cameras perform de-

mosaicing at the device level, it is useful to perform

this processing later on. Firstly, the raw data is only

a third of the demosaiced image; the raw image has

only one channel, instead of three. This will speed-

up the communication between the camera and the

Figure 1: Example of a Bayer pattern.

processing device, thus increasing the overall perfor-

mance. Secondly, demosaicing on devices with more

processing power can result in higher quality images.

More complex algorithms can be applied and less pro-

cessing restrictions apply.

The most straightforward method of demosaicing

is bilinear interpolation of the surrounding pixels. To

calculate the value of a missing color channel, the val-

ues of the surrounding pixels of that color channel are

averaged. This method is fast, but does not yield sharp

results and ignores borders and details, resulting in se-

vere artifacts, e.g. color bleeding.

To generate better results (Laroche and Prescott,

1994) and (Malvar et al., 2004) propose methods

which incorporatesthe gradient of the values per color

channel. Interpolating along an object edge is better

than across an edge, to reduce color artifacts from se-

lecting the color of the wrong objects in the scene.

Hirakawa and Parks present an adaptive homoge-

neity-directed demosaicing algorithm which cancels

aliasing and selects the interpolation direction with

Goorts P., Rogmans S. and Bekaert P..

Raw Camera Image Demosaicing using Finite Impulse Response Filtering on Commodity GPU Hardware using CUDA.

DOI: 10.5220/0004075900960101

In Proceedings of the International Conference on Signal Processing and Multimedia Applications and Wireless Information Networks and Systems

(SIGMAP-2012), pages 96-101

ISBN: 978-989-8565-25-9

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

the least color artifacts. The interpolation direction is

chosen such that a homogeneity metric is maximized

(Hirakawa and Parks, 2005).

We will investigate the real-time aspect of the de-

mosaicing problem. More speciﬁcally, we will dis-

cuss the method of Malvar et al. implemented on

CUDA. This method is chosen because it uses lin-

ear ﬁnite inpulse response (FIR) ﬁltering to produce

high-quality results. FIR ﬁltering is known to map

very well on CUDA (Goorts et al., 2009), which will

maximize the performance, while preserving the qual-

ity. This method is implemented earlier by (McGuire,

2008) using traditional GPGPU paradigms, but these

optimization principles do not map to CUDA.

We will give an overview of the algorithm in sec-

tion 2 and an overviewof CUDA in section 3. We will

further present our method in section 4 ans ultimately

present the results in section 5.

2 DEMOSAICING WITH LINEAR

FILTERING

To compose demosaiced images, we used the algo-

rithm proposed by (Malvar et al., 2004). We will

brieﬂy describe the algorithm here.

Malvar et al. propose a non-directional demosaic-

ing method implementable by a FIR convolution ﬁl-

ter. Typical demosaicing algorithms, like bilinear

interpolation, only use green ﬁltered pixels for the

green channel, red ﬁltered pixels for the red channel,

etcetera. The method of Malvar et al., on the other

hand, also incorporates pixels where the ﬁltered color

differs from the current channel.

This is obtained by applying a pattern to the pixel

and its neighbors. The different patterns and their

respective pixel weights are shown in ﬁgure 2. The

pattern used is dependent on the ﬁltered color of the

pixel and the desired color channel. For every pixel,

three patterns are applied (one for every color chan-

nel), where one pattern is trivial.

These patterns are designed to improve the result

around edges by incorporating the gradient of the lu-

minance values. When we calculate the green value

on a red ﬁltered pixel, for example, we do not dis-

card the red value. The red value is used to calcu-

late the luminance change (using adjacent red values)

and this is incorporated when calculating the green

value. Thus, we calculate the bilinear interpolation of

the green pixels around the red ﬁltered pixel and use

the red ﬁltered pixels to correct this interpolation for

edges. The same principle holds for different color

channels.

Figure 2: Convolution ﬁlters for demosaicing. The choice

of ﬁlter is based on the desired color channel for that pixel

(column) and the ﬁlter used for that pixel (row).

These patterns can easily be used as ﬁnite impulse

response (FIR) ﬁlters and implemented as such.

3 CUDA ARCHITECTURE

To obtain real-time demosaicing of images, we use

commodity GPUs for processing. More speciﬁcally,

we use the CUDA framework provided by NVIDIA.

Nowadays, commodity GPUs are exposed as a col-

lection of single-instruction multiple-thread (SIMT)

streaming processors, which easily allows parallel

general purpose applications. To use these GPUs

for general purpose applications, the algorithm must

be split up in elementary threads that all execute the

same code, but on different data. All threads share a

block of off-chip global memory, which is accessible

by the host CPU. The access to this global memory

is slow and should be avoided as much as possible.

To allow optimization, different threads are grouped

together in blocks, where it is possible to reuse data

loaded from global memory by using a type of shared

memory. Access to the shared memory is fast, but

limited to a block. All blocks have different shared

memories and no fast communication between blocks

is therefore possible.

RawCameraImageDemosaicingusingFiniteImpulseResponseFilteringonCommodityGPUHardwareusingCUDA

This paradigm is consistently mapped to the GPU

hardware. The GPU is a collection of multiproces-

sors, where each multiprocessor contains one instruc-

tion decoder, but multiple streaming processors. Each

streaming processor in a multiprocessor thus executes

the same instruction, but using different data. All

multiprocessors contain a local on-chip fast mem-

ory, accessible by all streaming processors. Every

multiprocessor is furthermore connected to an off-

chip GPU-wide memory block using a relatively slow

memory bus.

Every block of the execution model maps on one

multiprocessor and every thread of a block can be ex-

ecuted by one streaming processor. Because it is pos-

sible to have more threads in a block than there are

streaming processors available, a thread scheduler is

available to suspend and resume threads. The group

of threads that is executed simultaneously is called a

warp, and is well-deﬁned by the number of stream-

ing processors to allow performance optimizations. It

is furthermore possible to have multiple blocks reside

on a single multiprocessor. These blocks are com-

pletely separated from each other and can not com-

municate in any faster way as blocks on different mul-

tiprocessors can communicate.

Because this distributed shared memory architec-

ture is well-known, it is possible to optimize the algo-

rithm for maximum performance. Firstly, it is possi-

ble to coalesce memory accesses of the global mem-

ory of different threads in one memory call. This will

reduce the load on the memory bus and increase the

overall speed. Secondly, it is possible to optimize

the number of blocks and threads per block to as-

sure every processor is kept busy and no valuable pro-

cessor cycles are lost. The effective fraction of used

streaming processors is called the occupancy. Enough

blocks and threads should be deﬁned to utilize every

streaming processor.

To increase occupancy, memory latency should be

hidden. Threads will become idle when waiting for

the results of a global memory fetch, which can take

hundreds of clock cycles. Therefore, the idle threads

can be swapped out and other threads can continue

their execution. Because the threads can also use data

from other threads in the same block, it will happen

that the majority of the threads are waiting on a few

threads, which in turn are waiting for the result of

their global memory fetch. To counter this idle pro-

cessor time, the multiprocessor can decide to fetch in

another block of threads. This is only possible when

the blocks are small enough to allow the presence of

multiple blocks on the multiprocessor.

The number of blocks per multiprocessor and the

number of threads per block is limited by the com-

Figure 3: Implementation strategies for FIR ﬁltering on par-

allel SIMT architectures. (a) Naive strategy. Every thread

fetches all required data, resulting in multiple slow memory

accesses per thread. (b) Optimized strategy. Every thread

fetches only one data element and stores this in a shared

memory. Because other threads fetch the other data ele-

ments, data reuse is possible, resulting in less slow memory

accesses.

pute capability of the hardware. More recent compute

capabilities allow more ﬂexibility and more simulta-

neous threads, thus the optimizations are dependent

hereof.

4 IMPLEMENTATION

We implemented the method of (Malvar et al., 2004)

using CUDA. Because this method is a specialization

of generic FIR ﬁltering, we will discuss this ﬁrst.

4.1 Generic FIR Filtering

We can use the parallel direct GPU computing ar-

chitecture to implement linear FIR ﬁltering with the

aid of a user-managed cache i.e. the shared memory

(Goorts et al., 2009). When implementing, for exam-

ple, a 3 × 3 ﬁlter without optimizations, we can al-

locate one thread for every pixel of the image. This

thread will access 9 pixels of the image in global

memory to calculate the ﬁnal result for the allocated

pixel (see Figure 3 (a)). However, it is possible to re-

duce the memory accesses by reusing the information

of nearby threads, i.e. the thread loads only his allo-

cated pixel in shared memory and can then use the

values of nearby pixels which are loaded in shared

memory by other threads. Therefore, the amount of

global accesses per thread is reduced to one, which is

shown in Figure 3 (b).

Since the size of the blocks is limited and some

threads at the borders of the blocks don’t have enough

data available to calculate the ﬁltered value, we must

create extra threads at the borders that only read pixel

informationand hereby will not calculate a newvalue.

This way, these threads won’t need the value for

neighboring threads. The set of these speciﬁc threads

is called the apron (see Figure 4).

SIGMAP2012-InternationalConferenceonSignalProcessingandMultimediaApplications

Figure 4: One block for ﬁltering a part of the image. The

threads at the border (red) are inside the apron and do not

calculate new values for their pixels. They only load data

for use by the internal threads (blue).

4.2 FIR Filtering for Demosaicing

We have ported the demosaicing principles of (Malvar

et al., 2004) on the GPU using CUDA. The algorithm

is in essence a 2D linear FIR ﬁltering. We already in-

vestigated FIR ﬁltering using CUDA in our previous

research (Goorts et al., 2009). Because the kernels are

small, separating the kernels in 1D ﬁlters or using the

Fourier transforms will not result in a speedup. There-

fore, we will only use direct, straightforward ﬁltering.

However, the effect of the number of threads per

block and the trade-off between the total size of the

apron and the occupancy must be investigated. When

we havea small amount of threads per block, the over-

all amount of threads in the aprons is large, and a lot

of accesses to global memory must be made. How-

ever, as the blocks are small, this allows for multiple

blocks per multiprocessor, and enough blocks to uti-

lize every available multiprocessor. When the number

of threads per block is large, the overall number of

threads in the aprons is smaller, but less blocks can be

deﬁned and some multiprocessors can become idle or

no memory latency hiding can be employed. There-

fore, we will investigate what the optimal number of

threads per block is to maximize performance.

To avoid divergent branching, we deﬁned a thread

per square of four pixels. This way, every thread will

process the same kind of data and no selection of the

ﬁlter is needed; every thread uses all 12 ﬁlters.

5 EXPERIMENTAL RESULTS

We performed our experiments on two devices using

CUDA version 4.1. The ﬁrst device is an NVIDIA

GeForce 8800 GT with compute capability 1.1 and

112 streaming processors at 600 MHz; the second de-

vice is an NVIDIA GTX 580 with compute capability

2.0 and 512 streaming processors at 772 MHz. The

input images have a resolution of 1600× 1200. To

reduce the effects of low-level process management

by the operating system, we executed every conﬁgu-

ration 5000 times and computed the average running

time for a single execution.

We will present different results for both compute

capabilities to stress the effect of changes in the ar-

chitecture while it evolves to more advance massive

parallelism.

5.1 Compute Capability 1.1

The measured execution times of the different conﬁg-

urations are shown in Figure 5. In this graph, we only

discuss block widths that allow data coalescing; other

widths will result in uncoalesced reads and decrease

the performance severely.

Two effects are noticed: ﬁrst, the general perfor-

mance for very small blocks is high. This is caused

by the occupancy of the multiprocessors; there are

enough blocks to ﬁll every multiprocessor and the

memory footprint is small enough to allow multiple

blocks per multiprocessor. Multiple blocks per mul-

tiprocessor allow for effective hiding of the memory

latencies caused by starting fetches from global mem-

ory for other parallel threads. The performance is

higher compared with larger blocks, which is counter-

intuitive as smaller blocks result in increased number

of apron threads and thus more memory fetches.

The second effect is the almost constant execution

time after a certain block size (shown as the dotted

line on Figure 5). Increasing the block size will pre-

vent allocating multiple blocks to one multiprocessor,

and will hence decrease the performance. Neverthe-

less, when the blocks become larger, the overall num-

ber of threads in the aprons will decrease, simultane-

ously reducing the amount of global memory fetches.

Ergo, after the performance drop due to the reduced

occupancy, the performance will increase again, but

remains lower than small block sizes.

5.2 Compute Capability 2.0

The measured execution times of the different conﬁg-

urations are shown in Figure 6. The results are in-

terestingly different compared to compute capability

1.1; the most performant conﬁguration is no longer

the smallest block size. The total number of simul-

taneous threads raises due to the increased warp size,

thus the total amount of simultaneous global memory

reads increases. The latency becomes too high to ef-

fectively hide it with more threads. Therefore, it is

better to reduce the total number of memory fetches

while keeping the occupancy as high as possible by

RawCameraImageDemosaicingusingFiniteImpulseResponseFilteringonCommodityGPUHardwareusingCUDA

Figure 5: Results for compute capability 1.1. Every graph represents the width of the block, only considering coalesced

conﬁgurations. The height is varied on the horizontal axis.

Figure 6: Results for compute capability 2.0. Every graph represents the width of the block, only considering coalesced

conﬁgurations. The height is varied on the horizontal axis.

making sure that multiple blocks per multiprocessor

can be executed independently in succession.

Slightly increasing the block size does not de-

crease occupancy immediately. The speciﬁcations of

compute capability 2.0 provide more ﬂexibility, thus

allowing larger block sizes. Therefore, we see a more

distinct effect on the occupancy, which is clearly visi-

ble in Figure 6. This phenomenon manifests for block

sizes of 64x9 (576 threads), 32x16 (512 threads) and

16x31 (496 threads), where the performancesuddenly

drops signiﬁcant (shown as the dotted lines on Fig-

ure 6). The reason is that less blocks are allocated

per multiprocessor, thus impeding on the advantage

of memory latency hiding.

6 CONCLUSIONS

In this paper, we investigated the problem of demo-

saicing on CUDA using FIR convolution, and which

trade-offs must be made. We found that there is a

clear difference between compute capability 1.1 and

2.0, two of the most common CUDA hardware plat-

forms.

SIGMAP2012-InternationalConferenceonSignalProcessingandMultimediaApplications

100

Compute capability 1.1 has strict rules for coalesc-

ing and actual achievable occupancy. Therefore, it is

more performant to hide memory latency and raise

the performance by using small block sizes, even if

thereby the memory accesses increase signiﬁcantly.

Compute capability 2.0 allows more simultaneous

threads and has more ﬂexibility, thus automatically

increasing the occupancy by allowing more blocks

per multiprocessor. However, the throughput of the

threads is too high to effectively hide all memory

latency, thus occupancy is decreased for very small

block sizes. Therefore, the amount of memory fetches

can be decreased without affecting the occupancy,

which increases the performance. This is valid until a

speciﬁc threshold is reached. Crossing this threshold,

the number of blocks per multiprocessor decreases,

and the performance drops signiﬁcantly.

These different results prove that the memory wall

for systems with slow memory and fast processors, as

stated by (Asanovic et al., 2006), still holds and that

the effect becomes more distinct when the individ-

ual processor capabilities increase, and the number of

processors increase faster than the speed of the mem-

ory. The trade-offs between processing and memory

accesses are important and must always be properly

investigated to reach maximum performance.

ACKNOWLEDGEMENTS

Patrik Goorts would like to thank the IWT for its PhD

specialization bursary.

REFERENCES

Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J.,

Husbands, P., Keutzer, K., Patterson, D. A., Plishker,

W. L., Shalf, J., Williams, S. W., and Yelick, K. A.

(2006). The Landscape of Parallel Computing Re-

search: A View From Berkeley. Electrical Engineer-

ing and Computer Sciences, University of California

at Berkeley, 18(183):19.

Bayer, B. (1976). Color imaging array. US Patent

3,971,065.

Goorts, P., Rogmans, S., and Bekaert, P. (2009). Optimal

data distribution for versatile ﬁnite impulse response

ﬁltering on next-generation graphics hardware using

cuda. In Parallel and Distributed Systems (ICPADS),

2009 15th International Conference on, pages 300–

307. IEEE.

Hirakawa, K. and Parks, T. (2005). Adaptive homogeneity-

directed demosaicing algorithm. Image Processing,

IEEE Transactions on, 14(3):360–369.

Laroche, C. and Prescott, M. (1994). Apparatus and method

for adaptively interpolating a full color image utilizing

chrominance gradients. US Patent 5,373,322.

Malvar, H., He, L., and Cutler, R. (2004). High-quality lin-

ear interpolation for demosaicing of bayer-patterned

color images. In Acoustics, Speech, and Signal Pro-

cessing, 2004. Proceedings.(ICASSP'04). IEEE Inter-

national Conference on, volume 3, pages 485–488.

IEEE.

McGuire, M. (2008). Efﬁcient, high-quality bayer demo-

saic ﬁltering on gpus. Journal of Graphics, GPU, and

Game Tools, 13(4):1–16.

RawCameraImageDemosaicingusingFiniteImpulseResponseFilteringonCommodityGPUHardwareusingCUDA

101