GPU ACCELERATED REAL-TIME OBJECT DETECTION ON HIGH

RESOLUTION VIDEOS USING MODIFIED CENSUS TRANSFORM

Salih Cihan Tek and Muhittin G¨okmen

Department of Computer Engineering, Istanbul Technical University, Maslak, Istanbul, Turkey

Keywords:

Object Detection, Face Detection, GPU Acceleration, CUDA.

Abstract:

This paper presents a novel GPU accelerated object detection system using CUDA. Because of its detection

accuracy, speed and robustness to illumination variations, a boosting based approach with Modiﬁed Census

Transform features is used. Results are given on the face detection problem for evaluation. Results show that

even our single-GPU implementation can run in real-time on high resolution video streams without sacriﬁc-

ing accuracy and outperforms the single-threaded and multi-threaded CPU implementations for resolutions

ranging from 640× 480 to 1920× 1080 by a factor of 12-18x and 4-6x, respectively.

1 INTRODUCTION

Real-time object detection has been a very active

topic of research in the last decade. The primary rea-

son of this interest on the subject is the number of

its possible real-world applications both in commer-

cial and non-commercial systems. Most of the com-

plicated applications like virtual reality, surveillance,

video conferencing and robotics require the system to

be able to run in real-time, making the speed of the

object detector as important as the accuracy.

Numerous algorithms have been developed that

can run in real-time provided that the input resolu-

tion is relatively small. Most of them are descendants

of the object detection algorithm proposed in (Viola

and Jones, 2004), which is based on scanning the im-

age with a sliding window at multiple scales and us-

ing a cascaded classiﬁer on each location to obtain

the locations containing the searched object. One ex-

ample of these algorithms is (Fr¨oba and Ernst, 2004),

the algorithm this research is based on. Even though

these algorithms are fast, they are not fast enough to

run in real time on video streams having a high res-

olution like 1280 × 720 without sacriﬁcing the accu-

racy or generalization ability of the system in ques-

tion. Considerable amount of effort has been made

to speed up these algorithms, but algorithmic modi-

ﬁcations by themselves are not enough to get drastic

speed improvements required.

Another approach that is becoming increasingly

common is to accelerate the algorithms using the

GPU. In (Sharma et al., 2009), the authors followed

this approach by implementing the Viola-Jones algo-

rithm using CUDA. They achieved a detection rate of

19 Frames Per Second (FPS) on a video stream of res-

olution 1280 × 960 on a GTX 285 GPU. Although

there is an important increase in speed, having %81

accuracy and 16 false positives on the CMU test set

shows that a sacriﬁce has been made in the accuracy.

In (Hefenbrock et al., 2010), a multi-GPU implemen-

tation of the Viola-Jones algorithm is presented that

runs at 15.2 FPS at 640× 480 resolution on 4 Tesla

GPUs. Another CUDA accelerated Viola-Jones algo-

rithm given in (Obukhov, 2004) performs at 14 and 8

FPS on a GTX 480 for resolutions 1280 × 720 and

1920 × 1080, respectively. The scaling factors are

limited to be integers, which means that the detec-

tor will not be able to detect small faces accurately.

In (Harvey, 2009), the author achieved 2.8 FPS on a

single GTX 280 GPU and 4.3 FPS on a dual GTX

295 GPU on VGA image sizes with another CUDA

implementation of the Viola-Jones algorithm. (Her-

out et al., 2010) is the only GPU accelerated object

detection algorithm to date that uses a different fea-

ture and different classiﬁer structure than (Viola and

Jones, 2004). The reported frame rate is 58 FPS for

1280 × 720 resolution on a GTX 280 GPU with no

information about the accuracy.

In this paper, we follow a similar approach, but

use a different algorithm. We present a GPU imple-

mentation of a boosting based object detection algo-

rithm that uses illumination-robust MCT (Modiﬁed

Census Transform) (Fr¨oba and Ernst, 2004) based

classiﬁers using CUDA. Our implementation per-

685

Cihan Tek S. and Gökmen M..

GPU ACCELERATED REAL-TIME OBJECT DETECTION ON HIGH RESOLUTION VIDEOS USING MODIFIED CENSUS TRANSFORM.

DOI: 10.5220/0003821606850688

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 685-688

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

forms all major steps of the algorithm on the GPU and

hence requires minimum amount of memory transfers

between the host and the GPU. We evaluate the per-

formance of our implementation on face detection.

2 DETECTION ALGORITHM

MCT (Fr¨oba and Ernst, 2004) is a transform that

transforms the pixel values in a given neighborhood

N to a binary string. The binary values corresponding

to a pixel location is obtained by comparing the pixel

values in the neighborhood with their mean. Let N(x)

be a spatial neighborhood centered at the pixel loca-

tion x and I(x) be the mean of the pixel intensity val-

ues on this neighborhood. If

is the concatenation

operator, then the MCT can be deﬁned as follows:

Γ(x) =

y∈N

ζ(I(x), I(y)) (1)

where the comparison function ζ(I(x), I(y)) is de-

ﬁned as

ζ(I(x), I(y)) =

(

1, I(x) < I(y)

0, otherwise

(2)

In order to obtain memory efﬁcient classiﬁers, we

use a neighborhood size of 3 × 3, which leads to 9

bit MCT values in the range of [0, 510]. These val-

ues correspond to indices γ of local structure kernels

that can encode structural information about the im-

age like edges, ridges, etc. in binary form.

A MCT based weak classiﬁer h

consists of a co-

ordinate pair x = (x, y) relative to the origin of the

24×24 ﬁxed size scanning window and a 511 element

lookup table. The lookup table contains a weight for

each γ such that 0 ≤ γ ≤ 510. The output a weak clas-

siﬁer gives on a window is determined by calculating

γ from the neighborhood centered at x and getting the

corresponding value from the lookup table.

The detector has a cascaded structure, where each

stage contains a strong classiﬁer containing a number

of weak classiﬁers, each of which having a different

x position. H

(Γ), the strong classiﬁer of the j. stage,

is deﬁned as follows:

(Γ) =

∑

x∈W

′

(Γ(x)) (3)

where W

′

⊆ W is the set of unique positions used by

the weak classiﬁers h

in the strong classiﬁer. A win-

dow W passes the j. stage if the sum of the responses

of the weak classiﬁers on that window is less than or

equal to the stage threshold T

. A window is classiﬁed

as the searched object if it passes all stages. The cas-

cade used in this work has 5 stages utilizing 10, 20,

40, 80, and 217 positions.

The detection process involves using the classiﬁer

cascade at each location on the input image at multi-

ple scales. As in any other sliding window approach,

there are scanning parameters like horizontal and ver-

tical step sizes (∆x and ∆y), the scale factor (s) and the

starting scale. In this work, we choose ∆x = ∆y = 1,

s = 1.15 and use a starting scale of 1, which results in

a computationally demanding, ﬁne-grained scanning

process and makes it possible to detect faces as small

as 24× 24 accurately.

3 CUDA IMPLEMENTATION

3.1 Preprocessing

In order for the detector to detect objects at various

sizes, the input image needs to be scanned in multiple

scales. In (Viola and Jones, 2004), this is performed

by scanning the same input image with up-scaled

classiﬁers. The approach we followed is to construct

an image pyramid and scan it using a ﬁxed size win-

dow to prevent sparse global memory accesses that

would have been required by the former approach at

larger scales. The whole image pyramid is stored as

a single image as shown on the left in Figure 1. Even

though this layout has empty regions that increase the

number of windows scanned, it greatly simpliﬁes the

scanning process described in the next section. The

regions having a constant gray level are easily elimi-

nated by the ﬁrst stage of the cascade.

The image at each level of the pyramid is con-

structed by binding the image from the previous level

to a texture to take advantage of the hardware bilinear

interpolation capability of the GPU, and then down-

sampling it according to the scale factor.

Evaluation of a weak classiﬁer becomes a simple

memory lookup when the MCT values are precom-

puted. To take advantage of this property, the whole

image pyramid is transformed with MCT beforehand

by launching a kernel with a block size of 16 × 16.

Each thread block pulls in a patch from the pyramid

in global memory to its dedicated shared memory be-

fore further processing. Each thread in a block pulls

in a single pixel from the global memory. Since the

MCT computation at the edges and corners requires

additional pixels, threads at the edges and corners

pull in additional pixels. After getting the data to

shared memory, threads in each block are synchro-

nized. Then each thread computes the value corre-

sponding to its location and writes the result to the

global memory. At this point, the original image pyra-

mid is no longer needed and hence can be discarded.

In the remaining sections, the term ”image pyramid”

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

686

Figure 1: Image pyramid and the result of transforming it

with MCT.

refers to the one containing the result of the MCT,

which is shown on the right side of Figure 1.

During the scanning process, the cascade data will

be heavily accessed and therefore the speed of access-

ing them is crucial for the performance. Since the

size of the constant memory is too small store a cas-

cade containing more than 32 weak classiﬁers, it is

stored in the texture memory as a 2D ﬂoating point

texture. Each row starts with a single ﬂoating point

value containing the x and y coordinate of the weak

classiﬁer in its upper and lower 2 bytes, respectively.

Rest of the row contains the weights of the lookup

table. This approach helps to reduce texture cache

misses by making the accesses to the feature weights

as spatially local as possible in the 2D space.

3.2 Object Detection

The detection process involves scanning the image

pyramid with a ﬁxed size sliding window, classifying

the window each location and writing results back to

the device memory. We found that the best perfor-

mance overall is obtained when making each thread

classify a single window on the image pyramid. As

a natural consequence of using a cascaded classiﬁer

structure, the number of stages that will be evalu-

ated in a window cannot be predetermined. Threads

that classify their windows as negative in early stages

have to wait idle until all other threads in the same

block ﬁnish their tasks, only after which the process-

ing of a new block can be started. This results in

under-utilization of GPU resources. In order to over-

come this problem, we split the cascade into 2 smaller

groups and use 2 different detection kernels for scan-

ning. Each kernel uses one of the groups obtained in

the previous step for classiﬁcation. After performing

a number of experiments, we found that the best loca-

tion for the split is after the 2nd stage.

The ﬁrst detection kernel is launched with a grid

having as many threads as the number of windows

that needs to be scanned and a block size of 16× 16.

When a window is classiﬁed as the searched object,

its coordinates are written to the corresponding loca-

tion in a preallocated 1D array of ﬂoats that resides

in global memory. Each element of this array can

store the x and y coordinates of a window in its up-

per and lower two bytes, respectively. Since detec-

tions are rare, this array contains sparse data after the

execution of the kernel ﬁnishes. Most of its elements

still contain dummy values that are set when it is ﬁrst

allocated. To solve this problem, we use a stream

compaction algorithm that copies all elements having

meaningful values to the beginning of the array. Then

we launch another detection kernel with a 1D grid of

thread blocks, each one containing a 1D array of 256

threads. Each thread classiﬁes a single window whose

coordinates are fetched from the array generated by

the ﬁrst kernel. The whole process is shown in Figure

2 for the ﬁrst 10 windows in the input image.

Window

Origin

Thread

Window

Origin

Thread

0,0

1,0

3,0

2,0

4,0

3,0

8,0

4,0

5,0

6,0

7,0

8,0

9,0

10,0

1st Kernel

Output

Array

Compacted

Output

Array

2nd Kernel

Figure 2: Steps of scanning with 2 kernels. Window loca-

tions that are detected as positive are highlighted.

3.3 Utilizing Multiple GPUs

We have extended our implementation to reduce the

detection times even further on devices that have mul-

tiple GPUs. We spawn M + 1 CPU threads, where M

is the number of GPUs in the system. The main thread

acquires frames from the video stream and does pre-

liminary computations. Each one of the other threads

performs kernel launches and memory transfers be-

tween the GPU assigned to it and the host. Each

GPU generates several levels of the image pyramid

and scans only those levels. The input image and the

cascade data are copied to the memories of each GPU

separately. Distributing different levels of the pyra-

mid to different GPUs makes it possible to achieve

GPU ACCELERATED REAL-TIME OBJECT DETECTION ON HIGH RESOLUTION VIDEOS USING MODIFIED

CENSUS TRANSFORM

687

nearly linear speed-up when the levels each GPU will

process is carefully determined.

4 EXPERIMENTAL RESULTS

We have tested both the CPU and GPU implemen-

tations on the same desktop PC that contains a Intel

Core i5-2500k processor, 3 GTX 580 GPUs and 8GB

RAM. Figure 3 shows the average number of frames

processed per second by all implementations on video

streams of various sizes. These measurements in-

clude the time required to perform preprocessing and

memory copies between the host and the device. The

multi-threaded CPU implementation uses OpenMP to

distribute the processing to different cores.

171

140

320

264

121

445

380

177

100

150

200

250

300

350

400

450

640x480 720x540 1280x720 1920x1080

Frames Per Second

Resolution

CPU (1 thread)

CPU (4 threads)

1 GPU

2 GPUs

3 GPUs

Figure 3: Frame rates of GPU and CPU implementations on

various input resolutions.

As it can be seen from Figure 3, even the single-

GPU implementation outperformsthe single-threaded

and multi-threaded CPU implementations by a factor

of 12-18x and 4-6x, respectively. As the resolution

increases, so does the difference between the speed

of the GPU and CPU implementations, clearly show-

ing that a GPU, as a massively parallel processor, is

better suited to process high resolution videos than a

CPU. These results also show that, in contrast to the

CPU, the performance of the GPU based implementa-

tion scales nearly linearly with the number of GPUs.

We have also tested both our CPU and GPU im-

plementations on the CMU+MIT frontal face test set

(Rowley et al., 1998) to validate our results. The de-

tection rates for both implementations are measured

as %90.8 with 32 false positives, proving that our

GPU implementation generates the exact same results

with those of the CPU implementation.

5 CONCLUSIONS

We have presented an efﬁcient GPU implementation

of a boosting based, real-time object detection sys-

tem utilizing MCT based classiﬁers using CUDA. We

have shown that even our single GPU implementa-

tion outperforms both the single-threaded and multi-

threaded CPU implementations runningon a high-end

CPU. We have pointed out that, because of their mas-

sively parallel architecture, GPUs are more suitable

for working with high resolution videos than CPUs.

Our implementation, with its ability to detect objects

in high resolution video streams in real-time, can eas-

ily be used in modern multimedia, entertainment and

surveillance systems.

ACKNOWLEDGEMENTS

This work is supported by ITU-BAP and TUBITAK

under the grants 34120 and 109E268, respectively.

REFERENCES

Fr¨oba, B. and Ernst, A. (2004). Face detection with the

modiﬁed census transform. In Proceedings of the

Sixth IEEE international conference on Automatic

face and gesture recognition, FGR’ 04, pages 91–96,

Washington, DC, USA. IEEE Computer Society.

Harvey, J. P. (2009). Gpu acceleration of object classiﬁ-

cation algorithms using nvidia cuda. Master’s thesis,

Rochester Institute of Technology.

Hefenbrock, D., Oberg, J., Thanh, N. T. N., Kastner, R.,

and Baden, S. B. (2010). Accelerating viola-jones

face detection to fpga-level using gpus. In Proceed-

ings of the 2010 18th IEEE Annual International Sym-

posium on Field-Programmable Custom Computing

Machines, FCCM ’10, pages 11–18, Washington, DC,

USA. IEEE Computer Society.

Herout, A., Joth, R., Jurnek, R., Havel, J., Hradi, M., and

Zemk, P. (2010). Real-time object detection on cuda.

Journal of Real-Time Image Processing, 2010(1):1–

12.

Obukhov, A. (2004). Haar classiﬁers for object detection

with cuda. In Fernando, R., editor, GPU Gems: Pro-

gramming Techniques, Tips and Tricks for Real-Time

Graphics, chapter 33, pages 517–544. Addison Wes-

ley.

Rowley, H., Baluja, S., and Kanade, T. (1998). Neural

network-based face detection. IEEE Trans. Pattern

Anal. Mach. Intell., 20(1):23–38.

Sharma, B., Thota, R., Vydyanathan, N., and Kale, A.

(2009). Towards a robust, real-time face processing

system using cuda-enabled gpus. In High Perfor-

mance Computing (HiPC), 2009 International Con-

ference on, page 368377. IEEE.

Viola, P. and Jones, M. J. (2004). Robust real-time face

detection. Int. J. Comput. Vision, 57:137–154.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

688