A REAL WORLD DETECTION SYSTEM

Combining Color, Shape and Appearance to Enable Real-time

Road Sign Detection

Peng Wang, Jianmin Li and Bo Zhang

State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science

and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China

Keywords: Object Detection, Road Sign, Haar-wavelet, Cascade Detector, HSV Color Space, Contour, RANSAC.

Abstract: Although specific object detection has undergone great advances in recent years, its application to critical

real-time circumstances like those in automated vehicle controlling is still limited, especially when facing

strict speed and precision requirements. This paper uses a hybrid of various computer vision techniques

including color space analysis, Haar-wavelet cascade detector, contour analysis and RANSAC shape-fitting,

to achieve a real-time detection speed while maintaining a reasonable precision and false-alarm level. The

result is a practical system that out-performed most rivals in an automated vehicle contest and an indication

of feasible CV application to speed critical areas.

1 INTRODUCTION

1.1 Motivation

The real time road sign detection system introduced

by this paper was motivated as a crucial subsystem

of an automatic driving system for vehicles. Three

challenges arise in designing such a real time object

detection application which cannot be fulfilled

together by a single existing computer vision

algorithm: the strict requirement of speed, of

precision and the diversity of target objects.

1.2 Outline

To tackle all these three challenges, we designed a

combination of various computer vision techniques

to utilize color, shape and appearance information

all together. The design principle of our system is to

rely on a stable detection algorithm, which may be

time consuming, to act as the main detector, while

utilize various kinds of pre-processing and post-

processing stages to shrink the area of regions

performed on by this main detector. The reduction in

regions of interests (ROIs) compensates the slow

speed of the main detector. For orthogonality, these

pre and post processing stages should exploit

information different from that used by the main

detector.

We designed a pipeline architecture to combine

these pre and post processing stages and the main

detector. The pipeline of our road sign detection

system consists of 4 stages, as shown in Figure 1. At

the first stage, color space analysis is used to spot

approximate ROIs. After that, the potential regions

are tested against contour analysis to shrink the

number and size of candidate regions. At the third

stage, a Haar-wavelet cascade detector plays the

major role of detecting road signs. At the final stage,

these detected signs are checked by a RANSAC

shape-fitting post-validator.

Figure 1: The 4-stage pipeline.

2 RELATED WORK

Color-based methods (Broggi, 2007; Escalera, 1997;

Escalera, 2003) use thresholds within certain color-

space to pick up the pixels that comprise the target

object. The biggest problem of color-based methods

is the poor distinguishing power and weak

robustness of sole color information, and hence the

difficulty in distinguishing road signs from

background noises in a similar color.

Color

Space

Contour Cascade RANSAC

675

Wang P., Li J. and Zhang B..

A REAL WORLD DETECTION SYSTEM - Combining Color, Shape and Appearance to Enable Real-time Road Sign Detection.

DOI: 10.5220/0003814306750678

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 675-678

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Papageorgiou (2000) showed that compositions

of simple features like Haar-wavelet turned out to

have a great advantage in speed while not suffering

much from precision drop. Viola (2004) introduced

the canonical Cascade Classifier. The problem of

shape-based methods is that they are all aimed for

one specific kind of visual object. If we force the

training samples to contain various kinds of objects,

the output will be a detector poor in both hit rate and

false-alarm rate. If we train a detector for each kind

of objects, the computational resources required

during detecting will be overwhelming.

3 PIPELINE

3.1 Color Space Analysis

Among the many color spaces that can be used for

color space analysis, our system selects HSV

because it is the most coherent with the intuition of

human conception. Rendering a road image in H, S

and V channels, we found that road signs stand out

prominently in H and S channels, but not so much in

V channel. Via setting upper and lower thresholds

on H and S channels, we can approximately pick out

pixels that belong to a road sign. These pixels will

connect with each other to form irregular regions,

which, after certain image processing techniques,

can be used to calculate bounding boxes that most

tightly contain them. These bounding boxes form the

ROIs for the next pipeline stage.

3.2 Contour Analysis

Computer Vision toolset like OpenCV usually

provides some contour analysis tools, which can be

used to extract contours from a binary image, to

match a contour against a template contour, etc.

Contour analysis is based on the binary image output

of color thresholding. Contour matching algorithms

take as input two contours and output a real number

indicating the extent to which they match. A

threshold can be put on this number to rule out

candidates whose contours are too far away from the

wanted contour. Contour algorithms do not use

sliding window, thus is much faster than algorithms

that are performed in a sliding window manner.

3.3 Haar-wavelet Cascade Detector

The design of the cascade detector is the same as

Viola (2004). Because cascade detector is used in a

sliding window manner, it is the biggest time

consumer of the whole system, and the major way of

speeding up is thus to reduce of area of regions this

sliding window is performed on.

3.4 RANSAC Shape-fitting

When we get some points that are believed to be

generated from the edge of certain shape, in

principle we can recover the generating shape (i.e.

its parameters) from the information provided by

these points. Typically we will have much more

points than theoretically needed. RANSAC

(RANdom SAmple Consensus) (Fischler, 1981) is a

method that exploits this redundant information to

improve the precision and stability of shape fitting.

It randomly selects points that are mathematically

sufficient to calculate the shape parameters, and

repeat this procedure certain times to get multiply

sets of calculated parameters. The final values of

parameters are decided by a voting among these

calculated parameter sets.

In addition to gaining the values of shape

parameters, RANSAC can also be used to check our

presumption of shape model. For example, if we

assume the generating shape is a circle, but the

calculated parameter sets differ too much from each

other, i.e. the statistical deviation exceeds some

criteria, then we should forgo our previous

assumption and claim that the shape would not be a

circle. RANSAC is used in this way as a shape

validator in our system.

4 OPTIMIZATION

To stabilize the sizes of ROIs, we implemented a

tracking mechanism. When a sign is detected and

ensured (for example by a consecutive series of

appearances), subsequent detection will be

performed only on its neighborhood, with a

thorough detection every several frames to allow for

new signs.

There are 25 kinds of signs as shown in Figure 2.

Theoretically we should train one cascade detector

for each sign, leading to 25 cascade detectors in total.

The speed requirement cannot afford such an

amount of computing, thus we grouped the signs

into 9 groups and trained a detector for each group,

as shown in Figure 2. Multiple detectors also open

the possibility of parallelization.

For trade-off between hit rate and false-alarm,

we prefer lowering false-alarm rate to lifting hit rate

during parameter adjusting, because a low false-

alarm rate can serve both to precision and speed, and

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

676

a moderate hit rate is somewhat tolerable in object

detection, since the detection is measured object-

wise, that is to say, the detection of an object should

be considered succeeded as long as one of its many

appearances is detected.

5 EXPERIMENT RESULTS

To measure our system’s precision and speed, we

conducted a suite of experiment on a set of videos

captured from real road. The set consists of 33

videos with resolution 1280*960, containing totally

11345 frames. About 20% of the frames contain one

or more road signs. A road sign is one of the 25

target road signs used in this experiment, as shown

in Figure 2. We thoroughly annotated the bounding

boxes of the signs on all frames, and used them as

ground truth for testing. The training samples were

extracted from a different set of videos than the

testing set, containing 12,454 patches as positive

samples and 384 full images as the source of

negative patch samples.

Figure 2: Road signs used in this experiment. Each cell

contains a group sharing the same cascade detector.

For measurement, we defined several criteria. A

detect (the bounding box of a sign) is treated as

‘correct’ or ‘hit’ is it intersects with a ground truth

rectangle, and the area of intersection is larger than

both 80% of the area of the detect and 80% of the

area of the ground truth. Hit rate is defined as the

proportion of the number of the ground truth

bounding boxes been hit over the number of all

ground truth bounding boxes. The number of false-

alarms is the number of detects that do not hit any

ground truth. We reported it in the form of false-

alarms per frame. We reported the speed of the

system as Frames Per Second (FPS). The speed is

measured on a quad-core Intel i7 CPU with 2.8GHz

main frequency.

5.1 Full System Performance

Table 1 shows the performance of turning on all four

stages of the pipeline. A FPS of about 8 is a

conservative estimate, which doesn’t take advantage

of the optimization techniques like tracking

described in Section 4, whose results will be

reported in Subsection 5.5. A false-alarm rate of

about 0.13 is good enough to be based by further

process such as consecutive appearing validation

during tracking and validation during recognition.

Table 1: Performance of the full system. HR stands for hit

rate. FA stands for false-alarms per frame. FPS stands for

frames per second.

HR FA FPS

0.586 0.132 7.86

5.2 Effect of Contour Analysis

Turning off the contour analysis stage and thus

letting all the ROIs output by the previous stage to

reach the next stage, we got results shown in Table 2.

Note the increase of hit rate, but also the increase of

false-alarm and the drop of FPS. As the FPS was

nearly halved as a result of the removal of contour

analysis, and dropped to an intolerable level, we can

prove that the contour analysis stage is really a

crucial component of the whole system, especially

when it is intended to be used in a real-time

circumstance. It also justifies our bias stated in

Section 4 that in object detection tasks, we should

better prefer a low false-alarm rate to a high hit rate.

Table 2: Performance with Contour Analysis stage turned

off.

HR FA FPS

0.782 0.244 3.834

5.3 Effect of Color Space Analysis

The effect of removing color space analysis is very

obvious as shown in Table 3. (Because the contour

analysis is based on the output mask image of color

thresholding, that stage have to be also removed.)

Putting the nearly unchanged hit rate and the

dramatic boost of false-alarm rate aside, the FPS

alone would make the system unworkable. This

simple result is enough to show that color space

analysis is a fundamental part of our system.

Table 3: Performance with Color Space Analysis stage

turned off.

HR FA FPS

0.572 1.446 0.694

A REAL WORLD DETECTION SYSTEM - Combining Color, Shape and Appearance to Enable Real-time Road Sign

Detection

677

5.4 Effect of RANSAC

Turning off the RANSAC stage, we got results

shown in Table 4. Turning off RANSAC stage

doesn’t affect hit rate much, but increases false-

alarm rate by about 15%. Equally saying,

introducing RANSAC can lower false-alarm rate by

about 9% without sacrificing hit rate. The speed is

not affected much either. This proves that RANSAC

is a safe and feasible post processing stage for the

system, though the influence is not as dramatic as

color and contours.

Table 4: Performance with RANSAC stage turned off.

HR FA FPS

0.583 0.154 7.93

5.5 Effect of Tracking

In addition to the full system tested in Subsection

5.1, we can add simple tracking mechanism

described in Section 4 to stabilize the performance

and further boost the speed. The results of this

addition are shown in Table 5. Surprisingly and

happily, both precision and speed enjoy a significant

enhancement. The precision is enhanced in that the

false-alarm rate drops by more than 60% without the

hit rate suffering much. The FPS is increased by

about 50%. Both benefits are due to the dramatically

reduced ROI sizes. Some detects and false-alarms

are shown as Figure 3.

Table 5: Performance with tracking mechanism added.

HR FA FPS

0.535 0.081 11.750

Figure 3: Some detects and false-alarms.

6 CONCLUSIONS

In this paper we presented a system that can fulfill

the task of road sign detection in real-time and real

world circumstances. By effectively utilizing various

computer vision techniques, we proved that though

there may not be a single algorithm that can tackle

all the challenges in a real world task, a wise

selection and hybrid of existing techniques can still

produce a feasible and robust application.

ACKNOWLEDGEMENTS

This work is supported by the National Natural

Science Foundation of China under Grant No.

90820305, National Basic Research Program (973

Program) of China under Grant No.2012CB316301,

and Basic Research Foundation of Tsinghua

National Laboratory for Information Science and

Technology (TNList).

REFERENCES

Broggi, A., Cerri, P., et al., 2007. Real time road signs

recognition. In Intelligent Vehicles Symposium. IEEE.

Escalera, A., Luis, E., Moreno, M., et al., 1997. Road

traffic sign detection and classification. In

Transactions on Industrial Electronics. IEEE.

Escalera, A., Armingol, J., et al., 2003. Traffic sign

recognition and analysis for intelligent vehicles. In

Image and Vision Computing.

Fischler, M., Bolles, R., et al., 1981. Random sample

consensus: a paradigm for model fitting with

applications to image analysis and automated

cartography. In Communications of the ACM.

Papageorgiou, C., Poggio, T., 2000. A trainable system for

object detection. In International Journal of Computer

Vision.

Viola, P., Jones, M.,et al., 2004. Real-time face detection.

In International Journal of Computer Vision.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

678