PARTS-BASED FACE DETECTION AT MULTIPLE VIEWS

Andreas Savakis and David Higgs

Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY 14623, USA

Keywords: Face detection, parts-based, multiple views, neural network, Bayesian network.

Abstract: This paper presents a parts-based approach to face detection, that is intuitive, easy to implement and can be

used in conjunction with other image understanding operations that use prominent facial features. Artificial

neural networks are trained as view-specific parts detectors for the eyes, mouth and nose. Once these salient

facial features are identified, results for each view are integrated through a Bayesian network in order to

reach the final decision. System performance is comparable to other state-of the art face detection methods

while providing support for different view angles and robustness to partial occlusions.

1 INTRODUCTION

Face detection in images and video is very important

for many applications including surveillance,

biometrics, content-based image retrieval and

human-computer interaction. Image and video

understanding systems often perform multiple tasks

in addition to face detection, such as pose

estimation, gaze detection, recognition of facial

expressions, etc. Therefore, it would be helpful to

develop face detection methods that utilize features

which can be shared with other modules.

Face detection approaches may be categorized as

feature-based or image-based (Hjelmas and Low,

2001). For example, neural networks and support

vector machines were used in (Rowley et al., 1998)

and (Osuna et al., 1997) without considering

features. In feature based approaches, primitive parts

can be utilized to match known object relationships,

as in (Yow and Cipolla, 1997) and (Hsu et al.,

2002). In (Yow and Cipolla, 1997), a Bayesian

network was used to detect faces given evidence on

the presence of features. Boosting of features was

employed in (Viola and Jones, 2001) and (Xiao et

al., 2003). The parts selection process can be

automated using statistical methods, as in

(Schneiderman and Kanade, 2004), (Weber et al.,

2000), and (Fergus et al., 2003).

Automatic parts selection works without the need

of expert knowledge to identify important features,

but it does not always result in intuitively obvious

regions. This makes it difficult to share detected

features with other image understanding tasks.

Detection at multiple views may be

accomplished as an extension of existing methods by

training a single detector with all available views

(Pontil and Verri, 1998), (Yang et al., 2000) or by

training one model per viewpoint (Schneiderman

and Kanade, 2000). A constellation model that is

suitable for multipose detection was employed in

(Weber et al. 2000), and (Fergus et al., 2003).

In this paper, we present a parts-based approach

to face detection which utilizes prominent features

that may be used by other image understanding

methods. Thus, the results of feature detection

modules can be shared between various tasks for

efficient operation of the overall system.

Furthermore, the parts-based approach allows

detection under partial occlusion and is extended to

incorporate multiple views.

2 FACE DETECTION SYSTEM

The basic framework for the parts-based face

detection system is shown in Figure 1. This system

contains separate modules, where each module is

designed to support a different object view.

Individual parts detectors are selected to correspond

to prominent facial features, namely eyes, mouth and

nose. Parts detectors for each view are separately

trained and their outputs are fed into a Bayesian

network arbitrator that determines whether a face is

present at a given view.

Multi-layer artificial neural networks (ANNs) are

used for parts detection. The input layer corresponds

298

Savakis A. and Higgs D. (2007).

PARTS-BASED FACE DETECTION AT MULTIPLE VIEWS.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 298-301

 SciTePress

to a rectangular aperture at the selected reference

scale, as shown in Figure 2. The ANN detector is

applied as a sliding window, such that the network

response at each location is recorded in an activation

map. To accommodate moderate variations in scale,

three window sizes are scaled to the ANN

resolution.

The overall face detection system is shown in

Figure 3. Histogram equalization is performed on

each image subwindow to compensate for lighting

variations. The equalized region is processed by the

neural network parts detectors and an activation map

is obtained for each view. The activation map is

lowpass filtered to reduce the effects of outliers, and

the facial feature locations are identified based on

the maximum values of the activation map.

The results of the feature detector are interpreted

by the arbitrator network. Bayesian networks were

chosen as view arbitrators because of their natural

resistance to overfitting and ability to incorporate

incomplete data and cause-and-effect relationships

(Heckerman, 1995). This required discretizing the

output of the neural networks to indicate presence or

absence of a feature.

The final decision incorporates the detection

results at different poses and this is done with a

simple logical OR operation, since little additional

leverage can be gained by applying a more complex

decision-making process.

3 RESULTS

The FERET database (Phillips, 2000) was used to

construct training and testing sets, as it contains

multiple viewing angles of human faces. Other

desirable characteristics include a large number of

subjects and good diversity across age, race, and

gender. There were 1364 images of the Frontal A

view and 690 images of the Quarter Left view that

selected to illustrate the capabilities of the system,

henceforth referred to as “frontal” and “side” views.

The four facial features selected were the left

eye, right eye, nose, and mouth. Both the frontal and

side views allow for all four facial features to be

visible. The input images were scaled so that most

parts could be represented with 400 or less input

neurons. Eyes were detected based on a 12x20 pixel

aperture, with the exception of the right eye of the

side view; it was 12x16 pixels due to foreshortening

effects. The nose and mouth detectors were 18x20

and 14x32 pixel windows, respectively. The

resulting faces had approximate dimensions of 50

pixels high by 55 pixels wide.

The bootstrapping process was used for training

the parts detectors (Sung and Poggio 1998). The

component parts were extracted from the training

images, preprocessed, and added to the training set

as positive examples. To prevent an initial bias, an

equally sized negative set was added by taking

random preprocessed subwindows from a

background dataset consisting of 451 images from

the Caltech background image database.

Arbitration networks for each view were trained

in order to make a decision about the presence or

absence of a face in the scene. Part activation values

were gathered and used to determine event

thresholds for each part. The threshold was selected

by plotting a receiver operating characteristic curve

(ROC curve) for the activation values and choosing

the optimal threshold point.

The use of Bayesian networks allowed for the

inaccuracies of the individual neural networks to be

handled implicitly by the network itself. The

conditional probability tables (CPTs) were

determined experimentally from the training set.

Since the true presence of an object was known prior

to finding the part detections, statistics were

gathered for each part in the form of CPTs. Table 1

shows how the results of training the individual parts

detectors were incorporated in CPTs. The

correspondence between each part and its associated

view was found by counting the frequency of

detection with respect to whether the candidate

image contained a face or not. The experimental part

detection rates conformed to expectations, in that

parts are typically detected when a view was present

and not detected when the view is not present.

Bayes’ rule was applied to each view’s arbitration

network to find an expression relating the presence

of an object at a certain view to the conditional part

probabilities, as shown in the equation below:

where v represents a view and d’s represent parts

detected. For a given set of part detections, the

equation was evaluated twice – once for each state

of the view detection – substituting in the

corresponding CPTs for each part. The view state

with the larger network probability was the view

belief for the image. n most cases, two or more of

the four parts at any particular view indicated the

presence of the object.

The validation process was to simply run the

standard detection scheme on the validation images

of each cross-validation set. The testing results

illustrate the performance of the system on images

that it had never encountered. The output of the final

stage was a binary decision between face and non-

(

)

(

)

(

)( )( )

(

)

vdPvdPvdPvdPvPddddvP ||||,,,|

43214321

PARTS-BASED FACE DETECTION AT MULTIPLE VIEWS

299

face; the average performance is shown in Table 2.

The overall detection performance is better than the

performance of any of the individual part detectors,

which demonstrates the strength of Bayesian

decisions in this context. Side face detection

performed slightly better on average than the frontal

face detection, which could be expected by

comparing the part CPTs of each view.

For demonstration purposes, the proposed parts-

based face detection method was applied to subjects

outside the FERET database. Figure 4 shows two

correctly detected faces that are at different scales

and varying lighting conditions. Note that occlusion

of one eye did not affect the detection result.

4 CONCLUSIONS

This paper presents a parts-based face detection

approach that includes support for multiple viewing

angles. Parts detectors for eyes, mouth and nose

were implemented using neural networks trained

using the bootstrapping method. Bayesian networks

were used to integrate part detections in a flexible

manner, and were trained on a separate dataset so

that the experimental performance of each part

detector could be incorporated into the final

decision.

Images from the FERET human face database

were selected for training and testing. Individual part

detection rates ranged from 85% to 95% against

testing images (Table 1). Cross-validation was used

to test the system as a whole, giving average view

detection rates of 96.7% and 97.2% respectively for

the frontal and side views, and an overall face

detection rate of 96.9% (Table 2). A 5.7% false-

positive rate was demonstrated on background

clutter images.

Table 3 shows that the approach presented in this

paper performs in a manner comparable to other

research efforts within the field of face detection,

with minimal restrictions that would hinder

generalization to other object categories. In addition,

this approach provides the additional benefit of

support for different view angles. Finally, selecting

prominent facial features for face detection provides

a benefit for other image understanding modules that

may utilize the detected features.

ACKNOWLEDGEMENTS

This research was sponsored in part by the Eastman

Kodak Company and the Center for Electronic

Imaging Systems (CEIS), a NYSTAR-designated

Center for Advanced Technology in New York

State.

REFERENCES

Fergus R., Perona P., and Zisserman A., 2003. Object

Class Recognition by Unsupervised Scale-Invariant

Learning. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition.

Heckerman D., 1995. A Tutorial on Learning With

Bayesian Networks. Technical Report MSR-TR-95-

06, Microsoft Research, Advanced Technology

Division.

Hjelmas, E. and Low, B. K., 2001. Face Detection: A

Survey. In Computer Vision and Image

Understanding, vol. 83, pp. 236–274.

Hsu R. L., Abdel-Mottaleb M., and Jain A. K., 2002. Face

Detection in Color Images. IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 24,

pp. 696–706..

Osuna E., Freund R., and Girosi F., 1997. Training

Support Vector Machines: an Application to Face

Detection. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pp. 130–

136.

Phillips P. J., Moon H., Rizvi S. A., and Rauss P. J., 2000.

The FERET Evaluation Methodology for Face-

Recognition Algorithms. IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 22,

pp. 090–1104, October.

Pontil M. and Verri A., 1998. Support Vector Machines

for 3D Object Recognition. IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 20,

pp. 637–646.

Rowley H. A., Baluja S., and Kanade T., 1998. Neural

Network-Based Face Detection. IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 20,

pp. 23–38.

Schneiderman, H. and Kanade T., 2000. A Statistical

Method for 3D Object Detection Applied to Faces and

Cars. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, vol. 1, pp.

1746–1759.

Schneiderman H. and Kanade T., 2004. Object Detection

Using the Statistics of Parts. International Journal of

Computer Vision, vol. 56, pp. 151–177.

Sung K. K. and Poggio T., 1998. Example-based Learning

for View-based Human Face Detection. IEEE

Transactions on Pattern Analysis and Machine

Intelligence, vol. 20, pp. 39–51.

Viola P. A. and Jones M. J., 2001. Rapid Object

Detection using a Boosted Cascade of Simple

VISAPP 2007 - International Conference on Computer Vision Theory and Applications

300

Features. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, vol. 1, pp.

511–518.

Weber M., Welling M., and Perona P., 2000.

Unsupervised Learning of Models for Recognition. In

Proceedings of the European Conference on

Computer Vision, vol. 1, pp. 18–32.

Xiao R., Zhu L., and Zhang H. J., 2003. Boosting Cascade

Learning for Object Detection. In IEEE International

Conference on Computer Vision, ICCV03.

Yang M. H., Roth D., and Ahuja N., 2000. Learning to

Recognize 3D Objects with SNoW. In Proceedings of

the European Conference on Computer Vision, vol. 1,

pp. 439–454.

Yow, K. C. and Cipolla R., 1997. Feature-based human

face detection. Image and Vision Computing, vol. 15,

pp. 713–735.

Figure 1: Framework for parts-based face detection at

multiple views.

Figure 2: Neural network parts detector.

Figure 3: Overall Flow of Face Detection System.

Table 1: Conditional Probability Tables (CPTs) based on

the performance of feature detectors.

0.86630.95190.86240.8964T

0.07900.04580.10790.0798F

P(Mouth = T)P(Nose = T)P(Right Eye = T)P(Left Eye = T)

Frontal

View State

0.86630.95190.86240.8964T

0.07900.04580.10790.0798F

P(Mouth = T)P(Nose = T)P(Right Eye = T)P(Left Eye = T)

Frontal

View State

0.84490.92270.85600.9111T

0.11450.06580.09240.0791F

P(Mouth = T)P(Nose = T)P(Right Eye = T)P(Left Eye = T)

Side View

State

0.84490.92270.85600.9111T

0.11450.06580.09240.0791F

P(Mouth = T)P(Nose = T)P(Right Eye = T)P(Left Eye = T)

Side View

State

Table 2: Face Detection Performance.

Detected PercentDetected Images

97.2%2.8%167.64.9Side

96.9%3.1%497.516Face Overall

5.7%94.3%6.4106.3Background

96.7%3.3%329.911.1Frontal

FaceNon-FaceFaceNon-FaceDataset

Detected PercentDetected Images

97.2%2.8%167.64.9Side

96.9%3.1%497.516Face Overall

5.7%94.3%6.4106.3Background

96.7%3.3%329.911.1Frontal

FaceNon-FaceFaceNon-FaceDataset

Table 3: Comparison with existing work.

92.9%BayesianEdge DetectionPartsYow & Cipolla (1997)

96.9%BayesianNeural NetworksPartsView Based Parts

96.4%Statistical ModelGaussian DistributionPartsFergus et al. (2003)

99.5%ThresholdNeural NetworksImageRowley et al. (1998)

97.1%ThresholdSVMImageOsuna et al. (1997)

96.7%Neural NetworkPCA Distance MetricImageSung & Poggio (1998)

DetectionArbitrator(s)Detector(s)Model

92.9%BayesianEdge DetectionPartsYow & Cipolla (1997)

96.9%BayesianNeural NetworksPartsView Based Parts

96.4%Statistical ModelGaussian DistributionPartsFergus et al. (2003)

99.5%ThresholdNeural NetworksImageRowley et al. (1998)

97.1%ThresholdSVMImageOsuna et al. (1997)

96.7%Neural NetworkPCA Distance MetricImageSung & Poggio (1998)

DetectionArbitrator(s)Detector(s)Model

(a) (b)

Figure 4: Examples of detected faces showing robustness

to scale variations, lighting variations and occlusion.

PARTS-BASED FACE DETECTION AT MULTIPLE VIEWS

301