FACE AND EYE TRACKING FOR PARAMETERIZATION OF

COCHLEAR IMPLANTS

M. Cabeleira

, S. Ferreira

, L. F. Silva

, C. Correia

and J. Cardoso

Instrumentation Center, Physics Department, University of Coimbra, R. Larga, Coimbra, Portugal

Servic¸o de Otorrinolaringologia do Centro Hospitalar de Coimbra - (CHC), Coimbra, Portugal

Keywords:

Cochlear Implants, Eye-tacking, Colour based Trackers, Viola & Jones Face Tracker, Between-the-Eyes,

Gabor Filter.

Abstract:

This work presents a free head eye-tracking solution created for use as a complementary tool in the parame-

terization of cochlear implants. Nowadays, the parameterization of these implants is a long and cumbersome

process performed by audiologists and speech therapists that throughout many periodic evaluations where

audiometric tests and electrode adjustments are performed. The eye tracking system will assist this process

through detection of saccades generated when a subject hears sounds produced during the audiometric test

procedure. The main purpose is to ease and improve the implant re-parameterization procedure with uncoop-

erative subjects, like children. The developed system is composed of three digital video cameras where two

of the cameras are responsible of the detection of the position of the face and eyes and the third is responsible

of the gaze detection. The developed face and eye detectors are also compared in order to choose the best

combination of algorithms to perform robust eye detection with unpredictable subjects. The best combination

of algorithms is the Viola-Jones face detector combined with an eye detector Ring Gabor ﬁlters, that correctly

detected the eye-position in 76,81% of the tested videos at 18 frames per second.

1 INTRODUCTION

Cochlear implants are used in cases of profound or to-

tal deafness to restore hearing capabilities. An early

implementation, particularly in children younger than

three years old, allows for the exposition to sound

stimuli which is essential for the development of

speech and the articulation of language. A cochlear

implant is composed by a microphone, a speech pro-

cessor and a set of surgically implanted electrodes in-

side the cochlea, that stimulate the vestibulocochlear

nerve responsible for hearing. After the surgery, the

patient must undergo an extensive electrode parame-

terization procedure to achieve optimal hearing capa-

bilities, in a procedure strongly dependent on the ex-

pertise of audiologists and speech therapists. Many

periodic audiometric evaluations are performed to

ﬁne tune each electrode’s gain. This slow and subjec-

tive task becomes even more complex when applied

to children under twelve. Thus, innovative and objec-

tive implant parameterization techniques capable of

minimizing technician’s errors and allow faster and

reliable procedures are mandatory (Porter, 2003).

A possible solution makes use of eye-tracking

techniques, as a complement of the existing proto-

col, to determine the gaze point whenever a sound is

heard. The system must be non-invasive, comfortable

and easy to use. The proposed eye-tracking system

should robustly detect and track the face region after

a simple calibration procedure, as well as detect the

gaze vector (even when contact is lost and without be-

ing recalibrated) at around 50 frames per second (i.e.

time resolution should reach 20 ms).

The eye-tracking DAQ module is composed by

two webcams and one high resolution (HR) digital

camera. The webcams are for face and eyes detec-

tion along with distance assessment, while the digital

HR camera is used to acquire the eye ROIs. The 3

camera solution minimizes processing time and low-

ers overall costs.

The purpose of this work is to select the best set of

algorithms for face and eye detection. To perform the

face detection, colour based algorithms were tested

against feature based algorithms for robustness, de-

tection accuracy and processing time. Similarly, eye

detection is accomplished both by morphological fea-

429

Cabeleira M., Ferreira S., F. Silva L., Correia C. and Cardoso J..

FACE AND EYE TRACKING FOR PARAMETERIZATION OF COCHLEAR IMPLANTS.

DOI: 10.5220/0003792604290433

In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing (BIOSIGNALS-2012), pages 429-433

ISBN: 978-989-8425-89-8

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

tures extraction (’between the eyes’) and Gabor ﬁlter-

ing.(Zhao, 2003)

2 ALGORITHMS

A visual representation of the algorithms used is pre-

sented in Figure 1

Figure 1: Schematic representation of the Face and Eye de-

tection algorithms. (1) lighter version of the Viola-Jones

algorithm, (2) Image of the Mahalanobis Distance opera-

tion, (3) Image of the Red-Green operation, (4) Bounding

Box Selection, (5) Image binarization, (6) Bounding Box

generation, (7) Image resultant from the Elliptical Gabor

Filter, (8) Eroded image and vertical gradient image of the

face, (9) Eyes marked using information obtained with the

Elliptical Gabor ﬁlter, (10) Between the eyes point marked.

2.1 Face Detection and Tracking

For head or face analysis, three algorithms were stud-

ied: Red-Green (RG), Mahalanobis Distance (MD)

and Viola-Jones (VJ). The ﬁrst two algorithms are

based on skin colour properties and the last algorithm

is focused on facial features.

2.1.1 Red-Green

This algorithm was ﬁrstly developed by Saleh Al-

Shehri and is the simplest amongst all the algorithms

tested in this work. It uses images acquired in the

RGB format and takes advantage of two basic prin-

ciples; the red component is predominant in human

skin and the R/G ratio is bigger than 1. Therefore

if we subtract the green component to the red com-

ponent, skin pixels will acquire values higher than 0

(Al-Shehri, 2004).

Non-skin pixels with values much higher than 0

can also appear and must be eliminated. This was

accomplished by deﬁning a window of values that are

considered skin colour. The lower threshold used was

the value 0.02 and the higher was 0.25 (considering a

gray-scale ranging from -1 to 1).

This method excludes parts of the face like eyes

and beard, therefore a closing operator was used to

connect all the detected points in the face before cal-

culating the Bounding Box (BBox) around the face.

In order to counter several miscalculations where the

generated BBox included the neck and clothes, a se-

lection method was implemented. This method ad-

justs the BBox taking as premise that the face is lo-

cated on the top of the body and in between of the

shoulders.

2.1.2 Mahalanobis Distance (MD)

The MD allows the computation of the relative dis-

tance, or similitude, between each element of un-

known data sets to known ones, being this distance

measurement widely used in the segmentation of skin

colours (Supriya, 2010). The colour space used for

this algorithm was theYC

and it measures the MD

between the colour components C

and C

present

in the picture and the average skin colour compo-

nents deﬁned by the author as being 107.9649 for the

component and 140.8913 for the C

component.

The equations implemented for this algorithm may be

found in (Maesschalck, 2000). The covariance matrix

−1

) used was:

−1



1.8328501× 10

2.2506719× 10

6.8658257× 10



(1)

These values were obtained by the author by us-

ing a wide number of images containing skin colour

samples.

2.1.3 Viola-Jones (VJ)

This method was developed by Paul Viola and

Michael Jones and it differs from the previous two

because instead of using pixel colour information in

order to locate the face in an image, this algorithm

applies Haar-like features to the image through vari-

ous stages of a previously trained classiﬁer. A more

in depth explanation of the algorithm may be found in

the literature (Viola, 2001).

BIOSIGNALS 2012 - International Conference on Bio-inspired Systems and Signal Processing

430

In this work the algorithm used was developed by

Dirk-Jan Kroon (Mathworks, 2011), it consists of a

previously trained classiﬁer cascade with 22 stages of

classiﬁcation and spanned the images 11 times at dif-

ferent scales. To overcome processing time require-

ments, only the ﬁrst two frames of the video used the

original algorithm. The role of these two frames was

to deﬁne the scale at which the face and its initial po-

sition in the image. That information was then used in

the subsequent images where a lighter version of the

algorithm was applied. This version used only a sub-

window of the original image centred in the position

detected in the previous frame and spanned this sub-

window only in the scale deﬁned in the initial frames,

the classiﬁer cascade was also simpliﬁed using only

the ﬁrst 5 stages. As a result of this simpliﬁcation,

the algorithm performed a lot faster not without losing

some of its speciﬁcity generating lots of BBox around

the real face. To select the correct BBox, the image

was binarized using a threshold of 0.45 to 0.6 high-

lighting the darker pixels. The BBox with the more

dark pixels was the one used and the sub-window was

cropped and forced to become a 50x50 picture for fur-

ther analysis.

2.2 Eye Detection

To locate the eyes in the previously detected face, two

methods were studied: Between the Eyes (BTE) and

Elliptical Gabor Filter (EGF).

2.2.1 Between-the-Eyes (BTE)

This algorithm was developed having as reference the

work done by Peng et al. (Peng, 2005). This algo-

rithm takes advantage of the unique illumination pro-

ﬁle of a face and extracts the point that lies in between

of the eyes. It works in two fundamental steps. The

ﬁrst consists in computing the vertical gradient im-

age of the face to detect the increased shadow present

in the transition between the forehead and the eye-

brow region, to do this a horizontal cumulative sum

is performed and the line of pixels with higher value

is considered as the horizontal coordinate of the BTE

point. A sub-window of the original greyscale image,

centred in this line is then extracted and eroded by a

circular operator with 4 pixels in diameter. Vertical

cumulative sum is performed and again the column

with higher value is considered the vertical coordinate

of the point.

2.2.2 Gabor Elliptical Filter (GEF)

The Gabor ﬁlter used to detect the eyes was based on

the work done by Yanfang Zhang, it consists in the

use of an elliptical Gabor ﬁlter that is a 2-D band pass

ﬁlter generated by a Gaussian function modulated by

a sine function. The equation and pictures of the ﬁlter

are also presented in his work (Zhang, 2005). In this

work, the parameters used to compute the ﬁlter were

= 15, σ

= 15, F=64, θ = 0 and the ﬁlter size was

15x15. Here σ

and σ

stand for variances or scale

factors along x and y axis, F for spatial central fre-

quency and θ for the rotation angle of the ﬁlter. The

real part of the ﬁlter was selected and normalized for

further use. The computed ﬁlter was then convolved

with the gray-scale face image to highlight the eyes

position. The eye region appeared in the image as

holes, therefore to select the eye regions the picture

was binarized highlighting the darkest regions. The

centroid position of these regions was then used to

mark the position of the eyes.

3 TEST SETUP

In order to compare the effectiveness of the devel-

oped algorithms, a video database was created. This

database encompassed 6 types of videos of 23 differ-

ent subjects performing different head movements on

each video. The protocol for each video is presented

in Table 1. The subjects used for this database were

all voluntary members found in the investigation cen-

tre.

Table 1: Description of the head movement performed on

each video.

Video Subject’s movement description

I Upright position at rest

II Approach the camera and stray away

III Laterally tilt the head

IV Tilt the head

V Pan the head

VI Free head movement

The videos were acquired using a Logitech C210

USB webcam at 30 frames/s and with a length of 10s

with a resolution of 120x160. The recorded videos

were then processed in Simulink where each video

was run 6 times covering all possible combinations

of face detector and eye detectors. All videos were

processed in ofﬂine using a Intel Core 2 Duo CPU

E6750 @ 2.66 GHz computer (RAM 2GB and 64-bit

Operating System).

To evaluate the results in the terms of accuracy

a score scale with three steps was created: the score

”0” was attributed to algorithms that failed to detect

the correct face or eye position, ”1” was attributed

FACE AND EYE TRACKING FOR PARAMETERIZATION OF COCHLEAR IMPLANTS

431

to algorithms that can detect the face or eyes posi-

tion with minor errors and ”2” was attributed to al-

gorithms that can correctly detect the face or eye po-

sition. The time performance of each algorithm was

also measured (maximum frame rate).

4 RESULTS & DISCUSSION

The results for the time performance tests performed

on each combination of algorithms is presented in Ta-

ble 2.

Table 2: Results for the time performance tests.

Eye detection Face detection Average frames /s

BTE

R-G 43

MD 37

VJ 23

EGF

R-G 62

MD 34

VJ 18

Figure 2: Graphic representation of the results obtained for

the totals of each face detectors.

Figure 3: Graphic representation of the results obtained for

the totals of each eye detectors.

As expected, the face detector algorithm that per-

formed better in terms of processing time was the R-

G, reaching an average value of 62 fps when used with

the EGF. With this processing speed it should be pos-

sible to detect a saccade movement at its lower tim-

ing threshold. The VJ and the MD still need several

performance improvements because, to detect a sac-

cade movement, frame rates higher than 50 frames/s

are mandatory. By analysing the results relative to

the eye detectors, the EGF operated faster when com-

bined with the R-G and slower in the other two. The

exact opposite occurred with the BTE.

As can be observed in Figure 2 the results of the

two colour based algorithms show similar tendencies

being capable of correctly detect a face in roughly

50% of the cases. In the other cases the face was usu-

ally detected in some point, but with positional errors

or with inconstant detections. The algorithm that per-

formed the best was the VJ, being capable of correctly

detecting the face in 73.91% of the studied videos.

When analysing the accuracy of the face detectors,

present in Table 3, it is evident that the colour based

detectors are quite robust to the head movements, be-

cause they generate similar results on each type of

movement. The erroneous face detection in these al-

gorithms can be explained by illumination problems

and skin-like objects present in the image that can

mislead the algorithms. On the other side the VJ is

more sensible to some of the head movements than

to others, in particular movements that involve face

scale variations mainly because, the BBox generated

always have the same size.

To improvethe accuracy of the eye detection algo-

rithms, in case a face was not detected at all, the score

given to the eye algorithm was also 0. As can be seen

in Table 4, the EGF exhibits better eye detection re-

sults than the BTE, being able of outputting correct

eye positions even when the face was segmented with

errors. For this reason the EGF can compensate for

the errors inherited from the face detectors and gener-

ate better results. The BTE exhibited poor eye detec-

tion results, even when it was applied to a well seg-

mented face by the VJ in ideal conditions like in video

I where the face was impeccably segmented.

The table also reveal the superior accuracy of the

R-G when compared to the other colour based algo-

rithm. In video III all the combinations of algorithms

output worse results, being considered the most dif-

ﬁcult head movement to track. In Figure 3 can be

observed that the most accurate combination of algo-

rithms to extract eye positions is the VJ and the EGF

having detected the correct eye position in 76.81% of

the cases. The results for the combination of R-G and

EGF were also very appealing, being capable of de-

tecting the correct eye position in 75.36% of cases.

The percentage of erroneous detections was 17.39%,

which was higher than the 13.04% obtained by the VJ

and the EGF. In spite of this apparent superiority, the

percentage of non-detections was higher with the VJ

and the EGF.

BIOSIGNALS 2012 - International Conference on Bio-inspired Systems and Signal Processing

432

Table 3: Results for the face detectors accuracy.

Face Videos

Detection I II III IV V VI

Method

0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

R-G

8.7 47.8 43.5 0 39.1 60.9 4.3 56.5 39.1 4.3 47.8 47.8 4.3 47.8 47.8 0 43.5 56.5

8.7 43.5 47.8 0 43.5 56.5 8.7 60.9 30.4 8.7 56.5 34.8 8.7 30.4 60.8 0 39.1 60.9

13.0 0 86.9 4.3 34.8 60.9 4.3 21.7 73.9 13.0 13.0 73.9 13.0 4.3 82.6 8.7 26.1 65.2

Table 4: Results for the face and eye detectors accuracy.

Eye Face Videos

Detection Detection I II III IV V VI

Method Method

0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

BTE

R-G

30.4 17.4 52.2 17.4 17.4 65.2 39.1 43.5 17.4 30.4 17.4 52.2 30.4 34.8 34.8 21.7 47.8 30.4

30.4 26.1 43.5 21.7 34.8 43.5 39.1 34.8 26.1 30.4 39.1 30.4 43.5 21.7 34.8 30.4 43.5 26.1

17.4 34.8 47.8 13.0 43.5 43.5 26.1 30.4 43.5 17.4 26.1 56.5 17.4 30.4 52.2 17.4 43.5 39.1

EGF

R-G

13.0 8.7 78.3 4.3 34.8 60.9 8.9 26.1 65.2 8.7 4.3 86.9 4.3 13.0 82.6 4.3 17.4 78.3

17.4 4.3 78.3 8.7 34.8 56.5 13.0 43.5 43.5 13.0 4.3 82.6 13.0 21.7 65.2 8.7 39.1 52.2

13.0 0.0 86.9 4.3 17.4 78.3 4.3 26.1 69.6 13.0 13.0 73.9 13.1 8.7 78.3 13.0 13.0 73.9

5 CONCLUSIONS AND FUTURE

WORK

Both the R-G and VJ performed well in the tests. Al-

though the VJ has a higher percentage of correct de-

tections than the R-G, it requires a much longer pro-

cessing time. The best eye detector is the EGF there-

fore the best possible combinations are the R-G with

EGF or the VJ with the EGF. The colour based algo-

rithms are not robust enough to the task of detecting

faces, but could be used as a ﬁrst step of image pro-

cessing in order to isolate regions where a face could

be present in order to diminish the processing time

required by the VJ to segment the face.

In the future the face detectors and the eye detec-

tors will be improved in terms of processing time and

accuracy and will be implemented in C++ using the

OpenCv library. The cameras used to acquire the im-

age will also be improved to cameras capable of ac-

quiring images at frame rates of the order of the 50-60

frames per second in order to be able to detect a sac-

cade.

The project will enter in a second phase where al-

gorithms of stereo vision will be implemented in or-

der to calculate the distance between the head and

the system, in order to estimate the head pose and

to estimate the position of the eye under study in the

third camera. Eye processing techniques will also be

implemented in order to estimate the gaze point and

complete in this way the eye tracking system.

ACKNOWLEDGEMENTS

We would like to thank Fundac¸

ao para a Ci

encia e a

Tecnologia (FCT) for the ﬁnancial support of this

project (PTDC/SAU-BEB/100866/2008). We would

also like to thank the collaboration of the patients

and their families and for all support received from

Servic¸o de Otorrinolaringologia do Centro Hospita-

lar de Coimbra (CHC).

REFERENCES

Al-Shehri, S. (2004). A simple and novel method for skin

detection and face locating and tracking. Lecture

Notes in Computer Science, Volume 3101/2004, 1-8.

Maesschalck, R. (2000). The Mahalanobis distance. Else-

vier Science B.V., Chemometrics and Intelligent Lab-

oratory Systems, 50, 1-18.

Mathworks (2011). http://www.mathworks.com/matlab

central/ﬁleexchange/authors/29180.

Peng, K. (2005). A robust algorithm for eye detection on

gray intensity face without spectacles. JCST, Vol.5,

No.3.

Porter, G. T. (2003). Cochlear implants. In Grand Rounds

Presentation. UTMB, Dept. Of Otolaryngology.

Supriya, K. (2010). Facial Gesture Recognition using Cor-

relation and Mahalanobis Distance. IJCSIS, Vol. 7,

No.2.

Viola, P., J. M. (2001). Rapid object detection using a

boosted cascade of simple features. IEEE, Computer

Society Conference on Computer Vision and Pattern

Recognition, vol. 1, pp.511.

Zhang, Z. (2005). A robust method for eye features extrac-

tion on color image. Elsiever B.V., Pattern Recogni-

tion Letters, 26, 2252-2261.

Zhao, W. (2003). Face Recognition: A Literature Survey.

ACM Computing Surveys, vol. 35, no. 4, pp. 399-458.

FACE AND EYE TRACKING FOR PARAMETERIZATION OF COCHLEAR IMPLANTS

433