Automatic Analysis of In-the-Wild Mobile Eye-tracking Experiments

using Object, Face and Person Detection

Stijn De Beugher

, Geert Brˆone

and Toon Goedem´e

EAVISE, ESAT, KU Leuven, Leuven, Belgium

MIDI Research Group, KU Leuven, Leuven, Belgium

Keywords:

Mobile Eye-tracking, Object Detection, Object Recognition, Real World, Data Analysis, Person Detection.

Abstract:

In this paper we present a novel method for the automatic analysis of mobile eye-tracking data in natural envi-

ronments. Mobile eye-trackers generate large amounts of data, making manual analysis very time-consuming.

Available solutions, such as marker-based analysis minimize the manual labour but require experimental con-

trol, making real-life experiments practically unfeasible. We present a novel method for processing this mobile

eye-tracking data by applying object, face and person detection algorithms. Furthermore we present a temporal

smoothing technique to improve the detection rate and we trained a new detection model for occluded person

and face detections. This enables the analysis to be performed on the object level rather than the traditionally

used coordinate level. We present speed and accuracy results of our novel detection scheme on challenging,

large-scale real-life experiments.

1 INTRODUCTION

The development of mobile eye-tracking systems has

opened up the paradigm of eye-tracking to a wide va-

riety of research disciplines and commercial appli-

cations. Whereas traditionally, the analysis of eye

gaze patterns was largely conﬁned to controlled lab-

based conditions due to technological restrictions (i.c.

obtrusive hardware restricting the ﬂexibility of use

and potential research questions), mobile systems al-

low for eye-tracking in the wild, without a necessar-

ily predeﬁned set of research conditions. Because

of this increased ﬂexibility, research into visual be-

haviour and real-life user experience now extends to

natural environments such as public spaces (train sta-

tions, airports, museums, etc.), commercial environ-

ments (supermarkets, shopping centers, etc.) or to

interpersonal communicative settings (helpdesk inter-

actions, lectures, face-to-face communication, etc.).

A mobile eye-tracker, as illustrated in ﬁgure 1, com-

bines two types of cameras. The scene camera is

looking forward and captures the ﬁeld of view, while

the eye-camera(s) on the other hand capture the eye-

movements, also known as gaze data. Output of such

an eye-tracker,as shown in the right part of this ﬁgure,

consists of the images captured by the scene camera

with the gaze-locations laid on top of them.

One of the key challenges for this new type

Figure 1: Left: illustration of a mobile eye-tracker consist-

ing of a scene camera and an eye camera. Right: output of

a mobile eye-tracker in which the data of the scene camera

and the captured gaze point (green dot) are combined.

of pervasive eye-tracking, and mobile eye-tracking in

general, is the processing of data generated by the sys-

tems. By abandoning the traditional well-controlled

lab-based conditions, the data stream generated by the

eye-trackers becomes highly complex, both in terms

of the objects and scenes that are encountered, and the

gaze data that need to be analyzed and interpreted.

How can researchers avoid the painstaking task of

manually coding large amounts of data, which is ex-

tremely time-consuming, without losing the full po-

tential of mobile eye-tracking systems?

Eye-trackingexperimentsare mostly performed in

order to measure how often and for how long the test

subjects looked at a speciﬁc object and/or at persons,

to gather information about what ’catches the eye’ in

a certain setting. Recently, several solutions to the

analysis problem have been proposed, some of which

have been integrated in commercially available sys-

625

De Beugher S., Brône G. and Goedemé T..

Automatic Analysis of In-the-Wild Mobile Eye-tracking Experiments using Object, Face and Person Detection.

DOI: 10.5220/0004741606250633

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 625-633

ISBN: 978-989-758-003-1

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

tems, see (Evans et al., 2012) for an overview. The

best-known technique is the use of markers to prede-

ﬁne potential Areas Of Interest (AOI). These systems,

which either use physical infrared markers (e.g. Tobii

Glasses) or natural markers (e.g. SMI Eye Tracking

Glasses), determine the boundaries of the Areas Of

Analysis (AOA), generating a two-dimensional plane

within which eye gaze data can be collected for longer

stretches of time and generalized across subjects. The

output of this type of analysis is often represented in

heat maps or opacity maps that highlight the zones

within the AOA that received most visual attention

(measured in terms of visual ﬁxations and ﬁxation

times). Despite their advantages in comparison to

manual analysis, marker based systems suffer from

a range of limitations, as discussed in (Brˆone et al.,

2011) and (Evans et al., 2012), including the need for

ﬁxed positions of relevant objects to be tracked, and

the sensitivity to the observer’s position. These short-

comings impose limitations on the efﬁcient use of

(mobile) eye-tracking in real-life settings with mov-

ing subjects, objects and a dynamic environment.

This paper presents an alternative to the AOI-

based methods, building on recent studies combin-

ing object recognition algorithms with eye-tracking

data (De Beugher et al., 2012), (Toyama et al., 2012)

and (Yun et al., 2013). By mapping gaze data on ob-

jects and object classes to be recognized in the scene

video data, a number of restrictions of AOI-based ap-

proaches no longer hold, including the need to work

with predeﬁned static areas. Objects for which gaze

data statistics need to be generated can be selected in

the actual videostream, without prior training. The

schematic representation in ﬁgure 2 shows the detec-

tion methods that were used in our approach to gener-

ate both graphical and statistical output.

The next section illustrates potentional real-life

applications of our approach. In section 3 we dis-

cuss the technological background of the proposed

approach, including a comparison of different poten-

tial feature extraction methods and an overview of

face and body detection algorithms. In section 4 we

will clariﬁy our technical approach and in section 5

we report on our experiments.

2 TARGETED APPLICATIONS

In order to show the potential of our object-based eye-

tracking data analysis, we detail a selection of real-

world applications that can serve as both a test case

and a show case for our technique: user experience,

market research and (psycho)linguistics.

Figure 2: High-level overview of the approach. Output is

generated in the form of both a visual summary of the eye-

tracker data and statistical information.

2.1 Market Research

Our ﬁrst case study focuses on the analysis of eye-

tracker data recorded during a shopping experiment.

We have chosen this case because there has been

a substantial interest in eye-tracking for shopper re-

search for several years. Indeed, these experiments

yield an objective measurement of how well prod-

ucts catch the eye in a shop. Market researchers and

(brand) developers beneﬁt from insights into the ef-

fect of package design, shelf placement, and store

planning on shopper experience. The speciﬁc con-

text of a supermarket presents a series of challenges,

for example a multitude of objects with different

shapes and colors, products presented in groups on

the shelves or products within the same range exhibit-

ing similar features. As explained in the introduction,

the limitations of using predeﬁned areas of analysis

makes it virtually impossible to process large-scale

shopping experiments.

Our proposed object recognition method on the

other hand, allows for a fast analysis of multiple ob-

jects and object categories. For instance, once the sys-

tem has been trained for a speciﬁc product, it will rec-

ognize this product type each time it appears within

the visual ﬁeld of a test person walking through a shop

and even picking up those products.

The main advantage of our system, compared to

the marker-based methods, is that we no longer re-

quire predeﬁned areas of analysis, making it possi-

ble to perform eye-tracker experiments in larger, real-

world shopping environments.

2.2 Customer Journey

A prominent ﬁeld of application for mobile eye-

tracking is customer journey analysis. The main pur-

pose of customer journey research of a company is

gaining insights into the experience of customers.

Mobile eye-trackers provide potentially useful infor-

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

626

mation on customer experience, particularly when the

paradigm is combined with other sensors, such as

wearable EEG devices (Alves et al., 2012).

Customer experiencecan be measured through us-

ing so called touchpoints, the contact moments be-

tween the customer and the company e.g. in the case

of advertising, communication with desk members,

etc. The recordings of the mobile eye-tracker can be

used to analyse the visual behaviour towards the fysi-

cal and human touchpoints. Our approach can be used

to perform these analysis automatically.

In our case study we tackle the analysis of one

type of customer journey involving a series of touch-

points, namely a museum visit (see section 5.3).

Analysis of this data includes wayﬁnding analysis,

analysis of human contacts, visual behaviour towards

speciﬁc works of arts, etc.

2.3 Human-human Communication

A third example application tackles the analysis of

eye-tracker data of a human-human communication

experiment. Recent research on multimodal human-

human communication has explored the role of gaze

in turntaking and feedback in face-to-face conversa-

tion (Jokinen et al., 2009), shared gaze in dialogue

and the function of gaze as a directive instrument

in communication (Brˆone et al., 2010). Among the

questions that are addressed in this ﬁeld are: Does

a speaker visually address his/her audience during a

presentation? How does the audience divide its visual

attention between a speaker and relevant artefacts?

In this case, the speciﬁc challenge for the object-

based eye-tracking system resides in the recognition

of human bodies (and body parts), and the automatic

analysis of attentional distribution between multiple

objects. This test will allow for a ﬁrst insight into

the system’s reliability for the analysis of preferential

looking in communication (e.g. a speaker looking at

audience vs. notes).

3 RELATED WORK

In the introduction, we already mentioned the con-

cepts object recognition and object detection. The

task of object recognition consists of retrieving a

given object that is identical to a trained object in a

set of images. Object detection on the other hand

has extended the principle of detecting objects with a

known speciﬁc appearance towards detecting objects

based on a general object class model that contains

intra-class variability. The next subsections describe

a selection of techniques of both object recognition

and object detection that can be used in our applica-

tion.

3.1 Object Recognition Techniques

Object recognition, or ﬁnding an object that is iden-

tical to a trained one, is traditionally done with lo-

cal feature matching techniques. Recognition meth-

ods deﬁne local interest regions in an image, based on

speciﬁc features of the image content, which are de-

scribed with descriptor vectors. The characterisation

of these local regions with descriptor vectors that are

invariant to changes in illumination, scale and view-

point enables the regions to be compared across im-

ages. A survey of object recognition methods is given

in (Tuytelaars and Mikolajczyk, 2008), while (Miko-

lajczyk et al., 2005) report comparative experiments.

Renowned techniques are the Scale Invariant Fea-

ture Transform (SIFT) (Lowe, 2004) and Speeded

Up Robust Features (SURF) by (Bay et al., 2006).

Although SIFT and SURF are regarded as state-

of-the-art, we opted for a class of more recently

developed techniques due to licensing regulations.

We compared two competitive alternatives for SIFT

and SURF, namely ORB (Rublee et al., 2011) and

BRISK (Leutenegger et al., 2011).

The ORB feature descriptor is built on the well-

known FAST keypoint detector (Rosten and Drum-

mond, 2005) and the recently developed BRIEF de-

scriptor (Calonder et al., 2010). ORB is a compu-

tationally efﬁcient replacement for SIFT and SURF,

it has similar matching performance and is even less

affected by image noise. ORB is suitable for real-

time performance since it is faster than both SURF

and SIFT. Another competitive approach to keypoint

detection and description is Binary Robust Invariant

Scalable Keypoints (BRISK) and is as performant as

the state-of-the-art algorithms, but with a signiﬁcantly

lower computational cost.

An evaluation of these detectors is presented

by (Miksik and Mikolajczyk, 2012). Although these

results demonstrate that BRISK outperforms ORB,

we chose to use ORB in our algorithm based on

our own experiments. Mobile eye-trackers are often

equipped with low-resolution scene cameras, for ex-

ample 320 by 240 px on the Arrington mobile eye-

tracker. Moreover, on top of this low resolution we

are only interested in a speciﬁc region around the gaze

cursor, yielding a ﬁnal ROI of approximately 120 by

120 px. We noted that applying BRISK to such small

images often results in an insufﬁcient number of ex-

tracted keypoints as shown in ﬁgure 3, and thus does

not generate an adequate number of matches.

AutomaticAnalysisofIn-the-WildMobileEye-trackingExperimentsusingObject,FaceandPersonDetection

627

Figure 3: Comparison between ORB and Brisk. X-axis is

value of both width and height of the image.

3.2 Object Detection Techniques

Several studies in the ﬁeld of visual behaviour have

shown that visual attention is particularly attracted to

other persons (Judd et al., 2009; van Gompel, 2007)

and faces (Henderson, 2003). To broaden the use of

our approach we implemented techniques to automat-

ically detect whether a person looked at another per-

son. We make a distinction between speciﬁcally look-

ing at a face, e.g. during talking, and looking at some-

one from a larger distance.

Since each body or face is unique it is impossible

to apply the previously mentioned object recognition

techniques. Object detection on the other hand can be

described as detecting instances of objects of a certain

class, in which the appearance of objects may vary,

such as humans or faces, and therefore it is suitable

for this purpose. A short overview of two robust ob-

ject detection algorithms is described below.

The technique presented by (Viola and Jones,

2001) has proven to be a very useful tool to detect

faces in natural images. This technique combines a

set of weak classiﬁers into a ﬁnal strong classiﬁer and

uses a sliding window approach to search for speciﬁc

patterns in the image. Haar features models can be

used to detect faces, eyes or mouths, etc. A main

drawback of this technique is the limited viewing an-

gle for which standard models can be used.

On the other hand, for full human bodies a state-

of-the-art detectors are the Deformable Parts Model

(DPM) (Felzenszwalb et al., 2010), Integral Chan-

nel Features (Doll´ar et al., 2009) or Random Hough

Forests (Gall and Lempitsky, 2009). This technique

uses a parts extension of HOG (Dalal and Triggs,

2005), and is therefore invariant to various postures

or viewing angles. Models trained on the PASCAL

and INRIA Person datasets have proven to be very

robust in cases where a full body is visible, but some-

times fail when a body is not visible from head to

foot (Doll´ar et al., 2012). Unfortunately, since the

scene camera of the eye-tracker has a restricted ver-

tical viewing angle, people not too far away appear

cropped in the image, as illustrated in ﬁgure 4. In

Figure 4: Example eye-tracker images in which a complete

human body is not visible from head to foot.

eye-tracking experiments we are often interested in

the interaction between people and thus we capture

lots of such images.

To summarize this section we conclude that we se-

lected ORB to detect if one looked at speciﬁc objects.

To ﬁnd out if one looked at another person we chose

Viola and Jones and DPM based algorithms.

4 TECHNICAL APPROACH

The input of our algorithm consists of a videostream,

captured by the scene camera of an eye-tracker, and a

data ﬁle which contains the corresponding gaze loca-

tions. As explained in the previous chapter, we apply

two different techniques to analyse the eye-tracking

data. The ﬁrst part of this section discusses the imple-

mentation of the ORB technique to detect how often

and for how long a particular object was viewed. The

second part handles the implementation of techniques

to count how often and for how long a face or a person

was viewed.

4.1 Recognition of Speciﬁc Objects

This part of our approach focuses on how we process

eye-tracker data to generate basic statistics for spe-

ciﬁc objects to be detected. This is done in ﬁve steps:

1. Preprocessing step: since we are only interested

in the objects that appear close to the visual ﬁxa-

tion point, the input images of the forward looking

camera are cropped around the gaze coordinates.

Based on experiments, we chose to crop a ROI of

120 by 120 px around the gaze cursor.

2. In the next step the user selects objects of in-

terest in the datastream by simply clicking on

them while the video is playing. These ob-

jects are then stored in an object database, avoid-

ing the tedious task of manually creating such

a database with training images of the objects,

as proposed in other approaches (Toyama et al.,

2012; De Beugher et al., 2012).

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

628

3. The third step consists of searching for correspon-

dences between each cropped frame and each

frame stored in the database, using ORB features.

We apply a matching algorithm, based on the Eu-

clidean distance to ﬁnd similar keypoints between

each image pair. Furthermore we also apply

several ﬁlter techniques to eliminate weak or

false matches. First the distance between the two

best matches is evaluated: if this distance is large

enough it is safe to accept the ﬁrst best match,

since it is unambiguously the best choice. Sec-

ondly, a symmetrical matching scheme is used,

which imposes that for a pair of matches, both

points must be the best matching feature of the

other. The last step involves a fundamental matrix

estimation method based on RANSAC (Fischler

and Bolles, 1981) to remove the outliers. This

approach ensures that when we match feature

points between two images, we only keep those

matches that fall onto the corresponding epipolar

lines.

4. In the fourth step we assign a score S to each pair

of images:

S =

∑

i=1

d(k

, k

′

)

∑

i=1

A(k

) +

∑

i=1

A(k

′

))

, (1)

where k

and k

′

stands for the ith keypoint of the

corresponding images, m is the total number of

matches and A(k

) stands for the size of the cor-

responding features. This score S is then used to

decide whether a cropped frame exhibits sufﬁcient

agreement to one of the frames in the database by

comparing S to a tunable threshold.

5. In the ﬁfth and ﬁnal step we cluster consecutive

similar frames into a ”visual ﬁxation”. We deﬁne

a visual ﬁxation as a series of images in which the

same object was viewed with a minimal duration

of 100 ms or three consecutive frames. This min-

imal length factor allows us to remove many false

detections.

An example result of this algorithm is shown in ﬁg-

ure 5. A total of four objects were selected from an

eye-tracker experiment. In this ﬁgure we display a

single image per visual ﬁxation for each frame of in-

terest.

4.2 Detection of Faces and Bodies

The second part of our approach focuses on the detec-

tion of faces and human bodies as it is nessecary for

Figure 5: Results of the object detection algorithm. First

and last frame number of each visual ﬁxation is displayed.

e.g. customer journey experiments (see section 2.2).

As explained in the previous chapter, we use the

standard Viola and Jones technique for face detec-

tion in combination with the standard OpenCV frontal

face Haar-cascade model and the standard DPM tech-

nique (Felzenszwalb et al., 2010) in combination with

a new trained 60% model, as will be explained below.

We apply those techniques to the images captured by

the scene camera of the eye-tracker, but we only keep

the bounding boxes that are close to the gaze coordi-

nates of the corresponding frame.

This basic implementation performs sufﬁciently

well, but there is still room for improvement. We pro-

pose a temporal smoothing technique (see ﬁgure 6)

by using the gaze-data to improve the detection rate,

thus minimizing both false positives and false nega-

tives. To reduce the number of false positives, we as-

sume that a valid face/person detection should stand

for at least a certain time (tunable via a threshold, for

example 100 ms or 3 subsequent frames). This cri-

terion substantially reduces the number of false posi-

tives (since many false detections occur occasionaly).

On the other hand, if we ﬁnd gaps between detection

sequences, we can assume those are missing detec-

tions. Predicting them will improve the detection rate

and thus further reduce the number of false negatives.

The Haar-cascade method works best for frontal

faces, which is ideally suited for this application

where we want to count face-to-face interactions.

However if a face is presented in proﬁle, detection

will often fail. There are some models designed for

proﬁle face detection, but we saw little or no advan-

Figure 6: Vertical bars: real detections. Dashed line: output

of the temporal smoothing.

AutomaticAnalysisofIn-the-WildMobileEye-trackingExperimentsusingObject,FaceandPersonDetection

629

Figure 7: Components of the torso model.

tage with this as compared to the standard frontal face

model. The DPM is very robust in the detection of

full bodies, but because in our application persons are

mostly not visible from head to foot (see ﬁgure 4), we

used a self-trained torso model instead of the standard

person model.

To overcome these shortcomings, we have trained

a new model based on the standard PASCAL VOC

dataset

. This new model is trained using only the

upper 60% of the labeled bounding boxes of human

bodies, resulting in a human torso model as illustrated

in ﬁgure 7. Our model consists of three components,

each belonging to a speciﬁc viewpoint. Every compo-

nent is deﬁned by a root ﬁlter (a), several part ﬁlters

(b) and a spatial model for the location of each part

relative to the root (c). This approach to cope with

image border occlusion is also followed by (Mathias

et al., 2013), but for a channel features detector. To

the best of our knowledge, we are the ﬁrst to use it on

a DPM-detector. A second advantage of this cropped

model is the possibility to use the ﬁrst component

as an upper body (head and shoulder)-detector. This

model is, compared to the Haar-cascade model, robust

to various poses of the head.

5 EXPERIMENTAL RESULTS

In order to test our person and object detection

scheme on real-life eye-tracker experiments, we

recorded a large set of eye-tracking experiments.

These experiments included (i) a subject walking

through a university campus building and looking for

signs, (ii) a subject walking through the streets, while

paying attention to trafﬁc and other signs, (iii) a sub-

ject attending a presentation given by a lecturer and

(iv) a larger experiment where multiple participants

visited a museum. In this last experiment fourteen

participants (7 male - 7 female) were recorded while

The PASCAL Visual Object Classes Challenge

2009 (VOC2009) Dataset http://www.pascal-network.org/

challenges/VOC/voc2009/workshop/index.html

Figure 8: Example of the full body person detection.

they visited a special exhibition at Museum M in Leu-

ven (Belgium), starting from the ticket counter all the

way to the gift shop. The goal of this experiment

was to determine the ease-of-use and experience of

the self-guided tour: signage, information, view time

of speciﬁc works, etc. Recordings were made with

Tobii Glasses and Arrington Gig-E60 mobile systems

and resulted in 630 minutes of video material.

5.1 Object Recognition Results

We tested our object recognition technique on a

labeled set of images, captured during the above-

mentioned experiment (i). This set consists of 2000

images with 716 labeled objects of six different cate-

gories. A precision-recall curve indicating the perfor-

mance of our detector on this test set is shown in ﬁg-

ure 9. The obtained detection results are satisfactory

for most of the objects. However, a large scale vari-

ance results in a lower detection rate, as illustrated by

the curve of the toilet sign, which is looked at both

from very far away and from close by. Table 1 shows

the execution time for a given number of selected ob-

jects of interest and a given number of video frames.

As illustrated in this table, data of an eye-tracker ex-

periment of 6000 frames (3m 20s of video data) can

be processed in a couple of minutes, less than the du-

Figure 9: Precision-recall curve of our object recognition

technique tested on a set of 2000 images.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

630

Table 1: Computational time of the object recognition im-

plementation.

# selected objects 2 3 4 5

video of 1m 6s 31 s 42 s 54 s 68 s

video of 2m 13s

61 s 80 s 104 s 133 s

video of 3m 20s

94 s 122 s 162 s 201 s

ration of the video itself. These tests were performed

on a normal recent desktop PC.

5.2 Results of Face and Body Detections

In order to present results of the face and torso de-

tections, we have labeled a set of 3000 images cap-

tured during the museum visit. In this labeling we

made a distinction between looking at an upper body

and looking at a person. This distinction is made in

order to detect when the subject is talking to some-

one, or when he/she just looks at another person. In

ﬁgure 10 we present a set of precision-recall curves

displaying the improvements we have made. A de-

tection is counted if it corresponds to either an upper

body label or a torso label. The ﬁrst curve shows the

performance of the standard VOC 2009 model on our

dataset. The second curve shows the performance of

the standard VOC 2009 model in combination with

our temporal smoothing approach. The last curve

shows the new torso model in combination with our

temporal smoothing. Mainly in the recall region be-

tween 0.8 and 0.9, we reached a signiﬁcant improve-

ment compared to the standard model. Indeed, with

our technique, the Mean Average Precision (MAP)

value increases by 6%.

As mentioned in the previous section, it is pos-

sible to use the ﬁrst component of our model to de-

tect upper bodies, and thus use this model as an alter-

native/additional technique to the Haar-cascade face

model. An illustration of those results is presented

in ﬁgure 11. It is clear that our combined Haar-

cascade/upper body DPM model (yellow curve) is

signiﬁcantly better than both the Haar-cascade detec-

tor (blue square) or our DPM upper body model (yel-

low) used separately.

5.3 Combined Results of Objects, Face

and Body Detections

An overview of the automatically generated output of

our algorithm for the museum experiment is given in

ﬁgure 12. In this ﬁgure we show a visual timeline in-

dicating how often a visitor looked at a speciﬁc object

such as a route map, a speciﬁc work of art, etc. Fur-

thermore it indicates the interaction to other persons,

Figure 10: Precision recall curves of our body detection im-

plementation compared to a standard model.

Figure 11: Precision recall curves of face detection com-

pared to upper body detections.

for example when buying a ticket or asking informa-

tion about the exhibition.

6 CONCLUSIONS AND FUTURE

WORK

In this paper we presented an approach for automatic

eye-tracker data processing based on object, face and

person detection. As opposed to (Toyama et al., 2012)

and (De Beugher et al., 2012) we presented an object

detection scheme in which a separate training is no

longer required. On top of the object detection we

presented an approach suited for counting how often

and for how long one looked at a person or a face.

In order to further improve the detection rate we pro-

posed two novelties. The ﬁrst is a temporal smooth-

ing approach to avoid many false positives and false

negatives. The second is the training of a new DPM

model which is designed for torso and upper body de-

tections. We illustrated the accuracy and performance

of our approach and gave a comparison to standard

AutomaticAnalysisofIn-the-WildMobileEye-trackingExperimentsusingObject,FaceandPersonDetection

631

Figure 12: Results of our algorithm applied to the recordings of the museum visit. Each timeline represents a short summary

of viewing behaviour of a participant.

techniques and presented results of large-scale real-

life experiments. Our future work concentrates on

intelligent sampling which should avoid the process-

ing of every eye-tracker tick and thus reduce process-

ing time. We will also pay attention to a more user-

friendly visualisation of the results.

REFERENCES

Alves, R., Lim, V., Niforatos, E., Chen, M., Karapanos, E.,

and Nunes, N. J. (2012). Augmenting customer jour-

ney maps with quantitative empirical data: a case on

eeg and eye tracking.

Bay, H., Tuytelaars, T., and Van Gool, L. (2006). Surf:

Speeded up robust features. In ECCV, pages 404–417.

Brˆone, G., Oben, B., and Feyaerts, K. (2010). Insight inter-

action. a multimodal and multifocal dialogue corpus.

Brˆone, G., Oben, B., and Goedem´e, T. (2011). Towards

a more effective method for analyzing mobile eye-

tracking data: integrating gaze data with object recog-

nition algorithms. In Proc. of PETMEI, pages 53–56.

Calonder, M., Lepetit, V., Strecha, C., and Fua, P. (2010).

Brief: binary robust independent elementary features.

In ECCV, pages 778–792.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In CVPR, pages 886–

893.

De Beugher, S., Brˆone, G., and Goedem´e, T. (2012). Au-

tomatic analysis of eye-tracking data using object de-

tection algorithms. PETMEI.

Doll´ar, P., Tu, Z., Perona, P., and Belongie, S. (2009). Inte-

gral channel features. In BMVC.

Doll´ar, P., Wojek, C., Schiele, B., and Perona, P. (2012).

Pedestrian detection: An evaluation of the state of the

art. PAMI, 34.

Evans, K. M., Jacobs, R. A., Tarduno, J. A., and Pelz, J. B.

(2012). Collecting and analyzing eye-tracking data

in outdoor environments. Journal of Eye Movement

Research, pages 1–19.

Felzenszwalb, P. F., Girshick, R. B., and McAllester, D.

(2010). Cascade object detection with deformable part

models. In CVPR, pages 2241–2248. IEEE.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: a paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Commun. ACM, 24(6):381–395.

Gall, J. and Lempitsky, V. (2009). Class-speciﬁc hough

forests for object detection. In CVPR.

Henderson, J. M. (2003). Human gaze control during real-

world scene perception. Trends in Cognitive Sciences,

7(11):498 – 504.

Jokinen, K., Nishida, M., and Yamamoto, S. (2009). Eye-

gaze experiments for conversation monitoring. In

Proc. of the 3rd IUCS, pages 303–308.

Judd, T., Ehinger, K., Durand, F., and Torralba, A. (2009).

Learning to predict where humans look. In ICCV,

pages 2106–2113. IEEE.

Leutenegger, S., Chli, M., and Siegwart, R. (2011). Brisk:

Binary robust invariant scalable keypoints. In ICCV.

Lowe, D. (2004). Distinctive image features from scale-

invariant keypoints. In IJCV, pages 91–110.

Mathias, M., Benenson, R., Timofte, R., and Van Gool, L.

(2013). Handling occlusions with franken-classiﬁers.

Submitted for publication.

Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A.,

Matas, J., Schaffalitzky, F., Kadir, T., and Van Gool,

L. (2005). A comparison of afﬁne region detectors.

IJCV, 65(1-2):43–72.

Miksik, O. and Mikolajczyk, K. (2012). Evaluation of local

detectors and descriptors for fast feature matching. In

ICPR, pages 2681–2684.

Rosten, E. and Drummond, T. (2005). Fusing points and

lines for high performance tracking. In ICCV, ICCV,

pages 1508–1515.

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G.

(2011). Orb: An efﬁcient alternative to sift or surf.

In ICCV, pages 2564–2571.

Toyama, T., Kieninger, T., Shafait, F., and Dengel, A.

(2012). Gaze guided object recognition using a head-

mounted eye tracker. In ETRA, pages 91–98.

Tuytelaars, T. and Mikolajczyk, K. (2008). Local invariant

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

632

feature detectors: a survey. Found. Trends. Comput.

Graph. Vis., 3(3):177–280.

van Gompel, R. (2007). Eye Movements: A Window on

Mind and Brain. Elsevier Science.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In CVPR, pages

511–518.

Yun, K., Peng, Y., Samaras, D., Zelinsky, G. J., and Berg,

T. L. (2013). Studying relationships between human

gaze, description, and computer vision. In CVPR.

AutomaticAnalysisofIn-the-WildMobileEye-trackingExperimentsusingObject,FaceandPersonDetection

633