Review of the Use of Electroencephalography as an Evaluation Method

for Human-Computer Interaction

emy Frey

1,2,3

, Christian M

uhl

, Fabien Lotte

and Martin Hachet

Univ. Bordeaux, LaBRI, UMR 5800, F-33400 Talence, France

CNRS, LaBRI, UMR 5800, F-33400 Talence, France

INRIA, F-33400 Talence, France

Keywords:

HCI evaluation, EEG, ErrP, Workload, Attention, Emotions.

Abstract:

Evaluating human-computer interaction is essential as a broadening population uses machines, sometimes in

sensitive contexts. However, traditional evaluation methods may fail to combine real-time measures, an “ob-

jective” approach and data contextualization. In this review we look at how adding neuroimaging techniques

can respond to such needs. We focus on electroencephalography (EEG), as it could be handled effectively

during a dedicated evaluation phase. We identify workload, attention, vigilance, fatigue, error recognition,

emotions, engagement, ﬂow and immersion as being recognizable by EEG. We ﬁnd that workload, attention

and emotions assessments would beneﬁt the most from EEG. Moreover, we advocate to study further error

recognition through neuroimaging to enhance usability and increase user experience.

1 INTRODUCTION

Along computer science history, interfaces and inter-

actions have been getting more complex. Nowadays

computers are everywhere, used by everyone. It is

necessary to make them comply with human capabil-

ities, practical to use. This is mostly done by evalu-

ating HCI prior to their public availability. Yet tradi-

tional evaluation methods could either be ambiguous,

lack real-time recordings, or disrupt the interaction.

On the other hand, new technologies arise. Phys-

iological sensors help to improve the ergonomics

of human-computer interaction (HCI) (Fairclough,

2009). Systems could be tuned to users by monitor-

ing their mental workload in real-time (Kohlmorgen

et al., 2007). Physiological sensors add an insightful

information channel. However sensors may be intru-

sive or require a calibration to record a proper signal,

and some are hardly available to consumers.

These issues could be resolved by using physio-

logical sensors in HCI evaluation. While designing

a user interface (UI) it should be acceptable to add

sensors’ hindrance to specially enrolled users. Those

testers will then help to improve beforehand the UI.

Laboratory conditions permit a controlled setup for

repeatable measures. Neuroimaging rely on demand-

ing but sensitive sensors. We consider them as an in-

novative supplement to conventional evaluation meth-

ods. Measuring neural activity during HCI can help

us to better understand what occurs in the brain when

users are interacting (Parasuraman, 2013).

We highlight in this paper which neuroimaging

techniques could be used conveniently within labora-

tories to overcome the difﬁculties encountered by tra-

ditional evaluation methods alone. We review a reper-

toire of patterns of users’ state which could be used to

characterize HCI, and evaluate how neuroimaging ob-

jectively measures them. We call those patterns “con-

structs”, a term which refers to notions as different as

workload and the state of “ﬂow”.

Other papers already began to sense how neu-

rotechnologies beneﬁt HCI, but they do not cover

evaluation (George and L

ecuyer, 2010), or if so they

do not study many constructs. (Parasuraman, 2013)

only discuss workload, vigilance and error recogni-

tion. In the present review we gathered from the HCI

literature every major construct which could poten-

tially be evaluated with brain activity.

In this review, we ﬁrst brieﬂy describe the differ-

ent families of evaluation methods aimed at assessing

HCI and UI quality, along with their advantages and

drawbacks. We divided them in four categories: be-

havioral studies (observations of users actions in real-

time), inquiries (e.g. questionnaires, interviews, think

aloud), physiological sensors (e.g. heart rate, galvanic

skin response) and neuroimaging (a subset of physio-

214

Frey J., Mühl C., Lotte F. and Hachet M..

Review of the Use of Electroencephalography as an Evaluation Method for Human-Computer Interaction.

DOI: 10.5220/0004708102140223

In Proceedings of the International Conference on Physiological Computing Systems (PhyCS-2014), pages 214-223

ISBN: 978-989-758-006-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

logical sensors which records brain activity). We also

formalize a new scale (whenever the measure is “ex-

ocentic” or “egocentric”) which could help to choose

the right combination of methods for evaluations.

We show that electroencephalography (EEG) is

the neuroimaging technique which offers the best

trade-off between spatial and temporal resolution,

practical use and cost. Therefore we focus on EEG

during the second part. We review there constructs

related to the quality of HCI. We identiﬁed that work-

load, attention, vigilance, fatigue, error recognition,

emotions, engagement, ﬂow and immersion are use-

ful for evaluation and can be measured with EEG.

Finally we outline the challenges and limitations

which arise from this encounter between HCI evalua-

tion and neurotechnologies, as well as constructs that

could beneﬁt from being measurable with EEG.

2 EVALUATION METHODS

2.1 Behavioral Studies

Recording users interactions, such as mouse speed, is

one standard way to evaluate a UI. “Behavioral stud-

ies” refers to this method: behavior and actions of

users inside a software. Behavioral studies are close

to performance measures, as seen in human factors.

The easiest way to sense if a UI is well designed is

to watch users. How fast do they complete the task?

Are they more accurate with a bigger mouse cursor?

Such methods helped to formulate a preeminent law

in HCI, Fitts’s law, which is all about time to reach a

target depending on its distance and size (Fitts, 1954).

Although behavioral studies are able to account in

real-time for users’ interactions, they can be hard to

interpret: measures may not be speciﬁc to one con-

struct. E.g. a high reaction time can be caused ei-

ther by a low concentration level or a high workload

(Berka and Levendowski, 2007), (Hart and Staveland,

1988). On top of that, behavioral studies may not pro-

vide much information on the users’ state. With sim-

ple tasks in particular, little can be computed beside

reaction times and a performance metric.

2.2 Inquiries

While it is possible to infer users’ thoughts through

a behavioral study, it may be simpler to record their

opinion. We call this “inquiries”. In HCI we are inter-

ested in questionnaires related to the use of a UI. Stan-

dardized questionnaires have been validated across

several studies for various measures: e.g. NASA-TLX

for workload (Hart and Staveland, 1988).

Unfortunately those “pen and paper” tests are dis-

crete and are not good for real-time assessments. The

“think aloud” protocol (Weber, 2007) is a way to cir-

cumvent this, yet it could inﬂuence the interaction as

users still have two different things to do: interact

and report their experience. It is an example of dou-

ble task and divided attention (Ogolla, 2011). “Fo-

cus groups” (Bruseberg and McDonagh-Philp, 2002)

is the third form of inquiry. It involves experts and ad-

vanced users, who exchange about their ﬁndings un-

der the control of the designer.

Questionnaires, think aloud an focus group are

three different forms of inquiry fraught with the same

hazards. Resulting measures are prone to be contam-

inated by ambiguities (Nisbett and Wilson, 1977), so-

cial pressure (Picard, 1995) or participants’ memory

limitations (Kivikangas et al., 2010) – when answers

are not oriented toward experimenters’ expectations if

subjects ﬁgure out what is at stake.

2.3 Physiological Sensors

When humans interact with computers bodily

changes co-occurs with mental changes. E.g. pupils

dilate while experiencing strong emotions (Partala

and Surakka, 2003). Physiological sensors can be

used in order to account for such body changes in

HCI (Fairclough, 2009), (Dirican and G

okt

urk, 2011)

or game (Ravaja, 2009), (Nacke et al., 2009) research.

Galvanic skin response (GSR, also called “electroder-

mal activity”) is among those sensors, as well as elec-

trocardiography (ECG, the signal modality heart rate

is derived from) and electromyography (EMG, caused

by muscular activity, including facial expressions).

Even if someone trained could control his heart-

beat, physiological cues are great for the “objectiv-

ity” they bring into HCI (see section 2.5). Body re-

actions are sometimes misleading though: you may

record ECG to study attention, whereas an increase

in heartbeat can also be caused by strong feelings.

Muscles and organs are controlled by the peripheral

nervous system. Physiological sensors are a second-

order inference about the processing which occurs in

the central nervous system.

2.4 Neuroimaging

Neuroimaging is a currently rising ﬁeld used in brain-

computer interfaces (BCI) settings (Blankertz et al.,

2010), (Hamadicharef, 2010). Neuroimaging tech-

niques allows the assessment of brain activity; we

classify them apart even if strictly speaking they do

belong to physiological sensors.

Non-invasive neuroimaging techniques, which do

ReviewoftheUseofElectroencephalographyasanEvaluationMethodforHuman-ComputerInteraction

215

not require surgery, are divided into two main families

(Zander and Kothe, 2011). Functional magnetic res-

onance imaging (fMRI) and functional near-infrared

spectroscopy (fNIRS) record brain activity through

blood ﬂow variations. fMRI has a very good spatial

resolution but is a large device which completely sur-

rounds subjects and costs about one million dollars.

fNIRS is a much more lightweight and affordable de-

vice. Instead of magnetic ﬁelds, it uses direct light

for recordings. Sensors are ﬁxed on a cap, hence sub-

jects are free to interact with a computer while wear-

ing it. Compared to fMRI, the spatial resolution of

fNIRS is less detailed. It records only the outer re-

gion of the brain – light is absorbed by tissues. fMRI

and fNIRS share a poor temporal resolution. With a

latency reaching up to several seconds it is difﬁcult to

observe fast and short responses.

The second family of neuroimaging uses electri-

cal currents generated by neural activity. Magnetoen-

cephalography (MEG) records magnetic ﬁelds. It is

less heavy and expensive than fMRI, but still hardly

manageable for uses in HCI contexts. MEG has a

high temporal resolution, down to the millisecond.

Electroencephalography (EEG) also has a high tem-

poral resolution. It is comparable in size to fNIRS.

EEG measures electrical current onto the scalp. Elec-

trodes are “dry” – no electrolyte solution – or, more

frequently, “wet” – solvent is either water or gel. De-

spite its poor spatial resolution it is a relatively cheap

equipment for a laboratory. Because it is portable and

non invasive, it interferes little with HCI setting.

Experimenters must be cautious with the limita-

tions of the device they choose. Is the signal-to-noise

ratio sufﬁcient for what they intend to measure? What

artifacts could pollute their data? Are they in con-

trol of the algorithms producing measures from raw

signals? That said, EEG is the most promising candi-

date to assist inquiries and other physiological sensors

in a wide range of evaluation measures. Compared

to others neuroimaging devices, EEG offers the best

compromise between spatial and temporal resolution,

practical use and cost. Therefore we focus mostly on

this type of brain activity recordings in this paper.

2.5 A New Continuum for HCI

Evaluation Methods

We have previously mentioned how the evaluation

methods do bring different levels of “objectivity” in

their measures. Unfortunately, in such context “ob-

jective” and “subjective” are scarcely deﬁned in the

literature. According to (van de Laar et al., 2013),

“the objective methods are based on overt and covert

user responses during interaction while the subjective

methods rely on user expressions after the interac-

tion”. From that perspective, inquiries are “subjec-

tive” while behavioral studies, physiological sensors

and neuroimaging are “objective”.

While we agree such a distinction is required, a

more rigorous vocabulary is needed. We also doubt

the “time” variable should be involved in the deﬁni-

tion. As stated in section 2.2, results of inquiries are

prone to social pressure and other self-interpretations,

and this is also true for the real-time think aloud.

Moreover, when studying emotions, it could be ar-

gued that only “subjective” feelings are recorded, as

the evaluation is centered on the user. Hence, with-

out a complex phrasing (i.e. “objective measure of

subjective feelings”), employing such words is open

to criticisms. As an alternative “direct” and “indi-

rect” could be considered. But then those concepts

are more likely to refer to how measures are reported,

not where they originate from (e.g. EMG vs an exter-

nal observer annotating facial expressions).

As such, we would like to introduce a new nomen-

clature to name those two aspects and avoid ambigui-

ties: exocentric and egocentric. Those terms are bor-

rowed from spacial navigation research (Brandt et al.,

1973) and bring the notion of the self. Exocentric

measures are here close to the stimuli, to the source,

while egocentric measures are close to the conscious

thoughts of the user, to the outcome.

Figure 1: Proposal of an “exocentric / egocentric” scale

aimed at classifying evaluation methods for HCI.

We therefore create a continuous space between

two extremes (see Figure 1). We illustrate this scale

with the measurement of pain. The pressure of a

needle on a ﬁnger would represent a perfect exocen-

tric measure: the stimulus’ strength, a value discon-

nected from human body and perceptions. When the

pressure is transmitted to nociceptors in the skin, the

measure shifts a little from exocentric to egocentric.

As nerves are transmitting signals from the peripheral

nervous system to the brain, we go further to the right

of the axis. Since we may not be interested in skin’s

thickness, this neural activity represents the ﬁrst inter-

esting value from this side of the exo/egocentric scale.

Neuroimaging techniques record such activity, hence

it is the most exocentric evaluation method. When the

signal reaches the central nervous system, autonomic

responses are triggered – increase in heart rate, gal-

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

216

vanic skin response (Loggia et al., 2011). Those re-

actions could be recorded through physiological sen-

sors, a step further from the exocentric extreme.

As the pain grows, it will alter behaviors and

thoughts. A runner may slow down when experienc-

ing pain in a foot, no matter his willingness. Behav-

ioral studies are able to sense modiﬁcations occurring

against the will of the subject; that could be placed

somewhere in the middle of our scale. Concurrently,

most of the time, the person is being aware of the pain

and could phrase it if asked to. Many other cognitive

processes are involved in such a high level of con-

sciousness (e.g. planning, awareness), thus measures

recorded by inquiries are close to the far-end of the

scale and are indeed egocentric.

This scale can be used for various evaluations.

Eventually, it is possible to add “objective/subjective”

and “direct/indirect” to describe a whole framework.

A construct could be objective (usability) or subjec-

tive (emotions). A tool could be either direct (sen-

sor) or indirect (observer). A method is more exo-

centric (neuroimaging) or egocentric (inquiries). E.g.

the work of an experimenter assessing workload with

ECG can be described as objective/exocentric/direct.

3 CONSTRUCTS

“Constructs” designate the patterns of users’ state

which could be used to characterize interactions. This

part reviews relevant constructs from an HCI eval-

uation perspective that can be assessed using neu-

roimaging techniques. We grouped similar measure-

ments.

3.1 Workload

3.1.1 Deﬁnition

Humans have a limited set of resources to process in-

formation (Just et al., 2003). The ratio between pro-

cessing power and data coming from the environment

determines mental workload. Workload increases as

cognitive resources lessen or as the quantity of de-

mands grows. If the workload is too high subject’s

performance decreases, sometimes dramatically.

3.1.2 Neuroimaging

Using a device with 9 channels (Berka and Leven-

dowski, 2007) correlated EEG with workload. With

a better equipment (Mathan et al., 2007) showed

how EEG measures more subtle changes compared

to ECG. fNIRS is another well-tried technology: neu-

rons require more energy, hence more oxygen, as the

load increases. fNIRS showed better results com-

pared to EEG, with 82% of correct classiﬁcations be-

tween 2 classes (low vs high workload) and 50% with

3 classes (low, medium, high) (Hirshﬁeld et al., 2009).

In (Blankertz et al., 2010) EEG online analyses

(i.e. in real-time) discriminate 2 classes with a 70%

accuracy. A 2 minutes time window enables scores

from 80% to 90% (Brouwer et al., 2012). With 2

classes still, reviews report scores close to 100% if

EEG is combined with other physiological sensors

(van Erp et al., 2010). (Grimes et al., 2008) claim

99% success in distinguishing 2 memory load levels,

88% with 4.

3.2 Attention – Vigilance – Fatigue

3.2.1 Deﬁnition

Attention, vigilance and fatigue are closely related

and regularly measured altogether (Oken et al., 2006).

“Attention” refers to the ability to focus cognitive

resources on a particular stimulus (Kivikangas et al.,

2010). A correct selective attention allows to ignore

distractors. An insufﬁcient attention level results in a

difﬁculty or an inability to complete the task, whereas

too high or narrow attention resources may prevent

someone to disengage from a sub-task.

While in the literature, “attention” designates

more frequently the ability to perceive changes from

the environment, the term “vigilance” then often

refers to a broader resource, dependent of both cog-

nitive performance and the arousal level on the sleep–

wake spectrum (Oken et al., 2006). In that sense it

refers to a state of sustained attention. One needs to

maintain a high degree of vigilance over time in order

to focus his attention on something. Hereby “alert-

ness” will be considered as a synonym of “vigilance”.

“Fatigue” is a state in which cognitive resources

are exhausted. If the required level of vigilance or

attention causes a strain too important on the organ-

ism, fatigue arises and performances decrease (Bok-

sem et al., 2005). Then the task cannot be performed

correctly and errors appear (van Erp et al., 2010).

3.2.2 Neuroimaging

The alpha band is associated with attention. When

eyes are closed, or when fatigue occurs, alpha waves

amplitude increases (Shaw, 2003). This frequency

band in the range 8-12Hz is mostly generated by the

occipital lobe. It is easily recorded with EEG, even

with a single electrode (George et al., 2011). Alpha

band analysis discriminates different attention levels

ReviewoftheUseofElectroencephalographyasanEvaluationMethodforHuman-ComputerInteraction

217

(Klimesch et al., 1998). Even more, it enables to de-

tect which side of his visual ﬁeld a subject is paying

attention to while his eyes stare in front of him with

70% accuracy (Trachel et al., 2013).

Other types of brain activity are used, such as de-

lays in event-related potentials (ERP) – e.g. visual se-

lective attention in (Saavedra and Bougrain, 2012).

(Berka and Levendowski, 2007) suggested that

EEG is the only sensor which can accurately report

attention and vigilance shifts on a second-by-second

timeframe. Works investigating vigilance measures

are reviewed in (Parasuraman, 2013).

Regarding fatigue, if EEG signals are not more

accurate than physiological sensors to detect mi-

crosleeps, they offer the possibility to detect preced-

ing inattentive states (Blankertz et al., 2010, sec. 3.1).

Mental fatigue has been detected on 4 seconds time

windows with 80% accuracy, or 94% over 30 seconds

(Laurent et al., 2013). In order to improve reliabil-

ity, additional frequency ranges were recorded in this

study. For instance alpha, theta (4-8Hz) and beta (13-

18Hz) bands have been combined. ERP on the other

hand have been used to study how fatigue impairs dif-

ferently cognitive processes (Lorist et al., 2000).

3.3 Error Recognition

3.3.1 Deﬁnition

We call “error recognition” the situation that occurs

when users detect by themselves an outcome differ-

ent from what is expected (Nieuwenhuis et al., 2001).

It can be something users genuinely trigger but then

they realize they did a mistake. Or it can happen due

to commands erroneously interpreted by the machine.

Error recognition does not occur when a negative

feedback is given per se (Ferrez and Millan, 2008).

It is a matter of recognition by the user of a faulty

event. In UI evaluation, error recognition could be an

objective measure of subjective (mis)representations,

an objective assessment of how intuitive an HCI is.

3.3.2 Neuroimaging

ERP are “peaks” and “valleys” in averaged EEG

recordings associated with an external event. ERP

differ in their “shapes”, place on the scalp and latency

depending on the source of the stimuli or on the un-

derlying cognitive mechanism. One particular kind

of ERP has been discovered: error-related potentials

(ErrP) (Schalk et al., 2000). They are triggered when

an “error” occurs. It can be caused by something

users themselves did (response ErrP), by an incorrect

response from the command they used (interaction

ErrP), by something they witnessed from another user

(observation ErrP), and also when an explicit negative

feedback is given (feedback ErrP). All of which have

distinguishable features (Ferrez and Millan, 2008).

Response ErrP and interaction ErrP suit perfectly

our deﬁnition of “error recognition”. Brain signals are

elicited even when users are not consciously aware of

errors (Nieuwenhuis et al., 2001). ErrP have been

used to discriminate between incorrect and correct

users decisions. In (Chavarriaga and Millan, 2010)

respectively 76% and 63% accuracy were obtained to

detect observation ErrP in “single trial”, i.e. in detect-

ing ErrP for each user’s action.

These scores are common in the literature: 79%

and 84% in a task involving interaction ErrP (Ferrez

and Millan, 2008). Accuracy relates to EEG devices’

quality. From 70% with an entry-level headset and

non gel-based electrodes (Vi and Subramanian, 2012)

up to 90% with a more expansive device (Schmidt

et al., 2012). While ErrP detection does not reach

100% (chance is 50%), those scores are sufﬁcient to

improve HCI reliability (Vi and Subramanian, 2012).

(Sobolewski et al., 2013) recorded EEG while

subjects use a mouse and have to reach different

targets. In one-fourth of the trials the hand-to-

cursor mapping is randomly off-set by several de-

grees. Users do not expect these shifts and the anal-

ysis gives ﬁrst insights that the amplitudes of elicited

ErrP could relate to the degree of error. If this result

is conﬁrmed we may link error recognition to “intu-

itivity” evaluation.

3.4 Emotions

3.4.1 Deﬁnition

Psychology and neuroscience showed that emotions

are connected to high-level reasoning; they are

tightly linked to decision-making processes (Dama-

sio, 1994). The valence/arousal model is the most

commonly used paradigm to categorize emotions (Pi-

card, 1995). In this two-dimensional representation,

valence is related to hedonic tone and varies from

“negative” to “positive” (e.g. frustrated vs pleasant);

arousal is related to bodily and mental activation

and varies from “calm” to “excited” (e.g. satisﬁed vs

happy). This model must be applied with caution with

some populations. Children hardly make distinction

between different arousal levels (Posner et al., 2005).

3.4.2 Neuroimaging

Technologies with the highest temporal resolution,

such as MEG or EEG, are more indicated when a dy-

namic content is involved (Vecchiato et al., 2011).

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

218

An asymmetry within frequency bands (e.g. al-

pha and theta) in the frontal brain could be related

to different emotions (valence), such as pleasant-

ness/unpleasantness (Vecchiato et al., 2011). Still,

EEG is not yet a reliable sensor to assess emotions. In

(Chanel et al., 2011) even if EEG was better than the

other studied physiological sensors on short period of

times, a 56% accuracy barely sufﬁces for the differ-

entiation of three emotions (chance level is 33%).

Some Papers report high classiﬁcations rates. In

(Liu et al., 2011) 7 emotions are categorized. Au-

thors state a 85% accuracy for arousal and 90% for

valence. This using only three channels of an EEG

headset which is known to be sensitive to EMG ar-

tifacts. In pure EEG studies it is important to con-

trol for facial expressions (i.e. EMG signals), because

they can be easily recorder by electrodes. This is even

more problematic when emotions are involved. Al-

though we have to be cautious when assessing EEG

reliability, there is nothing wrong in combining EEG

and EMG (or other sensors) to improve overall per-

formance.

Despite the lack of clear indicators of affect in

EEG, neuroimaging is nevertheless a good lead for

novel research in this topic. For example different

patterns of EEG signals have been observed depend-

ing on the sense (sight or hearing) which induces an

emotion (M

uhl et al., 2011). It could then be specu-

lated that neuroimaging one day will be able to dis-

criminate which emotion is elicited by which input

modality, or which information channel leads to pos-

itive and which to negative user experience.

3.5 Engagement – Flow – Immersion

3.5.1 Deﬁnition

Deﬁnitions of “engagement”, “immersion” and

“ﬂow” overlap. From (Matthews et al., 2002), task

engagement is deﬁned as an “effortful striving to-

wards task goals”. Authors add that task engage-

ment increases during a demanding cognitive task

and decreases when participants perform a sustained

and monotonous vigilance task, see also (Fairclough,

2009). In (Chanel et al., 2011) “engagement” is

treated as one particular emotion, expressed as “pos-

itive excited” in the valence/arousal model. Engage-

ment is at a crossroads between several concepts stud-

ied in this paper: workload, attention and emotions.

“Flow” originates from psychological studies in-

volving challenge and/or creativity. It is a state in

which someone is totally involved in what he is doing.

Flow happens when the skills of the person meet a

sufﬁcient amount of challenge. A too important chal-

lenge brings anxiety, for too much skills it is boredom,

and too few of both results in apathy (Nacke and Lind-

ley, 2009). Here again, several measures are involved.

Challenge relates to workload and the resulting state

to emotions. By deﬁnition, ﬂow implies engagement.

“Immersion” is studied mainly in virtual reality

(VR) litterature. In (Slater et al., 2009) immersion

stands for the modalities hardware gives to users, how

well devices can preserve ﬁdelity in VR compared to

reality. Then the subjective feeling of being in the VR

is called “presence”. Unfortunately this distinction

between “immersion” and “presence” is less clear-cut

in other papers, see (Nacke and Lindley, 2009).

3.5.2 Neuroimaging

In neuroimaging literature (Fairclough, 2009),

(George and L

ecuyer, 2010) engagement assessment

studies are mentioned, but they often relate only

to sub-components such as workload or attention.

(Berka and Levendowski, 2007) see engagement as

a process related to information gathering, visual

scanning, and sustained attention. This study man-

aged to discriminate workload and engagement by

using EEG but the tasks involved (mental additions,

recalls) are close to what is seen elsewhere in

attention/vigilance protocols. Engagement is often

left entangled with other states in a “performance”

measure, see (Blankertz et al., 2010, sec. 3.2).

Experiments conducted during the FUGA project

showed that ﬂow could be related to fMRI measures

(Ravaja, 2009). The analysis with EEG of frequency

bands shows different pattern across three conditions

of interaction: boredom (i.e. not engaged), ﬂow and

immersion in a pilot study (Nacke et al., 2010). (Berta

et al., 2013) improved on this work and achieved a

66% classiﬁcation accuracy.

4 CHALLENGES

We saw how constructs relevant to HCI can be inves-

tigated with neuroimaging techniques. In this section

we will argue that two of them could beneﬁt drasti-

cally from neurotechnologies: error recognition and

attention. Besides accuracy, both could reach a new

level of description. Furthermore we will emphasize

the need for the evaluation of a whole HCI to account

for constructs of higher level, to study usability and

user experience. Finally we have to take care of EEG

devices and reliability in order to make it casual for

experimenters to use neuroimaging techniques.

ReviewoftheUseofElectroencephalographyasanEvaluationMethodforHuman-ComputerInteraction

219

4.1 Improving on Constructs

Measuring of two constructs would particularly ben-

eﬁt from improvements in neuroimaging.

First, as it may enable a real-time measure of how

intuitive a UI is, we would beneﬁt from a continuous

and modulated measure of error recognition. We saw

how error recognition can be indicated through ErrP

(Schalk et al., 2000). This means that it is possible to

detect when an interaction runs against users’ expec-

tations (Ferrez and Millan, 2008), i.e. when it is not

intuitive. At the moment only a binary measure and

poorly detailed data – “an ErrP is detected or not” –

is reliably obtained. Fortunately it seems possible to

measure a modulated ErrP (Sobolewski et al., 2013),

thus sensing by how much an operation in the UI has

perturbed users. If it is to be conﬁrmed, this would

enable a quantitative and qualitative data assessment.

We saw how single trial detection can be achieved

with EEG. Promising work reported ErrP detection

as the movement is occurring, within a 400ms time-

frame (Milekovic et al., 2013). At the moment this

near continuous detection uses an invasive technique.

The construct evolving around attention would

be the second one to proﬁt from neuroimaging. To

distinguish clearly in their measurements vigilance

and fatigue would be one point. On the other hand

EEG studies showed that visual artifacts in images

or videos are detected by subjects beyond conscious-

ness (Scholler et al., 2012), whenever it is conscious

perception or attention (Mustafa et al., 2012). This

would suggest that ERP could be used to anticipate

how much information users are able to process, be-

fore even considering their attention level. A (highly)

speculative experimental design where various cues

are hidden within sensory modalities in order to elicit

evoked potentials would create a “human bandwidth”

assessment, upstream from vigilance and attention.

4.2 Assessing New Constructs

Three constructs sit apart in our nomenclature. Both

usability and comfort are more closely related to UI

properties than to users’ state, and user experience

is entirely based on previously seen measures. Since

they are the subject of many HCI papers, it is worth to

shape their meaning in this review in order to encour-

age their assessment with neuroimaging techniques.

4.2.1 Usability – Comfort

“Usability” groups together the notions of “ease of

use” and “usefulness” (Bowman et al., 2002). It re-

lates to speed, accuracy and error rates in task com-

pletion. The learnability of UI is also a key point of

usability. As such a good affordance of UI elements –

how perceptions of objects induce a proper use – will

improve overall usability. Usability is impacted by

UI nature and constrains. E.g. an input device based

on body gestures is likely to be more tiring than a joy-

stick, given that it requires more energy from the user.

Usability is inextricably bound to users’ comfort.

Although usability could be investigated through

behavioral studies or inquiries (Jankowski and Ha-

chet, 2013), to our knowledge there is no neuroimag-

ing study which accounts solely for this construct.

Neuroimaging has been used instead as an indica-

tor, for example workload through fNIRS (Hirsh-

ﬁeld et al., 2009). In conjunction with other evalu-

ation methods, real-time recordings from physiologi-

cal sensors and neuroimaging give additional insights

and help to contextualize data (Pike et al., 2012).

4.2.2 User Experience

For (Mandryk et al., 2006) user experience (UX)

is a shift from usability analysis by bringing emo-

tions and entertainment into the equation. UX em-

beds “usability/comfort”, “emotions” and “engage-

ment/ﬂow/immersion”. UX is a higher comprehen-

sion level of what users experience during interac-

tions. (Ravaja, 2009) compiled various methods to

measure media enjoyment. It is possible to refer to

UX when studying the social aspect of interactions –

e.g. GSR is different if the opponent in a sport game

is played by a friend or a computer (Mandryk et al.,

2006). Assessing UX every time new technologies are

used could guide the HCI community in its choices,

e.g. with BCI (van de Laar et al., 2013).

4.3 Hardware – Signal Processing

Some limitations observed in EEG research are yet

to be resolved to make EEG-based evaluation of HCI

more operable. EEG devices, while practical com-

pared to other neuroimaging techniques, take long to

set up. Hence experiments can be tedious both for the

experimenter and for the subject. This is why there

are often only few subjects during EEG or BCI exper-

iments, which is a problem for the reliability of the

results. EEG signals contain many potential artifacts

(e.g. muscular activity and electrical parasites); the

quality of the device is essential. EEG signals must

be calibrated, processed and interpreted carefully.

Since a few years new EEG devices have ap-

peared, oriented toward a larger public. Their elec-

trodes use no conductive solution, or water as solvent.

These electrodes are faster to set-up – no more gel to

be put on each one after the device has been installed

– but may be less sensitive, see (Blankertz et al., 2010,

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

220

sec. 2.1). Hence some companies, while transforming

EEG into a mass-product, bring less reliable technol-

ogy to the market. Those devices often possess fewer

electrodes. Without a cap the electrodes are difﬁcult

to place in a standardized position on the scalp. Fi-

nally they are often packaged with software develop-

ment kits which hide the signal processing from the

users. Constructs like attention or emotions are then

claimed to be directly measured, without further jus-

tiﬁcation or muscular artifact control, see (Heingart-

ner, 2009). Nevertheless, while experimenters must

be aware of such limits if their intent is to rely solely

on brain activity, this increasing appeal in favor of

cheap EEG devices is a great opportunity to push for-

ward the use of neuroimaging in HCI.

Improvements in signal processing, either in fea-

tures extraction or classiﬁcation, could beneﬁt every

technology. Constructs, such as emotions, are not yet

accurately assessed with pure EEG signals. When too

many classes (e.g. emotions and workload levels) are

assessed altogether, the classiﬁer performance drops –

e.g. see how the “curse-of-dimensionality” relates to

classiﬁers’ complexity (Friedman, 1997). Improve-

ments in mathematical analysis and machine learning

algorithms, as well as a better understanding of brain

activity, would increase the reliability of the whole

system by a great amount and favour every construct.

Finally, no matter how lightweight they are, EEG

and physiological sensors change the way users inter-

act. Movements could be restrained by the devices

(less immersion) and users could perceive a more

stressful context, potentially biasing their experience.

As a result, a framework integrating physiological

sensors and traditional evaluation methods has to be

conceived to proﬁt from the potential of these novel

methods, while avoiding their limitations and pitfalls.

5 CONCLUSIONS

We reviewed how neuroimaging techniques could as-

sess constructs relevant for HCI evaluation.

Between the four categories of evaluation meth-

ods, inquiries could deliver more qualitative data,

while physiological sensors and neuroimaging are ex-

ocentric measures (the most “objective” measures of

subjectively perceived stimuli). It is particularly inter-

esting to combine those methods for constructs other-

wise difﬁcult to assess with exactitude, as investigated

in many studies (Ravaja, 2009), (Nacke and Lindley,

2009), (van Erp et al., 2010), (Chanel et al., 2011).

Our analysis of neuroimaging techniques focused

on EEG as it promises a good trade-off between cost,

time resolution and ease of installation. We advo-

cate that neurotechnologies can bring useful insights

to HCI evaluation. EEG devices are not yet perfectly

reliable and practical to use; hardware and software

processing are still evolving. However their cumber-

someness is partially avoided if they are used during

a dedicated evaluation phase in the HCI development

process, with specially enrolled users (testers).

Figure 2: One possible view of a simpliﬁed characterization

of the constructs. In the middle circles are the constructs

(dotted = not yet measurable with EEG). The inner circles

represent the HCI components the most closely related to

the constructs, or on which it would be easier to leverage.

The outer circles give a hint about what an evaluation would

be useful for.

We studied workload, attention, vigilance, fatigue,

error recognition, emotions, engagement, ﬂow and

immersion. Figure 2 stimulates thoughts about their

relationships with HCI components. Some constructs

should beneﬁt more than the others from EEG mea-

sures: 1) workload, EEG being more sensible to

changes compared to other methods (Mathan et al.,

2007); 2) attention, because event related potentials

could help to anticipate how many details users reg-

ister (Mustafa et al., 2012); 3) emotions, with an

arousal/valence state measured over a short time-

frame (Chanel et al., 2011). Error recognition could

hardly be assessed precisely with anything but neu-

roimaging. Such construct highlights how innovative

this evaluation method is. Among the outlined chal-

lenges, a continuous and modulated error recognition

would greatly help to assess usability and comfort.

Next studies should start to combine the various

constructs, along with a comprehensive framework

which gathers every evaluation method, one’s advan-

tages preventing others’ drawbacks. This should lead

to an increase of the overall user experience.

ReviewoftheUseofElectroencephalographyasanEvaluationMethodforHuman-ComputerInteraction

221

REFERENCES

Berka, C. and Levendowski, D. (2007). EEG correlates of

task engagement and mental workload in vigilance,

learning, and memory tasks. Aviat Space Environ

Med., 78(5 Suppl):B231–44.

Berta, R., Bellotti, F., De Gloria, A., Pranantha, D.,

and Schatten, C. (2013). Electroencephalogram and

Physiological Signal Analysis for Assessing Flow in

Games. IEEE Trans. Comp. Intel. and AI in Games,

5(2):164–175.

Blankertz, B., Tangermann, M., Vidaurre, C., Fazli, S., San-

nelli, C., Haufe, S., Maeder, C., Ramsey, L., Sturm,

I., Curio, G., and M

uller, K.-R. (2010). The Berlin

Brain-Computer Interface: Non-Medical Uses of BCI

Technology. Front Neurosci, 4(December):198.

Boksem, M. a. S., Meijman, T. F., and Lorist, M. M. (2005).

Effects of mental fatigue on attention: an ERP study.

Cogn Brain Res, 25(1):107–16.

Bowman, D., Gabbard, J., and Hix, D. (2002). A survey of

usability evaluation in virtual environments: classiﬁ-

cation and comparison of methods. Presence-Teleop.

Virt., 11(4):404–424.

Brandt, T., Dichgans, J., and Koenig, E. (1973). Differential

effects of central versus peripheral vision on egocen-

tric and exocentric motion perception. Experimental

Brain Research, 491:476–491.

Brouwer, A.-M., Hogervorst, M. a., van Erp, J. B. F.,

Heffelaar, T., Zimmerman, P. H., and Oostenveld,

R. (2012). Estimating workload using EEG spectral

power and ERPs in the n-back task. J. of neur. engin.,

9(4):045008.

Bruseberg, A. and McDonagh-Philp, D. (2002). Focus

groups to support the industrial/product designer: a

review based on current literature and designers’ feed-

back. Applied ergonomics, 33(1):27–38.

Chanel, G., Rebetez, C., B

etrancourt, M., and Pun, T.

(2011). Emotion assessment from physiological sig-

nals for adaptation of game difﬁculty. IEEE T Syst.

Man Cy. A, 41(6):1052–1063.

Chavarriaga, R. and Millan, J. D. R. (2010). Learn-

ing from EEG error-related potentials in noninvasive

brain-computer interfaces. IEEE Trans. Neural Syst.

Rehabil. Eng., 18(4):381–8.

Damasio, A. R. (1994). Descartes’ error: emotion, reason,

and the human brain.

Dirican, A. C. and G

okt

urk, M. (2011). Psychophysiologi-

cal measures of human cognitive states applied in hu-

man computer interaction. Procedia Computer Sci-

ence, 3:1361–1367.

Fairclough, S. H. (2009). Fundamentals of physiological

computing. Interacting with Comp., 21(1-2):133–145.

Ferrez, P. W. and Millan, J. D. R. (2008). Error-related EEG

potentials generated during simulated brain-computer

interaction. IEEE Trans. Biomed. Eng., 55(3):923–9.

Fitts, P. M. (1954). The information capacity of the hu-

man motor system in controlling the amplitude of

movement. J. of experimental psychology. General,

47(6):381–391.

Friedman, J. (1997). On bias, variance, 0/1loss, and the

curse-of-dimensionality. Data mining and knowledge

discovery, 77:55–77.

George, L. and L

ecuyer, A. (2010). An overview of re-

search on’passive’brain-computer interfaces for im-

plicit human-computer interaction. In ICABB 2010.

George, L., Lotte, F., Abad, R. V., and L

ecuyer, A. (2011).

Using Scalp Electrical Biosignals to Control an Object

by Concentration and Relaxation Tasks: Design and

Evaluation. In IEEE EMBS 2011.

Grimes, D., Tan, D., and Hudson, S. (2008). Feasibility and

pragmatics of classifying working memory load with

an electroencephalograph. CHI ’08, page 835.

Hamadicharef, B. (2010). BCI literature - a bibliometric

study. In ISSPA ’10, volume 1, pages 626–629. IEEE.

Hart, S. and Staveland, L. (1988). Development of NASA-

TLX (Task Load Index): Results of empirical and the-

oretical research. In Human mental workload.

Heingartner, D. (2009). Mental block. IEEE Spectrum,

46(1):42–43.

Hirshﬁeld, L., Chauncey, K., and Gulotta, R. (2009). Com-

bining electroencephalograph and functional near in-

frared spectroscopy to explore users’ mental work-

load. FAC ’09.

Jankowski, J. and Hachet, M. (2013). A Survey of Interac-

tion Techniques for Interactive 3D Environments. In

Eurographics ’13.

Just, M. A., Carpenter, P. a., and Miyake, A. (2003).

Neuroindices of cognitive workload: Neuroimaging,

pupillometric and event-related potential studies of

brain work. Theoretical Issues in Ergonomics Science,

4(1-2):56–88.

Kivikangas, J. M., Ekman, I., Chanel, G., J

arvel

a, S., Cow-

ley, B., Henttonen, P., and Ravaja, N. (2010). Review

on psychophysiological methods in game research.

Proc. of 1st Nordic DiGRA.

Klimesch, W., Doppelmayr, M., Russegger, H., Pachinger,

T., and Schwaiger, J. (1998). Induced alpha band

power changes in the human EEG and attention. Neu-

roscience letters, 244(2):73–6.

Kohlmorgen, J., Dornhege, G., Braun, M., Blankertz, B.,

uller, K.-R., Curio, G., Hagemann, K., Bruns, A.,

Schrauf, M., and Kincses, W. (2007). Improving

human performance in a real operating environment

through real-time mental workload detection. In To-

ward Brain-Computer Interfacing.

Laurent, F., Valderrama, M., Besserve, M., Guillard, M.,

Lachaux, J.-P., Martinerie, J., and Florence, G. (2013).

Multimodal information improves the rapid detection

of mental fatigue. Biomed. Sig. Proc. Contr., pages

1–9.

Liu, Y., Sourina, O., and Nguyen, M. (2011). Real-time

EEG-based emotion recognition and its applications.

In Trans. comp. science, pages 256–277. Springer.

Loggia, M. L., Juneau, M., and Bushnell, M. C. (2011).

Autonomic responses to heat pain: Heart rate, skin

conductance, and their relation to verbal ratings and

stimulus intensity. Pain, 152(3):592–8.

Lorist, M. M., Klein, M., Nieuwenhuis, S., De Jong, R.,

Mulder, G., and Meijman, T. F. (2000). Mental fa-

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

222

tigue and task control: planning and preparation. Psy-

chophysiology, 37(5):614–25.

Mandryk, R., Inkpen, K., and Calvert, T. (2006). Using

psychophysiological techniques to measure user ex-

perience with entertainment technologies. Behav. &

Inf. Tech.

Mathan, S., Whitlow, S., and Feyereisen, T. (2007). Work-

Sense: Exploring the Feasibility of Human Factors

Assessment using Electrophysiological Sensors. In

4th IACS.

Matthews, G., Campbell, S. E., Falconer, S., Joyner, L. a.,

Huggins, J., Gilliland, K., Grier, R., and Warm, J. S.

(2002). Fundamental dimensions of subjective state in

performance settings: Task engagement, distress, and

worry. Emotion, 2(4):315–340.

Milekovic, T., Ball, T., Schulze-Bonhage, A., Aertsen, A.,

and Mehring, C. (2013). Detection of error related

neuronal responses recorded by electrocorticography

in humans during continuous movements. PloS one,

8(2).

uhl, C., Brouwer, A., van Wouwe, N., van den Broek,

E. L., Nijboer, F., and Heylen, D. (2011). Modality-

speciﬁc Affective Responses and their Implications

for Affective BCI. In 5th Int. BCI Conf., pages 120–

123.

Mustafa, M., Lindemann, L., and Magnor, M. (2012). EEG

analysis of implicit human visual perception. CHI ’12,

page 513.

Nacke, L., Ambinder, M., Canossa, A., Mandryk, R., and

Stach, T. (2009). Game Metrics and Biometrics: The

Future of Player Experience Research. Future Play.

Nacke, L. E. and Lindley, C. A. (2009). Affective ludology,

ﬂow and immersion in a ﬁrst-person shooter: Mea-

surement of player experience. J. Can. Game Stud.

Ass., 3(5).

Nacke, L. E., Stellmach, S., and Lindley, C. A. (2010).

Electroencephalographic Assessment of Player Expe-

rience: A Pilot Study in Affective Ludology. SAG,

42(5):632–655.

Nieuwenhuis, S., Ridderinkhof, K. R., Blom, J., Band,

G. P., and Kok, A. (2001). Error-related brain po-

tentials are differentially related to awareness of re-

sponse errors: evidence from an antisaccade task. Psy-

chophysiology, 38(5):752–60.

Nisbett, R. E. and Wilson, T. D. (1977). Telling more than

we can know: Verbal reports on mental processes.

Psychological Review, 84(3):231–260.

Ogolla, J. A. (2011). Usability Evaluation: Tasks Suscepti-

ble to Concurrent Think-Aloud Protocol. Master the-

sis, Link

oping University.

Oken, B. S., Salinsky, M. C., and Elsas, S. M. (2006).

Vigilance, alertness, or sustained attention: physio-

logical basis and measurement. Clin Neurophysiol,

117(9):1885–901.

Parasuraman, R. (2013). Neuroergonomics: Brain-Inspired

Cognitive engineering. In The Oxford Handbook Of

Cog. Engin., page 672. Oxford University Press, USA.

Partala, T. and Surakka, V. (2003). Pupil size variation as an

indication of affective processing. International Jour-

nal of Human-Computer Studies, 59(1-2):185–198.

Picard, R. W. (1995). Affective computing. Technical Re-

port 321, MIT Media Laboratory.

Pike, M., Wilson, M., Divoli, A., and Medelyan, A. (2012).

CUES: Cognitive Usability Evaluation System. Euro-

HCIR ’12, pages 1–4.

Posner, J., Russell, J. a., and Peterson, B. S. (2005). The

circumplex model of affect: an integrative approach

to affective neuroscience, cognitive development, and

psychopathology. Dev. Psychopathol., 17(3):715–34.

Ravaja, N. (2009). FUGA: The Fun of Gaming: Measur-

ing the Human Experience of Media Enjoyment. Final

Activity Report. Technical report.

Saavedra, C. and Bougrain, L. (2012). Processing Stages

of Visual Stimuli and Event-Related Potentials. The

NeuroComp/KEOpS’12 workshop, 2:1–5.

Schalk, G., Wolpaw, J. R., McFarland, D. J., and

Pfurtscheller, G. (2000). EEG-based communication:

presence of an error potential. Clin. Neurophysiol.,

111(12):2138–44.

Schmidt, N. M., Blankertz, B., and Treder, M. S. (2012).

Online detection of error-related potentials boosts the

performance of mental typewriters. BMC neurosc.,

13(1):19.

Scholler, S., Bosse, S., Treder, M. S., Blankertz, B., Curio,

G., M

uller, K.-R., and Wiegand, T. (2012). Toward a

direct measure of video quality perception using EEG.

IEEE Trans. Image Process., 21(5):2619–29.

Shaw, J. C. (2003). The brain’s alpha rhythms and the mind.

Elsevier.

Slater, M., Lotto, B., Arnold, M., and Sanchez-Vives, M.

(2009). How we experience immersive virtual envi-

ronments: the concept of presence and its measure-

ment. Anuario de psicolog

ıa, 40(2773):193–210.

Sobolewski, A., Chavarriaga, R., and Mill

an, J. (2013). Er-

ror Processing of Self-paced Movements. In TOBI

Workshop IV, pages 137–138.

Trachel, R., Brochier, T., and Clerc, M. (2013). Enhanc-

ing visuospatial attention performance with brain-

computer interfaces. CHI ’13, page 1245.

van de Laar, B., G

urk

ok, H., Bos, D. P.-O., Nijboer, F.,

and Nijholt, A. (2013). Brain-Computer Interfaces

and User Experience Evaluation. In Towards Practical

Brain-Computer Interfaces, pages 223–237. Springer.

van Erp, J. B. F., Veltman, H., and Grootjen, M. (2010).

Brain-Based Indices for User System Symbiosis. In

Brain-Computer Interfaces, pages 201–219. Springer,

London.

Vecchiato, G., Astolﬁ, L., De Vico Fallani, F., Toppi, J.,

Aloise, F., Bez, F., Wei, D., Kong, W., Dai, J., Cin-

cotti, F., Mattia, D., and Babiloni, F. (2011). On

the use of EEG or MEG brain imaging tools in

neuromarketing research. Comput Intell Neurosci,

2011:643489.

Vi, C. and Subramanian, S. (2012). Detecting error-related

negativity for interaction design. CHI ’12, page 493.

Weber, J. (2007). Think Aloud Best Practices Study.

Zander, T. O. and Kothe, C. (2011). Towards passive brain-

computer interfaces: applying brain-computer inter-

face technology to human-machine systems in gen-

eral. J. Neural. Eng, 8(2):025005.

ReviewoftheUseofElectroencephalographyasanEvaluationMethodforHuman-ComputerInteraction

223