However, self-report may not be feasible and is of-
ten considered unreliable (Afzal and Robinson, 2009;
Kapoor et al., 2007). Yet, in particular since the
shift towards non-acted data, observer ratings are be-
coming more common for establishing the ground
truth (Kleinsmith and Bianchi-Berthouze, 2013). This
method may be particularly relevant when the aim of
the application into which the recognition software
will be integrated is to act as a human interaction part-
ner.
When establishing the ground truth using ob-
servers, what labeling model should be used? Two
options are pairwise preference ratings and rating
on multiple scales. Many studies in the Affective
Computing field employ a forced-choice design, e.g.,
(Savva and Bianchi-Berthouze, 2012) and (Klein-
smith et al., 2011). In this design, observers are pre-
sented with a list of choices and are forced to choose
from that list. An advantage of a forced-choice design
over a free-form design is that it forces an absolute
match and eliminates the possibility of observers pro-
viding non-emotion labels, a known issue with free-
form designs (Russell, 1994). However, forcing an
absolute match is also a disadvantage, as the list may
not include all options considered applicable by the
observers. Similarly, concurrence of more than one
distinct emotional state can not be captured in a forced
choice design. Also, the intensity of a perceived emo-
tion is lost, which may be particularly problematic
when dealing with subtle emotional expressions.
Pairwise preference rating is used in artificial in-
telligence (F
¨
urnkranz and H
¨
ullermeier, 2005; Doyle,
2004) and machine learning (Yannakakis, 2009) fields
and may be considered an attempt to overcome the
limitation of a forced choice design. In pairwise
preference rating, observers are presented with pairs
of stimuli and asked to choose which stimulus best
represents a particular label. This process is re-
peated for all possible stimulus pairs. Because it does
not require the observers to determine an absolute
match, pairwise preference rating may help to reduce
the variability that exists between observers when a
forced choice design is used (Kleinsmith and Bianchi-
Berthouze, 2013).
Obtaining ratings for the same stimulus on mul-
tiple scales representing discrete emotion labels has
also been used in other emotion recognition research
(Liscombe et al., 2003). Observers were asked to rate
speech tokens on separate scales for 10 discrete emo-
tions. The authors conclude that by rating stimuli on
multiple emotion scales shows that their stimuli ex-
pressed several different emotions at the same time,
resulting in a better, more complete representation of
emotion.
3 METHOD
3.1 Participants
For this initial study, participants were recruited via
mailing lists from within the university community.
They were aware of the aims of the study and no com-
pensation was given. 12 participants took part in the
study.
3.2 Stimuli
As stimuli we use images from the UCLIC Database
of Affective Postures and Body Movements (Klein-
smith et al., 2006; Kleinsmith et al., 2011); these are
readily available online. The database consists of sep-
arate corpora of acted and non-acted whole body pos-
tures, of which we use the latter. The non-acted col-
lection consists of 105 postures that were obtained
from people playing physically active video games
and have been rated for four affective labels: con-
centrating, defeated, frustrated, and triumphant. The
postures are modeled on an abstract avatar seen from
a frontal perspective in front of neutral grey back-
ground.
In order to prevent study fatigue and to account
for the fact that the study was conducted online (as
was the original labeling for the database used), the
number of ratings the participants were required to
make was kept to a level sufficient for carrying out
the study.
For this reason we only use a subset of 10 postures
from the UCLIC collection, chosen at random. Pair-
wise preference rating means that for n postures, n ∗
(n − 1)/2 pairs of postures have to be rated. For the
entire corpus this would result in 5460 pairs. Bringing
the number of stimuli down to 10 results in 45 pairs.
Figure 1 shows one of the 45 screens participants see
in the preferences condition. In a further attempt to
keep the number of ratings at a manageable level, we
let our participants rate a posture for all four labels
in one screen in the scale condition as can be seen in
Figure 2.
3.3 Procedure
Upon clicking on the link in the invitation email, par-
ticipants reached the welcome screen. Here they re-
ceived information about the study and that we ask
them to rate postures on 55 screens. The order of rat-
ing conditions was assigned randomly.
In the preference condition, participants saw two
postures next to each other. Below the posture im-
ages, they were asked: ”Which posture looks more
HumanRatingofEmotionalExpressions-Scalesvs.Preferences
241