4.2 Differences between Experiment
Settings and Questionnaire Types
4.2.1 Comparison of Results
The effect of psychophysiological classification
accuracy on user experience in the Snake game has
now been evaluated in three ways: Novak et al.
(2014) studied it online with electronic
questionnaires while we examined it in a lab setting
with electronic and paper-and-pencil questionnaires.
All three approaches found significant
correlations between classification accuracy and
satisfaction with the difficulty adaptation algorithm.
However, the correlation was weakest in the online
study of Novak et al. (ρ = 0.43), medium in our
electronic questionnaires (ρ = 0.58) and highest in
our paper-and-pencil questionnaires (ρ = 0.74).
Interestingly, the correlation between classification
accuracy and in-game fun was not significant for
either the previous Novak et al. study (ρ = 0.10) or
our own group that used electronic questionnaires (ρ
= 0.21). However, it was highly significant for our
paper-and-pencil questionnaire group (ρ = 0.53).
The link between classification accuracy and
user experience in our laboratory study was stronger
than in the online setting of the previous Novak et al.
(2014) study. Additionally, though results are not
completely reliable due to the unbalanced number of
participants, the paper-and-pencil questionnaires
indicated a stronger relationship between
classification accuracy and in-game fun than
electronic questionnaires. This is somewhat
surprising, as the measured relationship between
classification accuracy and satisfaction with the
difficulty adaptation algorithm is similar for both
types of questionnaires (Figure 2).
Nonetheless, assuming that results of paper-and-
pencil questionnaires are the more valid ones (as
discussed in the next section), they clearly show that
increasing the accuracy of psychophysiological
classification increases the amount of fun that
players have in a physiological game. This is
contrary to the surprising result of the previous
Novak et al. (2014) study, which found only a minor
effect of classification accuracy on in-game fun and
thus asked whether increasing classification
accuracy is even worthwhile. Our study instead
shows that classification accuracy has a strong effect
on in-game fun, but that it can be difficult to
measure properly. We must thus ask ourselves: what
is the reason for the difference between online and
lab settings, and between electronic and paper-and-
pencil questionnaires?
4.2.2 Possible Explanations
We believe that our lab setting produced better
results than the previous online setting of Novak et
al. (2014) due to the much lower dropout rate (3.5%
in our study, 40% in the previous study). Authors of
the previous online study acknowledged that their
high dropout rate likely skewed results, as
participants who did not enjoy the game simply quit
playing rather than fill out the final questionnaires.
This would explain the generally better results of our
questionnaires, as a more representative sample of
user experience has been obtained. A second
possible explanation is that participants in the online
study may not have paid attention to the instructions,
a problem common to all Web-based research
(Oppenheimer et al. 2009).
The difference between paper-and-pencil
questionnaires and electronic questionnaires in our
study is more surprising. Having examined the
individual results in detail, we believe that it is due
to a weakness in the electronic visual analog scales.
Specifically, the slider of the electronic visual
analog scale for in-game fun is initially set to the
exact middle value, and the participant can adjust the
answer by moving the slider. 10 of 80 participants
did not move the slider at all; 17 of 80 moved it to
the very far left or the very far right. Conversely, in
the pencil-and-paper version, only 1 of 25
participants placed the mark at approximately the
middle of the scale, and 2 of 25 placed the mark at
approximately the far left. We performed a follow-
up qualitative examination of the results of the
Novak et al. (2014) study and found a similar trend
among the 261 participants there: many either left
the slider at the default value or dragged it to one
extreme. This issue can be at least partially avoided
by not providing a starting setting for the electronic
visual analog scale and by discouraging participants
from selecting extreme values.
Here, we also acknowledge that the observed
difference between electronic and paper-and-pencil
questionnaires requires deeper experimental
investigation. We had originally planned to use only
electronic questionnaires, but later added paper-and-
pencil questionnaires so that we could check the
effect of questionnaire type. However, the two
participant groups have very different sizes, and
more participants should be tested with paper-and-
pencil questionnaires to ensure that the ‘better’ result
of such questionnaires is not simply a statistical
fluke.