Why you should Empirically Evaluate your AI Tool

From SPOSH to yaPOSH

Jakub Gemrot, Martin

Cern

y and Cyril Brom

Faculty of Mathematics and Physics, Charles University in Prague, Malostransk

e n

est

ı 25, Prague 1, Czech Republic

Keywords:

Behavior Design, Computer Games, Tool Productivity, User-study, Agent Languages.

Abstract:

The autonomous agents community has been developing speciﬁc agent-oriented programming languages for

more than two decades. Some of the languages have been considered by academia as possible tools for

developing artiﬁcial intelligence (AI) for non-player characters in computer games. However, as most of

the research related to the development of new AI languages within the agent community does not reach

production quality, they are seldom adopted by the games industry. As our experience has shown, it is not only

the actual language that matters. The toolchain supporting the language and its integration (or lack thereof)

with a development environment can make or break the success of the language in practical applications. In

this paper, we describe our methodology for evaluating AI languages and associated tools in practice based

on controlled experiments with programmers and/or game designers. The methodology is demonstrated on

our development and evaluation of SPOSH and yaPOSH high level agent behavior languages. We show that

incomplete development support may prevent the tool from giving any beneﬁt to developers at all. We also

present our experience from transferring knowledge gained during yaPOSH development to actual AI design

for an upcoming AAA game.

1 INTRODUCTION

Modern computer games are becoming more and

more complex and game designers seek for a way to

bring rich worlds to their audience. This is especially

true for role-playing games (RPGs) that are advertised

as “large open-worlds” players can explore and inter-

act with. However, the production of such worlds re-

quires a vast amount of authoring time; the time that is

spent not only by environment designers and anima-

tors but also by non-player-characters (NPCs) design-

ers. NPCs designers are responsible for making those

worlds alive by populating them with a wide variety

of NPCs. We see the NPC behavior production as one

of the bottlenecks in the whole game production pro-

cess; the large effort associated with developing com-

plex NPCs behavior causes that most NPCs demon-

strate only very limited and schematic behavior. Thus

players often do not perceive the NPCs as lively in-

habitants of the virtual world and are instead left with

an impression of indifferent soulless puppets.

Although the demand for better NPC behavior is

increasing, the actual NPC behavior in games im-

proves only slowly. In our view, an important rea-

son for that is the lack of proper NPC behavior

speciﬁcation languages and development tools (fur-

ther referred to uniformly as tools). Over the years,

academia has proposed many technologies that could

be — in theory — applicable to this problem. But

the industry is reluctant to adopt those solutions as

they are not accompanied with case studies that would

shed light on their impact on the game production pro-

cess. We report on our experience that it is hard to

prove that novel/existing AI tools related to the NPC

behavior production are of beneﬁts to the industry.

We also present a methodology how to (try to) do so.

The methodology is based on controlled experiments

we have conducted over past four years, in which we

studied the productivity of the custom NPC behavior

creation tool SPOSH (Bryson, 2001) and its improved

version yaPOSH. We brieﬂy present ﬁndings of the

studies as well.

The main goal of the paper is to introduce the

community to practical considerations necessary for

proper empirical evaluation of AI tools and to stress

the importance of evaluation for practical applicabil-

ity. It has been noted that even well-established aca-

demical tools, that seem intuitively beneﬁcial for AI

development, may perform well below a plain pro-

gramming language (P

ıbil et al., 2012) (although the

evaluation was non-systematical and purely qualita-

tive).

461

Gemrot J.,

Cerný M. and Brom C..

Why you should Empirically Evaluate your AI Tool - From SPOSH to yaPOSH.

DOI: 10.5220/0004818604610468

In Proceedings of the 6th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2014), pages 461-468

ISBN: 978-989-758-015-4

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

While the paper is phrased in game development

vocabulary, we believe that most of our results are

transferrable to other areas of possible industrial AI

applications.

The rest of the paper is organized as follows:

First we discuss various non-obvious requirements

the game industry imposes on AI tools (Section 2) and

deal with the notion of productivity (Section 3), then

we discuss literature related to tool evaluation (Sec-

tion 4) leading to an overview of our methodology in

Section 5. Section 6 describes user studies we have

conducted and Section 7 deals with several important

details for successful application of the methodology.

Finally, Section 8 discusses the conclusions we have

drawn from our experience.

2 NEEDS & FEARS OF THE

GAME INDUSTRY

Recently we have had the chance to participate in an

AI design process for an upcoming AAA game title.

We have successfully transferred many of our ideas

on the NPC behavior design to the game industry;

ideas that are based on our seven years of experiences

from teaching NPC behavior design at our university

and on the studies we have conducted. In this section

we discuss non-obvious requirements the industry im-

poses on AI tools.

Our view of the industry requirements is based

mainly on our tight cooperation with a game studio

that was newly founded but many of the developers

have history in making AAA game titles (Operation

Flash Point, Maﬁa 1 & 2). Although we work with

a single game studio, discussions with industry rep-

resentatives at conferences and experiences of other

AI researchers indicate that the requirements are very

similar in most of the industry.

Alex J. Champandard pointed out that the industry

yields to the argument “You will ship your game three

months earlier.” (Champandard, 2010). Applying this

to our perspective, industry simply needs to produce

reliable NPC behaviors quickly. However, as NPC

behaviors can be scripted and some tools for behavior

speciﬁcation already exist, the industry is skeptical to-

wards adopting new ideas and related tools.

The ﬁrst thing to note is that every AI tool related

to the game development has to be usable under dif-

ferent use cases; each use case is having its own pur-

pose: 1) reading/understanding how concrete behav-

ior works during the design time, 2) debugging exist-

ing behaviors at runtime, 3) extending existing behav-

iors and ﬁnally 4) creating new ones. Those four sit-

uations will be referred to as the “game development

use cases” (GDUCs).

Secondly, and on a related note, a game develop-

ment studio does not need only good tools, but also a

well thought-out workﬂow that integrates design, pro-

totyping, development and quality control. If a tool

cannot cope with such a workﬂow, it is not likely to

be adopted.

Thirdly, it is strongly desirable to redirect as much

development effort as possible to a less qualiﬁed

workforce. This is also true for NPC behavior design

as it is impossible to rely on senior programmers for

the behavior production only as they are rather scarce

and expensive resource. Thus a tool (as well as the

language) should be usable by less experienced pro-

grammers or game designers as well.

Generally speaking, development of a computer

game is to a large extent a software engineering task.

The comprehensibility, maintainability and reusabil-

ity of the code and data produced play an impor-

tant role and adoption of complicated algorithms is

avoided if it is not having a great impact

. Simple so-

lutions are preferred as they facilitate debugging and

keep the system predictable.

To conclude, the game studio will typically fear

that the new tool will not ﬁt into the game production

seamlessly as tools in the academy are not used in the

whole context of the game production and thus the

negatives might well outweigh the positives.

3 MEASURING PRODUCTIVITY

If the industry is to adopt anything, it needs many

proofs. Firstly and the most importantly, a fully work-

ing prototype of the technology that proves the indus-

try that they need it must exist. This serves as an open-

ing for the discussion about the time and the memory

impacts of the tool during the game runtime. If those

are found acceptable, the discussion will turn to the

debate about the productivity of the tool. Unless the

tool allows the designer to achieve something that was

previously unachievable, one has to prove that accom-

plishing a task using the tool is easier compared to not

using any special tool at all or using a different tool.

The term productivity should not be confused

with usability. While usability and productivity are

strongly related, they are not the same. A tool may

prove to be very productive even though it has multi-

ple usability issues. On the other hand a tool may be

For instance, the ﬁrst author of this paper was once

present at discussion of level designers talking about how

high-dynamic-range lighting is great yet extremely hard to

tweak as there are a lot of intertwined parameters so it slows

the whole level creation process down.

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

462

perfectly usable but may turn out to bring no practical

beneﬁt to the user, because the same task can be eas-

ily accomplished with a different workﬂow. However,

an increase in usability almost certainly results in an

increase in productivity, as it allows the user to work

faster and with less obstacles.

How to formally prove that one tool is more pro-

ductive than the other in the context of NPC behaviors

is hard and rigorous research faces serious method-

ological as well as practical issues. In contrast to the

tool internals, which can often be evaluated by run-

ning the tool on a large set of test cases, it is not clear

how to evaluate the usability or the productivity of

the tool itself. The way users are working with the

tool is bound strongly to the tool internals and thus

any evaluation effectively evaluates the tool itself, its

usability and productivity at the same time. Also the

evaluation is strongly inﬂuenced by the user’s ability

to understand and apply the concepts the tool is built

upon and its workﬂow.

4 RELATED WORK:

EVALUATING PRODUCTIVITY

AND USABILITY OF TOOLS

To our knowledge, the only published evaluation of

productivity of an AI tool is a purely qualitative (one

subject) case study of using an agent-based language

in a non-trivial setting (P

ıbil et al., 2012). The study

notes that the language is in many aspects inferior to

plain Java. The ScriptEase AI tool (Cutumisu et al.,

2007) was evaluated by letting students design their

own interactive story in the Neverwinter Nights com-

puter game. The study however measured only the

number of various language constructs the students

used and did not focus on usability or productivity,

neither was ScriptEase compared to the default tool

for the respective environment.

Some work has been done in determining the us-

ability of speciﬁc general purpose programming lan-

guage constructs (Sadowski and Kurniawan, 2011) or

to evaluate how quickly novice programmers learn

language constructs (Steﬁk et al., 2011). None of the

works known to us deal with evaluating usability or

productivity of development tools for a language nei-

ther are we aware of a work that would quantitatively

evaluate an agent-oriented language. Since our focus

is on the tools as well as the speciﬁc language, we

cannot directly reuse any of the methodologies of the

above papers.

An insight to the evaluation problematic can

be gained from the academic research on human-

computer interaction. A classical paper of Jeffries

et al. (Jeffries et al., 1991) compares four different

techniques for the usability evaluation of software.

They conclude that the best performance is achieved

by letting a group of humans test the application in

a realistic setting. Among user groups, user-interface

(UI) experts following the methodology called heuris-

tic evaluation are better at testing UI than regular soft-

ware beta testers. Two other methods were also tested

— validating the UI against expert-designed guide-

lines and cognitive walkthrough which is a structured

analysis of the program by developers. However,

techniques not involving testing with humans fared

much worse at determining usability deﬁciencies.

Further studies conﬁrmed the ﬁndings of Jeffries

et al. (Karat et al., 1992; Nielsen and Phillips, 1993).

A good overview paper on usability testing methods

is (Hollingsed and Novick, 2007). This overview sug-

gests that the most cost-effective results are obtained

by combining internal evaluation within the develop-

ment team early in the design process with user stud-

ies later on.

Although the aforementioned research is aimed at

usability, we think that most of its results generalize

to productivity as well, especially the necessity to in-

volve actual human users working on realistic tasks in

the evaluation process.

5 METHODOLOGY

Based on insights from the related work and the ex-

perience from our studies we propose a general re-

search methodology for comparative controlled ex-

periments that should validate both usability and pro-

ductivity of AI design tools. Proposed methodology

is certainly neither complete nor bullet-proof and is

open for comments. We think that the relation be-

tween the practice of comparative controlled experi-

ments for AI design tools and the respective method-

ology is similar to chicken-egg problem. As there is

no methodology well thought out to follow, it is hard

to conduct experiments yielding fruitful data about

the tools productivity within the whole context of the

game development; as there is a lack of experiments

conducted and presented in papers, it is hard to formu-

late any methodology whatsoever. Although some of

the individual steps might seem obvious, doing every-

thing correctly without any reference methodology is

not trivial as we have learned through trial and error.

For more experienced experimenters, our methodol-

ogy could serve as a kind of a checklist for good ex-

periment design.

WhyyoushouldEmpiricallyEvaluateyourAITool-FromSPOSHtoyaPOSH

463

5.1 Methodology Steps

1. Choose the game AI problem you plan to solve.

E.g., behavior speciﬁcation of game NPCs in the

context of ﬁrst-person shooters (FPS).

2. Think about tasks the team of game developers

will face during the problem solving utilizing ex-

isting tools (e.g. GDUCs as introduced in Sec-

tion 2).

3. Find a technology that is promising in solving the

problem, e.g. SPOSH (Bryson, 2001) that was

successfully used in robotics and shown to be us-

able for NPCs of FPS games.

4. Implement the tool internals and create a simple

UI that should support not only the solving of the

problem but (ideally) also the spectrum of tasks

from the Step 2.

5. Work with the interface yourself or within the de-

velopment team to determine the most obvious

deﬁciencies and ﬁx them. Think and stress (at

least) different GDUCs that your tool will un-

dergo during the game development: a) read-

ing/understanding of the data it produces (e.g.,

is your colleague able to quickly grasp what you

have done/created with the tool?), b) debugging

existing data (e.g., is it easy to diagnose and cor-

rect an error you have made?), c) extending the

data (e.g., is your colleague able to continue with

a half-done work?) and d) creating the data. Some

of the structured approaches mentioned in the re-

lated work might prove beneﬁcial. Long-term

student projects or bachelor theses using the tool

have also shown to be useful for initial evaluation.

6. Design a realistic set of tasks that your tool should

help in solving but limit their complexity to allow

for controlled experiments in a lab. Ideally, users’

solutions should yield qualitative data so the so-

lutions can be compared. The set of tasks should

sample tasks from the Step 2.

7. Do a “pilot study”: solve the task set by yourself

and entice a few colleagues to solve it as well in

order to predict the time your users will need to

solve the task set and to “debug” the experimental

setting.

8. Gather two groups of users sampled randomly to

solve the task set: one will use your tool the other

will use a baseline tool or no speciﬁc tool at all

(e.g., only standard IDE for underlying program-

ming language the NPCs are implemented in).

Since involving actual game development teams is

usually not feasible, we did our evaluations with

students of AI courses. We think evaluating on

students is reasonable, because a) many of our stu-

dents are of high skill, already working part-time

as junior programmers, b) behavior developers in

games usually are not programming experts and c)

it is accepted practice in other branches of science

to work with student subjects (e. g. many psycho-

logical experiments are performed with students).

9. Let all users pass a preliminary test that conﬁrms

they have a sufﬁcient experience using any tool,

platform or game engine they will need to utilize

for your tasks.

10. Let each user solve the task set. Measure their

time required to solve respective tasks and quality

of the solution. Gather their feedback — what was

easy and/or difﬁcult to achieve? What features of

the tool were used? Distinguish between features

the tool offers during the game runtime and the

design time carefully.

11. Optionally prepare a second set of tasks and swap

groups to mitigate sampling error (e.g., different

average programming experience between user

groups).

12. Analyze the results and see, which group per-

formed better and what obstacles users faced. Be

careful to distinguish between artifacts of the ex-

perimental setup (e.g. users became fatigued from

the overly long study, computers used for the

study were misconﬁgured), artifacts of the as-

signed task (e.g. users misunderstood the assign-

ment, users spent too much time ﬁguring out an

unobvious trick not related to your tool) and the

actual effect of the tool.

13. Analyze source code / data that solves the task set

to gain insight which features of your tool were

used the most, which were avoided and which

were misused. Perform post-hoc (but as soon as

possible) interviews with users that misused your

tool to gain additional information.

14. If the tool did not improve user performance sufﬁ-

ciently, alter the tool and its user interface to miti-

gate the difﬁculties that were faced most often. If

necessary, also change the experimental setup so

that it really tests user performance with as little

noise as possible.

15. Reiterate to the Step 4. Having the same group of

users in next experimental run can bring valuable

insights about the difference between the two ver-

sions of the software, at the cost of not gathering

data about the performance of novice users.

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

464

6 THREE USER-STUDIES: FROM

SPOSH TO YAPOSH

We now present three user-studies that were con-

ducted to gain insight into the productivity of the

academy tool SPOSH in the context of NPC behavior

design. The ﬁrst two studies were done with SPOSH

and the last study was done with its improved version

yaPOSH. The presented methodology was formed

based on our experience gained from the studies.

SPOSH is a dialect of POSH action selection de-

veloped by Bryson in late 1990s (Bryson, 2001).

POSH speciﬁes clear action selection semantics cap-

turing NPC behavior in a tree. Roughly speaking,

every edge in the tree is annotated with a sense (or

multiple senses) — a condition that must hold in the

environment for the edge to be active. The children

of nodes are ordered by priority. To determine the ac-

tion to perform, the highest-priority child of the root

node connected by an active edge is chosen. If it is a

leaf, an action or an action sequence associated with

the leaf is executed, otherwise the node is searched re-

cursively. SPOSH behavior primitives (senses and ac-

tions) serve as a communication interface between the

SPOSH engine and the rest of the NPC. Senses and

actions are user-deﬁned and implemented in the same

language as the SPOSH engine (in our case Java). Be-

havior primitives are implemented as (class) methods

of a speciﬁc signature. The SPOSH language syntax

is Lisp-like. As Lisp-like syntax has been found con-

fusing to our users, we have developed a simple drag

& drop graphical editor for SPOSH plans and inte-

grated it into NetBeans Java IDE (see Figure 1) . The

same NetBeans Java IDE is used for behavior primi-

tives coding in Java.

Figure 1: Screenshot of the yaPOSH Plan Debugger.

6.1 Studies Setting

All studies focused on creation of NPC behaviors

in game-like tasks. We were interested in compar-

ing performance of users using Java with SPOSH

(yaPOSH) tool to users using Java only. Users were

given necessary low-level primitives (turn, move,

navigate, shoot, speak, do-i-see, do-i-hear, etc.) and

they were asked to create high-level plans for the

NPC, which should solve game-like tasks. We have

been working with three tasks. The ﬁrst task was to

create a simple HunterBot behavior, where an NPC

had to explore its environment, gather weapons and

hunt down an enemy NPC. The second task was a

GuideBot behavior; to search the environment for

other friendly NPCs and guide them home. The NPCs

were programmed to follow the bot after a request and

to stop following whenever they lost visual contact

with the bot. The third task was to alter the existing

GuideBot behavior and turn it into the GuardBot be-

havior that protected the friendly NPC while guiding

it as hostile NPCs were added to the environment.

Studies were always done as a part of the semes-

tral test of the course on intelligent virtual agents;

during the course, students were taught how to cre-

ate NPCs behaviors both in plain Java and using our

SPOSH (yaPOSH) tool.

Users were given 3.5 hours to ﬁnish a single task.

We currently measure the productivity by the total

time a subject needs to accomplish the tasks. Future

work should aim on the question how to distinguish-

ing between different development use cases during

the subject’s work, as they are often mixed together.

6.2 The 1

Study — Java vs. SPOSH +

Java

The ﬁrst study (removed) was a pilot where we have

designed a comparative controlled experiment in or-

der to answer a general question about the SPOSH

tool suitability for the speciﬁcation of NPC behav-

iors. Its theme was focused on the subjective language

preference not the actual productivity of the tool. The

hypothesis was that the Java IDE NetBeans featuring

graphical SPOSH plan editor plugin will prove to be

better than NetBeans alone. The pilot consisted of

two tasks; HunterBot and GuideBot. 30 students have

participated in the study. The main ﬁndings were:

(a) SPOSH was favored for the HunterBot task, but

not for the GuideBot task (quantitative answers).

(b) Students argued that GuideBot task could have

been easily solved with event-driven approach,

which could have been easily captured in Java but

not in SPOSH (qualitative answers).

both tasks, it raised questions about tasks com-

plexity and relevance of users opinions.

WhyyoushouldEmpiricallyEvaluateyourAITool-FromSPOSHtoyaPOSH

465

Students’ comments revealed that their language

preference correlated with the difﬁculty they had with

its actual use. The HunterBot behavior was re-

ported to be better expressible within SPOSH than

the GuideBot behavior and thus the preference for

the SPOSH language differed between these two task.

Although SPOSH is a tool speciﬁcally designed for

NPC behavior development, the study did not indicate

that SPOSH is generally preferred than plain Java.

6.3 The 2

Study — Java vs. SPOSH +

Java

The second study (removed) used GuideBot and

GuardBot tasks. This time we introduced a twist

into the setting for the second task; all users received

implementation of GuideBot from someone else and

they were asked to extend it into GuardBot. This was

intended to raise difﬁculty of the second task and to

test the tool under different GDUC; users were forced

to read alien code and extend it. The study was hard

for students, but brought crucial data. Only two out

of 22 students were able to ﬁnish the second Guard-

Bot task. Qualitative data were also more fruitful,

as users were commenting on the code of someone

else. The necessity to alter existing (not always well

thought-out) behavior plan/code revealed strong and

weak points of SPOSH.

However, the study’s ﬁndings regarding produc-

tivity of the tools used were rather inconclusive: a)

The average time required to solve the GuideBot task

by subjects in both groups was almost the same: 2:42

hours (sd = 28 minutes) for Java and 2:50 hours

(sd = 33 minutes) for SPOSH. b) It was impossible

to extract data from the GuardBot task as it was ﬁn-

ished by only two students. But the study still brought

some interesting insights:

(a) High-level SPOSH plans helped users to quickly

grasp the general idea behind the code created by

someone else. Understanding behavior written in

plain Java was found to be more difﬁcult.

(b) Both user groups had similar opinions about the

suitability of their tool for the assignment.

SPOSH after they failed the 2nd task.

(d) As SPOSH constructs lacked parametrization, the

amount of logic present in the plan was limited.

(e) SPOSH users complained about the complexity

of custom actions they implemented in Java. The

complexity was imposed by the SPOSH engine

implementation that forced users to track the ac-

tion state (action initialization / execution / ﬁnal-

ization) by themselves.

(f) Usability issues of SPOSH plan editor were re-

ported; most notably a problem with the plan de-

bugging that relied on text logs only.

Despite our intuitions about SPOSH, the study re-

sults once again did not show that SPOSH is more

productive than plain Java. Nevertheless, it revealed

an area where SPOSH (if improved) can beat Java and

thus its features may be relevant to the industry.

6.4 The 3

Study —Java vs. SPOSH +

Java

Based on the ﬁndings from the 2nd study, we altered

SPOSH and created yaPOSH. Changes: 1) yaPOSH

plan primitives were made parameterizable. 2) New

yaPOSH debugger was created that allowed users to

place breakpoints on plan nodes. 3) yaPOSH engine

was improved to track the state of executed actions.

The third study (not published yet) used the same

setup as the second one. Given the same setting, we

decided to have only single group of users who were

using yaPOSH+Java. We recruited 18 students differ-

ent from those that participated in the second study.

Results of the study are encouraging:

(a) All students successfully completed both tasks.

Students ﬁnished the ﬁrst task in 1:29 hours (sd

= 31 minutes) on average and the second task in

3:15 (sd = 47 minutes).

(b) All yaPOSH changes were picked by almost

all users, especially yaPOSH debugger and plan

constructs parameterization that allowed pushing

more logic into yaPOSH plans in contrast to the

SPOSH.

after they ﬁnished the 2nd task.

6.5 Discussion

The results of the 3rd study show how proper tool

support can boost productivity. Students from the

3rd study using yaPOSH have ﬁnished the ﬁrst task

around 1.8 times faster than students from the 2nd

study (both tools counted, as the times were very sim-

ilar). Additionally, all students from the 3rd study

were able to ﬁnish the second task in the given time-

frame. Unfortunately, due to the nature of studies, it

is hard to isolate which new feature contributed the

most to this improvement. We believe it is a strong

evidence that standard programming languages (such

as C++/Java or Lua) are not sufﬁcient as a tool for

NPC behavior creation. However one can object that

changes done in SPOSH might have helped speciﬁ-

cally in GuideBot and GuardBot tasks (even though

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

466

we did not design yaPOSH this way and we are us-

ing it in different contexts as well) and thus the pro-

ductivity improvement might not generalize to other

tasks. But even if this was the case, our results would

still be applicable to industrial practice as game stu-

dios are always adapting existing tools and engines to

suit the needs of the particular project they are work-

ing on. Therefore, we suggest that AI experts should

be part of the game design team since the early pre-

production stage in order to have the time to tailor and

test the AI tools before the production stage starts.

This necessity is even greater considering the re-

sults of our ﬁrst two studies: a well thought-out tool

which has taken a signiﬁcant amount of time to de-

velop (SPOSH) failed to outperform the tool which

was already available (Java).

7 METHODOLOGY DETAILED

Based on experience from conducted studies, we now

discuss some of methodology steps in a greater de-

tail to provide a cook-book of useful ingredients for a

good experimental design.

7.1 Designing Task Sets

Choosing right tasks for the evaluation (Step 4) is crit-

ical. Simple tasks might not let users use the full

power of the tool or provide only small differences

between the groups, e.g. the 1st study.

A task too complex may uncover weaknesses of

tools used (e.g. the 2nd study), but such study will

not provide relevant data about the tool productivity

(e.g. students failed to ﬁnish the 2nd task of the 2nd

study). When multiple tasks can be assigned, it is a

good idea to choose them from the whole spectrum.

Another alternative (yet to be tested) is to design a lot

of simple tasks and ask users to complete as many as

they can in a given time frame.

Various tool advantages and disadvantages man-

ifest themselves under different use-cases, e.g., the

2nd tasks of the 2nd and the 3rd study. Thus the task

set should be designed to test how users work with

the tool under different GDUCs. If possible, each

task should be designed to target a single use-case

(complex tasks can be broken down to several steps,

each step analyzed separately) and the whole task set

should cover them all.

It is also important to minimize the amount of

work needed to solve the task that does not directly

involve your tool, e. g., when testing a behavior tree

debugger, the user should not be forced to ﬁx errors

that stem from the inaccurate NPC navigation.

The scale of the tasks is also important. If the task

set can be completed within hours, it allows for a con-

trolled experiment in a lab, which lets the researcher

to gather data without much noise and to monitor in-

dividual user progress. On the other hand, such a time

frame often precludes realistic problems as they fre-

quently cannot be solved that quickly. In our research

we have focused on controlled experiments, but the

length of the experiments necessary for proper evalu-

ation was on the border of manageability; in 8 hours

long experiment, users became seriously affected by

fatigue throughout its second half. It is always bene-

ﬁcial to let a few colleagues to solve the experiment

tasks during a pilot study. We have observed, that

(the mix of bachelor and master) students required 2-

4 times more time to complete given tasks than our

colleagues experienced in behavior development.

7.2 The Users

Recruiting users to test the tool is probably the most

difﬁcult part of the study. Obviously the more users

the better, but even in case of a few users, methods

of qualitative research allow the researcher to extract

valuable data from the test group. In an academic set-

ting we have found it useful to let students of a rel-

evant course test the tool as a part of their semestral

evaluation test.

Unless very focused test group is available (such

as a development team from a speciﬁc studio) it is

necessary to account for differences in skills and

background knowledge of individual users. Pre-test

questionnaires determining relevant previous experi-

ence of the users are very useful to this end. If possi-

ble, a within-subject experiment design also helps to

alleviate those issues.

The users should also be already familiar with the

tool they will test so that they understand all of its

features necessary for solving the task. This is es-

pecially the case if they are already proﬁcient with a

baseline tool. One should bear in mind that a tool

might be of different value to novice users and to ad-

vanced users and — if possible — experiments should

be conducted with both groups present.

7.3 Collecting Data

There are two major classes of data to collect: objec-

tive and subjective. The objective data include the ac-

tual performance of the user at the tasks, especially

the time it took the user to ﬁnish individual tasks,

logs of user activity (even screen capture) allowing for

deeper analysis of the resulting creations of the users.

This is where there is the largest beneﬁt of controlled

WhyyoushouldEmpiricallyEvaluateyourAITool-FromSPOSHtoyaPOSH

467

experiments — they allow to gather much more ob-

jective data than ﬁeld studies.

Subjective data include the individual user expe-

riences. The most convenient way to gather them is

through questionnaires and by structured interviews

with the users. While objective data are usually good

at showing “what” happened, the subjective data of-

ten help to clarify “why” it happened or to ﬁlter out

biased users. If the user group is large enough, all of

the data should be statistically analyzed.

So far, we do not have other metric for the tool

productivity than the time required to solve the tasks.

8 CONCLUSIONS

We have presented our position on the AI tool design,

in particular we have stressed the importance of user

evaluation of the tools. We believe that comparative

controlled experiments should become the main aca-

demic approach how to demonstrate AI tools useful-

ness to the game industry and how to drive the tool

improvements.

In general, designing proper comparative exper-

iments for usability and productivity evaluation is

tricky as there are many factors that affect user per-

formance, which are difﬁcult to control. Since this

research area is relatively new, there are no deﬁnitive

“best practices” to follow. In this paper we have pre-

sented what we consider a candidate for such “best

practice” in the case of languages and tools for de-

sign of game AI. We believe that our experience is to

a large extent transferable to other real world applica-

tions of AI and supporting tools.

The experience we have gathered throughout the

development of SPOSH and yaPOSH have let us de-

sign a new tool for developing NPCs behavior that has

been adopted for real-world deployment in the design

process of an upcoming AAA computer game. Al-

though the project leads were initially skeptical about

the idea, thanks to our studies, we were able to show

that we know how to create a practical tool. The tool

itself was ultimately received very well by both the

project leads and its users (i. e. scripters) and has

fully replaced their previous behavior design solution.

ACKNOWLEDGEMENTS

Human data were collected with APA princi-

ples in mind. This research is supported by

the Czech Science Foundation under the contract

P103/10/1287 (GACR), by student grant GA UK No.

655012/2012/A-INF/MFF and partially supported by

SVV project number 267 314.

REFERENCES

Bryson, J. J. (2001). Intelligence by design: Principles of

Modularity and Coordination for Engineering Com-

plex Adaptive Agent. PhD thesis, MIT, Department of

EECS, Cambridge, MA.

Champandard, A. J. (2010). Finding a better way to Mordor.

Presentation, CIG 2010. http://vimeo.com/14390998

Accessed 2014-01-05.

Cutumisu, M., Onuczko, C., McNaughton, M., Roy, T.,

Schaeffer, J., Schumacher, A., Siegel, J., Szafron, D.,

Waugh, K., Carbonaro, M., et al. (2007). ScriptEase:

A generative/adaptive programming paradigm for

game scripting. Science of Computer Programming,

67(1):32 – 58.

Hollingsed, T. and Novick, D. G. (2007). Usability inspec-

tion methods after 15 years of research and practice.

In Proceedings of the 25th annual ACM international

conference on Design of communication, pages 249–

255. ACM.

Jeffries, R., Miller, J. R., Wharton, C., and Uyeda, K.

(1991). User interface evaluation in the real world: a

comparison of four techniques. In Proceedings of the

SIGCHI conference on Human factors in computing

systems, pages 119–124. ACM.

Karat, C.-M., Campbell, R., and Fiegel, T. (1992). Com-

parison of empirical testing and walkthrough methods

in user interface evaluation. In Proceedings of the

SIGCHI conference on Human factors in computing

systems, pages 397–404. ACM.

Nielsen, J. and Phillips, V. L. (1993). Estimating the rela-

tive usability of two interfaces: Heuristic, formal, and

empirical methods compared. In Proceedings of the

INTERACT’93 and CHI’93 conference on Human fac-

tors in computing systems, pages 214–221. ACM.

ıbil, R., Nov

ak, P., Brom, C., and Gemrot, J. (2012). Notes

on pragmatic agent-programming with Jason. In Pro-

gramming Multi-Agent Systems, volume LNCS 7217,

pages 58–73. Springer.

Sadowski, C. and Kurniawan, S. (2011). Heuristic evalua-

tion of programming language features: two parallel

programming case studies. In Proceedings of the 3rd

ACM SIGPLAN workshop on Evaluation and usabil-

ity of programming languages and tools, pages 9–14.

ACM.

Steﬁk, A., Siebert, S., Steﬁk, M., and Slattery, K. (2011).

An empirical comparison of the accuracy rates of

novices using the Quorum, Perl, and Randomo pro-

gramming languages. In Proceedings of the 3rd ACM

SIGPLAN workshop on Evaluation and usability of

programming languages and tools, pages 3–8. ACM.

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

468