AN INFRASTRUCTURE FOR MECHANISED GRADING

Queinnec Christian

LIP6, Universit Pierre et Marie Curie, France

Keywords:

Mechanised grading, Grading experimentation.

Abstract:

Mechanised grading is now mature. In this paper, we propose elements for the next step: building a commodity

infrastructure for grading.

We propose an architecture for a grading infrastructure able to be used from various learning environments,

a set of extensible Internet-based protocols to interact with the components of this infrastructure, and a self-

contained ﬁle format to thoroughly deﬁne an exercise and its entire life-cycle in order to ease deployment. The

infrastructure is designed to be scalable, robust and secure.

First experiments with a partial implementation of this infrastructure show its versatility and its neutrality. It

does not impose any IDE and it respects the tenets of the embedding learning environment or course-based

system thus favouring the fostering of an eco-system around mechanised grading.

1 HISTORY

Around 2000, with the help of some colleagues from

UPMC, we set up two experimentations involving

mechanised grading. In the ﬁrst experiment (re-

ported in (Brygoo et al., 2002)), we adjoined to the

DrScheme programming environment (an Integrated

Development Environment (or IDE) devoted to the

Scheme programming language (Findler et al., 2002))

a mechanised grading facility: the student chooses

an exercise, reads the stem, types then debugs his

program and ﬁnally hits the “Check” button to ob-

tain a mark and a page explaining the tests that were

performed and led to that mark. This mechanised

grader is a stand-alone plug-in to DrScheme; every

year since 2001, around 700 students have had used it

on their home computer (night and day) or during lab

sessions.

In the second experiment (reported in (Queinnec

and Chailloux, 2002)), we organised general pro-

gramming examinations or contests where we had

several hundreds of undergraduate students to grade

in a short time. Centralised mechanised grading was

our sole option. Therefore we wrote unit tests to

check students’ programs considered as black boxes.

Soon afterthat, we designed a framework and imple-

mented some libraries to ease the production of mech-

anised graders that is, specialised programs that grade

students’ ﬁles for a given exercise. From 2001, we

have been grading several hundred examinations per

year.

Around 2005, our various experiences with dif-

ferent programming languages such as Scheme, C,

Ada, PHP, Perl, Java, make or shell convince us that

mechanised grading was mature enough for courses

that mainly rely on some programming language(s)

both for exercises and/or examinations. We proposed

to unify our previous experiments into a generalised

framework and a multi-language grading architecture

to ease the production of graders and to lessen the as-

sociated chores. Our goal was to build

1. an autonomous and scalable grader that may be

embedded within various systems,

2. REST-based (Fielding and Taylor, 2002) proto-

cols to interact with the various components of the

grading infrastructure,

3. a comprehensive ﬁle format for self-contained de-

ployable exercises that covers the entire life-cycle

of an exercise.

While building this grader, we added new goals in

order to provide a grading infrastructure that is, grad-

ing as a software service (SaaS).

We will present, in the second Section, the ad-

vantages of mechanised grading and some of our

goals. The third and fourth Sections will describe

an overview of the architecture, nicknamed FW4EX

(for “ framework for exercises”), and its prominent

features. In the ﬁfth Section, we will summarise the

Christian Q. (2010).

AN INFRASTRUCTURE FOR MECHANISED GRADING.

In Proceedings of the 2nd International Conference on Computer Supported Education, pages 37-45

DOI: 10.5220/0002791900370045

 SciTePress

results of our lattest experiments using the FW4EX

architecture. Related and future works conclude this

paper.

2 MECHANISED GRADING

Writing a program is a complex activity requiring

1. to understand a speciﬁcation,

2. to imagine a solution,

3. to write it down in some programming language

within some environment development (IDE),

4. ﬁnally, to test it to make sure it complies with the

speciﬁcation.

Any problem that arises during this life-cycle forces

the student to re-iterate parts of this cycle. A way to

assess this skill is to let students program with profes-

sional tools and to check programmatically whether

their programs comply with the speciﬁcation. Pro-

fessionally, this is called “unit testing” and popular

frameworks such as JUnit

were developed for that

activity. To use professional tools and to obey profes-

sional rules was something we wanted our students to

be exposed to.

There are some differences though with usual unit

testing.

• Unit testing is designed to give a binary answer:

1 (pass) or 0 (fail). In a grading context, we want

to give a more representative and accurate mark

for the student’s work say a mark between 0 and

20 (as usual in French universities). This may be

done by multiplying the tests and combining their

partial results.

• Unit testing frameworks often depend on one pro-

gramming language. However some speciﬁca-

tions leave the choice of the programming lan-

guage to use up to the student. Therefore, a

grading framework has to cope with multiple lan-

guages.

To design mechanised graders is a hot topic as

illustrated by the numerous papers presented these

last years like Quiver (Ellsworth et al., 2004), Robo-

Prof (Daly and Waldron, 2004), TorqueMOODa (Hill

et al., 2005), gradem (Noonan, 2006), APOGEE (Fu

et al., 2008). An overview of this topic may be found

in (Douce et al., 2005a). Among well known sys-

tems containing graders are CourseMaker (Higgins

et al., 2005), BOSS (Joy et al., 2005) and now ASAP

(Douce et al., 2005b).

Mechanised graders offer multiple advantages:

www.junit.org

• consistency: students are graded uniformly (and

anonymously) without the biases due to (multiple)

human grader(s).

• fairness: if one student ﬁnds a problem in the

speciﬁcation or the associated tests then all stu-

dents beneﬁt from the upgraded grader.

• persistence: once created, assessments are for-

ever available for students who may practise them

whenever they want, in the same conditions the

assessment was initially offered. We dubbed

this “dynamic annals” (Queinnec and Chailloux,

2002).

Many colleagues have a negative feeling against

mechanised grading. Unit tests consider programs

as opaque black boxes; quite often (but see (Higgins

et al., 2005), (Joy et al., 2005)) they do not consider or

appreciate style, performance, good algorithmic us-

age, etc. We think that some aspects of this critic are

due to youth problems:

• Style checkers (CheckStyle

, PMD

, FindBugs

for Java, Perl::Critic

for Perl, etc.) exist and are

more and more professionally used. To incorpo-

rate them in unit tests will lessen this critic. Per-

formance may also be checked given appropri-

ate environment setup and non-toy problem size.

However these improvements will dramatically

increase grading duration.

• Since we collect all the answers to our assess-

ments, we expect to be able to build a taxonomy

of common errors and to mechanically recognise

instances of these errors in order to provide appro-

priate answers.

• Unit testing may produce very detailed reports of

all tests whether they passed or failed. To analyse

these traces is a useful skill to be acquired by stu-

dents (as elsewhere observed, good debuggers are

likely to be good programmers while the inverse is

less than true) if some familiarity with these tools

is organised ahead of assessments.

Therefore a grading infrastructure must be able to

grade exercises in various programming languages,

various settings: complete programs (stand-alone,

client or/and server), program fragments (function,

method, class). A grading infrastructure must be able

to be put to work as part of a learning environment or

other systems: it must not dictate whether exercises

are summative or formative, it must be as independent

as possible from scholar databases.

checkstyle.sourceforge.net

pmd.sourceforge.net

ﬁndbugs.sourceforge.net

perlcritic.com

CSEDU 2010 - 2nd International Conference on Computer Supported Education

Of course, it must also be simple to use, ro-

bust in face of crashes, secure (respect privacy and

anonymity, pirat-proof and resistant to malicious stu-

dent’s works). It must ensure the long-term availabil-

ity of the assessments. Finally, our dreamt grading

infrastructure also ought to be scalable in order take

beneﬁt of new technologies such as cloud comput-

ing (Amazon EC2 for instance) or Internet-based stor-

age (Amazon S3 for instance) if demand for grading

should increase.

3 THE FW4EX PROJECT

The FW4EX project aims to specify and build, from

scratch, such a grading infrastructure under three ad-

ditional constraints listed in this Section. These are:

• Exercises must be Easily Deployed. Copying

the artefact deﬁning an exercise into an appropri-

ate directory should be sufﬁcient. This requires a

precise format for artefacts deﬁning exercises and

their whole life-cycle.

• IDEs should be FW4EX-clients. Students

should program with appropriate tools. Using an

IDE is the current way of programming therefore

IDEs must be able to propose exercises, gather

student’s ﬁles, submit them for grading and dis-

play grading reports.

Fortunately, IDEs (Eclipse, NetBeans, Scintilla,

Emacs, etc.) often offer a plugin architecture that

makes easy to add a button or a menu item for

that task. In order to simplify these plugins, in-

teractions with the FW4EX-related servers only

use the HTTP protocol and more precisely REST-

based style (Fielding and Taylor, 2002).

• Segregate Functionalities into Separate Com-

ponents. FW4EX-related functionalities are com-

ponentised so they may be organised into appro-

priate workﬂows (including human reviewers for

instance). Moreover these components should

cope with ﬁrewalls, university-imposed authen-

tication method, provide skinning and be scal-

able. Given that these components are accessed

via HTTP, they are called servers and altogether

form the building blocks for a grading infrastruc-

ture.

These servers gather related functionalities such

as serving exercises (e server(s)), acquiring stu-

dent’s ﬁles (a server(s)) or, serving stored grading

reports (s server(s)). A single physical machine

or a cloud of virtual machines may implement

all or part of these logical servers depending on

the topological constraints, the expected load or

throughput, the intended availability and the con-

ﬁguration of the workﬂow.

We think that these three goals provide a strong

basis for a universal and versatile grading infras-

tructure. This infrastructure takes care of stor-

ing/retrieving exercises, accepting/grading submis-

sions, storing submissions and their associated grad-

ing reports thus relieving teachers from all these

chores. FW4EX aims to be a grading infrastructure

in the line of SaaS (Software as a Service).

This infrastructure provides many opportunities

for the creation of appropriate eco-systems as:

• Authoring tools to build artefacts deﬁning exer-

cises.

• Teaching tools to design sets (scenarii) of exer-

cises, analyze answers, compute statistics, ana-

lyze common errors, etc.

• Deploying tools to instantiate networks of servers

to operate within a university (with a single phys-

ical server that serves all aspects of FW4EX to

unauthenticated clients) or within a cloud for a

world-wide company where scalability, latency

and accountability are of paramount importance.

To achieve these goals, we offer an artefact format

for exercises (see Section 4.1), a series of HTTP pro-

tocols (Section 4.4) and some returns of experience

(Section 5).

4 THE FW4EX ARCHITECTURE

The FW4EX framework takes beneﬁt of our previous

experiments and tries to address a variety of use cases.

In this Section, we present the main lines of the ar-

chitecture, the protocols and the format of exercises

artefacts.

Figure 1 shows a simpliﬁed sketch of the logical

architecture of the FW4EX infrastructure. This archi-

tecture supposes the presence of Internet and heavily

relies on REST-based services (Fielding and Taylor,

2002).

From the student’s point of view, after choos-

ing (or be assigned) an exercise, his browser (or any

FW4EX-compliant client, see Section 4.4) fetches a

zipped ﬁle (from an exercises server e) containing the

stem and all the necessary ﬁles or documents required

to practise the exercise. When ready, the student sub-

mits his work (to an acquisition server a) where it will

be picked by a grading server gd that will instantiate

an appropriate grading slave (m

or m

) to grade the

student’s ﬁles. Once graded, a ﬁnal report (an XML

ﬁle) is made available to the student (on some storage

server s) along with the originally submitted work.

AN INFRASTRUCTURE FOR MECHANISED GRADING

Figure 1: Sketch of the logical architecture of the FW4EX

infrastructure. The arrow tells which machine initiates con-

nections.

Exercises servers e only serve exercises i.e., static

ﬁles. Exercises servers may be implemented with

Amazon S3 technologies. To increase their availabil-

ity, e servers do not depend on the central databases.

In our experience, e servers are often proxyed via a

portal (see Figure 1) allowing students to choose exer-

cises among the set prescribed by their teachers. The

portal also skins the exercises with the university’s

web conventions (colors, logos, authentication, etc.).

Acquisition servers a are stand-alone servers that

must be as robust as possible since they must be avail-

able day and night to accept students’ works. To

increase their availability, a servers do not depend

on the central databases. They may be deployed

within ﬁrewalled networks, and maybe polled with

encrypted tunnels towards the grading machinery thus

ensuring robust security.

Storage servers s only serve reports i.e., static

anonymous XML ﬁles. These servers do not need to

access the central databases to increase their availabil-

ity: students may then get feedback whenever they

want night and day. To preserve anonymity, reports

may only be fetched via non guessable md5-based

urls. Storage servers may be implemented with Ama-

zon S3 technology. In our experience, reports from

storage servers are often accessed via portals that skin

the reports with the university’s web conventions.

Teachers prescribe sets of exercises to students

and may access their resulting reports. Teachers may

also collect students’ works (jobs) by any convenient

mean and submit batches of jobs for grading. Teach-

ers have (authenticated) access to grades and names

through the t server.

The author of an exercise (shortened as author in

the rest of this paper) also has access to the anony-

mous reports produced by his exercise in order to im-

prove it.

Information conﬁnement is strict: grading slaves

grade anonymous ﬁles, storage servers store anony-

mous ﬁles. Among all servers, only grading servers

gd and t (see below) are aware of students’ identities.

Servers are customised for one task only. This of-

fers numerous advantages: redundancy to cope with

fault tolerancy, concurrency to increase throughput,

versatility where a new version of a component may

be plugged into the architecture and compared (in

speed, resource consumption, accuracy) with its pre-

vious version without disrupting the entire infrastruc-

ture.

For security reasons, the grading machinery is

(quite often) not publicly nor directly reachable. The

grading slaves are virtual machines run with QEMU,

VMware, VirtualBox, etc. as explained in Section 4.2.

4.1 Exercise Format

An exercise is a tar gzipped ﬁle (not dissimilar to a

Java jar ﬁle or an OpenDocument ﬁle) containing

an XML descriptor and all the ﬁles required to op-

erate the exercise. This central descriptor rules the

various aspects of the entire life-cycle of the exercise,

successively: autocheck, publish, download, install,

grade. These phases are shortly explained hereafter

in increasing complexity order.

Download. The descriptor speciﬁes the stem and the

ﬁles required by the student in order to practise the

exercise. The descriptor also describes for each

question, the ﬁles or directories expected from the

student and may contain additional hints such as a

template to ﬁll for an answer or the size of the ex-

pected work. In the case of an exercise with a sin-

gle question expecting a single ﬁle, an FW4EX-

compliant client (see Section 4.4) may use these

hints to display an appropriate widget to capture

interactively an answer on one or multiple lines.

Grade. The descriptor deﬁnes which scripts (writ-

ten in whatever scripting language) are used to

grade student’s ﬁles and how these scripts may be

combined. It also speciﬁes which kind of grading

slave is required that is, which OS (Debian, Win-

dows, etc), which CPU, etc. The descriptor also

mentions CPU duration limits, output limits, etc.

Install. The descriptor deﬁnes how an exercise is in-

stalled on a grading slave (usually a virtual ma-

chine) or on a student’s computer. This installa-

tion may require to uncompress data ﬁles, to pre-

compile libraries, etc. The installation on a grad-

ing slave is different from the installation on the

student’s computer since grading scripts are not

disclosed to the student nor they are communi-

cated to the student’s computer.

Autocheck. The descriptor deﬁnes a number of

“pseudo jobs” paired with the expected grade they

CSEDU 2010 - 2nd International Conference on Computer Supported Education

should get. This allows to check whether the

grading slave contains all the software required to

grade a speciﬁc exercise, to ensure non regression

when the author of an exercise evolves an exer-

cise.

This autocheck phasis also allows FW4EX devel-

opers to evolve the infrastructure while maintain-

ing similar results for the registered exercises.

Most often it is advised to have at least three

pseudo jobs: the null one that contains nothing

and expects to obtain a 0/20 grade. The perfect

one that should be graded 20/20 and an interme-

diate one that checks that in-between grades are

indeed possible.

Publish. When installed and autochecked, an exer-

cise is ready to grade real students works. It is

then publicly available.

The descriptor also deﬁnes some meta-data to

characterise the exercise: name, requirements,

summary, tags, etc. These are needed by e servers

to inform students and teachers about the exer-

cises that are ready to be selected.

The XML descriptor is the ﬁle that ties all the

information required to operate the exercise through

all the steps of its life-cycle. The exercise ﬁle is

autonomous and, similarly to a WebApp, it may be

deployed or migrated from one application server to

some others.

Stems, reports and other information coming out

of the grading infrastructure are XML documents

therefore, they are skinnable via XSLT style sheets

to ﬁt university look and feel. They may also be

processed to ﬁll scholar databases, grade books or

any other tool that needs information about grading.

These normalised XML documents isolate the grad-

ing infrastructure from local idiosyncrasies.

4.2 Conﬁnement

Our experience shows us that students make mistakes

(inﬁnite loops, deadlocks, etc.). Their programs need

to be conﬁned in time and cpu consumption. Their

production must also be limited: not much than a

given number of bytes written to ﬁles or on output

streams. Though rare, some students’ programs are

malicious and must be tightly restricted (are they al-

lowed to post mails, to open sockets, to start back-

ground processes, to destroy resources to prevent

grading other student’s ﬁles, etc.) For all these rea-

sons, we run students’ programs under students’ ac-

counts (so they do not harm teachers’ ﬁles) within vir-

tual machines (QEMU or VMware) and pay much at-

tention to offer the same initial conditions for all jobs.

To use virtual machines also guarantees to be

able to reproduce, for ever, the same conditions in

which an exercise was designed to be graded. We

previously had bad experiences with system updates

breaking conﬁguration or slightly altering runtime li-

braries. Virtual machines removes entirely that prob-

lem and offer eternity to exercises. Virtual machines

also allows to offer various Operating Systems (De-

bian, Windows, etc.), CPUs or networked machines

for client-server programs (for instance in PHP).

Our experience also shows us that authors of exer-

cises make mistakes. Their programs need to be sim-

ilarly conﬁned otherwise they may bog down servers,

destroy common resources and harm student’s expe-

rience. To grade an exercise is therefore a complex

task where student’s errors should appear in student’s

report, author’s errors should appear in author’s re-

ports and FW4EX errors should be routed to FW4EX

maintainers.

4.3 Grading Scripts

Scripts are regular programs written in whatever lan-

guage ﬁts the task. They are run within the directory

where student’s ﬁles are deployed, they may read or

run author’s owned ﬁles (stored elsewhere) as well as

read or run FW4EX libraries (stored elsewhere).

Grading may be performed by comparison (with

teachers’ solutions) or by veriﬁcation (checking

whether some property is obtained). Grading by com-

parison requires the author of the exercise to write his

own solution but allows to randomly generate input

data.

The exercise descriptor also speciﬁes how to com-

bine the various scripts that is, stop at ﬁrst error, ig-

nore some errors, etc.

Support libraries help writing very concise grad-

ing scripts for speciﬁc languages (for instance in Java,

we reﬁne the JUnit framework to count the number of

successful/failed assertions in addition to the number

of successful/failed tests) or for speciﬁc situations (for

instance in bash, input data may be given as command

options, input streams, data ﬁle names, etc.).

4.4 FW4EX-compliant Clients

We strongly believe that, apart for introductory

courses to Computer Science, programming should

be done with professional tools such as usual pro-

gramming environment (IDE). Browsers, applets can-

not be considered as decent substitutes for real IDE.

An FW4EX-compliant IDE (often an IDE plug-in) is

able to speak to the three types of public servers (e,

AN INFRASTRUCTURE FOR MECHANISED GRADING

a and s) according to some REST protocols (Fielding

and Taylor, 2002).

These protocols allow a student to authenticate,

to obtain stem, to submit work and to, ﬁnally, get a

report. When submitting a job and following REST

principles, the client receives the URL where the re-

port will pop up on some s server.

From an author’s point of view, it is similarly pos-

sible to submit an exercise, get the corresponding au-

tocheck report leading to the grading reports of the

pseudo-jobs. Additional services exist for FW4EX

maintainers to manage jobs, exercises, conﬁguration.

Presently we have three FW4EX-compliant

clients. The ﬁrst is an AJAX FW4EX-compliant

client running in any regular browser. It fulﬁlls all of

the needs but for the installation of the exercise on the

student’s computer which cannot be fully automated

for security reason. It manages the display to accom-

modate various types of exercises (see experiments in

Section 5): one-liners are captured via a one-line text

widget, short scripts are captured via a textarea wid-

get, multiple ﬁles are captured via a ﬁle upload wid-

get.

Since it mostly handles XML documents, the

AJAX FW4EX-compliant client imposes its own

XSLT style sheets to display them with the wished

look and feel. Additional treatments may also be per-

formed such as using a tree widget to iconise series

of lengthy tests in the grading report. The report is

therefore largely independent of the display technol-

ogy.

The AJAX FW4EX-compliant client may be em-

bedded in a university portal.

Two other clients were developped as proofs of

concepts: one for Eclipse (for Java, C, php, etc.) and

one for Scite

. In these IDEs, similarly to what we

did for DrScheme (Brygoo et al., 2002), a single but-

ton or menu will manage authentication, show stem,

install accompanying ﬁles, pack student’s work, sub-

mit it and display the associated report, if available,

as annotations to source code.

5 EXPERIMENTS WITH FW4EX

We deployed the FW4EX infrastructure in September

2008. Two experiments were conducted during the

two teaching semesters, a third one took place at the

end of January 2009 with quite a different modality.

www.scintilla.org

5.1 One-liners

For a course on Posix skill (mainly shell and Make-

ﬁle) we deploy approximately 40 exercises. These

exercises are “one liners” since they contain a sin-

gle question that asks for options (for utilities such as

tr, sort, sed, etc.) or for whole commands for var-

ious tasks (sifting, sorting, regexp-ing, etc.). Stems

are terse and imprecise (but always give an example

of use) in order to let students think about limit cases

such as empty streams, empty ﬁles, ﬁles out of the

current directory, ﬁles with weird names, etc. The

goal of these formative exercises was to encourage

students to read the associated man pages in order to

be familiar with the 2 to 4 most useful options with

each of these utilities.

These exercises are permanently available for all

students; students are not limited in number of sub-

missions. Reports are very detailed, every test is ex-

plained, results are shown and reasons for success or

failure are verbalised. Students have to analyse these

reports and sometimes discover that some limit cases

are indeed possible with respect to the stem. In the

actual deployment, grading reports are obtained af-

ter 20 seconds. Such a duration compels students to

consider that the grader is not a fast debugging tool so

debug should be done appropriately on their computer

before asking to be graded.

We offer roughly 5 exercises per week for 7

weeks. We start from exercises asking for options

and end with exercises asking for small shell scripts

(less than 10 lines). With help of the support libraries,

grading scripts were reduced to a line or two.

On the 120 students enrolled for the course, half

of them tried at least one exercise. The infrastructure

had been serving (night, day and hollidays) around

3500 submissions per semester.

The exercises were not prescribed, only voluntary

students try them. Some try them from self-service

lab rooms, others from their home with their own

computer. In average, we observe 2 to 3 attempts per

exercise and per student. As already reported (Woit

and Mason, 2003), students did not volunteer easily.

Apart the initial curiosity of the ﬁrst week, we only

noted the usual peak during the week that precede ex-

aminations.

Students are not directly exposed to FW4EX, they

only use it through the Web pages of the lecture they

follow. They appreciate the availability (on sundays

and even Christmas hollidays). Of course, the more

trained the students are, the better results they get at

the examination.

CSEDU 2010 - 2nd International Conference on Computer Supported Education

5.2 Examinations

In the same course, we use the FW4EX infrastructure

to grade the mid-term examination and the two ﬁnal

examinations. The examinations contain three sum-

mative exercises each containing 1 to 6 rather inde-

pendent questions. Contrarily to the one-liners, exam-

inations are very carefully speciﬁed. Stems are pre-

cise, they ask for scripts (or Makeﬁle rules) perform-

ing various tasks. Examples are always given, deliv-

erables are speciﬁed and some hints are given about

how grading will be performed.

These examinations last 3 hours from lab rooms.

As usual the mid-term examination mainly serves to

prepare the students for the ﬁnal examinations. The

student’s works are sent to the grader after the exam-

ination therefore students do no have any feedback

during the examination.

In this setting, we had 120 sets of students’ ﬁles

to grade. The exercise ﬁle (containing the examina-

tion) was prepared with 6 pseudo jobs. We grade the

whole set of students’ ﬁles 4 times in order to tune the

grader: we had to cope with misnamed scripts or ﬁles,

to ﬁx an encoding problem (UTF8 versus Latin1) and

to slightly alter the weight of two questions. Approx-

imately 60 seconds were necessary to grade one stu-

dent.

The design and test of these examinations (writ-

ing the stem, writing a solution, writing the grad-

ing scripts and testing them over pseudo jobs) took

around two days of work.

Five days were necessary for the examinations be-

fore the reports and grades were made available to

the students (a student may only access his own re-

port). Among these ﬁve days, three were alloted to

the teaching assistants to check whether the grading

results were agreeing with the behaviour of the stu-

dents during lab sessions. Teachers also use the grad-

ing reports to prepare students to the second chance

examination.

The mid-term examination was also made avail-

able as a regular (although huge) exercise itself.

Students may then retry this examination in a less

stressed context (at home) but, at the same time, be

graded by the exact grader used at the examination.

This is what we call “dynamic annals”. Only 6 stu-

dents retried the examination, 1 among them got a

grade greater than 19/20 in 4 attempts, the others

topped at 6/20 with 1 to 6 attempts before abandon.

The examination was held on the computers of the

laboratories where no plagiarism was possible. Pla-

giarism detection will be considered as an option for

batch grading.

5.3 Programming Contest

The “Journ

ee Francilienne de Programmation” is a

programming contest for undergraduate students from

various Parisian universities. A dozen of program-

ming languages are indeed possible to solve the pro-

posed problem. The contest lasts 5 hours during

which 12 teams of voluntary students regularly sub-

mit their work (120 uploads) and get their grade (the

average grading time was 45 seconds) in order to ap-

preciate how they perform with respect to the other

teams. Besides individual grading, the portal runs ad-

ditional scripts using the reports from FW4EX to de-

liver bonus points to teams’ works (mainly based on

the speed of programs).

The next edition of this contest will use the same

grading infrastructure.

5.4 Summary of Experiments

These three experiments show the versatility of the

FW4EX infrastructure. One-liners are multiply sub-

mitted and graded from anywhere on Internet whereas

examinations are just graded once. Examinations and

contests are read ofﬂine by humans but one-liners are

just automatically graded. One-liners are graded in

a couple of seconds, examinations and contests may

take up to a minute.

6 RELATED WORK

While there are many mechanised graders on the mar-

ket, there are often tailored to a single programming

language, tightly bound to scholar databases or im-

posing a precise workﬂow. All this characteristics

make difﬁcult to re-use these embedded graders.

The BOSS Online Submission System

(Joy et al.,

2005) is a course management tool. It allows students

to submit assignments online securely, and contains a

selection of tools to allow staff to mark assignments

online and to manage their modules efﬁciently. Pla-

giarism detection software is included. Interaction

with BOSS may use an application or a web front end.

BOSS is a complete course-oriented web appli-

cation whereas FW4EX is only a set of protocols to

operate a grading infrastructure. On the other hand

FW4EX achieves better robustness against malicious

submissions and scalability. However due to the ﬂex-

ible architecture of BOSS that uses factories to hide

implementation choices or binding constraints with

http://www.dcs.warwick.ac.uk/boss/

AN INFRASTRUCTURE FOR MECHANISED GRADING

external resources, it would not be difﬁcult to let

BOSS use FW4EX.

CourseMaker

(Higgins et al., 2005) is a web-

based, easy-to-use course creation package commer-

cialised by the Connect company. It is now roughly

similar in functionalities to Blackboard or Sakai. Like

BOSS, it contains a number of tools to check ty-

pography, syntax, comments, and to detect plagia-

rism. CourseMaker is a complete solution that con-

tains courses documents, exchanges information with

scholar databases, deploys student clients application

or web-based forms to collect work. CourseMaker

hosts, on its proper network, courses along with their

exercises as QTI

ﬁles.

There again, FW4EX is different. While Course-

Maker is course-oriented, FW4EX is exercise-centric.

ASAP (Douce et al., 2005b) is a new initiative that

shares a number of goals with FW4EX. ASAP is com-

ponentised with respect to the JISC (Joint Information

Systems Committee) e-learning framework. It sepa-

rates grading from submission and it also uses XML

documents as a way to transmit information between

components. FW4EX improves on ASAP with its

emphasis on protocols, its robustness with the use of

virtual machines and its attention to the whole life-

cycle of exercises.

7 FUTURE WORK

We have recently migrated an instance of the FW4EX

architecture onto two rent dedicated servers. Us-

ing virtual hosts, we now offer two a servers, two e

servers and, from time to time (for testing purposes),

two concurent graders. We envision to offer a reason-

ably free usage of the FW4EX grading infrastructure

for cooperating universities.

We plan to develop new exercises for a new lec-

ture on Perl. We also supervise two new students’

projects

: one to extend a previous authoring tool

and a revised version of a FW4EX-client as a plugin

for Eclipse.

Finally, we wish to analyse the typical errors for

some exercises we have been proposing for years.

This is possible since we have been gathering thou-

sands of students’ submissions.

www.coursemaker.co.uk

www.imsglobal.org

pstl-fw4ex.sourceforge.net

8 CONCLUSIONS

In this paper, we present the FW4EX project, an at-

tempt to build a grading infrastructure that can be put

to work into various learning environments. A set of

REST-based protocols allows these learning systems

or IDEs to operate the grading infrastructure. A “stan-

dard” descriptor and a ﬁle format speciﬁcation are

proposed to reduce the deployment of new exercises

to a simple ﬁle copy. We ﬁnally describe some ex-

periments that illustrate the neutrality and versatility

of the associated infrastructure: one-liners, full exam-

inations, programming contests in multiple program-

ming languages.

We believe that this project provides a strong basis

to foster an eco-system for mechanised grading thus

allowing teachers to focus on the sole exercises and

not on the grading infrastructure. Yes, this is more

work for teachers to imagine a stem, write (and test)

a solution then design some grading scripts but, once

done, this may last for ever.

More information on the FW4EX project is avail-

able on the site paracamplus.org.

REFERENCES

Brygoo, A., Durand, T., Manoury, P., Queinnec, C., and

Soria, M. (2002). Experiment around a training en-

gine. In IFIP WCC 2002 – World Computer Congress,

Montr

eal (Canada). IFIP.

Daly, C. and Waldron, J. (2004). Assessing the assessment

of programming ability. In SIGCSE ’04: Proceedings

of the 35th SIGCSE technical symposium on Com-

puter science education, pages 210–213, New York,

NY, USA. ACM.

Douce, C., Livingstone, D., and Orwell, J. (2005a). Au-

tomatic test-based assessment of programming: A re-

view. J. Educ. Resour. Comput., 5(3):4.

Douce, C., Livingstone, D., Orwell, J., Grindle, S., and

Cobb, J. (2005b). A technical perspective on asap -

automated system for assessment of programming. In

Proceedings of the 9th CAA Conference, Loughbor-

ough University.

Ellsworth, C. C., James B. Fenwick, J., and Kurtz, B. L.

(2004). The quiver system. In SIGCSE ’04: Pro-

ceedings of the 35th SIGCSE technical symposium

on Computer science education, pages 205–209, New

York, NY, USA. ACM.

Fielding, R. T. and Taylor, R. N. (2002). Principled design

of the modern web architecture. ACM Trans. Interet

Technol., 2(2):115–150.

Findler, R. B., Clements, J., Flanagan, C., Flatt, M., Krish-

namurthi, S., Steckler, P., and Felleisen, M. (2002).

Drscheme: A programming environment for scheme.

Journal of Functional Programming, 12:369–388.

CSEDU 2010 - 2nd International Conference on Computer Supported Education

Fu, X., Peltsverger, B., Qian, K., Tao, L., and Liu, J. (2008).

Apogee: automated project grading and instant feed-

back system for web based computing. In SIGCSE

’08: Proceedings of the 39th SIGCSE technical sym-

posium on Computer science education, pages 77–81,

New York, NY, USA. ACM.

Higgins, C. A., Gray, G., Symeonidis, P., and Tsintsifas,

A. (2005). Automated assessment and experiences

of teaching programming. J. Educ. Resour. Comput.,

5(3):5.

Hill, C., Slator, B. M., and Daniels, L. M. (2005). The

grader in programmingland. In SIGCSE ’05: Pro-

ceedings of the 36th SIGCSE technical symposium

on Computer science education, pages 211–215, New

York, NY, USA. ACM.

Joy, M., Grifﬁths, N., and Boyatt, R. (2005). The boss on-

line submission and assessment system. J. Educ. Re-

sour. Comput., 5(3):2.

Noonan, R. E. (2006). The back end of a grading system. In

SIGCSE ’06: Proceedings of the 37th SIGCSE techni-

cal symposium on Computer science education, pages

56–60, New York, NY, USA. ACM.

Queinnec, C. and Chailloux, E. (2002). Une exprience

de notation en masse. In TICE 2002 – Tech-

nologies de l’Information et de la Communica-

tion dans les Enseignements d’Ing

enieurs et dans

l’industrie – Confrences ateliers, pages 403–404,

Lyon (France). Institut National des Sciences Ap-

pliques de Lyon. version complte disponible en

http://lip6.fr/Christian.Queinnec/PDF/cfsreport.pdf.

Woit, D. and Mason, D. (2003). Effectiveness of on-

line assessment. In SIGCSE ’03: Proceedings of the

34th SIGCSE technical symposium on Computer sci-

ence education, pages 137–141, New York, NY, USA.

ACM.

AN INFRASTRUCTURE FOR MECHANISED GRADING