AN INFRASTRUCTURE FOR MECHANISED GRADING
Queinnec Christian
LIP6, Universit Pierre et Marie Curie, France
Keywords:
Mechanised grading, Grading experimentation.
Abstract:
Mechanised grading is now mature. In this paper, we propose elements for the next step: building a commodity
infrastructure for grading.
We propose an architecture for a grading infrastructure able to be used from various learning environments,
a set of extensible Internet-based protocols to interact with the components of this infrastructure, and a self-
contained file format to thoroughly define an exercise and its entire life-cycle in order to ease deployment. The
infrastructure is designed to be scalable, robust and secure.
First experiments with a partial implementation of this infrastructure show its versatility and its neutrality. It
does not impose any IDE and it respects the tenets of the embedding learning environment or course-based
system thus favouring the fostering of an eco-system around mechanised grading.
1 HISTORY
Around 2000, with the help of some colleagues from
UPMC, we set up two experimentations involving
mechanised grading. In the first experiment (re-
ported in (Brygoo et al., 2002)), we adjoined to the
DrScheme programming environment (an Integrated
Development Environment (or IDE) devoted to the
Scheme programming language (Findler et al., 2002))
a mechanised grading facility: the student chooses
an exercise, reads the stem, types then debugs his
program and finally hits the “Check” button to ob-
tain a mark and a page explaining the tests that were
performed and led to that mark. This mechanised
grader is a stand-alone plug-in to DrScheme; every
year since 2001, around 700 students have had used it
on their home computer (night and day) or during lab
sessions.
In the second experiment (reported in (Queinnec
and Chailloux, 2002)), we organised general pro-
gramming examinations or contests where we had
several hundreds of undergraduate students to grade
in a short time. Centralised mechanised grading was
our sole option. Therefore we wrote unit tests to
check students’ programs considered as black boxes.
Soon afterthat, we designed a framework and imple-
mented some libraries to ease the production of mech-
anised graders that is, specialised programs that grade
students’ files for a given exercise. From 2001, we
have been grading several hundred examinations per
year.
Around 2005, our various experiences with dif-
ferent programming languages such as Scheme, C,
Ada, PHP, Perl, Java, make or shell convince us that
mechanised grading was mature enough for courses
that mainly rely on some programming language(s)
both for exercises and/or examinations. We proposed
to unify our previous experiments into a generalised
framework and a multi-language grading architecture
to ease the production of graders and to lessen the as-
sociated chores. Our goal was to build
1. an autonomous and scalable grader that may be
embedded within various systems,
2. REST-based (Fielding and Taylor, 2002) proto-
cols to interact with the various components of the
grading infrastructure,
3. a comprehensive file format for self-contained de-
ployable exercises that covers the entire life-cycle
of an exercise.
While building this grader, we added new goals in
order to provide a grading infrastructure that is, grad-
ing as a software service (SaaS).
We will present, in the second Section, the ad-
vantages of mechanised grading and some of our
goals. The third and fourth Sections will describe
an overview of the architecture, nicknamed FW4EX
(for framework for exercises”), and its prominent
features. In the fifth Section, we will summarise the
37
Christian Q. (2010).
AN INFRASTRUCTURE FOR MECHANISED GRADING.
In Proceedings of the 2nd International Conference on Computer Supported Education, pages 37-45
DOI: 10.5220/0002791900370045
Copyright
c
SciTePress
results of our lattest experiments using the FW4EX
architecture. Related and future works conclude this
paper.
2 MECHANISED GRADING
Writing a program is a complex activity requiring
1. to understand a specification,
2. to imagine a solution,
3. to write it down in some programming language
within some environment development (IDE),
4. finally, to test it to make sure it complies with the
specification.
Any problem that arises during this life-cycle forces
the student to re-iterate parts of this cycle. A way to
assess this skill is to let students program with profes-
sional tools and to check programmatically whether
their programs comply with the specification. Pro-
fessionally, this is called “unit testing” and popular
frameworks such as JUnit
1
were developed for that
activity. To use professional tools and to obey profes-
sional rules was something we wanted our students to
be exposed to.
There are some differences though with usual unit
testing.
Unit testing is designed to give a binary answer:
1 (pass) or 0 (fail). In a grading context, we want
to give a more representative and accurate mark
for the student’s work say a mark between 0 and
20 (as usual in French universities). This may be
done by multiplying the tests and combining their
partial results.
Unit testing frameworks often depend on one pro-
gramming language. However some specifica-
tions leave the choice of the programming lan-
guage to use up to the student. Therefore, a
grading framework has to cope with multiple lan-
guages.
To design mechanised graders is a hot topic as
illustrated by the numerous papers presented these
last years like Quiver (Ellsworth et al., 2004), Robo-
Prof (Daly and Waldron, 2004), TorqueMOODa (Hill
et al., 2005), gradem (Noonan, 2006), APOGEE (Fu
et al., 2008). An overview of this topic may be found
in (Douce et al., 2005a). Among well known sys-
tems containing graders are CourseMaker (Higgins
et al., 2005), BOSS (Joy et al., 2005) and now ASAP
(Douce et al., 2005b).
Mechanised graders offer multiple advantages:
1
www.junit.org
consistency: students are graded uniformly (and
anonymously) without the biases due to (multiple)
human grader(s).
fairness: if one student finds a problem in the
specification or the associated tests then all stu-
dents benefit from the upgraded grader.
persistence: once created, assessments are for-
ever available for students who may practise them
whenever they want, in the same conditions the
assessment was initially offered. We dubbed
this “dynamic annals” (Queinnec and Chailloux,
2002).
Many colleagues have a negative feeling against
mechanised grading. Unit tests consider programs
as opaque black boxes; quite often (but see (Higgins
et al., 2005), (Joy et al., 2005)) they do not consider or
appreciate style, performance, good algorithmic us-
age, etc. We think that some aspects of this critic are
due to youth problems:
Style checkers (CheckStyle
2
, PMD
3
, FindBugs
4
for Java, Perl::Critic
5
for Perl, etc.) exist and are
more and more professionally used. To incorpo-
rate them in unit tests will lessen this critic. Per-
formance may also be checked given appropri-
ate environment setup and non-toy problem size.
However these improvements will dramatically
increase grading duration.
Since we collect all the answers to our assess-
ments, we expect to be able to build a taxonomy
of common errors and to mechanically recognise
instances of these errors in order to provide appro-
priate answers.
Unit testing may produce very detailed reports of
all tests whether they passed or failed. To analyse
these traces is a useful skill to be acquired by stu-
dents (as elsewhere observed, good debuggers are
likely to be good programmers while the inverse is
less than true) if some familiarity with these tools
is organised ahead of assessments.
Therefore a grading infrastructure must be able to
grade exercises in various programming languages,
various settings: complete programs (stand-alone,
client or/and server), program fragments (function,
method, class). A grading infrastructure must be able
to be put to work as part of a learning environment or
other systems: it must not dictate whether exercises
are summative or formative, it must be as independent
as possible from scholar databases.
2
checkstyle.sourceforge.net
3
pmd.sourceforge.net
4
findbugs.sourceforge.net
5
perlcritic.com
CSEDU 2010 - 2nd International Conference on Computer Supported Education
38
Of course, it must also be simple to use, ro-
bust in face of crashes, secure (respect privacy and
anonymity, pirat-proof and resistant to malicious stu-
dent’s works). It must ensure the long-term availabil-
ity of the assessments. Finally, our dreamt grading
infrastructure also ought to be scalable in order take
benefit of new technologies such as cloud comput-
ing (Amazon EC2 for instance) or Internet-based stor-
age (Amazon S3 for instance) if demand for grading
should increase.
3 THE FW4EX PROJECT
The FW4EX project aims to specify and build, from
scratch, such a grading infrastructure under three ad-
ditional constraints listed in this Section. These are:
Exercises must be Easily Deployed. Copying
the artefact defining an exercise into an appropri-
ate directory should be sufficient. This requires a
precise format for artefacts defining exercises and
their whole life-cycle.
IDEs should be FW4EX-clients. Students
should program with appropriate tools. Using an
IDE is the current way of programming therefore
IDEs must be able to propose exercises, gather
student’s files, submit them for grading and dis-
play grading reports.
Fortunately, IDEs (Eclipse, NetBeans, Scintilla,
Emacs, etc.) often offer a plugin architecture that
makes easy to add a button or a menu item for
that task. In order to simplify these plugins, in-
teractions with the FW4EX-related servers only
use the HTTP protocol and more precisely REST-
based style (Fielding and Taylor, 2002).
Segregate Functionalities into Separate Com-
ponents. FW4EX-related functionalities are com-
ponentised so they may be organised into appro-
priate workflows (including human reviewers for
instance). Moreover these components should
cope with firewalls, university-imposed authen-
tication method, provide skinning and be scal-
able. Given that these components are accessed
via HTTP, they are called servers and altogether
form the building blocks for a grading infrastruc-
ture.
These servers gather related functionalities such
as serving exercises (e server(s)), acquiring stu-
dent’s files (a server(s)) or, serving stored grading
reports (s server(s)). A single physical machine
or a cloud of virtual machines may implement
all or part of these logical servers depending on
the topological constraints, the expected load or
throughput, the intended availability and the con-
figuration of the workflow.
We think that these three goals provide a strong
basis for a universal and versatile grading infras-
tructure. This infrastructure takes care of stor-
ing/retrieving exercises, accepting/grading submis-
sions, storing submissions and their associated grad-
ing reports thus relieving teachers from all these
chores. FW4EX aims to be a grading infrastructure
in the line of SaaS (Software as a Service).
This infrastructure provides many opportunities
for the creation of appropriate eco-systems as:
Authoring tools to build artefacts defining exer-
cises.
Teaching tools to design sets (scenarii) of exer-
cises, analyze answers, compute statistics, ana-
lyze common errors, etc.
Deploying tools to instantiate networks of servers
to operate within a university (with a single phys-
ical server that serves all aspects of FW4EX to
unauthenticated clients) or within a cloud for a
world-wide company where scalability, latency
and accountability are of paramount importance.
To achieve these goals, we offer an artefact format
for exercises (see Section 4.1), a series of HTTP pro-
tocols (Section 4.4) and some returns of experience
(Section 5).
4 THE FW4EX ARCHITECTURE
The FW4EX framework takes benefit of our previous
experiments and tries to address a variety of use cases.
In this Section, we present the main lines of the ar-
chitecture, the protocols and the format of exercises
artefacts.
Figure 1 shows a simplified sketch of the logical
architecture of the FW4EX infrastructure. This archi-
tecture supposes the presence of Internet and heavily
relies on REST-based services (Fielding and Taylor,
2002).
From the student’s point of view, after choos-
ing (or be assigned) an exercise, his browser (or any
FW4EX-compliant client, see Section 4.4) fetches a
zipped file (from an exercises server e) containing the
stem and all the necessary files or documents required
to practise the exercise. When ready, the student sub-
mits his work (to an acquisition server a) where it will
be picked by a grading server gd that will instantiate
an appropriate grading slave (m
1
or m
2
) to grade the
student’s files. Once graded, a final report (an XML
file) is made available to the student (on some storage
server s) along with the originally submitted work.
AN INFRASTRUCTURE FOR MECHANISED GRADING
39
Figure 1: Sketch of the logical architecture of the FW4EX
infrastructure. The arrow tells which machine initiates con-
nections.
Exercises servers e only serve exercises i.e., static
files. Exercises servers may be implemented with
Amazon S3 technologies. To increase their availabil-
ity, e servers do not depend on the central databases.
In our experience, e servers are often proxyed via a
portal (see Figure 1) allowing students to choose exer-
cises among the set prescribed by their teachers. The
portal also skins the exercises with the university’s
web conventions (colors, logos, authentication, etc.).
Acquisition servers a are stand-alone servers that
must be as robust as possible since they must be avail-
able day and night to accept students’ works. To
increase their availability, a servers do not depend
on the central databases. They may be deployed
within firewalled networks, and maybe polled with
encrypted tunnels towards the grading machinery thus
ensuring robust security.
Storage servers s only serve reports i.e., static
anonymous XML files. These servers do not need to
access the central databases to increase their availabil-
ity: students may then get feedback whenever they
want night and day. To preserve anonymity, reports
may only be fetched via non guessable md5-based
urls. Storage servers may be implemented with Ama-
zon S3 technology. In our experience, reports from
storage servers are often accessed via portals that skin
the reports with the university’s web conventions.
Teachers prescribe sets of exercises to students
and may access their resulting reports. Teachers may
also collect students’ works (jobs) by any convenient
mean and submit batches of jobs for grading. Teach-
ers have (authenticated) access to grades and names
through the t server.
The author of an exercise (shortened as author in
the rest of this paper) also has access to the anony-
mous reports produced by his exercise in order to im-
prove it.
Information confinement is strict: grading slaves
grade anonymous files, storage servers store anony-
mous files. Among all servers, only grading servers
gd and t (see below) are aware of students’ identities.
Servers are customised for one task only. This of-
fers numerous advantages: redundancy to cope with
fault tolerancy, concurrency to increase throughput,
versatility where a new version of a component may
be plugged into the architecture and compared (in
speed, resource consumption, accuracy) with its pre-
vious version without disrupting the entire infrastruc-
ture.
For security reasons, the grading machinery is
(quite often) not publicly nor directly reachable. The
grading slaves are virtual machines run with QEMU,
VMware, VirtualBox, etc. as explained in Section 4.2.
4.1 Exercise Format
An exercise is a tar gzipped file (not dissimilar to a
Java jar file or an OpenDocument file) containing
an XML descriptor and all the files required to op-
erate the exercise. This central descriptor rules the
various aspects of the entire life-cycle of the exercise,
successively: autocheck, publish, download, install,
grade. These phases are shortly explained hereafter
in increasing complexity order.
Download. The descriptor specifies the stem and the
files required by the student in order to practise the
exercise. The descriptor also describes for each
question, the files or directories expected from the
student and may contain additional hints such as a
template to fill for an answer or the size of the ex-
pected work. In the case of an exercise with a sin-
gle question expecting a single file, an FW4EX-
compliant client (see Section 4.4) may use these
hints to display an appropriate widget to capture
interactively an answer on one or multiple lines.
Grade. The descriptor defines which scripts (writ-
ten in whatever scripting language) are used to
grade student’s files and how these scripts may be
combined. It also specifies which kind of grading
slave is required that is, which OS (Debian, Win-
dows, etc), which CPU, etc. The descriptor also
mentions CPU duration limits, output limits, etc.
Install. The descriptor defines how an exercise is in-
stalled on a grading slave (usually a virtual ma-
chine) or on a student’s computer. This installa-
tion may require to uncompress data files, to pre-
compile libraries, etc. The installation on a grad-
ing slave is different from the installation on the
student’s computer since grading scripts are not
disclosed to the student nor they are communi-
cated to the student’s computer.
Autocheck. The descriptor defines a number of
“pseudo jobs” paired with the expected grade they
CSEDU 2010 - 2nd International Conference on Computer Supported Education
40
should get. This allows to check whether the
grading slave contains all the software required to
grade a specific exercise, to ensure non regression
when the author of an exercise evolves an exer-
cise.
This autocheck phasis also allows FW4EX devel-
opers to evolve the infrastructure while maintain-
ing similar results for the registered exercises.
Most often it is advised to have at least three
pseudo jobs: the null one that contains nothing
and expects to obtain a 0/20 grade. The perfect
one that should be graded 20/20 and an interme-
diate one that checks that in-between grades are
indeed possible.
Publish. When installed and autochecked, an exer-
cise is ready to grade real students works. It is
then publicly available.
The descriptor also defines some meta-data to
characterise the exercise: name, requirements,
summary, tags, etc. These are needed by e servers
to inform students and teachers about the exer-
cises that are ready to be selected.
The XML descriptor is the file that ties all the
information required to operate the exercise through
all the steps of its life-cycle. The exercise file is
autonomous and, similarly to a WebApp, it may be
deployed or migrated from one application server to
some others.
Stems, reports and other information coming out
of the grading infrastructure are XML documents
therefore, they are skinnable via XSLT style sheets
to fit university look and feel. They may also be
processed to fill scholar databases, grade books or
any other tool that needs information about grading.
These normalised XML documents isolate the grad-
ing infrastructure from local idiosyncrasies.
4.2 Confinement
Our experience shows us that students make mistakes
(infinite loops, deadlocks, etc.). Their programs need
to be confined in time and cpu consumption. Their
production must also be limited: not much than a
given number of bytes written to files or on output
streams. Though rare, some students’ programs are
malicious and must be tightly restricted (are they al-
lowed to post mails, to open sockets, to start back-
ground processes, to destroy resources to prevent
grading other student’s files, etc.) For all these rea-
sons, we run students’ programs under students’ ac-
counts (so they do not harm teachers’ files) within vir-
tual machines (QEMU or VMware) and pay much at-
tention to offer the same initial conditions for all jobs.
To use virtual machines also guarantees to be
able to reproduce, for ever, the same conditions in
which an exercise was designed to be graded. We
previously had bad experiences with system updates
breaking configuration or slightly altering runtime li-
braries. Virtual machines removes entirely that prob-
lem and offer eternity to exercises. Virtual machines
also allows to offer various Operating Systems (De-
bian, Windows, etc.), CPUs or networked machines
for client-server programs (for instance in PHP).
Our experience also shows us that authors of exer-
cises make mistakes. Their programs need to be sim-
ilarly confined otherwise they may bog down servers,
destroy common resources and harm student’s expe-
rience. To grade an exercise is therefore a complex
task where student’s errors should appear in student’s
report, author’s errors should appear in author’s re-
ports and FW4EX errors should be routed to FW4EX
maintainers.
4.3 Grading Scripts
Scripts are regular programs written in whatever lan-
guage fits the task. They are run within the directory
where student’s files are deployed, they may read or
run author’s owned files (stored elsewhere) as well as
read or run FW4EX libraries (stored elsewhere).
Grading may be performed by comparison (with
teachers’ solutions) or by verification (checking
whether some property is obtained). Grading by com-
parison requires the author of the exercise to write his
own solution but allows to randomly generate input
data.
The exercise descriptor also specifies how to com-
bine the various scripts that is, stop at first error, ig-
nore some errors, etc.
Support libraries help writing very concise grad-
ing scripts for specific languages (for instance in Java,
we refine the JUnit framework to count the number of
successful/failed assertions in addition to the number
of successful/failed tests) or for specific situations (for
instance in bash, input data may be given as command
options, input streams, data file names, etc.).
4.4 FW4EX-compliant Clients
We strongly believe that, apart for introductory
courses to Computer Science, programming should
be done with professional tools such as usual pro-
gramming environment (IDE). Browsers, applets can-
not be considered as decent substitutes for real IDE.
An FW4EX-compliant IDE (often an IDE plug-in) is
able to speak to the three types of public servers (e,
AN INFRASTRUCTURE FOR MECHANISED GRADING
41
a and s) according to some REST protocols (Fielding
and Taylor, 2002).
These protocols allow a student to authenticate,
to obtain stem, to submit work and to, finally, get a
report. When submitting a job and following REST
principles, the client receives the URL where the re-
port will pop up on some s server.
From an author’s point of view, it is similarly pos-
sible to submit an exercise, get the corresponding au-
tocheck report leading to the grading reports of the
pseudo-jobs. Additional services exist for FW4EX
maintainers to manage jobs, exercises, configuration.
Presently we have three FW4EX-compliant
clients. The first is an AJAX FW4EX-compliant
client running in any regular browser. It fulfills all of
the needs but for the installation of the exercise on the
student’s computer which cannot be fully automated
for security reason. It manages the display to accom-
modate various types of exercises (see experiments in
Section 5): one-liners are captured via a one-line text
widget, short scripts are captured via a textarea wid-
get, multiple files are captured via a file upload wid-
get.
Since it mostly handles XML documents, the
AJAX FW4EX-compliant client imposes its own
XSLT style sheets to display them with the wished
look and feel. Additional treatments may also be per-
formed such as using a tree widget to iconise series
of lengthy tests in the grading report. The report is
therefore largely independent of the display technol-
ogy.
The AJAX FW4EX-compliant client may be em-
bedded in a university portal.
Two other clients were developped as proofs of
concepts: one for Eclipse (for Java, C, php, etc.) and
one for Scite
6
. In these IDEs, similarly to what we
did for DrScheme (Brygoo et al., 2002), a single but-
ton or menu will manage authentication, show stem,
install accompanying files, pack student’s work, sub-
mit it and display the associated report, if available,
as annotations to source code.
5 EXPERIMENTS WITH FW4EX
We deployed the FW4EX infrastructure in September
2008. Two experiments were conducted during the
two teaching semesters, a third one took place at the
end of January 2009 with quite a different modality.
6
www.scintilla.org
5.1 One-liners
For a course on Posix skill (mainly shell and Make-
file) we deploy approximately 40 exercises. These
exercises are “one liners” since they contain a sin-
gle question that asks for options (for utilities such as
tr, sort, sed, etc.) or for whole commands for var-
ious tasks (sifting, sorting, regexp-ing, etc.). Stems
are terse and imprecise (but always give an example
of use) in order to let students think about limit cases
such as empty streams, empty files, files out of the
current directory, files with weird names, etc. The
goal of these formative exercises was to encourage
students to read the associated man pages in order to
be familiar with the 2 to 4 most useful options with
each of these utilities.
These exercises are permanently available for all
students; students are not limited in number of sub-
missions. Reports are very detailed, every test is ex-
plained, results are shown and reasons for success or
failure are verbalised. Students have to analyse these
reports and sometimes discover that some limit cases
are indeed possible with respect to the stem. In the
actual deployment, grading reports are obtained af-
ter 20 seconds. Such a duration compels students to
consider that the grader is not a fast debugging tool so
debug should be done appropriately on their computer
before asking to be graded.
We offer roughly 5 exercises per week for 7
weeks. We start from exercises asking for options
and end with exercises asking for small shell scripts
(less than 10 lines). With help of the support libraries,
grading scripts were reduced to a line or two.
On the 120 students enrolled for the course, half
of them tried at least one exercise. The infrastructure
had been serving (night, day and hollidays) around
3500 submissions per semester.
The exercises were not prescribed, only voluntary
students try them. Some try them from self-service
lab rooms, others from their home with their own
computer. In average, we observe 2 to 3 attempts per
exercise and per student. As already reported (Woit
and Mason, 2003), students did not volunteer easily.
Apart the initial curiosity of the first week, we only
noted the usual peak during the week that precede ex-
aminations.
Students are not directly exposed to FW4EX, they
only use it through the Web pages of the lecture they
follow. They appreciate the availability (on sundays
and even Christmas hollidays). Of course, the more
trained the students are, the better results they get at
the examination.
CSEDU 2010 - 2nd International Conference on Computer Supported Education
42
5.2 Examinations
In the same course, we use the FW4EX infrastructure
to grade the mid-term examination and the two final
examinations. The examinations contain three sum-
mative exercises each containing 1 to 6 rather inde-
pendent questions. Contrarily to the one-liners, exam-
inations are very carefully specified. Stems are pre-
cise, they ask for scripts (or Makefile rules) perform-
ing various tasks. Examples are always given, deliv-
erables are specified and some hints are given about
how grading will be performed.
These examinations last 3 hours from lab rooms.
As usual the mid-term examination mainly serves to
prepare the students for the final examinations. The
student’s works are sent to the grader after the exam-
ination therefore students do no have any feedback
during the examination.
In this setting, we had 120 sets of students’ files
to grade. The exercise file (containing the examina-
tion) was prepared with 6 pseudo jobs. We grade the
whole set of students’ files 4 times in order to tune the
grader: we had to cope with misnamed scripts or files,
to fix an encoding problem (UTF8 versus Latin1) and
to slightly alter the weight of two questions. Approx-
imately 60 seconds were necessary to grade one stu-
dent.
The design and test of these examinations (writ-
ing the stem, writing a solution, writing the grad-
ing scripts and testing them over pseudo jobs) took
around two days of work.
Five days were necessary for the examinations be-
fore the reports and grades were made available to
the students (a student may only access his own re-
port). Among these five days, three were alloted to
the teaching assistants to check whether the grading
results were agreeing with the behaviour of the stu-
dents during lab sessions. Teachers also use the grad-
ing reports to prepare students to the second chance
examination.
The mid-term examination was also made avail-
able as a regular (although huge) exercise itself.
Students may then retry this examination in a less
stressed context (at home) but, at the same time, be
graded by the exact grader used at the examination.
This is what we call “dynamic annals”. Only 6 stu-
dents retried the examination, 1 among them got a
grade greater than 19/20 in 4 attempts, the others
topped at 6/20 with 1 to 6 attempts before abandon.
The examination was held on the computers of the
laboratories where no plagiarism was possible. Pla-
giarism detection will be considered as an option for
batch grading.
5.3 Programming Contest
The “Journ
´
ee Francilienne de Programmation” is a
programming contest for undergraduate students from
various Parisian universities. A dozen of program-
ming languages are indeed possible to solve the pro-
posed problem. The contest lasts 5 hours during
which 12 teams of voluntary students regularly sub-
mit their work (120 uploads) and get their grade (the
average grading time was 45 seconds) in order to ap-
preciate how they perform with respect to the other
teams. Besides individual grading, the portal runs ad-
ditional scripts using the reports from FW4EX to de-
liver bonus points to teams’ works (mainly based on
the speed of programs).
The next edition of this contest will use the same
grading infrastructure.
5.4 Summary of Experiments
These three experiments show the versatility of the
FW4EX infrastructure. One-liners are multiply sub-
mitted and graded from anywhere on Internet whereas
examinations are just graded once. Examinations and
contests are read offline by humans but one-liners are
just automatically graded. One-liners are graded in
a couple of seconds, examinations and contests may
take up to a minute.
6 RELATED WORK
While there are many mechanised graders on the mar-
ket, there are often tailored to a single programming
language, tightly bound to scholar databases or im-
posing a precise workflow. All this characteristics
make difficult to re-use these embedded graders.
The BOSS Online Submission System
7
(Joy et al.,
2005) is a course management tool. It allows students
to submit assignments online securely, and contains a
selection of tools to allow staff to mark assignments
online and to manage their modules efficiently. Pla-
giarism detection software is included. Interaction
with BOSS may use an application or a web front end.
BOSS is a complete course-oriented web appli-
cation whereas FW4EX is only a set of protocols to
operate a grading infrastructure. On the other hand
FW4EX achieves better robustness against malicious
submissions and scalability. However due to the flex-
ible architecture of BOSS that uses factories to hide
implementation choices or binding constraints with
7
http://www.dcs.warwick.ac.uk/boss/
AN INFRASTRUCTURE FOR MECHANISED GRADING
43
external resources, it would not be difficult to let
BOSS use FW4EX.
CourseMaker
8
(Higgins et al., 2005) is a web-
based, easy-to-use course creation package commer-
cialised by the Connect company. It is now roughly
similar in functionalities to Blackboard or Sakai. Like
BOSS, it contains a number of tools to check ty-
pography, syntax, comments, and to detect plagia-
rism. CourseMaker is a complete solution that con-
tains courses documents, exchanges information with
scholar databases, deploys student clients application
or web-based forms to collect work. CourseMaker
hosts, on its proper network, courses along with their
exercises as QTI
9
files.
There again, FW4EX is different. While Course-
Maker is course-oriented, FW4EX is exercise-centric.
ASAP (Douce et al., 2005b) is a new initiative that
shares a number of goals with FW4EX. ASAP is com-
ponentised with respect to the JISC (Joint Information
Systems Committee) e-learning framework. It sepa-
rates grading from submission and it also uses XML
documents as a way to transmit information between
components. FW4EX improves on ASAP with its
emphasis on protocols, its robustness with the use of
virtual machines and its attention to the whole life-
cycle of exercises.
7 FUTURE WORK
We have recently migrated an instance of the FW4EX
architecture onto two rent dedicated servers. Us-
ing virtual hosts, we now offer two a servers, two e
servers and, from time to time (for testing purposes),
two concurent graders. We envision to offer a reason-
ably free usage of the FW4EX grading infrastructure
for cooperating universities.
We plan to develop new exercises for a new lec-
ture on Perl. We also supervise two new students’
projects
10
: one to extend a previous authoring tool
and a revised version of a FW4EX-client as a plugin
for Eclipse.
Finally, we wish to analyse the typical errors for
some exercises we have been proposing for years.
This is possible since we have been gathering thou-
sands of students’ submissions.
8
www.coursemaker.co.uk
9
www.imsglobal.org
10
pstl-fw4ex.sourceforge.net
8 CONCLUSIONS
In this paper, we present the FW4EX project, an at-
tempt to build a grading infrastructure that can be put
to work into various learning environments. A set of
REST-based protocols allows these learning systems
or IDEs to operate the grading infrastructure. A “stan-
dard” descriptor and a file format specification are
proposed to reduce the deployment of new exercises
to a simple file copy. We finally describe some ex-
periments that illustrate the neutrality and versatility
of the associated infrastructure: one-liners, full exam-
inations, programming contests in multiple program-
ming languages.
We believe that this project provides a strong basis
to foster an eco-system for mechanised grading thus
allowing teachers to focus on the sole exercises and
not on the grading infrastructure. Yes, this is more
work for teachers to imagine a stem, write (and test)
a solution then design some grading scripts but, once
done, this may last for ever.
More information on the FW4EX project is avail-
able on the site paracamplus.org.
REFERENCES
Brygoo, A., Durand, T., Manoury, P., Queinnec, C., and
Soria, M. (2002). Experiment around a training en-
gine. In IFIP WCC 2002 World Computer Congress,
Montr
´
eal (Canada). IFIP.
Daly, C. and Waldron, J. (2004). Assessing the assessment
of programming ability. In SIGCSE ’04: Proceedings
of the 35th SIGCSE technical symposium on Com-
puter science education, pages 210–213, New York,
NY, USA. ACM.
Douce, C., Livingstone, D., and Orwell, J. (2005a). Au-
tomatic test-based assessment of programming: A re-
view. J. Educ. Resour. Comput., 5(3):4.
Douce, C., Livingstone, D., Orwell, J., Grindle, S., and
Cobb, J. (2005b). A technical perspective on asap -
automated system for assessment of programming. In
Proceedings of the 9th CAA Conference, Loughbor-
ough University.
Ellsworth, C. C., James B. Fenwick, J., and Kurtz, B. L.
(2004). The quiver system. In SIGCSE ’04: Pro-
ceedings of the 35th SIGCSE technical symposium
on Computer science education, pages 205–209, New
York, NY, USA. ACM.
Fielding, R. T. and Taylor, R. N. (2002). Principled design
of the modern web architecture. ACM Trans. Interet
Technol., 2(2):115–150.
Findler, R. B., Clements, J., Flanagan, C., Flatt, M., Krish-
namurthi, S., Steckler, P., and Felleisen, M. (2002).
Drscheme: A programming environment for scheme.
Journal of Functional Programming, 12:369–388.
CSEDU 2010 - 2nd International Conference on Computer Supported Education
44
Fu, X., Peltsverger, B., Qian, K., Tao, L., and Liu, J. (2008).
Apogee: automated project grading and instant feed-
back system for web based computing. In SIGCSE
’08: Proceedings of the 39th SIGCSE technical sym-
posium on Computer science education, pages 77–81,
New York, NY, USA. ACM.
Higgins, C. A., Gray, G., Symeonidis, P., and Tsintsifas,
A. (2005). Automated assessment and experiences
of teaching programming. J. Educ. Resour. Comput.,
5(3):5.
Hill, C., Slator, B. M., and Daniels, L. M. (2005). The
grader in programmingland. In SIGCSE ’05: Pro-
ceedings of the 36th SIGCSE technical symposium
on Computer science education, pages 211–215, New
York, NY, USA. ACM.
Joy, M., Griffiths, N., and Boyatt, R. (2005). The boss on-
line submission and assessment system. J. Educ. Re-
sour. Comput., 5(3):2.
Noonan, R. E. (2006). The back end of a grading system. In
SIGCSE ’06: Proceedings of the 37th SIGCSE techni-
cal symposium on Computer science education, pages
56–60, New York, NY, USA. ACM.
Queinnec, C. and Chailloux, E. (2002). Une exprience
de notation en masse. In TICE 2002 Tech-
nologies de l’Information et de la Communica-
tion dans les Enseignements d’Ing
´
enieurs et dans
l’industrie Confrences ateliers, pages 403–404,
Lyon (France). Institut National des Sciences Ap-
pliques de Lyon. version complte disponible en
http://lip6.fr/Christian.Queinnec/PDF/cfsreport.pdf.
Woit, D. and Mason, D. (2003). Effectiveness of on-
line assessment. In SIGCSE ’03: Proceedings of the
34th SIGCSE technical symposium on Computer sci-
ence education, pages 137–141, New York, NY, USA.
ACM.
AN INFRASTRUCTURE FOR MECHANISED GRADING
45