A SYSTEM FOR AUTOMATIC EVALUATION OF PROGRAMS FOR

CORRECTNESS AND PERFORMANCE

Amit Kumar Mandal, Chittaranjan Mandal

School of Information Technology

IIT Kharagpur, WB 721302, India

Chris Reade

Kingston Business School

Kingston University

Keywords:

Automatic Evaluation, XML Schema, Program Testing, Course Management System.

Abstract:

This paper describes a model and implementation of a system for automatically testing, evaluating, grading

and providing critical feedback for the submitted programming assignments. Complete automation of the

evaluation process, with proper attention towards monitoring student’s progress and performing a structured

level analysis is addressed. The tool provides on-line support to both the evaluators and students with the level

of granularity, ﬂexibility and consistency that is difﬁcult or impossible to achieve manually.

1 INTRODUCTION

The paper presents both the internal working and ex-

ternal interface of an automated tool that assists the

evaluators/tutors by automatically evaluating, mark-

ing and providing critical feedback for the program-

ming assignments submitted by the students. The tool

helps the evaluator by reducing the manual work by a

considerable amount. As a result the evaluator saves

time that can be applied to other productive work,

such as setting up intelligent and useful assignments

for the students, spending more time with the students

to clear misunderstandings and other problems.

The problem of automatic and semi-automatic

evaluation has been highlighted several times in the

past and a considerable amount of innovative work

has been suggested to overcome the problem. In this

paper we will be discussing the key approaches in the

literature. Although our approach is currently limited

to evaluating only C programs, the design and im-

plementation of the system has been done in such a

way that it takes over almost all of the burden from

the shoulders of the students and the tutors. The tool

can be controlled and used through the Internet. The

evaluator and students can be in any part of the world

and they can freely communicate through the system.

Hence, the system helps in bringing alive the scenario

of distant learning which is a crucial improvement for

any large educational institute or university because

they might have extension centers anywhere across

the world. The same tutors can provide services to

all the students in any such extension center.

2 MOTIVATION

The scenario that motivated us to build such a system

was the huge cohorts of students in almost all big ed-

ucational institutions or universities across the world.

In almost all the big engineering institutes or univer-

sities the intake of undergraduates is around 600 - 800

students. As a part of their curriculum, at the place of

development of this tool, the students need to attend

labs and courses and in every lab each student has to

submit about 9-12 assignments and take up to three

lab tests. That amounts to nearly 10000 submissions

per semester. Even if the load is distributed among 20

evaluators, each evaluator is required to test almost

500 assignments. Without automation, the evaluators

would be busy most of time in testing and grading

work at the expense of time spent with the students

and also on setting up useful assignments for them.

3 RELATED WORK

A variety of noteworthy systems have been devel-

oped to address the problem of automatic and semi-

automatic evaluation of programming assignments.

Some of the early systems include TRY (Reek, 1989)

and ASSYST (Jackson and M., 1997). Schemer-

196

Kumar Mandal A., Mandal C. and Reade C. (2006).

A SYSTEM FOR AUTOMATIC EVALUATION OF PROGRAMS FOR CORRECTNESS AND PERFORMANCE.

In Proceedings of WEBIST 2006 - Second International Conference on Web Information Systems and Technologies - Society, e-Business and

e-Government / e-Learning, pages 196-203

DOI: 10.5220/0001251601960203

 SciTePress

obo (Saikkonen et al., 2001) is an automatic assess-

ment system for programming exercises written in

the Scheme programming language, and takes as in-

put a solution to an exercise and checks for correct-

ness of function by comparing return values against

the model solution. The automatic evaluation sys-

tems that are being developed nowadays have a web

interface, so that the system can be accessed univer-

sally through any platform. Such systems include

GAME (Blumenstein et al., 2004), SUBMIT (Pisan

et al., 2003) and (Juedes, 2003). Other systems

such as (Baker et al., 1999), Pisan et al(2003) and

Juedes(2003) developed a mechanism for providing

a detailed and rapid feedback to the student. Sys-

tems like (Luck and Joy, 1999), (Benford et al., 1993)

are integrated with a course maker system in order to

manage the ﬁles and records in a better way.

Most of the early systems were dedicated to solve

a particular problem. For example, some systems are

concerned with the grading issue thus turning their

attention from quality testing. Others are concerned

with providing formative feedback and loosing their

attention towards grading. We are concerned with ad-

dressing all aspects of evaluation whether it is testing,

grading or feedback. The whole design and imple-

mentation of our system was focused toward bringing

comfort to the evaluators as well as students, without

being partial towards the quality of testing and grad-

ing being done. Our aim was to test the programs

from all possible dimensions i.e. testing on random

inputs, testing on user deﬁned inputs, testing on ex-

ecution time as well as space complexity, perform-

ing a perfect style assessment of the programs. Sec-

ondly, security was given much attention because, as

the system was to be used by fresh under graduate stu-

dents who do not have many programming skills, they

might unknowingly cause harm to the system. Our

third focus of attention was grading, which should

not impose crude classiﬁcation such as zero or full

marks. Efforts have been made to make the grading

process simulate the manual grading process as much

as possible. Fourthly, the comments should not be

conﬁned to just a general list of errors that any stu-

dent is likely to commit. All the comments should be

speciﬁc to a particular assignment. As has been al-

ready mentioned, the system currently supports only

the C language, so during implementation, the ﬁfth

and foremost focus was to make the model and design

as general as possible so that the system could be eas-

ily upgraded to support other popular languages like

C++, Java etc.

4 SYSTEM OVERVIEW

4.1 Testing Approach

Black Box testing, Grey Box testing, and White Box

testing are the choices available in literature to test

a particular program. In Black Box testing, a par-

ticular submission is treated as a single entity and

the overall output of the programs are tested. In

Grey Box testing, component/function ﬁnal output is

tested. White box testing allows structure, program-

ming logic as well as behavior to be evaluated. In an

environment where the students are learning to pro-

gram, testing only the conformance of the ﬁnal output

is not workable because conformance to an overall in-

put/output requirement is particularly hard to achieve.

Evaluations that relied on this would exclude con-

structive evaluation for a majority of students. Grey

Box approaches involve exercising individual func-

tions within the student’s programs and are, unfortu-

nately, language sensitive. Looking at the pitfalls of

the above two strategies our decision was to choose

the White Box approach, which is more general than

the Black Box approach. White Box testing involves

exercising the intermediate results generated by the

functions, therefore, whether the program has been

written in a particular way, following a particular al-

gorithm and using certain data structures can be as-

certained with greater probability.

4.2 High Level System Architecture

Figure 1: High Level System Architecture.

Fig.1 shows the High Level System Architecture

diagram, the whole process of automatic evaluation

begins with the evaluator’s submission of the assign-

ment descriptor, model solution of the problem stated

A SYSTEM FOR AUTOMATIC EVALUATION OF PROGRAMS FOR CORRECTNESS AND PERFORMANCE

197

in the assignment descriptor, testing procedure and an

XML ﬁle. After the evaluator has ﬁnished the submis-

sion for a particular assignment, the system checks

the validity of the XML ﬁle against an XML Schema.

If the result of XML validation is negative, the sys-

tem halts any further processing of the ﬁle and the

errors are returned to the evaluator. The evaluator has

to correct the errors and then resubmit the ﬁles. If the

validation of the XML ﬁle against the XML Schema

is positive, then the assignment description becomes

available for students to download or view and subse-

quently submit their solutions.

These submissions are stored in a database. The

system accepts the submissions only if they are sub-

mitted within the valid dates speciﬁed by the evalua-

tor. The next criteria that should be passed for com-

pletion of submission is that proper ﬁle names are

used, as speciﬁed by the evaluator in the assignment

description. The above operations are performed by

the File Veriﬁcation module. After its operation is

complete the next module which comes into the pic-

ture is the Makeﬁle Prep/Compilation module with

the task of preparing a makeﬁle and then using this

makeﬁle to compile the code. This module gener-

ates feedback based on the result of the compilation.

I.e. if there are any compilation errors, then the sys-

tem rejects the submission and returns the compila-

tion errors to the student. Thus the submission is com-

plete only if the student is able to pass the operations

of both ﬁle veriﬁcation module and the compilation

module. After the submissions are complete the next

module executed is the testing module which includes

both dynamic testing as well as static testing. The dy-

namic testing module accepts students’ submissions

as input, generates random numbers using the random

number generator module

(not shown in ﬁgure). It

then uses the testing procedures provided by the eval-

uator to test a program on the randomly generated

inputs. Sometimes it is not enough to test the pro-

gram on randomly generated inputs (e.g. where the

test fails for some boundary conditions). By using the

randomly generated inputs we cannot be sure that the

program undergoes speciﬁc boundary tests on which

the test may fail.

To overcome this drawback, facilities have been

provided by which the evaluator can specify his/her

own test cases. The Evaluators have the freedom to

specify the user deﬁned inputs either directly in the

XML ﬁle or he may also specify the user deﬁned in-

puts in a text ﬁle (if the number of inputs is signif-

icant). This is an optional facility and the evaluator

may or may not be using it based on the requirements

of the assignment. If the results of testing the pro-

gram on random input as well as user deﬁned inputs

This module generate the random numbers required for

testing the programs

are positive, then the next phase of testing that be-

gins is the correctness of the program on execution

time as well as space complexity. During this test-

ing, the execution time and the space usage of the

user program are compared against the execution time

and the space usage of the model solutions submitted

by the evaluator. During this testing, the execution

time of the student program as well as the evalua-

tor’s program are determined for a large number of

inputs. The execution time of the student’s program

and the evaluator’s program are compared and ana-

lyzed to determine whether the execution time is ac-

ceptable or not. If the test for space and time complex-

ity goes well then the next phase is the Static Analysis

or the Style analysis. Static testing involves measur-

ing some of the characteristics of the program such as,

average number of characters per line, percentage of

blank lines, percentage of comments included in the

program, total program length, total number of condi-

tional statements, total number of goto, continue and

break statements, total number of looping statements,

etc. All these characteristics are measured and com-

pared with the model values speciﬁed by the evaluator

in the XML ﬁle. Testing and Grading modules works

hand in hand as testing and grading are done simulta-

neously. For example, after testing a program on ran-

dom inputs as well as user deﬁned inputs is over, the

grading for that part is done at that moment, without

waiting until all testing is complete. If at any point

of time during testing, the programs are rejected or

aborted, then the grading that has been done up to that

point is discarded (but feedback is not discarded).

Security is a non-trivial concern in our system

because automatic evaluation, almost invariably, in-

volves executing potentially unsafe programs. This

system is designed under the assumption that pro-

grams may be unsafe and executes programs within

a ’sand-box’ using regular system restrictions.

5 AN EXAMPLE

Let us start with an example that is very common for

a data structures course, for example: the Mergesort

Program. As the regular structure of a C program con-

sists of a main function and a number of other func-

tions performing different activities, we have decided

to break down the Mergesort Program into two func-

tions (1) Mergesort function - This function performs

the sorting by recursively calling itself and the merge

function. (2) Merge function - This function will ac-

cept two sorted arrays as input and then merge them

into a third array which is also sorted. As a White

Box testing strategy is followed, testing and award-

ing marks on correct output generated by the program

will be of no use because this will not distinguish be-

WEBIST 2006 - E-LEARNING

198

tween different methods of sorting (Quick sort, Selec-

tion sort, Heap sort, Insertion sort etc.).

5.1 Our Approach

Our approach is to test the intermediate results that the

function is generating, instead of a function ﬁnal out-

put or the program ﬁnal output. This section explains

one approach by which the mergesort program can be

tested. Our aim is to test that the student has writ-

ten the mergesort and the merge function correctly.

Our strategy is to check each step in the mergesort

program. Fig.2 explains the general working of the

mergesort program. Our approach is to catch data at

each step

2 4 23 42 1 9 12 4

2 4 23 42 9 45 1 2

4 2 23 42 9 45 12 1

1 2 4 9 12 23 42 45

mergge #1 mergge #2

mergge #3

mergge #4 mergge #5

mergge #6

mergge #7

Figure 2: Working Principle of Mergesort program.

(i.e. merge#1 to merge#7 of the example). This

can be done by using an extra two dimensional array

of size (n) * (n-1). The idea is to transfer the array

contents to this two dimensional array at the end of

the merge function.

01234567

0 242342 94512 1

1 242342 94512 1

2 242342 94512 1

3 242342 94512 1

4 242342 945 112

5 242342 1 91245

6 1 2 4 9 12 23 42 45

Figure 3: Two Dimensional Array.

After the execution of the student‘s program, a two

dimensional array similar to Fig.3 will be created, this

array can be compared with the two dimensional array

that is formed by the model solution. If both the ar-

rays are exactly the same then the student is awarded

marks, but if the arrays are not similar, then the ac-

tion initiated depends upon the action speciﬁed by the

evaluator in the XML ﬁle. As has been mentioned

previously, the array contents need to be transfered

to the two dimensional array at the end of the merge

function. This operation should be performed by the

function as shown in Fig.4.

transfer(int

a, int size, int

{int i;

for (i=0;i<size;i++)

(b + depth

size +i) =

(a+i);

depth = depth + 1;

}

Figure 4: Auxiliary Transfer Function.

5.2 Assignment Descriptor for the

Mergesort Program

The assignment descriptor is one of the crucial sub-

missions necessary for the successful completion of

the whole auto-evaluation process. If the assignment

descriptor is not setup up in a correct manner, it would

be difﬁcult for the student to understand the prob-

lem statement properly and most students will end up

with a faulty submission. We have considered some

of the general-purpose properties that an assignment

descriptor should have: (1)Simple and easy to un-

derstand language should be used, so that it becomes

easy for the student to understand the problem state-

ment; (2)Prototypes of the function to be submitted

by the student must be speciﬁed clearly in the de-

scriptor; (3) The necessary makeﬁle and the auxiliary

functions (for example, the transfer function shown in

Fig.4) should be supplied as a part of the assignment

descriptor to make it easier for the student to reach the

submission stage.

5.3 Evaluator’s Interaction

The Automatic Evaluation process cannot be accom-

plished without the valuable support of the evaluator.

The evaluator needs to communicate a large number

of inputs to the system. Since the number of inputs

are large, providing a web interface is not a good idea

to accept the values because it is cumbersome both

for the evaluator to enter the values and for the sys-

tem to accept and manage the data properly. Our idea

is that, the evaluator will provide the inputs in a single

XML ﬁle. In this XML ﬁle the evaluator is required

to use predeﬁned XML tags and attributes to specify

the inputs. The XML is used to specify the follow-

ing information: (1)Name of the ﬁles to be submitted

by the students. (2)How to generate the test cases.

(3)How to generate the makeﬁle. (4)Marks distribu-

tion for each function to be tested (5)Necessary inputs

to carry out static analysis of the program

A SYSTEM FOR AUTOMATIC EVALUATION OF PROGRAMS FOR CORRECTNESS AND PERFORMANCE

199

Along with the XML ﬁle, the evaluator is also re-

quired to submit the model solution for the prob-

lem, Testing Procedure and the Assignment descrip-

tor. The XML ﬁle is crucial because it controls the

working of the system and if the XML ﬁle is wrong

then the whole automatic evaluation process becomes

unstable. Therefore it is necessary to ensure that the

XML ﬁle is correct which can be done, to some ex-

tent, by validating the XML ﬁle against the XML

Schema.

. Fig.5 shows only a part of the main XML

ﬁle that is submitted. This portion of the XML is used

by the module that checks for proper ﬁle submissions.

Fig.6 shows the part of the XML schema that is used

to validate the part of the XML ﬁle as shown in Fig.5.

<source_files>

<text>File containing the main

function</text>

</file><file name="merge.c">

<text>File containing the merge-

function</text>

</file><file name="mergesort.c">

<text>File containing the merge-

sort function</text>

</file></source_files>

Figure 5: XML Spec. for checking ﬁle submission.

<xs:element name = "source_files">

<xs:complexType><xs:sequence>

<xs:element name="file" maxOccurs-

="unbounded">

<xs:complexType> <xs:sequence>

<xs:element name="text" minOccurs

="0"> <xs:simpleType>

<xs:restriction base="xs:string">

<xs:minLength value="0"/>

<xs:maxLength value="75"/>

</xs:restriction></xs:simpleType>

</xs:element></xs:sequence>

<xs:attribute name="name" type=

"xs:string">

</xs:attribute></xs:complexType>

</xs:element> </xs:sequence>

</xs:complexType> </xs:element>

Figure 6: XML Schema validating XML in Fig.5.

The purpose of the XML Schema is to deﬁne legal

building blocks of an XML document. An XML Schema

can be used to deﬁne the elements that can appear in a docu-

ment, deﬁne attributes that can appear in a document, deﬁne

data types of elements and attributes etc

5.4 Input Generation

The Automatic Program Evaluation System is a so-

phisticated tool, which evaluates the program using

the following criteria:

• Correctness on dynamically generated random

numbers

• Correctness on user deﬁned inputs

• Correctness on time as well as space complexity

• Performs style assessment of the Programs

• Number of Looping statements

• Number of Conditional statement.

Initially the programs are tested on randomly gen-

erated inputs. For effective testing we cannot rely on

ﬁxed data because they are vulnerable to replay at-

tacks. Evaluators have the option to write down their

own routines in the testing procedures to generate in-

puts or, alternatively, the system provides some assis-

tance in the generation of inputs. Currently the op-

tions are supported are as follows:

random integers:

(array / single)

(un-sorted/sorted ascending/sorted

descending)

(positive / negative / mixed)

random floats:

(array / single)

(un-sorted/sorted ascending/sorted

descending)

(positive / negative / mixed)

strings:

(array / single)

(fixed length/variable length)

The evaluator can express his/her choice of the ran-

dom numbers on which he/she wants the program-

ming assignments to be tested in the XML ﬁle. Fig.7

shows the example of the XML statement that is used

to generate input for the mergesort program.

<input inputvar="a" vartype="array"

type="float" range="positive"min= "10"

max="5000"sequence="ascend">

</generation>

<user

specified source="included">

</input>

</input></user

specified></testing>

Figure 7: XML Speciﬁcation for Input Generation.

WEBIST 2006 - E-LEARNING

200

The evaluator/tutor should specify the inputs in the

testing element, the generation element is used to

specify random inputs to be generated. This element

has only one attribute named iterations and the iter-

ations attribute can be used to specify the number of

times the evaluator needs to test a program on ran-

dom inputs. Any number of input elements can be

placed between its starting and closing generation tag.

The input element offers a lot of attributes to spec-

ify the type of random number to be generated. The

vartype attribute of an input element is used to spec-

ify whether the randomly generated value should be

an array or single value. The min and max attributes

can be used to generate the inputs within a range. If

the attribute is not mentioned explicitly in the XML

then the system sets these attributes to the default val-

ues. The type attribute can be set to one of the three

values integer, ﬂoat or string. Another important at-

tribute of input elements is the range attribute. This

attribute is used to specify whether positive, nega-

tive or mixed (mixture of positive and negative) val-

ues is to be generated. The inputvar is the attribute

that is used to name the variable to be generated. For

example in Fig.7, the array which is generated ran-

domly will be stored in a. Sometimes the evaluator

may choose to generate the values in some order (as-

cending or descending) and this problem is resolved

by the sequence attribute which can be set to any of

the two values ( ascend or descend). Every element

that is speciﬁed between the tags of user

speciﬁed is

used to supply user deﬁned inputs to the program.

The user

speciﬁed elements have only one attribute

named source which can take one of the two values

(included or ﬁle). If the attribute is set to included

then the system looks for the user deﬁned inputs in the

XML itself, on the other hand if the attribute is set to

ﬁle, then the system looks for the user deﬁned inputs

in a text ﬁle. Similar to the generation element, the

user

speciﬁed element can have any number of input

elements, but here the input element can have only

one attribute named values. All the user deﬁned in-

puts should be speciﬁed following the same sequence

as followed for the generation of random inputs.

In Fig.7 an array(vartype=“array”) of size

50(<arraysize>50</arraysize>), containing ascend-

ing(sequence=“ascend”) positive(range=“positive”)

ﬂoat(type=“ﬂoat”) values in the range of

10(min=“10”) and 5000(max=“5000”) is generated

and supplied as input to the mergesort program. The

above procedure is iterated 100(iterations=“100”)

times. Fig.7 also shows that, two user deﬁned

inputs have been provided by the evaluator, these

user deﬁned inputs are supplied as is to the testing

procedures.

5.5 Grading the Programs

The grading process is made as ﬂexible and granular

as possible. During testing of a program on user de-

ﬁned inputs, suppose the program is unsuccessful in

satisfying the output of the model solution. At this

moment the XML is parsed to determine the choice

mentioned by the evaluator. The evaluator may have

chosen to abort the test and award zero marks to the

student or, alternatively, the evaluator/tutor may have

chosen to move forward and test the program on other

criteria. If the evaluator had decided to test the pro-

gram on other criteria, then the tutor has the ﬂexibility

to either award full marks, zero marks or a fraction of

the full marks meant for testing on random inputs and

user deﬁned inputs. Fig.8 shows the example XML

for grading the Mergesort program.

<item value="0" factor="1"

fail="false"><text>Fine</text></item>

<item value="1"factor="0.6"

fail="false"><text>Improper</text>

</item> <item value="2" factor="0"

fail="true"><text>Wrong</text></item>

</status>

Figure 8: XML Speciﬁcation for Grading.

Based on the result of the execution, some value is

awarded to the program. For example, if the result of

execution of the mergesort program is correct for both

the user deﬁned inputs as well as random inputs then

the value awarded to the program is 0. If the result

is incorrect for user deﬁned inputs or random inputs,

then the value awarded is 1. If the result of execution

is incorrect for both the tests, then the value awarded

is 2. Based on the value awarded, marking is done,

i.e. if the value is 0 full marks are awarded for the

test (factor = “1”), if the value is 1, then only 60%

marks are awarded (factor=“0.6”) and if the value is

2 then no marks are awarded, the program is given

fail status(fail=“true”) and the program execution is

aborted(abort

on fail=“true”).

5.6 Commenting on Programs

The system generates feedback comments, which are

speciﬁc to a particular assignment. During testing

the programs on random and user-deﬁned inputs, if

the test fails for some particular random input, then

the comment would contain the speciﬁc information

about the random input on which the test has failed.

The student can run his/her program on the random

input and ﬁnd out what is wrong in the program. If

A SYSTEM FOR AUTOMATIC EVALUATION OF PROGRAMS FOR CORRECTNESS AND PERFORMANCE

201

the test fails for user deﬁned inputs, then the system

never returns the user-deﬁned inputs because that may

cause serious security ﬂaws in the system. Instead,

the system returns an error message mentioned by the

tutor in the XML for that particular input.

5.7 Testing the Programs on Time

and Space Complexity

Testing programs on their execution times and space

complexity is an important idea because different al-

gorithm often have different requirements of space

and time during execution. For example, bubble sort

is an in-place algorithm but inefﬁcient, the mergesort

algorithm is efﬁcient if extra space is used, the quick

sort algorithm is an in-place algorithm and usually

very efﬁcient, the heap sort algorithm is an in-place

algorithm and efﬁcient. This section shows the analy-

sis between two mergesort programs. One is submit-

ted by the student and the other is submitted by the

evaluator. These programs are different in the num-

ber of variables declared and the dynamic memory al-

located during the execution of the programs. Fig.9

shows the execution time of the two mergesort pro-

grams.

200

400

600

800

1000

1200

1400

1600

1800

2000

0 1000 2000 3000 4000 5000 6000 7000 8000

Time in milliseconds

Inputs

Execution Time Comparision

Model MergeSort Program

Student MergeSort Program

Figure 9: Execution times of Mergesort program.

As the execution time of the programs are nearly

same, this suggests that we need to test the programs

further. For a particular input the system executes the

model program N times and determines the accept-

able range of execution time for the student’s pro-

gram. For the same input the system then executes the

student’s program, determines the execution time and

then checks whether it falls within the range deter-

mined after the execution of the evaluator’s mergesort

program. The above procedure will be iterated for a

large number of inputs. After getting results for all

the inputs, the system decides whether the program is

acceptable or not. The same procedure is followed to

test the programs on space usage. Fig.10 shows the

space complexity of the above two programs i.e. the

model mergesort program and the student’s mergesort

program. Fig.10 also shows the space requirement of

another mergesort program where the student has not

used the memory properly.

300

400

500

600

700

800

900

0 1000 2000 3000 4000 5000 6000 7000 8000

Space Complexity (KiloBytes)

Inputs

Space Complexity Comparision

Model MergeSort Program

Student MergeSort Program

Unhealthy Student MergeSort Program

Figure 10: Space Complexity of MergeSort programs.

Here the space required shows the virtual memory

resident set size which is equal to the stack size plus

data size plus the size of the binary ﬁle being exe-

cuted. It is evident from Fig.10 that if the student is

using the memory properly, then the amount of mem-

ory required for the execution of the student’s pro-

gram will not vary drastically from the memory re-

quirement of the Model Solution. When the student

is not using the memory properly, for example when

the student is not freeing the dynamically allocated

memory, then the memory requirements of the pro-

gram will increase drastically as compared to that of

the model solution. These problematic programs get

easily caught with the help of the same procedure as

explained for testing the programs on execution times.

6 SECURITY

From the outset of the implementation we have been

concerned with making the system as secure as pos-

sible because we cannot rule out the possibility of a

malicious program, that may intentionally or uninten-

tionally cause damage to the system. Therefore, the

system has been implemented under the assumption

that the programs may be unsafe. The ﬁrst aspect

that has been taken into consideration is that, pro-

grams may intentionally or unintentionally have calls

to delete arbitrary ﬁles. To override the above possi-

bility, a separate login named as ’test’ login has been

created. During testing, the system uses Secure Shell

(ssh) with RSA encryption-based authentication to

execute necessary commands in this login. Necessary

ﬁles are transferred to this login (’test’ login) securely

using the Secure Copy Protocol. As the test login do

not have access to other directories, the programs ex-

ecuting in this login will not be able to delete them.

To save the ﬁles that are present at the test login we

have used the change ﬁle attributes on a Linux second

extended ﬁle system. Only superusers have access to

these attributes, and no user can modify them. After

WEBIST 2006 - E-LEARNING

202

the ﬁle attributes are changed they cannot be deleted

or renamed, no link can be created to this ﬁle and no

data can be written to the ﬁle. Only a superuser or a

process possessing the CAP

LINUX IMMUTABLE

capability can set or clear this attribute.

The next aspect under consideration is the problem

of the presence of inﬁnite loops in the student’s pro-

gram. In order to solve the problem of inﬁnite loops,

the system limits the resources available to the pro-

grams. Both the soft limit and the hard limit have to

be speciﬁed in seconds. The soft limit is the value that

the kernel enforces for the corresponding resource.

The hard limit acts as a ceiling for the soft limit. Af-

ter the soft limit is over the system sends a SIGXCPU

signal to the process, if the process does not stop ex-

ecuting then the system sends SIGXCPU signal ev-

ery second till the hard limit is reached, after the hard

limit is reached the system sends a SIGKILL signal

to the process. The setrlimit command has been used

to limit both execution time and memory usage, same

command can also be used to limit several other re-

sources, such as data and stack segment sizes, num-

ber of processes per uid and also the execution time.

Care has been taken so that the student is not able to

change the resources available to his/her program.

7 CONCLUSION

Our system has the potential to open up new hori-

zons in the ﬁeld of Automatic Evaluation of program-

ming assignments by making the mechanism rela-

tively simple to use. We have illustrated the ﬂexi-

bility of our system and explained the details of the

automatic evaluation process with an example in a

systematic way, ordering the phases serially as they

occur in the practical environment. This covers: de-

ciding the testing approach, designing the assignment

descriptor, the teacher’s interaction with the system,

and testing programs. The tool pin-points each possi-

ble dimension for testing the programs i.e. testing the

programs on random inputs, testing the programs on

user deﬁned inputs, testingthe programs on time com-

plexity, space usage and style assessment of the stu-

dent’s program. Students beneﬁt because they know

the status of their program before ﬁnal submission is

done. Evaluators beneﬁt because they do not have to

spending hours for grading the programs. But achiev-

ing peak evaluation beneﬁts of our approach, requires

further research and with more extensive use as well

as studying the experience of the evaluators and stu-

dents using the system. Currently we have built the

system to work with a single programming language

i.e. the ‘C’ language. This is done in the ﬁrst in-

stance to gain experience with the interactive nature

of the system in a simple form. Implementation doors

have been kept open, so that we can extend the sys-

tem to test programming assignments in other popu-

lar languages such as C++, Java etc by making small

changes in the code.

REFERENCES

Baker, R. S., Boilen, M., Goodrich, M. T., Tamassia, R.,

and Stibel, B. A. (1999). Tester and visualizers for

teaching data structures. In Proceedings of the ACM

30th SIGCSE Tech. Symposium on Computer Science

Education, pages 261–265.

Benford, S. D., Burke, K. E., and Foxley, E. (1993). A

system to teach programming in a quality controlled

environment. The Software Quality Journal pp 177-

197, pages 177–197.

Blumenstein, M., Green, S., Nguyen, A., and V., M. (2004).

An experimental analysis of game: A generic auto-

mated marking environment. In Proceedings of the 9th

annual SIGCSE conference on Innovation and tech-

nology in computer science education, pages 67–71.

Jackson, D. and M., U. (1997). Grading student pro-

gramming using assyst. In Proceedings of 28th ACM

SIGCSE Tech. Symposium on Computer Science Edu-

cation, pages 335–339.

Juedes, D. W. (2003). Experiences in web based grading. In

33rd ASEE/IEEE Frontiers in Education Conference.

Luck, M. and Joy, M. (1999). A secure online submis-

sion system. In Software-Practice and Experience,

29(8):721–740.

Pisan, Y., Richards, D., Sloane, A., Koncek, H., and

Mitchell, S. (2003). Submit! a web-based system

for automatic program critiquing. In Proceedings of

the ﬁfth Australasian Computing Education Confer-

ence (ACE 2003), pages 59–68.

Reek, K. A. (1989). The try system or how to avoid testing

students programs. In Proceedings of SIGCSE, pages

112–116.

Saikkonen, R., Malmi, L., and Korhonen, A. (2001). Fully

automatic assessment of programming exercises. In

Proceedings of the 6th annual conference on Innova-

tion and Technology in Computer Science Education

(ITiCSE), pages 133–136.

A SYSTEM FOR AUTOMATIC EVALUATION OF PROGRAMS FOR CORRECTNESS AND PERFORMANCE

203