Growing Grade-by-Grade

Pia Niemel

, Jenni Hukkanen

, Mikko Nurminen

and Jukka Huhtam

aki

Faculty of Information Technology and Communication Sciences, Tampere University,

P.O. Box 1001 FI-33014, Tampere, Finland

ﬁ

Keywords:

Learning Management System, Next-Generation Learning Environment, Assessment and Feedback,

Automatic Grading, Manual Grading, Peer-Reviews, Leaderboards, Learning Analytics, Flipped Learning,

Growth Mindset, The Theory of Formative Assessment.

Abstract:

The enrollment of computer science students continues to increase, with record enrollment numbers in the

course ”Data Structures and Algorithms” in fall 2022. As a result, teaching methods and systems must evolve

to support student progress in the face of scarce teaching resources. This paper examines the shift from manual

to peer-reviewed and auto-graded grading processes and investigates students’ perceptions of different grading

styles. The developed automatic graders utilize data from both the Plussa LMS and GitLab, which serves as

the channel for student submissions and provides feedback for formative assessments. The results indicate

that peer-reviews are accepted as an exercise for the reviewer, but not as a means to grade the reviewee. Auto-

graders are well-received due to their instant feedback, ability to allow for multiple submissions and iterate

towards more efﬁcient solutions, which helps foster a growth mindset.

1 INTRODUCTION

The adage ”You get what you grade” emphasizes the

signiﬁcance of dependable, impartial, and efﬁcient

grading methods. To accomplish this, grading must

be designed to provide immediate and ongoing feed-

back to students, allowing them to make adjustments

in real-time and improve their performance in a timely

manner. However, this approach must also be able

to accommodate ever-increasing student populations,

which may seem like a daunting task.

In online learning, students often need to be more

autonomous, as scaffolding from the course personnel

may be only available in online forums. The learn-

ing material is often provided as videos. The videos

can be cut in short clips of 15 .. 30 minutes, because

cutting the material in shorter portions has proven

to increase student engagement in the earlier studies

(Saunders et al., 2020; Bergmann and Sams, 2012;

Slemmons et al., 2018). The internalization of video

content can be effectively assessed through the use of

multiple choice questions. The ﬂipped learning ap-

proach also suggests the use of ”primetime” sessions,

https://orcid.org/0000-0002-8673-9089

https://orcid.org/0000-0002-7691-5974

https://orcid.org/0000-0001-7609-8348

https://orcid.org/0000-0003-2707-108X

where small groups of students can ask questions to

the instructor (Koskinen et al., 2018). Following this,

students can apply the learned topics in weekly ex-

ercises that involve both theoretical analysis and pro-

gramming. The programming exercises should aim to

minimize the need for additional support from course

personnel. To achieve this, clear and well-structured

exercise instructions should be provided. Addition-

ally, exercise graders should furnish students with ad-

equate feedback in order to support their learning and

development.

In the studied Data structures and Algorithms

course (DSA-2022, N=605), the goal legitimized

the effort to make all the exercises automatically

graded. In previous course implementation, some of

the weekly exercises were already automated. In this

implementation, the effort was made to automate the

rest of the exercises, as well as the bigger ﬁnal course-

work assignments and exam.

In this article, we will delve into the advan-

tages and disadvantages of various grading meth-

ods, speciﬁcally focusing on the comparison between

manual and automatic grading. The approach adopted

is research-based and evidence-based, and the effects

of the change are evaluated through monitoring learn-

ing outcomes and surveying students about their per-

ceptions of the shift from manual to automatic grad-

Niemelä, P., Hukkanen, J., Nurminen, M. and Huhtamäki, J.

Growing Grade-by-Grade.

DOI: 10.5220/0011982100003470

In Proceedings of the 15th International Conference on Computer Supported Education (CSEDU 2023) - Volume 1, pages 92-102

ISBN: 978-989-758-641-5; ISSN: 2184-5026

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

ing. These ﬁndings will be used to inform and adjust

future course implementations. To gain insight into

how students received the integration of auto-graders,

the following research questions will be posed:

1. Which grading styles students prefer the most and

why?

2. Which auto-graders were appreciated the most?

3. Which were the students’ desired submission

counts and what were their consequences?

2 RELATED WORK

After the COVID-19 pandemic, many students con-

tinue to request remote teaching in the form of videos

and remote exercise sessions, even as in-person teach-

ing resumes. The increased demand for remote teach-

ing and limited resources have led to a shift towards

remote teaching as a norm. In this environment,

the role of educators and course personnel becomes

more implicit, yet they design the course practices and

deliver material, exercises, and assignments to best

serve their target audience. Learning occurs primar-

ily through the use of learning management systems,

which can be enhanced with intelligent tutors. In

guiding students, grading is one of the primary means.

The purpose of the grading is to promote student’s

learning and guide the learning process in an appro-

priate direction, in computer science, improving their

coding skills.

To maintain the quality of teaching in an online

or hybrid environment, the ﬁndings of pedagogical

frameworks such as the growth mindset (Dweck and

Yeager, 2019), the theory of formative assessment

(Black and Wiliam, 2009; Clark, 2012), and ﬂipped

learning (Bergmann and Sams, 2014) can inform the

use of grading tools while developing the DSA-2022.

These frameworks can provide insight into what ben-

eﬁts students in terms of feedback, engagement, and

self-directed learning, which can be incorporated into

the design and implementation of grading tools. Ad-

ditionally, the use of formative assessments and active

learning strategies, as promoted by ﬂipped learning,

can improve the effectiveness of learning.

2.1 Pedagogical Frameworks

Pivotal in detecting the growth mindset is a student’s

attitude toward a challenge. If the challenge feels

thrilling and not intimidating, it speaks for a growth

mindset. A growth-mindset student is thrilled by an

opportunity of learning something new and difﬁcult.

The opposite of the growth mindset is ﬁxed mindset

(Haimovitz and Dweck, 2017). Fixed-mindset stu-

dents are constrained by shame and fear of failure.

Ultimately, the fear is induced by the risk of others

discovering them being dumb. In a ﬁxed mindset, this

would be tragic indeed, because the judgement is ﬁ-

nal: intelligence and skills are thought to be given and

that is why nothing can be done about it. Not so in the

growth mindset. It expects students to grow by tak-

ing on the challenges, which trains students’ ﬂexibil-

ity and resilience, and helps them become eventually

more skilled. Growth-minded students are also con-

ﬁdent about their progress and think there are good

chances in achieving the set academic goals, which

connects to self-efﬁcacy (Bandura, 1982) Educators

should rely on this growth and equip students to reg-

ulate their own learning, and learn to learn.

The theory of formative assessment is based on

the idea that assessment should be used as a tool

for improving both learning and instruction, rather

than just evaluating them. Formative assessment takes

place throughout the learning process, it also provides

feedback not only for students but also for course

personnel on what is being learned and what needs

to be improved in the course design. E.g., the in-

structions and exercise feedback are often refactored

in anticipation of a smoother completion of exer-

cises. According to Clark (Clark, 2012), there are

three aspects of feedback which have the potential to

impact meta-cognition and self-efﬁcacy, namely for-

mative, synchronous and external/internal feedback.

Clark deﬁnes formative as an assessment style that

supports self-regulation, synchronous as instant feed-

back, whereas internal feedback refers to ways per-

sons speak to themselves. Internal feedback gen-

eration is typical for self-regulated learners, often

triggered by getting external feedback ﬁrst. (Clark,

2012).

Flipped learning (FL) has grown into a popular

approach that involves students watching video lec-

tures before engaging in active learning activities such

as problem-solving and discussion. Assessment in

FL typically includes formative assessments that take

place throughout the course.

In MOOCs, active learning implied by FL may in-

corporate interactive and gamiﬁed exercises and cod-

ing. The feedback for these exercises can be catego-

rized as formative, if it is instant and detailed enough,

so that students can improve their submissions based

on it. The FL approach is believed to better prepare

students for the 21st century by giving them more

control over their learning, allowing them to learn at

their own pace, still not neglecting the opportunities

for active engagement and collaboration (Avery et al.,

2018).

Growing Grade-by-Grade

2.2 Respective Learning Tools

Dweck together with her research group have de-

veloped a reward system, ”brain points,” to encour-

age the development of growth mindset behaviors

by directly incentivizing effort, use of strategy, and

incremental progress. (Dweck and Yeager, 2019;

O’Rourke et al., 2014; O’Rourke et al., 2016) Re-

sults show that brain points are capable of encour-

aging the growth mindset by increasing persistence,

time spent playing, strategy use, and perseverance,

especially in low-performing students. By providing

incentives for effort and progress, players are more

likely to engage with the game and continue to chal-

lenge themselves, thus fostering the growth mindset

among players. Students will be encouraged to see

challenges as opportunities for growth and develop-

ment, rather than as a source of frustration.

3 RESEARCH CONTEXT

The studied DSA course provides a comprehensive

introduction to data structures and algorithms, from

insertion- to merge sort, from data structures se-

quences and sets and algorithm analysis including the

performance implication of data structure selection.

Due to COVID-19, DSA course replaced earlier live

lectures with video recordings, and on-premises tu-

toring with online tutoring sessions. Special Q&A

sessions were provided for tutoring purposes. In the

sessions, teachers answered questions stated by stu-

dents and went more in detail than in video lectures.

Q&A is similar concept to primetime sessions that are

utilized in ﬂipped learning (Koskinen et al., 2018), ex-

cept in DSA-2022 participation was voluntary and not

rewarded by points. The removal of points resulted in

a remarkable decline in participation compared with

previous implementations.

The exercises and assignments demonstrate how

well the content was internalized. Students struggling

with the exercises could get help in Teams where TAs

give hints and scaffold students in solving the prob-

lems. However, while doing the exercises students

should get a sufﬁcient amount of feedback from the

developed auto-graders to be able to fulﬁll the re-

quirements. The assignments are designed to be com-

pleted independently by the students, and success-

fully completing the assignments is evidence of ad-

equate learning.

Assignment 1 and Assignment 2 are divided into

two phases: compulsory and optional. Students have

the option to accept the results from the compulsory

phase or choose to progress to higher grades. Addi-

tionally, students may choose to forgo Assignment 2

entirely, resulting in a maximum grade of 2. These

various grade options can be viewed as an implemen-

tation of ﬂipped assessment, which grants students

more control over their assessments, allowing them

to be more autonomous and focus on areas that align

with their strengths and interests (Toivola, 2019).

3.1 Tools Used: Plussa, openDSA and

Gitlab

Plussa is the learning management system that was

used in the course. Originally Plussa has been devel-

oped in University of Helsinki (Karavirta et al., 2013).

Original LMS is a service-oriented architecture that

was designed to be easily extendable. This allows for

the addition of new services, such as auto-graders, to

enhance its functionality (Karavirta et al., 2013; Kan-

toj

arvi, 2018). OpenDSA exercises are such enhance-

ments, but basically anything is possible: the docker

container with an image of developer’s choice will be

launched, and the git repository is cloned inside the

image and tests will be executed inside this sandbox.

Open Data Structures and Algorithms (openDA)

exercises are especially designed to support growth

via formative assessment. OpenDSA is a open-source

community creative-commons project; a broad com-

munity of developers was involved in the effort with

the incentive of re-using the materials in their own

courses. The University of Virginia has been one of

the main promoters; in Finland, Aalto University has

been active.

The project has achieved a wide variety of self-

study resources for learning data structures and algo-

rithms, such as interactive visualizations, quizzes, and

coding exercises to help students learn and practice

computer science concepts (Shaffer et al., 2011; Fouh

et al., 2012). The visualizations cover different algo-

rithms (Karavirta and Shaffer, 2015; Tilanter

a et al.,

2020) and runtime behaviors (Sirki

a, 2018), that is,

the illustration of call stack and heap behavior while

executing an algorithm, such as recursion (e.g., An-

notation editor exercise about recursion). Recursion

as well as time complexity are identiﬁed as one of the

threshold concepts in computer science (Zander et al.,

2008), and visualizations are an apt tool for lowering

the threshold (Shaffer and Rosson, 2013).

openDSA comprehension aids are good for

novices, but to advance students’ craftsmanship in

software engineering the courses need no toys but

real tools for improving the efﬁciency and quality

of code. Submission through GitLab repos and hav-

ing real unit and integration tests would better future-

proof their development as becoming coders (Haara-

CSEDU 2023 - 15th International Conference on Computer Supported Education

(a) Assignment1 (b) Assignment2

Figure 1: Perftests: resource-consuming and resource-savvy versions.

nen and Lehtinen, 2015).

In addition to Plussa and openDSA, Gitlab is a

central tool starting from the middle of the course.

Besides functioning as a normal version control sys-

tem, Gitlab is the means to get instructions pulled

from the course upstream and, on the other hand, sub-

mit code for grading. The course upstream is a Gitlab

repository for pulling only. Course personnel main-

tain the upstream, new instructions and possible ﬁle

skeletons are released at the beginning of each exer-

cise round. To complete the course, students had to

pass weekly exercises, a coursework assignment, and

an exam. The manual assessment of assignments is

the most resource-consuming task, thus its automa-

tion was the ﬁrst priority.

3.2 Auto-Graders for Assignment

In 2022, the automation of the course was leveled up

by introducing multiple auto-graders to check various

aspects of students’ code.

Figure 2: Plussa and Gitlab co-operate during grading.

Fig.2 illustrates auto-graders in action. First,

students pull instructions from the course-upstream

repository. They then commit their code to their

own repository, and ﬁnally, submit the Gitlab URL

in Plussa system, which is responsible for submis-

sion and grading. Plussa system is divided into two

parts: the Plussa front-end and the MOOC grader.

The system launches temporary Docker containers

that are only started for performing the grading. The

grader clones the student’s git repository and executes

the grading as instructed in a shell script. Examples

of graders include perftest, unit, integration and Val-

grind graders. Most of the tests are also given to stu-

dents, and the respective Plussa graders run the same

tests. By running the tests locally, students receive

the same feedback as given by the Plussa graders,

which decreases the number of needed submissions.

This also gives students an idea of how their work

will be graded in the LMS. After grading, the points

are returned to the Plussa front-end, which stores the

grades.

The primary goal of Data Structures and Algo-

rithms (DSA) is to learn how to write efﬁcient code.

This goal is emphasized by setting more emphasis

on performance testing, with the measurement of the

time that program takes to run, i.e., its performance.

However, measuring performance using a stopwatch

can be affected by other processes running on the

computer at the same time, thus competing for the

same resources. To overcome this issue, an instruc-

tion counter is used instead, which measures the num-

ber of instructions that are executed by the program.

It is not susceptible to interference from other pro-

cesses. However, the instruction counter requires cer-

tain conditions to be met, such as having perf events

open on the server. To meet these requirements, a

dedicated ESXi server with an instruction counter was

setup for this course implementation.

In DSA-2022, performance tests were conducted

Growing Grade-by-Grade

by gradually increasing the amount of data (e.g.,

N =10, N =100, N =1000, ..). The instruction counts

were recorded, and then a curve was ﬁtted using

Python to estimate the average complexity of the al-

gorithm. The results were illustrated as a graph,

showing the average time complexity as a function of

the number of data points, see Fig. 1. To ensure accu-

rate results, the input data used for curve ﬁtting should

have minimal noise and the tests should be repeated

multiple times to average out the effects of random-

ized data or other sources of noise.

3.3 Method and Data Collection

The DSA course is developed on a yearly basis. The

method used is a design-based research (DBR) ap-

proach, which involves cyclical development with re-

ﬂective redesign phases (Cobb et al., 2003; Reimann,

2011; Ericson et al., 2016). In DSA-2022, this ap-

proach is guided by the theory of formative assess-

ment (Black and Wiliam, 2009; Clark, 2012) and in-

volves combining educational solutions with empiri-

cal interventions and proof. The DBR approach in-

cludes four stages: design, development, enactment,

and analysis (Anderson and Shattuck, 2012; Wang

and Hannaﬁn, 2005; Ørngreen, 2015). The cycle rep-

resents a course term and the retrospective analysis is

used to inform the design of the next implementation.

Students’ feedback was collected in two phases,

before Assignment 1, and after Assignment 2. In

addition, we utilize the grader feedback and logs in

scanning the learning process. The redesign of the

course is done based on the results, and adjustments

are made, where feasible.

4 RESULTS

We collected students’ views with two grading re-

lated questionnaires during DSA-2022. The ﬁrst,

pre-questionnaire, was carried on in the middle of

the course, in module 8 (N=360) before Assign-

ment1. The second questionnaire was executed as a

post questionnaire in module 14, after Assignment 2

(N=274). 240 students answered both questionnaires.

The questionnaires contained both Likert-scale

and open-ended questions. First, we scan through

the Likert-scale questions about the preferred grading

styles. Likert scales are divided in seven levels from

’Strongly disagree’ (0) to ’Strongly agree’ (7), and in-

terpret the results by selecting enlightening quotations

from students’ responses.

4.1 Preferred Grading Styles

When transferring from resource-intensive manual

grading to auto-grading, the main dilemma is how to

keep up the good quality of feedback. During this

course, alternative ways of giving feedback were ex-

perimented, such as auto-graders, peer-reviews, and

comparisons with others. Learning analytics and self-

reﬂection were also mentioned, and students were

asked to rate these grading methods, yet they were

omitted during this implementation.

In Fig. 3, the automatic grading gets upvoted the

most in both phases, followed by learning analytics

and manual grading, but as we can see, in the lat-

ter Fig. Assignment 2, the difference between auto-

matic and manual grading is signiﬁcantly less (in pre-

test, 1.98, in the post 1.2, see Table 1). The reasons

for automatic grading dropping in are the problems

with perftest in Assignment 1, yet the simpliﬁed (thus

faster) perftests in Assignment 2 managed to compen-

sate for the felt discomfort.

4.1.1 Auto Grading

Auto-grading was by far the most appreciated grad-

ing method. In their comments, students praised the

speed of the graders and the immediacy of the feed-

back. In addition, students greeted with joy the pos-

sibility to return their solution ”unpolished” multiple

times, such as the following respondent, -Also auto

graders don’t (hopefully) think you are a complete id-

iot if you try to submit something that is not yet well

polished. The uniformity and fairness of the evalu-

ation was emphasized especially in comparison with

peer-reviews that were not trusted that much. To a

minor extent, manual grading was attached with trust

issues as well, such as: - Personal written feedback

would’ve been very nice, but emphasis on the word

personal. and - There could also be some unknown

bias causing certain students to receive better/worse

grades than they deserve.

Auto-graders have advantages in terms of trans-

parency and continuous evaluation (characterized by

formative assessment) allowing for intermediate in-

spections and communication with the grader. How-

ever, there were issues with Assignment 1, speciﬁ-

cally with perftests, which compromised fairness and

caused timeouts due to the slowness of visualizing

and estimating asymptotic efﬁciency. These problems

were addressed in Assignment 2 by simplifying the

perftests, resulting in a smoother deadline. Refactor-

ing of Assignment1 perftests is the top priority.

CSEDU 2023 - 15th International Conference on Computer Supported Education

(a) pre-test, before Assignment 1 (b) post-test, after Assignment 1 and Assignment 2

Figure 3: Preferred grading style.

Table 1: Summary of statistics for the preferred grading style for the 240 students who answered to both questionnaires.

average pre average post

diff of post and

pre

std of the diff

% of changed

opinions

manual 4.14 4.8 0.66 1.66 76.25

auto 6.12 5.96 -0.16 1.31 54.17

peer-review 2.54 3.45 0.91 1.73 73.33

self-reﬂection 3.52 3.89 0.37 1.91 72.50

comparison 3.45 3.86 0.41 1.83 70.83

la 4.17 4.29 0.12 1.96 75.00

4.1.2 Manual Grading

Many students prefer manual grading because of its

depth and being personalized. A student pointed out

the need for constructive feedback, e.g. hints for bet-

ter readability and efﬁciency, and coding conventions.

(Yet checking of coding conventions can be achieved

with linting tools.) However, sometimes the feed-

back can degrade as dealing with irrelevances, such as

highlighted in the following response, -Manual grad-

ing has on previous courses put enormous weight on

nitpicking, which was non-existent now. Students

were also concerned about putting their effort to the

exercises in vain, If a student cannot do code that

passes tests but still has SOME code and has used

many hours for the work, the grade cannot be 0. There

needs to be some manual grading for that pass/fail sit-

uation. Like, you could try just to pass (grade 1) OR

get a better grade with autograders. For example,

Valgrind grader represents such a pass/failure grader:

it did not allow any memory leaks for pass, causing

’passing panic’ among students. While universities

must maintain the quality of graduates, the amount of

work should not automatically compensate for lack

of skill. It seems that students rely on humans being

more empathetic than machines in this sense.

Multiple responses highlighted the size of the

course as a constraint, where manual grading was

identiﬁed as the method that consumed the most

teaching resources, thus it was seen as unrealistic.

Additionally, the slowness and an option of returning

work only once were seen as downsides of it. How-

ever, in regard to fairness, manual grading received

fewer negative comments than automatic grading.

4.1.3 Peer-Reviews

In the pre-questionnaire, students were very sus-

picious about the functionality of peer-reviews.

Ignoring the initial dismay, peer-reviews were

introduced in the last module of the course, and

allegedly the strongest objection has softened af-

ter the peer-review period. The primary reason

for softening was the fact that peer-reviews did

not directly contribute to the grade of reviewee.

Also seeing others’ solutions was enlightening.

However, each review is not equally worthy,

I think peer-reviewing is a good idea, how-

ever I would have wanted to see a solution that

was more performant than mine, or alternatively

get feedback from an author of such a solution.

Peer reviews depend on the reviewer. Cur-

rently it’s a pure lottery. If a peer really

reads the submission and knows how to give

meaningful feedback, peer-reviews are good.

I dislike peer reviews as the reviewing is not done

by a person that has a clear understanding of the

grading in the course and I feel like I am not being

graded as fairly as I could have.

Students desire feedback from someone whose

domain knowledge is equal or better than their own.

To address this, the simplest solution is to increase the

number of reviews, as this increases the chances of

receiving feedback from a knowledgeable reviewer.

However, in some courses increasing the number of

feedback items a student has to give has led to a per-

ceived decrease in the quality of the feedback. Peer-

reviews are inherently less equal in quality when com-

pared to automatic grading, as they are subjective.

Growing Grade-by-Grade

Students should be given incentives to provide high

quality feedback to their peers. Additionally the grad-

ing of the outcome of the peer-review process should

be considered in order to ensure the quality of the re-

view. The quality checking of the reviews calls for

innovative ideas. Extending the peer-review process

to include students assessing the feedback they have

received could be a relatively straight-forward way to

improve the process.

Students ﬁnd it difﬁcult to comment on work if

they are still beginners or there is a gap in skill levels,

but with proper instructions, the task becomes eas-

ier. Thus, the conclusion is to consider increasing the

number of peer-reviews and put effort into providing

clear and useful instructions to aid the process.

4.1.4 Comparisons

In DSA-2022, the comparison meant perftest top-10

leaderboard. The leaderboard listed students’ initials

hinting the name but not revealing it fully. Students

listed in the leaderboard may take it as an honor. Of

course, being omitted from the board may be discour-

aging, or felt unequal: -Comparison with other stu-

dents is good but if there is only the 10 best, hundreds

of people get left out and don’t see how they com-

pare. so i think that everyone should be displayed

on the leaderboards. Leaderboard has its problems

either way, whether with only selected people or all

included, yet the ethical problems would peak in pub-

lishing ”bottom boards” .

In students’ responses, the leaderboard induced a

lot of negative feedback. The following collection

itemizes the reasons for strong objection:

Making this a competition lets only the top stu-

dents ﬂex their coding muscles when those who need

actual help might feel very stupid.

Seeing other peoples progress puts too much com-

petitive pressure on students. The reality is that some-

times we must prioritize our courses and work in or-

der to stay sane and still learn as much as possible.

Seeing other people’s progress to that function only

wrecks havoc.

Comparing yourself to others can be really harm-

ful to your self-esteem and many students already

struggle with mental health and imposter syndrome.

In the comments above, competition is considered

having a negative effect on one’s motivation, yet op-

posite opinions exist: I’m very competitive so getting

leaderboards is fun and increases my motivation sub-

stantially. After reading the internal review I almost

wanted to go back to working on prg1 just so I could

see if I could get my perfs down to the best ones :D.

Taking others’ progress as a positive challenge exem-

pliﬁes the growth mindset.

The information of the leaderboard can be also

found as an additional means of guidance: If we

were shown some leaderboards during the develop-

ment, it would have given us some indication that we

were on the right path with the project. Leaderboards

and comparative statistical analysis can be provided

for students as extra means to follow their progress.

Preferably, a student should be able to control whether

the information is shown or not.

4.1.5 Learning Analytics

In larger courses, such as DSA-2022, all additional

and summarizing tutoring would be welcome. The at-

tempt is to use statistical information at the beneﬁt of

students, yet the exact means are to be found. It is also

good to be aware of the delicacy of the issue and fo-

cus on discrete handling of the data. Most of all, stu-

dents wish encouragement and scaffolding with hard

exercises, and desire learning analytics for personal-

ized exercises: I would be open to learning analytics,

if they are user-friendly and give you gentle sugges-

tions and encouragement when you do well. How-

ever, at the same time they marvel how personalizing

will affect the grading, for example: I don’t know how

this would be implemented in practice; would every-

one be able to get as many points if different exercises

are suggested, would everyone be able to complete the

same exercises even if they aren’t suggested? Making

grade predictions could be useful.

4.1.6 Fairness Aspects

Most students conﬁrmed the claim “I have been

graded fairly”, yet there was also a group of students

strongly disagreeing with it. Responses to the ques-

tions “I deserve a better grade that I am getting” and

“I am getting a better grade than I deserve” primarily

are left-biased, that is, heavily disagreed, yet the latter

even more heavily. In accordance with human nature,

students would easily accept a better grade, not vice

versa.

The claim ”Satisﬁed with the feedback” indicates

that the feedback provided to the students is well-

received, even though there may be room for improve-

ment. The claim ”I would have preferred personal

written feedback from the course staff” is also agreed,

but not as much as the previous question, this means

that while students are generally satisﬁed with the

feedback they got, they would have preferred to have

personal written feedback from the course staff.

CSEDU 2023 - 15th International Conference on Computer Supported Education

Figure 4: Fairness of grading.

4.2 Most Useful Graders

As seen in Fig. 5, unit and integration tests were con-

sidered to be the most useful form of testing, with vi-

sualizations provided by performance tests also being

highly valued. However, the use of Valgrind as a grad-

ing tool was met with mixed reactions due to its strict

nature as a gatekeeper that must be passed in order to

receive a grade. The main issue with Valgrind is that

it does not allow for any memory leaks or errors and

thus some participants found it to be too decisive.

However, it is questionable to remove graders that

check the correctness, where the removal would even-

tually lower the quality of code. To address the difﬁ-

culties with Valgrind, providing better resources and

guides for effectively using the tool is crucial. This

could include incorporating Valgrind checks when

learning about invalidating pointers and provide a

guide for tackling the most common memory leaks

and Valgrind errors. This way, developers can learn

how to use the tool in a more effective and efﬁcient

manner and avoid common pitfalls.

Figure 5: Usefulness of the auto-graders.

4.3 Submission Count

The submission count, which refers to the number

of possible submissions to auto-graders, was initially

high at 150 but was later, in Assignment2, decreased

to 20 with the ultimate goal being 3.

The desired submission count was asked in the

second survey, and the results are shown in Fig. 6.

Responses that suggest submission count of 50 or less

total three quarters. The weighted average is still far

from the intended goal of ten. Fig.. 7 shows the his-

togram of the submission counts that students used

for one particular grader in Assignment 1. The grader

is perftest and it was the most heavily used of all

the graders. It is still to be noted that 94% of stu-

dents submitted at most 20 times, even if there are

a few students that needed more than 50. In As-

signment1, the unit, integration and performance test

codes were released in the middle of the submission

period; whereas in Assignment 2, the test codes were

readily available right from the start. The test codes

allowed students to test their code locally, and for As-

signment 2 the submission counts were much smaller

than in Assignment 1.

This suggests that the majority of students are able

to test their codes, if test codes are released. However,

students desired to have higher submission counts in

auto-graders than they actually needed. This may

stem from the need to ensure the possibility to test

with the grader, if the test codes are not made avail-

able or they behave differently in local testing versus

in the learning management system. In any case, stu-

dents must be encouraged to test their codes by them-

selves rather than testing against ready-made tests.

Figure 6: Desired submission counts.

Figure 7: The histogram of the submission counts to the

perftest grader in Assignment 1.

Had the students been graded manually, this high

number of submissions and this much feedback were

far beyond reach. In Tampere University, a few previ-

ous courses have been transferred from manual to au-

Growing Grade-by-Grade

tomatic grading (Niemel

a and Nurminen, 2020; Nur-

minen et al., 2021). The previous experiences conﬁrm

the results of this study: if students are provided with

an option of gradual improvements of their code, stu-

dents fully utilize the option, and learn gradually by

iterating their code. Another, self-evident but not any

less important consequence is saving the resources of

course personnel. In the course of this size (N=605),

auto-grading is simply a necessity. High submis-

sion counts, versatile formative feedback of various

graders combined with discussion forums with Q&A,

where saved course resources can be spent and hu-

mans give constructive feedback, make the virtue out

of necessity. However, more effort should be put to

foster social cohesion and collaboration among stu-

dents during the courses. Post-COVID students of

our university have increasingly moved into remote

studying mode. To bring them back to campus and to

get them to work in teams is a brand new challenge.

5 CONCLUSIONS

1. Which grading styles students prefer the most and

less and why?

Auto-grading was by far the most appreciated

grading method, yet manual grading had also

strong support but felt unrealistic in the course of

this size. It is appreciated because of its depth

and being individual. Students also pointed out

the need for constructive feedback, and leniency.

Peer-review collected the most negative feedback,

but views about peer-reviewing got more neutral

after students had more experience with DSA-

2022 implementation of the peer-reviews: stu-

dents found it beneﬁcial to see alternative solu-

tions, but others had difﬁculty providing useful

feedback and felt frustrated when the reviewed so-

lution was better than theirs. The feedback from

ill-motivated peer-reviewers was considered use-

less.

2. Which auto-graders were appreciated the most?

Unit-tests, integration tests and perftest, whereas

Valgrind grader raised mixed opinions. Perftests,

in particular with visualizations, help in reaching

the learning goals of the course, efﬁcient code

being the very essence of it. The current im-

plementation must though be revised, the perftest

grader being much too slow, thus causing conges-

tion. The revised version must function reliably,

despite other processes consuming CPU. Prefer-

ably, a browser should draw the curves of asymp-

totic efﬁciencies and do curve ﬁtting, instead of

a server, which would save its resources and im-

prove the user experience.

3. Which were the students’ desired submission

counts and what were their consequences?

Students wish for quite high submission counts;

the mode was 50 and a remarkable section voted

for 150 as well. In reality, they should not need

that much, and fortunately the majority did not

use that many submissions. Instead of submitting

to Plussa LMS, students were instructed to test lo-

cally, which is faster and, moreover, aligned with

the traversal learning goals of the CS faculty (test-

your-code). 150 as the initial submission count

of Assignment1 was clearly a wrong signal: in-

stead of leaning on Plussa LMS, students should

put more effort on testing their code locally, and

preferably they would complete the given test set

by writing their own tests, if something is missing.

6 FURTHER STUDIES

The data collected by Plussa and GitLab is exten-

sive and could be used for learning analytics. The

analysis should be made accessible to both teachers

and students, with the potential for students to com-

pare their performance to others, although this might

create unnecessary competition, yet a safer approach

would be for students to compare their performance

to their own earlier performance. Plussa graders cur-

rently check code quality and conventions, but an ad-

ditional ”self-reﬂection grader” would be useful in

helping students to improve their understanding of

their strengths and weaknesses, preferably by provid-

ing suggestions for exercises to ﬁll in any gaps, and

showing the path to growth.

An interesting area of research would be investi-

gating the most effective way to combine automatic

grading and the support provided by course person-

nel. While automatic grading has been shown to be

effective and efﬁcient, many students have expressed

a need for support from course personnel. The time

saved with automatic grading could be used to pro-

vide this support and improve teacher-student com-

munication.

In the wake of the COVID-19 pandemic, the uti-

lization of Teams for student-peer and student-teacher

interactions has surged. However, the current ap-

proach to employing these tools is often unstruc-

tured and ad-hoc, lacking clear guidelines on their

appropriate usage and the allocation of communica-

tion responsibilities. This results in an unpredictable

and disorganized communication environment, which

may fail to fully engage students and promote social

CSEDU 2023 - 15th International Conference on Computer Supported Education

100

cohesion. To address these concerns, a more struc-

tured approach is required, which emphasizes the se-

lection of appropriate communication tools and the

allocation of speciﬁc communication responsibilities

to designated individuals. This will facilitate more

predictable and organized communication during the

course implementation, enhancing the overall learn-

ing experience for students. Furthermore, incorpo-

rating team-based activities and assignments can pro-

vide students with valuable opportunities to hone their

collaboration and communication skills, which are

highly sought after in the modern workforce.

REFERENCES

Anderson, T. and Shattuck, J. (2012). Design-based re-

search: A decade of progress in education research?

Educational researcher, 41(1):16–25.

Avery, K., Huggan, C., and Preston, J. P. (2018). The ﬂipped

classroom: High school student engagement through

21st century learning. in education, 24(1):4–21.

Bandura, A. (1982). Self-efﬁcacy mechanism in human

agency. American psychologist, 37(2):122.

Bergmann, J. and Sams, A. (2012). Flip your classroom:

Reach every student in every class every day. Interna-

tional Society for Technology in Education.

Bergmann, J. and Sams, A. (2014). Flipped learning: Gate-

way to student engagement. International Society for

Technology in Education.

Black, P. and Wiliam, D. (2009). Developing the theory of

formative assessment. Educational Assessment, Eval-

uation and Accountability (formerly: Journal of Per-

sonnel Evaluation in Education), 21(1):5–31.

Clark, I. (2012). Formative assessment: Assessment is for

self-regulated learning. Educational Psychology Re-

view, 24(2):205–249.

Cobb, P., Confrey, J., diSessa, A., Lehrer, R., and Schauble,

L. (2003). Design experiments in educational re-

search. Educational Researcher, 32(1):9–13.

Dweck, C. S. and Yeager, D. S. (2019). Mindsets: A view

from two eras. Perspectives on Psychological science,

14(3):481–496.

Ericson, B. J., Rogers, K., Parker, M., Morrison, B., and

Guzdial, M. (2016). Identifying Design Principles for

CS Teacher Ebooks Through Design-Based Research.

In Proceedings of the 2016 ACM Conference on Inter-

national Computing Education Research, ICER ’16,

New York, NY, USA. ACM.

Fouh, E., Sun, M., and Shaffer, C. (2012). Opendsa: A

creative commons active-ebook. In Proceedings of the

43rd ACM technical symposium on Computer Science

Education, pages 721–721.

Haaranen, L. and Lehtinen, T. (2015). Teaching git on the

side: Version control system as a course platform. In

Proceedings of the 2015 ACM Conference on Innova-

tion and Technology in Computer Science Education,

pages 87–92.

Haimovitz, K. and Dweck, C. S. (2017). The origins of

children’s growth and ﬁxed mindsets: New research

and a new proposal. Child development, 88(6):1849–

1859.

Kantoj

arvi, J. (2018). Architecture of A+ LMS.

https://apluslms.github.io/architecture/presentation/.

Karavirta, V., Ihantola, P., and Koskinen, T. (2013).

Service-oriented approach to improve interoperability

of e-learning systems. In 2013 IEEE 13th Interna-

tional Conference on Advanced Learning Technolo-

gies.

Karavirta, V., Ihantola, P., and Koskinen, T. (2013).

Service-oriented approach to improve interoperability

of e-learning systems. In 2013 IEEE 13th Interna-

tional Conference on Advanced Learning Technolo-

gies, pages 341–345. IEEE.

Karavirta, V. and Shaffer, C. A. (2015). Creating engaging

online learning material with the JSAV javascript al-

gorithm visualization library. IEEE Transactions on

Learning Technologies, 9(2):171–183.

Koskinen, P., L

ams

a, J., Maunuksela, J., H

ainen, R.,

and Viiri, J. (2018). Primetime learning: collabora-

tive and technology-enhanced studying with genuine

teacher presence. International journal of STEM edu-

cation, 5(1):20.

Niemel

a, P. and Nurminen, M. (2020). Rate your mate

for food for thought: Elsewhere use a grader. In

Proceedings of the 12th International Conference on

Computer Supported Education - Volume 2: CSEDU,,

pages 422–429. INSTICC, SciTePress.

Nurminen, M., Niemel

a, P., and J

arvinen, H.-M. (2021).

Having it all: auto-graders reduce workload yet in-

crease the quantity and quality of feedback. In SEFI

Annual Conference: Blended Learning in Engineer-

ing Education: challenging, enlightening–and last-

ing, pages 385–393.

Ørngreen, R. (2015). Reﬂections on design-based research.

In Human Work Interaction Design. Work Analysis

and Interaction Design Methods for Pervasive and

Smart Workplaces, pages 20–38. Springer.

O’Rourke, E., Haimovitz, K., Ballweber, C., Dweck, C.,

and Popovi

c, Z. (2014). Brain points: A growth mind-

set incentive structure boosts persistence in an educa-

tional game. In Proceedings of the SIGCHI conference

on human factors in computing systems, pages 3339–

3348.

O’Rourke, E., Peach, E., Dweck, C. S., and Popovic, Z.

(2016). Brain points: A deeper look at a growth mind-

set incentive structure for an educational game. In

Haywood, J., Aleven, V., Kay, J., and Roll, I., editors,

Proceedings of the Third ACM Conference on Learn-

ing @ Scale, L@S 2016, Edinburgh, Scotland, UK,

April 25 - 26, 2016, pages 41–50. ACM.

Reimann, P. (2011). Design-based research, pages 37–50.

Methodological choice and design. Springer.

Saunders, F., Gellen, S., Stannard, J., McAllister-Gibson,

C., Simmons, L., and Gibson, A. (2020). Educat-

ing the Netﬂix Generation: Evaluating the impact of

teaching videos across a Science and Engineering Fac-

ulty.

Growing Grade-by-Grade

101

Shaffer, C. A., Karavirta, V., Korhonen, A., and Naps, T. L.

(2011). OpenDSA: beginning a community active-

ebook project. In Proceedings of the 11th Koli Calling

International Conference on computing education re-

search, pages 112–117.

Shaffer, S. C. and Rosson, M. B. (2013). Increasing stu-

dent success by modifying course delivery based on

student submission data. Inroads, 4(4):81–86.

Sirki

a, T. (2018). Jsvee & kelmu: Creating and tailoring

program animations for computing education. Journal

of Software: Evolution and Process, 30(2):e1924.

Slemmons, K., Anyanwu, K., Hames, J., Grabski, D., Ml-

sna, J., Simkins, E., and Cook, P. (2018). The im-

pact of video length on learning in a middle-level

ﬂipped science setting: Implications for diversity in-

clusion. Journal of Science Education and Technol-

ogy, 27(5):469–479.

Tilanter

a, A. et al. (2020). Towards automatic advice in

visual algorithm simulation.

Toivola, M. (2019). K

anteinen arviointi. Helsinki: Edita.

Wang, F. and Hannaﬁn, M. J. (2005). Design-based

research and technology-enhanced learning environ-

ments. Educational Technology Research and Devel-

opment, 53(4):5–23.

Zander, C., Boustedt, J., Eckerdal, A., McCartney, R.,

Mostr

om, J. E., Ratcliffe, M., and Sanders, K. (2008).

Threshold concepts in computer science: A multi-

national empirical investigation. In Threshold con-

cepts within the disciplines, pages 105–118. Brill.

CSEDU 2023 - 15th International Conference on Computer Supported Education

102