Beyond Functionality: Automating Algorithm Design Evaluation in

Introductory Programming Courses

Caio Oliveira, Leandra Silva and Jo

ao Brunet

Department of Systems and Computing, Federal University of Campina Grande, Campina Grande, Brazil

Keywords:

Design Test, Introductory Programming Courses, Tests, Feedback, Assessment.

Abstract:

Introductory programming courses often enroll large numbers of students, making individualized feedback

challenging. Automated grading tools are widely used, yet they typically focus on functional correctness

rather than the structural aspects of algorithms—such as the use of appropriate functions or data structures.

This limitation can lead students to create solutions that, while correct in functionality, contain structural ﬂaws

that might hinder their learning. To address this gap, we applied the concept of design tests to automati-

cally detect structural issues in students’ algorithms. Our tool, PyDesignWizard, provides an API leveraging

Python’s Abstract Syntax Tree (AST) to streamline the creation of design tests. We quantitatively evaluated

this approach on 1,714 student programs from an introductory course at the Federal University of Campina

Grande, achieving 98.6% accuracy, a 0.99 true negative rate, 77% precision, and 84% recall. A qualitative

evaluation was also conducted through interviews with 16 programming educators using the Think Aloud

Protocol. The results indicated a high level of understanding and positive feedback: 15 educators grasped the

concept quickly, and 8 highlighted the educational beneﬁts of identifying design issues. In summary, our ap-

proach empowers educators to write structural tests for automated feedback, enhancing the quality of grading

in programming education.

1 INTRODUCTION

The teaching of programming has become increas-

ingly important over the years, driven by technolog-

ical advancements and the growing interest in Infor-

mation Technology. This growth has resulted in larger

classes in introductory programming courses, signiﬁ-

cantly increasing the number of exercises and exams

that instructors need to assess, provide feedback, and

grade (Watson and Li, 2014). For example, in the con-

text that this work took place, the Introduction to Pro-

gramming course at the Federal University of Camp-

ina Grande (UFCG), a class typically consists of 90

students per semester. The course adopts a practice-

focused approach, where students are required to reg-

ularly submit exercises. On average, each class gen-

erates around 270 exercise submissions, which the in-

structor must assess, provide feedback on, and grade.

This signiﬁcant workload highlights the challenges

faced by instructors in scaling assessment practices

while ensuring timely and effective feedback.

Studies such as those by Leite (Leite, 2015)

and Papastergiou (Papastergiou, 2009) suggest that

increasing the frequency of programming exercises

can help improve students’ skills, but this also in-

creases the workload for instructors. Automated grad-

ing tools, as mentioned by Janzen (Janzen and others,

2013) and Singh (Singh et al., 2013), are often used

to address this issue. Instructors frequently use On-

line Judges (Juiz and others, 2014), where students’

submissions are automatically graded based on tests

that primarily focus on functional correctness, possi-

bly neglecting important aspects related to code struc-

ture and algorithm design. In summary, the solutions

tend to focus on testing only ’what’ the students’ code

does, rather than ’how’ they do it. While there are

studies exploring the use of static analysis in pro-

gramming education (Truong et al., 2004), these ap-

proaches primarily focus on code quality rather than

addressing algorithm design (Araujo et al., 2016).

In addition to functional correctness, it is essential

to evaluate the design of algorithms. Design refers to

the elements in the algorithm, and how the algorithm

is structured and organized in code. Due to the limited

focus in the literature on how to automatically ver-

ify the algorithm design, and because Online Judges

focus primarily on functional aspects, many students

end up implementing the exercises violating the ex-

516

Oliveira, C., Silva, L. and Brunet, J.

Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses.

DOI: 10.5220/0013242000003932

In Proceedings of the 17th International Conference on Computer Supported Education (CSEDU 2025) - Volume 2, pages 516-525

ISBN: 978-989-758-746-7; ISSN: 2184-5026

pected design and hindering the development of log-

ical reasoning skills. For example, consider that an

instructor asks students to implement the Merge Sort

algorithm in Python (Foundation, 2023), but the stu-

dent chooses to implement the Bubble Sort algorithm

instead. Although the solution is functionally correct,

that is, it sorts a sequence, the student does not meet

the instructor’s design expectations. In some cases,

the problem is even worse, when students end up us-

ing pre-built libraries or functions (sort() in this ex-

ample) that do all the work that they were supposed

to do.

To address this gap, we propose an automated

approach for inspecting algorithm design, based on

static code analysis, with a focus on the educational

context. Our tool, PyDesignWizard, provides an API

that allows the creation of design tests that verify the

structure of algorithms, beyond functional correct-

ness. Unlike traditional tests, our approach covers

everything from parameters and variables to detect-

ing loops and recursion, enabling a deeper evaluation

of syntactic and semantic conformity. This allows in-

structors to automatically check aspects of the design

of the code, such as control ﬂow structures usage, re-

cursion depth, complexity, code modularity, among

others.

The PyDesignWizard API was built based

on Python’s abstract syntax tree (AST)(Beazley,

2012)(van Rossum, 2009), facilitating code reading

and manipulation. With this tool, instructors can cre-

ate their own test suites to evaluate both the function-

ality and design of students’ algorithms, ensuring the

originality of their solutions and optimizing grading

time.

In order to evaluate the ability of our approach to

detect design issues and its practical applicability for

instructors, we conducted our research guided by the

following research questions:

• RQ1. Can the structure of students’ solutions be

effectively assessed through design tests?

• RQ2. Do educators perceive the concept of de-

sign tests as intuitive and easy to grasp?

• RQ3. Does a tool based on design tests offer tan-

gible beneﬁts in educational contexts?

To validate our tool and answer these questions,

we conducted two distinct studies. The ﬁrst study

evaluated the accuracy of algorithm detection in 1714

programs developed by 90 students in the Introduc-

tion to Programming course at the Federal University

of Campina Grande, one of the top 5 largest Com-

puter Science courses in Brazil. In order to under-

stand how instructors perceive our approach, we have

conducted a qualitative study interviewing 16 instruc-

tors from different universities. We have applied the

Think Aloud Protocol (Ericsson and Simon, 1993a)

to gather and analyze data from the interviews.

The results indicate that the tool was effective in

detecting algorithm details and that design tests re-

duced the incidence of false negatives. Speciﬁcally,

the test achieved a 98.6% accuracy, with a precision

of 77% and a recall of 84% in detecting sorting al-

gorithms. In the qualitative assessments, instructors

found the design tests intuitive and recognized the ed-

ucational potential of the tool in the context of pro-

gramming education. Eight interviewees highlighted

the tool’s usefulness in verifying the structure of stu-

dents’ code, while four stated they would use it in

their courses. Another four suggested that the tool

could be incorporated into courses on testing and soft-

ware architecture, complementing existing practices

such as unit testing. Only two interviewees did not

ﬁnd a clear use for the tool in their educational con-

text. Overall, the feedback suggests that the tool can

serve as an ally in improving the quality of program-

ming education by encouraging best practices in al-

gorithm design and code structure.

This paper is structured as follows. First we pro-

vide the background for this work, discussing the con-

cept of Design Tests in Section 2. Then, in Section 3,

we describe how we apply the concept of Design Test

for educational purposes, along with the details of the

tool we built to support the implementation of design

tests for Python programs. In Section 4 we present

both studies we conducted: a quantitative evaluation

to demonstrate our tool’s ability to identify structural

design elements in student code with high accuracy

and precision, and a qualitative evaluation to gather

feedback from instructors on the usability and poten-

tial educational beneﬁts of the tool. Finally, in Sec-

tion 6, we summarize our ﬁndings, discuss potential

threats to validity, and suggest directions for future re-

search, including expanding the tool’s applicability to

other programming paradigms and educational con-

texts.”

2 BACKGROUND: DESIGN

TESTS

In a previous work (Brunet et al., 2009), we intro-

duced the concept of design tests, an innovative ap-

proach to test software aimed at verifying whether a

software system’s implementation adheres to speciﬁc

architectural rules or design decisions. While tradi-

tional functional tests focus on ensuring that a pro-

gram delivers correct outputs in response to given in-

puts, design tests go beyond this by checking how

Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses

517

Figure 1: Design Test Overview.

the system has been implemented according to pre-

deﬁned architectural guidelines. This method helps

to ensure that design decisions are consistently ap-

plied, particularly in teams where developers may

change frequently, reducing the likelihood of missed

or misunderstood design aspects. By incorporating

design tests into the daily testing routine, mismatches

between design and implementation can be detected

early, improving overall software quality.

To support the use of design tests for architectural

rules, we developed DesignWizard, an API integrated

with JUnit (JUnit Team, 2023) for Java implementa-

tions. This tool allows developers to express design

rules and decisions in a programming language fa-

miliar to them, making the documentation of design

constraints both executable and easy to maintain. Fig-

ure 1 illustrates an overview of the design tests ap-

proach. In this method, a developer creates a design

test to automate the veriﬁcation of design rules. The

two key components of this approach are the code

structure analyzer API, which extracts metadata about

the code and exposes this information, and the testing

framework, which provides assertion routines, auto-

mates test execution, and reports the results.

For the sake of clarity, let us analyze a concrete

example. The design test below (Listing 1) illustrates

this: a design rule dictates that classes in the dao

package should only be accessed by classes within

the controller or dao packages. This rule is veri-

ﬁed through a design test that checks all method calls

to ensure compliance.

Listing 1: Design Test pseudo-code for communication rule

between dao and controller package.

1 daoP a c k a g e = myapp . dao

2 c o n t r o l l e r P a c k a g e = myapp . c o n t r o l l e r

3 FOR each c l a s s i n d a o P ack a g e DO

4 c a l l e r s = c l a s s . g e t C a l l e r s ( )

5 FOR e a c h c a l l e r i n c a l l e r s DO

6 a s s e r t c a l l e r i n c o n t r o l l e r P a c k a g e | |

7 c a l l e r i n daoPac k a g e

This design test is constructed by ﬁrst identifying

the target packages and then checking the callers of

each class in the dao package. Assertions are made to

verify that these callers belong to the controller or

dao packages. The test relies on static analysis pro-

vided by DesignWizard to retrieve structural informa-

tion, and the JUnit framework is used to automate the

process.

In the next section, we describe our proposal to

adapt and apply the concept of design tests to the con-

text of programming education. We have adapted the

concept of design tests to the level of algorithms in or-

der to allow instructors to write and assess more ﬁne-

grained rules related to the usage of loops, variables,

conditionals, among others.

3 DESIGN TESTS FOR

EDUCATION

Context. In the context of education, the scenario

for instructors is similar to developers: there are a

plethora of tools and mechanisms to verify whether

students’ programs are functionally correct, that is,

whether their solutions do what they are supposed to

do, but, to the extent of our knowledge, there is no

approach to check the internal aspects (the design) of

the algorithms they produce using the same language

and test infrastructure of the target program. In this

paper, we propose, implement, and evaluate the tool-

ing needed to apply the concept of design tests to the

context of education.

Design at Algorithm Level. As one may note, the

concept of design is broad and applied to different ab-

straction levels. As described in Section 2, design

tests were proposed in the context of high-level ab-

stractions, for example, modules, packages, and sub-

systems. In the context of introductory programming

courses, on the other hand, the instructor’s concerns

are at a different level, but still related to the struc-

ture; hence, the design. These concerns are related,

but not limited to:

• the elements students use to implement the pro-

gram, such as conditionals, loops, variables, data

types, and operators;

• the choice and manipulation of data structures;

and

• the usage of speciﬁc functions, classes, and mod-

ules.

The Problem. Let us take as a concrete example the

Introduction to Programming course from the Federal

University of Campina Grande. One of the units of

CSEDU 2025 - 17th International Conference on Computer Supported Education

518

the course approaches linear iterations over lists to

apply tasks such as ﬁltering and ﬁnding elements. In

this context, consider a straightforward question ask-

ing students to ﬁnd and return the index of the small-

est element within a list. The expected design for this

implementation is initializing the ﬁrst element as the

smallest and then iterating through the list, comparing

each subsequent element to the current smallest. If a

smaller element is found, the index of that element is

stored. After checking all elements, the code returns

the index of the smallest value in the list.

Once instructors typically take the well estab-

lished approach of using Online Judges to assess stu-

dents’ code output, they are limited to checking if the

output is the value expected, once they use black box

tests (input/output). However, over the years, the in-

structors noted that some of the students were sorting

the list and then returning the ﬁrst element. Some of

them implemented their own versions of classic sort-

ing algorithms, while others—more concerning to the

instructors—based their solutions on Python’s built-in

sort() function. In other words, while both Listing 2

and Listing 3 might pass the functional tests, they do

not adhere to the speciﬁc design expectations set by

the instructors.

Listing 2: Finding Smallest with Bubble Sort.

1 d e f g e t s m a l l e s t ( se q u e n c e ) :

2 n = l e n ( s e q u e n c e )

3 f o r i i n r a n g e ( n −1 ) :

4 f o r j i n r a n g e ( 0 , n− i − 1 ) :

5 i f s e q u e n c e [ j ] > s e q u e n c e [ j + 1 ] :

6 swap ( sequence , j , j +1)

7 r e t u r n s e q u e n c e [ 0 ]

Listing 3: Finding Smallest with Python’s built-in sort()

function.

1 d e f g e t s m a l l e s t ( s e q u e n c e ) :

2 seq u e n c e . s o r t ( )

3 r e t u r n s e q u e n c e [ 0 ]

The course instructors

believe that adhering to

the expected design is crucial for the learning pro-

cess, as it emphasizes key aspects like completing the

task in linear time and avoiding side effects. Given

the high volume of submissions, manually inspecting

students’ solutions is impractical. Therefore, there is

a clear need for automated mechanisms, such as de-

sign tests, to check alignment between the expected

design and the students’ implementations.

The last author was actively involved in the introduc-

tory programming course discussed in the paper.

Our Approach. To solve this problem, we have

developed PyDesignWizard, a Code Structure An-

alyzer API that parses Python programs and offers

methods to query detailed information about the an-

alyzed code. This allows us to write design tests

for scenarios like the aforementioned one. PyDe-

signWizard leverages Python’s Abstract Syntax Tree

(AST) for static analysis, integrating unit testing con-

cepts with Unittest Python’s built-in testing frame-

work.

The Python Design Wizard API provides sev-

eral methods that facilitate the automated inspection

of algorithm structure, allowing instructors to ver-

ify whether students followed the expected design in

their solutions. Some of the key methods include:

• get calls function(function name).

This method checks if a speciﬁc function, such

as sort(), has been called within the student’s

code. It is useful for ensuring that students do

not use built-in functions when the goal is to

manually implement the algorithm.

• entity has child(parent entity,

child type). This method checks whether a

given entity (such as a loop or function) contains

speciﬁc child elements, such as an assignment

variable. It can be used to verify the complexity

or modularity of the code.

• get relations(relation type). With

this method, it is possible to inspect relationships

between different code entities, such as nested

loops or function calls. It allows the detection of

complex patterns, like sorting algorithms that use

multiple loops.

• get entity(entity name). This method

identiﬁes and returns a speciﬁc entity in the code,

such as a function or class, enabling detailed anal-

ysis of that entity in terms of parameters, vari-

ables, and control structures.

Combined, these methods allow instructors to cre-

ate tests that go beyond simple functional veriﬁcation,

enabling the evaluation of code structure and the de-

tection of patterns or design violations expected in al-

gorithm development.

Let us revisit the previous example and propose

a design test to determine whether an implementation

relies on a quadratic sorting strategy or Python’s built-

in sort() function. While one could write a test to

check for speciﬁc patterns, such as the presence of a

for loop, an if statement, or a variable storing the

smallest element, the instructors in this case aimed

to avoid overly restricting students’ approaches. The

key goal was to prevent the use of a sorting strategy

Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses

519

when a single iteration over the sequence would suf-

ﬁce. Therefore, the design test should focus on two

main characteristics: the presence of nested loops,

which may suggest a sorting algorithm, and the use

of Python’s sort() function. Listing 4 is the design

test implementation for this using PyDesignWizard.

Listing 4: Design Test to check sorting strategies.

1 def te st _ so rt_ al gor it hm ( self ):

2 dw = Py De sig nW iz ard (" s tu den t - s ubmis si on . py ")

3 o ut er _loop = d w. ge t_ entit y (’f or1 ’)

4 if o ut er _lo op = None :

5 outer _l oo p = dw. get _e nt it y ( ’ whil e1 ’)

6 sel f . ass er tT rue ( o uter_ lo op ! = N on e )

7 i nn er _loop = ou ter_l oo p . g et _re la ti ons ( ’ HA SLOO P ’ )

8 sel f . ass er tT rue ( i nner_ lo op ! = N on e )

9 c ha ng e_l is t = dw . e n ti ty _ha s_ chi ld ( inne r_ lo op , \

’ as si gn _f ie ld ’)

10 se lf . a ssert Tr ue ( c ha ng e_l is t )

11 sor t_ ca lls = dw. g et _ca ll s_f un cti on ( ’sort ’)

12 se lf . a ssert Tr ue ( s or t_ ca lls == 0)

Line 2 creates an instance of PyDesignWizard,

initializing it with the ﬁle ”student-submission.py”.

When instantiated, the PyDesignWizard object ana-

lyzes the code’s Abstract Syntax Tree (AST) to ex-

tract structural information, representing it as a set of

entities and their relationships. It then offers methods

for querying these entities and relationships, making

it easier to examine the structure and behavior of the

code under analysis.

Lines 3 to 6 attempt to locate an outer loop in

the student’s code, ﬁrst checking for a ‘for‘ loop

(‘’for1’‘) and, if not found, checking for a ‘while‘

loop (‘’while1’‘). The code then asserts that an outer

loop was successfully found. If neither loop exists,

the test will fail.

Lines 7 and 8 attempt to ﬁnd an inner loop within

the outer loop by checking for a ‘’HASLOOP’‘ rela-

tionship. If an inner loop is detected, the test contin-

ues; otherwise, the assertion in line 8 causes the test

to fail.

Lines 9 and 10 verify whether the inner loop con-

tains an assignment operation. Line 9 checks for a

child entity labeled assign_field within the inner

loop, indicating an assignment. Line 10 asserts that

this assignment exists; if it does not, the test will fail.

Lines 11 and 12 check whether the sort function

is called within the student’s code. Line 11 retrieves

the number of times the sort function is invoked us-

ing the method dw.get calls function(’sort’). Line 12

asserts that the number of calls to the sort function

is zero, indicating that the implementation should not

rely on a built-in sorting function. If the assertion

fails, it suggests that the student’s code improperly

uses the sort function.

Identifying the use of speciﬁc functions, such as

the sort() function, is relatively straightforward and

leaves little room for errors in the tests. However, note

that the absence of nested loops does not necessarily

mean that the student did not sort the sequence. They

could have used different sorting strategies, such as

Merge or Quick sort. Therefore, a design test may fail

to detect some cases (false negatives), as well as ﬂag

instances where nested loops are used but are not ac-

tually sorting the sequence (false positives). For this

reason, we evaluated our approach using these types

of tests to assess precision and recall within the con-

text of more complex scenarios.

4 EVALUATION

In this section, we evaluate our approach for writ-

ing tests that verify the program structure of students’

code. We focus on three key aspects to assess its ef-

fectiveness and practicality.

First, we examine through unit tests our tool’s

ability to detect essential code elements. This in-

volves analyzing how accurately the tool can identify

speciﬁc components in students’ code, such as loops,

if statements, variables, commands, among others.

Second, we evaluate the performance of a speciﬁc

design test in a more complex testing scenario that in-

troduces uncertainty. This allows us to observe how

well our approach handles ambiguous or incomplete

information in the code, simulating real-world chal-

lenges where the structure might not always be clear.

Finally, we assess the feedback from instructors,

particularly in terms of their ability to write tests using

the proposed approach. We investigate whether they

ﬁnd the tool intuitive and useful, as well as their per-

ceptions of how the approach can enhance the teach-

ing and assessment process.

4.1 Methodology

The methodology of this study was guided by three

key research questions:

• RQ1. Can the structure of students’ solutions be

effectively assessed through design tests?

• RQ2. Do educators perceive the concept of de-

sign tests as intuitive and easy to grasp?

• RQ3. Does a tool based on design tests offer tan-

gible beneﬁts in educational contexts?

In this section, we describe the two studies con-

ducted to address the research questions. The ﬁrst

study focuses on the ﬁrst question, while the second

CSEDU 2025 - 17th International Conference on Computer Supported Education

520

addresses the latter two. These studies are as follows:

(1) a quantitative evaluation of the API, assessing the

conﬁdence level in the test results produced by the

PyDesignWizard, and (2) a qualitative study where

professionals validate the tool’s approach within the

educational context.

4.1.1 Quantitative Study: Answering RQ1

Our API simpliﬁes the identiﬁcation of speciﬁc func-

tions, data types, data structures, and other elements,

with static analysis minimizing the potential for er-

ror. We have conducted extensive testing to ensure ac-

curacy in these fundamental yet critical cases, which

has instilled conﬁdence in the reliability of our tool.

To thoroughly assess the API, we generated synthetic

Python programs that included various language con-

structs, such as loops, variable usage, data structures,

and functions. We used the API to extract informa-

tion from these programs and veriﬁed the correctness

of the data retrieved. The unit tests we developed sig-

niﬁcantly mitigate risks associated with the tool’s de-

velopment. The results of our coverage tests can be

summarized as follows:

• 33 unit tests;

• 100% function coverage; and

• Approximately 86% branch/code coverage.

However, we believe it is equally important to

evaluate how design tests perform in more complex

scenarios, where determining the implementation of a

solution requires more detail and multiple approaches

can be used to solve the same problem. To this end,

we chose to assess the precision of a design test we

wrote to detect whether a code implements a sorting

strategy. We selected the sorting context for several

reasons. First, sorting a sequence involves the use of

loops, conditional statements, and data structure ma-

nipulation, making it a rich area for analysis. Second,

there are numerous approaches to sorting, from clas-

sic algorithm implementations to leveraging pre-built

functions. Lastly, over the years, instructors of the

introductory programming course at UFCG have ob-

served that students often resort to sorting algorithms

to solve elementary problems where such complex-

ity is unnecessary—a behavior the instructors seek to

discourage. For these reasons, in this work we have

evaluated the design test (Listing 4) detailed in the

previous Section 2.

Dataset. We executed the design test in Listing 4

on a sample of 1,714 Python solutions, which we au-

tomatically collected from our Online Judge system.

We focused on 21 speciﬁc problems, selecting these

because they are the ones where students are most

likely to use sorting as a shortcut to simplify the so-

lution, rather than applying the appropriate strategy.

Examples of these problems include: (i) ﬁnding the

smallest or largest element in a list, (ii) checking if a

list is already sorted, (iii) counting the number of dis-

tinct elements, and (iv) verifying if all elements are

unique, among others.

Procedures and Metrics. After executing our se-

lected Design Test, we manually analyzed the 1,714

solutions to ensure that the test accurately detected

the presence of a sorting strategy. We then recorded

the number of true positives, false positives, true neg-

atives, and false negatives to assess the test’s preci-

sion, recall, and accuracy. In this context, true pos-

itives indicate that the test correctly identiﬁed a sort-

ing strategy in a solution where one was actually used.

False positives occur when the test incorrectly ﬂagged

a non-sorting solution as employing a sorting strategy.

True negatives reﬂect cases where the test correctly

indicated the absence of a sorting strategy, while false

negatives represent instances where the test failed to

detect a sorting strategy that was actually present in

the student’s code.

Results and Discussion. Table 1 presents the con-

fusion matrix used to classify test results in terms of

true positives (TP), false positives (FP), true negatives

(TN), and false negatives (FN).

Table 1: Confusion Matrix for Test Results.

Detected Positive Detected Negative

True Positive (TP) 49 9 (FN)

True Negative (TN) 15 (FP) 1695

The confusion matrix was used to calculate the

precision, recall, and accuracy of the proposed ap-

proach. Precision, deﬁned as the ratio of true posi-

tives (TP) to the sum of true positives and false pos-

itives (FP), was 77% (49 / (49 + 15)). Recall, which

measures the ratio of true positives to the sum of true

positives and false negatives (FN), was 84% (49 / (49

+ 9)). Finally, accuracy, calculated as the ratio of all

correctly classiﬁed instances (true positives and true

negatives) to the total number of instances, reached

98,6% ((49 + 1641) / (49 + 15 + 1641 + 9)). These

metrics demonstrate the effectiveness of the design

test in correctly identifying sorting strategies while

maintaining a low error rate.

The manual analysis of false positives revealed

two main categories of programs that the test incor-

rectly identiﬁed cases as involving sorting:

Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses

521

• Operations that modify the iterated array: these

operations involved changes to the array during

iteration but were not related to reordering ele-

ments;

• Algorithms utilizing auxiliary structures: in these

cases, the algorithm made use of additional data

structures, such as dictionaries or arrays, which

were initialized outside of the loop. These aux-

iliary structures were used for calculations unre-

lated to the array’s ordering.

Listing 5 shows an example of false positive. Note

that the code does have two loops, but it does not ma-

nipulate the list numbers to sort its elements.

Listing 5: False positive example.

1 d e f s p l i t s e q u e n c e ( numb ers , d i v s ) :

2 f o r i i n r a n g e ( l e n ( num bers ) ) :

3 num = numb ers [ i ]

4 f o r d i v i n d i v s :

5 i f d i v ! = 0 :

6 num / = d i v

7 numb ers [ i ] = num

These ﬁndings highlight the challenge of pre-

cisely differentiating between true sorting strategies

and other types of iterative modiﬁcations or calcula-

tions, emphasizing the need for further reﬁnement in

the design of the static analysis tests. Nevertheless, it

is important to note that while the students were not

employing a sorting strategy, they were still diverging

from the design originally intended by the instructors,

which did not require quadratic solutions for the given

problem.

For the false negatives, the issues seem to be

largely related to the absence of test checks for certain

scenarios. In 8 of the false negative cases, the students

used either native functions or custom implementa-

tions that performed operations on the list, which the

test failed to detect. Examples include the use of func-

tions like insert, pop, or custom implementations

of such operations. In one particular case, a student

utilized a lambda function, and our API did not cap-

ture this instance due to the lack of implementation

for handling the structure of lambda expressions.

Listing 6 shows an example of false negative.

Note that the student implemented the insertion sort,

but used the insert function instead of moving the el-

ements applying swaps (assign_field).

Listing 6: False negative example.

1 d e f m y i n s e r t i o n s o r t ( a r r ) :

2 f o r i i n r a n g e ( 1 , l e n ( a r r ) ) :

3 v a l u e = a r r . pop ( i )

4 j = i −1

5 w h i l e j >= 0 and a r r [ j ] > v a l u e :

6 j −= 1

7 a r r . i n s e r t ( j +1 , v a l u e )

8 r e t u r n a r r

The limitation encountered regarding the false

negatives was not inherent to the API itself, but rather

in how the test was written. The test could have been

more speciﬁc, explicitly including functions like ‘in-

sert‘ and ‘pop‘, which were overlooked. However,

this raises an important discussion about the balance

between writing highly speciﬁc tests that aim to cap-

ture many details, and more general tests that, while

simpler, are prone to false positives and false nega-

tives. Finding the right balance depends on the goals

of the test and the context in which it is applied.

On the other hand, the case involving the lambda

function highlights a genuine need to evolve the API.

This example indicates that the current version of the

API does not fully capture certain language features,

and extending it to handle lambda expressions would

improve its overall accuracy and coverage.

4.1.2 Qualitative Study: Answering RQ2 and

RQ3

Relying solely on professionals’ impressions of the

tool’s documentation would not provide sufﬁcient

conﬁdence to address our research questions concern-

ing the usefulness and usability of the solution. For

this reason, we conducted a qualitative study in which

we interviewed 16 professionals in order to evaluate

and collect deeper insights on the design test approach

for education. Next, we give details on the partici-

pants, the procedures applied to collect and analyze

data, and the results of this study.

Participants. We have interviewed 16 instructors

from three different universities of Brazil. The partic-

ipants were invited through mailing lists sent to three

different universities in Brazil, ensuring a diverse rep-

resentation of institutions. They are all profession-

als in the ﬁeld of education, with direct experience

teaching programming-related courses. This group

includes both university professors and graduate stu-

dents who serve as instructors for programming dis-

ciplines. From now on, we will refer to these partici-

pants as P1, P2...P16.

Interviews and Data Collection. In addition to

testing our API, we also evaluated whether instruc-

tors are capable of understanding the concept of de-

sign tests, writing their own design tests, and their

perspectives on using this approach in an educational

context. To this end, we asked instructors to perform

two tasks: (i) reading 3 pre-written design tests and

CSEDU 2025 - 17th International Conference on Computer Supported Education

522

explaining these tests, and (ii) writing 4 design tests

based on speciﬁc scenarios we provided.

The 3 pre-written tests were chosen for their in-

creasing complexity, starting with a test for simple

function calls and progressing to more complex struc-

tures such as loops and sorting algorithms, as sort-

ing algorithms are central to verifying structural un-

derstanding of code. This choice was strategic, as

the third test, which required detecting sorting algo-

rithms, allowed us to see whether instructors could

understand and explain a more advanced concept. In

the task of writing 4 new tests, the scenarios provided

ranged from verifying variable assignments to check-

ing for correct usage of functions and loops. These

tasks were selected to reﬂect common educational sit-

uations, ensuring the instructors could apply the con-

cept of design tests in practical settings.

During the evaluation, we employed the think-

aloud protocol (Ericsson and Simon, 1993b) to collect

data on their experiences while performing the tasks.

The think-aloud protocol is a qualitative research

method where participants verbalize their thought

processes in real-time as they complete a task. This

approach helps capture the reasoning, difﬁculties, and

strategies used by participants, providing insight into

how they interact with the material. Each session

lasted approximately 20 minutes on average. We

recorded the sessions, including both the task perfor-

mance and follow-up interviews, and later transcribed

the audio. These transcriptions were reviewed for ac-

curacy and then subjected to coding. We worked with

predeﬁned tags related to the usability of the API and

the understanding of design tests to ensure consis-

tency in categorizing the responses. Coding is a sys-

tematic process used in qualitative research to catego-

rize textual data into themes or patterns. By apply-

ing coding, we were able to organize participants’ re-

sponses into meaningful categories that facilitated our

analysis of how well the instructors understood the

design testing process, how they approached writing

tests, and their overall impressions of the approach.

Results. We matched participants’ verbalization to

7 pre-deﬁned tags: Positive feedback (tool or ap-

proach), Negative feedback (tool or approach), Dif-

ﬁculty (use or understanding), Ease (use or under-

standing), Correct response, Incorrect response, and

Response close to correct. The observations from the

data are the following.

Observation 1: Instructors Can Easily Under-

stand the Concept of Design Tests.

The majority of the instructors quickly grasped the

concept of design tests after a brief introduction. Out

of 16 participants, 15 reported understanding the con-

cept easily, and 10 instructors were able to think about

implementations even before consulting the docu-

mentation. This highlights the intuitive nature of the

concept for professionals in education.

The qualitative data collected shows a signiﬁcant

amount of positive feedback. There were 103 com-

ments tagged under ”Ease of Understanding”. 8

instructors asked questions or made comments that

showed a deeper engagement with the idea of design

tests, suggesting a high level of comprehension. On

the other hand, one participant (P12) could not easily

understand the approach and described it as confus-

ing.

P1: ”From what I understand, ’for’ is also an entity,

so in this line here, I am grabbing all occurrences

of ’for’ and returning a list with those occurrences,

and I check if they are different from an empty list.”

This demonstrates the participant’s ability to interpret

the role of programming structures as entities in the

design tests and understand their functional use within

the tool.

P2: ”[...] I really liked it, I found the way of do-

ing these tests interesting, I think it’s cool. My feed-

back is entirely positive.” This feedback emphasizes

the enthusiasm for the design test approach and its

perceived usefulness.

Observation 2: Design Tests Are Seen as a Valu-

able Tool for Educational Use.

Several instructors mentioned that they could see im-

mediate applications for design tests in their teaching.

Out of the 16 participants, 8 directly pointed out po-

tential educational uses, such as automating the struc-

tural correction of student assignments. Furthermore,

4 instructors stated they would consider integrating

the tool into their teaching routines, particularly in

disciplines involving programming practices.

P3: ”Thinking quickly, for correction, the ﬁrst thing

that comes to mind is grading exams.” This shows

that instructors quickly identiﬁed practical applica-

tions for design tests, speciﬁcally for automating and

enhancing the feedback process on student submis-

sions.

P4 reﬂected on the broader use of the tool in program-

ming courses: ”I think in the Programming II course,

it could be interesting to use these design tests along

with unit tests, because they abstract the idea of what

needs to be tested.” This indicates that design tests

could be a valuable addition to traditional unit tests,

providing a complementary layer of feedback that fo-

cuses on the structure of students’ code.

Observation 3: Some Instructors Experienced Mi-

nor Difﬁculties with the API, but the Overall Con-

cept Remained Clear.

While the majority of participants found the concept

Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses

523

of design tests easy to understand, a few noted mi-

nor challenges when working with speciﬁc API func-

tions. For instance, 6 instructors expressed some con-

fusion about how to manipulate certain relationships

between entities, which slowed down their ability to

implement the tests.

P5: ”I had some doubts about how to manipulate

some of the relationships between the entities, but I

think it’s something that can be understood with more

practice.” Despite these difﬁculties, the participants

generally found that the concept was clear and the tool

was accessible.

P6: ”I didn’t understand the last one, do you want me

to check if there’s at least one ’for’ inside ’func1’?”

This indicates that while there were occasional mis-

understandings, they were more about the technical

details of implementation rather than the core concept

of design tests.

Observation 4: Instructors See Potential in Com-

bining Design Tests with Other Educational Tools.

Some participants reﬂected on how design tests could

complement existing tools and methodologies in edu-

cation. For example, instructors suggested combining

design tests with unit tests to provide more compre-

hensive feedback to students. This would allow edu-

cators to not only check the correctness of a solution

but also assess the structure of the code.

P7: ”I think design tests could be used to detect

when students are using functions they shouldn’t, like

built-in libraries.” This suggests that design tests can

help enforce speciﬁc coding guidelines and encourage

students to focus on learning foundational program-

ming concepts.

In conclusion, the feedback from instructors strongly

supports the observation that design tests are an intu-

itive and valuable tool for educational contexts. The

concept was widely understood, and instructors were

able to think critically about its application in pro-

gramming education. Despite minor challenges with

the API, the overall reception was positive, with par-

ticipants expressing interest in incorporating design

tests into their teaching routines.

5 RELATED WORK

The automatic assessment of programs in introduc-

tory programming courses has been extensively stud-

ied (Ala-Mutka, 2005) (Paiva et al., 2022) (Keun-

ing et al., 2018) (Wasik et al., 2018). One of the

most common approaches is the use of Online Judges,

which automate the veriﬁcation of functional correct-

ness through input-output tests. Speciﬁcally, Janzen

et al.(Janzen and others, 2013) and Singh et al.(Singh

et al., 2013) propose the use of automated judges to

verify whether students’ programs produce the ex-

pected results for a given set of test cases. However,

this method is limited to verifying the program’s out-

put, without considering internal aspects such as the

use of control structures, recursion, and modularity.

On the other hand, Truong et al. (Truong et al.,

2004) explore the application of static code analysis

in an educational context, investigating how verifying

characteristics like cyclomatic complexity, code cov-

erage, and programming best practices can contribute

to automated assessment. However, this approach fo-

cuses on code quality in general, without speciﬁcally

covering algorithmic structure.

Ara

ujo et al. (Araujo et al., 2016) emphasize the

need for more robust methods that consider not only

code quality but also the algorithmic structure of pro-

grams. They demonstrate that, in the absence of clear

guidelines, many students adopt solutions that, while

functionally correct, do not adhere to the design cri-

teria expected by instructors, hindering the develop-

ment of logical and abstract reasoning skills.

This work aims to address the limitations of these

existing approaches by proposing a design veriﬁca-

tion tool that uses static analysis to detect structural

deviations in students’ solutions. Thus, the tool com-

plements functionality-based assessments, enabling

educators to provide more comprehensive feedback.

6 CONCLUSION AND FUTURE

WORK

In this work, we introduced the Python Design Wiz-

ard, a tool designed to inspect the structural aspects

of algorithms in programming courses. Our approach

focused on the use of design tests to verify not only

the functional correctness of students’ code but also

its structural design, ensuring that students adhere

to the expected algorithmic patterns. Throughout

the study, we evaluated the effectiveness of the tool

in identifying violations of algorithm design, with

promising results.

Our tests, particularly in detecting sorting strate-

gies, demonstrated a high accuracy of 98.6%, with a

precision of 77% and recall of 84%. These results

reinforce the potential of design tests to supplement

traditional unit tests by addressing aspects related to

the internal structure of algorithms, such as control

ﬂow, recursion, and the use of speciﬁc functions.

CSEDU 2025 - 17th International Conference on Computer Supported Education

524

6.1 Threats to Validity

While the results were promising, there are certain

threats to the validity of our approach. First, the sam-

ple size was limited to 1714 student programs from

a single institution, which may not fully represent

the diversity of approaches found in broader educa-

tional contexts. Additionally, the focus on sorting al-

gorithms may have limited the generalizability of the

tool’s applicability to other algorithmic topics.

Moreover, the qualitative evaluation was con-

ducted with a relatively small group of instructors,

which may not fully reﬂect the perspectives of a wider

range of educators. Future studies should consider ex-

panding the qualitative evaluation to a broader sample

of instructors and different educational settings.

6.2 Future Work

As future work, we plan to expand the functionality

of the Python Design Wizard by creating a catalog of

pre-deﬁned design tests that can be shared with the

community of educators. These tests could be tai-

lored to various common topics in introductory pro-

gramming courses, such as recursion, dynamic pro-

gramming, and data structures.

Furthermore, it would be interesting to investigate

the integration of this tool into other programming

paradigms and languages, as well as expanding its ca-

pabilities to more complex scenarios that go beyond

sorting algorithms. We also aim to explore the possi-

bility of developing more speciﬁc tests that could han-

dle different levels of complexity, providing a more

granular evaluation of algorithm design.

An interesting observation from our study is that

we evaluated the tests themselves rather than the tool

directly, which is why we included unit tests in our

validation process. This approach allowed us to mea-

sure how well the design tests perform in identifying

algorithmic patterns and violations, rather than focus-

ing solely on the technical capabilities of the tool.

REFERENCES

Ala-Mutka, K. M. (2005). A survey of automated assess-

ment approaches for programming assignments. Com-

puter science education, 15(2):83–102.

Araujo, E., Serey, D., and Figueiredo, J. (2016). Qualita-

tive aspects of students’ programs: Can we make them

measurable? In 2016 IEEE Frontiers in Education

Conference (FIE), pages 1–8. IEEE.

Beazley, D. (2012). Python Essential Reference. Addison-

Wesley.

Brunet, J., Guerrero, D., and Figueiredo, J. (2009). De-

sign tests: An approach to programmatically check

your code against design rules. In 2009 31st Inter-

national Conference on Software Engineering - Com-

panion Volume, pages 255–258.

Ericsson, K. A. and Simon, H. A. (1993a). Protocol Anal-

ysis: Verbal Reports as Data. MIT Press, Cambridge,

MA, revised edition edition.

Ericsson, K. A. and Simon, H. A. (1993b). Protocol Anal-

ysis: Verbal Reports as Data. MIT Press, Cambridge,

MA, revised edition edition.

Foundation, P. S. (2023). Python 3 Documentation. Avail-

able at https://docs.python.org/3/.

Janzen, S. and others (2013). Automated grading systems in

competitive programming contests. In Proceedings of

the 8th International Symposium on Educational Soft-

ware Development.

Juiz, A. and others (2014). Online judge and its applica-

tion to introductory programming courses. In ACM

SIGCSE Bulletin.

JUnit Team (2023). JUnit 5: The Next Generation of JUnit.

Accessed: 2024-11-11.

Keuning, H., Jeuring, J., and Heeren, B. (2018). A system-

atic literature review of automated feedback genera-

tion for programming exercises. ACM Transactions

on Computing Education (TOCE), 19(1):1–43.

Leite, L. (2015). A proposal for improving programming

education with frequent problem solving exercises. In

Brazilian Symposium on Computer Education.

Paiva, J. C., Leal, J. P., and Figueira,

A. (2022). Automated

assessment in computer science education: A state-

of-the-art review. ACM Transactions on Computing

Education (TOCE), 22(3):1–40.

Papastergiou, M. (2009). Digital game-based learning in

computer programming: Impact on educational effec-

tiveness and student motivation. Computers & Educa-

tion.

Singh, J., Gupta, P., and Sharma, M. (2013). Online judge

system for learning and practicing programming. In

International Journal of Computer Applications.

Truong, N., Roe, P., and Bancroft, P. (2004). Static analysis

of students’ java programs. In Proceedings of the Sixth

Australasian Conference on Computing Education-

Volume 30, pages 317–325. Citeseer.

van Rossum, G. (2009). The Python Language Reference

Manual. Network Theory Ltd.

Wasik, S., Antczak, M., Badura, J., Laskowski, A., and

Sternal, T. (2018). A survey on online judge sys-

tems and their applications. ACM Computing Surveys

(CSUR), 51(1):1–34.

Watson, C. and Li, F. W. (2014). Failure rates in intro-

ductory programming revisited. In Proceedings of the

2014 conference on Innovation & technology in com-

puter science education, pages 39–44.

Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses

525