Beyond Functionality: Automating Algorithm Design Evaluation in
Introductory Programming Courses
Caio Oliveira, Leandra Silva and Jo
˜
ao Brunet
Department of Systems and Computing, Federal University of Campina Grande, Campina Grande, Brazil
{caio.oliveira, leandra.silva}@ccc.ufcg.edu.br, joao.arthur@computacao.ufcg.edu.br
Keywords:
Design Test, Introductory Programming Courses, Tests, Feedback, Assessment.
Abstract:
Introductory programming courses often enroll large numbers of students, making individualized feedback
challenging. Automated grading tools are widely used, yet they typically focus on functional correctness
rather than the structural aspects of algorithms—such as the use of appropriate functions or data structures.
This limitation can lead students to create solutions that, while correct in functionality, contain structural flaws
that might hinder their learning. To address this gap, we applied the concept of design tests to automati-
cally detect structural issues in students’ algorithms. Our tool, PyDesignWizard, provides an API leveraging
Python’s Abstract Syntax Tree (AST) to streamline the creation of design tests. We quantitatively evaluated
this approach on 1,714 student programs from an introductory course at the Federal University of Campina
Grande, achieving 98.6% accuracy, a 0.99 true negative rate, 77% precision, and 84% recall. A qualitative
evaluation was also conducted through interviews with 16 programming educators using the Think Aloud
Protocol. The results indicated a high level of understanding and positive feedback: 15 educators grasped the
concept quickly, and 8 highlighted the educational benefits of identifying design issues. In summary, our ap-
proach empowers educators to write structural tests for automated feedback, enhancing the quality of grading
in programming education.
1 INTRODUCTION
The teaching of programming has become increas-
ingly important over the years, driven by technolog-
ical advancements and the growing interest in Infor-
mation Technology. This growth has resulted in larger
classes in introductory programming courses, signifi-
cantly increasing the number of exercises and exams
that instructors need to assess, provide feedback, and
grade (Watson and Li, 2014). For example, in the con-
text that this work took place, the Introduction to Pro-
gramming course at the Federal University of Camp-
ina Grande (UFCG), a class typically consists of 90
students per semester. The course adopts a practice-
focused approach, where students are required to reg-
ularly submit exercises. On average, each class gen-
erates around 270 exercise submissions, which the in-
structor must assess, provide feedback on, and grade.
This significant workload highlights the challenges
faced by instructors in scaling assessment practices
while ensuring timely and effective feedback.
Studies such as those by Leite (Leite, 2015)
and Papastergiou (Papastergiou, 2009) suggest that
increasing the frequency of programming exercises
can help improve students’ skills, but this also in-
creases the workload for instructors. Automated grad-
ing tools, as mentioned by Janzen (Janzen and others,
2013) and Singh (Singh et al., 2013), are often used
to address this issue. Instructors frequently use On-
line Judges (Juiz and others, 2014), where students’
submissions are automatically graded based on tests
that primarily focus on functional correctness, possi-
bly neglecting important aspects related to code struc-
ture and algorithm design. In summary, the solutions
tend to focus on testing only ’what’ the students’ code
does, rather than ’how’ they do it. While there are
studies exploring the use of static analysis in pro-
gramming education (Truong et al., 2004), these ap-
proaches primarily focus on code quality rather than
addressing algorithm design (Araujo et al., 2016).
In addition to functional correctness, it is essential
to evaluate the design of algorithms. Design refers to
the elements in the algorithm, and how the algorithm
is structured and organized in code. Due to the limited
focus in the literature on how to automatically ver-
ify the algorithm design, and because Online Judges
focus primarily on functional aspects, many students
end up implementing the exercises violating the ex-
516
Oliveira, C., Silva, L. and Brunet, J.
Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses.
DOI: 10.5220/0013242000003932
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 17th International Conference on Computer Supported Education (CSEDU 2025) - Volume 2, pages 516-525
ISBN: 978-989-758-746-7; ISSN: 2184-5026
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
pected design and hindering the development of log-
ical reasoning skills. For example, consider that an
instructor asks students to implement the Merge Sort
algorithm in Python (Foundation, 2023), but the stu-
dent chooses to implement the Bubble Sort algorithm
instead. Although the solution is functionally correct,
that is, it sorts a sequence, the student does not meet
the instructor’s design expectations. In some cases,
the problem is even worse, when students end up us-
ing pre-built libraries or functions (sort() in this ex-
ample) that do all the work that they were supposed
to do.
To address this gap, we propose an automated
approach for inspecting algorithm design, based on
static code analysis, with a focus on the educational
context. Our tool, PyDesignWizard, provides an API
that allows the creation of design tests that verify the
structure of algorithms, beyond functional correct-
ness. Unlike traditional tests, our approach covers
everything from parameters and variables to detect-
ing loops and recursion, enabling a deeper evaluation
of syntactic and semantic conformity. This allows in-
structors to automatically check aspects of the design
of the code, such as control flow structures usage, re-
cursion depth, complexity, code modularity, among
others.
The PyDesignWizard API was built based
on Python’s abstract syntax tree (AST)(Beazley,
2012)(van Rossum, 2009), facilitating code reading
and manipulation. With this tool, instructors can cre-
ate their own test suites to evaluate both the function-
ality and design of students’ algorithms, ensuring the
originality of their solutions and optimizing grading
time.
In order to evaluate the ability of our approach to
detect design issues and its practical applicability for
instructors, we conducted our research guided by the
following research questions:
RQ1. Can the structure of students’ solutions be
effectively assessed through design tests?
RQ2. Do educators perceive the concept of de-
sign tests as intuitive and easy to grasp?
RQ3. Does a tool based on design tests offer tan-
gible benefits in educational contexts?
To validate our tool and answer these questions,
we conducted two distinct studies. The first study
evaluated the accuracy of algorithm detection in 1714
programs developed by 90 students in the Introduc-
tion to Programming course at the Federal University
of Campina Grande, one of the top 5 largest Com-
puter Science courses in Brazil. In order to under-
stand how instructors perceive our approach, we have
conducted a qualitative study interviewing 16 instruc-
tors from different universities. We have applied the
Think Aloud Protocol (Ericsson and Simon, 1993a)
to gather and analyze data from the interviews.
The results indicate that the tool was effective in
detecting algorithm details and that design tests re-
duced the incidence of false negatives. Specifically,
the test achieved a 98.6% accuracy, with a precision
of 77% and a recall of 84% in detecting sorting al-
gorithms. In the qualitative assessments, instructors
found the design tests intuitive and recognized the ed-
ucational potential of the tool in the context of pro-
gramming education. Eight interviewees highlighted
the tool’s usefulness in verifying the structure of stu-
dents’ code, while four stated they would use it in
their courses. Another four suggested that the tool
could be incorporated into courses on testing and soft-
ware architecture, complementing existing practices
such as unit testing. Only two interviewees did not
find a clear use for the tool in their educational con-
text. Overall, the feedback suggests that the tool can
serve as an ally in improving the quality of program-
ming education by encouraging best practices in al-
gorithm design and code structure.
This paper is structured as follows. First we pro-
vide the background for this work, discussing the con-
cept of Design Tests in Section 2. Then, in Section 3,
we describe how we apply the concept of Design Test
for educational purposes, along with the details of the
tool we built to support the implementation of design
tests for Python programs. In Section 4 we present
both studies we conducted: a quantitative evaluation
to demonstrate our tool’s ability to identify structural
design elements in student code with high accuracy
and precision, and a qualitative evaluation to gather
feedback from instructors on the usability and poten-
tial educational benefits of the tool. Finally, in Sec-
tion 6, we summarize our findings, discuss potential
threats to validity, and suggest directions for future re-
search, including expanding the tool’s applicability to
other programming paradigms and educational con-
texts.
2 BACKGROUND: DESIGN
TESTS
In a previous work (Brunet et al., 2009), we intro-
duced the concept of design tests, an innovative ap-
proach to test software aimed at verifying whether a
software system’s implementation adheres to specific
architectural rules or design decisions. While tradi-
tional functional tests focus on ensuring that a pro-
gram delivers correct outputs in response to given in-
puts, design tests go beyond this by checking how
Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses
517
Figure 1: Design Test Overview.
the system has been implemented according to pre-
defined architectural guidelines. This method helps
to ensure that design decisions are consistently ap-
plied, particularly in teams where developers may
change frequently, reducing the likelihood of missed
or misunderstood design aspects. By incorporating
design tests into the daily testing routine, mismatches
between design and implementation can be detected
early, improving overall software quality.
To support the use of design tests for architectural
rules, we developed DesignWizard, an API integrated
with JUnit (JUnit Team, 2023) for Java implementa-
tions. This tool allows developers to express design
rules and decisions in a programming language fa-
miliar to them, making the documentation of design
constraints both executable and easy to maintain. Fig-
ure 1 illustrates an overview of the design tests ap-
proach. In this method, a developer creates a design
test to automate the verification of design rules. The
two key components of this approach are the code
structure analyzer API, which extracts metadata about
the code and exposes this information, and the testing
framework, which provides assertion routines, auto-
mates test execution, and reports the results.
For the sake of clarity, let us analyze a concrete
example. The design test below (Listing 1) illustrates
this: a design rule dictates that classes in the dao
package should only be accessed by classes within
the controller or dao packages. This rule is veri-
fied through a design test that checks all method calls
to ensure compliance.
Listing 1: Design Test pseudo-code for communication rule
between dao and controller package.
1 daoP a c k a g e = myapp . dao
2 c o n t r o l l e r P a c k a g e = myapp . c o n t r o l l e r
3 FOR each c l a s s i n d a o P ack a g e DO
4 c a l l e r s = c l a s s . g e t C a l l e r s ( )
5 FOR e a c h c a l l e r i n c a l l e r s DO
6 a s s e r t c a l l e r i n c o n t r o l l e r P a c k a g e | |
7 c a l l e r i n daoPac k a g e
This design test is constructed by first identifying
the target packages and then checking the callers of
each class in the dao package. Assertions are made to
verify that these callers belong to the controller or
dao packages. The test relies on static analysis pro-
vided by DesignWizard to retrieve structural informa-
tion, and the JUnit framework is used to automate the
process.
In the next section, we describe our proposal to
adapt and apply the concept of design tests to the con-
text of programming education. We have adapted the
concept of design tests to the level of algorithms in or-
der to allow instructors to write and assess more fine-
grained rules related to the usage of loops, variables,
conditionals, among others.
3 DESIGN TESTS FOR
EDUCATION
Context. In the context of education, the scenario
for instructors is similar to developers: there are a
plethora of tools and mechanisms to verify whether
students’ programs are functionally correct, that is,
whether their solutions do what they are supposed to
do, but, to the extent of our knowledge, there is no
approach to check the internal aspects (the design) of
the algorithms they produce using the same language
and test infrastructure of the target program. In this
paper, we propose, implement, and evaluate the tool-
ing needed to apply the concept of design tests to the
context of education.
Design at Algorithm Level. As one may note, the
concept of design is broad and applied to different ab-
straction levels. As described in Section 2, design
tests were proposed in the context of high-level ab-
stractions, for example, modules, packages, and sub-
systems. In the context of introductory programming
courses, on the other hand, the instructor’s concerns
are at a different level, but still related to the struc-
ture; hence, the design. These concerns are related,
but not limited to:
the elements students use to implement the pro-
gram, such as conditionals, loops, variables, data
types, and operators;
the choice and manipulation of data structures;
and
the usage of specific functions, classes, and mod-
ules.
The Problem. Let us take as a concrete example the
Introduction to Programming course from the Federal
University of Campina Grande. One of the units of
CSEDU 2025 - 17th International Conference on Computer Supported Education
518
the course approaches linear iterations over lists to
apply tasks such as filtering and finding elements. In
this context, consider a straightforward question ask-
ing students to find and return the index of the small-
est element within a list. The expected design for this
implementation is initializing the first element as the
smallest and then iterating through the list, comparing
each subsequent element to the current smallest. If a
smaller element is found, the index of that element is
stored. After checking all elements, the code returns
the index of the smallest value in the list.
Once instructors typically take the well estab-
lished approach of using Online Judges to assess stu-
dents’ code output, they are limited to checking if the
output is the value expected, once they use black box
tests (input/output). However, over the years, the in-
structors noted that some of the students were sorting
the list and then returning the first element. Some of
them implemented their own versions of classic sort-
ing algorithms, while others—more concerning to the
instructors—based their solutions on Python’s built-in
sort() function. In other words, while both Listing 2
and Listing 3 might pass the functional tests, they do
not adhere to the specific design expectations set by
the instructors.
Listing 2: Finding Smallest with Bubble Sort.
1 d e f g e t s m a l l e s t ( se q u e n c e ) :
2 n = l e n ( s e q u e n c e )
3 f o r i i n r a n g e ( n −1 ) :
4 f o r j i n r a n g e ( 0 , n i 1 ) :
5 i f s e q u e n c e [ j ] > s e q u e n c e [ j + 1 ] :
6 swap ( sequence , j , j +1)
7 r e t u r n s e q u e n c e [ 0 ]
Listing 3: Finding Smallest with Python’s built-in sort()
function.
1 d e f g e t s m a l l e s t ( s e q u e n c e ) :
2 seq u e n c e . s o r t ( )
3 r e t u r n s e q u e n c e [ 0 ]
The course instructors
1
believe that adhering to
the expected design is crucial for the learning pro-
cess, as it emphasizes key aspects like completing the
task in linear time and avoiding side effects. Given
the high volume of submissions, manually inspecting
students’ solutions is impractical. Therefore, there is
a clear need for automated mechanisms, such as de-
sign tests, to check alignment between the expected
design and the students’ implementations.
1
The last author was actively involved in the introduc-
tory programming course discussed in the paper.
Our Approach. To solve this problem, we have
developed PyDesignWizard, a Code Structure An-
alyzer API that parses Python programs and offers
methods to query detailed information about the an-
alyzed code. This allows us to write design tests
for scenarios like the aforementioned one. PyDe-
signWizard leverages Python’s Abstract Syntax Tree
(AST) for static analysis, integrating unit testing con-
cepts with Unittest Python’s built-in testing frame-
work.
The Python Design Wizard API provides sev-
eral methods that facilitate the automated inspection
of algorithm structure, allowing instructors to ver-
ify whether students followed the expected design in
their solutions. Some of the key methods include:
get calls function(function name).
This method checks if a specific function, such
as sort(), has been called within the student’s
code. It is useful for ensuring that students do
not use built-in functions when the goal is to
manually implement the algorithm.
entity has child(parent entity,
child type). This method checks whether a
given entity (such as a loop or function) contains
specific child elements, such as an assignment
variable. It can be used to verify the complexity
or modularity of the code.
get relations(relation type). With
this method, it is possible to inspect relationships
between different code entities, such as nested
loops or function calls. It allows the detection of
complex patterns, like sorting algorithms that use
multiple loops.
get entity(entity name). This method
identifies and returns a specific entity in the code,
such as a function or class, enabling detailed anal-
ysis of that entity in terms of parameters, vari-
ables, and control structures.
Combined, these methods allow instructors to cre-
ate tests that go beyond simple functional verification,
enabling the evaluation of code structure and the de-
tection of patterns or design violations expected in al-
gorithm development.
Let us revisit the previous example and propose
a design test to determine whether an implementation
relies on a quadratic sorting strategy or Python’s built-
in sort() function. While one could write a test to
check for specific patterns, such as the presence of a
for loop, an if statement, or a variable storing the
smallest element, the instructors in this case aimed
to avoid overly restricting students’ approaches. The
key goal was to prevent the use of a sorting strategy
Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses
519
when a single iteration over the sequence would suf-
fice. Therefore, the design test should focus on two
main characteristics: the presence of nested loops,
which may suggest a sorting algorithm, and the use
of Python’s sort() function. Listing 4 is the design
test implementation for this using PyDesignWizard.
Listing 4: Design Test to check sorting strategies.
1 def te st _ so rt_ al gor it hm ( self ):
2 dw = Py De sig nW iz ard (" s tu den t - s ubmis si on . py ")
3 o ut er _loop = d w. ge t_ entit y (f or1 )
4 if o ut er _lo op = None :
5 outer _l oo p = dw. get _e nt it y ( whil e1 )
6 sel f . ass er tT rue ( o uter_ lo op ! = N on e )
7 i nn er _loop = ou ter_l oo p . g et _re la ti ons ( HA SLOO P )
8 sel f . ass er tT rue ( i nner_ lo op ! = N on e )
9 c ha ng e_l is t = dw . e n ti ty _ha s_ chi ld ( inne r_ lo op , \
as si gn _f ie ld )
10 se lf . a ssert Tr ue ( c ha ng e_l is t )
11 sor t_ ca lls = dw. g et _ca ll s_f un cti on ( sort )
12 se lf . a ssert Tr ue ( s or t_ ca lls == 0)
Line 2 creates an instance of PyDesignWizard,
initializing it with the file ”student-submission.py”.
When instantiated, the PyDesignWizard object ana-
lyzes the code’s Abstract Syntax Tree (AST) to ex-
tract structural information, representing it as a set of
entities and their relationships. It then offers methods
for querying these entities and relationships, making
it easier to examine the structure and behavior of the
code under analysis.
Lines 3 to 6 attempt to locate an outer loop in
the student’s code, first checking for a ‘for‘ loop
(‘’for1’‘) and, if not found, checking for a ‘while‘
loop (‘’while1’‘). The code then asserts that an outer
loop was successfully found. If neither loop exists,
the test will fail.
Lines 7 and 8 attempt to find an inner loop within
the outer loop by checking for a ‘’HASLOOP’‘ rela-
tionship. If an inner loop is detected, the test contin-
ues; otherwise, the assertion in line 8 causes the test
to fail.
Lines 9 and 10 verify whether the inner loop con-
tains an assignment operation. Line 9 checks for a
child entity labeled assign_field within the inner
loop, indicating an assignment. Line 10 asserts that
this assignment exists; if it does not, the test will fail.
Lines 11 and 12 check whether the sort function
is called within the student’s code. Line 11 retrieves
the number of times the sort function is invoked us-
ing the method dw.get calls function(’sort’). Line 12
asserts that the number of calls to the sort function
is zero, indicating that the implementation should not
rely on a built-in sorting function. If the assertion
fails, it suggests that the student’s code improperly
uses the sort function.
Identifying the use of specific functions, such as
the sort() function, is relatively straightforward and
leaves little room for errors in the tests. However, note
that the absence of nested loops does not necessarily
mean that the student did not sort the sequence. They
could have used different sorting strategies, such as
Merge or Quick sort. Therefore, a design test may fail
to detect some cases (false negatives), as well as flag
instances where nested loops are used but are not ac-
tually sorting the sequence (false positives). For this
reason, we evaluated our approach using these types
of tests to assess precision and recall within the con-
text of more complex scenarios.
4 EVALUATION
In this section, we evaluate our approach for writ-
ing tests that verify the program structure of students’
code. We focus on three key aspects to assess its ef-
fectiveness and practicality.
First, we examine through unit tests our tool’s
ability to detect essential code elements. This in-
volves analyzing how accurately the tool can identify
specific components in students’ code, such as loops,
if statements, variables, commands, among others.
Second, we evaluate the performance of a specific
design test in a more complex testing scenario that in-
troduces uncertainty. This allows us to observe how
well our approach handles ambiguous or incomplete
information in the code, simulating real-world chal-
lenges where the structure might not always be clear.
Finally, we assess the feedback from instructors,
particularly in terms of their ability to write tests using
the proposed approach. We investigate whether they
find the tool intuitive and useful, as well as their per-
ceptions of how the approach can enhance the teach-
ing and assessment process.
4.1 Methodology
The methodology of this study was guided by three
key research questions:
RQ1. Can the structure of students’ solutions be
effectively assessed through design tests?
RQ2. Do educators perceive the concept of de-
sign tests as intuitive and easy to grasp?
RQ3. Does a tool based on design tests offer tan-
gible benefits in educational contexts?
In this section, we describe the two studies con-
ducted to address the research questions. The first
study focuses on the first question, while the second
CSEDU 2025 - 17th International Conference on Computer Supported Education
520
addresses the latter two. These studies are as follows:
(1) a quantitative evaluation of the API, assessing the
confidence level in the test results produced by the
PyDesignWizard, and (2) a qualitative study where
professionals validate the tool’s approach within the
educational context.
4.1.1 Quantitative Study: Answering RQ1
Our API simplifies the identification of specific func-
tions, data types, data structures, and other elements,
with static analysis minimizing the potential for er-
ror. We have conducted extensive testing to ensure ac-
curacy in these fundamental yet critical cases, which
has instilled confidence in the reliability of our tool.
To thoroughly assess the API, we generated synthetic
Python programs that included various language con-
structs, such as loops, variable usage, data structures,
and functions. We used the API to extract informa-
tion from these programs and verified the correctness
of the data retrieved. The unit tests we developed sig-
nificantly mitigate risks associated with the tool’s de-
velopment. The results of our coverage tests can be
summarized as follows:
33 unit tests;
100% function coverage; and
Approximately 86% branch/code coverage.
However, we believe it is equally important to
evaluate how design tests perform in more complex
scenarios, where determining the implementation of a
solution requires more detail and multiple approaches
can be used to solve the same problem. To this end,
we chose to assess the precision of a design test we
wrote to detect whether a code implements a sorting
strategy. We selected the sorting context for several
reasons. First, sorting a sequence involves the use of
loops, conditional statements, and data structure ma-
nipulation, making it a rich area for analysis. Second,
there are numerous approaches to sorting, from clas-
sic algorithm implementations to leveraging pre-built
functions. Lastly, over the years, instructors of the
introductory programming course at UFCG have ob-
served that students often resort to sorting algorithms
to solve elementary problems where such complex-
ity is unnecessary—a behavior the instructors seek to
discourage. For these reasons, in this work we have
evaluated the design test (Listing 4) detailed in the
previous Section 2.
Dataset. We executed the design test in Listing 4
on a sample of 1,714 Python solutions, which we au-
tomatically collected from our Online Judge system.
We focused on 21 specific problems, selecting these
because they are the ones where students are most
likely to use sorting as a shortcut to simplify the so-
lution, rather than applying the appropriate strategy.
Examples of these problems include: (i) finding the
smallest or largest element in a list, (ii) checking if a
list is already sorted, (iii) counting the number of dis-
tinct elements, and (iv) verifying if all elements are
unique, among others.
Procedures and Metrics. After executing our se-
lected Design Test, we manually analyzed the 1,714
solutions to ensure that the test accurately detected
the presence of a sorting strategy. We then recorded
the number of true positives, false positives, true neg-
atives, and false negatives to assess the test’s preci-
sion, recall, and accuracy. In this context, true pos-
itives indicate that the test correctly identified a sort-
ing strategy in a solution where one was actually used.
False positives occur when the test incorrectly flagged
a non-sorting solution as employing a sorting strategy.
True negatives reflect cases where the test correctly
indicated the absence of a sorting strategy, while false
negatives represent instances where the test failed to
detect a sorting strategy that was actually present in
the student’s code.
Results and Discussion. Table 1 presents the con-
fusion matrix used to classify test results in terms of
true positives (TP), false positives (FP), true negatives
(TN), and false negatives (FN).
Table 1: Confusion Matrix for Test Results.
Detected Positive Detected Negative
True Positive (TP) 49 9 (FN)
True Negative (TN) 15 (FP) 1695
The confusion matrix was used to calculate the
precision, recall, and accuracy of the proposed ap-
proach. Precision, defined as the ratio of true posi-
tives (TP) to the sum of true positives and false pos-
itives (FP), was 77% (49 / (49 + 15)). Recall, which
measures the ratio of true positives to the sum of true
positives and false negatives (FN), was 84% (49 / (49
+ 9)). Finally, accuracy, calculated as the ratio of all
correctly classified instances (true positives and true
negatives) to the total number of instances, reached
98,6% ((49 + 1641) / (49 + 15 + 1641 + 9)). These
metrics demonstrate the effectiveness of the design
test in correctly identifying sorting strategies while
maintaining a low error rate.
The manual analysis of false positives revealed
two main categories of programs that the test incor-
rectly identified cases as involving sorting:
Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses
521
Operations that modify the iterated array: these
operations involved changes to the array during
iteration but were not related to reordering ele-
ments;
Algorithms utilizing auxiliary structures: in these
cases, the algorithm made use of additional data
structures, such as dictionaries or arrays, which
were initialized outside of the loop. These aux-
iliary structures were used for calculations unre-
lated to the array’s ordering.
Listing 5 shows an example of false positive. Note
that the code does have two loops, but it does not ma-
nipulate the list numbers to sort its elements.
Listing 5: False positive example.
1 d e f s p l i t s e q u e n c e ( numb ers , d i v s ) :
2 f o r i i n r a n g e ( l e n ( num bers ) ) :
3 num = numb ers [ i ]
4 f o r d i v i n d i v s :
5 i f d i v ! = 0 :
6 num / = d i v
7 numb ers [ i ] = num
These findings highlight the challenge of pre-
cisely differentiating between true sorting strategies
and other types of iterative modifications or calcula-
tions, emphasizing the need for further refinement in
the design of the static analysis tests. Nevertheless, it
is important to note that while the students were not
employing a sorting strategy, they were still diverging
from the design originally intended by the instructors,
which did not require quadratic solutions for the given
problem.
For the false negatives, the issues seem to be
largely related to the absence of test checks for certain
scenarios. In 8 of the false negative cases, the students
used either native functions or custom implementa-
tions that performed operations on the list, which the
test failed to detect. Examples include the use of func-
tions like insert, pop, or custom implementations
of such operations. In one particular case, a student
utilized a lambda function, and our API did not cap-
ture this instance due to the lack of implementation
for handling the structure of lambda expressions.
Listing 6 shows an example of false negative.
Note that the student implemented the insertion sort,
but used the insert function instead of moving the el-
ements applying swaps (assign_field).
Listing 6: False negative example.
1 d e f m y i n s e r t i o n s o r t ( a r r ) :
2 f o r i i n r a n g e ( 1 , l e n ( a r r ) ) :
3 v a l u e = a r r . pop ( i )
4 j = i −1
5 w h i l e j >= 0 and a r r [ j ] > v a l u e :
6 j = 1
7 a r r . i n s e r t ( j +1 , v a l u e )
8 r e t u r n a r r
The limitation encountered regarding the false
negatives was not inherent to the API itself, but rather
in how the test was written. The test could have been
more specific, explicitly including functions like ‘in-
sert‘ and ‘pop‘, which were overlooked. However,
this raises an important discussion about the balance
between writing highly specific tests that aim to cap-
ture many details, and more general tests that, while
simpler, are prone to false positives and false nega-
tives. Finding the right balance depends on the goals
of the test and the context in which it is applied.
On the other hand, the case involving the lambda
function highlights a genuine need to evolve the API.
This example indicates that the current version of the
API does not fully capture certain language features,
and extending it to handle lambda expressions would
improve its overall accuracy and coverage.
4.1.2 Qualitative Study: Answering RQ2 and
RQ3
Relying solely on professionals’ impressions of the
tool’s documentation would not provide sufficient
confidence to address our research questions concern-
ing the usefulness and usability of the solution. For
this reason, we conducted a qualitative study in which
we interviewed 16 professionals in order to evaluate
and collect deeper insights on the design test approach
for education. Next, we give details on the partici-
pants, the procedures applied to collect and analyze
data, and the results of this study.
Participants. We have interviewed 16 instructors
from three different universities of Brazil. The partic-
ipants were invited through mailing lists sent to three
different universities in Brazil, ensuring a diverse rep-
resentation of institutions. They are all profession-
als in the field of education, with direct experience
teaching programming-related courses. This group
includes both university professors and graduate stu-
dents who serve as instructors for programming dis-
ciplines. From now on, we will refer to these partici-
pants as P1, P2...P16.
Interviews and Data Collection. In addition to
testing our API, we also evaluated whether instruc-
tors are capable of understanding the concept of de-
sign tests, writing their own design tests, and their
perspectives on using this approach in an educational
context. To this end, we asked instructors to perform
two tasks: (i) reading 3 pre-written design tests and
CSEDU 2025 - 17th International Conference on Computer Supported Education
522
explaining these tests, and (ii) writing 4 design tests
based on specific scenarios we provided.
The 3 pre-written tests were chosen for their in-
creasing complexity, starting with a test for simple
function calls and progressing to more complex struc-
tures such as loops and sorting algorithms, as sort-
ing algorithms are central to verifying structural un-
derstanding of code. This choice was strategic, as
the third test, which required detecting sorting algo-
rithms, allowed us to see whether instructors could
understand and explain a more advanced concept. In
the task of writing 4 new tests, the scenarios provided
ranged from verifying variable assignments to check-
ing for correct usage of functions and loops. These
tasks were selected to reflect common educational sit-
uations, ensuring the instructors could apply the con-
cept of design tests in practical settings.
During the evaluation, we employed the think-
aloud protocol (Ericsson and Simon, 1993b) to collect
data on their experiences while performing the tasks.
The think-aloud protocol is a qualitative research
method where participants verbalize their thought
processes in real-time as they complete a task. This
approach helps capture the reasoning, difficulties, and
strategies used by participants, providing insight into
how they interact with the material. Each session
lasted approximately 20 minutes on average. We
recorded the sessions, including both the task perfor-
mance and follow-up interviews, and later transcribed
the audio. These transcriptions were reviewed for ac-
curacy and then subjected to coding. We worked with
predefined tags related to the usability of the API and
the understanding of design tests to ensure consis-
tency in categorizing the responses. Coding is a sys-
tematic process used in qualitative research to catego-
rize textual data into themes or patterns. By apply-
ing coding, we were able to organize participants’ re-
sponses into meaningful categories that facilitated our
analysis of how well the instructors understood the
design testing process, how they approached writing
tests, and their overall impressions of the approach.
Results. We matched participants’ verbalization to
7 pre-defined tags: Positive feedback (tool or ap-
proach), Negative feedback (tool or approach), Dif-
ficulty (use or understanding), Ease (use or under-
standing), Correct response, Incorrect response, and
Response close to correct. The observations from the
data are the following.
Observation 1: Instructors Can Easily Under-
stand the Concept of Design Tests.
The majority of the instructors quickly grasped the
concept of design tests after a brief introduction. Out
of 16 participants, 15 reported understanding the con-
cept easily, and 10 instructors were able to think about
implementations even before consulting the docu-
mentation. This highlights the intuitive nature of the
concept for professionals in education.
The qualitative data collected shows a significant
amount of positive feedback. There were 103 com-
ments tagged under ”Ease of Understanding”. 8
instructors asked questions or made comments that
showed a deeper engagement with the idea of design
tests, suggesting a high level of comprehension. On
the other hand, one participant (P12) could not easily
understand the approach and described it as confus-
ing.
P1: ”From what I understand, ’for’ is also an entity,
so in this line here, I am grabbing all occurrences
of ’for’ and returning a list with those occurrences,
and I check if they are different from an empty list.
This demonstrates the participant’s ability to interpret
the role of programming structures as entities in the
design tests and understand their functional use within
the tool.
P2: ”[...] I really liked it, I found the way of do-
ing these tests interesting, I think it’s cool. My feed-
back is entirely positive. This feedback emphasizes
the enthusiasm for the design test approach and its
perceived usefulness.
Observation 2: Design Tests Are Seen as a Valu-
able Tool for Educational Use.
Several instructors mentioned that they could see im-
mediate applications for design tests in their teaching.
Out of the 16 participants, 8 directly pointed out po-
tential educational uses, such as automating the struc-
tural correction of student assignments. Furthermore,
4 instructors stated they would consider integrating
the tool into their teaching routines, particularly in
disciplines involving programming practices.
P3: ”Thinking quickly, for correction, the first thing
that comes to mind is grading exams. This shows
that instructors quickly identified practical applica-
tions for design tests, specifically for automating and
enhancing the feedback process on student submis-
sions.
P4 reflected on the broader use of the tool in program-
ming courses: ”I think in the Programming II course,
it could be interesting to use these design tests along
with unit tests, because they abstract the idea of what
needs to be tested. This indicates that design tests
could be a valuable addition to traditional unit tests,
providing a complementary layer of feedback that fo-
cuses on the structure of students’ code.
Observation 3: Some Instructors Experienced Mi-
nor Difficulties with the API, but the Overall Con-
cept Remained Clear.
While the majority of participants found the concept
Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses
523
of design tests easy to understand, a few noted mi-
nor challenges when working with specific API func-
tions. For instance, 6 instructors expressed some con-
fusion about how to manipulate certain relationships
between entities, which slowed down their ability to
implement the tests.
P5: ”I had some doubts about how to manipulate
some of the relationships between the entities, but I
think it’s something that can be understood with more
practice. Despite these difficulties, the participants
generally found that the concept was clear and the tool
was accessible.
P6: ”I didn’t understand the last one, do you want me
to check if there’s at least one ’for’ inside ’func1’?”
This indicates that while there were occasional mis-
understandings, they were more about the technical
details of implementation rather than the core concept
of design tests.
Observation 4: Instructors See Potential in Com-
bining Design Tests with Other Educational Tools.
Some participants reflected on how design tests could
complement existing tools and methodologies in edu-
cation. For example, instructors suggested combining
design tests with unit tests to provide more compre-
hensive feedback to students. This would allow edu-
cators to not only check the correctness of a solution
but also assess the structure of the code.
P7: ”I think design tests could be used to detect
when students are using functions they shouldn’t, like
built-in libraries. This suggests that design tests can
help enforce specific coding guidelines and encourage
students to focus on learning foundational program-
ming concepts.
In conclusion, the feedback from instructors strongly
supports the observation that design tests are an intu-
itive and valuable tool for educational contexts. The
concept was widely understood, and instructors were
able to think critically about its application in pro-
gramming education. Despite minor challenges with
the API, the overall reception was positive, with par-
ticipants expressing interest in incorporating design
tests into their teaching routines.
5 RELATED WORK
The automatic assessment of programs in introduc-
tory programming courses has been extensively stud-
ied (Ala-Mutka, 2005) (Paiva et al., 2022) (Keun-
ing et al., 2018) (Wasik et al., 2018). One of the
most common approaches is the use of Online Judges,
which automate the verification of functional correct-
ness through input-output tests. Specifically, Janzen
et al.(Janzen and others, 2013) and Singh et al.(Singh
et al., 2013) propose the use of automated judges to
verify whether students’ programs produce the ex-
pected results for a given set of test cases. However,
this method is limited to verifying the program’s out-
put, without considering internal aspects such as the
use of control structures, recursion, and modularity.
On the other hand, Truong et al. (Truong et al.,
2004) explore the application of static code analysis
in an educational context, investigating how verifying
characteristics like cyclomatic complexity, code cov-
erage, and programming best practices can contribute
to automated assessment. However, this approach fo-
cuses on code quality in general, without specifically
covering algorithmic structure.
Ara
´
ujo et al. (Araujo et al., 2016) emphasize the
need for more robust methods that consider not only
code quality but also the algorithmic structure of pro-
grams. They demonstrate that, in the absence of clear
guidelines, many students adopt solutions that, while
functionally correct, do not adhere to the design cri-
teria expected by instructors, hindering the develop-
ment of logical and abstract reasoning skills.
This work aims to address the limitations of these
existing approaches by proposing a design verifica-
tion tool that uses static analysis to detect structural
deviations in students’ solutions. Thus, the tool com-
plements functionality-based assessments, enabling
educators to provide more comprehensive feedback.
6 CONCLUSION AND FUTURE
WORK
In this work, we introduced the Python Design Wiz-
ard, a tool designed to inspect the structural aspects
of algorithms in programming courses. Our approach
focused on the use of design tests to verify not only
the functional correctness of students’ code but also
its structural design, ensuring that students adhere
to the expected algorithmic patterns. Throughout
the study, we evaluated the effectiveness of the tool
in identifying violations of algorithm design, with
promising results.
Our tests, particularly in detecting sorting strate-
gies, demonstrated a high accuracy of 98.6%, with a
precision of 77% and recall of 84%. These results
reinforce the potential of design tests to supplement
traditional unit tests by addressing aspects related to
the internal structure of algorithms, such as control
flow, recursion, and the use of specific functions.
CSEDU 2025 - 17th International Conference on Computer Supported Education
524
6.1 Threats to Validity
While the results were promising, there are certain
threats to the validity of our approach. First, the sam-
ple size was limited to 1714 student programs from
a single institution, which may not fully represent
the diversity of approaches found in broader educa-
tional contexts. Additionally, the focus on sorting al-
gorithms may have limited the generalizability of the
tool’s applicability to other algorithmic topics.
Moreover, the qualitative evaluation was con-
ducted with a relatively small group of instructors,
which may not fully reflect the perspectives of a wider
range of educators. Future studies should consider ex-
panding the qualitative evaluation to a broader sample
of instructors and different educational settings.
6.2 Future Work
As future work, we plan to expand the functionality
of the Python Design Wizard by creating a catalog of
pre-defined design tests that can be shared with the
community of educators. These tests could be tai-
lored to various common topics in introductory pro-
gramming courses, such as recursion, dynamic pro-
gramming, and data structures.
Furthermore, it would be interesting to investigate
the integration of this tool into other programming
paradigms and languages, as well as expanding its ca-
pabilities to more complex scenarios that go beyond
sorting algorithms. We also aim to explore the possi-
bility of developing more specific tests that could han-
dle different levels of complexity, providing a more
granular evaluation of algorithm design.
An interesting observation from our study is that
we evaluated the tests themselves rather than the tool
directly, which is why we included unit tests in our
validation process. This approach allowed us to mea-
sure how well the design tests perform in identifying
algorithmic patterns and violations, rather than focus-
ing solely on the technical capabilities of the tool.
REFERENCES
Ala-Mutka, K. M. (2005). A survey of automated assess-
ment approaches for programming assignments. Com-
puter science education, 15(2):83–102.
Araujo, E., Serey, D., and Figueiredo, J. (2016). Qualita-
tive aspects of students’ programs: Can we make them
measurable? In 2016 IEEE Frontiers in Education
Conference (FIE), pages 1–8. IEEE.
Beazley, D. (2012). Python Essential Reference. Addison-
Wesley.
Brunet, J., Guerrero, D., and Figueiredo, J. (2009). De-
sign tests: An approach to programmatically check
your code against design rules. In 2009 31st Inter-
national Conference on Software Engineering - Com-
panion Volume, pages 255–258.
Ericsson, K. A. and Simon, H. A. (1993a). Protocol Anal-
ysis: Verbal Reports as Data. MIT Press, Cambridge,
MA, revised edition edition.
Ericsson, K. A. and Simon, H. A. (1993b). Protocol Anal-
ysis: Verbal Reports as Data. MIT Press, Cambridge,
MA, revised edition edition.
Foundation, P. S. (2023). Python 3 Documentation. Avail-
able at https://docs.python.org/3/.
Janzen, S. and others (2013). Automated grading systems in
competitive programming contests. In Proceedings of
the 8th International Symposium on Educational Soft-
ware Development.
Juiz, A. and others (2014). Online judge and its applica-
tion to introductory programming courses. In ACM
SIGCSE Bulletin.
JUnit Team (2023). JUnit 5: The Next Generation of JUnit.
Accessed: 2024-11-11.
Keuning, H., Jeuring, J., and Heeren, B. (2018). A system-
atic literature review of automated feedback genera-
tion for programming exercises. ACM Transactions
on Computing Education (TOCE), 19(1):1–43.
Leite, L. (2015). A proposal for improving programming
education with frequent problem solving exercises. In
Brazilian Symposium on Computer Education.
Paiva, J. C., Leal, J. P., and Figueira,
´
A. (2022). Automated
assessment in computer science education: A state-
of-the-art review. ACM Transactions on Computing
Education (TOCE), 22(3):1–40.
Papastergiou, M. (2009). Digital game-based learning in
computer programming: Impact on educational effec-
tiveness and student motivation. Computers & Educa-
tion.
Singh, J., Gupta, P., and Sharma, M. (2013). Online judge
system for learning and practicing programming. In
International Journal of Computer Applications.
Truong, N., Roe, P., and Bancroft, P. (2004). Static analysis
of students’ java programs. In Proceedings of the Sixth
Australasian Conference on Computing Education-
Volume 30, pages 317–325. Citeseer.
van Rossum, G. (2009). The Python Language Reference
Manual. Network Theory Ltd.
Wasik, S., Antczak, M., Badura, J., Laskowski, A., and
Sternal, T. (2018). A survey on online judge sys-
tems and their applications. ACM Computing Surveys
(CSUR), 51(1):1–34.
Watson, C. and Li, F. W. (2014). Failure rates in intro-
ductory programming revisited. In Proceedings of the
2014 conference on Innovation & technology in com-
puter science education, pages 39–44.
Beyond Functionality: Automating Algorithm Design Evaluation in Introductory Programming Courses
525