lated our method’s accuracy by comparing the trained
model predictions for their test set data (10−fold
splits × 12 training models, in a total of 120 inde-
pendent test sets) with the test sets’ expected results
(hidden from the network model in which they were
used as test sets). For a total of 1020 test samples, we
achieved the results displayed at Table 3 and Figure 4.
A detailed information of the performance of each
corresponding grade (across every training iteration)
can be found at the Table 3.
As it can be seen at the Confusion Matrix at Figure
4, each class had a accuracy (across different trained
networks, each one of them independently trained for
a single question) varying from 69% (A grade) to 80%
(D grade), with the largest degree of confusion be-
ing 13% (between D and F grades), which is expected
considering the subjective nature of code correction
and also the fact that both are the lowest possible
grades. More than 10% of confusion is also found be-
tween A and B grades (11%), which is also expected
for the similar reasons (the best possible evaluations).
A unexpected confusion happened between D and B
grades (13%). Confusion between other classes did
not went higher than 10%, which can be seen as a
proof that this method can perform a good analysis of
code quality in an academic environment.
4.1 Discussions
The proposed method does not intend to replace the
evaluation process performed by the professors which
is a very important step of the teaching process. Our
method intends to mitigate possible inconsistencies
that might happen during this process, taken previ-
ously made corrections as a reference (which were
also performed by the professors).
Before making each assignment grade available
for students, this method could be used by professors
to validate each grade given, in order to find clues of
inconsistent corrections. We believe that this is a real
possibility since each question on the final exam cov-
ers one specific taught concept, and we traditionally
repeat the same topics for the questions having simi-
lar difficulty levels across different academic terms.
Considering that there are 5 possible classes (vary-
ing between A, B, C, D, and F grades), the results pre-
sented in this paper can be considered excellent. Usu-
ally, when we, as professors, are in duty of evaluating
a student submission to a question, we tend to divide
opinions when the student’s work did not achieve po-
larized results (being perfectly correct or extremely
wrong), leading to a subjectivity in the evaluation pro-
cess. For example: while some professors consider
that they should assign a D for a poor solution argu-
ing that a minimum skill was demonstrated by the stu-
dent, other professors would consider it as a complete
failure assigning a F grade for the same proposed so-
lution. In our dataset, consisting of blended-learning
modality with unified curse and exams, this subjec-
tivity between D and F grades resulted in a variation
of 20%. In (Zampirolli et al., 2018), a variation of
up to 40% was presented in the evaluation of several
classes in the face-to-face modality when there was
no unified process. Analysing the Confusion Matrix
in Figure 4, the difference between these two grades
in our method (where the method had classified as F
but the teacher attributed the D grade) was 13% .
5 CONCLUSIONS
In this article, we present a method for helping profes-
sors evaluating student code submissions in a under-
graduate introductory programming language course
(ILP). We believe that our approach could be incor-
porated on Massive Open Online Courses (MOOCs)
since it offers a deeper evaluation of source code in-
stead of a binary pass or fail feedback as it hap-
pens on traditional online programming judges such
as URI (urionlinejudge.com.br), repl.it, VPL (Vir-
tual Programming Lab for Moodle - vpl.dis.ulpgc.es),
among others.
We validated our method over a corpus of 938
programming exercises developed by undergraduate
students during the final exam of a introductory level
programming course, which was held on a blended–
learning modality (combining face–to–face and on-
line classes). As explained on Section 1, those stu-
dents share different levels of interest and/or skills in
computer programming – therefore, we trained our
models over a corpus reflecting many different types
of students believing that it would reflect a real–world
scenario.
Our method consisted in cleaning the text in
source codes (removing comments and providing pat-
terns for variable names), representing those source
codes on code–embedding based on Skip–Gram
method, and training them over a Convolutional Neu-
ral Network (CNN).
We achieved an average accuracy of 74.9% for
each question (all of them representing hard–level ex-
ercises). Considering the subjectivity of the process
of attributing grades to code assignments, which is a
very challenging task by itself, those results reflects
many possibilities of using this method to help pro-
fessors on grading actual code assignments.
Convolutional Neural Network Applied to Code Assignment Grading
67