Table 5: A sample of data set for the analysis on first three years. The attributes in the table refer to the student identifier,
Student, the student cohort, Cohort, the number of credits corresponding to exams with a grade achieved during the I, II
or III year, CreditsI, CreditsII, CreditsIII, the average grade, varying in the range 18..30, achieved during the I, II or
III year, AvggradeI, AvggradeII and AvggradeIII, and the grade obtained in the entrance test, varying in the range 0..25,
Test.
Student Cohort CreditsI CreditsII CreditsIII AvggradeI AvggradeII AvggradeIII Test
100 2010 60 60 60 26 28 28 18
200 2010 12 36 48 21 23 25 15
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
300 2011 12 24 36 21 23 24 12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
during their first year. Understanding the productivity
of first year students can point out these difficulties
and gives an opportunity to improve the teaching and
learning processes of the Laurea Degree. The data set
we analyse is not big, however, as focused in (Natek
and Zwilling, 2014), a data mining analysis is useful
also in such small contexts. Moreover, the case study
allows us to describe the methodology on a real situ-
ation.
In particular, we perform a cluster analysis by us-
ing the K-means implementation of the software WEKA.
In our analysis we measure cluster validity with cor-
relation, by using the concept of proximity and inci-
dence matrices: in the proximity matrix P = (P
i, j
),
each element P
i, j
represents the Euclidean distance
between elements i and j in the data set; in the in-
cidence matrix I = (I
i, j
), each element I
i, j
is 1 or 0
if the elements i and j belong to the same cluster or
not. We then compute the Pearson’s correlation, as
defined in (Tan et al., 2006, page 77), between the
linear representation by rows of matrices P and I and
we expect to find a negative value, where -1 means a
perfect negative linear relationship.
We tried the K-means algorithm with several val-
ues of k and with k = 3 we obtained the cluster for
the first year students of the 5 cohorts from 2010 up
to 2014, illustrated in Figure 1. As cluster attributes
we used the number of credits corresponding to ex-
ams with a grade, attribute credits grade, the av-
erage grade, attribute avggrade, and the grade of the
self-assessment test, attribute test grade.
The centroids of the cluster are illustrated in Ta-
ble 6 and, in particular, cluster 0 identifies medium
achieving students, cluster 1 corresponds to students
that during the first year had success only with the
exam of English and therefore have no credits and
no grade in this clustering, finally, cluster 2 identifies
high achieving students. The clusters are character-
ized by colours blue, red and green in Figure 1, re-
spectively. The Pearsons correlation between the lin-
ear representation of the proximity and incidence ma-
trices is -0.66, a good value of correlation.
The following Figures 2,3,4,5 and 6 illustrate the
relation between the students in the cluster of Fig-
ure 1 and the exams of the first year: Algorithms
and Data Structures (ADS), Programming (PRG), Cal-
culus (CAL), Architectures (ARC), Discrete mathemat-
ics and Logic (DML). In these figures, the blue colour
means that the exam has not been given (the grade is
0) and the orange colour means that the exam has been
passed with a grade between 18 and 30 (31 means 30
cum laude). As can be seen, there are some courses,
such as ADS, organized in a such a way that most stu-
dents in clusters 0 and 2 are able to give the corre-
sponding exams, while there are two exams, ARC and
DML, which are given mainly by students in cluster 2
and that therefore present some critical aspects.
Figure 7 puts in evidence the results of the self as-
sessment test, however such figure should be accom-
panied with the results of the Pearson correlation be-
tween the test grade and the number of credits and the
average grade, respectively: for the five years 2010-14
the value corresponding to attributes credits grade
and test grade shows a positive correlation of 0.49
while the value corresponding to attributes avggrade
and test grade shows a positive correlation of 0.39.
A more detailed analysis, shows a particular positive
correlation with the average grade of CAL and DML,
that is, the mathematics courses of the first year. The
self assessment test is mainly concerned with prob-
lems of logic, calculus, probability and the previ-
ous correlations between mathematics courses and the
test are quite natural. These facts are summarized
in Table 7 which shows the values of the correla-
tion between each of the attributes credits grade,
avggrade, ADS, ARC, PRG, CAL, DML and the attribute
test
grade, during the academic years from 2010 up
to 2014, the three years 2010-12, which will be ex-
amined in the next section and, finally, the five years
2010-14.
University Student Progressions and First Year Behaviour
49