performance in courses, that have different objectives
from ours, but employ a similar methodology.
The work proposed in (Jha et al., 2019) makes
a predictive performance analysis of online course
students. The authors compare the performance of
Machine Learning algorithms using different sets of
features. The techniques explored were: Distributed
Random Forest (DRF), Gradient Boosting Machine
(GBM), Deep Learning (DL), and Generalized Linear
Model (GLM). Their proposed methodology uses 50
features, with 8 of those referring to demographic in-
formation, which made this type of information less
likely to stand out from the rest. The authors note
that these demographic features, such as the student’s
genre, age, and region, were not very relevant in their
context compared to other information such as the stu-
dent’s interactions in the virtual environments or the
student’s assessment scores. When evaluating the us-
age of the demographic features, they pointed out that
the Area Under the Curve (AUC) when using all 50
features was about 0.01 greater than when discarding
the 8 demographic features.
Our work, similar to what was done in (Jha et al.,
2019), analyzes the usage of demographic informa-
tion to predict the student’s performance, but we focus
solely on this type of information instead of using ex-
tra information like the grade on a specific test. The
student’s performance on a single test would be ex-
tremely valuable for our model, but it would make it
less useful since it would only be applicable after the
exam, as the grades are being published. We believe
that our model’s greatest value is to be used before
the exam, where schools can take action to try and
help the students. By analyzing the model proposed
by (Jha et al., 2019), the model is only applicable af-
ter the student already spent a considerable amount of
time in the course, so it cannot help the student early
on. Moreover, the authors do not clarify which fea-
tures are present in the final model, so it is not clear
which factors have a greater impact on the student’s
performance. Since in our work we focus on the un-
derstanding of which social-economic features influ-
ence more the student’s performance, we have chosen
a technique that can easily measure these probabili-
ties, this way, any school that wishes to compute the
probability of a particular student achieving high per-
formance in the exam, it can do so with relative ease.
The EDM application proposed in (Gonz
´
alez-
Marcos et al., 2019) analyzes the academic perfor-
mance of students in the fourth year of Bachelor in
Mechanical Engineering and students in the first year
of the master’s degree in Industrial Engineering. In
their work, they gathered data related to communica-
tion, time, resources, information, documentation, be-
havioral assessment, as well as the grades in the first
half of the course and used it as predictive features
for their model. The authors discuss the existence of
a possibility of using the model to identify “weaker”
students, those with a higher risk of not finishing the
course, so that action may be taken to address the situ-
ation before the student withdraws or underperforms.
The work proposed by (Stearns et al., 2017) ana-
lyzes data from the ENEM exam applied in 2014 to
predict the student’s final grade on the math exam.
The authors used two regression techniques based on
Decision Trees, testing the algorithms AdaBoost and
Gradient Boosting. In their experiments, the Gradi-
ent Boosting algorithm had the best performance with
an R
2
of 35%, then 35% of the final grade variability
could be explained by the model proposed. Although
their model did not achieve high predictive capabili-
ties, through their results, they were able to show that
social-economic features help to explain the student’s
performance on the math exam, but they do not dis-
cuss which features specifically they used.
In their work, (de Castro Rodrigues et al., 2019)
explore the data from the 2017 ENEM exam. They
analyzed how the familiar income relates to other fea-
tures on their dataset, leading to the selection of 48
features chosen by how strongly they related to fa-
miliar income. Their final selection consists of six
features: Schooling of the father or male guardian;
Schooling of the mother or female guardian; Has a
computer in their residence; Occupation of the father
or male guardian; Occupation of the mother or female
guardian; Took the exam seeking a scholarship.
Their model then predicted if the student would
get a final grade of at least 550, since, according to
the authors, it would be a grade good enough so the
student could get into a public university. They em-
ployed the Learn K-Nearest Neighbor KNN, Support
Vector Machine (SVM), Artificial Neural Network
(ANN), and Na
¨
ıve Bayes approach. On their tests, the
ANN approach achieved the best discriminatory re-
sults, with an accuracy of 99%. Furthermore, to look
for unknown patterns and rules in the dataset, they
applied a rule-based Data Mining method, and one
of the rules they found was that, in a certain region,
students that did not repeat a year in high school had
a final grade greater than 450. However, the authors
do not make it clear why they started with a selection
based on the student’s familiar income, and they also
do not explore the difference in importance between
the features of the final model. When comparing the
AUC achieved by each of their approaches, it is inter-
esting that the KNN algorithm got the best result, with
97.5%, followed by the Na
¨
ıve Bayes approach, which
got 87.5%, and the ANN approach, achieving only
CSEDU 2021 - 13th International Conference on Computer Supported Education
94