the IG by partitioning entropy, causing high entropy
partitioning (a large number of small partitions) to be
penalized. GR is defined by: GR(A) =
IG(A)
E(A)
, where
IG(A) is information gain and E is the entropy.
3.4 Symmetric Uncertainty
Symmetric uncertainty (SU) (Yu and Liu, 2003) is
a nonlinear correlation measure developed with the
same purpose of GR, that is, an attempt to normalize
the IG of an attribute A with the class C. SU is defined
by: SU(A) = 2 ·
IG(A)
E(A)+E(C)
, where IG(A) is Informa-
tion Gain and E is the entropy.
3.5 Pearson Correlation
Pearson Correlation (PC) (Hall, 1998), also known as
a linear coefficient, measures the degree of correlation
between two metric scale variables. It is a relationship
degree between two quantitative attributes, and it ex-
presses the correlation degree through values between
−1 (negative or inverse correlation) and 1 (positive
linear relationship). A correlation coefficient near
zero indicates no relationship between the attributes.
The PC is given by: PC(A) =
Cov(X,Y )
√
Var(X)∗Var(Y )
, where
Cov is the covariance between the two attributes and
Var is the variance of each attribute. To calculate the
qualitative attributes correlation, data are adapted by
turning them into binary data.
3.6 Relief F
Over the years, a Relief extension called Relief F
(Kononenko, 1994; Kira and Rendell, 1992; Ur-
banowicz et al., 2018) has been developed, aiming to
improve the original algorithm by estimating proba-
bilities more reliably. It handles multiclass and in-
complete datasets, while the complexity remains the
same. It is calculated using a function W defined by:
W (A) = W (A) −
di f f (A,R
i
,H)
m
+
di f f (A,R
i
,M)
m
, where A
is the attribute, W(A) is a vector with each attribute
score, R
i
is the target instance, H is the closest in-
stance of the same class, M is the closest instance to
the other class, m is the number of random instances
selected to be part of the calculation, and the function
di f f calculates the difference between attributes.
4 METHODOLOGY
In this section, we present the methodology used for
this study. We emphasize that the work is supported
by the KDD process, which comprises five stages.
1. Selection. This work takes into account ENADE
2018 microdata. They have 548,127 instances and
137 attributes of the numeric or character type.
The attributes are divided, among others, into the
institution and course information, student infor-
mation, the number of items in the objective part,
types of presence (participant present, absent or
canceled test), test perception questionnaire, and
student questionnaire. The original database was
divided into online students (96,927 instances)
and F2F students (451,200 instances). After an-
alyzing all database attributes, we focus on the
personal, socioeconomic aspects and participant’s
course. We emphasize that at this point, 23 at-
tributes were kept in each database
1
.
2. Preprocessing. The first preprocessing opera-
tion was the application of a filter to select only
those participants who had actually taken the test.
We removed 32,285 participants from the online
modality and 115,765 F2F students. The crite-
ria for removing attributes include absent candi-
dates, candidates with a blank test in the objective
and discursive part of general education, candi-
dates with a blank test in the objective and discur-
sive part of the specific component, participation
with a result disregarded by the Applicator. The
second step verified null or incomplete data, in-
cluding blank test notes and the blank part of the
questionnaire. We excluded 15 online cases and
103 F2F. Online databases had 64,627 instances,
and F2F had 335,332.
3. Transformation. The first operation was to re-
name the attributes. At this stage, 23 attributes
had names referring to the student’s questionnaire
number (QE I01 to QE In). The nominal values
of the attributes (A, B, etc.) were also renamed,
for example, the father’s level of schooling was
renamed to (None, Elementary 1, Elementary 2,
High school, Undergraduate, Graduate).
The courses were also grouped according to their
primary areas, according to the tables provided by
CNPq and CAPES, Brazilian funding agencies.
ENADE’s exam occurs every three years in a spe-
cific set of courses. Not all courses took the test in
2018. The scores obtained by the candidates were
also categorized, with their values discretized into
three frequency categories (low, medium, and
high performance), keeping the original distribu-
tion. Discretized online student grades perfor-
mance: Low (≤ 30), Medium (30 < grade ≤ 60)
1
The original database and the complete list of at-
tributes are available at – https://www.gov.br/inep/pt-
br/acesso-a-informacao/dados-abertos/microdados/enade
ICEIS 2022 - 24th International Conference on Enterprise Information Systems
236