et al. (2017), where language deviations, one of the
criteria evaluated in Competence 1 of the ENEM
evaluation model, are detected based on a set of
predetermined linguistic rules. This system provides
a valuable input for more complete approaches
related to Competence 1 evaluation. Another work
also based on the ENEM model was developed by
Passero, Haendchen Filho, Dazzi (2016) where
Competence 2 regarding the deviation of the
proposed theme is treated and provides excellent
results.
Júnior, Spalenza and Oliveira (2017) presented a
framework based on machine learning and natural
language for the evaluation of Competence 1 of
ENEM. The authors establish a set of features specific
to Competence 1, as well as various ways of refining
these characteristics in order to generate a machine
learning model that achieves good results in the essay
corpus of the Brazil School.
On the evaluation of textual coherence, some
works propose ways of measuring this characteristic
of the text. The TAACO system (Crossley, Kyle, &
McNamara, 2016) and the coh-metrix (Graesser,
McNamara, McCarthy, 2014) are reference tools in
this context. In addition, extensive research was
carried out on more specific points of textual
coherence such as the analysis of co-referencing and
the use of cohesive links for the summarization of
texts.
6 CONCLUSIONS AND FUTURE
WORKS
The automated analysis of textual cohesion presents
several challenges, mainly related to the processing of
features suitable for its characterization. The shortage
of data and tools for the Portuguese language also
worsen the situation, and more work on developing
and improving NLP tools in Portuguese is needed.
One of the contributions of this work is the corpus
of ENEM-based essays that is made available ready
to use (download from <blind review>). This is
relevant for research in Portuguese, beyond the usual
English. Furthermore, the work introduces a set of
textual cohesion features adapted to Portuguese. The
adaptation had considered the linguistic differences at
the morphological and syntactic levels between
English and Portuguese. These publicly available
features can be explored in other models of machine
learning for the problem approached.
Regarding accuracy, the confusion matrix shows
that the best results were obtained in the dominant
classes, those that hold more than 80% of the
occurrences in the scores. On the other hand, there is
a need for methods capable of obtaining more
precision in the attribution of scores close to the
extremes.
The study also showed that gains in accuracy can
be obtained for true positives by applying balancing
techniques.
As future work, it is suggested: (i) to expand and
improve the quality of the essays corpus; (ii) to
evaluate other learning models based on neural
networks and deep learning; (iii) to explore the lexical
cohesion part and (iv) to compare the results here
presented for Portuguese with those in other
languages, say, English, for example.
REFERENCES
Avila, R. L. F.; Soares, J. M. (2013) Uso de técnicas de pré-
processamento textual e algoritmos de comparação
como suporte à correção de questões dissertativas:
experimentos, análises e contribuições. SBIE.
Chawla, N. V. et al. (2002) SMOTE: Synthetic Minority
oversampling technique. Journal of Artificial
Intelligence Research 16:321–357. https://
www.jair.org/media/953/live-953-2037-jair.pdf
Crossley, S. A., Kyle, K., & McNamara, D. S. (2016) The
tool for the automatic analysis of text cohesion
(TAACO): Automatic assessment of local, global, and
text cohesion. Behavior Research Methods 48(4).
INEP - Instituto Nacional de Estudos e Pesquisas
Educacionais Anísio Teixeira (2017). Redação no
ENEM 2017: Cartilha do participante.. http://
download.inep.gov.br/educacao_basica/enem/guia_par
ticipante/2017/manual_de_redacao_do_enem_2017.pd
f
Géron, A. (2017) Hand-On Machine Learning with Scikit-
Learn and TensorFlow. O’Reilly.
Graesser, A. C.; McNamara, D. S; McCarthy, P. M. (2014)
Automated Evaluation of Text and discourse with Coh-
metrix. Cambridge University Press.
Halliday, M. A. K; Hasan, Ruqaiya. (1976) Cohesion in
English. Routledge.
Joachims, T. (2005) Text categorization with support vector
machines: learning with many relevant features.
European Conference of Machine Learning.
Júnior, C. R. C. A; Spalenza, M. A.; Oliveira, E. (2017)
Proposta de um sistema de avaliação automática de
redações do ENEM utilizando técnicas de
aprendizagem de máquina e processamento de
linguagem natural. Computer on the Beach.
https://siaiap32.univali.br/seer/index.php/acotb/article/
view/10592
Klein, R.; Fontanive, N. (2009) Uma nova maneira de
avaliar as competências escritoras na Redação do
ENEM. Ensaio: Avaliação e Políticas Públicas em