Authors:
Pedro Curto
1
;
Nuno Mamede
1
and
Jorge Baptista
2
Affiliations:
1
Universidade de Lisboa and INESC-ID Lisboa/L2F – Spoken Language Lab, Portugal
;
2
Universidade de Lisboa and Universidade do Algarve, Portugal
Keyword(s):
Readability, Readability Assessment Metrics, Automatic Readability Classifier, Linguistic Features Extraction, Portuguese.
Related
Ontology
Subjects/Areas/Topics:
Computer-Supported Education
;
Information Technologies Supporting Learning
;
Learning/Teaching Methodologies and Assessment
;
Metrics and Performance Measurement
Abstract:
This paper describes a system to assist the selection of adequate reading materials to support European Portuguese teaching, especially as second language, while highlighting the key challenges on the selection of
linguistic features for text difficulty (readability) classification. The system uses existing Natural Language
Processing (NLP) tools to extract linguistic features from texts, which are then used by an automatic readability classifier. Currently, 52 features are extracted: parts-of-speech (POS), syllables, words, chunks and
phrases, averages and frequencies, and some extra features. A classifier was created using these features and
a corpus, previously annotated by readability level, using a five-levels language classification official standard
for Portuguese as Second Language. In a five-levels (from A1 to C1) scenario, the best-performing learning
algorithm (LogitBoost) achieved an accuracy of 75.11% with a root mean square error (RMSE) of 0.269. In
a three-level
s (A, B and C) scenario, the best-performing learning algorithm (C4.5 grafted) achieved 81.44%
accuracy with a RMSE of 0.346.
(More)