Authors:
Karina Wiechork
1
;
2
and
Andrea Schwertner Charão
1
Affiliations:
1
Department of Languages and Computer Systems, Federal University of Santa Maria, Santa Maria, Brazil
;
2
Information Technology Coordination, Federal Institute of Education Science and Technology Farroupilha, Frederico Westphalen, Brazil
Keyword(s):
Dataset Collection, Ground Truth, Performance Evaluation, PDF Extraction Tools.
Abstract:
The massive production of documents in portable document format (PDF) format has motivated research on automated extraction of data contained in these files. This work is mainly focused on extractions of natively digital PDF documents, made available in large repositories of educational exams. For this, the educational tests applied at Enade were used and collected automatically using scripts developed with Scrapy. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, 396 answers, with 14.475 alternatives extracted from the objective questions. For the construction of ground truth in the tests, the Aletheia tool was used. For the extractions, existing tools were used that perform data extractions in PDF files: tabular data extractions, with Excalibur and Tabula for answer extractions, textual content extractions, with CyberPDF and PDFMiner to extract the questions, and extractions of regions of interest, with Aletheia and ExamClipper fo
r the cutouts of the questions. The results of the extractions point out some limitations in relation to the diversity of layout in each year of application. The extracted data provide useful information in a wide variety of fields, including academic research and support for students and teachers.
(More)