Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests

Karina Wiechork; Karina Wiechork; Andrea Schwertner Charão

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests

Topics: Coupling and Integrating Heterogeneous Data Sources; Subjective Databases

In Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS, 359-366, 2021

Authors: Karina Wiechork ^{1

;

2} and Andrea Schwertner Charão ¹

Affiliations: ¹ Department of Languages and Computer Systems, Federal University of Santa Maria, Santa Maria, Brazil ; ² Information Technology Coordination, Federal Institute of Education Science and Technology Farroupilha, Frederico Westphalen, Brazil

Keyword(s): Dataset Collection, Ground Truth, Performance Evaluation, PDF Extraction Tools.

Abstract: The massive production of documents in portable document format (PDF) format has motivated research on automated extraction of data contained in these files. This work is mainly focused on extractions of natively digital PDF documents, made available in large repositories of educational exams. For this, the educational tests applied at Enade were used and collected automatically using scripts developed with Scrapy. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, 396 answers, with 14.475 alternatives extracted from the objective questions. For the construction of ground truth in the tests, the Aletheia tool was used. For the extractions, existing tools were used that perform data extractions in PDF files: tabular data extractions, with Excalibur and Tabula for answer extractions, textual content extractions, with CyberPDF and PDFMiner to extract the questions, and extractions of regions of interest, with Aletheia and ExamClipper fo r the cutouts of the questions. The results of the extractions point out some limitations in relation to the diversity of layout in each year of application. The extracted data provide useful information in a wide variety of fields, including academic research and support for students and teachers. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.143.4.181

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Wiechork, K. and Charão, A. (2021). Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests. In Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS; ISBN 978-989-758-509-8; ISSN 2184-4992, SciTePress, pages 359-366. DOI: 10.5220/0010524503590366

@conference{iceis21,
author={Karina Wiechork. and Andrea Schwertner Charão.},
title={Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests},
booktitle={Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS},
year={2021},
pages={359-366},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010524503590366},
isbn={978-989-758-509-8},
issn={2184-4992},
}

TY - CONF

JO - Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS
TI - Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests
SN - 978-989-758-509-8
IS - 2184-4992
AU - Wiechork, K.
AU - Charão, A.
PY - 2021
SP - 359
EP - 366
DO - 10.5220/0010524503590366
PB - SciTePress