to study for other tests, using as an interesting object
of study to learn and retain new knowledge. In ad-
dition to becoming a set of interesting material to be
used in the classroom by teachers, in order to facilitate
understanding, as well as use these questions for exer-
cises. In the case of a teacher, he may have a database
of questions and answers from which he can generate
new tests. Another important aspect is the possibility
of creating a database of questions with these ques-
tions extracted. The objective of this work was not to
make these extracted questions available in databases
or systems, but it can be a suggestion for future works.
REFERENCES
Alaei, A., Nagabhushan, P., and Pal, U. (2011). A bench-
mark kannada handwritten document dataset and its
segmentation. In 2011 International Conference on
Document Analysis and Recognition, pages 141–145.
Bast, H. and Korzen, C. (2017). A Benchmark and
Evaluation for Text Extraction from PDF. In 2017
ACM/IEEE Joint Conference on Digital Libraries
(JCDL), pages 1–10.
Budhiraja, S. S. (2018). Extracting Specific Text From Doc-
uments Using Machine Learning Algorithms. Thesis
of computer science, Lakehead University, Canada.
Choudhury, S. R., Mitra, P., Kirk, A., Szep, S., Pellegrino,
D., Jones, S., and Giles, C. L. (2013). Figure meta-
data extraction from digital documents. In 2013 12th
International Conference on Document Analysis and
Recognition, pages 135–139.
Constantin, A., Pettifer, S., and Voronkov, A. (2013).
PDFX: Fully-Automated PDF-to-XML Conversion of
Scientific Literature. In Proceedings of the 2013 ACM
Symposium on Document Engineering, DocEng ’13,
page 177–180, New York, NY, USA. Association for
Computing Machinery.
Duretec, K., Rauber, A., and Becker, C. (2017). A text ex-
traction software benchmark based on a synthesized
dataset. In Proceedings of the 17th ACM/IEEE Joint
Conference on Digital Libraries, JCDL ’17, page
109–118. IEEE Press.
Excalibur (2018). Excalibur: Pdf table extraction for hu-
mans. Accessed: 2020-11-29.
Fang, J., Tao, X., Tang, Z., Qiu, R., and Liu, Y. (2012).
Dataset, ground-truth and performance metrics for ta-
ble detection evaluation. In 2012 10th IAPR Inter-
national Workshop on Document Analysis Systems,
pages 445–449.
Fang Yuan and Bo Lu (2005). A new method of information
extraction from PDF files. In 2005 International Con-
ference on Machine Learning and Cybernetics, vol-
ume 3, pages 1738–1742 Vol. 3.
gov.br (2021). Exame Nacional de Desempenho dos Estu-
dantes (Enade). Accessed: 2020-01-16.
Hadjar, K., Rigamonti, M., Lalanne, D., and Ingold, R.
(2004). Xed: a new tool for extracting hidden struc-
tures from electronic documents. In First Interna-
tional Workshop on Document Image Analysis for Li-
braries, 2004. Proceedings., pages 212–224.
Hassan, T. and Baumgartner, R. (2007). Table recognition
and understanding from pdf files. In Ninth Interna-
tional Conference on Document Analysis and Recog-
nition (ICDAR 2007), volume 2, pages 1143–1147.
INEP (2020). Exame Nacional de Desempenho dos Estu-
dantes (Enade). Accessed: 2020-10-07.
Li, P., Jiang, X., and Shatkay, H. (2018). Extracting figures
and captions from scientific publications. In Proceed-
ings of the 27th ACM International Conference on In-
formation and Knowledge Management, CIKM ’18,
page 1595–1598, New York, NY, USA. Association
for Computing Machinery.
Lima, R. and Cruz, E. F. (2019). Extraction and multi-
dimensional analysis of data from unstructured data
sources: A case study. In ICEIS.
Lipinski, M., Yao, K., Breitinger, C., Beel, J., and Gipp, B.
(2013). Evaluation of header metadata extraction ap-
proaches and tools for scientific pdf documents. JCDL
’13, page 385–386, New York, NY, USA. Association
for Computing Machinery.
Liu, Y., Bai, K., Mitra, P., and Giles, C. L. (2007). Table-
seer: Automatic table metadata extraction and search-
ing in digital libraries. In Proceedings of the 7th
ACM/IEEE-CS Joint Conference on Digital Libraries,
JCDL ’07, page 91–100, New York, NY, USA. Asso-
ciation for Computing Machinery.
Manuel Aristar
´
an, Mike Tigas, Jeremy B. Merrill, Jason
Das, David Frackman and Travis Swicegood (2018).
Tabula is a tool for liberating data tables locked inside
pdf files. Accessed: 2020-07-20.
Parizi, R. M., Guo, L., Bian, Y., Azmoodeh, A., De-
hghantanha, A., and Choo, K. R. (2018). Cyber-
pdf: Smart and secure coordinate-based automated
health pdf data batch extraction. In 2018 IEEE/ACM
International Conference on Connected Health: Ap-
plications, Systems and Engineering Technologies
(CHASE), pages 106–111.
Ramakrishnan, C., Patnia, A., Hovy, E., and Burns, G. A.
(2012). Layout-aware text extraction from full-text
PDF of scientific articles. Source Code for Biology
and Medicine, 7(1):7.
Strecker, T., v. Beusekom, J., Albayrak, S., and Breuel,
T. M. (2009). Automated ground truth data genera-
tion for newspaper document images. In 2009 10th
International Conference on Document Analysis and
Recognition, pages 1275–1279.
Yusuke Shinyama (2014). Python pdf parser and analyzer.
Accessed: 2020-05-21.
Øyvind Raddum Berg (2011). High precision text extrac-
tion from PDF documents. Thesis en informatics, Uni-
versity of Oslo.
ICEIS 2021 - 23rd International Conference on Enterprise Information Systems
366