eral Web pages and load it into a DW allowing the
analysis of data by different perspectives.
The proposed process was successfully applied to
the case study on the athletics events results. This
case study made available a DW with data about re-
sults of athletic events realized in Portugal in the last
12 years. The results are integrated with data about
the geographic location and atmospheric conditions
in which the competitions took place.
Based on the BI analysis of the information stored
in the DW some conclusions have already been drawn
but further conclusions may be drawn. The process
is being used successfully. The information was up-
loaded into the DW the first time and has been up-
dated a few times. The process is prepared to run
cyclically, detect new files, handle and load the new
data by updating the data in the DW. Additionally, the
process is prepared to be used in other contexts, nev-
ertheless the PositionParser module will need to be
adapted to deal with new type of metadata. The pro-
posed process can be applied in other projects like the
analysis of curriculum information, students grads,
annual reports or scientific articles, etc..
In the future, the data warehouse will continue to
be updated with new files with new competitions re-
sults. The PositionParser module, created to recog-
nize and extract tables from PDF files, may be im-
proved by creating a proper graphical user interface.
REFERENCES
Bienz, T., Cohn, R., and Meehan, J. R. (1997). Portable
document format reference manual. Adobe Systems
Incorporated.
Cruz, E. F., Machado, R. J., and Santos, M. Y. (2014).
Derivation of data-driven software models from busi-
ness process representations. In 9th International
Conference on the Quality of Information and Com-
munications Technology (QUATIC2014), pages 276–
281. IEEE Compute Society.
Cruz, E. F., Machado, R. J., and Santos, M. Y. (2019). On
the rim between business processes and software sys-
tems. In da Cruz, A. M. R. and Cruz, E. F., editors,
New Perspectives on Information Systems Modeling
and Design, pages 170–196. IGI Global.
Deepakumara, J., Heys, H. M., and Venkatesan, R.
(2001). Fpga implementation of md5 hash algo-
rithm. In Canadian Conference on Electrical and
Computer Engineering 2001. Conference Proceedings
(Cat. No.01TH8555), volume 2, pages 919–924 vol.2.
Endel, F. and Piringer, H. (2015). Data wrangling: Mak-
ing data useful again. In International Federation of
Automatic Control Hosting by Elsevier Ltd.
Hassan, T. and Baumgartner, R. (2007). Table recognition
and understanding from pdf files. In Ninth Interna-
tional Conference on Document Analysis and Recog-
nition (ICDAR 2007), volume 2, pages 1143–1147.
Kandel, S., Paepcke, A., Hellerstein, J., and Heer, J.
(2011). Wrangler: Interactive visual specification of
data transformation scripts. In Proceedings of the
SIGCHI Conference on Human Factors in Computing
Systems, CHI ’11, pages 3363–3372, New York, NY,
USA. ACM.
Kazil, J. and Jarmul, K. (2016). Data Wrangling with
Python: Tips and Tools to Make Your Life Easier.
O’Reilly Media, Inc.
Kenneth Reitz, T. S. (2016). The Hitchhiker’s Guide to
Python: Best Practices for Development. O’Reilly
Media, Inc.
Khusro, S., Latif, A., and Ullah, I. (2015). On methods and
tools of table detection, extraction and annotation in
pdf documents. J. Inf. Sci., 41(1):41–57.
McCallum, Q. E. (2012). Bad Data Handbook. O’Reilly
Media, Inc.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space. CoRR, abs/1301.3781.
Mitchell, R. (2018). Web Scraping with Python: Collecting
More Data from the Modern Web. O’Reilly Media,
Inc.
Oro, E. and Ruffolo, M. (2009). Pdf-trex: An approach for
recognizing and extracting tables from pdf documents.
In 2009 10th International Conference on Document
Analysis and Recognition, pages 906–910.
Pennington, J., Socher, R., and Manning., C. (2014). Glove:
Global vectors for word representation. In Proceed-
ings of the 2014 conference on empirical methods in
natural language processing (EMNLP).
Pitale, S. and Sharma, T. (011). Information extraction tools
for portable document format.
Santos, M. Y. and Costa, C. (2016). Data warehousing in big
data: From multidimensional to tabular data models.
In Proceedings of the Ninth International C* Confer-
ence on Computer Science & Software Engineering,
C3S2E ’16, pages 51–60, New York, NY, USA. ACM.
Santos, M. Y. and Ramos, I. (2006). Business Intelligence:
Tecnologias da informao na gesto de conhecimento.,
volume 1. FCA-Editora de Informtica, Lda.
Shi, T. and Liu, Z. (2014). Linking glove with word2vec.
CoRR, abs/1411.5595.
Vargiu, E. and Urru, M. (2013). Exploiting web scraping in
a collaborative filteringbased approach to web adver-
tising. Artificial Intelligence Research.
Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Mo-
toda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S.,
Zhou, Z.-H., Steinbach, M., Hand, D. J., and Stein-
berg, D. (2008). Top 10 algorithms in data mining.
Knowledge and Information Systems, 14(1):1–37.
Yildiz, B., Kaiser, K., and Miksch, S. (2005). pdf2table: A
method to extract table information from pdf files. In
IICAI.
Yin, S., Li, X., Gao, H., and Kaynak, O. (2015). Data-
based techniques focused on modern industry: an
overview. Industrial Electronics, IEEE Transactions
on, 62(1):657–667.
Extraction and Multidimensional Analysis of Data from Unstructured Data Sources: A Case Study
199