Authors:
Cristiano Cortez da Rocha
1
;
Márcio Parise Boufleur
1
;
Leandro da Silva Fornasier
1
;
Júlio César Narciso
2
;
Andrea Schwertner Charão
3
;
Vinícius Maran
3
;
João Carlos D. Lima
3
and
Benhur O. Stein
3
Affiliations:
1
Centro de Informática e Automação do Estado de Santa Catarina (CIASC), Brazil
;
2
Secretaria de Estado da Fazenda de Santa Catarina, Brazil
;
3
Universidade Federal de Santa Maria (UFSM), Brazil
Keyword(s):
Large Database, Query Performance, Data Management, Business-critical Data.
Related
Ontology
Subjects/Areas/Topics:
Data Engineering
;
Databases and Data Security
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Large Scale Databases
;
Performance Evaluation and Benchmarking
Abstract:
Hadoop clusters have established themselves as a foundation for various applications and experiments in the field of high-performance processing of large datasets. In this context, SQL-on-Hadoop emerged as trend that combines the popularity of SQL with the performance of Hadoop. In this work, we analyze the performance of SQL queries on Hadoop, using the Impala engine, comparing it with a RDBMS-based approach. The analysis focuses on a large set of electronic invoice data, representing an important application to support fiscal audit operations. The experiments performed included frequent queries in this context, which were implemented with and without data partitioning in both RDBMS and Impala/Hadoop. The results show speedups from 2.7 to 14x with Impala/Hadoop for the queries considered, on a lower cost hardware/software platform.