SQL Query Performance on Hadoop: An Analysis Focused on Large
Databases of Brazilian Electronic Invoices
Cristiano Cortez da Rocha
1
, M
´
arcio Parise Boufleur
1
, Leandro da Silva Fornasier
1
,
J
´
ulio C
´
esar Narciso
2
, Andrea Schwertner Char
˜
ao
3
, Vin
´
ıcius Maran
4
, Jo
˜
ao Carlos D. Lima
3
and Benhur O. Stein
3
1
Centro de Inform
´
atica e Automac¸
˜
ao do Estado de Santa Catarina (CIASC), Florian
´
opolis, SC, Brazil
2
Diretoria de Administrac¸
˜
ao Tribut
´
aria, Secretaria de Estado da Fazenda de Santa Catarina, Florian
´
opolis, SC, Brazil
3
Departamento de Linguagens e Sistemas de Computac¸
˜
ao, Programa de P
´
os-Graduac¸
˜
ao em Inform
´
atica,
Universidade Federal de Santa Maria (UFSM), Santa Maria, RS, Brazil
4
Coordenadoria Acad
ˆ
emica, Universidade Federal de Santa Maria (UFSM), Cachoeira do Sul, RS, Brazil
Keywords:
Large Database, Query Performance, Data Management, Business-critical Data.
Abstract:
Hadoop clusters have established themselves as a foundation for various applications and experiments in the
field of high-performance processing of large datasets. In this context, SQL-on-Hadoop emerged as trend that
combines the popularity of SQL with the performance of Hadoop. In this work, we analyze the performance of
SQL queries on Hadoop, using the Impala engine, comparing it with a RDBMS-based approach. The analysis
focuses on a large set of electronic invoice data, representing an important application to support fiscal audit
operations. The experiments performed included frequent queries in this context, which were implemented
with and without data partitioning in both RDBMS and Impala/Hadoop. The results show speedups from 2.7
to 14x with Impala/Hadoop for the queries considered, on a lower cost hardware/software platform.
1 INTRODUCTION
Large-scale analytical data processing has spread
across a variety of industries and areas of interest,
guided by low storage costs and the ability to col-
lect large volumes of business-critical data. Providing
such data to engineers and analysts has become incre-
asingly mandatory and requires low response times,
which represent a key factor in data mining, moni-
toring, prototyping strategies, simulations, and other
common tasks in this field (Melnik et al., 2010).
Consequently, the processing of large data sets re-
quires state-of-the-art solutions of high performance
computing. In this context, Hadoop (White, 2012)
and its related tools have established themselves as
solutions for managing and performing robust ana-
lyzes of unstructured data, as well as structured and
semi-structured data processing. The processing of
SQL queries in particular has gained significant atten-
tion, since the data management systems of many in-
stitutions depend on or rely on SQL, in addition to the
fact that many corporate users are familiar and com-
fortable with such language (Floratou et al., 2014).
With the evolution of the Hadoop ecosystem, soluti-
ons have emerged to process SQL queries on Hadoop,
seeking to extend its benefits to structured and semi-
structured data (Kornacker et al., 2015).
For government institutions, the processing power
and storage savings made possible by the Hadoop
ecosystem offer new options for IT infrastructure in
optimization and application development. In addi-
tion, governments expect such applications to be able
to improve services to citizens, as well as addressing
major government challenges, such as the economy,
revenue collection, job creation, health and natural di-
sasters (Kim et al., 2014).
Tax evasion is mostly performed by the taxpayers
to reduce tax liability and this illegal action is usually
performed to misrepresent the financial facts to go-
vernment and tax authorities by providing false tax re-
porting, such as declaring less income, less profit and
more or exaggerated costs (Rad and Shahbahrami,
2016).
According to (Allingham and Sandmo, 1972), the
probability of practicing tax evasion is directly related
to the probability of its detection, so governments are
constantly seeking more efficient and effective ways
to deal with this problem. Another challenge faced
Cortez da Rocha, C., Parise Boufleur, M., da Silva Fornasier, L., César Narciso, J., Schwertner Charão, A., Maran, V., D. Lima, J. and O. Stein, B.
SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices.
DOI: 10.5220/0006690400290037
In Proceedings of the 20th International Conference on Enterprise Information Systems (ICEIS 2018), pages 29-37
ISBN: 978-989-758-298-1
Copyright
c
2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
29
by the tax authorities is the turnover of companies.
According to data from the Brazilian Institute of Ge-
ography and Statistics (in portuguese, IBGE) (IBGE,
2012), the close rate of companies in 2 years of life is
about 25%. This way, one in four companies closes
before two years of existence. In many cases, the do-
wngrade of the business register is fraudulent, leaving
behind an uncollected tax liability.
Since 2007, with the institution of the Public Sy-
stem of Digital Bookkeeping (in portuguese, SPED),
documents and fiscal books, previously filed manu-
ally and printed, are migrating to electronic docu-
ments, enabling: (i) More speed in the identification
of tax offenses; (ii) Faster access to information; and
(iii) Greater effectiveness in the supervision of the
operations with the data crossing and electronic audit
(RFB, 2010). The most commonly used electronic tax
document is the electronic invoice, which documents
the sales operations.
In this context, Big Data technologies can provide
a quick and efficient selection of companies with fis-
cal irregularities, since the work of a tax auditor in-
volves the analysis of a large amount of fiscal data
and reports. Thus, it is possible to provide a rapid re-
sponse to the Brazilian tax reality (Paula et al., 2016;
Abrantes and Ferraz, 2016).
In this paper, in general, we try to feed the con-
text in question with new evidence, considering both
technological and business aspects. For this, a compa-
rative analysis of two approaches was performed for
the processing of large volumes of fiscal data, more
specifically electronic invoices, through SQL queries
executed on RDBMS and on Hadoop, using Impala
(Kornacker et al., 2015) as a processing engine paral-
lel configuration.
The rest of this paper is organized as follows:
Section 2 presents the relevant concepts about the Ha-
doop ecosystem and the application under considera-
tion. The related works are discussed in Section 3.
The characterization of the application in more detail
is presented in Section 4. In Section 5, experimen-
tal studies are reported and, finally, in Section 6, the
final considerations and perspectives of future works
are presented.
2 CONCEPTUAL FOUNDATION
2.1 Apache Hadoop
Apache Hadoop, is an open source framework that
supports data-intensive distributed and fault-tolerant
applications (White, 2012). Its aim is to make availa-
ble a framework for data and processing abstractions
in order to facilitate queries of large, dynamic, and ra-
pidly growing datasets. The main modules that made
up the Hadoop framework are as follows:
Hadoop Distributed File System (HDFS): ori-
ginally it was the Google File System. This mo-
dule is a distributed file system used as distributed
storage for the data; furthermore, it provides an
access to the data with high throughput (Shvachko
et al., 2010).
Hadoop YARN (MRv2): this module is respon-
sible for the job scheduling and for managing the
cluster resources (Vavilapalli et al., 2013).
Hadoop MapReduce: originally Google’s
MapReduce (Dean and Ghemawat, 2008), this
module is a system, based on YARN, for the
parallel processing of the data.
One of the main aspects that characterize Hadoop
is that the HDFS has a high fault-tolerance to the
hardware failure. Indeed, it is able to automatically
handle and resolve these events. Therefore, it enables
the deploy of lower cost clusters since it can use com-
modity hardware. Furthermore, HDFS is able, by the
interaction among the nodes belonging to the cluster,
to manage the data, for instance, to rebalance them.
The processing of the data stored on the HDFS
is performed by the MapReduce framework. The
MapReduce framework was designed to address the
challenges of large-scale computing in the context of
long-running batch jobs and allows splitting on the
nodes belonging to the cluster the tasks that have to
be completed.
2.2 SQL Language in Hadoop
Ecosystem
Hadoop has emerged as a solution for distributed pro-
cessing of large, unstructured datasets. More recently,
the ecosystem of solutions around Hadoop has ex-
panded with the emergence of solutions for structured
data processing and SQL queries. Usually, such solu-
tions translate a given SQL query into several MapRe-
duce jobs.
Each job applies a different set of operators to dif-
ferent sets of data. The fast response for queries ena-
bles interactive exploration and fine-tuning of analytic
queries, rather than long batch jobs traditionally asso-
ciated with SQL-on-Hadoop technologies. However,
performing interactive data analysis at scale demands
a high degree of parallelism.
In (Chen et al., 2014), the performance of ve
solutions for executing SQL queries in Hadoop were
analyzed. The experiments employ queries derived
ICEIS 2018 - 20th International Conference on Enterprise Information Systems
30
from the TPC-DS benchmark and indicate that pio-
neering solutions, such as Hive, may not perform sa-
tisfactorily for many applications. In addition, the re-
sults encourage further research on this topic.
Impala (Kornacker et al., 2015) is part of a new ge-
neration of SQL query engines that integrate with the
Hadoop ecosystem. It is a modular solution, which
uses a variety of components (Metastore, HDFS,
HBase, YARN and Sentry, for example). Impala
circumvents MapReduce to directly access the data
through a specialized distributed query engine that
is very similar to those found in commercial paral-
lel RDBMSs. Therefore, Impala’s goal is to com-
bine familiarity with the SQL language and multiu-
ser performance of traditional databases with Hadoop
scalability and flexibility. Impala does not have to
translate a SQL query into another processing fra-
mework like the map/shuffle/reduce operations which
Hive and Apache Pig depends today. As a result, Im-
pala does not suffer the latencies that those operations
impose.
One of the components that can be coupled to Im-
pala is Apache Parquet (Melnik et al., 2010), which
consists of a columnar storage format that is available
to any project in the Hadoop ecosystem, regardless of
the choice of the framework to process data, the data
model or programming language. It consists of an
optimized format for large blocks of data (tens, hund-
reds and thousands of megabytes). According to the
performance analysis performed in (Kornacker et al.,
2015), Parquet offers the best compression and se-
arch performance among storage formats for Hadoop
ecosystem (Plain text, RC, Avro and Sequence).
2.3 Brazilian Electronic Invoice
Brazil is a country that has in the past had to deal with
major bouts of tax evasion. The reasons for which are
many and contain among others stringent labor laws,
high interest rates and the existence of hefty taxes.
Brazilian business owners are quick to point out that
the fiscal burden placed by the government has led to
international smuggling and rampant tax evasion.
One aspect of new Brazilians laws passed in the
year 2005 are to better enable the Brazilian govern-
ment to combat tax evasion by requiring a process of
digital invoicing regarding the sale of goods or servi-
ces. In this context, the Brazilian Electronic Invoice
(in portuguese, NF-e) aims to implement a national
electronic tax document model to gradually replace
the model still in force, which is done with paper is-
suing (RFB, 2010). It seeks to simplify processes re-
lated to ancillary obligations, enabling real-time mo-
nitoring of commercial operations by Brazilian regu-
latory agencies.
The NF-e project has as one of its basic premi-
ses the interconnection of information systems, which
provides a reduction in evasion, since it automatically
documents and discloses the data of all the buying and
selling operations. Specifically, for tax administrati-
ons, NF-e provides benefits such as: increasing the
reliability of the invoice, improving the fiscal cont-
rol process with exchange and sharing of information,
among others. To take full advantage of such benefits,
however, new challenges mediated by technologies
must be addressed. In particular, the wealth of NF-
e data generates analytical processing demands on a
much larger scale than in the previous model.
3 RELATED WORK
According to (Abouzeid et al., 2009) there are two
schools of thought on what technology should use for
analyzing data in a high-volume data environment.
Database providers argue that the strong emphasis
on the performance and efficiency of parallel data-
bases makes them suitable for performing this analy-
sis. On the other hand, others argue that MapReduce-
based systems are more appropriate because of their
superior scalability, fault tolerance, and flexibility to
handle unstructured data.
For (Stonebraker et al., 2010), MapReduce style
systems has advantages in complex analyzes, in the
extraction processes, transformation and loading of
data from external sources. Therefore, these systems
complement database systems, since they are desig-
ned to operate with transactions and not with massive
data load.
The vision of complementary systems generates
a third school for processing large volumes of data,
which associates the expressiveness of the SQL lan-
guage with the parallel processing power of Hadoop,
which is the focus of this work. SQL-on-Hadoop sy-
stems enable the connection of existing SQL-based
business applications with the results of large data
pipelines, increasing their value and accelerating the
adoption of Hadoop in commercial environments.
There is a variety of works that compare perfor-
mance among SQL-on-Hadoop systems using bench-
mark standards (Costea et al., 2016) and even com-
mercial databases (Kornacker et al., 2015). These
comparisons highlight the particularities of the sys-
tems and even point out low-cost alternatives to im-
plementing SQL queries on large volumes of data.
At the governmental level, the initiatives follow
the American normative (Pannu et al., 2016) and be-
gins a movement to report research papers and pro-
SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices
31
cessing analysis for these databases. In (Abrantes and
Ferraz, 2016), a literature review is presented about
big data technologies applied in the detection of tax
evasion. In (Paula et al., 2016) are presented the re-
sults obtained in the implementation of the ”deep le-
arning” model without supervision, to classify Brazi-
lian exporters as the possibility of committing export
fraud.
The present work joins other initiatives, shifting
the focus to a database marked by the wealth of attri-
butes that make up NF-e, which have confidential in-
formation that must be processed several times daily.
Therefore, the system needs to be adapted to the needs
and repeatability of audit procedures.
According to (Earley, 2015), there are four pri-
mary benefits to adopt data analysis in auditing proce-
dures: (i) Auditors can test a greater number of tran-
sactions; (ii) Audit quality can be improved by pro-
viding more insights in relation to the scenarios; (iii)
Fraud can be detected more easily, since auditors can
use familiar tools and technologies; and auditors can
provide services and solve problems that are currently
beyond their ability to use external data to support au-
diting activities.
4 APPLICATION
CHARACTERIZATION
The tax audit activity seeks to find inconsistencies be-
tween what is practiced by taxpayers and what is es-
tablished in the tax legislation. In this context, one
of the carried out analyzes is to analyze the electronic
invoices to verify if the rate associated with a particu-
lar product is in accordance with the law. However,
there are several specificities and conditions that must
be considered to determine taxation correctly.
For companies under normal taxation regime, wit-
hout considering the companies under the Brazilian
National SIMPLES regime, there are several scena-
rios to be analyzed, depending on the type of commer-
cial operation. For internal transactions, that is, com-
mercialization between companies or individuals of
Santa Catarina (SC) state / Brazil, the ICMS (Tax on
Goods Circulation and Services) tax rate depends on
the type of product marketed and, therefore, this sce-
nario presents the greatest challenge of that the taxa-
tion of marketed items is correct. The other scenarios,
with interstate operations, have fixed rates that do not
depend on the type of product sold, so they require
less effort.
The strategy used to analyze the most complex
scenario is to filter the transactions that are of inter-
nal operations and then calculate a series of metrics
to check the effective rates being practiced (0%, 7%,
12%, 17% and 25%) to indicate which are the most
used aliquots per product. With this processed statis-
tic, the tax auditors have subsidies to choose the focus
of the investigation.
Tax inspection in Santa Catarina state is struc-
tured in 15 specialties groups per product segment
or economic activity, such as fuels, medicines, su-
permarkets, among others. Therefore, the most fre-
quently used aliquot analysis consists of a very im-
portant consultation for the massive audit in State Se-
cretary of Finance of Santa Catarina (SEF-SC), im-
pacting the work of at least one tax auditor from each
expert group. In addition, this analysis is carried out
several times, since it is a consolidation of the beha-
vior of the goods movements by the taxpayers and,
therefore, their performance is fundamental for the in-
vestigative process of the auditors.
In a more detailed way, the analytical scenario
found in the SEF-SC consists in consolidating the
data of electronic invoices in two visions: (i) identifi-
cation; and (ii) item. The identification of NF-e con-
sists of general and summarized information of NF-e.
The most important fields are: identification of the is-
suer (identification number (CPF or CNPJ), name and
address), identification of the consignee, information
about the freight (delivery address, quantity, weight,
value and license plate) and tax information of ICMS,
total value of ICMS, IPI value, total value of the note,
total value of products, insurance value, discounts and
other ancillary expenses). On the other hand, the item
consists of the finest level of detail, containing infor-
mation of each item of the NF-e, ie of each product
and/or service marketed, such as: product code, pro-
duct description, quantity, price unit, ICMS calcula-
tion basis and ICMS and IPI rate.
It is worth mentioning that the fiscal audit com-
prises a investigative work and sometimes it is an in-
strument in the fight against tax crimes. Therefore,
there are secrets regarding certain details of the stra-
tegy of the fiscal operations, some details, steps and
data considered that can not be revealed.
5 EXPERIMENTS AND RESULTS
The experiments carried out focused on queries that
were relevant to auditors and were frequently execu-
ted in a corporate RDBMS, whose performance was
limiting in the context in question. Such queries were
chosen within the limits of secrecy, with substitution
of terms when necessary, but without changes that im-
pacts on performance. Queries were performed after
data preparation in two execution environments with
ICEIS 2018 - 20th International Conference on Enterprise Information Systems
32
time measurements to allow comparison between RD-
BMS and Impala/Hadoop based solutions.
5.1 Data Preparation
To make a preparation for queries, we used the partiti-
oning of data. Partitioning consists of a technique for
physically splitting the data during the loading pro-
cess, based on values of one or more columns of a
table, for the purpose of improving the management,
performance, availability, or balancing of charge.
Typically, data partitioning is appropriate for large
tables, where reading of all data requires a significant
time to be performed. Another criterion that indica-
tes an advantage in partitioning is the existence of co-
lumns having a reasonable number of distinct values,
at the same time that the data volume is not too small
to be able to take advantage of the cost of reading.
Thus, we sought to analyze the NF-e data to ve-
rify which columns would be candidates to partition
the data, considering the size of the tables in question
and the need for high performance in the queries. Fi-
gure 1 shows the distribution of NF-e items throug-
hout the months of the year 2016. It can be seen,
for the data considered, that the issuance of electro-
nic invoices occurs in a balanced way among the 12
months. In this way, we chose to partition the data by
the criterion of the month of issue of the NF-e.
In Table 1, the volumetry of these two origins of
the NF-e is presented for operations in the SC state
carried out in 2016.
Table 1: Volumetry of NF-e data sources in RDBMS and
Hadoop.
Source NF-e NF-e Item
Identification
Columns 69 26
Records 182,584,721 1,153,841,591
Volume - RDBMS 47.20 GB 112.18 GB
Volume - Hadoop 53.66 GB 128.95 GB
It is observed that the volume and quantity of re-
cords are significant. It was observed an increase in
the data load in the Hadoop environment, compared
to the space used in RDBMS. In addition, SEF-SC’s
policy is to implement a massive auditing operation,
that is, to analyze all goods circulation transactions
of all taxpayers, unlike the traditional method of wor-
king with samplings. This reinforces the demand for
a high-performance solution.
5.2 Experimentation Platform
For the experiments presented in this section, two
environments were considered: (i) commercial RD-
BMS; and (ii) Hadoop cluster. It is noticeable that
there is a difference between these environments. Ho-
wever, this difference is due to organizational and li-
censing restrictions. Although they are different, we
tried to use a configuration with virtual machines that
was similar in parallelism.
In addition, it is worth noting that, in finan-
cial terms, the RDBMS environment hardware costs
around 10x more than the hardware in the Hadoop en-
vironment in our experimentation platform. The har-
dware and software features of such environments are
described below.
5.2.1 Commercial RDBMS
The RDBMS execution environment consists of an
IBM Power System E850 server, dedicated to experi-
ments. This machine has 20 PowerPC POWER8 3.36
GHz cores and 512 GB of RAM. A feature of the ser-
vers in this series is the possibility of partitioning it
into smaller machines.
This way, there are 2 partitions used for the RD-
BMS. Each partition has 8 cores and 48 GB of me-
mory. All of them have the AIX 7.1 TL3 operating
system.
In terms of storage, an EMC VNX5500 150TB
storage was used, connected by a storage network
with 8GB/s Fiber Channel technology. Currently, the
RDBMS consumes approximately 80TB.
The RDBMS used is a popular commercial RD-
BMS, called ”RDBMS-Y” in this work, due to pro-
prietary restrictions of the licensing agreement.
5.2.2 Hadoop Cluster
Hadoop cluster nodes are hosted in a VMWare VSp-
here ESX 6.0 virtual environment. For the experi-
ments, a Dell PowerEdge R620 server with 48 In-
tel Xeon E5-2697 v2 2.70 GHz cores and 256 GB
of RAM was designated as the dedicated host. For
storage, an EMC VNX7500 storage was used, with a
dedicated area of 4 TB, interconnected to the server
with two 8 GB/s Fiber Channel interfaces.
Five virtual machines were created for the Hadoop
cluster. The first node, named NameNode, was con-
figured with 8 cores and 32 GB of RAM. The remai-
ning 4 nodes were configured with 4 cores and 16 GB
of RAM. In terms of storage, 800GB was allocated
for each node, totaling 4TB. From gross storage, 1.37
TB was made available for HDFS.
In terms of software, we used Impala 2.7.0, Apa-
che Parquet 1.5.0 (Parquet-format 2.1.0) and Apache
Hive 1.1.0 with Apache Hadoop 2.6.0. These packa-
ges are provided by the Cloudera CDH 5.10.0 distri-
bution. The HDFS replication factor was set to 3, with
SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices
33
Figure 1: Monthly distribution of identification data (a) and item (b) of NF-e.
the Java stack set to 2 GB. The Java stack for Impala
was set to 16 GB. Other parameters have been enabled
for the default configuration of the distribution.
5.3 Performance Analysis
For all of the queries discussed below, the parallel
query execution property was enabled in the RDBMS-
Y, since Impala consists of a query execution engine
with parallelism. In addition, as mentioned in Section
2, the Parquet format presents the best performance
for storage formats. Therefore, all queries performed
by Impala/Hadoop operate on tables stored in Parquet
format. In the following sections we present the que-
ries made in the comparison.
5.3.1 Query 1: Join between NF-e Item x NF-e
Identification
This query aims to consolidate the data of the identifi-
cation and item tables of the NF-e into a single table,
called nfe complete item, using as a key the nfe id co-
lumn. This is done so that the next steps of analysis
for audit can consult only one table with all the data
of the NF-e universe, without the need to make joins,
which are computationally costly. Figure 2 shows the
SQL query considered in this experiment.
select identification .*, item .*
from nfe_ite m item inner join
nfe_i d e n t i f i c a t i o n i d ent
on item . nf e _id = id ent . n f e_id ;
Figure 2: SQL code for NF-e item and NF-e Identification
join.
For the execution times presented in Figure 3, wit-
hout partitioning the data, we obtained a 2.7x speedup
for the Impala in comparison to the RDBMS-Y. When
considering the partitioning of the data per month, it
is verified that with Impala, a 4x acceleration is obtai-
ned in relation to the RDBMS-Y.
Figure 3: Comparison of query performance regarding the
join between NF-e item and NF-e Identification.
5.3.2 Query 2: Most Used Aliquots
As mentioned in Section 4, the query of statistics on
the aliquots associated with electronic invoice pro-
ducts represents an important step in the work of the
SEF-SC tax auditors. Specifically, this query con-
sists of calculation of a series of arithmetic operati-
ons, grouping the data by month and by textual attri-
butes that define the products (called here column A
and column B, for confidentiality). In addition, only
transactions involving the circulation of goods within
SC state defined by a set of Tax and Operational Co-
des (CFOPs) are considered. The query in question,
shown in Figure 4, is executed for each different ali-
quot that may exist, that are, 0%, 7%, 12%, 17% and
25%.
Figure 5 shows the execution times for this
query. Without data partitioning, Impala achie-
ves a 5.2x acceleration over the RDBMS. When
ICEIS 2018 - 20th International Conference on Enterprise Information Systems
34
select column_A , column_B , n um_ y e ar_ m o nth ,
count(*) as n u m b e r _ o f_occ u r r e n c e s ,
count(distinct if ( i c m s _ a l i q u o t a = ${ AL IQU O T } ,
cnpj_num , null)) as a l i q _qu a n t i t y _ $ { AL I Q UOT } ,
sum( i cms_ v a l u e ) as ic ms _ va lue ,
sum( i c m s _c a l c u l a ti o n _ b a si s _ v a l ue ) as i c m s _ c a l c u l a t i o n _ b a s i s _ v a l u e ,
max( u n i t a r y _ v alue ) as un i t a ry_v a l u e_ma x ,
min( u n i t a r y _ v alue ) as un i t a ry_v a l u e_mi n ,
avg( u n i t a r y _ v alue ) as un i t a r y _ v a l u e _ a v g
from nfe_c o m p l e t e _ i t e m
where co d _ c f o p _ i t e m in ( 51 01 , 51 02 , 5103 , 5104 ,
5 10 5 , 510 6 , 511 6 , 511 7 , 511 8 , 511 9 , 512 0 , 512 2 ,
5 12 3 , 590 4 , 591 7 )
group by c ol umn_A , col um n_ B , nu m _ y e a r _ m o n t h ;
Figure 4: SQL code to experiment with grouping and filtering.
considering data partitioning per month (column
num year month), the Impala query displays a 7.18x
speedup on the RDBMS-Y, with the same partitioning
criteria.
Figure 5: Comparison of query performance to search the
most used aliquots.
5.3.3 Query 3: Most Used Descriptions
The tax legislation of the ICMS is extensive and com-
plex, since the economy and commerce are dynamic,
with policies and products being created and disconti-
nued at all times. Often product classifications (defi-
ned here as textual columns column A and column B)
are erroneously done either by register errors or by
other reasons.
Thus, in many cases, only the product description
becomes the means of knowing if the product classifi-
cations are correct, thus allowing a correct identifica-
tion of the tax rate in which the product fits. Figure 6
presents the SQL query for the most commonly used
electronic invoice description, for each combination
of column A and column B.
select t . co lu mn_A , t . col umn_B ,
t . pr o d u c t_ d e s c
from (
select it em . co lumn_ A ,
ite m . c ol umn_B ,
ite m . pr o du ct_ de s c ,
count(*) as n_ o c our e n c es _ p r odu c t ,
ran k () ove r (
pa r t it i o n by item . colu mn _A ,
ite m . col u mn_ B
order by count(*) desc
) as ra nk i ng
from n f e _ c o m p l e te_i t e m ite m
where len gth ( it em . co lum n _B ) >= 8
group by i te m . c olumn _A , co lu mn _B ,
ite m . pr odu c t _ d esc
) t
where t . r ank i ng = 1;
Figure 6: SQL code for aggregation query.
Figure 7 presents the times of execution of the
query. Without considering data partitioning, the SQL
query via Impala presented an increase in speed of
11.25x in relation to the RDBMS-Y. When conside-
ring data partitioning per month, the speedup using
the Impala was 14.65x, with the same partitioning cri-
teria.
5.3.4 Discussion
We strongly believe that the modular nature of the Ha-
doop environment, in which Impala draws on a num-
ber of standard components that are shared across the
platform, confers some advantages that cannot be re-
plicated in a traditional, monolithic RDBMS as the
one that is presented in the experiments.
In particular, the ability to mix file formats and
processing frameworks means that a much broader
SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices
35
Figure 7: Comparison of query performance to search the
most used descriptions.
spectrum of computational tasks can be handled by
a single system without the need for data movement,
which itself is typically one of the biggest impedi-
ments for an organization to do something useful with
its data.
It is one of the reasons of the shown difference
of performance between Impala and the RDBMS-Y.
Since RDBMSs need to read data from disk to me-
mory before start processing, there is a latency that
is repeated several times when is working with large
datasets. On the other hand, Impala as a MPP solu-
tion takes the processing as close possible to the data.
Thus, there is less data movement and a lower latency.
6 CONCLUSION
In this work, an open source and recent technology
was explored in order to perform SQL queries on Ha-
doop, applying it in a context in which solutions based
on RDBMS are dominant. The comparison joins ot-
her research that pointed to benefits of this approach,
with the differential being based not on benchmarks,
but on a real application, not found in other works that
investigate SQL performance on Hadoop.
Despite Hadoop’s origin as a batch processing
environment, the obtained results reinforce the idea
that the public sector can derive great benefits from
building large-scale analytical processing technolo-
gies compared to traditional RDBMS technologies. In
fact, the SQL solution on Hadoop has not only shown
better performance, but is also less costly in terms of
hardware and software, representing a rational use of
public resources.
This solution opens the way to face the challenge
of tax evasion in several ways, with the possibility of a
massive audit, considering the totality of transactions,
companies and individuals involved. In particular, the
performance results presented in this paper contribu-
ted to the creation of the Fiscal Planning and Moni-
toring Group in order to subsidize the control groups
with information for more efficient fiscal operations.
It is also worth noting that the Impala/Hadoop ap-
proach is applicable to data from other Brazilian sta-
tes, since NF-e is a national standard and the codes
for operations and benefits are also nationally unified.
The experiments carried out opened the way for
further investigations, which cross information from
NF-e with other tax data. From the technological
point of view, it is necessary to accompany new alter-
natives of SQL-on-Hadoop that perhaps present pro-
mising attributes for the application in question.
ACKNOWLEDGEMENTS
The authors are especially grateful to CIASC colle-
agues, F
´
abio Eduardo Thomaz, James Rosa, Robson
Marcos da Cunha, Ademir Jo
˜
ao da Rosa, Luiz Car-
los Brehmer Junior, Eduardo Sguario dos Reis, Dante
Michels de Mattos, Sergio Luiz Borges da Silva, and
Nelson Mussak Guanabara for their support and com-
mitment to the project. In addition, the authors thank
the colleagues of SEF-SC, Luiz Carlos de Lima Fei-
toza, Omar Afif Alemsan, and Dayna Maria Borto-
luzzi for the suggestions and availability of the ne-
cessary infrastructure for the experiments presented.
This research was partially supported by UFSM/-
FATEC through project number 041250 - 9.07.0025
(100548).
REFERENCES
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silber-
schatz, A., and Rasin, A. (2009). HadoopDB: An ar-
chitectural hybrid of MapReduce and DBMS techno-
logies for analytical workloads. Proceedings of the
Very Large Data Base Endowment Inc., 2(1):922–933.
Abrantes, P. C. and Ferraz, F. (2016). Big data applied to
tax evasion detection: A systematic review. In Procee-
dings of the International Conference on Computati-
onal Science and Computational Intelligence (CSCI),
pages 435–440. IEEE.
Allingham, M. G. and Sandmo, A. (1972). Income tax eva-
sion: A theoretical analysis. Journal of public econo-
mics, 1(3-4):323–338.
Chen, Y. et al. (2014). A study of SQL-on-Hadoop systems.
In Zhan, J., Han, R., and Weng, C., editors, Big Data
Benchmarks, Performance Optimization, and Emer-
ging Hardware: 4th and 5th Workshops, BPOE 2014,
ICEIS 2018 - 20th International Conference on Enterprise Information Systems
36
Salt Lake City, USA, March 1, 2014 and Hangzhou,
China, September 5, 2014, Revised Selected Papers,
pages 154–166. Springer International Publishing.
Costea, A. et al. (2016). VectorH: Taking SQL-on-Hadoop
to the next level. In Proceedings of the 2016 Interna-
tional Conference on Management of Data, SIGMOD
’16, pages 1105–1117, New York, NY, USA. ACM.
Dean, J. and Ghemawat, S. (2008). Mapreduce: Simpli-
fied data processing on large clusters. Commun. ACM,
51(1):107–113.
Earley, C. E. (2015). Data analytics in auditing: Opportu-
nities and challenges. Business Horizons, 58(5):493–
500.
Floratou, A., Minhas, U. F., and
¨
Ozcan, F. (2014). SQL-on-
Hadoop: Full circle back to shared-nothing database
architectures. Proceedings of the Very Large Data
Base Endowment Inc., 7(12):1295–1306.
IBGE (2012). Demographics of companies [online].
Available: http://biblioteca.ibge.gov.br/visualizacao/
livros/liv88028.pdf. [Accessed 20 October 2017].
Kim, G.-H., Trimi, S., and Chung, J.-H. (2014). Big-data
applications in the government sector. Communicati-
ons of the ACM, 57(3):78–85.
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T.,
Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs,
M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li,
N., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsi-
rogiannis, D., Wanderman-milne, S., and Yoder, M.
(2015). Impala: A modern, open-source SQL en-
gine for Hadoop. In Proceedings of the 7h Bien-
nial Conference on Innovative Data Systems Research
(CIDR’2015).
Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivaku-
mar, S., Tolton, M., and Vassilakis, T. (2010). Dre-
mel: Interactive analysis of web-scale datasets. In
Proceedings of the 36th International Conference on
Very Large Data Bases, pages 330–339.
Pannu, M., Gill, B., Tebb, W., and Yang, K. (2016). The
impact of big data on government processes. In Pro-
ceedings of the IEEE 7th Annual Information Techno-
logy, Electronics and Mobile Communication Confe-
rence (IEMCON), pages 1–5. IEEE.
Paula, E. L., Ladeira, M., Carvalho, R. N., and Marzag
˜
ao,
T. (2016). Deep learning anomaly detection as sup-
port fraud investigation in brazilian exports and anti-
money laundering. In Proceedings of the 15th IEEE
International Conference on Machine Learning and
Applications (ICMLA), pages 954–960. IEEE.
Rad, M. S. and Shahbahrami, A. (2016). Detecting high risk
taxpayers using data mining techniques. In Internati-
onal Conference of Signal Processing and Intelligent
Systems (ICSPIS), pages 1–5. IEEE.
RFB (2010). Sped: Public system of digital bookkeeping
[online]. Available: http://sped.rfb.gov.br. [Accessed
20 October 2017].
Shvachko, K., Kuang, H., Radia, S., and Chansler, R.
(2010). The hadoop distributed file system. In Pro-
ceedings of the 2010 IEEE 26th Symposium on Mass
Storage Systems and Technologies (MSST), MSST
’10, pages 1–10, Washington, DC, USA. IEEE Com-
puter Society.
Stonebraker, M., Abadi, D., DeWitt, D. J., Madden, S.,
Paulson, E., Pavlo, A., and Rasin, A. (2010). MapRe-
duce and parallel DBMSs: friends or foes? Commu-
nications of the ACM, 53(1):64–71.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S.,
Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H.,
Seth, S., Saha, B., Curino, C., O’Malley, O., Radia,
S., Reed, B., and Baldeschwieler, E. (2013). Apa-
che hadoop yarn: Yet another resource negotiator. In
Proceedings of the 4th Annual Symposium on Cloud
Computing, SOCC ’13, pages 5:1–5:16, New York,
NY, USA. ACM.
White, T. (2012). Hadoop: The definitive guide. ” O’Reilly
Media, Inc.”.
SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices
37