SQL Query Performance on Hadoop: An Analysis Focused on Large

Databases of Brazilian Electronic Invoices

Cristiano Cortez da Rocha

, M

arcio Parise Bouﬂeur

, Leandro da Silva Fornasier

ulio C

esar Narciso

, Andrea Schwertner Char

, Vin

ıcius Maran

, Jo

ao Carlos D. Lima

and Benhur O. Stein

Centro de Inform

atica e Automac¸

ao do Estado de Santa Catarina (CIASC), Florian

opolis, SC, Brazil

Diretoria de Administrac¸

ao Tribut

aria, Secretaria de Estado da Fazenda de Santa Catarina, Florian

opolis, SC, Brazil

Departamento de Linguagens e Sistemas de Computac¸

ao, Programa de P

os-Graduac¸

ao em Inform

atica,

Universidade Federal de Santa Maria (UFSM), Santa Maria, RS, Brazil

Coordenadoria Acad

emica, Universidade Federal de Santa Maria (UFSM), Cachoeira do Sul, RS, Brazil

Keywords:

Large Database, Query Performance, Data Management, Business-critical Data.

Abstract:

Hadoop clusters have established themselves as a foundation for various applications and experiments in the

ﬁeld of high-performance processing of large datasets. In this context, SQL-on-Hadoop emerged as trend that

combines the popularity of SQL with the performance of Hadoop. In this work, we analyze the performance of

SQL queries on Hadoop, using the Impala engine, comparing it with a RDBMS-based approach. The analysis

focuses on a large set of electronic invoice data, representing an important application to support ﬁscal audit

operations. The experiments performed included frequent queries in this context, which were implemented

with and without data partitioning in both RDBMS and Impala/Hadoop. The results show speedups from 2.7

to 14x with Impala/Hadoop for the queries considered, on a lower cost hardware/software platform.

1 INTRODUCTION

Large-scale analytical data processing has spread

across a variety of industries and areas of interest,

guided by low storage costs and the ability to col-

lect large volumes of business-critical data. Providing

such data to engineers and analysts has become incre-

asingly mandatory and requires low response times,

which represent a key factor in data mining, moni-

toring, prototyping strategies, simulations, and other

common tasks in this ﬁeld (Melnik et al., 2010).

Consequently, the processing of large data sets re-

quires state-of-the-art solutions of high performance

computing. In this context, Hadoop (White, 2012)

and its related tools have established themselves as

solutions for managing and performing robust ana-

lyzes of unstructured data, as well as structured and

semi-structured data processing. The processing of

SQL queries in particular has gained signiﬁcant atten-

tion, since the data management systems of many in-

stitutions depend on or rely on SQL, in addition to the

fact that many corporate users are familiar and com-

fortable with such language (Floratou et al., 2014).

With the evolution of the Hadoop ecosystem, soluti-

ons have emerged to process SQL queries on Hadoop,

seeking to extend its beneﬁts to structured and semi-

structured data (Kornacker et al., 2015).

For government institutions, the processing power

and storage savings made possible by the Hadoop

ecosystem offer new options for IT infrastructure in

optimization and application development. In addi-

tion, governments expect such applications to be able

to improve services to citizens, as well as addressing

major government challenges, such as the economy,

revenue collection, job creation, health and natural di-

sasters (Kim et al., 2014).

Tax evasion is mostly performed by the taxpayers

to reduce tax liability and this illegal action is usually

performed to misrepresent the ﬁnancial facts to go-

vernment and tax authorities by providing false tax re-

porting, such as declaring less income, less proﬁt and

more or exaggerated costs (Rad and Shahbahrami,

2016).

According to (Allingham and Sandmo, 1972), the

probability of practicing tax evasion is directly related

to the probability of its detection, so governments are

constantly seeking more efﬁcient and effective ways

to deal with this problem. Another challenge faced

Cortez da Rocha, C., Parise Bouﬂeur, M., da Silva Fornasier, L., César Narciso, J., Schwertner Charão, A., Maran, V., D. Lima, J. and O. Stein, B.

SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices.

DOI: 10.5220/0006690400290037

In Proceedings of the 20th International Conference on Enterprise Information Systems (ICEIS 2018), pages 29-37

ISBN: 978-989-758-298-1

by the tax authorities is the turnover of companies.

According to data from the Brazilian Institute of Ge-

ography and Statistics (in portuguese, IBGE) (IBGE,

2012), the close rate of companies in 2 years of life is

about 25%. This way, one in four companies closes

before two years of existence. In many cases, the do-

wngrade of the business register is fraudulent, leaving

behind an uncollected tax liability.

Since 2007, with the institution of the Public Sy-

stem of Digital Bookkeeping (in portuguese, SPED),

documents and ﬁscal books, previously ﬁled manu-

ally and printed, are migrating to electronic docu-

ments, enabling: (i) More speed in the identiﬁcation

of tax offenses; (ii) Faster access to information; and

(iii) Greater effectiveness in the supervision of the

operations with the data crossing and electronic audit

(RFB, 2010). The most commonly used electronic tax

document is the electronic invoice, which documents

the sales operations.

In this context, Big Data technologies can provide

a quick and efﬁcient selection of companies with ﬁs-

cal irregularities, since the work of a tax auditor in-

volves the analysis of a large amount of ﬁscal data

and reports. Thus, it is possible to provide a rapid re-

sponse to the Brazilian tax reality (Paula et al., 2016;

Abrantes and Ferraz, 2016).

In this paper, in general, we try to feed the con-

text in question with new evidence, considering both

technological and business aspects. For this, a compa-

rative analysis of two approaches was performed for

the processing of large volumes of ﬁscal data, more

speciﬁcally electronic invoices, through SQL queries

executed on RDBMS and on Hadoop, using Impala

(Kornacker et al., 2015) as a processing engine paral-

lel conﬁguration.

The rest of this paper is organized as follows:

Section 2 presents the relevant concepts about the Ha-

doop ecosystem and the application under considera-

tion. The related works are discussed in Section 3.

The characterization of the application in more detail

is presented in Section 4. In Section 5, experimen-

tal studies are reported and, ﬁnally, in Section 6, the

ﬁnal considerations and perspectives of future works

are presented.

2 CONCEPTUAL FOUNDATION

2.1 Apache Hadoop

Apache Hadoop, is an open source framework that

supports data-intensive distributed and fault-tolerant

applications (White, 2012). Its aim is to make availa-

ble a framework for data and processing abstractions

in order to facilitate queries of large, dynamic, and ra-

pidly growing datasets. The main modules that made

up the Hadoop framework are as follows:

• Hadoop Distributed File System (HDFS): ori-

ginally it was the Google File System. This mo-

dule is a distributed ﬁle system used as distributed

storage for the data; furthermore, it provides an

access to the data with high throughput (Shvachko

et al., 2010).

• Hadoop YARN (MRv2): this module is respon-

sible for the job scheduling and for managing the

cluster resources (Vavilapalli et al., 2013).

• Hadoop MapReduce: originally Google’s

MapReduce (Dean and Ghemawat, 2008), this

module is a system, based on YARN, for the

parallel processing of the data.

One of the main aspects that characterize Hadoop

is that the HDFS has a high fault-tolerance to the

hardware failure. Indeed, it is able to automatically

handle and resolve these events. Therefore, it enables

the deploy of lower cost clusters since it can use com-

modity hardware. Furthermore, HDFS is able, by the

interaction among the nodes belonging to the cluster,

to manage the data, for instance, to rebalance them.

The processing of the data stored on the HDFS

is performed by the MapReduce framework. The

MapReduce framework was designed to address the

challenges of large-scale computing in the context of

long-running batch jobs and allows splitting on the

nodes belonging to the cluster the tasks that have to

be completed.

2.2 SQL Language in Hadoop

Ecosystem

Hadoop has emerged as a solution for distributed pro-

cessing of large, unstructured datasets. More recently,

the ecosystem of solutions around Hadoop has ex-

panded with the emergence of solutions for structured

data processing and SQL queries. Usually, such solu-

tions translate a given SQL query into several MapRe-

duce jobs.

Each job applies a different set of operators to dif-

ferent sets of data. The fast response for queries ena-

bles interactive exploration and ﬁne-tuning of analytic

queries, rather than long batch jobs traditionally asso-

ciated with SQL-on-Hadoop technologies. However,

performing interactive data analysis at scale demands

a high degree of parallelism.

In (Chen et al., 2014), the performance of ﬁve

solutions for executing SQL queries in Hadoop were

analyzed. The experiments employ queries derived

ICEIS 2018 - 20th International Conference on Enterprise Information Systems

from the TPC-DS benchmark and indicate that pio-

neering solutions, such as Hive, may not perform sa-

tisfactorily for many applications. In addition, the re-

sults encourage further research on this topic.

Impala (Kornacker et al., 2015) is part of a new ge-

neration of SQL query engines that integrate with the

Hadoop ecosystem. It is a modular solution, which

uses a variety of components (Metastore, HDFS,

HBase, YARN and Sentry, for example). Impala

circumvents MapReduce to directly access the data

through a specialized distributed query engine that

is very similar to those found in commercial paral-

lel RDBMSs. Therefore, Impala’s goal is to com-

bine familiarity with the SQL language and multiu-

ser performance of traditional databases with Hadoop

scalability and ﬂexibility. Impala does not have to

translate a SQL query into another processing fra-

mework like the map/shufﬂe/reduce operations which

Hive and Apache Pig depends today. As a result, Im-

pala does not suffer the latencies that those operations

impose.

One of the components that can be coupled to Im-

pala is Apache Parquet (Melnik et al., 2010), which

consists of a columnar storage format that is available

to any project in the Hadoop ecosystem, regardless of

the choice of the framework to process data, the data

model or programming language. It consists of an

optimized format for large blocks of data (tens, hund-

reds and thousands of megabytes). According to the

performance analysis performed in (Kornacker et al.,

2015), Parquet offers the best compression and se-

arch performance among storage formats for Hadoop

ecosystem (Plain text, RC, Avro and Sequence).

2.3 Brazilian Electronic Invoice

Brazil is a country that has in the past had to deal with

major bouts of tax evasion. The reasons for which are

many and contain among others stringent labor laws,

high interest rates and the existence of hefty taxes.

Brazilian business owners are quick to point out that

the ﬁscal burden placed by the government has led to

international smuggling and rampant tax evasion.

One aspect of new Brazilians laws passed in the

year 2005 are to better enable the Brazilian govern-

ment to combat tax evasion by requiring a process of

digital invoicing regarding the sale of goods or servi-

ces. In this context, the Brazilian Electronic Invoice

(in portuguese, NF-e) aims to implement a national

electronic tax document model to gradually replace

the model still in force, which is done with paper is-

suing (RFB, 2010). It seeks to simplify processes re-

lated to ancillary obligations, enabling real-time mo-

nitoring of commercial operations by Brazilian regu-

latory agencies.

The NF-e project has as one of its basic premi-

ses the interconnection of information systems, which

provides a reduction in evasion, since it automatically

documents and discloses the data of all the buying and

selling operations. Speciﬁcally, for tax administrati-

ons, NF-e provides beneﬁts such as: increasing the

reliability of the invoice, improving the ﬁscal cont-

rol process with exchange and sharing of information,

among others. To take full advantage of such beneﬁts,

however, new challenges mediated by technologies

must be addressed. In particular, the wealth of NF-

e data generates analytical processing demands on a

much larger scale than in the previous model.

3 RELATED WORK

According to (Abouzeid et al., 2009) there are two

schools of thought on what technology should use for

analyzing data in a high-volume data environment.

Database providers argue that the strong emphasis

on the performance and efﬁciency of parallel data-

bases makes them suitable for performing this analy-

sis. On the other hand, others argue that MapReduce-

based systems are more appropriate because of their

superior scalability, fault tolerance, and ﬂexibility to

handle unstructured data.

For (Stonebraker et al., 2010), MapReduce style

systems has advantages in complex analyzes, in the

extraction processes, transformation and loading of

data from external sources. Therefore, these systems

complement database systems, since they are desig-

ned to operate with transactions and not with massive

data load.

The vision of complementary systems generates

a third school for processing large volumes of data,

which associates the expressiveness of the SQL lan-

guage with the parallel processing power of Hadoop,

which is the focus of this work. SQL-on-Hadoop sy-

stems enable the connection of existing SQL-based

business applications with the results of large data

pipelines, increasing their value and accelerating the

adoption of Hadoop in commercial environments.

There is a variety of works that compare perfor-

mance among SQL-on-Hadoop systems using bench-

mark standards (Costea et al., 2016) and even com-

mercial databases (Kornacker et al., 2015). These

comparisons highlight the particularities of the sys-

tems and even point out low-cost alternatives to im-

plementing SQL queries on large volumes of data.

At the governmental level, the initiatives follow

the American normative (Pannu et al., 2016) and be-

gins a movement to report research papers and pro-

SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices

cessing analysis for these databases. In (Abrantes and

Ferraz, 2016), a literature review is presented about

big data technologies applied in the detection of tax

evasion. In (Paula et al., 2016) are presented the re-

sults obtained in the implementation of the ”deep le-

arning” model without supervision, to classify Brazi-

lian exporters as the possibility of committing export

fraud.

The present work joins other initiatives, shifting

the focus to a database marked by the wealth of attri-

butes that make up NF-e, which have conﬁdential in-

formation that must be processed several times daily.

Therefore, the system needs to be adapted to the needs

and repeatability of audit procedures.

According to (Earley, 2015), there are four pri-

mary beneﬁts to adopt data analysis in auditing proce-

dures: (i) Auditors can test a greater number of tran-

sactions; (ii) Audit quality can be improved by pro-

viding more insights in relation to the scenarios; (iii)

Fraud can be detected more easily, since auditors can

use familiar tools and technologies; and auditors can

provide services and solve problems that are currently

beyond their ability to use external data to support au-

diting activities.

4 APPLICATION

CHARACTERIZATION

The tax audit activity seeks to ﬁnd inconsistencies be-

tween what is practiced by taxpayers and what is es-

tablished in the tax legislation. In this context, one

of the carried out analyzes is to analyze the electronic

invoices to verify if the rate associated with a particu-

lar product is in accordance with the law. However,

there are several speciﬁcities and conditions that must

be considered to determine taxation correctly.

For companies under normal taxation regime, wit-

hout considering the companies under the Brazilian

National SIMPLES regime, there are several scena-

rios to be analyzed, depending on the type of commer-

cial operation. For internal transactions, that is, com-

mercialization between companies or individuals of

Santa Catarina (SC) state / Brazil, the ICMS (Tax on

Goods Circulation and Services) tax rate depends on

the type of product marketed and, therefore, this sce-

nario presents the greatest challenge of that the taxa-

tion of marketed items is correct. The other scenarios,

with interstate operations, have ﬁxed rates that do not

depend on the type of product sold, so they require

less effort.

The strategy used to analyze the most complex

scenario is to ﬁlter the transactions that are of inter-

nal operations and then calculate a series of metrics

to check the effective rates being practiced (0%, 7%,

12%, 17% and 25%) to indicate which are the most

used aliquots per product. With this processed statis-

tic, the tax auditors have subsidies to choose the focus

of the investigation.

Tax inspection in Santa Catarina state is struc-

tured in 15 specialties groups per product segment

or economic activity, such as fuels, medicines, su-

permarkets, among others. Therefore, the most fre-

quently used aliquot analysis consists of a very im-

portant consultation for the massive audit in State Se-

cretary of Finance of Santa Catarina (SEF-SC), im-

pacting the work of at least one tax auditor from each

expert group. In addition, this analysis is carried out

several times, since it is a consolidation of the beha-

vior of the goods movements by the taxpayers and,

therefore, their performance is fundamental for the in-

vestigative process of the auditors.

In a more detailed way, the analytical scenario

found in the SEF-SC consists in consolidating the

data of electronic invoices in two visions: (i) identiﬁ-

cation; and (ii) item. The identiﬁcation of NF-e con-

sists of general and summarized information of NF-e.

The most important ﬁelds are: identiﬁcation of the is-

suer (identiﬁcation number (CPF or CNPJ), name and

address), identiﬁcation of the consignee, information

about the freight (delivery address, quantity, weight,

value and license plate) and tax information of ICMS,

total value of ICMS, IPI value, total value of the note,

total value of products, insurance value, discounts and

other ancillary expenses). On the other hand, the item

consists of the ﬁnest level of detail, containing infor-

mation of each item of the NF-e, ie of each product

and/or service marketed, such as: product code, pro-

duct description, quantity, price unit, ICMS calcula-

tion basis and ICMS and IPI rate.

It is worth mentioning that the ﬁscal audit com-

prises a investigative work and sometimes it is an in-

strument in the ﬁght against tax crimes. Therefore,

there are secrets regarding certain details of the stra-

tegy of the ﬁscal operations, some details, steps and

data considered that can not be revealed.

5 EXPERIMENTS AND RESULTS

The experiments carried out focused on queries that

were relevant to auditors and were frequently execu-

ted in a corporate RDBMS, whose performance was

limiting in the context in question. Such queries were

chosen within the limits of secrecy, with substitution

of terms when necessary, but without changes that im-

pacts on performance. Queries were performed after

data preparation in two execution environments with

ICEIS 2018 - 20th International Conference on Enterprise Information Systems

time measurements to allow comparison between RD-

BMS and Impala/Hadoop based solutions.

5.1 Data Preparation

To make a preparation for queries, we used the partiti-

oning of data. Partitioning consists of a technique for

physically splitting the data during the loading pro-

cess, based on values of one or more columns of a

table, for the purpose of improving the management,

performance, availability, or balancing of charge.

Typically, data partitioning is appropriate for large

tables, where reading of all data requires a signiﬁcant

time to be performed. Another criterion that indica-

tes an advantage in partitioning is the existence of co-

lumns having a reasonable number of distinct values,

at the same time that the data volume is not too small

to be able to take advantage of the cost of reading.

Thus, we sought to analyze the NF-e data to ve-

rify which columns would be candidates to partition

the data, considering the size of the tables in question

and the need for high performance in the queries. Fi-

gure 1 shows the distribution of NF-e items throug-

hout the months of the year 2016. It can be seen,

for the data considered, that the issuance of electro-

nic invoices occurs in a balanced way among the 12

months. In this way, we chose to partition the data by

the criterion of the month of issue of the NF-e.

In Table 1, the volumetry of these two origins of

the NF-e is presented for operations in the SC state

carried out in 2016.

Table 1: Volumetry of NF-e data sources in RDBMS and

Hadoop.

Source NF-e NF-e Item

Identiﬁcation

Columns 69 26

Records 182,584,721 1,153,841,591

Volume - RDBMS 47.20 GB 112.18 GB

Volume - Hadoop 53.66 GB 128.95 GB

It is observed that the volume and quantity of re-

cords are signiﬁcant. It was observed an increase in

the data load in the Hadoop environment, compared

to the space used in RDBMS. In addition, SEF-SC’s

policy is to implement a massive auditing operation,

that is, to analyze all goods circulation transactions

of all taxpayers, unlike the traditional method of wor-

king with samplings. This reinforces the demand for

a high-performance solution.

5.2 Experimentation Platform

For the experiments presented in this section, two

environments were considered: (i) commercial RD-

BMS; and (ii) Hadoop cluster. It is noticeable that

there is a difference between these environments. Ho-

wever, this difference is due to organizational and li-

censing restrictions. Although they are different, we

tried to use a conﬁguration with virtual machines that

was similar in parallelism.

In addition, it is worth noting that, in ﬁnan-

cial terms, the RDBMS environment hardware costs

around 10x more than the hardware in the Hadoop en-

vironment in our experimentation platform. The har-

dware and software features of such environments are

described below.

5.2.1 Commercial RDBMS

The RDBMS execution environment consists of an

IBM Power System E850 server, dedicated to experi-

ments. This machine has 20 PowerPC POWER8 3.36

GHz cores and 512 GB of RAM. A feature of the ser-

vers in this series is the possibility of partitioning it

into smaller machines.

This way, there are 2 partitions used for the RD-

BMS. Each partition has 8 cores and 48 GB of me-

mory. All of them have the AIX 7.1 TL3 operating

system.

In terms of storage, an EMC VNX5500 150TB

storage was used, connected by a storage network

with 8GB/s Fiber Channel technology. Currently, the

RDBMS consumes approximately 80TB.

The RDBMS used is a popular commercial RD-

BMS, called ”RDBMS-Y” in this work, due to pro-

prietary restrictions of the licensing agreement.

5.2.2 Hadoop Cluster

Hadoop cluster nodes are hosted in a VMWare VSp-

here ESX 6.0 virtual environment. For the experi-

ments, a Dell PowerEdge R620 server with 48 In-

tel Xeon E5-2697 v2 2.70 GHz cores and 256 GB

of RAM was designated as the dedicated host. For

storage, an EMC VNX7500 storage was used, with a

dedicated area of 4 TB, interconnected to the server

with two 8 GB/s Fiber Channel interfaces.

Five virtual machines were created for the Hadoop

cluster. The ﬁrst node, named NameNode, was con-

ﬁgured with 8 cores and 32 GB of RAM. The remai-

ning 4 nodes were conﬁgured with 4 cores and 16 GB

of RAM. In terms of storage, 800GB was allocated

for each node, totaling 4TB. From gross storage, 1.37

TB was made available for HDFS.

In terms of software, we used Impala 2.7.0, Apa-

che Parquet 1.5.0 (Parquet-format 2.1.0) and Apache

Hive 1.1.0 with Apache Hadoop 2.6.0. These packa-

ges are provided by the Cloudera CDH 5.10.0 distri-

bution. The HDFS replication factor was set to 3, with

SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices

Figure 1: Monthly distribution of identiﬁcation data (a) and item (b) of NF-e.

the Java stack set to 2 GB. The Java stack for Impala

was set to 16 GB. Other parameters have been enabled

for the default conﬁguration of the distribution.

5.3 Performance Analysis

For all of the queries discussed below, the parallel

query execution property was enabled in the RDBMS-

Y, since Impala consists of a query execution engine

with parallelism. In addition, as mentioned in Section

2, the Parquet format presents the best performance

for storage formats. Therefore, all queries performed

by Impala/Hadoop operate on tables stored in Parquet

format. In the following sections we present the que-

ries made in the comparison.

5.3.1 Query 1: Join between NF-e Item x NF-e

Identiﬁcation

This query aims to consolidate the data of the identiﬁ-

cation and item tables of the NF-e into a single table,

called nfe complete item, using as a key the nfe id co-

lumn. This is done so that the next steps of analysis

for audit can consult only one table with all the data

of the NF-e universe, without the need to make joins,

which are computationally costly. Figure 2 shows the

SQL query considered in this experiment.

select identification .*, item .*

from nfe_ite m item inner join

nfe_i d e n t i f i c a t i o n i d ent

on item . nf e _id = id ent . n f e_id ;

Figure 2: SQL code for NF-e item and NF-e Identiﬁcation

join.

For the execution times presented in Figure 3, wit-

hout partitioning the data, we obtained a 2.7x speedup

for the Impala in comparison to the RDBMS-Y. When

considering the partitioning of the data per month, it

is veriﬁed that with Impala, a 4x acceleration is obtai-

ned in relation to the RDBMS-Y.

Figure 3: Comparison of query performance regarding the

join between NF-e item and NF-e Identiﬁcation.

5.3.2 Query 2: Most Used Aliquots

As mentioned in Section 4, the query of statistics on

the aliquots associated with electronic invoice pro-

ducts represents an important step in the work of the

SEF-SC tax auditors. Speciﬁcally, this query con-

sists of calculation of a series of arithmetic operati-

ons, grouping the data by month and by textual attri-

butes that deﬁne the products (called here column A

and column B, for conﬁdentiality). In addition, only

transactions involving the circulation of goods within

SC state deﬁned by a set of Tax and Operational Co-

des (CFOPs) are considered. The query in question,

shown in Figure 4, is executed for each different ali-

quot that may exist, that are, 0%, 7%, 12%, 17% and

25%.

Figure 5 shows the execution times for this

query. Without data partitioning, Impala achie-

ves a 5.2x acceleration over the RDBMS. When

ICEIS 2018 - 20th International Conference on Enterprise Information Systems

select column_A , column_B , n um_ y e ar_ m o nth ,

count(*) as n u m b e r _ o f_occ u r r e n c e s ,

count(distinct if ( i c m s _ a l i q u o t a = ${ AL IQU O T } ,

cnpj_num , null)) as a l i q _qu a n t i t y _ $ { AL I Q UOT } ,

sum( i cms_ v a l u e ) as ic ms _ va lue ,

sum( i c m s _c a l c u l a ti o n _ b a si s _ v a l ue ) as i c m s _ c a l c u l a t i o n _ b a s i s _ v a l u e ,

max( u n i t a r y _ v alue ) as un i t a ry_v a l u e_ma x ,

min( u n i t a r y _ v alue ) as un i t a ry_v a l u e_mi n ,

avg( u n i t a r y _ v alue ) as un i t a r y _ v a l u e _ a v g

from nfe_c o m p l e t e _ i t e m

where co d _ c f o p _ i t e m in ( ’ 51 01 ’ , ’ 51 02 ’ , ’ 5103 ’ , ’ 5104 ’ ,

’ 5 10 5 ’ , ’ 510 6 ’ , ’ 511 6 ’ , ’ 511 7 ’ , ’ 511 8 ’ , ’ 511 9 ’ , ’ 512 0 ’ , ’ 512 2 ’ ,

’ 5 12 3 ’ , ’ 590 4 ’ , ’ 591 7 ’)

group by c ol umn_A , col um n_ B , nu m _ y e a r _ m o n t h ;

Figure 4: SQL code to experiment with grouping and ﬁltering.

considering data partitioning per month (column

num year month), the Impala query displays a 7.18x

speedup on the RDBMS-Y, with the same partitioning

criteria.

Figure 5: Comparison of query performance to search the

most used aliquots.

5.3.3 Query 3: Most Used Descriptions

The tax legislation of the ICMS is extensive and com-

plex, since the economy and commerce are dynamic,

with policies and products being created and disconti-

nued at all times. Often product classiﬁcations (deﬁ-

ned here as textual columns column A and column B)

are erroneously done either by register errors or by

other reasons.

Thus, in many cases, only the product description

becomes the means of knowing if the product classiﬁ-

cations are correct, thus allowing a correct identiﬁca-

tion of the tax rate in which the product ﬁts. Figure 6

presents the SQL query for the most commonly used

electronic invoice description, for each combination

of column A and column B.

select t . co lu mn_A , t . col umn_B ,

t . pr o d u c t_ d e s c

from (

select it em . co lumn_ A ,

ite m . c ol umn_B ,

ite m . pr o du ct_ de s c ,

count(*) as n_ o c our e n c es _ p r odu c t ,

ran k () ove r (

pa r t it i o n by item . colu mn _A ,

ite m . col u mn_ B

order by count(*) desc

) as ra nk i ng

from n f e _ c o m p l e te_i t e m ite m

where len gth ( it em . co lum n _B ) >= 8

group by i te m . c olumn _A , co lu mn _B ,

ite m . pr odu c t _ d esc

) t

where t . r ank i ng = 1;

Figure 6: SQL code for aggregation query.

Figure 7 presents the times of execution of the

query. Without considering data partitioning, the SQL

query via Impala presented an increase in speed of

11.25x in relation to the RDBMS-Y. When conside-

ring data partitioning per month, the speedup using

the Impala was 14.65x, with the same partitioning cri-

teria.

5.3.4 Discussion

We strongly believe that the modular nature of the Ha-

doop environment, in which Impala draws on a num-

ber of standard components that are shared across the

platform, confers some advantages that cannot be re-

plicated in a traditional, monolithic RDBMS as the

one that is presented in the experiments.

In particular, the ability to mix ﬁle formats and

processing frameworks means that a much broader

SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices

Figure 7: Comparison of query performance to search the

most used descriptions.

spectrum of computational tasks can be handled by

a single system without the need for data movement,

which itself is typically one of the biggest impedi-

ments for an organization to do something useful with

its data.

It is one of the reasons of the shown difference

of performance between Impala and the RDBMS-Y.

Since RDBMSs need to read data from disk to me-

mory before start processing, there is a latency that

is repeated several times when is working with large

datasets. On the other hand, Impala as a MPP solu-

tion takes the processing as close possible to the data.

Thus, there is less data movement and a lower latency.

6 CONCLUSION

In this work, an open source and recent technology

was explored in order to perform SQL queries on Ha-

doop, applying it in a context in which solutions based

on RDBMS are dominant. The comparison joins ot-

her research that pointed to beneﬁts of this approach,

with the differential being based not on benchmarks,

but on a real application, not found in other works that

investigate SQL performance on Hadoop.

Despite Hadoop’s origin as a batch processing

environment, the obtained results reinforce the idea

that the public sector can derive great beneﬁts from

building large-scale analytical processing technolo-

gies compared to traditional RDBMS technologies. In

fact, the SQL solution on Hadoop has not only shown

better performance, but is also less costly in terms of

hardware and software, representing a rational use of

public resources.

This solution opens the way to face the challenge

of tax evasion in several ways, with the possibility of a

massive audit, considering the totality of transactions,

companies and individuals involved. In particular, the

performance results presented in this paper contribu-

ted to the creation of the Fiscal Planning and Moni-

toring Group in order to subsidize the control groups

with information for more efﬁcient ﬁscal operations.

It is also worth noting that the Impala/Hadoop ap-

proach is applicable to data from other Brazilian sta-

tes, since NF-e is a national standard and the codes

for operations and beneﬁts are also nationally uniﬁed.

The experiments carried out opened the way for

further investigations, which cross information from

NF-e with other tax data. From the technological

point of view, it is necessary to accompany new alter-

natives of SQL-on-Hadoop that perhaps present pro-

mising attributes for the application in question.

ACKNOWLEDGEMENTS

The authors are especially grateful to CIASC colle-

agues, F

abio Eduardo Thomaz, James Rosa, Robson

Marcos da Cunha, Ademir Jo

ao da Rosa, Luiz Car-

los Brehmer Junior, Eduardo Sguario dos Reis, Dante

Michels de Mattos, Sergio Luiz Borges da Silva, and

Nelson Mussak Guanabara for their support and com-

mitment to the project. In addition, the authors thank

the colleagues of SEF-SC, Luiz Carlos de Lima Fei-

toza, Omar Aﬁf Alemsan, and Dayna Maria Borto-

luzzi for the suggestions and availability of the ne-

cessary infrastructure for the experiments presented.

This research was partially supported by UFSM/-

FATEC through project number 041250 - 9.07.0025

(100548).

REFERENCES

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silber-

schatz, A., and Rasin, A. (2009). HadoopDB: An ar-

chitectural hybrid of MapReduce and DBMS techno-

logies for analytical workloads. Proceedings of the

Very Large Data Base Endowment Inc., 2(1):922–933.

Abrantes, P. C. and Ferraz, F. (2016). Big data applied to

tax evasion detection: A systematic review. In Procee-

dings of the International Conference on Computati-

onal Science and Computational Intelligence (CSCI),

pages 435–440. IEEE.

Allingham, M. G. and Sandmo, A. (1972). Income tax eva-

sion: A theoretical analysis. Journal of public econo-

mics, 1(3-4):323–338.

Chen, Y. et al. (2014). A study of SQL-on-Hadoop systems.

In Zhan, J., Han, R., and Weng, C., editors, Big Data

Benchmarks, Performance Optimization, and Emer-

ging Hardware: 4th and 5th Workshops, BPOE 2014,

ICEIS 2018 - 20th International Conference on Enterprise Information Systems

Salt Lake City, USA, March 1, 2014 and Hangzhou,

China, September 5, 2014, Revised Selected Papers,

pages 154–166. Springer International Publishing.

Costea, A. et al. (2016). VectorH: Taking SQL-on-Hadoop

to the next level. In Proceedings of the 2016 Interna-

tional Conference on Management of Data, SIGMOD

’16, pages 1105–1117, New York, NY, USA. ACM.

Dean, J. and Ghemawat, S. (2008). Mapreduce: Simpli-

ﬁed data processing on large clusters. Commun. ACM,

51(1):107–113.

Earley, C. E. (2015). Data analytics in auditing: Opportu-

nities and challenges. Business Horizons, 58(5):493–

500.

Floratou, A., Minhas, U. F., and

Ozcan, F. (2014). SQL-on-

Hadoop: Full circle back to shared-nothing database

architectures. Proceedings of the Very Large Data

Base Endowment Inc., 7(12):1295–1306.

IBGE (2012). Demographics of companies [online].

Available: http://biblioteca.ibge.gov.br/visualizacao/

livros/liv88028.pdf. [Accessed 20 October 2017].

Kim, G.-H., Trimi, S., and Chung, J.-H. (2014). Big-data

applications in the government sector. Communicati-

ons of the ACM, 57(3):78–85.

Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T.,

Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs,

M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li,

N., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsi-

rogiannis, D., Wanderman-milne, S., and Yoder, M.

(2015). Impala: A modern, open-source SQL en-

gine for Hadoop. In Proceedings of the 7h Bien-

nial Conference on Innovative Data Systems Research

(CIDR’2015).

Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivaku-

mar, S., Tolton, M., and Vassilakis, T. (2010). Dre-

mel: Interactive analysis of web-scale datasets. In

Proceedings of the 36th International Conference on

Very Large Data Bases, pages 330–339.

Pannu, M., Gill, B., Tebb, W., and Yang, K. (2016). The

impact of big data on government processes. In Pro-

ceedings of the IEEE 7th Annual Information Techno-

logy, Electronics and Mobile Communication Confe-

rence (IEMCON), pages 1–5. IEEE.

Paula, E. L., Ladeira, M., Carvalho, R. N., and Marzag

ao,

T. (2016). Deep learning anomaly detection as sup-

port fraud investigation in brazilian exports and anti-

money laundering. In Proceedings of the 15th IEEE

International Conference on Machine Learning and

Applications (ICMLA), pages 954–960. IEEE.

Rad, M. S. and Shahbahrami, A. (2016). Detecting high risk

taxpayers using data mining techniques. In Internati-

onal Conference of Signal Processing and Intelligent

Systems (ICSPIS), pages 1–5. IEEE.

RFB (2010). Sped: Public system of digital bookkeeping

[online]. Available: http://sped.rfb.gov.br. [Accessed

20 October 2017].

Shvachko, K., Kuang, H., Radia, S., and Chansler, R.

(2010). The hadoop distributed ﬁle system. In Pro-

ceedings of the 2010 IEEE 26th Symposium on Mass

Storage Systems and Technologies (MSST), MSST

’10, pages 1–10, Washington, DC, USA. IEEE Com-

puter Society.

Stonebraker, M., Abadi, D., DeWitt, D. J., Madden, S.,

Paulson, E., Pavlo, A., and Rasin, A. (2010). MapRe-

duce and parallel DBMSs: friends or foes? Commu-

nications of the ACM, 53(1):64–71.

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S.,

Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H.,

Seth, S., Saha, B., Curino, C., O’Malley, O., Radia,

S., Reed, B., and Baldeschwieler, E. (2013). Apa-

che hadoop yarn: Yet another resource negotiator. In

Proceedings of the 4th Annual Symposium on Cloud

Computing, SOCC ’13, pages 5:1–5:16, New York,

NY, USA. ACM.

White, T. (2012). Hadoop: The deﬁnitive guide. ” O’Reilly

Media, Inc.”.

SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices