Elcano: A Geospatial Big Data Processing System based on
SparkSQL
Jonathan Engélinus and Thierry Badard
Centre for Research in Geomatics (CRG), Laval University, Québec, Canada
Keywords: Elcano, ISO-19125, Magellan, Spatial Spark, GeoSpark, Geomesa, Simba, Spark SQL, Big Data.
Abstract: Big data are in the midst of many scientific and economic issues. Furthermore, their volume is continuously
increasing. As a result, the need for management and processing solutions has become critical.
Unfortunately, while most of these data have a spatial component, almost none of the current systems are
able to manage it. For example, while Spark may be the most efficient environment for managing Big data,
it is only used by five spatial data management systems. None of these solutions fully complies with ISO
standards and OGC specifications in terms of spatial processing, and many of them are neither efficient
enough nor extensible. The authors seek a way to overcome these limitations. Therefore, after a detailed
study of the limitations of the existing systems, they define a system in greater accordance with the ISO-
19125 standard. The proposed solution, Elcano, is an extension of Spark complying with this standard and
allowing the SQL querying of spatial data. Finally, the tests demonstrate that the resulting system surpasses
the current available solutions on the market.
1 INTRODUCTION
Today, it becomes crucial to develop systems able to
manage efficiently huge amounts of spatial data.
Indeed, the convergence of the Internet and
cartography has brought forth a new paradigm called
“neogeography”. This new paradigm is
characterized by the interactivity of location based
contents and the possibility for the user to generate
them (Mericksay and Roche, 2010). This
phenomenon, in conjunction with the arrival in the
market of new captors like the GPS chips in
smartphones, resulted in the inflation of production
and retrieval of spatial data (Badard, 2014). This
new interest for cartography makes the process more
complex as it becomes more and more difficult to
manage and represent such large quantities of data
by use of conventional tools (Evans et al, 2014).
The Hadoop environment (White, 2012),
currently one of the most important projects of the
Apache Foundation, is a de facto standard for the
processing and management of Big data. This very
popular tool, involved in the success of many start-
ups (Fermigier, 2011), implements MapReduce
(Dean and Ghemawat, 2008), an algorithm that
allows the distribution of data processing among the
servers of a cluster for a faster execution. The data to
process are also distributed among the servers by the
Hadoop Distributed File System (HDFS), which is
provided by default with Hadoop. The result is a
high degree of horizontal scalability, which can be
defined as the ability to linearly increase the
performances of a multi-server system to meet the
user’s requirements in terms of processing time. A
real ecosystem of interoperable elements has been
built up around Hadoop, which enables the
management of such various aspects as streaming
(e.g. Storm), serialization (e.g. Avro) and data
analysis (e.g. Hive).
In 2014, the University of Berkeley's AMPLab
started to develop a new element of the Hadoop
ecosystem, which has since been taken over by the
Apache Foundation, namely Spark
(http://spark.apache.org/), which offers an
interesting alternative to HDFS and MapReduce. In
Spark, data and processing codes are distributed
together in small blocks called RDD (“Resilient
Distributed Dataset”) on the whole cluster RAM.
This architectural choice, which strongly limits hard
drive accesses, makes Spark up to ten times faster
than conventional Hadoop use, in some cases
(Zaharia et al, 2010), although at the cost of a
greater RAM load (Gu and Li, 2013). Furthermore, a
Engélinus, J. and Badard, T.
Elcano: A Geospatial Big Data Processing System based on SparkSQL.
DOI: 10.5220/0006794601190128
In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 119-128
ISBN: 978-989-758-294-3
Copyright
c
2019 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
119
part of Spark called Spark SQL (Armbrust et al,
2015) dress up Spark RDD with a supplementary
level called “DataFrames” which allows to organize
received data from Spark into temporary tables and
to query them with SQL language. Spark SQL
minimizes as well the duration of Spark processes,
thanks to the strategic optimization of queries and
the serialization of data. Finally, it allows the
definition of personalized data types (UDT: “User
Defined Type”) and personalized functions (UDF:
“User Defined Function”), which permit
respectively to get new kinds of data and processing
available from SQL.
This opportunity to query Big data thanks to
SQL is of paramount importance as it helps in their
analysis, with the goal of a better understanding of
the phenomena they represent on the ground. It also
empowers analysts with new analytical capabilities,
using a query language they already master on the
day to day. Together with the availability of a
growing amount of geospatial data, it is profitable to
use these capabilities to analyze the spatial
component of this huge amount of information,
which, according to Franklin, is present in 80% of
all business data (Franklin and Hane, 1992).
According to a frequently mentioned research of the
McKinsey cabinet (Manyika et al, 2011), a better
use of Big data spatial localization could grant 100
billion USD to services providers and in the range of
700 billion USD to final users. Lastly, spatial Big
data management finds itself in the midst of many
important economical, scientific and societal issues.
In this respect, Spark appears again as a promising
solution, because it processes the spatial data at least
more than 7 times faster than Impala, another
Hadoop element managing the SQL (You, et al,
2015).
Today, some systems relying on Hadoop enable
the management of massive spatial data, such as
Hadoop GIS (Aji et al, 2013), Geomesa (Hugues et
al, 2015) and Pigeon (Eldawy and Mokbel, 2014).
But they are mainly about prototypes than mature
technologies (Badard, 2014). In addition, most of
them only relies on the core version of Hadoop
without fully scaling the processing power of the
RAM like it is achieved by Spark. For example,
Spatial Hadoop (Eldawy and Mokbel, 2013) only
uses the Map Reduce algorithm of Hadoop.
Among these systems, only five propose a
management of spatial data relying on Spark. The
first two systems, Spatial Spark (You, et al, 2015)
and GeoSpark (Yu et al, 2015) only add a
management of the spatial component to the basic
version of Spark, which do not fully take advantage
of all the capabilities (e.g. SQL querying) and
performance of Spark. Hence, the current Spatial
Spark version can only interact with data in
command line mode instead of managing SQL
queries. GeoSpark only uses its own spatial
extension of the Spark RDD type, which does not
directly comply with Spark SQL (Yu, 2017). The
third, Magellan (Ram, 2015), defines spatial data
types directly available in Spark SQL, but without
correctly managing some spatial operations like the
union of disjointed polygons, the symmetric
differences involving more than a geometry and the
creation of an envelope. The fourth, Simba (Xie et
al, 2016), enables the querying of data in SQL for
points only and without the possibility to trigger
standard spatial functions. At last, the fifth prototype
is the Geomesa extension, which can be used from
Spark. The system is anyway limited in the spatial
operations that it offers because it has been natively
designed only for the research of points included in
an envelope. Furthermore, it presents limited
performances (Xie et al, 2016) in comparison with
other solutions. That apparently could be explained
by the fact that it imposes the use of a key store
technology (Accumulo, https://accumulo.apache.
org/) to store the spatial data to process.
As a conclusion, there is presently no system for
the management of geospatial data that fully
manages all kinds of 2D geometry data types and
that enables their efficient and actionable SQL
querying. Each model implemented in the five
studied prototypes which pursue a similar goal
presents limited capacities both on the types of
geometry they support as well as on the spatial
processing capabilities they offer. Details about this
last point are given in the next section.
2 LIMITS IN THE GEOSPATIAL
CAPABILITIES SUPPORTED
BY CURRENT SOLUTIONS
In order to assess the capabilities of the different
geospatial Big data management systems currently
relying on Spark to fully manage the 2D spatial
component, the ISO-19125 standard can profitably
be used as a guideline. Indeed, the two parts of this
standard respectively describe the 2D geometry
types and the geospatial functions and operators
(ISO 19125-1, 2004) and their expression in the
SQL language (ISO 19125-2, 2004) that a system
must implement to basically store 2D geospatial data
and support its querying and its analysis in an
GISTAM 2018 - 4th International Conference on Geographical Information Systems Theory, Applications and Management
120
interoperable way. In this context, we will first
introduce the geometry types supported by the
different systems. Then we will analyze which
spatial functions they give access to and whether
they can be extended to easily implement the
missing ones. Finally, we will study how they
manage the spatial indexation issue, which is crucial
when dealing with geospatial data.
2.1 Geometry Data Types
A system complying with ISO 19125-1 is supposed
to handle the seven main 2D geometry types that can
be built by linear interpolation. These can be divided
into three simple types (point, polyline and polygon)
and into four composite types (multipoint,
multipolyline, multipolygon and geometry
collection). Here is a study of how the current
systems meet this standard.
Spatial Spark and GeoSpark integrate all these
types of geometries because their model relies on the
use of the JTS (“Java Topology Suite”,
https://www.locationtech.org/proposals/jts-topology-
suite) library, which has been designed to meet ISO
standards and OGC recommendations (Davis and
Aquino, 2003). Geomesa also manages all the
geometries in its current version (Commonwealth
Computer Research, 2017), while Simba only
manages the point.
The case of the HortonWorks Magellan system is
more mixed. It enables the processing of points,
polylines and polygons. This may seem sufficient if
one assumes, as one of the designers of the system
(Sriharasha, 2016) does, that compound geometries
are reducible to tables of geometries. But in reality,
such an approach can only lead to a dysfunctional
system. Indeed, by not being able to explicitly create
actual complex geometry, such arrays are not
allowed as operands of a spatial function and their
return as a result of a spatial operation like the union
of disjoint polygons causes a type error.
In addition to its development, Magellan's
limitations are also due to the use of ESRI Tools as a
spatial library. The latter does not make it possible
to process all the 2D geometry types defined by the
ISO-19125 standard. It lacks the geometry collection
type, while the multi-polygon type is only partially
implemented. Furthermore, the adaptation of WKT
(“Well-Known Text”) provided by ESRI Tools does
not comply with the ISO standards and the OGC
recommendations.
The limitations of the different solutions studied
in relation to the requirements of ISO-19125-1 are
summarized in Table 1. Those related to ESRI Tools
have been added to give an idea of the limits that
they involve on the evolution of Magellan.
Table 1: Coverage of the different 2D geometry types
specified by ISO-19125 in studied prototypes.
Geo
Spark
Spatial
Spark
Simba Geomssa Magellan ESRI
Point Yes Yes Yes Yes Yes Yes
Polyline Yes Yes No Yes Yes Yes
Polygon Yes Yes No Yes Yes Yes
Multi-Point Yes Yes No Yes No Yes
Multi-polyline Yes No Yes No Yes Yes
Multi-polygon Yes Yes No Yes No In part
Collection Yes Yes No Yes No No
2.2 Spatial Functions and Operators
(ISO 19125-2, 2004) specifies how the spatial
functions (relations, operations, metric functions and
methods), a spatial data management system should
implement in SQL to comply with the ISO 19125-1
standard. It does not specify the way these methods
have to be implemented. It only defines their
signatures. These functions define the minimal set of
operations a system must implement to enable basic
and advanced spatial analysis capabilities. Even if
these functions have been defined for querying data
in classic spatial DBMS, their usage in geospatial
Big data management systems still pertain.
Nevertheless, the application of the ISO-19125-2
standard requires a system allowing SQL queries
and personalized SQL functions. This section details
how the five studied systems partly implement the
standard and describes their extension capabilities.
Spatial Spark only uses the core of Spark.
Indeed, it allows to work with RDD’s but not with
DataFrames or SQL queries. In this context, the
application of the ISO 19125-2 standard to Spatial
Spark seems impossible without a full
reimplementation.
As we saw, GeoSpark extends the RDD type of
Spark, and is therefore not directly compatible with
Spark SQL. Nevertheless, one of its developers
indicates that the integration of this point is planned
for a future version of the system and that there
would be an indirect way of changing these RDD’s
in DataFrames (Yu, 2017). But neither does he
describe a general process for it, nor how to apply
SQL queries afterwards. Indeed, the current version
of GeoSpark does not seem to be compliant with the
ISO 19125-2 standard because all geometry types
cannot be managed from SQL queries.
Simba released its own adaptation of Spark SQL,
which might enable the use of SQL queries and the
Elcano: A Geospatial Big Data Processing System based on SparkSQL
121
creation of User Defined Functions. In practice
however, the only accessible geometry is the point.
Furthermore, the syntactic analyzer does not always
work properly. By example, it forces to write “IN”
before “POINT(x, y)” even without a context of
inclusion. Simba is therefore not a mature and reliable
solution that could meet the ISO 19125-2 standard.
Until recently, Geomesa’s Spark extension only
used Spark’s core. But a recent version tries to
integrate Spark SQL. However, this solution remains
restrained by the mandatory use of the CQL format
and the Accumulo database (Commonwealth
Computer Research, 2017). Indeed, Geomesa does
not allow an autonomic and agnostic implementation
of ISO-19125-2.
Magellan does not directly manage SQL either.
But it defines User Data Types for the point. It is
therefore tempting to assume that the addition of
Used Defined Functions to its model should be
enough to allow the SQL functions of the ISO-
19215-2 standard. In practice however, the extension
of Magellan with these functions only covers two
thirds of spatial relations, half of the spatial
operations and a small part of spatial methods
specified by the ISO-19125-2 standard. These
limitations are due to both implementation errors
and the choice of the ESRI Tools library, which only
partially meets the ISO-19125-2 standard.
In their current states therefore, none of the
studied systems totally comply with the ISO-19125
standard.
2.3 Spatial Indexation Management
Spatial indexation can be defined as the
reorganization of spatial data, typically by using
their proximity relations, with the purpose of
accelerating their processing (Eldawy and Mokbel,
2015). Four of the studied systems provide a spatial
indexation component, but which is never both
efficient and extensible.
The spatial indexation component of Spatial
Spark uses directly the methods of the JTS spatial
library, which is not conceived for Big data
processing in a multi-server environment. GeoSpark
proposes a more integrated and efficient spatial
indexation module (Yu, 2017), but without the
possibility of managing it with SQL queries. The
indexation component of Simba is described as more
efficient by its developers (Xie et al, 2016), but has
important limitations and bugs we already covered.
Finally, Geomesa offers poor performances because it
relies on a specific database system (Xie et al, 2016),
which drastically increases the processing time.
2.4 Synthesis of Limitations
Table 2 sums up the main limitations of the studied
systems. It first recalls their most problematic
limitations. Then it reminds the geometry types they
support and as a result their degree of conformance
to the ISO 19125 standard. Next, it indicates
whether they manage SQL and whether they comply
with the ISO-19125-2 standard.
Table 2: Limitations of current spatial Big data processing
systems.
Magellan
Spatial
Spark
GeoSpark
Simba Geomesa
Main
limitation
Use a
limited
spatial
library
Inextensib
le to SQL
Syntactic
bugs, ́ no
extensible
Force to
use a
NoSQL
database
Types of
geometries
Only
simples
All Only point All
ISO-19125-1
In part Yes In part Yes
SQL
management
No, but
extensible
No
Yes
(replace
Spark
SQL)
Yes,
limited bý
CQL
ISO-19125-2
In part (by
extension)
́
No No In part
Spatial
indexation
No
Yes, but not efficient
or not extensible
Next section presents a new system designed for
the efficient and interoperable management and
rapid processing of geospatial Big data (vector data
only). It relies on Spark and overcomes identified
limitations present in current state-of-the-art
solutions. This prototype is named Elcano. Its
release as an open source project has not yet been
performed but it is envisaged.
3 PRESENTATION OF ELCANO
The main objective leading the design of Elcano is
to model a spatial Big data processing and
management system that surpasses the other systems
studied here. It must then integrate each 2D
geometry types defined in the ISO-19125 standard.
It must also enable the use of associated spatial
functions, in order to improve the analysis of spatial
phenomena. All spatial relations, operations and
methods defined by the ISO-19125-2 standard must
then be implemented by Elcano. For example, a call
to the SQL function ST_Intersects has to indicate if
two generic geometry objects intersect or not. The
GISTAM 2018 - 4th International Conference on Geographical Information Systems Theory, Applications and Management
122
system must also allow to load spatial data in a
simple and generic way, for it to be easy to feed and
to extend toward other formats. It also must ensure
data persistence in memory, in a compact manner so
that it enables a faster processing of the geospatial
component. It has to be easily extensible in order to
potentially support new geometry types or
extensions to the geometry types defined in the ISO-
19125 standard (for example the inclusion of
elevation in geometric features definition, i.e. 2.5D
data). Finally, it must offer good processing
performances in comparison with current processing
systems. A model seeking to meet these objectives is
presented and justified below.
3.1 Architecture
Figure 1 illustrates the model on which Elcano is
based. In this model, classes of the geometry
package integrate the elementary geometries and
spatial functions linked to Elcano.
Figure 1: Elcano’s model.
The loader package enables the use of SQL
spatial functions. Data persistence for processing
and data retrieval is managed by the “Table” class,
together with the support of the conversion methods
from the GeometryFactory class. Finally, the index
package deals with the indexation of spatial data for
its faster processing. Details on the way these
different capabilities are implemented are given in
the next sections.
3.1.1 2D Geometry Types Management
The geometry package of Elcano contains a concrete
class for each geometry type described in the ISO
19125 standard. These classes use the JTS spatial
library, which is specifically conceived to comply
with many ISO standard (including ISO 19125) and
the OGC recommendations (Davis and Aquino,
2003). This choice avoids the problems faced by
Magellan, which are due to the integration of an
inadequate spatial library like stated above. The
system could have used JTS classes directly, as
Spatial Spark and GeoSpark do, but for optimization
purposes, it seemed interesting not to be constrained
by the implementation of a chosen spatial library. To
this end, the Elcano geometry package uses a JTS-
independent class hierarchy by applying the “proxy
design pattern (Gamma et al, 1994). This choice of
conception allows also to accelerate, whenever
possible, the JTS methods by overwriting them.
3.1.2 Spatial SQL Functions Management
In order to make the spatial functions and operators
defined in ISO 19125 available as SQL functions in
Elcano, different User Defined Functions (UDF) has
been defined. All these functions are in fact
shortcuts to the different methods supported by the
different geometry types (i.e. classes included in the
geometry package) and specified in the ISO 191125
standard. The build() method of the SqlLoader class
in the loader package is in charge of declaring all
these functions at the initialization stage of the
application.
3.1.3 Spatial Data Persistence
Elcano provides a unified procedure for the loading
of all 2D geometry types and their persistence. The
Table class of Elcano enables the definition of
geometric features in WKT. WKT is a concise
textual format defined in the ISO 19125 standard.
Elcano thus allows to load tabular data (for example
from a CSV file where the geometry component of
each row is defined in WKT) in the form of an SQL
temporary table. The management of more specific
formats like JSON (Bray, 2014), GeoJSON or
Elcano: A Geospatial Big Data Processing System based on SparkSQL
123
possible spatial extensions to Big data specific file
formats like Parquet (Vorha, 2016) could also be
easily added to the system by simply inheriting the
Table class.
3.1.4 Data Types Extensibility
The GeometryFactory class implements the “abstract
factory” design pattern (Vlissides et al, 1995) and
allows the extensibility of Elcano. Other geometry
types than those defined in the ISO-19215 standard
could thus be added in the future, such as Triangles
and TINs in order to manage DTMs (Digital Terrain
Models).
3.1.5 Spatial Indexation
The index package of Elcano contains all classes in
charge of the spatial indexation of data stored in
Elcano. It drastically speeds up all spatial processes.
This component is inspired from the one
implemented in Spatial Spark but with some hooks
for better performance. Its use is illustrated in the
benchmark section. Its detailed description is though
out of the scope of the present paper. It will be
described and detailed later in another publication.
4 BENCHMARK
The present section compares the performances of
Elcano with another spatial data management and
processing systems using Spark, aka. Spatial Spark.
They are also compared with a well-known and
widely used classical spatial database management
system (DBMS): PostGIS (Obe and Hsu, 2015).
Spatial Spark has been chosen among the studied
systems that manage spatial indexation because it is
the only one that could be extended to support SQL
queries (by performing an important reimplementa-
tion though). So, it is the only one of the tested
prototype that really compares to Elcano. As for
PostGIS, it is to our point of view, a reference
implementation of the ISO 19125 standard with
which we can compare. In addition, it proposes
efficient and reliable spatial indexation methods.
For the needs of this benchmark, Elcano and
Spatial Spark have been installed on a cluster of
servers using a master server with 8 Go of RAM and
nine slave servers with 4 Go of RAM. Each of these
computers uses the CentOS 6.5 operating systems
and height Intel Xeon 2.33 GHz processors. PostGIS
has been optimized with the pgTune library
(https://github.com/le0pard/pgtune) and tested in
comparable conditions.
In each of the 3 tests performed, we count the
number of resulting elements from a spatial join
between two tables. We group the elements of these
tables by pair, according to a given spatial relation,
namely the intersection. This spatial relation has
been chosen because it implies complex and
sometimes time consuming processing. The use of a
fast and reliable spatial indexation system is also of
importance in such a process. The contents of the
tables used in the test is fixed. The management of
changing data is out of the scope of the tests.
Test 1 compares the execution time of the three
systems with a raise in data volume. It consists in
counting the intersections between an envelope
around Quebec province and seven sets of points
randomly dispatched in an envelope around Canada.
These seven sets contain respectively 1000, 10 000,
100 000, 1 million, 100 million and one billion
points.
Table 3 presents a synthesis of the first test
results for the three studied systems. In order to
facilitate their comparison, the duration cumulates
the indexation time and the first query time. It
appears that performances of Elcano performances
are better than those of PostGIS and Spatial Spark
beyond one million points. PostGIS is the best
choice for lower volumes but encounter a significant
slowdown after a certain threshold: it requires many
hours to process 100 million points against five
minutes for Elcano. The difference between Spatial
Spark and Elcano is more tenuous but increases in
favor of Elcano as data volume increases.
The drop in PostGIS performances when data
volume increases is probably explained by its weak
horizontal scalability: this system is not designed for
Big data management. In return, performances of
Elcano when compared to Spatial Spark can be
explained by its usage of Spark SQL. Indeed, the
latter uses specific query optimizations and Spark’s
caching system (Armbrust et al, 2015). But for low
data volumes (under one million points), the
classical PostGIS solution is better, probably
because of its simpler distributed treatments
architecture. In a similar way, the best performances
of Spatial Spark between one and ten million points
can probably be explained by the additional
treatments imposed by the use of Spark SQL by
Elcano.
GISTAM 2018 - 4th International Conference on Geographical Information Systems Theory, Applications and Management
124
Table 3: Test 1 – Processing time with a raise of the data
volume.
Volume
(points)
PostGIS
(ms)
Spatial Spark
(ms)
Elcano
(ms)
1000 234 6 543 9 516
10 000 326 6 622 9 714
100 000 3 783 8 301 9 030
1 000 000 29 898 8 301 10 747
10 000 000 269 257 20 487 17 099
100 000 000 5 752 821 55 017 37 378
1 000 000
000
More
than 10
hours
399 100 273 074
Test 2 compares the horizontal scalability of
Elcano and Spatial Spark for a number of servers
from one to nine. It compares the count of the
intersections between the envelope of the Quebec
province and one billion points randomly dispatched
in a bounding box of Canada. PostGIS performance
is not measured for this test because the previous test
clearly underlies its poor performances for large data
volumes and there is no way to distribute the
processing between many servers as PostgreSQL has
not been designed for horizontal scalability.
The table 4 presents a synthesis of the results of
this second test. Spatial Spark and Elcano both
appear to have a good horizontal scalability.
Furthermore, the execution time of the two systems
presents a similar drop from one to nine servers:
87,4% for Spatial Spark and 87,2% for Elcano. But
Elcano remains approximately 1.5 faster than Spatial
Spark regardless of the number of servers.
Table 4: Test 2 – Horizontal scalability when the number
of server increases.
Servers Spatial Spark (ms) Elcano (ms)
1 3 349 414 2 196 344
2 1 718 672 1 123 153
3 1 143 790 762 536
4 875 284 588 401
5 696 195 473 635
6 586 211 391 297
7 511 111 340 784
8 456 446 314 796
9 423 647 280 761
Elcano’s superior brute speed in this second test
can probably be explained by its using of Spark
SQL. Otherwise the rates of scalability of the two
systems are very close, maybe because both rely on
the JTS spatial library for the implementation of the
spatial analysis algorithms.
Test 3 compares more finely the performances of
PostGIS, Spatial Spark and Elcano. It counts the
intersections between one million points in an
envelope of Canada and the points in a copy of this
set. Therefore, a total of 100 billion intersection tests
(spatial join) are processed. The execution time is
spread between indexation time, first query time
(cold start) and second query time (hot start). Hot
start queries are more representative of the response
times in a running environment in production.
Indeed, while indexing is only necessary once for
the two given tables, an SQL query must be started
for each spatial join operation applied to them.
Table 5 offers a summary of the results for this
third test. PostGIS presents a spatial indexation time
a bit shorter than Spatial Spark, but the execution
time of its first SQL query is then much longer.
Elcano presents the best performances in all cases:
its indexation time is five time lower than with
PostGIS and the execution of its first query is two
times faster than with Spatial Spark. Elcano is also
the only solution to execute a second SQL query on
the same data significantly faster than the first: the
second execution is 26 times faster.
The last point can probably be explained by
Spark SQL’s caching system.
Table 5: Test 3 – Execution time is spread between
indexation time, first query time and second query time.
Solution
Indexation
time (ms)
First
query (ms)
Second
query (ms)
PostGIS 29 756 100 742 100 742
Spatial
Spark
36 824 36 824 36 824
Elcano 13 578 15 754 1 393
So, to sum up, above a given data volume,
Elcano surpasses PostGIS and Spatial Spark in terms
of execution speed. It presents a scalability similar to
the one of Spatial Spark, but a better execution time
when the number of servers increases.
Elcano: A Geospatial Big Data Processing System based on SparkSQL
125
5 CONCLUSION AND
PERSPECTIVES
In conclusion, while Big data with a spatial
component are in the midst of many scientific,
economical and societal issues and while Hadoop
has become a mature de facto standard for Big data
processing, the number of processing and
management systems for this type of data using the
Hadoop environment and available in the market is
limited. All available solutions are only prototypes
with limited capabilities. Moreover, only five
solutions are managing spatial data from Spark,
which is perhaps the most promising Hadoop
module for this type of processing, and none of these
systems can entirely handle the geometry types and
SQL spatial functions specified in the ISO 19125
standard.
To tackle this issue, the present paper proposes a
new spatial Big data processing and management
system relying on Spark: Elcano. It is based on the
SQL library of Spark and uses the JTS spatial library
for its compliance with the ISO’s standards. Thanks
to this approach, all SQL functions and operators
defined by the ISO 19125 standard are fully
supported.
The proposed model on which Elcano relies is
not a simple implementation of JTS. It comes with
the possibility to use SQL spatial queries with a data
model that can evolve. Furthermore, it integrates the
geometric types on a context of Big data and comes
with a scalable spatial indexation system which will
be detailed in an upcoming article.
In addition, Elcano offers better performances
than Spatial Spark and a similar scalability. The
detailed study of all the possibilities in term of
spatial indexation management remains however to
be done. A way to address it could be to adapt the no
Hadoop solution defined by (Cortés et al, 2015) to
the Spark environment, but there is also many
classical spatial data indexation modes that could be
explored and adapted in order to fulfill the big data
processing requirements.
In a larger perspective, it could be interesting in a
near future to enable the management of the
elevation together with dedicated data types such as
Triangles and TINs in the current model. Raster data
types, maybe via the use of RasterWKT, are also
considered for inclusion. That would allow to apply
the model to many new challenging situations such
as the processing of large collection of images
coupled with vector data analytics capabilities or the
building and analysis of high resolution digital
elevation models (DEM) or DTM without being
compelled to split them into tiles in order to be able
to process them at a whole.
The current version of Elcano manages only
batches of data, but adding the possibility of
processing and displaying continuously received
data (in streaming) could be very interesting
(Engélinus and Badard, 2016). Such an extension
could indeed enable the design of real time
geospatial analytical tools that will help in users
(analysts, decision makers, …) in making more
informed decisions on more up-to-date data and in a
shorter period of time. Furthermore, it could provide
some advanced features that deals with the temporal
dimension of the data, as for example by excluding
all data outside of a defined temporal window
(Golab, 2006). Such extensions could allow the
modelling of such data as a spatiotemporal event or
flow and maybe to dynamically detect “hot spots”
(Maciejewsky et al, 2010) in the stream.
But, if Spark can technically handle streaming,
taking it into account would induce several
conceptual and technical problems. It would be
necessary to define a mode of spatial indexation able
to manage fluctuating data. Furthermore, what
would be the visual variables to use for this type of
data in order to represent their dynamic structure?
Those defined by Bertin in 1967 (Bertin, 1967) and
widely used since are inappropriate because of their
strict limitation to a static spatiotemporal context.
More recent works have tried to add visual variables
to Bertin’s models in order to represent motion
(MacEachren, 2001; Fabrikant and Goldsberry,
2005), but their application in a context of Big data
remains unaddressed. Furthermore, once these
conceptual issues are solved, the definition of a
system that is effectively able to represent and
manage streamed data remains to be done. This
could not be a simple add-on to the classic
geographic information systems (GIS): they are
designed to be efficient for classical data only and
are not able to deal with the huge amount of data and
velocity that Big data implies. How then is it
possible to manage and to represent fluctuating Big
data in an efficient way, without losing the
horizontal scalability offered by Hadoop? This rich
problematic seems to require the definition of a new
type of GIS. This will be the bottom line of our
future research works.
ACKNOWLEDGEMENTS
We acknowledge the support of the Natural Sciences
and Engineering Council of Canada (NSERC),
GISTAM 2018 - 4th International Conference on Geographical Information Systems Theory, Applications and Management
126
funding reference number 327533. We also thank
Université Laval and especially the Center for
Research in Geomatics (CRG) and the Faculty of
Forestry, Geography and Geomatics for their support
and their funding. Thanks to Cecilia Inverardi and
Pierrot Seban for their thorough proof reading and to
Judith Jung for her advices in the writing of this
paper.
REFERENCES
A. Aji et al, 2013. “Hadoop GIS: a high performance
spatial data warehousing system over mapre- duce”.
In: Proceedings of the VLDB Endowment 6.11, p.
1009–1020.
M. Armbrust et al, 2015. “Spark SQL: Relational data
processing in spark”. In: Proceedings of the 2015
ACM SIGMOD International Conference on
Management of Data, p. 1383–1394.
T. Badard, 2014. “Mettre le Big Data sur la carte : défis et
avenues relatifs à l’exploitation de la localisation”. In:
Colloque ITIS - Big Data et Open Data au coeur de la
ville intelligente. Québec : CRG.
A. Eldawy, M. F. Mokbel, 2013. "A Demonstration of
SpatialHadoop: An Efficient MapReduce Framework
for Spatial Data." Proceedings of the VLDB
Endowment.
J. Bertin, 1967. “Semiologie Graphique: Les Diagrammes,
Les Reseaux, Les Cartes”.
T. Bray, 2014. The javascript object notation (json) data
interchange format, RFC 7158.
R. Cortés et al, 2015. “A Scalable Architecture for Spatio-
Temporal Range Queries over Big Location Data”. In:
Network Computing and Applications, IEEE 14th
International Symposium, p. 159–166.
M. Davis, J. Aquino, 2003. Jts topology suite technical
specifications.
J. Dean, S. Ghemawat, 2008. “MapReduce: simplified
data processing on large clusters”. In:
Communications of the ACM 51.1, p. 107–113.
J. Engélinus, T. Badard, 2016. “Towards a Real-Time
Thematic Mapping System for Strea-ming Big Data”.
In: GIScience, Montreal.
E. Gamma et al, 1994. Design Patterns: Elements of
Reusable Object-Oriented Software.
A. Eldawy, M. F. Mokbel, 2014. “Pigeon: A spatial
mapreduce language”. In: Data Engineering, 2014 30th
International Conference on IEEE, p. 1242–1245.
A. Eldawy et M. F. Mokbel, 2015. “The Era of Big Spatial
Data: A Survey”. In: Information and Media
Technologies 10.2, p. 305–316.
M. R. Evans et al, 2014. “Spatial big data”. In: Big Data:
Techniques and Technologies in Geoinformatics, p.
149.
S. I. Fabrikant, K. Goldsberry, 2005. “Thematic relevance
and perceptual salience of dynamic geovisualization
displays”. In: Proceedings, 22th ICA/ACI
International Cartographic Conference, Coruna.
S. Fermigier. 2011. Big data et open source: une
convergence inevitable? URL: http: //projet-
plume.org.
C. Franklin, P. Hane, 1992. “An Introduction to
Geographic Information Systems: Linking Maps to
Databases [and] Maps for the Rest of Us: Affordable
and Fun.” In: Database 15.2, p. 12–15.
L. Golab, 2006. “Sliding window query processing over
data streams”. Doctorate thesis. University of
Waterloo.
L. Gu, H. Li, 2013. “Memory or time: Performance
evaluation for iterative operation on hadoop and
spark”. In: High Performance Computing and
Communications & IEEE 10th International
Conference, Embedded and Ubiquitous Computing.
2013, p. 721–727.
J. N. Hugues et al, 2015. “GeoMesa: a distributed
architecture for spatio-temporal fusion”. In: SPIE
Defense + Security. International Society for Optics et
Photonics. 94730F.
ISO 19125-1, 2004. Geographic information -- Simple
feature access -- Part 1: Common architecture. ISO/TC
211, 42 pages. URL: https://www.iso.org/standard/
40114.html.
ISO 19125-2, 2004. Geographic information -- Simple
feature access -- Part 2: SQL option. ISO/TC 211, 61
pages. URL: https://www.iso.org/standard/40115.html.
A. M. MacEachren, 2001. “An evolving cognitive-
semiotic approach to geographic visualization and
knowledge construction”. In: Information Design
Journal 10.1, p. 26–36.
R. Maciejewsky et al, 2010. “A visual analytics approach
to understanding spatiotemporal hots- pots”. In: IEEE
Transactions on Visualization and Computer Graphics
16.2 p. 205– 220.
J. Manyika et al, 2011. “Big data: The next frontier for
innovation, competition, and productivity”. In: The
McKinsey Global Institute.
B. Mericksay, S. Roche, 2010. “Cartographie numérique en
ligne nouvelle génération: impacts de la néogéographie
et de l’information géographique volontaire sur la
gestion urbaine participative”. In: Nouvelles
cartographie, nouvelles villes, HyperUrbain.
R. O. Obe et L. S. Hsu, 2015. PostGIS in action. Manning
Publications Co..
Commonwealth Computer Research, 2017. Apache Spark
Analysis. URL: http://www.geomesa.org/documenta
tion/tutorials/spark.html.
S. Ram, 2015. Magellan: Geospatial Analytics on Spark.
URL: http://hortonworks.com/blog/magellan-geospati
al-analytics-in-spark/.
R. Sriharasha, 2016. Magellan’s Github - issue 30. URL:
https://github.com/harsha2010/magellan/issues.
J. Vlissides et al, 1995. “Design patterns: Elements of
reusable object-oriented software”. In: Reading:
Addison-Wesley 49.120, p. 11.
D. Vorha, 2016. “Apache Parquet”. In: Practical Hadoop
Ecosystem. Springer, p. 325–335.
Elcano: A Geospatial Big Data Processing System based on SparkSQL
127
T. White, 2012. Hadoop: The definitive guide. O’Reilly
Media, Inc.
D. Xie et al, 2016. Simba: Efficient In-Memory Spatial
Analytics. URL: https://www.cs.utah.edu/~lifeifei/
papers/simba.pdf.
S. You, et al, 2015. “Large-scale spatial join query
processing in cloud”. In: Data Engineering Workshops
(ICDEW), 31st IEEE International Conference, p. 34–
41.
J. Yu, 2017. GeoSpark’s Github- issue 33. URL:
https://github.com/DataSystemsLab/GeoSpark/ issues.
J. Yu et al, 2015. “Geospark: A cluster computing
framework for processing large-scale spatial data”. In:
Proceedings of the 23rd SIGSPATIAL International
Conference on Advances in Geographic Information
Systems, p. 70.
M. Zaharia et al, 2010. “Spark: cluster computing with
working sets”. In: Proceedings of the 2nd USENIX
conference on Hot topics in cloud computing. T. 10, p.
10.
GISTAM 2018 - 4th International Conference on Geographical Information Systems Theory, Applications and Management
128