An interesting functionality for CEBA users
would be to provide them with analytical tools.
Hence, in this paper, we aim to investigate using
Elasticsearch as a data warehouse and Kibana as a
Spatial OLAP visualisation tool. Data warehouses
support managers for decision-making (Jarke, 2002),
(Inmon, 2005), (Pinet, 2010). Traditionally, data
warehouses are based on relational data models, but
this type of models is not the most efficient model for
real-time sensor streams. ELK stack is more suitable
for stream management, but this approach does not
provide analytical features as proposed in data
warehouses. The authors of (Bicevska, 2017)
discussed the NoSQL-based data warehouse
solutions and provided some positive points for this
solution. They noted however the lack of reporting
tools compatible with NoSQL systems.
In this paper, we propose a method to model a
spatial data warehouse model with ELK stack. We
present the main structure of a component called IAT
(Integration and Aggregation Tool) that allows users
defining mappings (Lenzerini, 2002) and aggregation
options between sensors sources and a target index in
Elasticsearch. IAT acts as a streaming ETL (Sabtu,
2017). It continuously extracts records from Logstash
aggregate records, transforms and maps them
according to the output schema. The output data is in
JSON format and is stored in an Elasticsearch index.
Elasticsearch (ES) is powerful in search and
aggregation queries but less for join queries (Pilato,
2017). Hence, we store the data going out from IAT
in one ES index.
The paper is organised as follows. In the next
section, we present some related work. In section 3,
we present our work and the architecture composed
of the ELK stack, as well as the use case for analytical
queries. We present the functionalities of IAT
components through the use case. Finally, we present
an example of measurement station dashboard for our
use case and we conclude.
2 BACKGROUND AND RELATED
WORK
In this section, we present the main related work and
concepts related to our paper topic, i.e., sensor data,
spatial data warehouse, ETL process, ELK stack.
2.1 Sensor Data
Sensors are popular technology solutions to collect
environmental data. With the developing
technologies, there are many kinds of environmental
sensors, e.g., (Werner-Allen, 2006), (Yick, 2008),
(Richter, 2009), (Noury, 2018).
Usually, sensor data are georeferenced data. The
records consist of measurements or observations got
at a specific location (geo-point) or within a specific
area (geo-shape). The geographical information in the
measurement is usually the physical location of the
sensor. In CEBA, data collected from sensors are
georeferenced data.
2.2 Data Warehouse and Spatial Data
Warehouse
In principle, data warehouses are designed for
analytical queries (Inmon, 2005). Data can be
arranged into either as facts or dimensions and mainly
modelled in a star or snow-flake schema. Facts
consist mainly of measures or metrics (i.e., the data to
analyse), and dimensions are mainly descriptive and
upon which the aggregation are processed (Jarke,
2002). Data warehouses can be represented in a
multidimensional conceptual model. The
multidimensional data structures are also called data
cubes. Users can analyse data using online analytical
processing (OLAP) tools. The most popular OLAP
operations are roll up, roll down, slicing, and dicing
(Matei, 2014).
Spatial data warehouses and OLAP tools extend
these concepts. They especially provide support to
store, aggregate and analyse geographical data
(Nipun Garg, 2011). In spatial data warehouses, facts
and dimensions may be spatial objects.
2.3 Batch and Streaming ETL (Extract
Transform Load)
Traditionally, ETL is a process for (i) extracting data
from multiple sources, (ii) transforming and (iii)
loading them into a data warehouse (Bansal, 2015).
Batch ETL corresponds to the ETL process, it is
triggered at a specific time and which processes a
large volume of data in one time.
The streaming ETL is an enhanced approach of
the ETL process. It executes the ETL process in near
real-time. This approach solves the limitations of the
batch ETL for streaming data and allows analysing
data in a short time after it is produced by the sources.
2.4 ELK Stack
ELK stack (Elasticsearch, 2020) is composed of four
main open-source projects: Beats, Logstash,
Elasticsearch, and Kibana. Beats are data shippers.