ADS-B infrastructure is currently being deployed and
the open challenges around it (Ali, 2016; Strohmeier
et al., 2014). Collaborative projects like OpenSky
1
demonstrate how a network of ADS-B sensors can
be deployed to capture and share ADS-B data to the
community (Sch
¨
afer et al., 2014). A recent paper
(Strohmeier et al., 2015) reports the experience of im-
plementing OpenSky and emphasize on its data archi-
tecture. The authors report that its original MySQL-
based deployment lacked of scalability and it was re-
placed by a Lambda-oriented architecture (see Sec-
tion 3). Another paper (Boci and Thistlethwaite,
2015) reports a preliminary experience of designing
a data lake for ADS-B data. In contrast to our ap-
proach, this deployment is restricted to a single type
of surveillance data (CAT033) and does not devise
how other data streams can be managed and inte-
grated to obtain more valuable knowledge.
3 AIRPORTS DL
A considerable effort has been carried out during the
past decade in Big Data solutions and scalable data
systems, and terms like Hadoop or NoSQL are two
of the new buzzwords in computational circles. On
the one hand, Hadoop is able to run large-scale batch
computations in a parallelized fashion, at the price
of high latency time. On the other hand, NoSQL
databases are highly scalable solutions, but face some
limitations regarding traditional relational databases.
However, these technologies excel when they are
combined intelligently with other tools (in the Big
Data ecosystem) to build scalable and fault tolerant
systems which are able to deal with variable and com-
plex amounts of data (Marz and Warren, 2015). These
systems are also extensible and allows ad-hoc queries
to be performed over the big data repository.
The Lambda architecture (Marz and Warren,
2015) is the main reference to build such type of
systems. It isolates real-time Big Data management
needs into three layers: (i) the Batch layer is re-
sponsible of preserving the master dataset, and com-
putes batch views transforming (raw) data for partic-
ular end-user purposes; (ii) the Serving layer enables
batch views to be efficiently accessed, and (iii) the
Speed layer assumes real-time data management.
Our current needs must be satisfied by only im-
plementing Batch and Serving layers, because real-
time data management is not currently addressed.
Although different approaches can be adopted, we
choose, as previously explained, the data lake one.
1
https://opensky-network.org/
This topic has received much attention recently
(Miloslavskaya and Tolstoy, 2016; Madera and Lau-
rent, 2016; Hai et al., 2016). A data lake comprises
a set of centralized repositories with no schema-on-
write restrictions. That is, structured and unstructured
data can be effectively stored and only on-read restric-
tions are made. Descriptive metadata must be also
maintained to avoid the data lake to be turned into a
data swamp (Gartner, 2014). The data lake also as-
sumes the traditional ETL (Extract-Transformation-
Load) responsibilities, while preserving all ongoing
data for traceability and analysis purposes. Thus, the
data lake implements storage and data computation
responsibilities of the Batch layer.
Data lakes are usually deployed using Hadoop-
based technology (White, 2015) to ensure cost-
effective storage and processing using the Hadoop
Distributed File System (HDFS) and the MapReduce
computation model, respectively. Regarding the re-
sulting batch views, which comprises highly-curated
data, they must be managed outside of the data lake.
The Serving layer implementation depends on how
data are finally exploited by end-user systems. Al-
though it is common to use data warehouse technol-
ogy, NoSQL databases are increasingly adopted to de-
ploy scalable Serving layer implementations.
AIRPORTS DL
2
combines these foundations to
design an scalable architecture able to deal with volu-
minous ADS-B data streams, and a variety of flight-
related datasets. Figure 2 provides a big picture of
AIRPORTS DL, including the data lake itself, exter-
nal data sources, and the system which implements
the Serving layer. All these elements are described by
following their numeric identifiers; we also mention
the technologies
3
used to implement each one.
1. Data Sources. This element is “the world
around AIRPORTS DL” and includes all external
databases or live services which feed data into the
data lake. We collect information from many ADS-
B providers to get a wide coverage of the air space.
It includes
4
the aforementioned OpenSky community
network, but also comercial providers. Flight plans,
weather information, provided by the Global Fore-
2
AIRPORTS DL relies on the Aviation Data Analytics
Platform Testbed (ADAPT) by Boeing Research and
Technology-Europe (BR&T-E) in Madrid, Spain
3
These technologies are usually available within Hadoop
distributions, so more information about them can be
found in references like (White, 2015).
4
We also feed ADS-B data captured by the Frambuesa
BR&T-E sensor that currently operates at the Madrid-
Barajas Adolfo Su
´
arez Airport (LEMD).
Towards a Scalable Architecture for Flight Data Management
265