Towards a Scalable Architecture for Flight Data Management

an Garc

ıa

, Miguel A. Mart

ınez-Prieto

, An

ıbal Breg

, Pedro C.

Alvarez

1,3

and Fernando D

ıaz

Instituto de Investigaci

on en Matem

aticas (IMUVA), Universidad de Valladolid, 47011 Valladolid, Spain

Departamento de Inform

atica, Universidad de Valladolid, 40005 Segovia, Spain

Departamento de Estad

ıstica e Investigaci

on Operativa, Universidad de Valladolid, 47011 Valladolid, Spain

Keywords:

Data Lake, ADS-B, Big Data, Flight Analytics, Hadoop.

Abstract:

The dramatic growth in the air trafﬁc levels witnessed during the last two decades has increased the interest

for optimizing the Air Trafﬁc Management (ATM) systems. The main objective is being able to cope with the

sustained air trafﬁc growth under safe, economic, efﬁcient and environmental friendly working conditions. The

ADS-B (Automatic Dependent Surveillance - Broadcast) system is part of the new air trafﬁc control systems,

since it allows to substitute the secondary radar with cheaper ground stations that, at the same time, provide

more accurate real-time positioning information. However, this system generates a large volume of data that,

when combined with other ﬂight-related data, such as ﬂight plans or weather reports, faces scalability issues.

This paper introduces an (on-going) Data Lake based architecture which allows the full ADS-B data life-cycle

to be supported in a scalable and cost-effective way using technologies from the Apache Hadoop ecosystem.

1 INTRODUCTION

There is an increasing interest for optimizing air traf-

ﬁc management (ATM) to deal with the fast growing

number of ﬂight movements around the world. For

instance, EUROCONTROL predicts that there will be

11.6 ± 1.2 million ﬂight movements in the European

airspace by 2023, 14% more than in 2016 (Eurocon-

trol, 2017). It is a fact that the airspace is getting more

and more congested, and aviation authorities work to

modernize air trafﬁc control (ATC) systems.

ADS-B (Automatic Dependent Surveillance-

Broadcast) is one of the core technologies of the

new generation of ATC systems and is designed

to improve the safety, capacity, and efﬁciency of

airspaces. ADS-B enables ﬂight trajectories to

be tracked by using a Global Navigation Satellite

System (GNSS) instead of traditional radar commu-

nications. Although airspace systems are still in the

transition period to incorporate ADS-B technology,

the corresponding equipment will be mandatory for

some aircraft in Europe by the end of the current 2017

(European Commission, 2011). Nevertheless, some

vehicles have been already equipped with ADS-B

transponders and broadcast messages with informa-

tion about the ﬂight trajectory. These time-stamped

messages comprise the 24-bit aircraft hex code, the

ﬂight callsign, geolocation, altitude, or speed.

ADS-B messages are broadcasted twice per sec-

ond (Strohmeier et al., 2015), so large amounts of data

are streamed during the ﬂight. It is obvious that po-

tential scalability issues arise when ATM systems deal

with data from many ﬂights (e.g. all ﬂights within

a particular airspace over a period of time). In this

work, we focus on ATM systems that preserve ADS-

B data for subsequent processing. In this case, large

streams of ADS-B messages must be ingested into the

system and stored for different types of analytics. It

is usual that these analyses demand complementary

data to enhance trajectory information (e.g. ﬂight-

plans, weather, baggage ticketing, etc.). As a result,

huge repositories of heterogeneous data are consoli-

dated and ATM systems must deal with them to make

the corresponding decisions and forecasts. Thus, im-

plementing ATM systems is a particular case of ex-

ploiting Big Data for ﬂight-related analytics.

This paper describes our in-progress experience to

deploy a Big Data platform for ATM. More speciﬁ-

cally, we propose a scalable data-center architecture

which implements all ﬂight data life-cycle, from raw

data acquisition to highly-reﬁned data load into an

end-user system. This class of solutions, commonly

referred to as Data Lake (Terrizzano et al., 2015),

is designed as a centralized repository allowing large

amounts of (structured and/or unstructured) raw data

to be ingested and stored for subsequent transforma-

tion and/or integration purposes. All data manipula-

tions are continuously tracked by a data governance

García, I., Martínez-Prieto, M., Bregón, A., Álvarez, P. and Díaz, F.

Towards a Scalable Architecture for Flight Data Management.

DOI: 10.5220/0006473402630268

In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 263-268

ISBN: 978-989-758-255-4

263

component to ensure high-quality data to be ﬁnally

delivered for particular analytics.

Our main contribution in this paper is AIR-

PORTS DL, a data lake oriented architecture which

enables massive of ADS-B data collections to be ef-

ﬁciently managed and integrated with other ﬂight-

related data. The AIRPORTS DL components are

designed for scalable data management along its full

life-cycle, ensuring high-quality data to be deliv-

ered for ﬂight analytics. AIRPORTS DL is imple-

mented by a Hadoop-based ATM system which ex-

ploits ﬂight-related data to reconstruct gate-to-gate

trajectories and to derive parameters, such as the pre-

dictability or the fuel consumption, that can be later

used to measure and ultimately improve the ﬂight ef-

ﬁciency. This is our main objective in this project.

The rest of the paper is organized as follows. Sec-

tion 2 delves into more detail about ADS-B and re-

views some use-cases using this technology. Then,

Section 3 provides a detailed description of the Data

Lake architecture proposed in this paper. Finally, Sec-

tion 4 concludes about our current achievements and

explains our next steps in this project.

2 ADS-B

The Automatic Dependent Surveillance (ADS) is a

surveillance technology that allows aircraft to auto-

matically provide, through data link, the data ex-

tracted from the on board navigation and position de-

termination systems (Blythe et al., 2011). ADS has

three main features: i) it is automatic, since it requires

no intervention from any operator (such as the pilot)

or external input to transmit data; ii) it is dependent,

since the data generation depends on the on board air-

craft system (such as the navigation system); and iii)

it provides surveillance data similar to radar data for

both the ground controllers and other aircraft.

ADS provides technologies for broadcasting

(ADS-B) and contract (ADS-C). ADS-B automati-

cally and periodically transmits on board parameters

to all possible receivers within its range of inﬂuence

(it can transmit to both a ground station or to other

suitably equipped aircraft within the range). On the

other hand, ADS-C transmits similar information, but

only to on ground Air Trafﬁc Services Units (ATSUs)

and based on an established contract.

We will focus on ADS-B because it provides

foundational technology for the Next Generation Air

Transportation System (NextGen) and the Single Eu-

ropean Sky ATM Research (SESAR) programme

(Richards et al., 2010). Thus, it is a fundamental

building block of the future ATM systems. One of

the main ADS-B advantages is that it provides ATCs

with real-time position information obtained from a

navigation system that is usually more accurate than

that provided by a radar-based system. Increased ac-

curacy means increased safety and increased capacity

of the airspace and the airport. Additionally, ADS-

B works with low altitudes and even at ground level,

which allows its use to monitor aircraft ground ma-

noeuvring, greatly improving the resolution of ASDE

(Airport Surface Detection Equipment).

ADS-B has two components: i) ADS-B Out, that

broadcasts aircraft data, such as the aircraft identiﬁ-

cation, position, altitude, and velocity; and ii) ADS-B

In, that receives useful information for the pilot such

as the trafﬁc information about surrounding aircraft

or information transmitted through the Flight Infor-

mation Service-Broadcast (FIS-B), such as weather

reports or ﬂight information (e.g. temporary ﬂight

restrictions). Deploying ADS-B technology requires

two avionic components, i) a navigation system and a

datalink within the aircraft, and ii) a receiving (typi-

cally ground) station to receive the ADS-B data.

Figure 1: ADS-B operating scheme.

Figure 1 illustrates the ADS-B operating scheme.

Aircraft obtain their position through the built-in sen-

sors along with positioning data from the GNSS satel-

lites, and broadcast such data to any receiver within its

area of coverage. ATC ground stations receive such

data and use it as a replacement for the secondary

radar. Nearby aircraft can also receive such data to

provide situational awareness and allow self separa-

tion. In situations where the aircraft is away from

other aircraft or ground stations (such as uncovered

areas or transatlantic ﬂights), communication satel-

lites (COMM) are used to communicate with the re-

ceivers, thus allowing to have global coverage.

Some research and industrial literature have been

published about ADS-B. The earliest papers, such as

(Hicok and Lee, 1998) and (Zeitlin and Strain, 2002),

describe the system, its components and its imple-

mentation, while more recent works focus on how

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

264

ADS-B infrastructure is currently being deployed and

the open challenges around it (Ali, 2016; Strohmeier

et al., 2014). Collaborative projects like OpenSky

demonstrate how a network of ADS-B sensors can

be deployed to capture and share ADS-B data to the

community (Sch

afer et al., 2014). A recent paper

(Strohmeier et al., 2015) reports the experience of im-

plementing OpenSky and emphasize on its data archi-

tecture. The authors report that its original MySQL-

based deployment lacked of scalability and it was re-

placed by a Lambda-oriented architecture (see Sec-

tion 3). Another paper (Boci and Thistlethwaite,

2015) reports a preliminary experience of designing

a data lake for ADS-B data. In contrast to our ap-

proach, this deployment is restricted to a single type

of surveillance data (CAT033) and does not devise

how other data streams can be managed and inte-

grated to obtain more valuable knowledge.

3 AIRPORTS DL

A considerable effort has been carried out during the

past decade in Big Data solutions and scalable data

systems, and terms like Hadoop or NoSQL are two

of the new buzzwords in computational circles. On

the one hand, Hadoop is able to run large-scale batch

computations in a parallelized fashion, at the price

of high latency time. On the other hand, NoSQL

databases are highly scalable solutions, but face some

limitations regarding traditional relational databases.

However, these technologies excel when they are

combined intelligently with other tools (in the Big

Data ecosystem) to build scalable and fault tolerant

systems which are able to deal with variable and com-

plex amounts of data (Marz and Warren, 2015). These

systems are also extensible and allows ad-hoc queries

to be performed over the big data repository.

The Lambda architecture (Marz and Warren,

2015) is the main reference to build such type of

systems. It isolates real-time Big Data management

needs into three layers: (i) the Batch layer is re-

sponsible of preserving the master dataset, and com-

putes batch views transforming (raw) data for partic-

ular end-user purposes; (ii) the Serving layer enables

batch views to be efﬁciently accessed, and (iii) the

Speed layer assumes real-time data management.

Our current needs must be satisﬁed by only im-

plementing Batch and Serving layers, because real-

time data management is not currently addressed.

Although different approaches can be adopted, we

choose, as previously explained, the data lake one.

https://opensky-network.org/

This topic has received much attention recently

(Miloslavskaya and Tolstoy, 2016; Madera and Lau-

rent, 2016; Hai et al., 2016). A data lake comprises

a set of centralized repositories with no schema-on-

write restrictions. That is, structured and unstructured

data can be effectively stored and only on-read restric-

tions are made. Descriptive metadata must be also

maintained to avoid the data lake to be turned into a

data swamp (Gartner, 2014). The data lake also as-

sumes the traditional ETL (Extract-Transformation-

Load) responsibilities, while preserving all ongoing

data for traceability and analysis purposes. Thus, the

data lake implements storage and data computation

responsibilities of the Batch layer.

Data lakes are usually deployed using Hadoop-

based technology (White, 2015) to ensure cost-

effective storage and processing using the Hadoop

Distributed File System (HDFS) and the MapReduce

computation model, respectively. Regarding the re-

sulting batch views, which comprises highly-curated

data, they must be managed outside of the data lake.

The Serving layer implementation depends on how

data are ﬁnally exploited by end-user systems. Al-

though it is common to use data warehouse technol-

ogy, NoSQL databases are increasingly adopted to de-

ploy scalable Serving layer implementations.

AIRPORTS DL

combines these foundations to

design an scalable architecture able to deal with volu-

minous ADS-B data streams, and a variety of ﬂight-

related datasets. Figure 2 provides a big picture of

AIRPORTS DL, including the data lake itself, exter-

nal data sources, and the system which implements

the Serving layer. All these elements are described by

following their numeric identiﬁers; we also mention

the technologies

used to implement each one.

1. Data Sources. This element is “the world

around AIRPORTS DL” and includes all external

databases or live services which feed data into the

data lake. We collect information from many ADS-

B providers to get a wide coverage of the air space.

It includes

the aforementioned OpenSky community

network, but also comercial providers. Flight plans,

weather information, provided by the Global Fore-

AIRPORTS DL relies on the Aviation Data Analytics

Platform Testbed (ADAPT) by Boeing Research and

Technology-Europe (BR&T-E) in Madrid, Spain

These technologies are usually available within Hadoop

distributions, so more information about them can be

found in references like (White, 2015).

We also feed ADS-B data captured by the Frambuesa

BR&T-E sensor that currently operates at the Madrid-

Barajas Adolfo Su

arez Airport (LEMD).

Towards a Scalable Architecture for Flight Data Management

265

Figure 2: AIRPORTS DL Architecture.

cast System

, or information about lightning strikes,

ﬂights, airlines, and airports, are also ingested into

the data lake to consolidate a huge data collection.

2. Ingestion. This architectural component is re-

sponsible for importing data from the data sources.

Many data is directly streamed from the providers,

but the component must also get data from avail-

able databases or ﬂat ﬁles. We use Apache Flume

to deal with real-time continuous data streams, while

the other ingestion methods are implemented using

Apache Sqoop or Hadoop command utilities. All in-

gested data is preserved by the storage component.

3. Storage. It is the cornerstone of our architecture

and it is designed as an scalable and ﬂexible compo-

nent which allows any type of (big) data to be stored

with no restrictions. This component is implemented

on top of the HDFS (Shvachko et al., 2010) ﬁle sys-

tem, ensuring reliable and low cost data persistence

for large ﬁles. Our storage component forces data

immutability to ensure data traceability. It means that

the original raw data will never be modiﬁed within

the data lake, although it can be copied and progres-

sively reﬁned for different purposes. This component

arranges multiple reﬁnement levels where data at dif-

ferent levels of quality will be preserved. Lower lev-

els are used for transforming data within the scope

of their corresponding data source, while higher lev-

els are used for heterogeneous data integration. This

component allows data to be accessed using HDFS

commands, but also provides a SQL-like interface us-

ing external Hive tables.

4. Transformation. It is the “low-level” data re-

ﬁnery of AIRPORTS DL and allows data to be ma-

nipulated and transformed within their original scope

(with no addition of external data). The MapReduce

http://www.emc.ncep.noaa.gov/index.php?branch=GFS

computation model (Dean and Ghemawat, 2008) is

the core technology of this component, although other

high-level tools such as Apache Hive or Apache Pig,

are also available to implement and run the corre-

sponding jobs. These jobs interact massively with the

storage component. In practice, transformation is a

complex task which needs many jobs to be effectively

orchestrated (we use Apache Oozie for this purpose).

5. Integration. It is the “high-level” data reﬁnery

and enables heterogeneous data to be integrated and

aligned to satisfy end-user application requirements.

Thus, the transformation and integration components

assume the batch view computations (Lambda Batch

layer). The integration component also runs complex

dataﬂows implemented using the same technologies.

6. Exploration. It implements an interface to the

storage component that enables on-going data to be

analyzed before being loaded into the Serving layer.

Although it is a dispensable component at architec-

tural level, it is a highly useful tool for data scientists.

The Hadoop command line is used for simple data

exploration, and more advanced data analysis can be

implemented using HiveQL queries.

7. Statistical Analysis. It is a particular exploration

subcomponent which provides a high-level interface

enabling statistical analysis on top of the storage com-

ponent. It is currently implemented using RStudio

but also makes exhaustive use of Apache Spark and

visualization libraries like Shiny.

8. Load. This is a simple component which moves

batch views from their storage in the data lake to the

Serving Layer. Although Apache Sqoop is a reference

technology at this level, this component is designed

https://www.rstudio.com/

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

266

to adjust particular load requirements from the Serv-

ing layer. As current data loading is implemented us-

ing logical Hive tables which play a mediator role be-

tween the storage component and our Serving Layer.

9. Governance. It is a transversal component that

monitors data manipulations within the data lake and

preserves enough information to trace them. Thus, it

is in continuous communication with all components.

We use Apache Atlas to implement data governance

and lineage policies. It can be easily integrated with

the other tools in the data lake, allowing metadata to

be easily managed for each data manipulation.

10. Serving Layer. This component is out of the

data lake and there are no restrictions about its imple-

mentation. Nevertheless, it must be deployed to en-

sure ATM analytics to be effectively obtained on top

of the batch views loaded from the storage compo-

nent. We currently choose Apache HBase since it pro-

vides straightforward integration within the Hadoop

ecosystem, and allows batch views to be queried in

real-time. This deployment is suitable for the current

status of the project, but it could be replaced if new

ATM requirements demand it.

The main ADS-B processing dataﬂow is as follows:

• Independent Flume agents are deployed, in the

ingestion component, to (continuously) capture

ADS-B data from different providers.

• These corresponding message streams are deliv-

ered to the storage component, which preserves

them in particular repositories.

• A transformation Oozie workﬂow is daily

launched for ADS-B data cleansing and fact dis-

covering. It includes simple tasks (e.g. discarding

useless messages or aligning ﬁeld values to sat-

isfy the AIRPORTS data model), but also more

complex ones (e.g. determining trajectories when

different ﬂights use the same callsign or relevant

messages are lost). As a result, messages and tra-

jectories are effectively cleaned in the scope of

each ADS-B provider. This data is made available

for integration, but also for exploration purposes.

• A second Oozie workﬂow is then launched for in-

tegration. It is performed from two complemen-

tary perspectives. On the one hand, ADS-B mes-

sages are grouped to reduce the existing gaps in

ﬂight trajectories. This issue is due to each data

provider covers particular airspace areas, but does

not capture information outside of them. Thus,

more realistic trajectories are reconstructed after

grouping. On the other hand, we use other repos-

itories in AIRPORTS DL to enhance such trajec-

tories; e.g. ADS-B does not provide information

about departure/arrival airports, but it is inferred

by integrating data from our ﬂight/airport reposi-

tories. All batch views are preserved in the storage

component and then loaded into HBase.

It is just an example, but other dataﬂows are cur-

rently implemented, and preliminary end-user proto-

types also interact with the Serving layer to obtain rel-

evant analytics for ATM.

4 CONCLUSIONS

The data lake oriented architecture AIRPORTS DL

has been motivated and explained in the current paper.

Although it is an on-going deployment, AIRPORTS

DL has been successfully used to ingest, store, and

process ADS-B messages, and other ﬂight related

data, for almost the last two years. During this period,

an average amount of 20 GB of new ﬂight data (from

the most relevant sources) and 20 GB of weather data

have been collected per day.

This data has been reﬁned within the data lake to

satisfy particular needs for ATM optimization and fur-

ther decision making. Although these end-used re-

quirements are part of our future work, initial ﬂight-

related pattern analysis and visualization prototypes

have been deployed. For instance, Figure 3 shows a

trajectory density map which enables the main traf-

ﬁc patterns, in the Spanish air space, to be identiﬁed

from reconstructed trajectories. Regarding visualiza-

tion, Figure 4 shows a screenshot of our on-going

dashboard. It just allows to visualize different ﬂight

trajectory dimensions (GPS positions, speed or alti-

tude proﬁles, etc.), but future versions will enable all

batch views (including reﬁned data and/or computed

analytics) to be efﬁciently accessed.

Our future work also includes the integration of

more data sources (e.g. lightning strikes), and an ex-

haustive benchmarking of all data lake components to

ensure their scalability in a production ATM scenario.

ACKNOWLEDGEMENTS

This work is part of a joint project (AIRPORTS) with

Boeing Research and Technology Europe (BRT&E),

and is funded by CDTI and MINECO (FEDER)

ref. IDI-20150616, CIEN 2015. Authors are

partly funded by MINECO (PGE and FEDER)

projects TIN2013-46238-C4-3-R, TIN2016-78011-

Towards a Scalable Architecture for Flight Data Management

267

Figure 3: Trajectories density map.

Figure 4: AIRPORTS dashboard.

C4-1-R, MTM2014-56235-C2-1-P and MTM2014-

56235-C2-2, and Consejer

ıa de Educaci

on, JCyL,

project VA212U13.

REFERENCES

Ali, B. (2016). System Speciﬁcations for Developing an

Automatic Dependent Surveillance-Broadcast (ADS-

B) Monitoring System. International Journal of Crit-

ical Infrastructure Protection, 15:40–46.

Blythe, W., Anderson, H., and King, N. (2011). ADS-B Im-

plementation and Operations Guidance Document. In-

ternational Civil Aviation Organization. Asia and Pa-

ciﬁc Ocean.

Boci, E. and Thistlethwaite, S. (2015). A Novel Big Data

Architecture in Support of ADS-B Data Analytic. In

Integrated Communication, Navigation and Surveil-

lance Conference, pages C1–1–C1–8.

Dean, J. and Ghemawat, S. (2008). MapReduce: Simpliﬁed

Data Processing on Large Clusters. Communications

of ACM, 51(1):107–113.

Eurocontrol (2017). Flight Movements and Service

Units 2017-2023. Technical report, Eurocontrol.

http://www.eurocontrol.int/sites/default/ﬁles/content/

documents/ofﬁcial-documents/forecasts/seven-

year-ﬂights-service-units-forecast-2017-2023-

Feb2017.pdf.

European Commission (2011). Commission Implement-

ing Regulation (EU) No 1207/2011. Ofﬁcial Jour-

nal of the European Union. http://data.europa.eu/

eli/reg impl/2011/1207/oj.

Gartner (2014). Gartner Says Beware of the Data Lake Fal-

lacy. http://www.gartner.com/newsroom/id/2809117.

Hai, R., Geisler, S., and Quix, C. (2016). Constance: an

intelligent Data Lake System. In International Con-

ference on Management of Data, pages 2097–2100.

Hicok, D. and Lee, D. (1998). Application of ADS-B for

Airport Surface Surveillance. In 17th Digital Avionics

Systems Conference, volume 2, pages F34/1–F34/8.

Madera, C. and Laurent, A. (2016). The Next Information

Architecture Evolution: the Data Lake Wave. In 8th

International Conference on Management of Digital

EcoSystems, pages 174–180.

Marz, N. and Warren, J. (2015). Big Data: Principles

and Best Practices of Scalable Realtime Data Sys-

tems. Manning Publications Co.

Miloslavskaya, N. and Tolstoy, A. (2016). Application of

Big Data, Fast Data, and Data Lake Concepts to In-

formation Security Issues. In 4th International Con-

ference on Future Internet of Things and Cloud, pages

148–153.

Richards, W., O’Brien, K., and Miller, D. (2010). New Air

Trafﬁc Surveillance Technology. Aero, 2:7–14.

Sch

afer, M., Strohmeier, M., Lenders, V., Martinovic, I.,

and Wilhelm, M. (2014). Bringing up OpenSky: a

Large-Scale ADS-B Sensor Network for Research. In

13th International Symposium on Information Pro-

cessing in Sensor Networks, pages 83–94.

Shvachko, K., Kuang, H., Radia, S., and Chansler, R.

(2010). The Hadoop Distributed File System. In 26th

Symposium on Mass Storage Systems and Technolo-

gies, pages 1–10.

Strohmeier, M., Martinovic, I., Fuchs, M., Sch

afer, M., and

Lenders, V. (2015). OpenSky: A Swiss Army Knife

for Air Trafﬁc Security Research. In 34th Digital

Avionics Systems Conference, pages 4A1–1–4A1–14.

Strohmeier, M., Sch

afer, M., Lenders, V., and Martinovic,

I. (2014). Realities and Challenges of NextGen Air

Trafﬁc Management: the case of ADS-B. IEEE Com-

munications Magazine, 52(5):111–118.

Terrizzano, I., Schwarz, P., Roth, M., and Colino, J. (2015).

Data Wrangling: The Challenging Journey from the

Wild to the Lake. In 7th Biennial Conference on In-

novative Data Systems Research. Paper 2.

White, T. (2015). Hadoop: The Deﬁnitive Guide. O’Reilly.

Zeitlin, A. and Strain, R. (2002). Augmenting ADS-B with

Trafﬁc Information Service-Broadcast. In 21st Digital

Avionics Systems Conference, volume 1, pages 3D2–

1–3D2–7.

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

268