Generative Deep Learning for Solutions to Data Deconflation Problems
in Information and Operational Technology Networks
Roger A. Hallman
1,3
, John M. San Miguel
2
, Arron Lu
2
, Alejandro Monje
2
, Mohammad R. Alam
4
and George Cybenko
3
1
C.A.T Labs, San Diego, California, U.S.A.
2
Naval Information Warfare Center Pacific, San Diego, California, U.S.A.
3
Thayer School of Engineering, Dartmouth College, Hanover, New Hampshire, U.S.A.
4
The M.I.T.R.E. Corporation, San Diego, California, U.S.A.
George.Cybenko@dartmouth.edu
Keywords:
Data Deconflation, Source Separation, Generative Adversarial Networks (GANs), Transformers,
Double-NATed Network Traffic, Network Situational Awareness.
Abstract:
Source separation problems are a long-standing and well-studied challenge in signal processing and informa-
tion sciences. The “Cocktail Party Phenomenon” and other classical source separation problems are vector
representable and additive, and thus solvable by well-established linear algebra techniques. However, the pro-
liferation and adoption of Internet-connected devices (e.g., IoT, distributed sensor networks, etc.) have led
to a “Cambrian explosion” of data that is available for processing. Much of this data is not readily available
for processing because it includes data objects that are categorical or non-additive superpositions (i.e., data
not confined to signals). The Data Deconflation Problem refers to the challenge of identifying and separat-
ing the individual constituent elements of these complex data objects. Real-world data deconflation scenarios
include pattern-of-life tracking (e.g., identifying recreational activities in conjunction with a business trip),
multi-target tracking (e.g., occlusions and track assignment challenges), and network situational awareness
(e.g., monitoring NATed network traffic, detecting and identifying shadow IT, network steganalysis).
This paper details our approach, utilizing Generative Adversarial Networks (GANs) and attention-based Trans-
formers, to solving the data deconflation problem, as well as our experimental application to network situa-
tional awareness tasks. We cover traditional source separation solutions and expound upon why these solutions
are inadequate for network monitoring tasks. Background information on GANs and transformers is presented
before a description of our architecture and initial experimentation which serves as a proof-of-concept. We
then describe experimentation applying our methodology to network monitoring tasks, in particular separat-
ing activities and shadow IT devices within double-NATed network traffic. We discuss our results and our
methodology’s applicability to other network monitoring tasks, such as network steganalysis and covert chan-
nel detection.
1 INTRODUCTION AND
MOTIVATION
The ever-increasing adoption of distributed sensor
networks, Internet of Things (IoT), networked infras-
tructure, as well as mobile and wearable devices has
brought about a “Cambrian Explosion” of data that is
available for processing. It is unlikely that these tor-
rents of data will be ready–or even useful–for process-
ing and computation upon arrival at data centers. For
instance, wearable medical devices may have read-
ings corrupted by patient movement, data from sen-
sor networks may represent co-located individuals, or
IP addresses made ‘private’ by a router’s network ad-
dress translation (NAT)–a solution to the depletion
of IPv4 addresses–are obfuscated and it is difficult
to identify individual NATed machines. There is a
particularly interesting variation of the last example
where two routers are placed sequentially in a net-
work’s architecture called a double-NAT (Karimzadeh
et al., 2017). Double-NATing may be intentionally
designed into a a network architecture, but it often
added in later, leading to a condition called shadow IT
which is difficult for network administrators to deal
Hallman, R., Miguel, J., Lu, A., Monje, A., Alam, M. and Cybenko, G.
Generative Deep Learning for Solutions to Data Deconflation Problems in Information and Operational Technology Networks.
DOI: 10.5220/0011996700003482
In Proceedings of the 8th International Conference on Internet of Things, Big Data and Security (IoTBDS 2023), pages 231-235
ISBN: 978-989-758-643-9; ISSN: 2184-4976
Copyright
c
2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
231
with as it obscures their visibility of networked de-
vices. Shadow IT is defined as any solution (software,
hardware, optimization, etc.) on a network that has
not been approved by network administrators (Silic
and Back, 2014). The challenge of discovering and
classifying shadow IT devices is critical for enterprise
network security.
To meet this challenge, we are utilizing the Ma-
chine Learning for Data Deconflation (ML4DD) ap-
proach that was introduced in (Hallman. and Cy-
benko., 2021). ML4DD is utilizing recent advances
in deep learning to create novel solutions to the
data deconflation problem, an updated take on clas-
sical source separation problems. Whereas classi-
cal source separation problems could be solved us-
ing well-established linear algebra techniques, how-
ever there are many real-world cases of conflated data
that are mixed with important components that are not
vector representable. Our approach attempts to ad-
dress this challenge by utilizing Generative Adversar-
ial Networks (GANs) (Goodfellow et al., 2014; Tom-
czak, 2022) that can observe streams of conflated data
and create candidate processes that are likely to be re-
sponsible for creating the observed data stream. We
have conducted initial experimentation and present
our results as a proof-of-concept. We then describe
ongoing experimentation building on our initial re-
sults to monitor traffic and identify individual devices
contributing to double-NATed network.
The remainder of this paper is organized as fol-
lows: Background information and related work are
presented in Section 2. The ML4DD concept and ar-
chitecture, as well as initial proof-of-concept results
are discussed in Section 3. We discuss the decon-
flation of IT/OT network traffic in Section 4 Finally,
concluding remarks are presented in Section 5, along
with directions for follow-on work.
2 BACKGROUND AND RELATED
WORK
Data can be conflated in multiple dimensions (e.g.,
time, space, semantics, etc.) and there are many com-
mon manifestations. Consider the scenario of an em-
ployee using a company-owned computer on an en-
terprise network. This is an example of semantic
data conflation, common to pattern-of-life analysis,
where the employee will very likely be simultane-
ously running business applications (e.g., working in
a spreadsheet, reading or writing business documen-
tation) as well as a web browser with tabs open to
recreational services (e.g., personal email, music or
video streaming services, news aggregation sites, so-
cial media sites).
Many classical deconflation problems are solved
using well-established linear algebra techniques–i.e.,
blind source separation (BSS) (Kofidis, 2016). In
BSS, we seek to solve for a mixture
u(n) = F (a(n), v(n), n)
mixes N source signals
a(n) = [a
1
(n), a
2
(n), ..., a
N
(n)]
T
,
and K noise signals
v(n) = [v
1
(n), v
2
(n), ..., v
K
(n)]
T
,
by a mixing system F (·, ·, ·), which yields
u(n) = [u
1
(n), u
2
(n), ..., u
N×K
(n)]
T
.
BSS has been successfully applied to signal process-
ing applications across multiple modalities (e.g., the
cocktail party phenomena, multimedia steganalysis,
etc.).
Process Query Systems (PQS) (Cybenko and
Berk, 2007), the current state-of-the-art deconflation
solution, are well-suited to applications in networked
environments. PQS work for discovering processes
with discrete states, observable events, and dynam-
ics. Multiple hypotheses are built about the processes
behind observed events by taking inputs from arbi-
trary network nodes, ideally matching hypotheses to
known processes. There are many PQS implemen-
tations used for covert channel detection (Giani et al.,
2005), as well as other computer and network security
applications (Berk et al., 2003; Berk and Fox, 2005).
Despite these successful implementations, PQS re-
quire significant background information (e.g., a pri-
ori models, process heuristics, etc.) to be effective.
3 MACHINE LEARNING FOR
DATA DECONFLATION: OUR
APPROACH
The ML4DD approach to data deconflation was in-
troduced in (Hallman. and Cybenko., 2021), incor-
porating recent advances in deep learning to move
towards a more generalized solution to the data de-
conflation problem. Our approach takes the same
fundamental assumptions that underlie PQS, namely
that observed events in an environment are represen-
tative of underlying and interleaved data and pro-
cesses. We take sequences of observed events as in-
puts and use a transformer-enhanced generative ad-
versarial network (GAN) architecture generates can-
didates subsequences which represent possible under-
lying processes and data objects.
IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security
232
3.1 Early Results and Ongoing
Experimentation
Figure 1: An observed sequence of two simple, interleaved
processes for initial experimentation.
Figure 2: The ML4DD data pipeline for generating candi-
date process sequences.
We set up a simple illustrative example to demonstrate
the ML4DD proof-of-concept. We begin with two re-
peating simple processes with observable states (Fig-
ure 1 top):
Process 1 proceeds through its states as follows:
RED BLU E GREEN;
Process 2 proceeds through its states as follows:
GREEN BLUE BLUE.
The processes are mixed according to some unknown
probability to produce a sequence of observed events
(Figure 1 top). The ML4DD GAN ingests this se-
quence of observed events and generates Process 1
and Process 2 after a sufficient number of rounds (Fig-
ure 2).
4 DECONFLATION OF
NETWORK TRAFFIC
Following the promising results of our initial experi-
mentation, we are utilizing ML4DD to give improved
situational awareness for network monitoring tasks.
In particular, we are interested in improved detection
and identification of shadow IT assets. We are con-
ducting experimentation on a dataset (Farhat et al.,
2020) of double-NATed network traffic, with the end
goal of being able to identify devices that are ob-
scured behind a second router. This dataset replicates
a shadow IT scenario in an enterprise network and
records the network traffic from different devices: a
PC, an ios device, and two different Android devices.
There are 294 test sessions taken over one week, with
each session consisting of seven tests recording net-
work traffic over one-minute testing intervals.
We model network traffic as non-determinate fi-
nite state automata (Figure 3). TCP traffic can only
take a finite number of states, so this is an ideal way
to intake network traffic in a way that the ML4DD
architecture can process. Our automata model is ca-
pable of simultaneously ingesting and modeling mul-
tiple TCP streams, a prerequisite for deployments in
real world networks. Importantly, in the shadow IT
scenario, multiple TCP streams will emanate from a
single IP address (i.e., the illicit router that was in-
stalled on the network).
We are in the process of conducting experimen-
tation where data from the aforementioned dataset of
Double-NATed network traffic. Our GAN is trained
initially with automata models of TCP streams of in-
dividual devices, before receiving models of double-
NATed network traffic. Once this experimentation is
completed, we will extend our methodology to net-
work steganalysis applications (i.e., detecting covert
channels), as well as designing an implementation of
our data deconflation architecture that can be used
to monitor traffic in operational (e.g., SCADA) net-
works. Critical infrastructures are reliant on these
networks; however, they are notoriously fragile, and
therefore not well-suited to the security solutions that
are available for IT networks. We anticipate that
ML4DD will prove to be a capable tool for analyzing
OT network traffic, while minimally impacting oper-
ational performance, and proactively detecting poten-
tially malicious traffic.
5 CONCLUSION AND FUTURE
WORK
We are leveraging recent advances in deep learning,
particularly GANs, to solve deconflation/source sep-
aration problems that are inherent to networked sys-
tems (though we hope that our approach will even-
tually prove to be a general solution to all classes
of deconflation problems). These deconflation prob-
lems are becoming increasingly important as “smart”
Internet-connected technologies and distributed sen-
Generative Deep Learning for Solutions to Data Deconflation Problems in Information and Operational Technology Networks
233
Figure 3: Automata model for TCP traffic.
sor networks are incorporated into real-world sys-
tems. This paper describes our initial results at de-
conflating simple processes, as well as our ongoing
work with network monitoring use cases. In partic-
ular, we are pursuing the challenges of detecting and
classifying shadow IT and covert channels on enter-
prise networks, as well as challenges that are unique
to the analysis and defense of OT networks.
ACKNOWLEDGEMENTS
Roger A. Hallman’s contribution to this work oc-
curred while employed by the Naval Information War-
fare Center Pacific, during which time he was par-
tially supported by the United States Department
of Defense SMART Scholarship for Service Pro-
gram, funded by USD/R&E (The Under Secretary
of Defense-Research and Engineering), National De-
fense Education Program (NDEP) / BA-1, Basic Re-
search.
REFERENCES
Berk, V., Chung, W., Crespi, V., Cybenko, G., Gray, R.,
Hernando, D., Jiang, G., Li, H., and Sheng, Y. (2003).
Process query systems for surveillance and awareness.
In In Proc. System. Cyber. Infor.(SCI2003. Citeseer.
Berk, V. and Fox, N. (2005). Process query systems for
network security monitoring. In Sensors, and Com-
mand, Control, Communications, and Intelligence
(C3I) Technologies for Homeland Security and Home-
land Defense IV, volume 5778, pages 520–530. SPIE.
Cybenko, G. and Berk, V. H. (2007). Process query sys-
tems. Computer, 40(1):62–70.
Farhat, S., Elhajj, I. H., and Kayssi, A. (2020). Nat network
traffic dataset. https://dx.doi.org/10.21227/zxdq-
hg05.
Giani, A., Berk, V., Cybenko, G., and Hanover, N. (2005).
Covert channel detection using process query systems.
proceedings of: FLoCon.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2014). Generative adversarial nets. Advances
in neural information processing systems, 27.
Hallman., R. and Cybenko., G. (2021). The data deconfla-
tion problem: Moving from classical to emerging so-
lutions. In Proceedings of the 6th International Con-
ference on Internet of Things, Big Data and Security -
AI4EIoTs,, pages 375–380. INSTICC, SciTePress.
Karimzadeh, M., Valtulina, L., Pras, A., Liebsch, M.,
Taleb, T., van den Berg, H., and Schmidt, R. d. O.
(2017). Double-nat based mobility management for
future lte networks. In 2017 IEEE Wireless Commu-
nications and Networking Conference (WCNC), pages
1–6. IEEE.
IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security
234
Kofidis, E. (2016). Blind source separation: Fundamentals
and recent advances (a tutorial overview presented at
sbrt-2001). arXiv preprint arXiv:1603.03089.
Silic, M. and Back, A. (2014). Shadow it–a view from be-
hind the curtain. Computers & Security, 45:274–283.
Tomczak, J. M. (2022). Deep Generative Modeling.
Springer Cham.
Generative Deep Learning for Solutions to Data Deconflation Problems in Information and Operational Technology Networks
235