Generative Deep Learning for Solutions to Data Deconﬂation Problems

in Information and Operational Technology Networks

Roger A. Hallman

1,3

, John M. San Miguel

, Arron Lu

, Alejandro Monje

, Mohammad R. Alam

and George Cybenko

C.A.T Labs, San Diego, California, U.S.A.

Naval Information Warfare Center Paciﬁc, San Diego, California, U.S.A.

Thayer School of Engineering, Dartmouth College, Hanover, New Hampshire, U.S.A.

The M.I.T.R.E. Corporation, San Diego, California, U.S.A.

George.Cybenko@dartmouth.edu

Keywords:

Data Deconﬂation, Source Separation, Generative Adversarial Networks (GANs), Transformers,

Double-NATed Network Trafﬁc, Network Situational Awareness.

Abstract:

Source separation problems are a long-standing and well-studied challenge in signal processing and informa-

tion sciences. The “Cocktail Party Phenomenon” and other classical source separation problems are vector

representable and additive, and thus solvable by well-established linear algebra techniques. However, the pro-

liferation and adoption of Internet-connected devices (e.g., IoT, distributed sensor networks, etc.) have led

to a “Cambrian explosion” of data that is available for processing. Much of this data is not readily available

for processing because it includes data objects that are categorical or non-additive superpositions (i.e., data

not conﬁned to signals). The Data Deconﬂation Problem refers to the challenge of identifying and separat-

ing the individual constituent elements of these complex data objects. Real-world data deconﬂation scenarios

include pattern-of-life tracking (e.g., identifying recreational activities in conjunction with a business trip),

multi-target tracking (e.g., occlusions and track assignment challenges), and network situational awareness

(e.g., monitoring NATed network trafﬁc, detecting and identifying shadow IT, network steganalysis).

This paper details our approach, utilizing Generative Adversarial Networks (GANs) and attention-based Trans-

formers, to solving the data deconﬂation problem, as well as our experimental application to network situa-

tional awareness tasks. We cover traditional source separation solutions and expound upon why these solutions

are inadequate for network monitoring tasks. Background information on GANs and transformers is presented

before a description of our architecture and initial experimentation which serves as a proof-of-concept. We

then describe experimentation applying our methodology to network monitoring tasks, in particular separat-

ing activities and shadow IT devices within double-NATed network trafﬁc. We discuss our results and our

methodology’s applicability to other network monitoring tasks, such as network steganalysis and covert chan-

nel detection.

1 INTRODUCTION AND

MOTIVATION

The ever-increasing adoption of distributed sensor

networks, Internet of Things (IoT), networked infras-

tructure, as well as mobile and wearable devices has

brought about a “Cambrian Explosion” of data that is

available for processing. It is unlikely that these tor-

rents of data will be ready–or even useful–for process-

ing and computation upon arrival at data centers. For

instance, wearable medical devices may have read-

ings corrupted by patient movement, data from sen-

sor networks may represent co-located individuals, or

IP addresses made ‘private’ by a router’s network ad-

dress translation (NAT)–a solution to the depletion

of IPv4 addresses–are obfuscated and it is difﬁcult

to identify individual NATed machines. There is a

particularly interesting variation of the last example

where two routers are placed sequentially in a net-

work’s architecture called a double-NAT (Karimzadeh

et al., 2017). Double-NATing may be intentionally

designed into a a network architecture, but it often

added in later, leading to a condition called shadow IT

which is difﬁcult for network administrators to deal

Hallman, R., Miguel, J., Lu, A., Monje, A., Alam, M. and Cybenko, G.

Generative Deep Learning for Solutions to Data Deconﬂation Problems in Information and Operational Technology Networks.

DOI: 10.5220/0011996700003482

In Proceedings of the 8th International Conference on Internet of Things, Big Data and Security (IoTBDS 2023), pages 231-235

ISBN: 978-989-758-643-9; ISSN: 2184-4976

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

231

with as it obscures their visibility of networked de-

vices. Shadow IT is deﬁned as any solution (software,

hardware, optimization, etc.) on a network that has

not been approved by network administrators (Silic

and Back, 2014). The challenge of discovering and

classifying shadow IT devices is critical for enterprise

network security.

To meet this challenge, we are utilizing the Ma-

chine Learning for Data Deconﬂation (ML4DD) ap-

proach that was introduced in (Hallman. and Cy-

benko., 2021). ML4DD is utilizing recent advances

in deep learning to create novel solutions to the

data deconﬂation problem, an updated take on clas-

sical source separation problems. Whereas classi-

cal source separation problems could be solved us-

ing well-established linear algebra techniques, how-

ever there are many real-world cases of conﬂated data

that are mixed with important components that are not

vector representable. Our approach attempts to ad-

dress this challenge by utilizing Generative Adversar-

ial Networks (GANs) (Goodfellow et al., 2014; Tom-

czak, 2022) that can observe streams of conﬂated data

and create candidate processes that are likely to be re-

sponsible for creating the observed data stream. We

have conducted initial experimentation and present

our results as a proof-of-concept. We then describe

ongoing experimentation building on our initial re-

sults to monitor trafﬁc and identify individual devices

contributing to double-NATed network.

The remainder of this paper is organized as fol-

lows: Background information and related work are

presented in Section 2. The ML4DD concept and ar-

chitecture, as well as initial proof-of-concept results

are discussed in Section 3. We discuss the decon-

ﬂation of IT/OT network trafﬁc in Section 4 Finally,

concluding remarks are presented in Section 5, along

with directions for follow-on work.

2 BACKGROUND AND RELATED

WORK

Data can be conﬂated in multiple dimensions (e.g.,

time, space, semantics, etc.) and there are many com-

mon manifestations. Consider the scenario of an em-

ployee using a company-owned computer on an en-

terprise network. This is an example of semantic

data conﬂation, common to pattern-of-life analysis,

where the employee will very likely be simultane-

ously running business applications (e.g., working in

a spreadsheet, reading or writing business documen-

tation) as well as a web browser with tabs open to

recreational services (e.g., personal email, music or

video streaming services, news aggregation sites, so-

cial media sites).

Many classical deconﬂation problems are solved

using well-established linear algebra techniques–i.e.,

blind source separation (BSS) (Koﬁdis, 2016). In

BSS, we seek to solve for a mixture

u(n) = F (a(n), v(n), n)

mixes N source signals

a(n) = [a

(n), a

(n), ..., a

(n)]

and K noise signals

v(n) = [v

(n), v

(n), ..., v

(n)]

by a mixing system F (·, ·, ·), which yields

u(n) = [u

(n), u

(n), ..., u

N×K

(n)]

BSS has been successfully applied to signal process-

ing applications across multiple modalities (e.g., the

cocktail party phenomena, multimedia steganalysis,

etc.).

Process Query Systems (PQS) (Cybenko and

Berk, 2007), the current state-of-the-art deconﬂation

solution, are well-suited to applications in networked

environments. PQS work for discovering processes

with discrete states, observable events, and dynam-

ics. Multiple hypotheses are built about the processes

behind observed events by taking inputs from arbi-

trary network nodes, ideally matching hypotheses to

known processes. There are many PQS implemen-

tations used for covert channel detection (Giani et al.,

2005), as well as other computer and network security

applications (Berk et al., 2003; Berk and Fox, 2005).

Despite these successful implementations, PQS re-

quire signiﬁcant background information (e.g., a pri-

ori models, process heuristics, etc.) to be effective.

3 MACHINE LEARNING FOR

DATA DECONFLATION: OUR

APPROACH

The ML4DD approach to data deconﬂation was in-

troduced in (Hallman. and Cybenko., 2021), incor-

porating recent advances in deep learning to move

towards a more generalized solution to the data de-

conﬂation problem. Our approach takes the same

fundamental assumptions that underlie PQS, namely

that observed events in an environment are represen-

tative of underlying and interleaved data and pro-

cesses. We take sequences of observed events as in-

puts and use a transformer-enhanced generative ad-

versarial network (GAN) architecture generates can-

didates subsequences which represent possible under-

lying processes and data objects.

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

232

3.1 Early Results and Ongoing

Experimentation

Figure 1: An observed sequence of two simple, interleaved

processes for initial experimentation.

Figure 2: The ML4DD data pipeline for generating candi-

date process sequences.

We set up a simple illustrative example to demonstrate

the ML4DD proof-of-concept. We begin with two re-

peating simple processes with observable states (Fig-

ure 1 top):

• Process 1 proceeds through its states as follows:

RED → BLU E → GREEN;

• Process 2 proceeds through its states as follows:

GREEN → BLUE → BLUE.

The processes are mixed according to some unknown

probability to produce a sequence of observed events

(Figure 1 top). The ML4DD GAN ingests this se-

quence of observed events and generates Process 1

and Process 2 after a sufﬁcient number of rounds (Fig-

ure 2).

4 DECONFLATION OF

NETWORK TRAFFIC

Following the promising results of our initial experi-

mentation, we are utilizing ML4DD to give improved

situational awareness for network monitoring tasks.

In particular, we are interested in improved detection

and identiﬁcation of shadow IT assets. We are con-

ducting experimentation on a dataset (Farhat et al.,

2020) of double-NATed network trafﬁc, with the end

goal of being able to identify devices that are ob-

scured behind a second router. This dataset replicates

a shadow IT scenario in an enterprise network and

records the network trafﬁc from different devices: a

PC, an ios device, and two different Android devices.

There are 294 test sessions taken over one week, with

each session consisting of seven tests recording net-

work trafﬁc over one-minute testing intervals.

We model network trafﬁc as non-determinate ﬁ-

nite state automata (Figure 3). TCP trafﬁc can only

take a ﬁnite number of states, so this is an ideal way

to intake network trafﬁc in a way that the ML4DD

architecture can process. Our automata model is ca-

pable of simultaneously ingesting and modeling mul-

tiple TCP streams, a prerequisite for deployments in

real world networks. Importantly, in the shadow IT

scenario, multiple TCP streams will emanate from a

single IP address (i.e., the illicit router that was in-

stalled on the network).

We are in the process of conducting experimen-

tation where data from the aforementioned dataset of

Double-NATed network trafﬁc. Our GAN is trained

initially with automata models of TCP streams of in-

dividual devices, before receiving models of double-

NATed network trafﬁc. Once this experimentation is

completed, we will extend our methodology to net-

work steganalysis applications (i.e., detecting covert

channels), as well as designing an implementation of

our data deconﬂation architecture that can be used

to monitor trafﬁc in operational (e.g., SCADA) net-

works. Critical infrastructures are reliant on these

networks; however, they are notoriously fragile, and

therefore not well-suited to the security solutions that

are available for IT networks. We anticipate that

ML4DD will prove to be a capable tool for analyzing

OT network trafﬁc, while minimally impacting oper-

ational performance, and proactively detecting poten-

tially malicious trafﬁc.

5 CONCLUSION AND FUTURE

WORK

We are leveraging recent advances in deep learning,

particularly GANs, to solve deconﬂation/source sep-

aration problems that are inherent to networked sys-

tems (though we hope that our approach will even-

tually prove to be a general solution to all classes

of deconﬂation problems). These deconﬂation prob-

lems are becoming increasingly important as “smart”

Internet-connected technologies and distributed sen-

Generative Deep Learning for Solutions to Data Deconﬂation Problems in Information and Operational Technology Networks

233

Figure 3: Automata model for TCP trafﬁc.

sor networks are incorporated into real-world sys-

tems. This paper describes our initial results at de-

conﬂating simple processes, as well as our ongoing

work with network monitoring use cases. In partic-

ular, we are pursuing the challenges of detecting and

classifying shadow IT and covert channels on enter-

prise networks, as well as challenges that are unique

to the analysis and defense of OT networks.

ACKNOWLEDGEMENTS

Roger A. Hallman’s contribution to this work oc-

curred while employed by the Naval Information War-

fare Center Paciﬁc, during which time he was par-

tially supported by the United States Department

of Defense SMART Scholarship for Service Pro-

gram, funded by USD/R&E (The Under Secretary

of Defense-Research and Engineering), National De-

fense Education Program (NDEP) / BA-1, Basic Re-

search.

REFERENCES

Berk, V., Chung, W., Crespi, V., Cybenko, G., Gray, R.,

Hernando, D., Jiang, G., Li, H., and Sheng, Y. (2003).

Process query systems for surveillance and awareness.

In In Proc. System. Cyber. Infor.(SCI2003. Citeseer.

Berk, V. and Fox, N. (2005). Process query systems for

network security monitoring. In Sensors, and Com-

mand, Control, Communications, and Intelligence

(C3I) Technologies for Homeland Security and Home-

land Defense IV, volume 5778, pages 520–530. SPIE.

Cybenko, G. and Berk, V. H. (2007). Process query sys-

tems. Computer, 40(1):62–70.

Farhat, S., Elhajj, I. H., and Kayssi, A. (2020). Nat network

trafﬁc dataset. https://dx.doi.org/10.21227/zxdq-

hg05.

Giani, A., Berk, V., Cybenko, G., and Hanover, N. (2005).

Covert channel detection using process query systems.

proceedings of: FLoCon.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. Advances

in neural information processing systems, 27.

Hallman., R. and Cybenko., G. (2021). The data deconﬂa-

tion problem: Moving from classical to emerging so-

lutions. In Proceedings of the 6th International Con-

ference on Internet of Things, Big Data and Security -

AI4EIoTs,, pages 375–380. INSTICC, SciTePress.

Karimzadeh, M., Valtulina, L., Pras, A., Liebsch, M.,

Taleb, T., van den Berg, H., and Schmidt, R. d. O.

(2017). Double-nat based mobility management for

future lte networks. In 2017 IEEE Wireless Commu-

nications and Networking Conference (WCNC), pages

1–6. IEEE.

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

234

Koﬁdis, E. (2016). Blind source separation: Fundamentals

and recent advances (a tutorial overview presented at

sbrt-2001). arXiv preprint arXiv:1603.03089.

Silic, M. and Back, A. (2014). Shadow it–a view from be-

hind the curtain. Computers & Security, 45:274–283.

Tomczak, J. M. (2022). Deep Generative Modeling.

Springer Cham.

Generative Deep Learning for Solutions to Data Deconﬂation Problems in Information and Operational Technology Networks

235