with as it obscures their visibility of networked de-
vices. Shadow IT is defined as any solution (software,
hardware, optimization, etc.) on a network that has
not been approved by network administrators (Silic
and Back, 2014). The challenge of discovering and
classifying shadow IT devices is critical for enterprise
network security.
To meet this challenge, we are utilizing the Ma-
chine Learning for Data Deconflation (ML4DD) ap-
proach that was introduced in (Hallman. and Cy-
benko., 2021). ML4DD is utilizing recent advances
in deep learning to create novel solutions to the
data deconflation problem, an updated take on clas-
sical source separation problems. Whereas classi-
cal source separation problems could be solved us-
ing well-established linear algebra techniques, how-
ever there are many real-world cases of conflated data
that are mixed with important components that are not
vector representable. Our approach attempts to ad-
dress this challenge by utilizing Generative Adversar-
ial Networks (GANs) (Goodfellow et al., 2014; Tom-
czak, 2022) that can observe streams of conflated data
and create candidate processes that are likely to be re-
sponsible for creating the observed data stream. We
have conducted initial experimentation and present
our results as a proof-of-concept. We then describe
ongoing experimentation building on our initial re-
sults to monitor traffic and identify individual devices
contributing to double-NATed network.
The remainder of this paper is organized as fol-
lows: Background information and related work are
presented in Section 2. The ML4DD concept and ar-
chitecture, as well as initial proof-of-concept results
are discussed in Section 3. We discuss the decon-
flation of IT/OT network traffic in Section 4 Finally,
concluding remarks are presented in Section 5, along
with directions for follow-on work.
2 BACKGROUND AND RELATED
WORK
Data can be conflated in multiple dimensions (e.g.,
time, space, semantics, etc.) and there are many com-
mon manifestations. Consider the scenario of an em-
ployee using a company-owned computer on an en-
terprise network. This is an example of semantic
data conflation, common to pattern-of-life analysis,
where the employee will very likely be simultane-
ously running business applications (e.g., working in
a spreadsheet, reading or writing business documen-
tation) as well as a web browser with tabs open to
recreational services (e.g., personal email, music or
video streaming services, news aggregation sites, so-
cial media sites).
Many classical deconflation problems are solved
using well-established linear algebra techniques–i.e.,
blind source separation (BSS) (Kofidis, 2016). In
BSS, we seek to solve for a mixture
u(n) = F (a(n), v(n), n)
mixes N source signals
a(n) = [a
1
(n), a
2
(n), ..., a
N
(n)]
T
,
and K noise signals
v(n) = [v
1
(n), v
2
(n), ..., v
K
(n)]
T
,
by a mixing system F (·, ·, ·), which yields
u(n) = [u
1
(n), u
2
(n), ..., u
N×K
(n)]
T
.
BSS has been successfully applied to signal process-
ing applications across multiple modalities (e.g., the
cocktail party phenomena, multimedia steganalysis,
etc.).
Process Query Systems (PQS) (Cybenko and
Berk, 2007), the current state-of-the-art deconflation
solution, are well-suited to applications in networked
environments. PQS work for discovering processes
with discrete states, observable events, and dynam-
ics. Multiple hypotheses are built about the processes
behind observed events by taking inputs from arbi-
trary network nodes, ideally matching hypotheses to
known processes. There are many PQS implemen-
tations used for covert channel detection (Giani et al.,
2005), as well as other computer and network security
applications (Berk et al., 2003; Berk and Fox, 2005).
Despite these successful implementations, PQS re-
quire significant background information (e.g., a pri-
ori models, process heuristics, etc.) to be effective.
3 MACHINE LEARNING FOR
DATA DECONFLATION: OUR
APPROACH
The ML4DD approach to data deconflation was in-
troduced in (Hallman. and Cybenko., 2021), incor-
porating recent advances in deep learning to move
towards a more generalized solution to the data de-
conflation problem. Our approach takes the same
fundamental assumptions that underlie PQS, namely
that observed events in an environment are represen-
tative of underlying and interleaved data and pro-
cesses. We take sequences of observed events as in-
puts and use a transformer-enhanced generative ad-
versarial network (GAN) architecture generates can-
didates subsequences which represent possible under-
lying processes and data objects.
IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security
232