Hybrid Statistical Modeling for Anomaly Detection in Multi-Key Stores

Based on Access Patterns

Tiberiu Boros

1 a

and Marius Barbulescu

Security Coordination Center, Adobe Systems, Bucharest, Romania

Cloudops, CES Security Solutions, Adobe Systems, Bucharest, Romania

Keywords:

Machine Learning, Statistical Modeling, Multi-Key Store, Access Pattern, Anomaly Detection.

Abstract:

Anomaly detection in datasets with massive amounts of sparse data is not a trivial task, given that working

with high intake data in real-time requires careful design of the algorithms and data structures. We present

a hybrid statistical modeling strategy which combines an effective data structure with a neural network for

Gaussian Process Modeling. The network is trained in a residual learning fashion, which enables learning

with less parameters and in fewer steps.

1 INTRODUCTION

Secrets management platforms (SMPs) are the indus-

try standard for managing and protecting sensitive in-

formation such as passwords, private keys, and other

secrets. Leveraging the features provided by these

platforms helps eliminating secret sprawl and reduc-

ing the exposure of credentials. Since these platforms

hold access keys to private services and systems, they

are also a potential target for malicious users. Given

the delicate nature of the content, it is exceedingly

important that the data is kept secure and away from

unauthorized access.

Relying on SMPs for storing and retrieving secrets

has the following advantages: (a) one does not have to

implement any custom secrets management strategy

in order to avoid locally storing secrets on production

machines; (b) SMPs come with everything you need

to implement the least privilege access model and (c)

SMPs keep records on what has been accessed when

and how – which means that in the case of an incident

you have a valuable source of logs that can be used in

the investigation.

Our research explores if SMP logs are suitable

to detect a compromise. In other words, we are fo-

cused on access pattern modeling for anomaly detec-

tion in secrets consumption. To be precise, we de-

tect the unauthorized access to credentials, while de-

tection of credential theft is out-of-scope for our re-

search. While generally applicable, our constraints

https://orcid.org/0000-0003-1416-915X

state that the object (secret) is retrieved based on a

combination of categorical values (keys). In a typi-

cal scenario, the lookup multi-key is a combination

of say customer, namespace, and path. Fine-grained

modeling is obtained by adding additional attributes

such as: (a) hosting-based information such as the

autonomous system number (ASN), region (Europe,

US, etc.), country, city, IP address, etc. and (b) time-

lag features such as time of day (ToD), day of week

(DoW), week of month (WoM), etc.

We start with a brief overview of the related work

(Section 2), we describe the dataset we are using in

our experiments (Section 3.1), and we introduce our

methodology (Section 3.2). Finally, we provide an

evaluation (Section 4) and deliver our conclusions in

Section 5. The key contributions of this work touch

the anomaly detection modeling process itself, and

the hybrid algorithmic/machine-learning approach we

had to take, to mitigate issues with processing a high

volume of input data.

2 RELATED WORK

Our work closely relates to Security Information and

Event Management (SIEM) log analysis for intrusion

detection. Prior work is divided between static rules

and machine learning-based methods. Both types of

methods are used as a primary ﬁltering mechanism

for interesting events. The resulting subset of events

is later analyzed and sorted by human experts.

Boros, T. and Barbulescu, M.

Hybrid Statistical Modeling for Anomaly Detection in Multi-Key Stores Based on Access Patterns.

DOI: 10.5220/0012621300003705

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 9th International Conference on Internet of Things, Big Data and Security (IoTBDS 2024), pages 185-190

ISBN: 978-989-758-699-6; ISSN: 2184-4976

185

Anumol (2015) introduces a statistical ML model

for intrusion detection based on network logs. Feng

et al. (2017) present a ML user-centered model de-

signed to reduce the number of false positive alerts

generated by static rules. Du et al. (2017) employ a

deep learning approach for anomaly detection inside

system logs. Finally, many other works (Bryant and

Saiedian, 2020; Noor et al., 2019; Zhou et al., 2020;

Das et al., 2020; Gibert Llaurad

o et al., 2020; Piplai

et al., 2020; Idhammad et al., 2018; Zekri et al., 2017;

Osanaiye et al., 2016) employ some type of machine

learning algorithm for solving security related prob-

lems, by analyzing some type of log data, whether we

are talking about application or a network log. Our

work is focused on a subset of security-related mod-

eling problems that share the following items:

• High Volume of Data. We focus on operating

with a high volume of input data, that is speciﬁc to

application/system logs for large intranet infras-

tructures and network logs for medium intranet

conﬁgurations.

• Low Resourced Computational Environment.

The computational resources used by our ap-

proach are minimal, especially when compared

with state-of-the-art deep learning models, that

use billions of parameters. This is because we re-

duce the computational effort required to model

an entities’ behavior, by employing a hybrid

ML/data-structure approach.

• Mixed Numerical and Categorical Attributes.

Our input data is a mixture of categorical and nu-

merical values. We mention that the target at-

tributes (the attributes we want to model), should

be preponderantly numerical. Otherwise, the

data-structure we use in our hybrid approach is

not efﬁcient.

• Skewed Numerical Ranges. Finally, our ap-

proach addresses a corner case that poses serious

issues to neural networks with small number of

parameters, but also affects the training of large

models – data with a non-uniform variance. To be

precise, if one would cluster the input data based

on the variance of the output targets, he would ob-

tain high and low-variance clusters. When model-

ing the output for both types of data clusters at the

same time, larger variance has a higher impact on

the overall loss. Thus, the network is unlikely to

balance automatically, unless some mathematical

trick is applied, such as weighted loss.

3 PROPOSED METHODOLOGY

As previously mentioned, our journey starts with the

dataset description (Section 3.1), to give a better un-

derstanding on how the input looks and how it can be

interpreted. Then, we introduce our proposed hybrid

method, and we discuss several issues and their miti-

gation using a residual learning scheme (Section 3.2).

3.1 Dataset Description

Our reference dataset is a collection of access logs for

an SMP. The objects (secrets) in our dataset, are ac-

cessed via a namespace (categorical value) and a path.

To identify the client, the logs contain the Customer

ID, a hash of the token used to authenticate and the

source IP address. Additionally, the log contains the

status of the operation (successful/failed – with rea-

son), and the type of the request, which takes one of

the following discrete values: read (R), create (C), up-

date (U), delete (D) and list (L).

Data preprocessing is crucial to any machine

learning (ML) approach, and in our case, we ﬁrst an-

alyzed activity from adversary emulations that mim-

icked real-world scenarios, in order to have a better

understanding on current attack patterns that mali-

cious users follow. This revealed that:

(a) It is impossible to say if a single event is malicious

or not, out of context.

(b) Adversary emulations showed activity that is as-

sociated with the discovery phase of any attack

(listing of secrets on speciﬁc paths).

discovery and lateral movement phases of an at-

tack, but there are certain cues and patterns that

highlight its benign nature.

(d) Operations performed by an attacker are likely list

and read. Update operations are avoided, since

they could potentially cause a critical system out-

age (CSO), and thus the attack would not go un-

detected.

(e) Malicious activity might come as a spike of oper-

ations in certain cases.

Based on this, we converted our dataset into

hourly aggregations over the read, create, update, and

delete operations. Also, we compute spikes for all 4

operations as the maximum number of events of the

same type in a single minute.

Figure 1 represents a histogram of the number of

read operations, with the timespan of 1 hour, and the

resolution of 1 minute. The total number of opera-

tions is 590 with an average of 9.83. As can be eas-

ily observed, there is a spike of 50 operations at the

IoTBDS 2024 - 9th International Conference on Internet of Things, Big Data and Security

186

32-minute mark. Similarly, this is computed for the

other 4 types of operations. Thus, a single entry in

our dataset contains the following attributes:

Categorical values

Customer - A unique identiﬁer for the cur-

rent customer

Cluster - Region-based segregation for se-

crets

Path - A full path to the object (secret)

that is being accessed

Namespace - A namespace used to segregate

the secret landscape. Can be any

value chosen by the customer.

Source IP - IP address (internal or external)

from which the request origi-

nated

Numerical

, µ

- The average number of opera-

tions performed in the timespan

for (R) read, (L) list, (C) create,

(U) update and (D) delete.

max

- Maximum number of operations

(i.e., spikes), performed in the

timespan (1h) for read, list, cre-

ate, update and delete.

Time lag features (also categorical values)

ToD - Time of Day – a numeric identi-

ﬁer between 0 and 23, specifying

the exact timespan within a day,

for which the statistics are being

computed

DoW - Day of Week – a numeric identi-

ﬁer between 0 and 6, with 0 rep-

resenting Monday and 6 repre-

senting Sunday

WoM - Week of Month – a numeric

identiﬁer between 0 and 4 – like

ToD and DoW (not used in the

current implementation).

Month - Like the above – a numeric iden-

tiﬁer between 0 and 11 (not used

in the current implementation).

From a quantitative aspect, the timespan of the

dataset is 6 weeks and contains a total number of

80398543 records (i.e. hourly aggregations). The

unique number of clusters is 7, for 1248 customers,

with 49 namespaces and 71832 IP addresses. To get a

better understanding on how heterogenous customer

behavior is, the mean number deviation of 231.40,

while the minimum and maximum values are 0 and

69239 respectively. Similarly, the average number

of list operations is 0.47 with a standard deviation of

Figure 1: Example histogram of # read operations per

minute over the 1-hour timespan.

322.31, with values from 0 to 227907. These numbers

suggest that threshold-based alerting is not a good

candidate for incident detection, and this was proved

in our initial assessment. Furthermore, our analy-

sis showed that normal operations are consistent with

the generation of spikes in list and read operations

(see Figure 2). This is because customer automations

are scheduled periodically, and when triggered, they

quickly generate list, read, update, delete and create

events. Also, the set operations that are going to be

triggered is not straight forward to determine. In Fig-

ure 2, images (a) and (e) represent two different au-

tomations, one that only performs read operations on

the secrets, and another that only lists secrets from the

repository. Image (f) is a mixture of both operations,

where we can see some linearity between list and read

operations, while (b), (c) and (d) have a blend of be-

haviors (automations that only perform reads and au-

tomations that perform list and read operations at the

same time). Observing these automations over longer

periods of time, we noticed sporadic behaviors that

seem to fall out of line with normal operations. This

happens in two cases: (a) when multiple automations,

that are scheduled with different frequencies end up

being triggered at the same time (in one cycle) and

then drift apart again and (b) whenever scheduled se-

cret rotation takes place the distribution between the

different types of operations gets skewed in favor of

update, delete, and create.

3.2 Residual Learning Model

Anomaly detection can be achieved by directly scor-

ing datapoints, or by modeling normal behavior and

by checking how divergent is the observed behav-

ior from what is modeled. Our approach implements

the second option and, to target attacker behavior, we

model read and list operations based on the categori-

cal attributes and on the numerical attributes left be-

hind (update, create, and delete). The reason for us-

ing these numerical attributes as input instead of out-

put, is that attackers avoid causing CSOs (updating or

deleting secrets would likely generate this). Instead,

Hybrid Statistical Modeling for Anomaly Detection in Multi-Key Stores Based on Access Patterns

187

(a) (b) (c)

(d) (e) (f)

Figure 2: Spikes for read and list operations computed for different automations/customers. The vertical axis represents list

operations and the horizonal represents read operations. Images (a) and (e) perform a single type of operation - (a) read

operations and (e) list operations. Image (f) is a mixture of both read and write operations and judging by the distribution they

are linearly dependent. Images (b) (c) and (d) show both type of behaviours: independent read or list operations combined

with dependent ones.

Figure 3: Overview of our proposed architecture: the input is transformed using a tree-like data-structure for statistics and

the network is trained to output the normalized values. In the end, the output is de-normalized, so that the values are in the

original range of their cluster.

IoTBDS 2024 - 9th International Conference on Internet of Things, Big Data and Security

188

these operations are indicators of secret rotation run-

ning, which likely inﬂuences secret listing and read-

ing behavior. As previously mentioned, there is a gap

between customers or automations that generate large

spikes in the data and those that have a consistent and

predictable behavior. Modeling them at the same time

proved difﬁcult, because low-variance customers are

underrepresented in the loss function (at least in our

case) and data normalization and centering does not

yield good results when one data cluster has a vari-

ance of say 100 and another cluster a variance of just

0.1. To mitigate these issues, we employ a hybrid ap-

proach (Figure 3), which is inspired by residual learn-

ing, but instead of training the network to model the

residual signal, we use a non-uniform data normal-

ization technique, where the mean and standard de-

viation are predicted using a tree. Each level of the

tree handles a different attribute, and in our case, the

order in which attributes are processed is static, based

on our analysis of the data and their semantics: clus-

ter, customer, namespace, IP address. However, this

tree can be constructed automatically by using stan-

dard ML metrics, such as the entropy of the data. Ev-

ery leaf and every node inside the tree contain statis-

tics that model every datapoint that passed though the

node or ended up in that leaf. This enables us to per-

form some type of backoff for previously unseen at-

tributes. For instance, if an IP address is previously

unseen, we backoff to the previous level in the tree

and we model this entry based on the (cluster, cus-

tomer, namespace). This ensures data and normaliza-

tion consistency as best as possible, given previously

observed datapoints.

4 EVALUATION

Direct evaluation of this type of modeling is hard in

the absence of a proper dataset. Also, measuring the

F-score for synthetically generated data is irrelevant,

since in normal operations, where you don’t have in-

cidents daily, this score will be 0 regardless how good

the system scored on the synthetic data. Instead, we

measure (a) the modeling capacity of the network

with and without the residual scheme, (b) the reduc-

tion in training time and (c) the reduction in False

Positives (FPs). We mention that the output of the

network is a normal distribution (µ, ρ), which brings

major modeling advantages (i.e. we can directly use

the z-rule to score anomalies) but requires us to use

the Gaussian loss function during training (Equation

1) .

L =



log (max(ρ, ε)) +

(µ − y)

max(ρ, ε)



(1)

y - Is the target value

µ, ρ - Represent the mean and variance

ε - Is a constant used for numeric stability –

in our experiments we used 1e-8

The results shown in Table 1 are reported using the

following procedure: (a) each network is trained with

and without residual learning on 80% of the dataset;

(b) we compute the loss on previously unseen 10%

of the dataset and we use the result as an early stop-

ping criterion with a patience of 5 (epochs without

improvement); (c) we report the loss computed on

another 10% of the dataset, that is also unseen dur-

ing training. Note, that the negative values for the

loss function are accurate, because we are working

with the gaussian loss function. Also, the network is a

three-layer perceptron with tanh activation and gaus-

sian output (µ, ρ). The input is composed of the con-

catenated embeddings for the categorical ﬁelds (128-

sized partitions) and the log-normalized values for the

numeric attributes. As can be seen, with one excep-

tion, the number of training epochs is reduced when

relying on the tree-based statistics for normalization.

Also, there is a constant reduction in the loss value,

using this hybrid scheme. Finally, we tested the 1000-

1000-1000 network in a production environment, and

we got a 98% reduction in False Positives, which

translates into 20-30 events per day, on about 1.9M

datapoints.

Observation 1. With this information in hand, we

can determine that using SMPs logs to identify unau-

thorized access to credentials is suitable, but the re-

sults must be correlated with other sources of events

in order to properly classify the events as expected or

not.

Observeration 2. When it comes to scalability of

the proposed method, we observed that the memory

consumed during training and analyze phases grows

in direct ratio to the number of the datapoints in the

Table 1: Loss on the test set obtained by several network

conﬁgurations with residual (R) and non-residual (NR)

learning.

Conﬁguration Type Loss Best

100-100-100

NR 1,23 5

R -3,40 8

300-300-300

NR -1,59 6

R -4,76 3

700-700-700

NR -6,21 8

R -7,45 3

1000-1000-1000

NR -6,84 8

R -7,67 5

1500-1500-1500

NR -6,26 7

R -7,58 4

Hybrid Statistical Modeling for Anomaly Detection in Multi-Key Stores Based on Access Patterns

189

dataset, which makes the statistics tree expand to

many nodes. To overcome this limitation, we decided

to implement some core data structures that the model

uses to compute the statistics tree in C, by leveraging

Cython to develop a Python extension. With this so-

lution in place, we were able to lower the memory

consumed by up to 65% when using a statistics tree

with ∼40M nodes.

5 CONCLUSIONS AND FUTURE

WORK

Our work introduced a hybrid modeling method for

numerical data, based on a mixture of categorical

and numeric inputs. As shown in the evaluation sec-

tion, we achieved a 98% reduction in FPs with a

signiﬁcantly smaller number of parameters required

for modeling and faster training times. The non-

uniform normalization scheme, permitted using the

tree, enables us to efﬁciently train models on datasets

with skewed numerical outputs and provides a nat-

ural back-off method for the statistical estimation.

Our future research plans will focus on (a) how the

consumption order of the attributes can be efﬁciently

computed, (b) on investigating more use-cases and (c)

on further reﬁning our results and reducing the num-

ber of False Positives.

With every SMP, it is expected to have frequent

requests to fetch credentials, especially in enterprise

grade, rapid changing environments. This translates

in dozens of access logs being generated with every

request, which could lead to introduction of new ac-

cess patterns. To accommodate for the continuous

shift in patterns, the model would have to be retrained

often, to avoid data drift. This is one aspect that is part

of our future work on improving the model to auto-

matically detect that the model has drifted and decide

when it’s time to retrain.

Adding on future research plans, the current work

focused on attack patterns observed while inspecting

adversary emulations of real-world scenarios. As part

of our future research plans, we will investigate how

to improve the model’s robustness against new and

evolving attack patterns, to consider more viewing

points when deciding if an event is anomalous or not.

Finally, although we have shown that our hy-

brid modeling method proved successful to reduce

the number of False Positives and identify anomalous

events in SMPs access patterns, we only had access

to a single SMP to work with and test our approach

against. To validate that the proposed solution is ef-

fective when confronted with access logs from other

SMPs as well, we will conduct a new experiment us-

ing data gathered from multiple sources, and compare

the results.

REFERENCES

Anumol, E. (2015). Use of machine learning algorithms

with siem for attack prediction. In Intelligent Com-

puting, Communication and Devices: Proceedings of

ICCD 2014, Volume 1, pages 231–235. Springer.

Bryant, B. D. and Saiedian, H. (2020). Improving siem

alert metadata aggregation with a novel kill-chain

based classiﬁcation model. Computers & Security,

94:101817.

Das, S., Ashrafuzzaman, M., Sheldon, F. T., and Shiva, S.

(2020). Network intrusion detection using natural lan-

guage processing and ensemble machine learning. In

2020 IEEE Symposium Series on Computational In-

telligence (SSCI), pages 829–835. IEEE.

Du, M., Li, F., Zheng, G., and Srikumar, V. (2017).

Deeplog: Anomaly detection and diagnosis from sys-

tem logs through deep learning. In Proceedings of the

2017 ACM SIGSAC conference on computer and com-

munications security, pages 1285–1298.

Feng, C., Wu, S., and Liu, N. (2017). A user-centric ma-

chine learning framework for cyber security opera-

tions center. In 2017 IEEE International Conference

on Intelligence and Security Informatics (ISI), pages

173–175. IEEE.

Gibert Llaurad

o, D., Mateu Pi

nol, C., and Planes Cid, J.

(2020). The rise of machine learning for detection and

classiﬁcation of malware: Research developments,

trends and challenge. Journal of Network and Com-

puter Applications, 2020, vol. 153, 102526.

Idhammad, M., Afdel, K., and Belouch, M. (2018). Semi-

supervised machine learning approach for ddos detec-

tion. Applied Intelligence, 48:3193–3208.

Noor, U., Anwar, Z., Amjad, T., and Choo, K.-K. R. (2019).

A machine learning-based ﬁntech cyber threat attribu-

tion framework using high-level indicators of compro-

mise. Future Generation Computer Systems, 96:227–

242.

Osanaiye, O., Cai, H., Choo, K.-K. R., Dehghantanha,

A., Xu, Z., and Dlodlo, M. (2016). Ensemble-based

multi-ﬁlter feature selection method for ddos detec-

tion in cloud computing. EURASIP Journal on Wire-

less Communications and Networking, 2016(1):1–10.

Piplai, A., Mittal, S., Abdelsalam, M., Gupta, M., Joshi,

A., and Finin, T. (2020). Knowledge enrichment by

fusing representations for malware threat intelligence

and behavior. In 2020 IEEE International Conference

on Intelligence and Security Informatics (ISI), pages

1–6. IEEE.

Zekri, M., El Kafhali, S., Aboutabit, N., and Saadi, Y.

(2017). Ddos attack detection using machine learn-

ing techniques in cloud computing environments. In

2017 3rd international conference of cloud computing

technologies and applications (CloudTech), pages 1–

7. IEEE.

Zhou, H., Hu, Y., Yang, X., Pan, H., Guo, W., and Zou,

C. C. (2020). A worm detection system based on deep

learning. IEEE Access, 8:205444–205454.

IoTBDS 2024 - 9th International Conference on Internet of Things, Big Data and Security

190