Probabilistic Graphical Model on Detecting Insiders:

Modeling with SGD-HMM

Ahmed Saaudi, Yan Tong and Csilla Farkas

Department of Computer Science & Engineering, University of South Carolina, 550 Assembly St., Columbia, U.S.A.

Keywords:

Insider Threat, Anomaly Detection System, Machine Learning, HMM, Big Data.

Abstract:

This paper presents a novel approach to detect malicious behaviors in computer systems. We propose the use

of varying granularity levels to represent users’ log data: Session-based, Day-based, and Week-based. A user’s

normal behavior is modeled using a Hidden Markov Model. The model is used to detect any deviation from the

normal behavior. We also propose a Sliding Window Technique to identify malicious activity effectively by

considering the near history of user activity. We evaluated our results using Receiver Operating Characteristic

curves (or ROC curves). Our evaluation shows that the results are superior to existing research by improving

the detection ability and reducing the false positive rate. Combining sliding window technique with session-

based system gives a fast detection performance.

1 INTRODUCTION

Insiders’ misuse of computer systems is a major

concern for many organizations. Breach Level In-

dex (Gemalto, 2016), public information of data

breaches collected and distributed by Gemalto, asserts

that around 40% of data leakage attacks are due to in-

siders’ misuse. The data leakages are scored accord-

ing to their importance. The risk scores of malicious

insider threats are the highest in USA and China: 9.4

and 9.1 respectively. Additionally, the recent studies

in (Gavai et al., 2015; House, 2012; Cappelli, 2012;

Institute, 2017) show that the insider threat rate has

increased compared to 2015. The mean time to detect

such malicious data breaches is 50 days (Clearswift,

2018; Cappelli, 2012; Institute, 2017).

There are several solutions proposed to deal with

insider threat. Most of them deﬁne the suspicious be-

haviors as low-frequency actions that are performed

by a user. So, the unusual behaviors can be com-

pared to high-frequency behaviors to predict the ab-

normality. The activities can be captured by tracing

log data within a speciﬁc time unit. The actions’ log

data can be pre-processed such that it can be mod-

eled using machine learning techniques (Rashid et al.,

2016). However, none of these researches address the

fact that a long time period is needed to detect mali-

cious behaviors.

In this paper, the raw data from ﬁve different do-

mains, “Log on/ Log off,” “Connect/ Disconnect,”

“Http,” “Emails,” and “Files,” are pre-processed to

generate new sequence data samples. Multiple do-

mains show different aspects of user behaviors which

would support our model to detect malicious behav-

ior. The new data samples are generated according

to the detection time unit: Session-based sequences,

Day-based sequences, Week-based sequences.

In this paper, we present our results of the session-

based analysis.

We propose an unsupervised detection approach

to monitor user actions and detect the abnormal be-

haviors. A user’s behavior is represented as a series

of activities performed within the organizational en-

vironment. To identify the unusual sequence of ac-

tions, a stochastic gradient descent version of HMM,

“HMM-SGD”, is proposed to model the sequence of

user activities. The new model has training ﬂexibil-

ity because it contains four hyper-parameters. These

hyper-parameters can be tuned to improve model con-

vergence.

Our contribution in the presented work can be

summarized as:

1. Processing the raw log data to be in session-based,

day-based, and week-based sequences. Level

granularity data samples help to discover the ab-

normal behaviours that are distributed over time.

2. Proposing a sliding window technique to consider

the effect of the recent history of user activities on

their current behavior.

Saaudi, A., Tong, Y. and Farkas, C.

Probabilistic Graphical Model on Detecting Insiders: Modeling with SGD-HMM.

DOI: 10.5220/0007404004610470

In Proceedings of the 5th International Conference on Information Systems Security and Privacy (ICISSP 2019), pages 461-470

ISBN: 978-989-758-359-9

461

3. Proposing the “HMM-SGD” to model the se-

quence data samples.

The structure of the remaining sections are as fol-

lows:

In section 2, we show related works on insider

threat. In section 3, we explain how we implement

and train our models to detect insiders. Section 4 pro-

vides a brief explanation of the CERT data set. Sec-

tions 5 and 6 present the ﬁnal results of the two mod-

els along with the evaluation analysis. Section7 pro-

vides a case study similar to the one in (Rashid et al.,

2016). Finally, we brieﬂy wrap up our work with the

work’s limitations and conclusion sections.

2 LITERATURE SURVEY

HMMs have been used with intrusion detection mod-

eling for years. The authors in (Jain and Abouza-

khar, 2012) used HMMs to model TCP network data

from KDD Cup 1999 dataset and proposed their intru-

sion detection system. They used Baum-Welch train-

ing (BWT) to train the model parameters. To evalu-

ate their model, they applied Forward and Backward

algorithms to calculate the likelihood for each sam-

ple. Additionally, the Receiver Operating Character-

istic curves (or ROC curves) were used to measure

the general model effectiveness.Furthermore, the au-

thors in (Lee et al., 2008) proposed a Multi-Stage in-

trusion detection system using HMM. They evaluated

their system by adapting the headmost section data

of the “DARPA 2000 intrusion detection” dataset.

This dataset provides ﬁve different stages or scenar-

ios. They applied HMM on each one of these sce-

narios independently to create their multi-stage in-

trusion detection system. The authors in (Rashid

et al., 2016), claim to be the ﬁrst to adapt the Hid-

den Markov Model to the domain of insiders threats

detection. In addition to their application of using the

original HMM platform, they proposed a new con-

cept of using a moment of inertia with HMM to im-

prove the results’ accuracy. To train and test their

work they used the same CERT division dataset as

in (Bose et al., 2017), but they used an updated ver-

sion r4.2. To evaluate their work, they used the ROC

curve method. Their highest accuracy using original

HMM was 0.797, while their efﬁciency of using the

proposed approach was 0.829.

3 MACHINE LEARNING BASED

MODELS

In the presented work, we used sequence-based data

samples. Section 4.3 shows how we reformed and

generated our data samples or events sequences. We

modeled the data samples using the Hidden Markov

Model in two different approaches, i.e. the base

HMM and HMM-SGD approach.

3.1 Training of Hidden Markov Model

This section illustrates how we train the proposed ap-

proach. HMM has three parameters that need to be

prepared: initial probability vector (π), transition ma-

trix (A), and emission matrix (B).We use the Baum-

Welch algorithm to train the parameters of our model.

The Baum-Welch is an HMM context algorithm of the

expectation maximization (EM) algorithm. Details of

EM algorithm can be found in (Bilmes, 1998). The

training process can be set according to the structure

of the adapted model. For example, Figure 1 illus-

trates a four-state structure HMM. We need to ﬁnd the

initial distribution of each of the four states and the

transition distribution between them. Also, the distri-

bution of the observed symbols at each state should

be determined as well. The list below shows how the

model parameters are trained:

1. Initializing model parameters π, A, B with posi-

tive random numbers between 0 and 1, where:

• (π) : The initial distribution of the states. The

most probable state that the model will start

with.

• (A) : The initial distribution of the transitions

between states.

• (B) : The initial distribution of the observed

symbols.

2. Baum-Welch algorithm is applied to learn HMM

parameters. The details of the Baum-Welch algo-

rithm are also presented in (Rabiner, 1989).

3. To make sure that there are no zeros within any of

trained HMM parameters, we add a small number

to each one of the parameters, followed by a scal-

ing process to ensure the probability condition; all

numbers in the symbols matrix add up to one. In

addition to that, we use the scaled version of Hid-

den Markov Model, which also works on over-

coming the resolution problems during the train-

ing process. Information about the scaled version

of HMM is provided in (Rabiner, 1989).

The training process aims to ﬁnd the model pa-

rameters that maximize the likelihood of the se-

quences that represent the user’s normal behavior and

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

462

Figure 1: The structure of an insiders threat detection sys-

tem with HMM.

minimizes the probability of the sequences that repre-

sent the anomalous behavior.

3.2 Training with Stochastic Gradient

Techniques

As the second approach to model user behavior, we

adapt a Hidden Markov Model with Stochastic Gra-

dient Techniques (HMM-SGD) . The main difference

in using HMM-SGD is the learning step. In the ﬁrst

method, we use the Baum Welch algorithm to train

the model parameters While in this approach we use

the stochastic gradient descent (SGD) algorithm to

train the HMM. The gradient descent (GD) algorithm

is the core algorithm of the training process in the

Deep-Learning approaches (the deep neural network)

and several others (LeCun et al., 2012). In this ap-

proach, we use the SGD method with SoftMax nor-

mality function to ensure the probability condition.

According to our knowledge, we are the ﬁrst who use

HMM-SGD to solve the insider’s threat attack prob-

lem.

3.2.1 Selection of GD

The learning methods are divided into two main

categories: Stochastic-based and batch-based learn-

ing (LeCun et al., 2012). Batch-based learning ap-

proaches: needs to process all the training data sam-

ples, insider action sequences, to update model pa-

rameters. Stochastic based Learning approaches:

each single sequence sample is used to update model

parameters.

3.2.2 HMM-SGD Learning

Gradient Descent methods are the base of the most

successful models, especially in deep learning sys-

tems (LeCun et al., 2012). These methods are used

to learn parameters during a maximum number of it-

erations or when there is no change in model perfor-

mance. The goal of using GD methods is to increase

the likelihood of the input data samples, i.e., user ac-

tivities sequences, given the model parameters. The

training procedure works to ﬁt the model parameters

with the training dataset such that we can get a high

likelihood of a new set of parameters. It is assumed

that after scanning all iterations, the parameters will

be updated in such a way that the model will converge

with a high-objective value.

The objective function is the joint distribution of

the hidden state q at the time t and the observed

symbols sequences o

, ..., o

given the model as de-

scribed in Equation 1. We train the model to get an

objective value for the training sequences. The train-

ing samples represent the sequences of user actions

within each session as described in section 4.4.

Ob jective(sequence

) = P(Q, O)

= π

)

∏

t=1

P(q

| q

t−1

)

•

P(o

| q

)

where:

Q is hidden states sequence, q

∈ {q

, ..., q

O is a sequence of the observed symbols,

∈ {o

, ..., o

}

is the initial states distribution

A is transition matrix: A

i, j

= pr(q

= i|q

t−1

= j)

B is emission matrix: B

k=1

= pr(o

= o

= j)

(1)

The essential formula of Gradient Descent is illus-

trated in Equation 2. The GD algorithm uses the chain

rule to accomplish the training goal for all model pa-

rameters (Theano, 2018). The context of HMM with

gradient descent can be summarized as follows:

1. The term W(t) refers to any of the current param-

eters {π, B, A}.

2. W(t+1) presents the updated version of the model

parameters.

3. To orientate the system learning process, we

manipulate the learning rate parameter “µ” that

changes the learning step during the training pro-

cedure.

4. The gradient term of the equation 2 presents the

derivation of the objective function with respect

to the model parameters.

W (t +1) = W (t) − µ ∗

∂Ob jective

∂W

(2)

Figure 2 shows the ﬂow diagram of the learning

process. First, the model parameters {Π, A, B} are

randomly initialized while maintaining the probabil-

ity condition, such that all numbers add up to one.

Probabilistic Graphical Model on Detecting Insiders: Modeling with SGD-HMM

463

Figure 2: HMM-SGD Learning Process.

The next step is to start modeling the training data

samples that involve sequences of the ﬁrst 50 ses-

sions. The learning procedure is initiated by iterat-

ing over a ﬁxed number of iterations. Within each

iteration, the objective function will be called to cal-

culate the probability of the session actions sequence

as shown in equation 1. The result and the current

model parameters will be fed into the gradient de-

scent function. The gradient of the objective function

with respect to the model parameters will be calcu-

lated. Later on, the parameters will be updated such

that we achieve a high probability of the input data

sample.

To elaborate more on the SGD training procedure,

the training pseudo code is presented in algorithm 1.

The main procedure begins by initializing the HMM’s

parameters. Then, it goes over each of the sequence

data samples and calls the training procedure. Algo-

rithm 2 illustrates the training steps that begin by call-

ing the objective function. Then, it updates the model

parameters independently by calling the gradient de-

scent function for each of the parameters along with

the objective function.

4 DATASET

To test the performance of the proposed approaches,

we need a data set that can be used to proﬁle the users’

behaviors based on machine log data. For that reason,

we used the CERT Insider Threat Data sets (Division

and LLC, 2017; Glasser and Lindauer, 2013). The

CERT Division cooperated with ExactData, LLC, to

create several versions of synthetic insider threat data

sets. These data sets are unlabeled sets. They have

both synthetic base data and synthetic malicious user

data. The data sets project is sponsored by DARPA

I2O (Division and LLC, 2017). The CERT data set

is a diverse domains data set. It is a public data set

that consists of different computer-based log events

https://resources.sei.cmu.edu/library/asset-view.cfm?

assetid=508099

Algorithm 1: Modeling Actions Sequences.

1: procedure HMM SGD(inputSequences, hid-

denStates, learningRate, iterations)

2: Create HMM Object

3: Initialize HMM Parameters

5:  θ

old

= {θ

π old

, θ

A old

, θ

B old

}

6: trainingLength ← length(inputSequences)

8:  The ﬁrst 50 sessions

9: while iterations do

10: for Training sequences do

11: HMM.trainModel( sequence, θ

model

12: hiddenStates, learningRate)

13: 

14: θ

model

is the model current parameters

15: end for

16: iterations − −

17: end while

18: end procedure

Algorithm 2: Training Procedure.

1: procedure TRAINMODEL(sample, θ

model

, hid-

denStates, learningRate)

3: Ob jective = HMM.Ob jective(sample)

4: θ

π new

← θ

π old

−µ ∗ SGD(Ob jective, θ

π old

)

5: 

6: θ

π old

is the current initial probability vector

7:  µ is the learning rate

8: 

9: SGD is a stocastic gredient descent function

10: θ

newA

← θ

oldA

− µ ∗ SGD(Ob jective, θ

oldA

)

11: 

12: θ

oldA

is the current transition probability matrix

13: θ

newB

← θ

oldB

− µ ∗ SGD(Ob jective, θ

oldB

)

14: 

15: θ

oldB

is the current emission probability matrix

16: end procedure

data ﬁles (Rashid et al., 2016). The raw logs data

sets include the following computer log events: lo-

gon/logoff, the logs of open/closed ﬁles, the logs of

all surfed websites, the logs of how a user uses the

thumb drive Connect/Disconnect, the logs for email

messages that have been sent and received, and one

ﬁle for LDAP information (Rashid et al., 2016; Bose

et al., 2017; Division and LLC, 2017). For our work,

we used the r4.2 data set that has a huge variety of

users event logs. Even though CERT provides r6 data

sets, r4.2 has many insider users for several scenarios

which is the reason that we adapt r4.2 in this work.

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

464

4.1 The Probability of a Given Sample

Sequence

The raw log events are regenerated to be sequences

of timed events, as is illustrated in section 4.3. To

ﬁnd the probability of each of the created sequences

Y = (y

, y

, ..., y

), we can use one of the two Algo-

rithms: the Naive Based Algorithm, or the Forward-

Backward Algorithm.

To use the Naive Bayes method, we need to con-

sider all possibilities of hidden states sequences and

add the probabilities across all of them. Using this

method is not an efﬁcient way because it increases

O(T N

) runtime. Alternatively, The Forward-

Backward algorithm (Rabiner, 1989) is more efﬁcient

and it elapses O(NT

) runtime. In general, the se-

quence probability is a redundant process of the sum

of the product of fraction numbers. The multiplica-

tion of two fractions will result in a smaller value.

This fact produces small amounts that cannot be pro-

cessed by computers because of the resolution capa-

bility. In many cases, it turns out to be zeros. Thus, in

our work, we use the (−log(P(Y )) instead of P(Y ).

The works in (Rashid et al., 2016; Rabiner, 1989)

adapt these solutions for the machine resolution prob-

lem. In the training section, we will explain two more

solutions that we use in our model.

4.2 Malicious Insider’s Features

The main purpose that we aim to achieve is to reform

the computer-based event log features, from CERT

division data sets, which are used with the proposed

models. We use two kinds of machine learning mod-

els the hidden Markov model and the HMM-SGD.

Both of the adopted models work with a sequence-

shape data sample. Therefore, we preprocess the

multi-domains log events and produce sequences that

will be fed to the HMM models. In this paper, we

adopt all features that are included with the CERT

data set to create session-based, day-based, and week-

based data samples of users’ action events.

The next section will illustrate the extraction of

features and the implementation of the proposed ap-

proaches as well.

4.3 Preprocessing of Log Data

At the beginning of this section, we will explain how

we preprocess the raw events’ log data from CERT

datasets. The preprocessing procedure starts from

reading the log ﬁles from different log domains of

each user and ends with the generation of new en-

coded action event sequences that present user be-

haviors. The preprocessing phase has four essential

stages: Filtration, Encoding, Merging and Extracting.

Figure 3 shows the general overview of our prepro-

cessing stages. Each step of the preprocessing is de-

signed to be an independent module. Thus, it can be

updated without affecting the other stages.

Figure 3: The general platform of the insider threat detec-

tion system.

4.4 Features Selection and Extracting

The CERT dataset provides comma separated value

(CSV) log ﬁles of ﬁve different domains (Division

and LLC, 2017).

Figure 3 illustrates the main stages of selecting

and extracting features. There are four stages (Fil-

tration, Encoding, Merging, and Extracting) that are

used to process the raw features. The CERT datasets

are big datasets that require a large memory space.

For example, a machine with 16 GB RAM cannot

hold some of the preprocessing steps especially dur-

ing the Filtration stage. To overcome this issue, the

data ﬁltration performed during loading the CSVs log

ﬁles from the hard disk.

Figure 4: Filtering and encoding multi domains logs data.

The Filtering, Encoding and Merging processes

are described as follows:

1. The system starts with ﬁltering the log ﬁles of

a given user. It uses ”R-SQL” to ﬁlter the data

while loading these ﬁles from a hard disk. Using

this technique gives the ability to use the available

memory size without any issues as shown in Fig-

ure 4.

2. The ﬁltered events are encoded sequentially with a

hash table . For instance, if the user ”AAM0658”

Probabilistic Graphical Model on Detecting Insiders: Modeling with SGD-HMM

465

logs in to the system and performs several activi-

ties on his machine. The log data of these actions

will be encoded as a sequence of numbers. Each

number stands for a speciﬁc action made by the

user.

In HMM context, each code refers to an index in

the symbols matrix of Hidden Markov Model. For

instance, a representation of one of the encoded

session-based symbols of user ”AAM0658” are il-

lustrated in Figure 5.

Figure 5: A session sequence of ”AAM0658”.

3. The encoded events of different domains are

merged based on their time-stamp to generate a

big vector of symbols as illustrated in Figure 6.

Figure 6: Merging encoded events followed by extracting

session-based sequences.

4. The last step is extracting the data samples. The

big symbols vector is evaluated to generate three

different samples: session-based, day-based and

week-based samples. The session-based are de-

termined by tracking the (log on/ log off) events

codes. All symbols between log on/log off codes

are considered as a session-based sequence.

The day-based and the week-based samples are

obtained by aggregating action events per a day

or per a week. Hour, day, week, year and sev-

eral others features are created to facilitate the

pre-processing operations to generate the session-

based, day-based , and the week-based sequences.

Finally, the resulted sequences are saved in a data

frame to be modeled later with HMM and HMM-

SGD.

5 MODEL STRUCTURES AND

RESULTS

The work presented here utilizes two models: the

HMM and the HMM-SGD models. In this section,

only the results of the HMM-SGD model is presented

because the difference between the performance of

the two models is minimal. However, the HMM-SGD

model has more tuning ﬂexibility due to the presence

of more hyper-parameters.

The structures of HMM and HMM-SGD are also

described.

5.1 HMM Structure

HMM experiments were conducted with three hyper-

parameters: the number of hidden states, the maxi-

mum number of iterations and the number of training

samples. The list below shows the combinations of

the used hyper-parameters along with the structure of

HMMs.

1. HMMs are implemented with 10, 20, 40, 50, 60

hidden stats.

2. Three detection systems are implemented: the

session-based, the day-based and the week-based.

Each one of these models is trained using the

Baum-Welch algorithm for 20 iterations.

3. We generate the training sets as follows:

• Session-based system: the ﬁrst 50 sessions.

• Day-based: the ﬁrst 35 day samples.

• Week-based: the ﬁrst 5 weeks samples.

4. After the training process, we ﬁnd the probability

of each sequence P(sequence) using the forward

algorithm.

5.2 HMM-SGDs’ Structure and Results

The HMM-SGD models are trained with four hyper-

parameters: the number of hidden states, the max-

imum number of iterations, the number of training

samples, and the value of the learning rate.

The structure of the HHM-SGD models and the

hyper-parameter combinations are similar to HMM

modes. The only difference is the learning rate. The

HMM-SGD models are trained with a 0.01 learning

rate.

Similar to the baseline HMM, the probability of

each sequence P(sequence) is calculated using the for-

ward algorithm. The results are evaluated using the

ROC curve to see the overall performance of the pro-

posed detection approaches.

5.2.1 Session-based Model Results

The session-based sequence is a low level granularity

sample. The representation of user actions per ses-

sion are too narrow to consider many of the users’ ac-

tivities, compared to a day- or week-based sequence.

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

466

Moreover, the insiders usually distribute their actions

over several sessions so that no one can recognize

their anomalous behaviors. However, the session-

based system provides the shortest detection time.

It could be observed that the model evaluates most

of the labeled sessions with high probability values

instead of low probability values.

Although the session-based results look weak, the

model shows a very good performance by evaluating

the sessions that are near to or surround the labeled

sessions with very low probability values, (for more

details see study case in section 7). This has inspired

us to come up with the idea of a sliding window tech-

nique to optimize the model performance.

5.2.2 A Sliding Window Technique

To optimize the models evaluation process, we pro-

pose a novel Sliding Window Technique SWT. This

approach provides the ﬂexibility to monitor the recent

history of user behaviors. For example, to evaluate

the session sequence of user “AAM0658” (Figure 5),

the sliding window will be used to see the history of

the current sequence based on a window size. Thus,

instead of considering just the predicted probability

of a sequence, a sliding window gives a broad vision

of user behaviors.

The proposed technique can also be used to see

the future changes of behaviors regarding a current

session. For instance, if we want to evaluate session

100 of user “AAM0658”, we can see the changes in

his behaviors between sessions 100 and 110, using a

window of size 10.

In this work, we use the Sliding Window to mon-

itor the recent history of user behaviors for a session-

based approach.

6 MODEL EVALUATION

The CERT data set has several scenarios and provides

description ﬁles for insider’s events. These ﬁles spec-

ify the events that are considered as malicious behav-

iors. That data is used as truth labels to evaluate the

work.

The truth labels of thirty users are used to evaluate

the presented work. Those users are insiders accord-

ing to the deﬁnition of scenario one. The insiders at-

tack in scenario one occurs as follows: a User begins

to log on after ofﬁce hours, starts using a removable

drive and then begins uploading data to wikileaks.org.

The generated session-based, day-based, and

week-based data sets are labeled manually and using

a labeling system according to the truth label data.

6.1 Normalization

The baseline of normal behaviors is different among

users. We deﬁne a baseline as the average of the pre-

dicted probabilities of training samples.

To evaluate our work, ﬁrst, all baselines are nor-

malized to be in the same range. We do that by pick-

ing one baseline and shift the rest to be in the same

scale. Then, the predicted probabilities of user’s be-

haviors is normalized according the new baseline.

6.2 Evaluation with ROC

The performance is evaluated in term of ROC and

Area Under the Carve (AUC). Figure 7 shows the

ROC curves of applying sliding window with session-

based data samples. It presents four sub ﬁgures, each

one with different size window. The experiment is

implemented with ﬁve different model structures and

four different window sizes. These include 10, 20, 40,

50, and 60 hidden states and 5, 10, 15, and 20 window

sizes. Also, the AUC is presented under each model

structure with a different color.

7 CASE STUDY

To investigate more about the detection function of

the model, a case study is illustrated with more de-

tails. The User “MCF0600” is selected as a case study

which is the same case study as in (Rashid et al.,

2016). As mentioned in section 6, the attack is started

when user “MCF0600” begins to log in after ofﬁce

hours, starts using a removable drive, and then begins

uploading data to wikileaks.org. User ”MCF0600” is

considered one of the insiders according to scenario

one. The user has malicious behavioral truth labels,

so we can use his log data to evaluate the models.

Table 1: User ”MCF0600” Data Samples Statistics.

Based Session Day Week

Data samples 308 246 41

Malicious Samples 3 3 1

Training Samples 50 35 5

Malicious Labels

276, 278

, 280

221, 223

, 224

Table 1 previews information about the data sam-

ple representations for user “MCF0600”: session, day

Note: Every procedure in this work is built from

scratch. No code is provided from any other work.

Probabilistic Graphical Model on Detecting Insiders: Modeling with SGD-HMM

467

(a): 5 Samples Windows Size. (b): 10 Samples Windows Size.

(c): 15 Samples Windows Size. (d): 20 Samples Windows Size.

Figure 7: Session-based HMM-SGD with Sliding Window.

and week-based data samples. The information in-

cludes the total number of samples, the number of

malicious samples, the number of training samples,

and labeled indexes of each data set, section 4.3.

Figure 8 (a), (c), and (e) shows the predicted -

Log(Probability) of the three detection systems. The

normal samples are colored blue while the malicious

samples are colored red. The results are for the testing

data sets, So the labeled indexes are different com-

pared to the numbers in table 1.

Figure 8 (b), (d), and (f) presents the histogram

distributions of the predicted probabilities with a class

color, blue for abnormal behaviors and orange for nor-

mal behaviors. The session-based system evaluates

the labeled sessions with a high probability

, sessions

226, 228, and 230. On the other hand, the model eval-

uates the unusual copying of ﬁles or surﬁng unusual

web sites with very low probabilities as shown in ﬁg-

ure 8 (a), sessions 221, 232, 235, and 237.

High probability means low -Log(Probability).

To improve the model performance, SWT is used

with the predicted probabilities to analyze the effect

of the history behavior on the current behavior. Using

SWT shows performance improved compared to the

work presented by (Rashid et al., 2016) where each

data point represents a week-based behavior. In addi-

tion to the data representation, they use the momen-

tum of inertia principle where the detection system is

continuously trained based on a predetermined ratio.

In the presented case study, the ROC curve is used

with and without a sliding window. The ﬁrst score is

0.2 while a full score, 1, is the result of using a sliding

window. A Window size of 5, 10, 15, and 20 is used

and the result is a full score, 1.

Using the sliding window with the session-based

system gives a high performance detection system.

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

468

Figure 8: -Log(Probability) and Histograms Plots of ”MCF0600” activities with session, day, and week-based data samples.

8 CONCLUSION

This work provides three novel approaches to detect

insider threats as described below:

1. The log events are represented in three bases:

session-based, day-based, and week-based sam-

ples.

2. HMM-SGDs are used to learn users’ normal be-

haviors. The learned models work as a baseline to

investigate the behavior of new samples.

3. A novel sliding window technique is proposed to

monitor the history of a user behaviors and detect

malicious insiders effectively.

It was concluded that combining the session-based

approach with a sliding window technique provides

a better detection capability compared to existing

works.

REFERENCES

Bilmes, J. (1998). A Gentle Tutorial of the EM

Algorithm and its Application to Parameter

Estimation for Gaussian Mixture and Hidden

Markov Models. Retrieved January 18, 2008,

from the International Computer Science In-

stitute Web site: http://ssli.ee.washington.edu/

people/bilmes/mypapers/em.pdf, 1198(510).

Bose, B., Avasarala, B., Tirthapura, S., Chung, Y. Y., and

Steiner, D. (2017). Detecting Insider Threats Using

RADISH: A System for Real-Time Anomaly Detec-

tion in Heterogeneous Data Streams. IEEE Systems

Journal, 11(2):471–482.

Cappelli, D. (2012). The CERT Guide to Insider Threats.

Pearson Education,Inc.

Clearswift (2018). insider threat 74 security incidents come

extended enterprise not hacking groups.

Division, C. and LLC, E. (2017). Insider threat tools.

Gavai, G., Sricharan, K., Gunning, D., Rolleston, R., Han-

ley, J., and Singhal, M. (2015). Detecting Insider

Threat from Enterprise Social and Online Activity

Data. Proceedings of the 7th ACM CCS International

Workshop on Managing Insider Security Threats -

MIST ’15, pages 13–20.

Gemalto (2016). Breach level index data breach database

and risk assessment calculator.

Glasser, J. and Lindauer, B. (2013). Bridging the Gap:

A Pragmatic Approach to Generating Insider Threat

Data. 2013 IEEE Security and Privacy Workshops,

pages 98–104.

House, T. W. (2012). Presidential Memorandum – National

Insider Threat Policy and Minimum Standards for Ex-

ecutive Branch Insider Threat Programs.

Institute, P. (2017). Cost of Cyber Crime Study.

Jain, R. and Abouzakhar, N. S. (2012). Hidden markov

model based anomaly intrusion detection. In 2012

International Conference for Internet Technology and

Secured Transactions, pages 528–533.

LeCun, Y. A., Bottou, L., Orr, G. B., and M

uller, K. R.

(2012). Efﬁcient backprop. Lecture Notes in Com-

Probabilistic Graphical Model on Detecting Insiders: Modeling with SGD-HMM

469

puter Science (including subseries Lecture Notes in

Artiﬁcial Intelligence and Lecture Notes in Bioinfor-

matics), 7700 LECTU:9–48.

Lee, D.-h., Kim, D.-y., and Jung, J.-i. (2008). Multi-Stage

Intrusion Detection System Using Hidden Markov

Model Algorithm. 2008 International Conference on

Information Science and Security (ICISS 2008), pages

72–77.

Rabiner, L. (1989). A tutorial on hidden Markov models

and selected applications in speech recognition.

Rashid, T., Agraﬁotis, I., and Nurse, J. R. (2016). A New

Take on Detecting Insider Threats : Exploring the use

of Hidden Markov Models. MIST ’16 Proceedings of

the 8th ACM CCS International Workshop on Manag-

ing Insider Security Threats, pages 47–56.

Theano (2018). Gradient kernel description.

http://deeplearning.net/software/theano/extending/op.

html?highlight=grad#grad. Accessed: 2018-02-12.

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

470