LSTM Autoencoder-Based Insider Abnormal Behavior Detection Using

De-Identiﬁed Data

Seo-Yi Kim

1 a

and Il-Gu Lee

1,2 b

Department of Future Convergence Technology Engineering, Sungshin Women’s University, Seoul, South Korea

Department of Convergence Security Engineering, Sungshin Women’s University, Seoul, South Korea

Keywords:

De-Identiﬁcation, LSTM Autoencoder, Abnormal Behavior Detection.

Abstract:

Leakages of national core technologies and industrial secrets have occurred frequently in recent years. Unfor-

tunately, because most of the subjects of conﬁdential data leaks are IT managers, executives, and employees

who have easy access to conﬁdential information, more sophisticated theft is possible, and there is a risk of

large-scale data leakage incidents. Insider behavior monitoring is being conducted to prevent conﬁdential data

leaks, but there is a problem with personal information being collected indiscriminately during this process.

This paper proposes a security solution that protects personal privacy through a process of de-identifying

data, while maintaining detection performance in monitoring insider aberrations. In the abnormal behavior

detection process, a long short-term memory (LSTM) autoencoder was used. To prove the effectiveness of

the proposed method, de-identiﬁcation evaluation and abnormal behavior detection performance comparison

experiments were conducted. According to the experimental results, there was no degradation in detection

performance even when data was de-identiﬁed. Furthermore, the average re-identiﬁcation probability was ap-

proximately 1.2%, whereas the attack success probability was approximately 0.2%, proving that the proposed

de-identiﬁcation method resulted in low possibility of re-identiﬁcation and good data safety.

1 INTRODUCTION

Today, as science and technology increasingly be-

come a competitive edge among nations, data leak-

ages and theft incidents between countries or organi-

zations occur more frequently. If national core tech-

nologies are leaked overseas, it can have a fatal im-

pact on national security and the economy, lowering

national competitiveness and further leading to cy-

ber warfare between countries. Additionally, if the

internal secrets of an organization are leaked, it can

cause signiﬁcant damage to the image of the orga-

nization and lead to loss of proﬁts and competitive

advantage, thereby hindering corporate sustainability

(Goryunova et al., 2020). For this reason, countries

and organizations are trying to protect conﬁdential

data and minimize damage caused by the leakage of

industrial secrets.

However, recently, cases of internal data leaks by

insiders have increased, emerging as a global secu-

rity problem. Because insiders already have access

https://orcid.org/0009-0004-4890-1972

https://orcid.org/0000-0002-5777-4029

to the network and internal services, they can easily

access conﬁdential data, and thus, more sophisticated

theft is possible (Abiodun et al., 2023). To solve the

problem of conﬁdential leakage by insiders, research

is actively being conducted to detect abnormal behav-

ior among insiders.

According to the 2023 Insider Threat Report (Gu-

rucul, 2023), in 2022, approximately 35% of the to-

tal respondents reported that they experienced insider

attacks 1 to 5 times, whereas approximately 8% re-

ported that they experienced insider attacks more than

20 times. Figure 1 shows the percentage of insiders

ranked by cybersecurity experts as posing the great-

est security risk to their organizations. Privileged

IT users/admins were approximately 60%, privileged

business users/executives were approximately 53%,

and other IT staff were 24%, indicating a high propor-

tion of IT managers, employees, and executives with

extensive access rights. These are all groups of users

that have easy access to conﬁdential or sensitive in-

formation within an organization.

For example, in May 2022, there was a case

wherein a researcher who was working at Company

A at the time transferred to Company B, a competitor,

Kim, S. and Lee, I.

LSTM Autoencoder-Based Insider Abnormal Behavior Detection Using De-Identiﬁed Data.

DOI: 10.5220/0012458000003648

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 10th International Conference on Information Systems Security and Privacy (ICISSP 2024), pages 609-620

ISBN: 978-989-758-683-5; ISSN: 2184-4356

609

and leaked the conﬁdential information of Company

A to their new workplace. This researcher was a se-

nior manager in an advertising related product team

in Company A who, after receiving a job offer from

Company B, stole approximately 570,000 pages of

source code and conﬁdential product-related informa-

tion using a personal external device in only a few

minutes. A few weeks later, Company A, which be-

came aware of the leak, ﬁled a lawsuit against the re-

searcher who took possession of the conﬁdential in-

formation of Company A through a personal external

device until they were ordered to stop.

Figure 1: Types of insiders that pose the greatest security

threat to an organization(Gurucul, 2023).

In July 2021, cybersecurity Company C was

robbed of conﬁdential sales support data by a former

employee. The employee stole conﬁdential data us-

ing a personal USB device before moving to competi-

tor Company D. Company C built and used its own

data loss prevention (DLP) solution but did not block

internal staff from accessing, downloading, and shar-

ing critical documents to external storage devices. A

few months later, Company C discovered the data leak

and sued the employee, but at that point, the leak may

have already proven useful to the channel sales power

of Company D, which recorded an increase in sales

following the incident.

As the number of cases and scale of damage

caused by conﬁdential data leaks by insiders gradu-

ally increase, security solutions to prevent such inci-

dents are increasingly being introduced within orga-

nizations. One of these methods is to monitor insider

behavior using a user behavior analytics (UBA) tool.

According to the 2023 Insider Threat Report, approx-

imately 86% of organizations monitor insider behav-

ior; however, their most utilized method is to monitor

access logging only. The next most utilized method,

employed by approximately 25% of organizations, is

to monitor all actions of insiders 24/7. Figure 2 pro-

vides a graph of whether and to what extent insider

behavior is monitored.

However, although monitoring all actions of insid-

ers can be effective at detecting abnormal behaviors

among insiders, it raises privacy infringement con-

Figure 2: Percentages of organizations by whether and to

what extent insider behavior is monitored (Gurucul, 2023).

cerns. In the process of monitoring all actions, sen-

sitive information that can seriously harm the human

rights of the users may be included, creating another

type of security problem. Therefore, organizations

must protect the privacy of insiders while quickly and

accurately detecting abnormal behaviors among them

to minimize the risk of leakage of conﬁdential data

and damage to the company.

In this paper, we aim to present a security solution

that can protect the privacy of insiders through data

de-identiﬁcation in the process of detecting abnor-

mal behavior among the insiders. After insider data

are de-identiﬁed using the ARX Data Anonymiza-

tion Tool (Koll et al., 2022), abnormal actors are

detected using a long short-term memory (LSTM)

autoencoder, an algorithm suitable for anomaly de-

tection. Three attack models were used to evaluate

the level of de-identiﬁcation of the de-identiﬁed data,

and the level of re-identiﬁcation was evaluated by

detecting insiders belonging to speciﬁc departments.

The proposed method protects the privacy of insid-

ers through de-identiﬁcation while providing a similar

abnormal behavior detection rate to that of the con-

ventional method in the detection of abnormal actors.

The main contributions of this study are summa-

rized as follows:

• After the insider data are de-identiﬁed to make it

difﬁcult to identify individuals, a high degree of

de-identiﬁcation, as evaluated by applying three

attacker models, was obtained, as indicated by a

probability of re-identiﬁcation of approximately

1.2% and a probability of successful attack of

approximately 0.2%. Thus, the ability of the

proposed de-identiﬁcation method to ensure the

safety of the dataset was proven.

• Based on the results of detecting abnormal behav-

ior among insiders using the LSTM autoencoder

on the de-identiﬁed data, an accuracy of 94% and

F1-score of 97% were achieved, showing similar

abnormal behavior detection results to those pro-

vided by the corresponding identiﬁable data (orig-

inal data).

This paper is structured as follows. Section

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

610

2 introduces background knowledge related to de-

identiﬁcation and analyzes existing research on ab-

normal behavior detection using machine learning.

Section 3 explains the proposed technology, and then

Section 4 describes the experimental environment in

which to evaluate the de-identiﬁcation level and ab-

normal behavior detection performance of the pro-

posed technology. Section 5 provides a presentation

and analysis of the experimental results. Finally, Sec-

tion 6 presents the conclusions of the study.

2 RELATED WORK

2.1 De-Identiﬁcation

As the transition to a digital society accelerates, the

need for new regulations and innovations to protect

personal privacy emerges. All areas of society are be-

ing digitized, and personal information and privacy

data are being collected online and transmitted, used,

and stored through networks. Various digitalized ser-

vices provide convenience, but at the same time, they

cause security problems such as personal information

theft and privacy threats (Yun et al., 2023). Thus, con-

cerns regarding the need to protect individual privacy

continue to be raised, and laws in each country are

being revised accordingly.

In Europe, the General Data Protection Regula-

tion (GDPR) was enacted in 2019, stipulating that

personal privacy data should be protected in all trans-

actions occurring within EU countries (Li et al.,

2023). GDPR speciﬁes that non-identiﬁcation mea-

sures such as pseudonymization and anonymization

must be taken when personal information is used.

Pseudonymization can be used for research and sta-

tistical purposes by ensuring that a speciﬁc individual

can no longer be identiﬁed based on the data without

additional information. In the United States, Califor-

nia’s Consumer Privacy Act (CCPA) will be imple-

mented starting in 2020, to which both domestic com-

panies with business establishments in California and

companies headquartered overseas are to be subject

(Naim et al., 2023). CCPA deﬁnes “de-identiﬁcation”

as the use of technical/administrative protection mea-

sures to prevent re-identiﬁcation, where the de-

identiﬁed information is not included in personal in-

formation. As such, the data protection laws of

many countries identify de-identiﬁcation as a neces-

sary step in collecting and utilizing personal informa-

tion. Through appropriate de-identiﬁcation, personal

privacy can be protected while effectively utilizing

personal information.

De-identiﬁcation is the process of modifying or

replacing personal identiﬁers to hide some informa-

tion from a public perspective (Chomutare, 2022).

Non-identiﬁcation will be subjected to a prelimi-

nary review process to review which data corre-

spond to personal information. Subsequently, it in-

cludes a follow-up management process such as re-

identiﬁability monitoring and safety measures after an

adequacy assessment process to determine whether an

individual can be easily identiﬁed when the data are

combined with other information. Non-identiﬁcation

methods include kana processing, total processing or

average value substitution, data deletion, categoriza-

tion, and data masking. Table 1 shows a list of tradi-

tional methods of de-identiﬁcation. Through this ap-

proach, data containing sensitive information are pro-

cessed and then used for research or statistical indica-

tors. Information loss must be minimized by select-

ing an appropriate de-identiﬁcation method accord-

ing to the data type and purpose. In this study, de-

identiﬁcation was performed using data reduction and

data masking methods. In the “adequacy evaluation,”

which evaluates the possibility of re-identiﬁcation af-

ter de-identiﬁcation, methods such as k-anonymity, l-

diversity, and t-accessibility are typically used.

• k-anonymity. k-anonymity is a non-identiﬁcation

model that prevents the identiﬁcation of speciﬁc

individuals by ensuring that there are more than

k identical record values in the entire data set. If

some of the information used is combined with

other information that is publicly available to

identify an individual, a linkage attack problem

may occur. To compensate for this vulnerability,

the k-anonymity model is used. In this way, at-

tackers will not be able to ﬁnd out exactly which

record the attack target is from the de-identiﬁed

data (Ito and Kikuchi, 2022).

• l-diversity. l-diversity is a de-identiﬁcation model

used to complement the vulnerabilities of k-

anonymity. Records that are de-identiﬁed together

must have at least l different pieces of sensitive in-

formation. For context, k-anonymity is vulnerable

to identity attacks because it does not consider the

diversity of information during de-identiﬁcation.

Additionally, it is vulnerable to attacks enabled by

background knowledge because it does not con-

sider the background knowledge of the attacker,

other than the provided data. Therefore, it must be

ensured that the de-identiﬁed data have more than

l different pieces of data, to enable some degree

of defense even in situations where the attacker

possesses background knowledge (Rai, 2022).

LSTM Autoencoder-Based Insider Abnormal Behavior Detection Using De-Identiﬁed Data

611

Table 1: De-identiﬁcation methods.

Method Explanation

Pseudonymization

Replacing key identifying elements with other values to make it difﬁcult to identify

an individual. Examples include heuristic pseudonymization, encryption, and exchange

methods.

Aggregation

Preventing individual data values from being exposed by replacing them with the total

value of the data. Examples include total processing, partial totals, rounding, and

rearrangement.

Data reduction

Deleting values that are unnecessary or serve as personal identiﬁers among the values

included in the dataset depending on the purpose of data collection and the level of

sharing and openness. Examples include deleting or partially deleting identiﬁers,

deleting records, and deleting all identiﬁers.

Data suppression

Replacing data values with category values without directly exposing them. Examples

include hiding, random rounding, range method, and control rounding.

Data masking

Replacing data values with category values without directly exposing them. Examples

include hiding, random rounding, range method, and control rounding.

2.2 Abnormal Behavior Detection Using

Machine Learning

Al-Mhiqani et al. (Al-Mhiqani et al., 2022) proposed

a multi-layer framework for insider threat detection.

In the ﬁrst step, the levels of performance of nine

machine learning models were evaluated using multi-

criteria decision making (MCDM) to select a model

optimized for insider threat detection. As a result

of simulations, random forest and k-nearest neigh-

bors (KNN) were selected. Based on these results,

for the second step, hybrid insider threat detection

(HITD), consisting of a misuse insider threat detec-

tion (MITD) component based on random forest and

an anomaly insider threat detection (AITD) compo-

nent based on KNN, was proposed. To evaluate its

performance, the CERTr4.2 dataset was employed to

test unknown and known insider attack scenarios. The

evaluation indicators used were recall, accuracy, pre-

cision, area under the curve (AUC), F-score, and true

negative rate (TNR). In terms of these measures, the

proposed HITD method demonstrated the best per-

formance. However, although the proposed method

showed signiﬁcant improvement in terms of detec-

tion performance, it did not consider the overhead and

waiting delays that may occur when adopting a hybrid

method. In addition, the original data were used as is

after preprocessing. Because of this, there is a limita-

tion in that there is a risk of infringing on individual

privacy when the proposed method is applied to actual

situations.

Cui at el. (Cui et al., 2021) observed that tradi-

tional federated learning, while effective for privacy

protection and low latency, lacks stability because

of non-uniform data distribution among distributed

clients. Therefore, they proposed a blockchain-based

distributed asynchronous federated learning model

for anomaly detection in an Internet of things (IoT)

environment. The model mitigates the problem of im-

balanced data distribution because it is stored on the

blockchain rather than a central server, ensuring that

all clients share the same model regardless of data dis-

tribution or quantity during model updates. Simulta-

neously, it efﬁciently addresses privacy concerns by

storing only update information on the blockchain,

without exposing sensitive data directly to a central

server. For performance evaluation, a generative ad-

versarial network (GAN) algorithm was used; the

model demonstrated superior performance in terms of

accuracy and convergence speed compared to those

of traditional federated learning models. However,

the model does have a number of limitations, such

as signiﬁcant accuracy variations in the learning rate

settings, as observed in the IoT device learning rate

comparison graph, and additional computational ef-

fort and time overhead in the process of setting the

optimal learning rate.

Jamshidi et al. (Jamshidi et al., 2024) proposed

a privacy enhancement model using an autoencoder

structure to efﬁciently de-identify personal sensitive

information when collecting data from providers dur-

ing the anomaly detection model learning process.

The data are ﬁrst compressed using an encoder; the

conﬁdential and non-conﬁdential attributes are sepa-

rated and then passed through a classiﬁer to weaken

the correlation. Among them, the conﬁdential at-

tributes are de-identiﬁed by adding appropriate noise

based on differential privacy and combined with the

non-conﬁdential attributes through a decoder, thus

creating original data. To evaluate its performance,

experiments were conducted using image datasets and

categorical datasets, and on both datasets, the pro-

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

612

posed model exhibited better accuracy, precision, re-

call, and F1-score compared to those of the conven-

tional autoencoder algorithms CelebA-G-M, CelebA-

G-S, and CelebA-G-C. Their proposed model demon-

strated excellent performance in the performance

evaluation. Additionally, in a differential privacy-

based noise optimization experiment, an appropriate

value was obtained for an efﬁcient de-identiﬁcation

parameter, which resulted in high accuracy. How-

ever, the process of adding noise to de-identify data

may also increase the model complexity and compu-

tational overhead compared to those of a conventional

autoencoder model.

According to a survey of previous studies re-

lated to detecting abnormal behavior using machine

learning, there are two challenging aspects that must

be considered: on one hand, privacy issues often

occurred when data were not de-identiﬁed, and on

the other hand, overhead was generated when de-

identiﬁcation was performed. In this paper, we would

like to present an abnormal behavior detection solu-

tion that protects privacy by de-identifying data, while

maintaining high detection performance by using an

LSTM autoencoder.

3 LSTM AUTOENCODER-BASED

INSIDER ABNORMAL

BEHAVIOR DETECTION

Here in, an LSTM autoencoder-based abnormal in-

sider behavior detection technology that uses de-

identiﬁed data is proposed. This security solution

can solve the problem of insider privacy infringement

when abnormal-behavior monitoring is performed

within an organization. When insider activity is mon-

itored, the actions of all devices connected to the net-

work are recorded. Before these raw data are used,

they are subjected to a de-identiﬁcation process to en-

sure that they do not contain sensitive information.

Abnormal behaviors among insiders are then detected

from the de-identiﬁed data using the LSTM autoen-

coder.

LSTM is an algorithm that complements the limi-

tation of recurrent neural networks (RNN), which op-

erate effectively only in short sequences, making it

difﬁcult to model dependencies in long sequences. By

contrast, LSTM improves gradient propagation per-

formance by adding cell-state to the state of the hid-

den layer. In addition, the four layers interact and

operate, and in this process, the short-term state and

long-term state are learned separately and undergo a

merging and prediction process (Nguyen et al., 2021).

LSTM can process relatively long time-series data

without performance degradation and can efﬁciently

use memory by deleting data that are less relevant to

prediction (Ashraf et al., 2020).

On the other hand, an autoencoder is a type of ar-

tiﬁcial neural network (ANN) model for image data

compression. It consists of an encoder and decoder

and creates a model in the form of compressed data by

repeating the encoding and decoding process (Roelofs

et al., 2021). The autoencoder is an unsupervised

learning model that is actively considered in situa-

tions such as anomaly detection where there are only

small amounts of labeled data (Thill et al., 2021).

An LSTM autoencoder is a model that applies the

LSTM algorithm to the encoder and decoder of an

autoencoder and is used for the dimensionality re-

duction and anomaly detection of a dataset (Said El-

sayed et al., 2020). Because the detection of abnor-

mal insider behavior is performed on large datasets

with long time-series data, the LSTM autoencoder is

suitable for the purpose (Nam et al., 2020).

Figure 3 shows a ﬂowchart illustrating how abnor-

mal insider behavior is detected using an LSTM au-

toencoder that uses de-identiﬁed data. After data gen-

erated from all PCs connected to the network of an or-

ganization are collected, preprocessing is performed

to replace the collected data with numerical data suit-

able for machine learning use. Among the collected

data, data that are personally identiﬁable and have

a high possibility of violating privacy are deleted,

whereas sensitive information are de-identiﬁed. After

de-identiﬁcation, the data are subjected to a risk as-

sessment process to ensure that they are de-identiﬁed

to an appropriate level. The data processed in this

way are used as input data for the LSTM autoencoder

to detect abnormal behavior. If abnormal behavior is

determined, the abnormal actor is traced through a re-

identiﬁcation process.

Figure 3: Flowchart of abnormal-insider-behavior detection

based on LSTM autoencoder using de-identiﬁed data.

LSTM Autoencoder-Based Insider Abnormal Behavior Detection Using De-Identiﬁed Data

613

Table 2: CERT dataset conﬁguration.

File Contents Characteristics

Logon

.csv

PC login

or logoff

• Fields: ID, Date, User, PC, Activity (Logon/Logoff)

• 1,000 insiders each have an assigned PC.

• The following items appear similar among users.

- Start time (slight error)

- End time (slight error)

- Length of work day (slight error)

- After-hours work: Most users do not log on outside of working hours.

Http

.csv

Internet

access

history

• Fields: id, date, user, pc, url, content

• URL includes the domain name and path. Words included in the URL are

generally related to the content of the web page.

• Each web page can contain multiple pieces of content.

File

.csv

Copy

ﬁles to

removable

media

devices

• Fields: id, date, user, pc, ﬁlename, content

• content: Consists of a hexadecimal encoded ﬁle header followed by a space-

separated list of content keywords.

• Each ﬁle can contain multiple topics.

• File header is related to ﬁle name extension.

• Each user has a normal number of ﬁle copies per day (deviations from these

normal numbers can be used as an important indicator).

.csv

Incoming

and

outgoing

emails

• Fields: id, date, user, pc, to, cc, bcc, from, size, attachment count, content

• Some noise edges are introduced.

• A small number of insiders send emails to outsiders.

• There may be multiple recipients.

• Email size indicates the number of bytes of the message, excluding attachments

(email size and number of attachments have no correlation to each other).

Device

.csv

External

device

input or

output

• Fields: id, date, user, pc, activity (connect/disconnect)

• Some users use ﬂash drives.

• If the user shuts down the system before removing the drive, the disconnect

record is missing.

• Users are assigned a typical average number of ﬂash drive uses per day

(deviations from the normal number of uses can be used as an important indicator).

LDAP

Personnel

information

of insider

• Fields: employee name, user id, email, role, business unit, functional unit,

department, team, supervisor

• Data for approximately 1 year and 6 months exist by month from 2009-12 to

2011-05.

• There is a signiﬁcant difference in the numbers of emails received and sent,

depending on the role.

• role - ITAdmin: Systems administrators with global access privileges

4 EVALUATION ENVIRONMENT

4.1 Dataset

The CERT Insider Threat Test Dataset (Institute,

2013) was used in the experiment. The CERT dataset

was created by the Carnegie Mellon University Soft-

ware Engineering Institute in collaboration with Ex-

actData and LCC, and with support from DARPA

I2O. It was created for the purpose of researching in-

sider threat behavior, and currently, six releases have

been updated to 10 versions. The dataset includes

data on 1,000 insiders and malicious actors execut-

ing ﬁve malicious behavior scenarios. In this exper-

iment, CERT r4.2 was used, because it was judged

to be suitable for the experiment for its higher rate

of malicious behavior data than in other versions.

Table 2 provides a description of the ﬁles included

in the CERT dataset. In this experiment, logon.csv,

http.csv, ﬁle.csv, email.csv, device.csv, and LDAP

were used. In the case of the CERT r4.2 data set,

there are 100 malicious actors among 1,000 insiders,

including thirty malicious actors in scenario 1, thirty

in scenario 2, and ten in scenario 3. Table 3 provides

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

614

descriptions of the scenarios used in this experiment

on the CERT r4.2 dataset.

4.2 ARX

The open source software ARX, which is used for

data de-identiﬁcation, was used on the original data.

ARX supports a variety of anonymization tools and

privacy models and can analyze risks by applying at-

tacker models. Its de-identiﬁcation process is illus-

trated in Figure 4. After the data to be de-identiﬁed

are imported, the type of each attribute is set as ei-

ther Identifying, Quasi-Identifying, or Sensitive / In-

sensitive. In the case of the Identifying and Quasi-

Identifying types, the value of each attribute is re-

placed with “*”. The privacy model to be applied is

then set. On the other hand, in the case of the Sen-

sitive type, a separate privacy model must be set for

each. After all settings for data de-identiﬁcation are

completed, the appropriate de-identiﬁcation level is

determined via the Explore process. Afterward, the

input data and output data are compared to evaluate

the de-identiﬁcation performance. The Analyze util-

ity function can be used to check the classiﬁcation

performance and quality model according to the target

variable. On the other hand, the Analyze risks func-

tion can be used to evaluate the risk level according to

various attacker models.

4.3 Evaluation Method

Experiments were conducted to demonstrate and eval-

uate the performance of the proposed technology.

Figure 5 shows the simulation model. All records of

operations by 1,000 insiders from PCs connected to

the internal network of the organization are transmit-

ted to the central monitoring server. The monitoring

server subjects the data to preprocessing to use the

collected data for machine learning. Data divided into

ﬁles such as logon.csv, http.csv, ﬁle.csv, email.csv,

device.csv, and LDAP are reclassiﬁed based on in-

sider ID and then undergo processing for missing val-

ues and the removal of outliers.

Two procedures are performed to de-identify the

preprocessed data. In the ﬁrst step, a new item is cre-

ated by combining up to three appropriate items to

avoid using items that may contain content directly

related to the actions of the insider and their personal

privacy, such as date, content, and url. For exam-

ple, through the use of the logon and logoff records

of the logon data, it is possible to determine the time

work was performed within designated working hours

or the time work was performed outside working

hours. These newly created items can be combined

again with http and device data to create data such as

records of Internet access outside working hours and

records of ﬁles copied to external devices.

The second step is de-identiﬁcation using the

ARX tool. First, the type of each ﬁeld is set. In

this experiment, the type of user was set to Identi-

fying, that of sessionid was set to Quasi-identifying,

and those of role, f unit, dept, team, ITAdmin, and

n email related to position and department informa-

tion were set to Sensitive. The ﬁeld n email shows

the sum of incoming and outgoing e-mails, and as

speciﬁed in Table 3, there is a large difference in the

amounts of e-mails received and sent, depending on

the role, and thus it was classiﬁed as Sensitive-type

data. The types of all remaining ﬁelds were set to In-

sensitive.

Next, the generalization hierarchy method to be

applied to the Sensitive-type data is set. In this ex-

periment, character masking was applied to the six

Sensitive-type data ﬁelds. Character masking is a

general purpose mechanism and is the most widely

available method for data anonymization. After all

values were aligned to the left, masking was per-

formed from right to left. After the length of the

longest string in each ﬁeld was set to Max.characters,

padding was added for all values to have a length of

Max.characters. The settings are speciﬁed in Table 4.

Domain size refers to the number of different pos-

sible values for each ﬁeld. For example, in the case

of role, the possible values for the ﬁeld, that is, the

domain, are 0 to 41, and thus, the domain size is 42.

Max.characters indicates the length of the longest in-

teger type in each ﬁeld.

Therefore, Max.characters is set to 1 for f unit and

ITAdmin, which have domain sizes of 10 or less, and

2 for the remaining ﬁelds, which have domain sizes

of 100 or less. Padding is then added according to

alphabet size for all values to have the same length.

After the privacy model was set to l-diversity, l was

set to 2. Abnormal behavior is detected by using the

de-identiﬁed data as input data to the LSTM autoen-

coder. If abnormal behavior is detected during this

process, the abnormal actor is tracked through a re-

identiﬁcation process. If abnormal behavior is not de-

tected, the data are stored in the database.

In this experiment, 20% of the overall dataset con-

stituted the learning data set, whereas the remaining

80% constituted the validation dataset. We evaluated

the degree of de-identiﬁcation by the proposed tech-

nology and compared the detection rate with those of

conventional technologies.

This study used log data collected over a short pe-

riod of time. This makes it difﬁcult to reﬂect bias and

errors that may occur when massive amounts of log

LSTM Autoencoder-Based Insider Abnormal Behavior Detection Using De-Identiﬁed Data

615

Table 3: CERT experiment scenario descriptions.

Scenario Description

Scenario 1

User who did not previously use removable drives or work after hours begins logging in after

hours, using a removable drive, and uploading data to wikileaks.org. Leaves the organization

shortly thereafter.

Scenario 2

User begins surﬁng job websites and soliciting employment from a competitor. Before leaving

the company, they use a thumb drive (at markedly higher rates than their previous activity) to

steal data.

Scenario 3

System administrator becomes disgruntled. Downloads a keylogger and uses a thumb drive to

transfer it to the machine of their supervisor. The next day, he uses the collected keylogs to log

in as his supervisor and send out an alarming mass email, causing panic in the organization.

He leaves the organization immediately.

Figure 4: ARX de-identiﬁcation process.

Figure 5: Simulation of abnormal-insider-behavior detection based on LSTM autoencoder.

data are collected in real situations. Therefore, in or-

der to use it in an actual intrusion detection system,

learning must be done using sufﬁcient sample data.

5 PERFORMANCE EVALUATION

AND ANALYSIS

We analyzed the results of the experiments that we

conducted to evaluate the performance of the pro-

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

616

Table 4: Demonstration of data de-identiﬁcation using ARX.

Fields Type Hierarchy

Domain

size

Alphabet

size

Max.

characters

Privacy model

user Identifying Delete - - - -

sessionid Quasi-id. Delete - - - -

role Sensitive Masking 42 10 2 Distinct-2-diversity

f unit Sensitive Masking 7 7 1 Distinct-2-diversity

dept Sensitive Masking 23 10 2 Distinct-2-diversity

team Sensitive Masking 39 10 2 Distinct-2-diversity

ITAdmin Sensitive Masking 2 2 1 Distinct-2-diversity

n email Sensitive Masking 37 10 2 Distinct-2-diversity

posed technology. In this assessment, the levels of

performance obtained using identiﬁable data (origi-

nal dataset), using the conventional de-identiﬁcation

method, and using the de-identiﬁcation method of the

proposed technology were compared. The conven-

tional de-identiﬁcation method is to select sensitive

data and then to delete the data.

5.1 De-Identiﬁcation Performance

Evaluation

The safety of the de-identiﬁed data was analyzed us-

ing the ARX Analyze risk function. For safety anal-

ysis, it is important to consider what kind of and how

much knowledge the attacker has. The more knowl-

edge an attacker has that is necessary for an attack,

the easier the attack can be. In the risk analysis pro-

cess of this experiment, to analyze safety, we em-

ployed three attacker models: the Prosecutor attacker

model, the Journalist attacker model, and the Mar-

keter attacker model. The evaluation indicators used

included Records at risk, Highest risk, and Success

rate. Records at risk indicates the percentage of risk

above the standard value, whereas Highest risk indi-

cates the highest risk for a single record. Meanwhile,

the Success rate indicates the percentage of records

that can be re-identiﬁed on average.

The Prosecutor attacker model assumes that an

attacker targets a speciﬁc individual. The attacker

knows that the data regarding their target individ-

ual are included in the dataset. Table 5 shows the

results of risk analysis for when the Prosecutor at-

tacker model was applied on the original dataset and

de-identiﬁed datasets. In the case of the original

dataset, all evaluation indicators, i.e., Records at risk,

Highest risk, and Success rate, were at approximately

100%. On the other hand, when the conventional

de-identiﬁcation method was applied, the risk was

at its lowest because sensitive data were completely

deleted. When the proposed de-identiﬁcation method

was applied, the highest risk was only in the 1%

Table 5: Risk analysis results for when the Prosecutor at-

tacker model was applied.

Data type

Records

at risk

Highest

risk

Success

rate

Identiﬁable data 100% 100% 100%

De-identiﬁed data

(conventional)

0% 0.02% 0.02%

De-identiﬁed data

(proposed)

0% 1.30% 0.23%

range, indicating safety of the dataset.

By contrast, the Journalist attacker model assumes

a situation wherein an attacker targets, but has no

background knowledge about, a speciﬁc individual.

Table 6 shows the results of risk analysis for when

the Journalist attacker model was applied. When

the original dataset was used, all indicators of risk

were at 100%, whereas when the conventional de-

identiﬁcation method was employed, all indicators

were at 0%, showing that the dataset was safe. When

the proposed de-identiﬁcation method was applied,

the highest risk was in the 1% range.

Table 6: Risk analysis results for when the Journalist at-

tacker model was applied.

Data type

Records

at risk

Highest

risk

Success

rate

Identiﬁable data 100% 100% 100%

De-identiﬁed data

(conventional)

0% 0.02% 0.02%

De-identiﬁed data

(proposed)

0% 1.30% 0.23%

The Marketer attacker model aims to help attack-

ers re-identify multiple individuals rather than target-

ing speciﬁc individuals. An attack is considered suc-

cessful when a large number of records can be re-

identiﬁed. Table 7 shows the results of risk analysis

for when the Marketer attacker model was applied.

Even in this case, the proposed de-identiﬁcation

method resulted in excellent safety, whereas the suc-

LSTM Autoencoder-Based Insider Abnormal Behavior Detection Using De-Identiﬁed Data

617

cess rate, which was 100% in the case of the original

dataset, was in the 0% range for both the conventional

and proposed de-identiﬁcation methods.

Table 7: Risk analysis results for when the Marketer at-

tacker model was applied.

Data type Success rate

Identiﬁable data 100%

De-identiﬁed data (conventional)

0.02%

De-identiﬁed data (proposed)

0.23%

As a result of risk analysis by applying three at-

tack models, the dataset de-identiﬁed using the con-

ventional method was found to produce the lowest

re-identiﬁcation probability. The dataset de-identiﬁed

using the proposed method was also found to produce

a sufﬁciently low re-identiﬁcation probability that is

considered sufﬁciently safe.

5.2 Detection Rate Performance

Evaluation

Subsequently, an experiment was conducted to com-

pare abnormal behavior detection performance with

respect to whether and how data de-identiﬁcation was

applied. The loss value obtained using veriﬁcation

data is compared with a randomly set threshold value,

and if the loss value is greater than the threshold

value, it is judged to be an abnormal behavior. In

this experiment, thresholds were set to 1.1. Figure 6

shows a graph comparing the accuracy and F1-score

results of the LSTM autoencoder-based abnormal be-

havior detection experiment with respect to whether

and how de-identiﬁcation was applied.

In the case of identiﬁable data, the accuracy was

0.950. By contrast, in the case of data de-identiﬁed

using the conventional method, the accuracy was

0.788, showing signiﬁcant deterioration in detection

performance. Conventional de-identiﬁcation causes

loss of information because it deletes data without re-

placing or generalizing sensitive data. This resulted

in a decrease in detection performance. On the other

hand, in the case of data de-identiﬁed using the pro-

posed method, the accuracy was found to be 0.954,

which was a slight improvement from that obtained

using the identiﬁable data.

With regard to the F1-score, that for the identiﬁ-

able data was 0.975, that for the data de-identiﬁed us-

ing the conventional method was 0.881, and that for

the data de-identiﬁed using the proposed method was

0.976. It can also be observed from the F1-scores that

the data de-identiﬁed using the conventional method

resulted in signiﬁcantly deteriorated detection perfor-

mance, whereas the data de-identiﬁed using the pro-

posed method resulted in a slight performance im-

provement compared to that obtained using the iden-

tiﬁable data.

Figure 6: Comparison of LSTM autoencoder-based abnor-

mal behavior detection accuracy and F1-score results with

respect to de-identiﬁcation application.

Through experiments to evaluate de-

identiﬁcation, it was conﬁrmed that the proposed

de-identiﬁcation leads to a low possibility of re-

identiﬁcation and therefore good safety for the

de-identiﬁed dataset. In addition, as a result of

comparing the LSTM autoencoder-based anomaly

detection performance obtained with the identiﬁable

data and that obtained with the data de-identiﬁed

using the proposed method, it was conﬁrmed that

the proposed de-identiﬁcation resulted in a slight

performance improvement. On the other hand,

conventional de-identiﬁcation provided the lowest

possibility of re-identiﬁcation and, therefore, superior

data safety, but because signiﬁcant information loss

occurred during the de-identiﬁcation process, the

abnormal behavior detection performance deterio-

rated. We demonstrated that the proposed LSTM

autoencoder-based insider abnormality detection

technology that uses de-identiﬁed data provides

safety in terms of personal privacy and leads to high

detection performance.

6 CONCLUSION

In this paper, we present a security solution that

protects individual privacy by applying data de-

identiﬁcation when monitoring abnormal behavior

among organization insiders, while maintaining ab-

normal behavior detection performance similar to that

obtained using existing identiﬁable data. In the ab-

normal behavior detection process, we attempted to

effectively process long time-series data using an

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

618

LSTM autoencoder, an algorithm suitable for abnor-

mal behavior detection.

To prove the effectiveness of the proposed

method, de-identiﬁcation evaluation and abnormal

behavior detection performance comparison were

conducted. In the de-identiﬁcation evaluation, risk

analysis was conducted by applying three attacker

models, and it was proven that the de-identiﬁed

dataset had only a low possibility of re-identiﬁcation

and was therefore safe. On the other hand, in the ab-

normal behavior detection performance comparison

experiment, the de-identiﬁed data resulted in slightly

improved performance and a higher detection rate

than those obtained using the identiﬁable data.

In follow-up research, we plan to conduct further

studies to expand the scope of application of anomaly

detection solutions using de-identiﬁed datasets by set-

ting various anomaly detection situations and provid-

ing anomaly detection solutions tailored to each situ-

ation.

ACKNOWLEDGEMENTS

This work was partly supported by the Korea Institute

for Advancement of Technology (KIAT) grant funded

by the Korean Government (MOTIE) (P0008703, The

Competency Development Program for Industry Spe-

cialists) and MSIT under the ICAN (ICT Challenge

and Advanced Network of HRD) program (No. IITP-

2022-RS-2022-00156310), supervised by the Insti-

tute of Information Communication Technology Plan-

ning and Evaluation (IITP).

REFERENCES

Abiodun, M. K., Adeniyi, A. E., Victor, A. O., Awotunde,

J. B., Atanda, O. G., and Adeniyi, J. K. (2023). De-

tection and prevention of data leakage in transit us-

ing lstm recurrent neural network with encryption al-

gorithm. In 2023 International Conference on Sci-

ence, Engineering and Business for Sustainable De-

velopment Goals (SEB-SDG), volume 1, pages 01–09.

IEEE.

Al-Mhiqani, M. N., Ahmad, R., Abidin, Z. Z., Abdulka-

reem, K. H., Mohammed, M. A., Gupta, D., and

Shankar, K. (2022). A new intelligent multilayer

framework for insider threat detection. Computers &

Electrical Engineering, 97:107597.

Ashraf, J., Bakhshi, A. D., Moustafa, N., Khurshid, H.,

Javed, A., and Beheshti, A. (2020). Novel deep

learning-enabled lstm autoencoder architecture for

discovering anomalous events from intelligent trans-

portation systems. IEEE Transactions on Intelligent

Transportation Systems, 22(7):4507–4518.

Chomutare, T. (2022). Clinical notes de-identiﬁcation:

Scoping recent benchmarks for n2c2 datasets. Stud

Health Technol Inform, pages 293–6.

Cui, L., Qu, Y., Xie, G., Zeng, D., Li, R., Shen, S., and

Yu, S. (2021). Security and privacy-enhanced feder-

ated learning for anomaly detection in iot infrastruc-

tures. IEEE Transactions on Industrial Informatics,

18(5):3492–3500.

Goryunova, V., Goryunova, T., and Molodtsova, Y. (2020).

Integration and security of corporate information sys-

tems in the context of industrial digitalization. In

2020 2nd International Conference on Control Sys-

tems, Mathematical Modeling, Automation and En-

ergy Efﬁciency (SUMMA), pages 710–715. IEEE.

Gurucul (2023). Insider threat report: 2023 cybersecurity

survey. Technical report, Gurucul.

Institute, C. M. U. S. E. (2013). Cert insider threat

test dataset. https://resources.sei.cmu.edu/library/

asset-view.cfm?assetid=508099. Accessed: 2023-09-

14.

Ito, S. and Kikuchi, H. (2022). Estimation of cost of k–

anonymity in the number of dummy records. Journal

of Ambient Intelligence and Humanized Computing,

pages 1–10.

Jamshidi, M. A., Veisi, H., Mojahedian, M. M., and

Aref, M. R. (2024). Adjustable privacy using

autoencoder-based learning structure. Neurocomput-

ing, 566:127043.

Koll, C. E., Hopff, S. M., Meurers, T., Lee, C. H., Kohls,

M., Stellbrink, C., Thibeault, C., Reinke, L., Stein-

brecher, S., Schreiber, S., et al. (2022). Statistical bi-

ases due to anonymization evaluated in an open clin-

ical dataset from covid-19 patients. Scientiﬁc Data,

9(1):776.

Li, Z., Lee, G., Raghu, T., and Shi, Z. (2023). Does data pri-

vacy regulation only beneﬁt contracting parties? evi-

dence from international digital product market.

Naim, A., Alqahtani, H., Muniasamy, A., Bilfaqih, S. M.,

Mahveen, R., and Mahjabeen, R. (2023). Applications

of information systems and data security in market-

ing management. In Fraud Prevention, Conﬁdential-

ity, and Data Security for Modern Businesses, pages

57–83. IGI Global.

Nam, H.-S., Jeong, Y.-K., and Park, J. W. (2020). An

anomaly detection scheme based on lstm autoencoder

for energy management. In 2020 international confer-

ence on information and communication technology

convergence (ICTC), pages 1445–1447. IEEE.

Nguyen, H. D., Tran, K. P., Thomassey, S., and Hamad,

M. (2021). Forecasting and anomaly detection ap-

proaches using lstm and lstm autoencoder techniques

with the applications in supply chain management.

International Journal of Information Management,

57:102282.

Rai, B. K. (2022). Ephemeral pseudonym based

de-identiﬁcation system to reduce impact of in-

ference attacks in healthcare information system.

Health Services and Outcomes Research Methodol-

ogy, 22(3):397–415.

LSTM Autoencoder-Based Insider Abnormal Behavior Detection Using De-Identiﬁed Data

619

Roelofs, C. M., Lutz, M.-A., Faulstich, S., and Vogt, S.

(2021). Autoencoder-based anomaly root cause anal-

ysis for wind turbines. Energy and AI, 4:100065.

Said Elsayed, M., Le-Khac, N.-A., Dev, S., and Jurcut,

A. D. (2020). Network anomaly detection using lstm

based autoencoder. In Proceedings of the 16th ACM

Symposium on QoS and Security for Wireless and Mo-

bile Networks, pages 37–45.

Thill, M., Konen, W., Wang, H., and B

ack, T. (2021).

Temporal convolutional autoencoder for unsupervised

anomaly detection in time series. Applied Soft Com-

puting, 112:107751.

Yun, S.-W., Lee, E.-Y., and Lee, I.-G. (2023). Selective

layered blockchain framework for privacy-preserving

data management in low-latency mobile networks.

Journal of Internet Technology, 24(4):881–891.

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

620