Anomalous File System Activity Detection Through Temporal

Association Rule Mining

∗

M. Reza H. Iman

, Pavel Chikul

, Gert Jervan

, Hayretdin Bahsi

and Tara Ghasempouri

Department of Computer Systems, Tallinn University of Technology, Tallinn, Estonia

Centre for Digital Forensics and Cyber Security, Tallinn University of Technology, Tallinn, Estonia

Keywords:

NTFS, USN Journal, Forensics, Pattern Recognition, Association Rule Mining, Anomaly Detection.

Abstract:

NTFS USN Journal tracks all the changes in the ﬁles, directories, and streams of a volume for various reasons

including backup. Although this data source has been considered a signiﬁcant artifact for digital forensic

investigations, the utilization of this source for automatic malicious behavior detection is less explored. This

paper applies temporal association rule mining to data obtained from the NTFS USN Journal for malicious

behavior detection. The proposed method extracts association rules from two data sources, the ﬁrst one with

normal behavior and the second one with a malicious one. The obtained rules, which have embedded the

sequence of information, are compared with respect to their support and conﬁdence values to identify the ones

indicating malicious behavior. The method is applied to a ransomware case to demonstrate its feasibility in

ﬁnding relevant rules based on USN journal activities.

1 INTRODUCTION

The detection and exploration of malicious behavior

are one of the mainstream research directions in the

digital forensics domain. A huge number of data

sources can be utilized in cyber incident investiga-

tions for identifying such behavior. The sources in-

clude but are not limited to network trafﬁc captures,

processes in memory, system call sequences, or Win-

dows registry modiﬁcations. Microsoft NTFS Change

Journal or USN Journal is another alternative that ac-

cumulates information regarding all of the operations

performed on the ﬁle system.

NTFS forensics stands out as one of the corner-

stones of conventional PC forensics due to the usage

of ﬁle systems across all of the Microsoft Windows

operating system lines. USN Journal is often used

in system forensics to manually determine malicious

or criminal actions (Cohen, 2020). It can shed some

light on the executables launched in the system. File

deletion traces of these ﬁles can still conﬁdently be re-

covered from the journal. It enables tracking the ﬁle

system operations related to ﬁle creation, renaming,

deletion, or changing security attributes, thus, pro-

∗

This work was supported in part by the European

Union through European Social Fund in the frames of the

“ICT programme” (“ITA-IoIT” topic) and by the Estonian

Research Council grants PSG837.

viding valuable information for malicious behavior

once the benign usage is proﬁled (Corey, 2013; Russi-

novich, 2000). It is easy to access and extract this

data when compared to, for instance, network traces

or system calls, requiring additional tools and usually

having limited historical coverage. Despite its huge

potential, limited work has been done to date regard-

ing the automated analysis of this important source of

evidence.

In this paper, we examine the ways of forensic

pattern recognition in the NTFS USN Journal using

the Apriori algorithm (Han et al., 2012) and Tempo-

ral Association Rules (TAR) (Antunes and Oliveira,

2001; Bilqisth and Mustofa, 2020). Apriori is a fast

algorithm that can provide accurate association rules

(Han et al., 2012). Association rules demonstrate in-

teresting relations among variables and data in a large

dataset (Zaki, 2000).

To this end, association rule mining became a

promising technique for extracting and exploring use-

ful information from a system for engineers. It has

shown its strength in many different domains, such

as market analysis (Brin et al., 1997), accident and

trafﬁc analysis (Shahin et al., 2022), intrusion detec-

tion infrastructures (Treinen and Thurimella, 2006)

and health informatics (Altaf et al., 2017), as well as

its huge application in dependability and reliability of

safety-critical applications (Danese et al., 2015; Hei-

H. Iman, M., Chikul, P., Jervan, G., Bahsi, H. and Ghasempouri, T.

Anomalous File System Activity Detection Through Temporal Association Rule Mining.

DOI: 10.5220/0011805100003405

In Proceedings of the 9th International Conference on Information Systems Security and Privacy (ICISSP 2023), pages 733-740

ISBN: 978-989-758-624-8; ISSN: 2184-4356

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

733

dari Iman et al., 2021), etc.

In this research, temporal association rule mining

identiﬁes rules that are applicable to the USN Journal

data for the detection of anomalies caused by mali-

cious behavior. Mainly, the data mining approach ex-

tracted two sets of rules, one from a snapshot of a be-

nign ﬁle system and another one from a target ﬁle sys-

tem that is suspected to be infected or attacked. These

two rule sets are compared, and the rules that detect

the anomalies are determined. Security experts can

use these rules for revealing and enumerating the ﬁles

used or infected by the actions of the adversary. Thus,

the proposed method does not only predict the exis-

tence of anomalies, but it also enables to discriminate

the infected ﬁles from the benign ones to assist in the

impact assessment of the incidents and planning the

recovery actions during incident handling processes.

In summary, the contributions of this paper are as

follows:

• An automatic malicious behavior detection

method is proposed to analyze the NTFS USN

Journal and extract a set of association rules for

detecting the anomalies induced by malicious

behavior.

• The method does not require ﬁle-level labeled

data, instead, a normal ﬁle system, which is easy

to obtain, and a target ﬁle system which is the

main subject of the analysis are enough.

• An incident regarding the ransomware analysis is

presented to demonstrate the applicability of the

method.

The outline of this paper is as follows: Section 2 gives

background information and reviews the related work.

The preliminaries of the proposed method are pre-

sented in Section 3. The datasets and their generation

are detailed in Section 4. Section 5 introduces the

proposed methodology. The case study and the rele-

vant results are presented and discussed in Section 6.

Section 7 concludes the study.

2 BACKGROUND INFORMATION

AND RELATED WORK

The USN Journal or Update Sequence Number Jour-

nal is an advanced feature of the Windows NT ﬁle

system introduced with version 3.1 of the ﬁle system

(Russinovich, 2000). It was designed to keep a record

of all changes made to the volume. There are sev-

eral use cases for the ﬁle system to maintain a full

log of changes within itself. Backup applications may

use the change journal information in order to iden-

tify ﬁles that were created or modiﬁed since the last

backup without the need to recursively parse the di-

rectory tree which is time- and resource-consuming.

Another useful application of the journal is real-time

antivirus protection: the AV application can monitor

the live USN journal to identify any incoming ﬁles

and scan them at the same moment.

The journal is stored in a system-maintained

metaﬁle $Extend\$UsnJrnl in an alternate data stream

called $J and is comprised of a number of records

consisting of the following ﬁelds: a USN ID (a

64-bit unique identiﬁer which is incremented with

each new record been created but not guaranteed to

be contiguous (Cooperstein and Richter, 1999)), a

timestamp, ﬁlename, reference to the parent Master

File Table (MFT) ID, the update reason, and some

other attributes. The presence of parent MFT ID

in some cases can lead to the real location of the

ﬁle. However, if the MFT entry was already reused

the reference becomes invalid. Update reason is a

64-bit integer that uses bit ﬂags to describe what

changed in the ﬁle or directory. According to Mi-

crosoft’s documentation (Microsoft, 2022), there are

23 ﬂags available, including creation, renaming, dele-

tion, and security information change. Multiple ﬂags

can be set into a single update reason record. For

example, two ﬂags USN REASON FILE CREATE

(0x100) and USN REASON CLOSE (0x80000000)

combined together will result in an integer record

0x80000100 or 2147483904 in decimal. One of the

most important aspects of the journal is the fact that it

stores information about operations on ﬁles that may

be already deleted and their entries in the Master File

Table reused. Thus, it is possible to prove some data’s

existence even if the data is a long time gone.

Different approaches for analysis of the USN jour-

nal in order to discover patterns are presented in sev-

eral works. Lees et al. in (Lees, 2013) explore iden-

tifying a user using Private Browsing mode or uti-

lizing anti-forensic software such as CCleaner. The

proposed method allowed them to clearly identify

traces and, most importantly, patterns for such activ-

ities within the change journal. Corey in their article

”Re-introducing $UsnJrnl” (Corey, 2013) discusses

ways of using the change journal for determining mal-

ware activity from the USN journal including self-

destruction, hiding in unusual locations, and tamper-

ing with the ﬁle system metadata. Cohen in their ar-

ticle (Cohen, 2020) demonstrates real-time monitor-

ing and capture of the change journal with Velocirap-

tor software in order to update the modiﬁed ﬁles hash

database to trace the malicious activity.

Association rule mining has been applied to the

detection of ransomware by using the data regarding

dynamic link libraries called by the programs (Subedi

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

734

et al., 2018). Another study extracts association rules

from the user login information for the purpose of

user proﬁling (Abraham and de Vel, 2002). A so-

lution based on a classiﬁer composed of association

rules is proposed for the problem of email authorship

attribution (Schmid et al., 2015).

All the observed works that use the USN journal

as the evidence source demonstrate semi-manual pro-

cesses in pattern recognition mostly relying on the in-

vestigators’ observations and prior knowledge of spe-

ciﬁc behavior. Obviously, these approaches need hu-

man expertise, they are costly and error-prone due to

human beings in the loop. Thus, we see a clear indi-

cation of the need for an automated way for ﬁle sys-

tem behavior patterns extraction. To the best of the

authors’ knowledge, this work is the ﬁrst automatic

malicious ﬁle system behavior detection that adopted

data mining methods for this purpose.

3 PRELIMINARIES

Deﬁnition 1. Apriori is a seminal data mining algo-

rithm for mining frequent itemsets for Boolean asso-

ciation rules (Han et al., 2012). To mine association

rules, Apriori employs an iterative approach called

level-wise search, where k-itemsets are used to ex-

plore (k + 1)-itemsets (Han et al., 2012).

Deﬁnition 2. Let I = {i

, i

, ...,i

} be a set of items

and D = {d

, d

, ..., d

} be a data set, i.e., a set of

observations, called transactions, with respect the set

of items I. Each element in D contains a subset of the

items in I. An association rule is deﬁned as an impli-

cation of form X → Y where X, Y ⊆ I and X ∩ Y =

X and Y are called itemsets (Han et al., 2012).

Deﬁnition 3. Temporal Association Rule (TAR) is

a kind of association rule that considers time in the

data sets when the sequence of data changes during

the time (Antunes and Oliveira, 2001; Bilqisth and

Mustofa, 2020).

Deﬁnition 4. In TAR mining, there are different pat-

terns including Next, and Before that consider differ-

ent time series in a data set (Antunes and Oliveira,

2001; Bilqisth and Mustofa, 2020). As an example,

X → Next(5min)Y means that when X occurs then

after 5 minutes Y will be implied. Moreover, rule

X → Be f ore(5min)Y means that When X occurs, 5

minutes before it Y should have occurred.

Deﬁnition 5. Support is an indication of how fre-

quently the itemset appears in the data set (Han et al.,

2012). This value is between 0 and 1. For the rule

X → Y , the value of support is calculated with the

following formula (Han et al., 2012):

Supp(X → Y ) = P(X ∪Y ) (1)

In (1), P(X ∪Y ) is the probability where X ∪Y indi-

cates that a transaction contains both X and Y, that is,

the union of itemsets X and Y.

Furthermore, in Apriori, min supp value is the

threshold and a minimum value that is chosen by the

expert to decide whether an itemset is frequent (i.e.,

occurs frequently in the data set) or not. If the fre-

quency of the itemset is more than this threshold, the

itemset is considered a frequent itemset.

Deﬁnition 6. Conﬁdence is an indication of how of-

ten the rule has been found to be true. For the rule

X → Y , the value of conﬁdence is calculated with the

following formula (Han et al., 2012):

Con f (X → Y) = P(Y |X) (2)

Conﬁdence assesses the degree of certainty of the de-

tected association rule. This is taken to be the condi-

tional probability P(Y |X), that is, the probability that

a transaction containing X also contains Y. This value

is between 0 and 1. The min con f is the threshold

and the minimum value that is chosen by the expert

for conﬁdence.

4 DATA SETS

As noted in (Cohen, 2020) and (Lees, 2013), differ-

ent software utilizes different approaches in regard to

ﬁle manipulations depending on their needs and im-

plementation speciﬁcs that usually result in several

change records being created. For example, unpack-

ing a ﬁle from an archive will in most cases result in

three USN records being generated:

• 256 (FILE CREATE)

• 258 (DATA EXTEND FILE CREATE)

• 2147483906 (DATA EXTEND FILE CREATE CLOSE)

Various software actions (both operating system and

user applications) performing ﬁle operations result in

a continuous ﬂow of USN records created in the jour-

nal. Thus, our assumption is that it is possible to

ﬁngerprint speciﬁc software behavioral patterns and

classify such actions (both legitimate and malicious).

To test our assumption with different behavioral

patterns we created two datasets: the ﬁrst one with

legitimate behavior only and the second one intro-

ducing some malicious activity inside the normal op-

erating system lifecycle. A fully patched Windows

7 virtual machine was set up and an origin snap-

shot was created (snapshot 1). For the ”legitimate”

dataset creation, some user activities were simulated.

Anomalous File System Activity Detection Through Temporal Association Rule Mining

735

Figure 1: General ﬂow of the proposed method.

These activities included web browsing, user docu-

ment editing, and, most importantly, software instal-

lation. On a high level, software installation involves

various ﬁle operations that might be common to ma-

licious software actions as well: ﬁles unpacked, tem-

porary ﬁles created in different locations to be later

deleted, etc. which could make the analysis harder.

After some time a snapshot was created representing

the ﬁrst ”normal” dataset (snapshot 2). To introduce

malicious activity the system was reverted to the ori-

gin snapshot and infected with the WannaCry mal-

ware as a typical ransomware representative. Wan-

naCry is a ransomware crypto-worm that when trig-

gered on a target machine iterates over user ﬁles en-

crypting them and by the end of the encryption phase

displays a notiﬁcation demanding ransom in order to

decrypt the ﬁles. In a worldwide attack in 2017 Wan-

naCry infected more than 200.000 machines in more

than 150 countries dealing billions of dollars in dam-

age (Trautman and Ormerod, 2018). When the system

went into the ransom-demanding state another snap-

shot was created representing the second ”infected”

dataset (snapshot 3).

The ﬁle operation sequences that represent a sin-

gle action (such as the un-archiving of a ﬁle men-

tioned above) tend to be atomic meaning that the

records representing an action will stay close to each

other in the journal. However, due to the parallel writ-

ing in the journal, the patterns relevant to several ﬁles

may be mixed with each other. Thus, to overcome this

behavior we do the initial preparation of the datasets

so that the records related to a single ﬁle are batched

together in a one-second timeframe. To demonstrate

such preparation refer to Table 1.

We extracted USN journals from snapshots 2 and

3 and after running the preparation procedure on them

as discussed above we then converted them into arrays

of USN update reasons, i.e., lists of 64-bit integers.

Thus we resulted in two datasets representing ﬁle sys-

tem activity under different circumstances: legitimate

Table 1: Raw journal data preprocessing.

Original Preprocessed

1 second

File-1 record 1 File-1 record 1

File-2 record 1 File-1 record 2

File-1 record 2 File-1 record 3

File-2 record 2 File-2 record 1

File-1 record 3 File-2 record 2

actions (∼19.000 records) and legitimate actions with

some malicious actions mixed in (∼14.000 records).

5 PROPOSED METHODOLOGY

The general ﬂow of the proposed method has been il-

lustrated in Fig. 1. As mentioned in Section 4, in our

case study, one data set is related to the normal behav-

ior of a user while he/she was using the system. The

other data set is related to the behavior of the system

when ransomware inﬂicted damage on it. As can be

seen in Fig. 1, association rule mining is applied sep-

arately on both sets of data. Therefore, at ﬁrst, a data

preprocessing phase is performed on data sets to pre-

pare suitable data for association rule mining. After-

ward, in the data mining phase, the Apriori algorithm

is applied to the prepared data sets separately. The

outcome is two sets of association rules which have

been mined from any of the normal and infected data

sets.

As illustrated in Fig. 1, by reaching the associ-

ation rules, a comparison is done between the two

data sets based on the mined association rules. This

comparison is performed with the aid of the values of

Support (Deﬁnition 5) and Conﬁdence (Deﬁnition 6)

metrics. Based on our assumption, if both data sets

are similar, the mined rules from each of them should

be similar. This similarity means that in addition to

the mined rules, the values of the Support and Conﬁ-

dence for these rules should be the same. Therefore,

any difference in these values can show an anomaly.

The result of the comparison will be two new sets of

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

736

association rules (shown in the green box in Fig 1),

which both indicate the anomalies in the system. One

of these two sets of anomalies contains the rules that

have occurred only in the infected data set. The other

one is the set of rules that have occurred in both data

sets, however, the values of their support and conﬁ-

dence are different.

More details about each phase and how the mined

rules will be compared are discussed in the following

subsections.

5.1 Data Preprocessing

In this phase, data preprocessing is performed to pro-

vide suitable data for Apriori to extract TARs (Def-

inition 3) from the datasets. The Apriori algorithm

extracts frequent itemsets in the form of association

rules without considering the sequence of events dur-

ing the time. However, we are interested in rules

which illustrate the sequence of events through time.

In other words, in Apriori, it does not matter

whether an itemset is followed by another or pre-

ceded by another. It only ﬁnds those itemsets that

have occurred together (without considering their se-

quences and orders). However, in NTFS USN Jour-

nal the concept of time or more speciﬁcally the se-

quence of operations that occur in the system mat-

ters. More precisely, if event X happens at second 1

and event Y happens at second 6, then the associa-

tion rule regarding this sequence of events would be

X → 5seconds Y , which means Y happens 5 seconds

after the happening of X. The mentioned association

rule is what we are interested in extracting for this

work. Mining these kinds of rules will be helpful for

security experts to more accurately ﬁnd the ﬁles or di-

rectories related to the malicious behavior in the sys-

tem.

In this regard, in preprocessing step ﬁrst, the user

identiﬁes the length of time for the rules. For instance,

if a rule such as X → 5seconds Y , is in the interest

of the users, therefore number 5 should be identiﬁed.

Second, all the events in the dataset with the identiﬁed

length (in this case 5), are clustered in the same sub-

dataset. Finally, the concept of time for each event

in the sub-data set is removed and saved for future

reference (authors do not describe technical details to

make it easier to read). Finally, this sub-dataset is fed

to the next step for mining the association rules.

5.2 Data Mining

In this phase, the Apriori algorithm (Deﬁnition 1)

(Han et al., 2012) is applied to the Preprocessed data

sets to generate association rules. According to Fig.

1, this phase takes two sets of data as the inputs, one

for the normal set of data, and the other one for the

infected set. The outputs of this phase are association

rules related to both sets. Due to the space limit, we

refer interested readers about the Apriori to the litera-

ture (Han et al., 2012).

5.2.1 Applying Temporal Filters and Labels

This phase aims to restore the time instance of events

that were removed in the Preprocessing phase, Sec-

tion 5.1. In accordance with our previous statement,

the extracted association rules are generated in two

formats, namely next and before (Deﬁnition 4). De-

tailed instructions on how time instances are set back

to the rules are provided below:

After mining association rules in the previous

phase (section 5.2), the method provides us a set of

rules in the form of P → Q. By considering P → Q,

we will have two different conditions as follows:

• next: If the value of P is equal to some events in

the data set, and the value of Q is equal to the

events that in the data set have appeared after the

events of P, this means that the extracted associ-

ation rule is next. Therefore, the mined rule is

labeled as a next TAR.

• before: If the value of P is equal to the events that

have appeared in the data set before the events of

Q, this means that the extracted association rule is

before. Therefore, the mined rule is labeled as a

before TAR.

5.3 Anomaly Detection

This phase is in charge of automatically detecting ma-

licious behavior in the NTFS USN Journal which is

typically performed by ransomware. The assumption

in the proposed method is that in the ’normal’ sce-

nario that there is no malicious behavior in the in-

fected dataset, two data sets should be similar (normal

and infected data sets). This means that if the Apriori

algorithm is applied to both data sets, the mined rules,

as well as the values of their supports (Deﬁnition 5)

and conﬁdences (Deﬁnition 6) should be similar.

In order to ﬁnd the anomalies, the method com-

pares the two sets of mined rules. In this compar-

ison, two different conditions and two different sets

of anomalies would occur. In fact, in this compari-

son, we are looking for the conditions that neglect our

assumption (i.e., similar behavior and similar mined

rules for both data sets in a normal scenario)

The ﬁrst set of rules is the one that has not oc-

curred in the normal data set and occurs only in the

infected data set. Based on our assumptions, these

Anomalous File System Activity Detection Through Temporal Association Rule Mining

737

rules show malicious behavior. The other set of mined

rules is one that is the same in both data sets. For

these rules, the support and conﬁdence values of each

rule are calculated. Next, according to the following

formulas, we calculate the difference between their

supports and conﬁdences:

DS = (Support1 − Support2) × 100 (3)

It should be noted that each rule that has been

mined from the normal data set or infected data set has

a support value. With the aid of the formula (3), we

calculate the support difference for each pair of rules

that has been mined from each data set and are exactly

the same (i.e., similar rules that have been mined from

both data sets, but with different support values). In

the above formula, Support1 is the calculated support

for a speciﬁc rule that has been mined from the nor-

mal data set, and Support2 is the calculated support

of that speciﬁc rule that has been mined from the in-

fected data set.

If DS > 0, it means that a malicious behavior

has occurred, and it shows that in comparison with

the normal data set, some parts of data have been re-

moved from the infected data set. On the other hand,

if DS < 0, it means that there is malicious behavior

again, however, in comparison with the normal data

set, additional records of data have been added to the

infected results ﬁle. Furthermore, the formula (4) and

according to Deﬁnition 6 shows the probability of ma-

licious behavior in a speciﬁc rule.

DC = (Con f idence1 −Con f idence2) × 100 (4)

In the above formula, Conﬁdence1 is the calcu-

lated conﬁdence for a speciﬁc rule that has been

mined from the normal data set, and Conﬁdence2 is

related to the calculated conﬁdence for that speciﬁc

rule that has been mined from the infected data set.

For instance, if for a mined rule like P → Q, Conﬁ-

dence1 – Conﬁdence2 is equal to 95, it means that in

95% of the operations that this rule shows in data sets,

we have a malicious behavior.

Table 2: Number of Mined Association Rules.

Rules Unequal Support Infected Only

#Association Rules 1 14

#Before Rules 0 1

#Next Rules 1 13

6 EXPERIMENTAL RESULTS

The experimental results of the proposed method have

been elaborated in this section. The normal data set

that we have used in this paper has 19055 records and

the infected data set has 13721 records.

In Table 2, the number of all mined rules (’#As-

sociation Rules’), as well as the number of Before

(’#Before Rules’) and Next (’#Next Rules’) rules

have been presented. It should be noted that the length

of the sequence of operations in the data set has been

set to 9 based on the expert’s decision. Thereby, for

the preprocessing phase, the number of shifts is equal

to 9. Since the detection of the attacks is signiﬁ-

cantly important, the minimum conﬁdence value has

been assigned to 0.80 out of 1. Note that this number

can easily be changed by the expert as the proposed

method is fully automated. For highly critical cases

this value can be set to higher; otherwise, a lower de-

gree can be set by the user to introduce more sensitiv-

ity in rule mining.

As mentioned in section 5.3 (Anomaly detection),

the method provides two sets of rules (anomalies).

One set is related to the rules that have unequal sup-

port values and their difference is calculated accord-

ing to formula 3 (DS). However, the second rule set

is the one that has occurred only in the infected data

set. In Table 2, the number of mined rules have been

demonstrated for both of these two sets, i.e., ’Unequal

Support’ and ’Infected Only’ columns. There is only

one rule in both data sets with unequal support values

mined with the Next pattern. The ﬁgure for the rules

that have not occurred in the normal data set is 14 with

1 rule mined with the Before pattern and the rest with

the Next pattern. Regarding the execution time, the

method is able to mine both categories of rules in less

than a second.

6.1 Digital Forensics Interpretation of

Rules

All 14 rules in the uniﬁed format are presented in

the table 3. Basically, the uniﬁed format represents

all of the reasons records that are put in consecu-

tive order the way they are supposed to be found

in the dataset. If we take the ﬁrst rule as an ex-

ample, the original mined rule’s consequent was

2147483652 and the list of antecedents was as fol-

lows: [6 before 4, 4 before 2147484160, 1 before 4,

8 before 256, 7 before 2147483904, 3 before 256,

5 before 2147483652, 2 before 2147483904]. It

practically means that we will be looking for a record

256, followed by a record 2147483904, followed by

4, and so on until we ﬁnd the exact match of the

whole sequence ending with a 2147483652. It should

be noted that all parts of the antecedent of this rule

should occur together in the data set to ﬁnally imply

the consequent.

As the ﬁrst part of our validation, we ran all our

mined rules against the infected dataset and extracted

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

738

Table 3: Association rules in uniﬁed format.

# Rule Conﬁdence

1 256, 2147483904, 4, 2147483652, 2147484160, 256, 2147483904, 4, 2147483652 1

2 256, 2147483904, 4, 2147483652, 2147484160, 256, 2147483904, 4, 2147483652 0.831050228

3 2147483652, 2147484160, 256, 2147483904, 4, 2147483652, 2147484160, 256, 2147483904 0.965714286

4 2147483904, 4, 2147483652, 2147484160, 256, 2147483904, 4, 2147483652, 2147484160 1

5 33026, 2147516674, 4096, 256, 258, 32768, 2147516416, 8192, 2147491840 1

6 4096, 8192, 2147491840, 4096, 8192, 2147491840, 4096, 8192, 2147491840 0.886956522

7 4, 2147483652, 2147484160, 256, 2147483904, 4, 2147483652, 2147484160, 256 0.945945946

8 258, 32768, 2147516416, 8192, 2147491840, 33026, 2147516674, 4096, 256 0.843137255

9 32768, 2147516416, 8192, 2147491840, 33026, 2147516674, 4096, 256, 258 0.931818182

10 2147484160, 256, 2147483904, 4, 2147483652, 2147484160, 256, 2147483904, 4 0.982248521

11 8192, 2147491840, 4096, 8192, 2147491840, 4096, 8192, 2147491840, 4096 1

12 256, 258, 32768, 2147516416, 8192, 2147491840, 33026, 2147516674, 4096 1

13 49152, 2147532800, 8192, 2147491840, 256, 258, 33026, 2147516674, 4096 1

14 2147491840, 4096, 8192, 2147491840, 4096, 8192, 2147491840, 4096, 8192 0.982142857

a histogram of the affected ﬁle types (Table 4). The

second and third most frequent ﬁle types are WN-

CRYT and WNCRY. These ﬁle types represent the

temporary storage and the ﬁnal encrypted container

generated by the WannaCry ransomware accordingly

(Team, 2017). As for the TMP ﬁles, we suppose that

those are also temporary ﬁles generated by the mal-

ware since they were created in the infected direc-

tories (as indicated by the Parent File Reference en-

try in the record) and the timestamps match the time-

frame of the attack. The rest of the ﬁles comprise less

than 9% of the total detected records that were false-

positively identiﬁed. Having this information we may

conclude that the rules correctly detect the anomalies

caused in the ﬁle system by malicious activity. To get

the accuracy of the identiﬁcation, we took all of the

unique ﬁle entries that were affected by the attack and

compared them with the ones detected by the rules:

out of 235 affected ﬁles we detected 206 which makes

an 87.7% accuracy.

Table 4: Detected ﬁle types histogram.

File Type Number of Hits

tmp 1020

wncryt 710

wncry 411

png 101

txt 31

db 24

docx 18

zip 12

js 6

vbs 5

gif 3

lnk 1

If we look closer at the 14 mined rules we can

identify that some of them are just shifted versions of

others. For example, rules 1, 2, 3, 4, 7, and 10. This

behavior was expected since the contiguous repetitive

patterns in the USN Journal can be grabbed by the

algorithm from different starting points. This leaves

us with 4 groups representing the unique rules: (1,

2, 3, 4, 7, 10), (6, 11, 14), (5, 8, 9, 12), and (13).

Only one rule number 13 does not have a shifted ver-

sion of itself. We extracted individual outputs of sin-

gle rules from the identiﬁed groups. A comparison

of the outputs showed little to no difference in the

identiﬁed records. Thus we end up with only 4 dis-

tinct rules for malicious behavior detection. Another

aspect noted is the repetitiveness of the pattern in the

mined rules. For example, rule number 6 [4096, 8192,

2147491840, 4096, 8192, 2147491840, 4096, 8192,

2147491840] is a repetition of the same 3-value pat-

tern [4096, 8192, 2147491840] three times. It is a

part of future work to address both the elimination of

shifted rule versions and the shortening of repeated

patterns.

Machine learning methods can be considered a

signiﬁcant alternative to the proposed method. How-

ever, there are some obstacles to applying them in

this context. It is easy for a forensic expert to cre-

ate a snapshot with a benign ﬁle system. The target

snapshot which constitutes the subject of investiga-

tion usually contains benign and malicious ﬁles which

are blended into one ﬁle system. Supervised learn-

ing models require ﬁle-level labels to provide scrutiny

about each ﬁle, which is very hard to achieve in dig-

ital forensics tasks due to the high cost of labeling.

One-class learning models, which may just learn from

the ﬁles in the benign snapshot, cannot use the tar-

get snapshot while inducing the models, limiting the

knowledge that can be obtained from both snapshots.

Unsupervised methods (e.g., clustering) that do not

use any labeled data may give some intuition to the

expert but they do not provide explicit rules. More

importantly, machine learning models do not provide

human-readable rules, which limits their applicabil-

ity in this context enormously. Even the explainable

methods such as decision trees may require additional

steps to generate rules and strict pruning strategies

should be applied to achieve comprehendible rule sets

at expense of detection loss.

Anomalous File System Activity Detection Through Temporal Association Rule Mining

739

7 CONCLUSION AND FUTURE

WORK

In this work, we proposed an automated way of dis-

covering patterns in the NTFS USN change journal

by utilizing Temporal association rule mining. A data

preprocessing method is introduced which can cus-

tomize the Apriori algorithm for mining Temporal

association rules instead of mining traditional rules

where time has no meaning. The method can be

applied for both real-time and post-mortem pattern

recognition. We assume that normal and malicious

software leave distinct ﬁngerprints in the ﬁle system

that are recorded by the change journal. To test this

theory we validate the method by trying to detect the

patterns of ransomware presence in the system. This

is achieved by practically infecting an operating sys-

tem with malware and then running the proposed sys-

tem against the extracted USN journal. As a result of

such validation, we identiﬁed patterns speciﬁc to mal-

ware activity. More speciﬁcally, the ﬁles which are in-

fected or generated by malicious activity are found by

the association rules mined from normal and infected

data sets.

As part of future work, we envision a system that

will utilize the proposed method in real-time to mon-

itor the activity of a live system in order to detect

patterns at the moment close to emerging. Another

prominent application would be the automatic gen-

eration of a forensic timeline that shows the system

behavior and possible attack timeframes and volume.

From the perspective of technical improvement, we

are planning to address the shifted rules handling and

merging in order to reduce the number of redundant

patterns. The same applies to the repetitive patterns

inside the rules: we need to shorten the identiﬁed pat-

tern if it is just a repeated sub-pattern present in it.

REFERENCES

Abraham, T. and de Vel, O. (2002). Investigative proﬁling

with computer forensic log data and association rules.

In 2002 IEEE International Conference on Data Min-

ing, 2002. Proceedings., pages 11–18. IEEE.

Altaf, W., Shahbaz, M., and Guergachi, A. (2017). Applica-

tions of association rule mining in health informatics:

A survey. Artif. Intell. Rev., 47(3):313–340.

Antunes, C. and Oliveira, A. L. (2001). Temporal data min-

ing: an overview.

Bilqisth, S. and Mustofa, K. (2020). Determination of tem-

poral association rules pattern using apriori algorithm.

IJCCS (Indonesian Journal of Computing and Cyber-

netics Systems), 14:159.

Brin, S., Motwani, R., Ullman, J. D., and Tsur, S. (1997).

Dynamic itemset counting and implication rules for

market basket data. In Proceedings of the 1997 ACM

SIGMOD international conference on Management of

data, pages 255–264.

Cohen, M. (2020). The windows usn journal.

Cooperstein, J. and Richter, J. (1999). Keeping an eye

on your ntfs drives: the windows 2000 change jour-

nal explained. MICROSOFT SYSTEMS JOURNAL-

US EDITION-, 14:17–30.

Corey, H. (2013). Re-introducing $usnjrnl.

Danese, A., Filini, F., Ghasempouri, T., and Pravadelli, G.

(2015). Automatic generation and qualiﬁcation of as-

sertions on control signals: A time window-based ap-

proach. In IFIP/IEEE International Conference on

Very Large Scale Integration-System on a Chip, pages

193–221. Springer.

Han, J., Kamber, M., and Pei, J. (2012). 6 - mining fre-

quent patterns, associations, and correlations: Basic

concepts and methods. In Han, J., Kamber, M., and

Pei, J., editors, Data Mining (Third Edition), The Mor-

gan Kaufmann Series in Data Management Systems,

pages 243–278. Morgan Kaufmann, Boston, third edi-

tion edition.

Heidari Iman, M. R., Raik, J., Jenihhin, M., Jervan, G., and

Ghasempouri, T. (2021). A methodology for auto-

mated mining of compact and accurate assertion sets.

In 2021 IEEE Nordic Circuits and Systems Confer-

ence (NorCAS), pages 1–7.

Lees, C. (2013). Determining removal of forensic artefacts

using the usn change journal. Digital Investigation,

10(4):300–310.

Microsoft (2022). Usn record v2 - win32 apps.

Russinovich, M. (2000). Inside win2k ntfs, part 1.

Schmid, M. R., Iqbal, F., and Fung, B. C. (2015). E-mail au-

thorship attribution using customized associative clas-

siﬁcation. Digital Investigation, 14:S116–S126.

Shahin, M., Heidari Iman, M. R., Kaushik, M., Sharma,

R., Ghasempouri, T., and Draheim, D. (2022). Ex-

ploring factors in a crossroad dataset using cluster-

based association rule mining. Procedia Computer

Science, 201:231–238. The 13th International Confer-

ence on Ambient Systems, Networks and Technolo-

gies (ANT).

Subedi, K. P., Budhathoki, D. R., and Dasgupta, D. (2018).

Forensic analysis of ransomware families using static

and dynamic analysis. In 2018 IEEE Security and Pri-

vacy Workshops (SPW), pages 180–185. IEEE.

Team, C. T. U. R. (2017). Wcry (wannacry) ransomware

analysis.

Trautman, L. J. and Ormerod, P. C. (2018). Wannacry,

ransomware, and the emerging threat to corporations.

Tenn. L. Rev., 86:503.

Treinen, J. J. and Thurimella, R. (2006). A framework for

the application of association rule mining in large in-

trusion detection infrastructures. In Zamboni, D. and

Kruegel, C., editors, Recent Advances in Intrusion

Detection, pages 1–18, Berlin, Heidelberg. Springer

Berlin Heidelberg.

Zaki, M. (2000). Scalable algorithms for association min-

ing. IEEE Transactions on Knowledge and Data En-

gineering, 12(3):372–390.

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

740