TI-NERmerger: Semi-Automated Framework for Integrating NER

Datasets in Cybersecurity

Inoussa Mouiche and Sherif Saad

School of Computer Science, University of Windsor, ON, Canada

Keywords:

Threat Intelligence, Named Entity Recognition, Data Annotation, Data Augmentation.

Abstract:

Recent advancements highlight the crucial role of high-quality data in developing accurate AI models, espe-

cially in threat intelligence named entity recognition (TI-NER). This technology automates the detection and

classiﬁcation of information from extensive cyber reports. However, the lack of scalable annotated security

datasets hinders TI-NER system development. To overcome this, researchers often use data augmentation

techniques such as merging multiple annotated NER datasets to improve variety and scalability. Integrating

these datasets faces challenges like maintaining consistent entity annotations and entity categories and ad-

hering to standardized tagging schemes. Manually merging datasets is time-consuming and impractical on a

large scale. Our paper presents TI-NERmerger, a semi-automated framework that integrates diverse TI-NER

datasets into scalable, compliant datasets aligned with cybersecurity standards like STIX-2.1. We validated

the framework’s efﬁciency and effectiveness by comparing it with manual processes using the DNRTI and

APTNER datasets, producing Augmented APTNER (2APTNER). The results demonstrate over 94% reduc-

tion in manual labour, saving several months of work in just minutes. Additionally, we applied advanced ML

algorithms to validate the effectiveness of the integrated NER datasets. We also provide publicly accessible

datasets and resources, supporting further research in threat intelligence and AI model developments.

1 INTRODUCTION

Threat intelligence, also known as entity recognition

(TI-NER), is a specialized NLP task in the cyberse-

curity and threat intelligence domain. It identiﬁes

and classiﬁes cybersecurity-related entities within un-

structured text reports, such as malware, threat actors,

indicators of compromise (IoCs), security tools, and

vulnerabilities. Although manual analysis by secu-

rity analysts is precise, the large volume and varied

sources of daily threat reports make this approach im-

practical. To address this, researchers have turned

to machine learning (ML) models to automate the

extraction of actionable intelligence from these re-

ports. Examples of such tools include AGIR (Per-

rina et al., 2023), TTPHunter (Rani et al., 2023a),

Vulcan(Jo et al., 2022), AttackKG(Li et al., 2022),

CyberRel(Guo et al., 2021), EXTRACTOR(Kiavash

et al., 2021), and CyberEntRel(Ahmed et al., 2024).

These deep learning tools depend on high-quality an-

notated datasets. In cybersecurity, the most com-

mon tagging schemes for annotating these entities in

text sequences are BIO (beginning, inside, or outside)

and BIOES (beginning, inside, outside, end, or sin-

gle), although there is limited research on the costs

of choosing one scheme over another. The effective-

ness of TI-NER is further underscored by its integra-

tion with the Structured Threat Information eXpres-

sion (STIX) framework like STIX-2.1 [(Jordan et al.,

2022)]. STIX organizes extracted entities in a stan-

dardized format, facilitating data sharing and analy-

sis. It includes at least 19 entity types or STIX do-

main objects (SDOs) and 18 STIX cyber-observable

objects (SCOs) or artifacts, each of which represents

a unique entity commonly found in cyber threat intel-

ligence (CTI) datasets.

However, a signiﬁcant challenge in developing a

dynamic and effective TI-NER AI model for real-

world use is the scarcity of suitably scalable labelled

datasets. These datasets need to be well-annotated

and readily accessible to facilitate progress in the

ﬁeld. Additionally, they should encompass a diverse

range of entity categories with many instances per

category and include a substantial volume of tokens

(Wang et al., 2020c). Additionally, adherence to the

widely adopted STIX 2.1 speciﬁcations is essential, as

these serve as a standard format for TI data exchange

among security ﬁrms. Data augmentation (DA), by

Mouiche, I. and Saad, S.

TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity.

DOI: 10.5220/0012867900003767

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Security and Cryptography (SECRYPT 2024), pages 357-370

ISBN: 978-989-758-709-2; ISSN: 2184-7711

357

merging existing annotated NER datasets, offers a

potential solution to this challenge. DA is the pro-

cess of generating new data from existing data. Ro-

bust ML models require large and varied datasets for

initial training, but sourcing sufﬁciently diverse real-

world datasets can be challenging because of data si-

los, regulations, and other limitations (Ding et al.,

2024; Zhou et al., 2020). While various DA tech-

niques are available in the literature, it’s important

to note that they do not always guarantee improved

dataset quality or subsequent model performance (Lin

et al., 2024; Bakır et al., 2024). The DA approach

consisting of integrating two or more NER datasets

in cybersecurity presents several challenges, includ-

ing inconsistencies in tagging schemes, number of la-

bels, label names, and compliance with standards like

STIX2.1. Merging NER datasets without addressing

these issues degrades the model’s performance. Man-

ual merging processes are time-consuming and cum-

bersome, akin to re-annotating each dataset manually.

To address these challenges, this study intro-

duces TI-NERmerger, a semi-automated framework

for merging TI-NER datasets. Our framework stream-

lines the integration process, signiﬁcantly reducing

manual effort. Experimental results using two promi-

nent open-source TI-NER datasets, DNRTI(Wang

et al., 2020c) and APTNER(Wang et al., 2022),

demonstrate that our framework saves over 94% of

manual work, which would typically take several

months, in just a few minutes.

Our key contributions can be summarized as follows:

• We introduced TI-NERmerger, a semi-automated

framework designed to integrate threat intelli-

gence NER datasets. A case example involv-

ing open-source NER datasets such as DNRTI

and APTNER illustrates the framework’s effec-

tiveness and performance.

• We curated the DNRTI-STIX NER dataset, com-

prising 175, 354 tokens, 39, 435 labeled entities,

and 6, 580 sentences. This dataset adheres to the

STIX 2.1 data exchange standard.

• We created the curated 2APTNER dataset by

merging DNRTI-STIX and APTNER. This more

extensive augmented dataset contains 434, 150

tokens, 79, 161 labelled security entities, and

16, 691 sentences. It offers greater scalability than

existing datasets and complies with the STIX 2.1

standard, establishing itself as the premier dataset

for building robust NER AI models.

• We implemented deep learning models, including

BiLSTM and BERT, to demonstrate the effective-

ness of the curated DNRTI-STIX and 2APTNER

datasets for TI-NER.

admin@338 B-HackOrg Aquatic B-APT

uses O Panda E-APT

Poison B-Tool leverages O

Ivy I-Tool Cobalt B-MAL

, O strike E-MAL

LaZagne I-Tool , O

, O LaZagne B-TOOL

and O , O

Cobalt B-Tool and O

strike I-Tool njRAT B-MAL

to O to O

target O target O

financial B-Org military B-IDTY

organizations I-Org industries E-IDTY

in O in O

Westen B-Area Hong B-LOC

China I-Area Kong E-LOC

; 0 and O

it O exfiltrate B-ACT

exfiltrates B-Purp data E-ACT

data I-Purp over O

via O 0.0.0.0 B-IP

1.1.1.1 O address

A) B)

Figure 1: Sample threat information in BIO(A) and

BIOES(B) format.

• To promote research in this ﬁeld, in addition to

the TI-NERmerger framework, we will make both

the DNRTI-STIX and 2APTNER datasets avail-

able through our GitHub repository, accessible via

the following link

The paper proceeds as follows: Section 2 discusses

the challenges motivating this study, Section 3 re-

views previous research efforts, Section 4 outlines the

methodology for merging TI-NER datasets, Section

5 introduces the TI-NERmerger framework, and Sec-

tion 6 concludes the paper, summarizing key ﬁndings

and contributions.

2 PROBLEM DEFINITION

We illustrate the research problem using a simple

case example in Figure 1, which mirrors a real-world

scenario. In this example, A and B represent two

annotated TI-NER datasets collected from different

sources. The objective is to merge these datasets into

a single consolidated dataset suitable for training a ro-

bust AI model.

The analysis of datasets A and B reveals the fol-

lowing challenges:

1. Tagging schemes: Dataset A utilizes the BIO

tagging scheme, whereas Dataset B employs the

https://github.com/imouiche/TI-NERmerger

SECRYPT 2024 - 21st International Conference on Security and Cryptography

358

BIOES tagging scheme.

2. Label names and entity categories: Dataset A in-

cludes label names such as HackOrg, Tool, Org,

Area, and Purp, while dataset B uses labels such

as APT, MAL, TOOL, IDTY, LOC, ACT, and IP.

This also highlights the difference in the number

of entity types between A and B.

3. Annotation: There is inconsistency in entity anno-

tation between the datasets. For example, "Cobalt

Strike" labelled as B-Tool I-Tool in dataset A

is annotated as B-MAL E-MAL, indicating it as

malware instead of a tool like in A. Another in-

consistency is between "ﬁnancial organizations"

labelled as B-Org I-Org in dataset A and "military

industries" labelled as B-IDTY E-IDTY in dataset

B. Both entity types ("Org" and "IDTY") identify

the object being targeted by hackers or malware.

The only difference is that "Org" is more speciﬁc.

4. Uncovered entities: Dataset B includes low-level

indicators of compromise (IoCs), such as IP ad-

dresses, which are neglected in Dataset A.

Integrating datasets A and B without addressing these

challenges will degrade the model’s performance.

While the manual process can be completed within

minutes if A and B only contain one sentence each,

real-world datasets like DNRTI(Wang et al., 2020c)

and APTNER(Wang et al., 2022) contain tens of thou-

sands of sentences, which makes the manual approach

cumbersome and even intractable at large scale. In

addition, datasets contain entities that span several to-

kens, making their identiﬁcation and extraction more

complex. This paper aims to alleviate these chal-

lenges by transitioning from the manual process to a

semi-automated one, taking advantage of the fact that

these datasets are already annotated and come from

the same domain.

3 RELATED WORKS

Previous literature lacks any work explicitly target-

ing the development of a framework for merging

TI-NER datasets. Given that this paper also aims

to release suitably annotated NER datasets compli-

ant with cybersecurity data exchange standards like

STIX-2.1 (Jordan et al., 2022), we will review previ-

ous efforts in this direction to provide research con-

text. (Zhou et al., 2018) conducted a comprehensive

study in which they crawled 687 Advanced Persistent

Threat (APT) reports published between 2008 and

2018. They then annotated 370 articles, focusing on

11 predeﬁned indicators of compromise (IoC) entity

types. (YI et al., 2020) introduced a novel NER ap-

proach called RDF-CRF, which combines regular ex-

pressions, a dictionary of known entities, and the con-

ditional random ﬁeld (CRF) algorithm. To evaluate

the model, they created a NER dataset using 14, 000

web security reports, encompassing 22 predeﬁned en-

tity categories and featuring 7, 413 labelled entities.

(Kim et al., 2020) designed a NER system that lever-

aged the character-level feature vector to detect cy-

ber threats within unstructured text reports. To evalu-

ate the performance of their model, they constructed

a corpus that contained 498, 000 entity tags and 11

cyber keywords or entity names. (Guo et al., 2021)

gathered security reports from diverse CTI sources,

including APT reports, hacker forums, security bul-

letins, and more. They created a dataset named OS-

INT, consisting of 13,000 sentences, to assess the

capabilities of CyberRel, a model designed for the

simultaneous extraction of entities and relationships

from security reports. (Marchiori et al., 2023) intro-

duced the STIXnet model, which employs rule-based

methods, NLP, and deep learning techniques to ex-

tract 18 STIX entities and relationships within secu-

rity reports. As part of their work, the authors made

available a sample of annotated APT groups, which

they gathered by crawling data from the MITRE

ATT&CK repository (Corporation, 2023).

Previous attempts to address the lack of large-

scale and high-quality annotated NER datasets in cy-

bersecurity have not gone unnoticed. Table 1 summa-

rizes advancements in the NER domain in a compar-

ative study. It is essential to highlight that, at present,

all the annotated datasets mentioned are not publicly

accessible except for DNRTI (Wang et al., 2020c)

and APTNER (Wang et al., 2022). DNRTI covers

only 13 entity categories and does not conform to the

STIX 2.1 speciﬁcation for sharing cyber threat intelli-

gence (CTI) information (Wang et al., 2022). DNRTI-

STIX is a newly generated TI-NER dataset that ad-

heres to the STIX 2.1 standard. The integration of

DNRTI-STIX and APTNER results in the augmented

APTNER, also known as 2APTNER. The 2APTNER

dataset surpasses existing datasets in terms of the

number of tokens, annotated entities, and sentences,

establishing itself as the largest NER dataset in the

ﬁeld of threat intelligence.

TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity

359

Table 1: The DNRTI-STIX2 and 2APTNER Datasets and their Comparison with Existing TI-NER Datasets.

Datasets Open # of entity

types

# of to-

kens

# of la-

beled ents.

# of

sents.

vocab

size

# of Re-

ports

(Zhou et al., 2018) 4 11 1773638 69032 - - 390

(YI et al., 2020) 4 23 - 7413 - - 14128

(Kim et al., 2020) 4 11 498000 15720 13570 - 160

(Guo et al., 2021) 4 - - 75990 13000 - -

(Marchiori et al., 2023) 4 18 - - - - -

(Wang et al., 2020c) 2 13 175461 36808 6592 9426 -

(Wang et al., 2022) 2 21 258796 39726 10111 15608 -

DNRTI-STIX2 2 21 175354 39435 6580 9444 -

2APTNER 2 21 434150 79161 16691 16439 -

4 METHODOLOGY FOR

INTEGRATING TI-NER

DATASETS

This section outlines the step-by-step procedure fol-

lowed in this paper for merging labelled TI-NER

datasets in cybersecurity. After deﬁning the datasets,

the methodology comprises four main phases: Tag

Representation, Entity Categories, Entity Mappings,

and Annotation. The paper begins with a manual ap-

proach to establish the baseline for developing the au-

tomation framework known as Ti-NERmerger.

4.1 Datasets

The two datasets utilized for the experiment are

DNRTI(Wang et al., 2020c) and APTNER(Wang

et al., 2022), sourced from their respective reposito-

ries [(Wang et al., 2020b), (Wang et al., 2020a)]. We

combined the training, testing, and validation sets into

a uniﬁed dataset for each dataset. We conducted pre-

processing to eliminate non-ASCII characters and in-

complete sentences, and the resulting distribution of

the number of sentences, labelled entities, and vocab-

ulary size can be found in Table 1. The objective is

to merge these datasets to create a more scalable an-

notated dataset for building robust NER AI systems.

In this case, the resulting dataset is called augmented

APTNER or simply 2APTNER.

The deﬁnitions and examples of each entity type

are provided in Table 2. Additionally, the

4.2 Tag Representation

The goal here is to select the tagging scheme for the

resulting dataset (2APTNER). DNRTI is labelled us-

ing the BIO (beginning, inside, or outside) scheme,

while APTNER employs BIOES (beginning, inside,

outside, end, or single). Since we only have two

datasets, choosing between BIO and BIOES is op-

timal. To maintain simplicity and leverage the data

granularity provided by BIOES, we opted for this for-

mat and this decision addresses the issue (1) stated in

Section 2.

4.3 Entity Categories

This step tackles challenge (2) of the problem def-

inition in Section 2 by specifying the entity cate-

gories for the target dataset (2APTNER). The DNRTI

dataset comprises 13 entity types: HackOrg, Of-

fAct, SamFile, SecTeam, Time, Way, Tool, Idus, Org,

Area, Purp, and Features. In contrast, the APTNER

dataset features 21 entity categories, including APT,

SECTEAM, LOC, TIME, VULNAME, VULID, TOOL,

MAL, FILE, MD5, SHA1, SHA2, IDTY, ACT, DOM,

ENCR, EMAIL, OS, PROT, URL, and IP. Given that

APTNER complies with the STIX 2.1 standard for

data exchange, using its entity types ensures align-

ment with this standard. Therefore, utilizing APT-

NER as the base dataset and converting DNRTI to

align with APTNER for seamless integration is bene-

ﬁcial.

4.4 Entity Mappings

This step involves deﬁning possible entity mappings

when aligning two datasets. Entity mappings elu-

cidate the types of relationships that exist between

entity types in different datasets. Once entity cate-

gories for the resulting or target dataset have been

deﬁned, up to four possible entity mappings can be

distinguished. Due to this ﬁnite number, it becomes

feasible to semi-automate the process. For DNRTI

and APTNER, the four established mappings are il-

lustrated with examples in Table 2.

SECRYPT 2024 - 21st International Conference on Security and Cryptography

360

1. 1-to-1 Mappings indicate a direct mapping be-

tween DNRTI and APTNER entities.

2. 1-to-many Mappings: they show the DNRTI enti-

ties or categories that were expanded into two or

more APTNER features.

3. many-to-1 Mapping: as a reverse of 1-to-many

mappings, they present those DNRTI entities

merged into a single APTNER entity.

4. Uncovered Entities: this section introduces addi-

tional entities similar to APTNER, not initially in-

cluded in the original DNRTI article but uncov-

ered during the annotation process while convert-

ing DNRTI to align with APTNER, i.e. with 21

entity types. It is important to note that this map-

ping is optional as one may decide only to con-

sider initially annotated entities.

The primary objective of this phase is to establish a

foundation for seamless manual and automated har-

monization of datasets.

4.5 Annotations or Alignments

This phase aims to tackle challenges (3) and (4) from

Section 2. To resolve the inconsistency issue in entity

annotation between both datasets, it is crucial to have

a reliable reference source of truth. We relied on the

MITRE ATT&CK framework (Corporation, 2023) as

our primary point of reference to determine the cor-

rect entity types. The MITRE ATT&CK framework is

a knowledge base of adversary tactics and techniques

based on real-world observations. It is widely used

in cybersecurity for threat intelligence, threat hunt-

ing, and incident response purposes. The example

provided in Figure 1, utilizing the MITRE repository,

highlights that "Cobalt Strike" in Dataset A should

be classiﬁed as part of the Malware class rather than

a Tool, thus offering enhanced precision in address-

ing inconsistency for a coherent integration. Address-

ing the challenge (4) is important but not mandatory.

It involves identifying entities that were not initially

included in the original dataset. The case example

shown in Table 2 entails discovering entities such as

DOM, ENCR, EMAIL, OS, PROT, URL, and IP in

the DNRTI dataset. It is important as it helps increase

the number of instances of these classes in the target

dataset, thereby enhancing classiﬁcation accuracy.

After completing the analysis phases, the manual re-

labeling of DNRTI using BIOES format and the 21

predeﬁned entity categories of the target dataset was

initiated. This process involved four annotators: one

PhD student and three master’s students, all from a

cybersecurity background. The process began with

two one-hour meetings coordinated by the PhD stu-

Similar O Similar O

to O to O

RIPTIDE B-OffAct RIPTIDE B-ACT

campaigns I-OffAct campaigns E-ACT

, O , O

APT12 B-HackOrg APT12 S-APT

infects O infects S-ACT

target O target O

systems O systems O

with O with O

HIGHTIDE B-Tool HIGHTIDE S-MAL

using O using O

a O a O

Microsoft B-Tool Microsoft B-FILE

Word I-Tool Word E-FILE

( O ( O

.doc B-Tool .doc S-FILE

) O ) O

document O document O

that O that O

exploits O exploits O

CVE-2012-0158 B-Exp CVE-2012-0158 S-VULID

. O . O

a) b)

Figure 2: Sample conversion of DNRTI (a) to DNRTI-

STIX(b).

dent. During the ﬁrst meeting, 25 sentences were

re-labeled to serve as examples. At the end of the

meeting, each student selected 10 sentences to an-

notate for the next meeting. In the second meet-

ing, all 40 sentences were reviewed for better under-

standing. Subsequently, the remaining DNRTI sen-

tences were distributed among all annotators, with the

PhD student receiving 40% and each master’s student

receiving 20%. Annotators collaborated to address

any confusion that arose during the annotation pro-

cess and the consensus was obtained through a ma-

jority vote. The voting weight was distributed such

that the PhD student’s vote counted for 40%, while

each master’s student’s vote counted for 20%. This

distribution of voting power effectively resolved any

tie situations. The manual process to align DNRTI

with APTNER, ensuring adherence to the STIX 2.1

speciﬁcation, took three months to complete. The re-

sulting dataset, named DNRTI-STIX, will seamlessly

merge with APTNER to create 2APTNER. This com-

bined dataset offers a more scalable annotated TI-

NER dataset for building reliable AI systems. A sam-

ple conversion of DNRTI to DNRTI-STIX is shown

in Figure 2. For instance, the named entity "HIGH-

TIDE" initially labeled as "Tool" is changed to "Mal-

ware" according to the MITRE ATT&CK repository.

Similarly, "Microsoft Word .doc" classiﬁed initially

as "Tool" (i.e., "B-Tool I-Tool B-Tool"), becomes "B-

FILE E-FILE S-FILE" after conversion.

TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity

361

Table 2: Entity Mappings aligning DNRTI with APTNER and STIX 2.1.

DNRTI Entities APTNER Entities STIX-2.1 Examples

1-to-1 Mappings

HackOrg APT Threat groups APT19, admin@338, MuddyWater

SecTeam SECTEAM Security teams FireEye, MATI, Palo Alto Networks

Area LOC Location China, Russia, North Korea

Time TIME Time Sept 10, April 9th, 2016

1-to-many Mappings

Exp

VULNAME Exploit EternalBlue, zero-day

VULID Vulnerability ID CVE-2017-8759, CVE-2016-4117

Tool

TOOL Tool PowerShell, LaZagne

MAL Malware SHIRIME, FinSpy, Clayslide

SamFile

MAL Malware Backdoor.APT.FakeWinHTTPHelper

FILE File checker1.exe, .docs, Excel worksheets

MD5 Hash value 12hj34ng34ghjdf802n3inf

SHA1 Hash value AA0FA4584768CE9E16D67D8C520...

SHA2 Hash value cca268c13885ad5751eb70371bbc9ce8c...

many-to-1 Mappings

Idus

IDTY

Identity, Military Industry, Financial Institutes

Org Industry Google, Technology organizations

OffAct

ACT

Attack patterns Spear-phishing

Way Attack patterns Brute force

Purp Attack patterns Exﬁltration, DoS

Features Attack patterns Lateral movement

Uncovered Entities

- DOM Domain adobe.com, mydomain1607.com

- ENCR Encryption methods RSA, AES

- EMAIL Email edmundj@chmail.ir, hostay88@gmail.com

- OS Operating system Windows, Linux

- PROT Protocol ssh, HTTP, POP3

- URL URL https://github.com

- IP IP address 185.162.235.0, 0.0.0.0

4.6 Integration and Results

As shown in Table 3, the conversion of the DNRTI

to DNRTI-STIX from using ﬁne-grained BIOES for-

mat resulted in a total of 39, 435 labelled entities,

adding 2, 625 entities to the original DNRTI. A slight

reduction in the number of tokens and sentences for

DNRTI-STIX can be observed, and this is primarily

attributed to the removal of noisy data, including non-

ASCII characters and incomplete sentences, during

the migration process. Additionally, DNRTI-STIX

features 21 entity categories that are the same as APT-

NER and, therefore, can be merged with no issues.

Their integration gives rise to the 2APTNER dataset,

which is more expansive and encompasses 434, 150

tokens, 79, 161 labelled security entities, and 16, 691

sentences. It provides increased scalability compared

to existing datasets and adheres to the STIX 2.1 stan-

dard, solidifying its position for building real-world

AI systems.

Figure 3 provides a visual representation of dif-

ferent class distributions that distinguish 2APTNER

as the most scalable TI-NER dataset when com-

pared to DNRTI-STIX and the leading APTNER. The

labelling quality and effectiveness of the resulting

datasets (DNRTI-STIX and 2APTNER) are assessed

in the following sections.

4.7 Evaluations and Discussions

This section evaluates the quality and effectiveness of

the curated DNRTI-STIX and 2APTNER datasets re-

sulting from the manual labelling process. Various

state-of-the-art (SOTA) algorithms in the literature

SECRYPT 2024 - 21st International Conference on Security and Cryptography

362

Table 3: Curated DNRTI-STIX and 2APTNER Datasets from Manual Approach.

Datasets # ents type # of tokens # of labeled ents. # of sents. vocab size

DNRTI 13 175461 36808 6592 9426

DNRTI-STIX 21 175354 39435 6580 9444

APTNER 21 258796 39726 10111 15608

2APTNER 21 434150 79161 16691 16439

Figure 3: Comparison of DNRTI-STIX, APTNER, and

2APTNER.

have demonstrated signiﬁcant performance on NER

tasks. For this study, we implement one recurrent neu-

ral network (RNN)-based architecture like BiLSTM

and one transformer-based model like BERT. The pri-

mary objective is to evaluate the effectiveness of the

datasets rather than focusing on the performance of

the models. Both BiLSTM and BERT have bidirec-

tional capabilities, allowing them to capture informa-

tion from past and future contexts, signiﬁcantly en-

hancing their ability to comprehend the overall con-

text of a sequence. This quality contributes to their

impressive performance in NER tasks [(Huang et al.,

2015), (Zhou et al., 2021), (Varghese et al., 2023),

(Wang et al., 2020a), (Devlin et al., 2019)]. Moreover,

BERT undergoes pre-training on an extensive cor-

pus of text data before ﬁne-tuning for speciﬁc down-

stream tasks, employing an attention mechanism to

consider the entire context of a word within a sen-

tence. BERT has gained prominence, particularly for

its transformer architecture, which excels in capturing

long-range dependencies. This study implements the

base forms of these models to demonstrate the effec-

tiveness of our datasets.

4.7.1 DNRTI-STIX vs DNRTI

The hyperparameters for each base model used in the

experiment are detailed in Table 4. A Dropout of 0.2

was applied with BiLSTM to prevent overﬁtting. The

datasets were split into training, test, and validation

sets in a ratio of 7:1.5:1.5 for both models.

Table 4: Models’ parameter settings.

parameters BERT BiLSTM

batch size 8 16

dropout 0.5 0.5

learning rate 1e-5 1e-5

epsilon 1e-6 1e-6

weight decay 0.001 0.001

hidden layer size 100 -

optimizer Adam Adam

embedding size 768 300

number of epochs 1 10

Table 5 provides a comparative summary of

DNRTI-STIX and DNRT datasets for BiLSTM and

BERT models. DNRTI-STIX features more unique

entity tags (60) than DNRTI (27) due to the conver-

sion of DNRTI to 21 entity categories aligned with

the STIX standard. Despite covering a broader range

of entity categories and exhibiting more diversity in

entity types, DNRTI-STIX maintains relatively simi-

lar performance to DNRTI and even slightly outper-

forms it in terms of Precision (P), Recall (R), and

F1 scores (F1) for both BiLSTM and BERT models.

This highlights the quality of the manual relabeling

process undertaken by the authors. As expected, the

BERT model achieves higher Precision, Recall, and

F1 scores compared to the BiLSTM model for both

datasets, indicating its superior performance. For this

reason, we used the BERT model to report the indi-

vidual class classiﬁcation for both datasets in Table

This table comprehensively shows how effectively

the model predicts each class, presenting per-class

and overall performance metrics, including Micro

Avg, Macro Avg, and Weighted Avg.

• The Micro Avg row represents the weighted av-

erage of precision, recall, and F1-Score across

all classes, considering individual predictions and

TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity

363

Table 5: DNRTI-STIX vs DNRTI using BiLSTM and

BERT models.

DNRTI-STIX DNRTI

# unique

entity tags

60 27

Metrics P R F1 P R F1

BiLSTM-

CRF

0.68 0.70 0.69 0.67 0.70 0.68

BERT 0.79 0.84 0.81 0.77 0.82 0.80

support for each instance.

• The Macro Avg row displays the unweighted av-

erage of precision, recall, and F1-Score across all

classes, treating all classes equally without con-

sidering class imbalances.

• The Weighted Avg rows provide a weighted aver-

age of precision, recall, and F1-Score, with each

class’s contribution weighted by its support after

the split.

It’s important to note that speciﬁc entity classes, such

as SHA1, and URL, are not included in the report.

This omission is due to the insufﬁcient number of

instances for these classes in DNRTI-STIX, as seen

in Fig 3, and they were not considered during train-

ing and evaluation. Simultaneously, Table 6 presents

the classiﬁcation report for the BERT model on the

original DNRTI dataset. This not only highlights

the unique characteristics of the STIX 2.1 format

in extracting more detailed entity information from

TI-NER datasets but also underscores the quality of

data relabeling done by the authors, maintaining the

model’s performance relatively high despite the in-

creased number of entity categories in DNRTI-STIX.

4.7.2 DNRTI-STIX VS APTNER vs 2APTNER

The Augmented APTNER also known as 2APTNER

is formed by merging the DNRTI-STIX and APT-

NER datasets. Table 7 displays the classiﬁcation

report for these datasets using BiLSTM and BERT

models. DNRTI-STIX demonstrates superior perfor-

mance compared to APTNER and 2APTNER. Previ-

ous studies have shown that BiLSTM struggles with

larger datasets due to memory requirements. This is

also seen in Table 7, while BERT consistently outper-

forms across all datasets.

Upon reviewing the BERT classiﬁcation reports in

Table 8 for both APTNER and 2APTNER datasets, it

becomes evident that the "SHA1" entity class hurts

2APTNER with a contribution of 0.00. This is be-

cause the count of instances of this class was less

than 50, and machine learning (ML) models typically

require a minimum of 50 samples to understand the

context within a sentence [(Rani et al., 2023b), (Pe-

dregosa et al., 2011)]. However, the contribution of

"SHA1" and "URL" was ignored for DNRTI-STIX.

Thus, it is understood that DNRTI-STIX and 2APT-

NER perform similarly using the BERT model.

This concludes the demonstration of high-quality

annotation and efﬁciency of the DNRTI-STIX and

2APTNER datasets obtained through the manual re-

labeling approach using the BiLSTM and BERT base

form models. These datasets will serve as base-

lines for evaluating the TI-NERmerger framework

proposed in this study, aimed aA common approach

in data-centric AI is data augmentation or argumen-

tation to meet these requirementsre already annotated

and belong to the same domain, the framework aims

to automate and optimize the dataset integration pro-

cess.

5 TI-NERmerger: A

SEMI-AUTOMATED

FRAMEWORK FOR

INTEGRATING NER DATASETS

IN CYBERSECURITY: A CASE

STUDY OF DNRTI AND

APTNER

With the rise of data-centric AI and the emergence

of Large Language Models (LLMs) like BERT, GPT-

3, RoBERTa, and others, the importance of high-

quality, scalable, and diverse datasets for training ro-

bust AI systems has become increasingly apparent.

To meet these requirements, a common approach in

data-centric AI is data augmentation or argumenta-

tion. This involves merging multiple open-source an-

notated datasets into a single, consolidated, and di-

verse dataset with the aim of signiﬁcantly improv-

ing the resulting AI systems. However, integrat-

ing threat intelligence named entity recognition (TI-

NER) datasets poses several challenges, as outlined

in Section 2. These challenges include using differ-

ent tagging formats, entity types, and inconsistency

in entity annotation. The manual process to address

these issues and align datasets for integration is time-

consuming and becomes increasingly difﬁcult when

dealing with numerous datasets.

This section introduces TI-NERmerger, a semi-

automated framework designed for merging TI-NER

datasets. Leveraging that these datasets originate

from the same domain and are already annotated

for NER tasks, TI-NERmerger facilitates the tran-

sition from the current manual approach to a semi-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

364

Table 6: DNRTI-STIX vs DNRTI Classiﬁcation Report using the BERT Model.

DNRTI-STIX DNRTI

Class P R F1 Class P R F1

ACT 0.72 0.80 0.76 Area 0.85 0.93 0.89

APT 0.80 0.88 0.84 Exp 0.96 0.98 0.97

DOM 1.00 0.80 0.89 Features 0.73 0.83 0.78

EMAIL 0.80 1.00 0.89 HackOrg 0.78 0.83 0.81

ENCR 0.75 0.60 0.67 Idus 0.79 0.80 0.79

FILE 0.78 0.89 0.83 OffAct 0.71 0.84 0.77

IDTY 0.78 0.81 0.79 Org 0.65 0.68 0.66

IP 0.67 1.00 0.80 Purp 0.63 0.74 0.68

LOC 0.85 0.91 0.88 SamFile 0.81 0.81 0.81

MAL 0.79 0.83 0.81 SecTeam 0.88 0.87 0.88

MD5 1.00 1.00 1.00 Time 0.87 0.91 0.89

OS 0.84 0.95 0.89 Tool 0.68 0.77 0.72

PROT 0.94 0.64 0.76 Way 0.73 0.64 0.68

SECTEAM 0.87 0.87 0.87

SHA2 1.00 1.00 1.00

TIME 0.85 0.90 0.87

TOOL 0.70 0.71 0.70

VULID 1.00 0.99 1.00

VULNAME 0.86 0.90 0.88

Micro Avg 0.78 0.84 0.81 Micro Avg 0.77 0.82 0.80

Macro Avg 0.84 0.87 0.85 Macro Avg 0.77 0.82 0.79

Weighted Avg 0.79 0.84 0.81 Weighted Avg 0.77 0.82 0.80

Table 7: DNRTI-STIX VS APTNER vs 2APTNER Classiﬁcation Report using BiLSTM and BERT.

DNRTI-STIX APTNER 2APTNER

Model Metrics P R F1 P R F1 P R F1

Micro Avg 0.65 o.74 0.69 0.65 0.61 0.63 0.60 0.61 0.63

BiLSTM-CRF Macro Avg 0.56 0.50 0.54 0.45 0.47 0.46 0.41 0.46 0.42

Weighted Avg 0.66 0.74 0.69 0.59 0.61 0.60 0.60 0.60 0.60

Micro Avg 0.78 0.84 0.81 0.73 0.78 0.76 0.76 0.80 0.78

BERT Macro Avg 0.84 0.87 0.85 0.77 0.79 0.78 0.81 0.82 0.81

Weighted Avg 0.79 0.84 0.81 0.73 0.78 0.76 0.76 0.80 0.78

automated one. This transition allows the annotation

task, which typically takes several months, to be com-

pleted in just a few minutes.

Figure 4 illustrates the framework, which comprises

four main components leading to the formation of

the target or merged dataset. The framework is in-

spired by the manual process of integrating DNRTI

and APTNER, as outlined in Table 2. The four com-

ponents are classiﬁed into two phases: Analysis and

Automation. The analysis phase includes Tag Rep-

resentation, Entity Categories, and Entity Mappings.

The automation phase involves annotation or align-

ment and integration into the target dataset. As de-

picted in Figure 4, the framework claims the capabil-

ity of merging multiple datasets (denoted as n). In

practice, this is achieved by merging two datasets at a

time, and then the resulting dataset is merged with the

next dataset in the sequence.

For clarity, we maintain the example of DNRTI and

APTNER to describe the four components:

1. Tag Representation.

Select the tagging scheme for the target dataset,

such as BIO or BIOES. This decision should be

inﬂuenced by the speciﬁc requirements of the

NER dataset, the dataset’s characteristics, and

the desired level of detail in entity recognition

((Konkol and Konopík, 2015), (Alshammari and

Alanazi, 2021)). The framework implements only

these two tagging formats because almost all NER

datasets in security use one of these formats. To

repeat the experiment using our tool for the case

of DNRTI and APTNER, we have chosen BIOES.

TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity

365

Figure 4: TI-NERmerger: Semi-automation framework to integrate NER datasets in cybersecurity.

Table 8: APTNER and 2APTNER Classiﬁcation Reports

using the BERT model.

APTNER 2APTNER

Class P R F1 P R F1

ACT 0.56 0.68 0.62 0.62 0.68 0.65

APT 0.82 0.88 0.85 0.82 0.86 0.86

DOM 0.93 0.90 0.92 0.83 0.98 0.92

EMAIL 0.67 0.56 0.61 1.00 0.73 0.84

ENCR 0.85 0.92 0.89 0.76 0.85 0.80

FILE 0.72 0.75 0.74 0.77 0.74 0.76

IDTY 0.71 0.82 0.76 0.68 0.74 0.78

IP 0.94 0.94 0.94 0.94 0.97 0.96

LOC 0.88 0.91 0.89 0.84 0.89 0.86

MAL 0.71 0.72 0.72 0.72 0.80 0.76

MD5 0.69 0.63 0.66 0.93 0.97 0.95

OS 0.77 0.78 0.77 0.83 0.86 0.85

PROT 0.72 0.77 0.74 0.70 0.79 0.74

SECTEAM 0.87 0.89 0.88 0.80 0.82 0.81

SHA1 - - - 0.0 0.0 0.0

SHA2 0.77 0.97 0.86 0.98 1.00 0.99

TIME 0.87 0.91 0.89 0.77 0.82 0.79

TOOL 0.53 0.57 0.55 0.72 0.74 0.73

URL 0.89 0.71 0.79 0.78 0.50 0.61

VULID 1.00 0.99 0.99 1.00 1.00 1.00

VULNAME 0.52 0.51 0.51 0.76 0.76 0.76

This choice is based on its previous use in rep-

resenting the APTNER dataset, which was larger

and contained 21 entity categories, and it also ad-

heres to the STIX standard.

2. Entity Categories.

Thoroughly analyze the entity categories in each

dataset and predeﬁned the entity types for the ﬁ-

nal dataset. This task should be carried out by

a domain expert who understands the speciﬁc re-

quirements of the NER tasks. In the case example,

the authors opted for the 21 predeﬁned APTNER

entity categories for the target dataset.

3. Entity Mappings.

Establish distinct mappings between each dataset

and the predeﬁned entity types of the target

dataset. Possible mappings include 1-to-1 map-

pings, 1-to-many mappings, many-to-1 mappings,

and the discovery module. Refer to Table 2 for

a visual representation of the four different map-

pings deﬁned for the case of DNRTI and APT-

NER.

4. Annotation or Alignment.

This component involves translating the different

mappings established in the preceding phase into

a programming language. The TI-NERmerger

framework is implemented in Python and com-

prises six main modules, each of which can work

and be executed independently. The ﬁrst mod-

ule reads the command-line inputs, while the next

four modules implement each identiﬁed mapping:

1-to-1 mappings, 1-to-many mappings, many-to-1

mappings, and uncovered entities. Finally, the last

module performs the merge and outputs the result.

The component is approached as an active anno-

tation initiative, with a domain expert in the loop

who decides which module to run and provides

the required entity classes. For the case example

of DNRTI and APTNER:

(a) The ﬁrst module reads user inputs from the

command line and applies any tagging con-

version if needed. For example, a command-

line input of "TI-NERmerger BIOES DNRTI

APTNER 2APTNER" means the user wants

to merge DNRTI and APTNER into a single

dataset called 2APTNER using BIOES tagging.

This module will automatically detect which

DNRTI and APTNER does not align with this

format and make the necessary conversion. In

this case, DNRTI will be changed from BIO to

BIOES.

(b) The 1-to-1 mappings module assigns new

labels, namely APT, SECTEAM, LOC, and

TIME, to all DNRTI entities with labels Hack-

Org, SecTeam, Area, and Time, respectively.

DNRTI entities tagged as Idus and Org into

IDTY; and all DNRTI entities annotated as Of-

fAct, Way, Purp, and Features into ACT.

(d) The 1-to-many mappings module implements

an algorithm that uses Python Scrapy library to

query ATT&CK repository(Corporation, 2023)

and categorize all DNRTI entities labelled as

Tool into either TOOL or MAL (malware). It

defaults to TOOL if the software is not found

SECRYPT 2024 - 21st International Conference on Security and Cryptography

366

on the MITRE platform. It works similar to the

manual process to address inconsistency in en-

tity annotation. This module employs regular

expressions (regex) to parse all SamFile entities

into MAL, FILE, MD5, SHA1, and SHA2. Sim-

ilarly, it categorizes all Exp entities into VUL-

NAME and VULID using regex. It’s important

to note that this module can also identify hacker

groups deﬁned in the ATT&CK repository.

(e) The discovery module reveals indicators of

compromise (IoCs) such as IP, URL, DOM,

EMAIL, and PROT from the DNRTI that were

not originally considered. This was necessary

to augment the number of instances in the tar-

get dataset, as these entities were annotated in

APTNER. The module also identiﬁes encryp-

tion algorithms (ENCR) and operating systems

(OS) in DNRTI by matching unlabeled entities

with pre-deﬁned lists of encryptions and oper-

ating systems.

(f) The integration module merges both datasets,

combining DNRTI and APTNER into a single

dataset called 2APTNER.

This explains how we successfully completed the

annotation task that took several months in just a

few minutes.

The code of the whole TI-NERmerger framework

spans over 1000+ lines, and we plan to release it along

with various datasets to support continuous develop-

ment and improvement.

5.1 Results and Discussions

To evaluate the effectiveness of the TI-NERmerger

framework, we employed it to align the original

DNRTI with APTNER, resulting in DNRTI-STIX.

We compared the results with the manual process in

Table 9. It is important to note that APTNER re-

mained unchanged during the integration process un-

til the ﬁnal stage, where it was merged with DNRTI-

STIX. APTNER was selected as the baseline for the

ﬁnal dataset because it adheres to the STIX-2.1 data

exchange standard and offers a diverse range of entity

categories. Consequently, inconsistencies were re-

solved by aligning DNRTI-STIX with APTNER. For

example, as shown in Figure 1, the "ﬁnancial organi-

zation" in dataset A, initially tagged as B-Org I-Org,

was mapped to B-IDTY I-IDTY to align with dataset

B during the merging process. In dataset B, the la-

bel category "IDTY" is used to identify the object or

victim entity targeted by malware and hacker organi-

zations.

Table 9 indicates that the framework successfully

extended DNRTI of 13 entity types to DNRTI-STIX

featuring 21 entity categories. Both the manual pro-

cess and the TI-NERmerger framework resulted in

datasets with the same number of entity types (21).

The total number of tokens in the datasets is almost

the same for both approaches, with a slight difference

of 111 tokens. This is because noisy words and in-

complete sentences were removed during the manual

approach (hence 6, 580 for the manual process and

6, 592 for the TI-NERmerger framework).

We observe more labelled entities (39, 435) result-

ing from the manual process than the TI-NERmerger

framework (37,335). This is due to the discovery of

entities such as IP, URL, DOM, EMAIL, PROT, OS,

and ENCR that were not initially considered in the

original DNRTI dataset. The framework’s discovery

module faces challenges in uncovering an operating

system or encryption algorithm when the entity name

cannot be found in the predeﬁned list of operating

systems or encryption methods. Both approaches re-

sulted in datasets with a similar number of sentences

(6, 580 for the manual process and 6, 592 for the TI-

NERmerger framework). The vocabulary size of the

datasets is also very close, with only a difference of 5

vocabulary items. These results deduce the ﬁgures de-

picted in Table 10 for the 2APTNER dataset, which is

the outcome of merging DNRTI-STIX and APTNER.

TI-NERmerger framework can produce datasets with

comparable characteristics to those obtained through

manual processes, demonstrating its effectiveness in

automating the dataset integration process. In just a

few minutes, the framework successfully accounted

for over 94.67% of the annotated entities, a task that

had previously taken several months using manual

methods. This result could improve further if no new

entities are uncovered from the dataset.

5.2 Evaluation and Discussions

The effectiveness of the datasets obtained using

our TI-NERmerger framework is displayed in Ta-

ble 11. The results from the manual approach serve

as baselines to evaluate the framework’s capability.

The TI-NERmerger framework performs similarly

to the manual approach, especially regarding Micro

and Weighted Averages for both BiLSTM-CRF and

BERT models on both datasets.

Table 12 presents the classiﬁcation reports for each in-

dividual entity class for the DNRTI-STIX dataset us-

ing the BERT model. We observed only slightly better

performance in favour of the manual approach. The

absence of a line for the EMAIL entity class indicates

that the framework did not uncover enough instances

of this class. Similarly, the framework successfully

identiﬁed 53 of 60 unique tags from the manual ap-

TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity

367

Table 9: DNRTI-STIX: Manual process Vs Framework.

Approach # of ents type # of tokens # of labelled ents. # of sents. vocab size

Manual 21 175354 39435 6580 9444

TI-NERmerger 21 175465 37335 6592 9439

Table 10: 2APTNER: Manual Approach Vs Framework.

Approach # of unique tags # of tokens # of labeled ents. # of sents. vocab size

Manual 21 434150 79161 16691 16439

TI-NERmerger 21 434261 77,061 16,703 16023

proach. As stated earlier, the missing 7 tags result

from discovering entities that were initially ignored in

the original dataset. This demonstrates that the overall

performance of the framework remains above 94%.

5.3 Generalization, Advantages, and

Limitations

1. Generalization:

Our TI-NERmerger framework aligns with the

widely adopted STIX-2.1 data exchange stan-

dard in cybersecurity, which deﬁnes a set of

STIX Domain Objects (SDOs) and STIX Cyber-

observable Objects (SCOs). Each object cor-

responds to a unique concept commonly repre-

sented in CTI datasets. STIX ensures that orga-

nizations can share CTI consistently and machine-

readably, encouraging dataset owners to use STIX

objects as entity types for different downstream

tasks. Although these datasets may utilize dif-

ferent labelling and tagging schemes, their in-

tegration is facilitated once they establish map-

pings between entity categories and STIX base-

line objects. The identiﬁcation of the four pos-

sible mappings, outlined in Table 2, is supported

by integrating MITRE ATT&CK within the STIX

framework. This integration offers a detailed be-

havioural context that signiﬁcantly enhances the

understanding and differentiation of threat enti-

ties. This facilitates the effective alignment of

entity categories with the established STIX ob-

jects. As a result, our TI-NERmerger framework

claims strong generalizability across datasets uti-

lizing STIX objects or a subset of STIX objects to

deﬁne the entity categories. In other words, their

integration is assured as long as CTI data can be

mapped to the STIX standard. Conversely, in ar-

eas where these standard references are not guar-

anteed or are inapplicable, establishing mappings

across CTI datasets or demonstrating their exis-

tence can be challenging.

2. Advantages:

• The modules are independent of each other,

ﬂexible, and extensible, allowing them to be

adapted for different purposes.

• Large datasets, often annotated by groups of

students, can be effectively combined using this

framework. It ensures quality annotation and

consistency when merging datasets from differ-

ent groups.

• The framework signiﬁcantly streamlines

labour-intensive work that typically takes

several weeks to only a few minutes.

• Small to medium-sized NER datasets are typ-

ically well-annotated, and the tool can be em-

ployed to create a scalable and high-quality la-

belled dataset.

• It is designed to conform to the STIX 2.1 data

sharing standard and functions effectively with

datasets encompassing a diverse range of entity

categories.

3. Limitation. The TI-NER framework has certain

limitations:

• The framework merges two datasets at a time.

In the case of multiple datasets, the result of

the ﬁrst two datasets is merged with the third

dataset, and so on. This requires running the

model multiple times, assuming each dataset

has peculiarities.

• Despite its capability to uncover artifact en-

tities initially overlooked in original datasets,

the framework relies on the MITRE ATT&CK

repository as the sole source of truth when clas-

sifying high-level security entities such as at-

tack groups, tools and malware.

6 CONCLUSION AND FUTURE

WORK

This study introduces TI-NERmerger, a semi-

automated framework integrating threat intelligence

named entity recognition (TI-NER) datasets in cyber-

security. It serves as a data augmentation tool de-

signed to timely tackle the scarcity of scalable and

SECRYPT 2024 - 21st International Conference on Security and Cryptography

368

Table 11: Manual Approach vs TI-NERmerger: Classiﬁcation Report using BiLSTM and BERT.

Manual TI-NERmerger Manual TI-NERmerger

DNRTI-STIX DNRTI-STIX 2APTNER 2APTNER

Model Metrics P R F1 P R F1 P R F1 P R F1

Micro Avg 0.65 o.74 0.69 0.65 0.61 0.63 0.60 0.61 0.63 0.61 0.62 0.62

BiLSTM-

CRF

Macro Avg 0.56 0.50 0.54 0.45 0.47 0.46 0.41 0.46 0.42 0.44 0.47 0.45

Weighted Avg 0.66 0.74 0.69 0.59 0.61 0.60 0.60 0.60 0.60 0.61 0.63 0.62

Micro Avg 0.78 0.84 0.81 0.80 0.84 0.90 0.76 0.80 0.78 0.75 0.79 0.77

BERT Macro Avg 0.84 0.87 0.85 0.77 0.80 0.78 0.81 0.82 0.81 0.79 0.80 0.79

Weighted Avg 0.79 0.84 0.81 0.80 0.84 0.80 0.76 0.80 0.78 0.77 0.79 0.77

Table 12: DNRTI-STIX (Manual process Vs Framework):

Classiﬁcation Reports using the BERT model.

Manual TI-NERmerger

# of unique