DISC: A Dataset for Information Security Classiﬁcation

Elijah Bass

, Massimiliano Albanese

and Marcos Zampieri

Center for Secure Information Systems, George Mason University, Fairfax, U.S.A.

Keywords:

Information Security, Information Protection, Security Classiﬁcation, Artiﬁcial Intelligence, Datasets.

Abstract:

Research in information security classiﬁcation has traditionally relied on carefully curated datasets. However,

the sensitive nature of the classiﬁed information contained in such documents poses challenges in terms of

accessibility and reproducibility. Existing data sources often lack openly available resources for automated

data collection and quality review processes, making it difﬁcult to facilitate reproducible research. Addi-

tionally, datasets constructed from declassiﬁed information, though valuable, are not readily available to the

public, and their creation methods remain poorly documented, rendering them non-reproducible. This paper

addresses these challenges by introducing DISC, a dataset and framework, driven by artiﬁcial intelligence

principles, for information security classiﬁcation. This process aims to streamline all the stages of dataset

creation, from preprocessing of raw documents to annotation. By enabling reproducibility and augmentation,

this approach enhances the utility of available document collections for information security classiﬁcation re-

search and allows researchers to create new datasets in a principled way.

1 INTRODUCTION

In the realm of information security, classiﬁcation

refers to the process of categorizing and protecting

information based on its sensitivity, conﬁdentiality,

or criticality (NIST, 2004). Information classiﬁca-

tion is an essential aspect of information security and

data governance, enabling organizations to adeptly

manage information while ensuring the protection of

sensitive data and compliance with regulatory stan-

dards. Classiﬁcation systems are designed to as-

sist organizations in developing robust classiﬁcation

strategies that align with their security requirements

and facilitate efﬁcient data management. Govern-

ment agencies, private sector entities, and industry-

speciﬁc users (e.g., healthcare, payment card indus-

try) leverage information classiﬁcation systems as

integral components of their information protection

planning. Government agencies rely on information

classiﬁcation systems to control access to national

security information. Industry-speciﬁc users adhere

to classiﬁcation protocols dictated by regulatory and

standards requirements, such as the Health Insurance

Portability and Accountability Act (HIPAA) for safe-

guarding medical industry information in the United

https://orcid.org/0009-0001-8680-1815

https://orcid.org/0000-0002-2675-5810

https://orcid.org/0000-0002-2346-3847

States and the Payment Card Industry Data Security

Standard (PCI DSS) for protecting data used in credit

card transactions.

Information classiﬁcation stands as a critical as-

pect of government and military operations, with each

country maintaining a unique classiﬁcation system

tailored to its speciﬁc needs. In the United States,

Executive Order (EO) 13526

mandates a uniform

system for classifying, safeguarding, and declassify-

ing national security information within the govern-

ment and military (White House, 2009). Govern-

ment ofﬁcials with the role of Original Classiﬁcation

Authorities (OCAs) determine the classiﬁcation level

based on the assessment of potential damage to na-

tional security due to unauthorized disclosure. Secu-

rity Classiﬁcation Guides detail and manage the sensi-

tivity of information (Information Security Oversight

Ofﬁce, 2018). Section 1.2 of EO 13526 prescribes

three distinct levels of information classiﬁcation la-

bels, namely CONFIDENTIAL, SECRET, and TOP

SECRET (White House, 2009). The CONFIDENTIAL

classiﬁcation is assigned to information where unau-

thorized disclosure is expected to cause damage to na-

tional security, while the SECRET classiﬁcation is re-

served for information whose unauthorized disclosure

Once issued, executive orders remain in force until

they are canceled or revoked by the current president, ex-

pire, or are adjudicated as unlawful.

Bass, E., Albanese, M. and Zampieri, M.

DISC: A Dataset for Information Security Classiﬁcation.

DOI: 10.5220/0012763400003767

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Security and Cryptography (SECRYPT 2024), pages 175-185

ISBN: 978-989-758-709-2; ISSN: 2184-7711

175

is expected to cause serious harm to national secu-

rity. The TOP SECRET classiﬁcation is applied when

the unauthorized disclosure is anticipated to cause ex-

ceptionally grave damage to national security. Infor-

mation that doesn’t meet the sensitivity requirements

outlined in EO 13526 Section 1.2 is designated as

UNCLASSIFIED.

The advancement of information security classiﬁ-

cation research has historically been hindered by lim-

ited shareability of curated datasets and resources re-

quired to automate dataset curation and assess qual-

ity. Previous information security efforts have fo-

cused on researching classiﬁcation algorithms. Us-

able datasets and frameworks to replicate research are

not yet available. The sensitive nature of information

used in security classiﬁcation precludes the sharing of

most datasets. The need for manual review of data ex-

tracted from the few available document repositories

has been demonstrated as essential for ensuring data

quality and accuracy (Engelstad et al., 2015b; Brown

and Charlebois, 2010; Engelstad et al., 2015a). This is

particularly evident in correcting errors resulting from

optical character recognition (OCR) applied to legacy

documents, and in preserving the integrity of docu-

ment information.

To tackle these challenges, we propose a frame-

work for automating the creation of datasets to serve

as training data to support the development of ad-

vanced information security classiﬁcation systems.

We illustrate the use of this framework by creat-

ing DISC, the ﬁrst reproducible information security

dataset, which we make publicly available. The pro-

posed approach ensures the integrity and quality of

the data, thus reducing the need for manual review, by

harnessing the power of AI-based methods and Large

Language Models (LLMs). We report the results of

initial experiments on information classiﬁcation us-

ing the DISC dataset and discuss future research di-

rections.

The remainder of this paper is organized as fol-

lows: Section 2 describes datasets curated in prior re-

search. Section 3 introduces a formal model for repre-

senting documents and supporting granular classiﬁca-

tion. Section 4 outlines the methodology for collect-

ing, processing, and storing information from a de-

classiﬁed document database to create a shareable and

reusable dataset. Section 5 describes areas of research

that can beneﬁt from the proposed methods and DISC

whereas Section 6 demonstrates the utility of DISC on

information security classiﬁcation. Finally, Section 7

provides concluding remarks and identiﬁes future re-

search directions.

2 RELATED WORK

The sensitivity of CLASSIFIED information and its

potential security impact from public release pre-

cludes the availability of readily accessible pub-

lic datasets. As a result, information classiﬁcation

datasets have been curated using publicly released,

declassiﬁed, or leaked information sources. In re-

sponse to the evolving need for diverse datasets, an

emerging research ﬁeld focuses on generating doc-

ument datasets, leveraging advancements in deep

learning to augment existing capabilities and re-

sources.

In the United States, information must undergo re-

classiﬁcation or the declassiﬁcation process upon the

expiration of its original classiﬁcation period (White

House, 2009). Publicly released government docu-

ments that have completed the declassiﬁcation pro-

cess serve as raw data for creating a realistic classiﬁ-

cation dataset. The Digital National Security Archive

(DNSA) stands out as a comprehensive collection,

comprising over 100,000 declassiﬁed records that

document historic United States policy decisions

This archive contains electronic copies of both previ-

ously UNCLASSIFIED and UNCLASSIFIED govern-

ment documents related to U.S. foreign policy from

the post-World War II era to the present day. No-

tably, research highlighted in (Engelstad et al., 2015b;

Brown and Charlebois, 2010; Engelstad et al., 2015a)

utilized speciﬁc DNSA domains, encompassing a mix

of CLASSIFIED and UNCLASSIFIED documents, in-

cluding:

• Afghanistan: The Making of U.S. Policy, 1973-

1990.

• China and the United States: From Hostility to

Engagement, 1960-1998.

• The Philippines: U.S. Policy during the Marcos

Years, 1965-1986.

Information made publicly available through data

breaches and data leaks can aid in the creation of

difﬁcult-to-obtain research datasets. However, re-

searchers must conscientiously identify and address

the ethical concerns associated with the process

through which the information was released into the

public domain before utilizing leaked data (Boustead

and Herr, 2020).

WikiLeaks, founded in 2006, is an organization

known for collecting and publishing sensitive leaked

documents. Notably, WikiLeaks published conﬁden-

tial United States diplomatic cables spanning from

2003 to 2010. These diplomatic cables, originat-

ing from U.S. Embassies and Consulates worldwide,

https://nsarchive.gwu.edu/publications-collections.

SECRYPT 2024 - 21st International Conference on Security and Cryptography

176

were presented on the WikiLeaks website in a static

HTML format, organized by embassies, and sorted

by their respective origination dates. In the research

conducted in (Alzhrani et al., 2016), the leaked diplo-

matic cables were processed, leading to the creation

of datasets for the Baghdad, London, Berlin, and

Damascus embassies. The content of these datasets

was classiﬁed at the paragraph level, with a total

of 21,718 paragraphs labeled as UNCLASSIFIED and

9,291 paragraphs labeled as SECRET.

The challenge of insufﬁcient datasets for train-

ing and validating sensitive information classiﬁcation

models is a common hurdle and may be addressed us-

ing data augmentation. The research in (Jadli et al.,

2020) introduced the use of deep convolutional ad-

versarial networks (DCGAN) to generate synthetic

document images from an existing scanned document

dataset. These generated images were combined with

original images in a dataset to train an image clas-

siﬁer using a convolutional neural network for doc-

ument classiﬁcation. The accuracy of the resulting

model, trained on the augmented dataset, was then

compared to a model trained on a similar-sized dataset

consisting solely of original documents. Remarkably,

the constructed model performed comparably to the

model trained with authentic data, achieving accuracy

of 90% and 91%, respectively.

The research in (Chakraborty et al., 2021) fo-

cuses on generating additional documents by replac-

ing concepts with semantically related concepts se-

lected from an ontology. The results yield, for every

real document d, a set Fake(d) of fake documents that

closely resemble d. Another approach demonstrated

in (Whitham, 2017) employs a rule-based and preset

template-based method for textual information gener-

ation. This technique involves parsing text to iden-

tify part-of-speech tags, selecting an original docu-

ment as a template, and utilizing word transposition

and substitution based on part-of-speech tagging and

n-grams.

In varying degrees, previous research efforts have

described the processes utilized to create datasets for

training and testing information security classiﬁer al-

gorithms. However, efforts in contacting prior re-

searchers to obtain datasets to recreate prior research

results has been unsuccessful, conﬁrming the chal-

lenges in accessing quality data for information clas-

siﬁcation research. To the author’s knowledge, DISC

is the ﬁrst openly available and reusable dataset for

information security classiﬁcation

The dataset and associated resources will be made pub-

licly available upon publication of this paper.

3 CORPUS FORMAL MODEL

The availability of high-quality and well-curated

datasets, often referred to as corpora, is essential for

advancing the state-of-the-art in information security

classiﬁcation. A suitable corpus serves as the foun-

dation upon which researchers can build and evalu-

ate models, algorithms, and methodologies to address

key challenges and issues within the domain of infor-

mation security classiﬁcation. This section describes

a formal model for representing documents and sup-

porting ﬁne-grained classiﬁcation of document ele-

ments (paragraphs and sentences), whereas Section 4

provides a comprehensive description of the corpus

utilized in our research, encompassing both the data

collection methodology and relevant information re-

garding the composition, characteristics, and struc-

ture of the corpus.

In the ﬁeld of information security, the infor-

mation security classiﬁcation of documents based

on their sensitivity and conﬁdentiality requirements

plays a pivotal role in safeguarding valuable data from

unauthorized access and disclosure. However, de-

spite the critical importance of this task, existing in-

formation security classiﬁcation literature often lacks

a formalized information model capable of describing

documents at different levels of granularity, such as

paragraphs or sentences. This gap presents a signiﬁ-

cant challenge in efﬁciently and accurately predicting

the security classiﬁcation of different parts of a doc-

ument and the classiﬁcation of the information that

could be derived by combining different parts of the

same or different documents. This modeling capabil-

ity is therefore critical for reasoning about inference

attacks and preventing the under-classiﬁcation of doc-

uments within a corpus.

3.1 Modeling Documents in a Corpus

A corpus of documents in an information system can

be represented as a set:

D = {d

, d

, ..., d

, ...d

m−1

, d

} (1)

where m = |D| is the number of documents in the cor-

pus. A document d

can then be represented as an

ordered sequence of paragraphs:



, p

, ..., p

i j

, ... p



(2)

where m

is the number of paragraphs in document d

and p

i j

is the j-th paragraph in document d

The ability to accurately model and analyze para-

graphs within a corpus holds signiﬁcant importance

for various information security applications ranging

from information security categorization to threat de-

tection. Paragraphs serve as fundamental units of

DISC: A Dataset for Information Security Classiﬁcation

177

𝒑

𝒊𝒎

𝒊

ij1

ij2

ijk

.....

𝒑

.....

𝒔

𝒊𝒋𝒎

𝒊𝒋

Figure 1: Hierarchical corpus model.

information within documents and include coherent

and semantically meaningful segments of text. Un-

derstanding the structural and contextual attributes of

paragraphs within a corpus is essential for extracting

valuable insights, identifying patterns, and making in-

formed decisions regarding information security poli-

cies and practices. This section focuses on the task

of modeling paragraphs within a corpus, with a spe-

ciﬁc emphasis on its relevance to information security

classiﬁcation.

Finally, a paragraph p

i j

can be represented as an

ordered sequence of sentences:

i j



i j1

, s

i j2

, ..., s

i jk

, ..., s

i jm

i j



(3)

where m

i j

is the number of sentences in paragraph p

i j

and s

i jk

is the k-th sentences in paragraph p

i j

Considering that the number of documents in the

corpus is m and m

denotes the number of paragraphs

in document d

, the total number of paragraphs in the

corpus is given by:

∑

i=1

(4)

Similarly, considering that the number of sen-

tences in a paragraph p

i j

is m

i j

, the total number of

sentences in the corpus is given by:

∑

i=1

∑

j=1

i j

(5)

This hierarchical model of a document corpus is

illustrated in Figure 1.

3.2 Achieving Granular Classiﬁcation

In the domain of information security classiﬁcation,

achieving granular classiﬁcation at the level of indi-

vidual sentences and paragraphs is crucial for effec-

tively safeguarding sensitive data and ensuring com-

pliance with security policies, while preserving ac-

cess to information for those with sufﬁcient clearance.

Traditional approaches to document classiﬁcation

often focus on categorizing entire documents or sec-

tions based on their overall sensitivity. There exists

a growing recognition of the need for more granu-

lar classiﬁcation techniques that can discern the vary-

ing levels of sensitivity present within a document’s

elements. This section explores the importance of

achieving granular sentence and paragraph classiﬁca-

tion within the context of information security.

To model granular classiﬁcation at the sentence

and paragraph level, let C = {UNCLASSIFIED, CON-

FIDENTIAL, SECRET, TOP SECRET} denote the set

of security classiﬁcations and let P and S respectively

denote the sets of all paragraphs and sentences in the

corpus, which can be deﬁned as follows.

P =

[

i=1

[

j=1

i j

}

(6)

S =

[

i=1

[

j=1

i j

[

k=1

i jk

}

(7)

We can represent the sentence-level classiﬁcation

as a mapping δ

: S → C that associates a classiﬁcation

label δ

(s) to each sentence s ∈ S. Similarly, we can

represent the paragraph-level classiﬁcation as a map-

ping δ

: P → C and the document-level classiﬁcation

as a mapping δ

: D → C.

Given a paragraph p

i j

∈ P, one would intuitively

assume that its classiﬁcation is the highest classiﬁca-

tion among its sentences:

i j

) = max

k∈(1,m

i j

)

i jk

) (8)

Similarly, given a document d

∈ D, one would in-

tuitively assume that its classiﬁcation is the highest

classiﬁcation among its paragraphs:

) = max

j∈(1,m

)

i j

) (9)

However, combining multiple pieces of informa-

tion at one classiﬁcation level may result in informa-

tion that should be classiﬁed at a higher level. Thus,

information classiﬁcation must consider whether the

aggregate of multiple units of text (sentences or para-

graphs) should be assigned a higher classiﬁcation

SECRYPT 2024 - 21st International Conference on Security and Cryptography

178

Document

Images

PDF

Documents

DNSA

Database

PDF to

Images

Error

Prone

Text

Error

Corrected

Text

OCR

LLM

Correction

Document

Indexing

DNSA

Database

JSON

Creation

STAGE 1

DISC

Dataset

STAGE 2

STAGE 3

STAGE 4

Figure 2: Dataset for Information Security Classiﬁcation (DISC) processing chain.

level than the highest classiﬁcation level of all the in-

dividual units. In other words Eqs. 8 and 9 must be

rewritten to deﬁne lower bounds on the classiﬁcation

levels.

i j

) ≥ max

k∈(1,m

i j

)

i jk

) (10)

) ≥ max

j∈(1,m

)

i j

) (11)

The formal information model introduced in this

section forms a foundational framework for organiz-

ing and categorizing document content in a systematic

manner. At its core, the formal information model

delineates the hierarchical structure of documents,

breaking them down into constituent paragraphs and

sentences. Each level of granularity within the model

is endowed with metadata attributes that capture rele-

vant information about the content and potential se-

curity implications of the text. By leveraging this

formalized representation of document content, the

proposed method aims to streamline the process of

security classiﬁcation by automating key tasks such

as document collection, OCR for text extraction, and

subsequent analysis. By facilitating the utilization of

advanced Natural Language Processing (NLP) tech-

niques and machine learning algorithms, the proposed

formal information model enables intelligent classiﬁ-

cation decisions based on the semantic understanding

of document contents at the paragraph and sentence

levels.

4 CORPUS DESCRIPTION

In this section, we ﬁrst describe the methodology we

employed to generate the DISC corpus from the orig-

inal DNSA database, and then describe in detail the

structure of the corpus itself.

4.1 Data Collection Methodology

The DISC corpus is built upon the documents con-

tained within the DNSA database, and speciﬁcally the

documents in the three topic areas outlined in Sec-

tion 2. Out of the 2,459 documents available within

those topic areas, documents deemed not relevant to

information classiﬁcations were excluded. Speciﬁ-

cally, the following documents were omitted: dupli-

cate documents, documents classiﬁed as UNKNOWN,

and documents marked as EXCISED (i.e., the classi-

ﬁed text sections were redacted).

The DISC processing pipeline is illustrated in

Figure 2. Stage 1 of the processing pipeline in-

volves performing image processing on the pages of

the PDF documents. The documents in the DNSA

database consist of non-searchable PDF ﬁles con-

taining scanned images of the original printed docu-

ments. The PDF ﬁles were searched and organized ac-

cording to their initial classiﬁcation (UNCLASSIFIED,

CONFIDENTIAL, SECRET, or TOP SECRET) and the

topic areas mentioned earlier. The initial classiﬁca-

tion of each document was determined by the author

and classiﬁcation originator, based on the procedures

DISC: A Dataset for Information Security Classiﬁcation

179

in (White House, 2009). The documents within the

DNSA database have undergone the declassiﬁcation

process, therefore the value of their reclassiﬁcation

metadata is set to UNCLASSIFIED. The document re-

classiﬁcation (UNCLASSIFIED) and DNSA database

document search parameters (initial classiﬁcation and

domain) forms the metadata associated with the resul-

tant documents during Stage 1 of the DISC processing

pipeline. The Python fitz package was used to in-

gest individual PDF ﬁles and convert their pages to

images. Figure 3 shows an example of a page from a

PDF document in the DSNA database converted to an

image in Stage 1.

Figure 3: Stage 1: Example of non-searchable image-based

PDF document.

Stage 2 of the DISC processing pipeline uses an

Optical Character Recognition (OCR) algorithm to

convert the images of document pages into editable

and searchable data. The OCR processing stage en-

ables subsequent information manipulation, editing,

searching, and analysis using information security

classiﬁcation algorithms. The Python pytesseract

module was used to perform OCR processing on the

images formed from the pages of individual docu-

ments. Figure 4 shows the output of OCR processing

for the document image of Figure 3. In the interest of

space, only the ﬁrst few paragraphs of the document

are shown in Figure 4.

RELEASE OF AMERICAN WOSTAGES In BEIRUT

Background. In June 1985, privete American and Yereeli

citeizens Commenced 8m Operation to ecfect the release of

the Americen +iostages in Beirut in exchange for providing

certain factions in Iran with 0.$.-origan Isreeli militery sateriel.

By September, U.S. and Israeli Government ofﬁcials became

involved in thie -endeavor in order to ensure that the 086 would:

– not object to the Isrseli transfer of embergoed na-

terie! tc Iran;

=< eell replacement items to lerae)] as replenishment for

like items sold to fran by lserael.

ih On September 3f, the Israeli Government, with che en-

dorsement of the OSG, transferred 508 besic TOW missiles

to Iran. Forty-eight hours later, Reverend Benjamin Weir was

relesseé in beirut.

Figure 4: Stage 2: OCR error-prone textual output for the

document image in Figure 3. OCR errors in the Background

section of the document errors are indicated in red text.

RELEASE Of AMERICAN WOSTAGES In BEIRUT

Background. In June 1985, private American and Yereeli

citizens Commenced 8m Operation to effect the release of the

American + hostages in Beirut in exchange for providing actions

in Iran with 0.$.-origin Israeli military materiel. By September,

U.S. and Israeli Government ofﬁcials became involved in the

-endeavor in order to ensure that the 086 would:

– not object to the Israeli transfer of emerged nature!

tc Iran;

=< well replacement items to were)] as replenishment

for like items sold to fran by israel.

ih) On September 3f, the Israeli Government, with che

endorsement of the OS, transferred 508 basic OW missiles to

Iran. Forty-eight hours later, Reverend Benjamin Weir was

releaseé in bert.

Figure 5: Results of the Python Speller package’s

autocorrection module on the OCR output in Figure 4.

Uncorrected errors in the Background section of the doc-

ument are indicated in red text. Autocorrection module

corrections on the OCR output are in blue text.

Stage 3 of the DISC processing pipeline performs

error correction on the OCR output. The Python

Speller package’s autocorrection module was in-

vestigated for accuracy in correcting the OCR pro-

cessing errors. The primary motivation for investigat-

ing the Python Speller package is its ability to perform

local processing, thus maintaining conﬁdentiality of

information, and offer cost-effectiveness and speed.

The result of the autocorrection module on the in-

formation in Figure 4 is illustrated in Figure 5. This

module was able to correct only words with errors in-

volving one or two characters.

The performance of the LLM-based error correc-

tion methodology was compared against the Python

Speller package. LLMs detect OCR errors by analyz-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

180

Table 1: Stage 4: DNSA documents indexing information.

ID Title Abstract Pub. Date Authors

1679059219 Mujajedin Cross Border Cow Raid... Soviet Union Armed Forces retaliate... June 14, 1987 U.S. Consulate of Peshawar

1679076686 Afghanistan: Gailani Welcome Zahir Shah’s... Itihad-i Islami Baraye Azad-i Afghanistan... May 18, 1987 U.S.Embassy of Pakistan

1679059734 Jamiat-i-Islami Comment on Mujahideen ... Newspapers report that the Soviet Union ... May 2, 1987 U.S. Consulate of Peshawar

1679077357 Soviet Reprisals in Northern Afghanistan United States Embassy. Afghanistan ofﬁcers ... April 26, 1987 U.S. Embassy Afghanistan of Afghanistan

1679060091 Xinhua: Soviet, Afghan Planes Change... United States Embassy. Afghanistan ofﬁcers ... April 21, 1987 U.S. Foreign Broadcast Information Service

ing inconsistencies or grammatical anomalies within

the uncorrected text. Therefore, words that do not ﬁt

within the context of the document or are unlikely to

occur based on language patterns can be identiﬁed

as OCR errors and replaced with the mostly likely

words. Previous related works outlined the need for

manual visual reviews to ensure quality of the in-

formation. The LLM-based technique implemented

in the DISC processing pipeline allows automation

of the OCR error correction process and greatly re-

duces the need for manual information review. The

LLM prompt utilized was meticulously crafted to cor-

rect OCR errors while preventing the generation of

additional or alternative content. The result of the

LLM-based OCR error correction on the information

in Figure 4 is illustrated in Figure 6. The LLM-based

OCR error correction greatly outperformed Python’s

autocorrection module, demonstrating the ability

to understand the context and remediate poor OCR

results. The accuracy of LLM-based error correc-

tion, compared to the Python Speller package, out-

weighs the additional overhead in terms of time and

cost. Therefore, the LLM methodology was selected

to perform Stage 3 of the DISC processing pipeline,

as illustrated in Figure 2.

RELEASE OF AMERICAN HOSTAGES IN BEIRUT

Background. In June 1985, private American and Israeli

citizens commenced an operation to effect the release of the

American hostages in Beirut in exchange for providing certain

factions in Iran with U.S.-origin Israeli military material.

By September, U.S. and Israeli Government ofﬁcials became

involved in this endeavor in order to ensure that the U.S. would:

– not object to the Israeli transfer of embargoed mate-

rial to Iran;

– sell replacement items to Iran as replenishment for like

items sold to Iran by Israel.

On September 3, the Israeli Government, with the en-

dorsement of the U.S.G, transferred 508 basic TOW missiles

to Iran. Forty-eight hours later, Reverend Benjamin Weir was

released in Beirut.

Figure 6: Stage 3: LLM-based error free correction of the

OCR output in Figure 4. Corrected errors in the Back-

ground section of the document are indicated in blue text.

Finally, Stage 4 of the DISC processing pipeline

collects indexing information of DNSA documents

for correlation to the LLM-enhanced OCR error-

corrected textual information obtained from Stage 3

to create the DISC JSON entries. In addition to the

image-based PDF copies of the original documents,

the DNSA database user interface facilitates gener-

ating and exporting an Excel spreadsheet contain-

ing document metadata. This metadata includes in-

dexing information and an abstract summarizing the

document. The indexing information collected from

DNSA consists of document title, classiﬁcation date,

database name, authors information, and assigned

unique document identiﬁer, as illustrated in Figure 1.

The ﬁnal step in the DISC processing pipeline con-

sists in correlating document abstract and indexing in-

formation with the textual information produced in

Stage 3 for storage in the DISC’s JSON format, as

shown in Figure 7.

1 {

2 " D ocI D ": 7 ,

3 " T itl e ": " R ele a se of Am e rica n Hos t ages in

Bei rut .. . ",

4 "ORC - T ext ": " REL E ASE OF AM E RICA N WOS T AGE S In

BEI RUT .. . ",

5 " Tex t ": " R ELEA SE OF AME R ICAN HOST A GES IN

BEI RUT .. . ",

6 " A bstr a ct ": " Oli v er L. No rth ca l ls S ovi et

pol icy .. . ",

7 " C l a ssif i c a tion ": [{

8 " C l ass I D ": "7 _ 1 " ,

9 " L abe l ": " Top Se c ret " ,

10 " D ate ": "c . A pri l 4 , 198 6"

11 },

12 {

13 " C l ass I D ": "7 _ 2 " ,

14 " L abe l ": " U n c lassi f i ed "

15 }

16 ],

17 " D atab a se ": " Dig i tal N atio n al S e cur i t y

Arc h ive " ,

18 " D oma i n ": " A fgha n i stan : The Ma k ing of U.S.

Po licy , 19 73\ u20 1 3 199 0 " ,

19 " A uth o r ": " Uni t ed S tat e s . N a t ion a l Se c urit y

Cou n cil " ,

20 " S tor e ID ": "16 7 9 05981 8 "

21 }

Figure 7: Document indexing and textual information

stored in a JSON record.

DISC: A Dataset for Information Security Classiﬁcation

181

4.2 Corpus Description

Section 4.1 provided a comprehensive overview of

the methodology utilized in the creation of the DISC

corpus from its precursor, the DNSA database. The

processing chain and information processing strat-

egy elaborated upon in section 2 provides a repeat-

able process to recreate or augment the developed

reusable DISC dataset. This section dives deeper into

the DISC corpus, focusing on its information organi-

zation and presenting detailed statistics pertaining to

the archived documents available in DISC.

The information within the DNSA not only con-

sists of PDF documents comprised of textual images

but also includes indexed information created during

content curation, Indexes information describes the

document content and provenance information. The

information obtainable for each document in the DISC

corpus is illustrated in Table 2.

Table 2: Attributes for documents in the DISC corpus.

Attribute Description

DocID Unique Document ID

Title Document title

Abstract Brief summary of the document

OCR-Text Text extracted via OCR from the PDF

document

Text Text reconstructed via LLM from the

OCR output

Classiﬁcation Document classiﬁcation events (each

document may have multiple classiﬁca-

tion events)

Database References to the database the document

was extracted from

Domain Domain within the database the docu-

ment was extracted from

Author Authorship information

StoreID Database-assigned unique document ID

Changes in information sensitivity, facilitate mod-

eling of the time-decay of information sensitivity, or

satisfy decisions to increase level of information con-

ﬁdentiality.

Table 3: Attributes for classiﬁcation events.

Attribute Description

ClassID Unique classiﬁcation event ID

Label Classiﬁcation level (UNCLASSIFIED,

SECRET, or TOP SECRET)

Date Classiﬁcation date

The DISC consists of the following DNSA domain

areas:

• AF, Afghanistan: The Making of U.S. Policy,

1973-1990.

• CH, China and the United States: From Hostility

to Engagement, 1960-1998.

• PH, The Philippines: U.S. Policy during the Mar-

cos Years, 1965-1986.

The documents statistics and original documenta-

tion classiﬁcations contained in DISC are listed in Ta-

ble 4.

Table 4: Information classiﬁcation statistics for documents

in the DISC corpus.

Domain UNCLASSIFIED SECRET TOP SECRET TOTAL

AF 401 206 13 620

CH 151 666 132 949

PH 411 469 1 881

TOTAL 963 1,341 146 2,450

5 APPLICATIONS

DISC was developed primarily to support NLP algo-

rithm research for reactive information security clas-

siﬁcation of existing documents e.g., security classiﬁ-

cation after information creation. However, DISC can

be utilized for other research purposes such as proac-

tive information control and access applications.

Proactive information classiﬁcation involves the

preemptive process of categorizing and labeling data

or information based on its content, sensitivity, or

other characteristics before it is created or shared.

This approach aims to enhance information manage-

ment and security by implementing automated sys-

tems or policies that identify sensitive information

during its creation, sharing, or storage. Through

proactive information classiﬁcation, high-value infor-

mation can be identiﬁed in real-time, enabling organi-

zations to implement immediate data protection mea-

sures. The DISC dataset facilitates research in NLP

techniques to enforce security controls, authorization

permissions, and data access protection measures at

the point of information creation.

LLMs exhibit proﬁciency in processing and com-

prehending intricate language found in textual docu-

ments, making them valuable for enhancing informa-

tion classiﬁcation tasks. Leveraging their knowledge

and contextual understanding, LLMs can extract cru-

cial information, identify topics, and conduct senti-

ment analysis. An analysis of LLMs performance in

classifying public affairs documents was conducted

SECRYPT 2024 - 21st International Conference on Security and Cryptography

182

in (Peña et al., 2023). Four distinct Spanish LLMs

were employed to classify up to 30 different topics

in the dataset. The ﬁndings underscored the effec-

tiveness of LLMs in performing information classi-

ﬁcation on domain-speciﬁc documents. DISC would

provide the necessary dataset to assist in the research

and training of LLMs to automate the information se-

curity classiﬁcation process.

6 EXPERIMENTAL EVALUATION

This section demonstrates how DISC can be used

for information security classiﬁcation (Brown and

Charlebois, 2010; Engelstad et al., 2015a; Engel-

stad et al., 2015b). We use Term Frequency-Inverse

Document Frequency (TF-IDF) with multiple classi-

ﬁers such as Naive Bayes (NB), Support Vector Ma-

chines (SVM), K-nearest Neighbor (KNN), and Gra-

dient Boost (GB). A Bidirectional Encoder Represen-

tations from Transformers (BERT) classiﬁer (Devlin

et al., 2019) was utilized to compare performance

with the TF-IDF classiﬁers. We present the results

of the different models in Table 5 along with perfor-

mance results for each individual model in Tables 6 to

10.

Table 5: Performance comparison of classiﬁer models.

Classiﬁer

UNCLASSIFIED CLASSIFIED

Precision Recall Precision Recall F1 Score

NB 1.00 0.94 0.94 1.00 0.97

SVM 0.97 0.99 0.99 0.97 0.98

KNN 0.97 0.91 0.91 0.91 0.94

GB 0.91 0.98 0.97 0.90 0.94

BERT 0.97 0.99 0.99 0.97 0.98

The CLASSIFIED documents misclassiﬁed by

BERT were examined to provide insight into how

the algorithm was learning language structure. The

BERT algorithm classiﬁed the text “Reports on Tai-

wan’s air defense capability, including availability of

surface-to-air missiles and ﬁghter aircraft.” as being

CLASSIFIED while the DNSA Database labeled the

information as being UNCLASSIFIED. However, mil-

itary capabilities and vulnerabilities of countries are

considered classiﬁed information. In fact, an Internet

search of the Taiwan air defense topic conﬁrmed that

information on military capabilities and vulnerabili-

ties is in fact classiﬁed information and was recently

leaked within an on-line chat forum (Nakashima et al.,

2023). This result conﬁrms that the manual process of

classifying data is in inconsistently applied within or-

ganizations and it is prone to human error.

Table 6: Confusion matrix for the TF-IDF/Naïve Bayes

classiﬁer.

UNCLASSIFIED CLASSIFIED

UNCLASSIFIED 291 19

CLASSIFIED 0 315

Table 7: Confusion matrix for the TF-IDF/SVM classiﬁer.

UNCLASSIFIED CLASSIFIED

UNCLASSIFIED 307 3

CLASSIFIED 8 307

Table 8: Confusion matrix for the TF-IDF/KNN classiﬁer.

UNCLASSIFIED CLASSIFIED

UNCLASSIFIED 281 29

CLASSIFIED 8 307

Table 9: Confusion matrix for the TF-IDF/Gradient Boost

classiﬁer.

UNCLASSIFIED CLASSIFIED

UNCLASSIFIED 303 7

CLASSIFIED 31 284

Table 10: Confusion matrix for the BERT classiﬁer.

UNCLASSIFIED CLASSIFIED

UNCLASSIFIED 306 4

CLASSIFIED 8 307

In conclusion, the experimentation conducted uti-

lizing DISC for information security classiﬁcation re-

search has demonstrated several key beneﬁts of our

approach. By making our dataset openly available

to the research community and providing a detailed

methodology for its creation, we have enabled other

researchers to access and utilize the identical dataset

for their own studies. Moreover, the method utilized

for creating DISC is replicable and facilitates aug-

mentation and expansion of the dataset to accommo-

date diverse research needs and objectives. One of

the most signiﬁcant advantages of DISC is the uti-

lization of a LLM, which signiﬁcantly reduces the

need for manual quality review of the data to ad-

dress optical character recognition errors. The uti-

lization of a LLM not only streamlines the dataset

creation process but also ensures the accuracy and re-

liability of the data, thereby enhancing the credibil-

ity and trustworthiness of research ﬁndings derived

from the dataset. LLM error correction is especially

suited for declassiﬁed document archives which often

consist of poor quality historical handwritten, type-

written, or photocopied documents due to the time

DISC: A Dataset for Information Security Classiﬁcation

183

1 {

2 " $ sch e ma ": " http :// json - sch e ma . or g / draft - 07/ sc hema #" ,

3 " t itl e ": " G ener a ted s che m a for DI SC " ,

4 " typ e ": " a rra y " ,

5 " i tem s ": {

6 " t ype ": " obj e ct " ,

7 " p r o pert i es ": {

8 " D ocI D ": {" typ e ": " n umb e r "} ,

9 " T itl e ": {" typ e ": " s tri n g "} ,

10 " O C Rte x t ": {" typ e ": " stri ng "} ,

11 " T ext ": {" typ e ": " s tri n g "} ,

12 " A b stra c t ": {" typ e ": " str i ng "} ,

13 " Clas s i f icat i o n ": {

14 "ty pe ": " arr ay " ,

15 "it e ms ": {

16 " t ype ": " obj e ct " ,

17 " p r o pert i es ": {

18 " C l ass I D ": {" typ e ": " stri ng "} ,

19 " L abe l ": {" typ e ": " s tri n g "} ,

20 " D ate ": {" typ e ": " s tri n g "}

21 },

22 " r e quir e d ": [" C las s ID " , " L abe l "]

23 }

24 },

25 " D a taba s e ": {" typ e ": " str i ng "} ,

26 " D o mai n ": {" typ e ": " s tri n g "} ,

27 " A u tho r ": {" typ e ": " s tri n g "} ,

28 " S t ore I D ": {" typ e ": " stri ng "}

29 },

30 " r e quir e d ": [" Doc I D " , " T itl e ", " T ext 1 " , " T ext 2 " , " Abst r act " , " C l a s sifi c a t ion " , " D atab a se " , " D oma i n

", " A uth o r " , " S t ore I D "]

31 }

32 }

Figure 8: JSON Schema for DISC.

period requirement of information declassiﬁcation.

Furthermore, the experimentation results demonstrate

the practical utility of DISC in reproducing prior re-

search approaches. The research efforts in (Engelstad

et al., 2015b; Brown and Charlebois, 2010; Engelstad

et al., 2015a) leverage documents from the DNSA.

However, the researcher don’t explicitly state docu-

ment selection process or list the documents select.

DISC stores the unique database assigned document

in the StoreID ﬁeld and allows duplication of research

datasets. Thereby, exactly reproducing prior experi-

mental results is facilitated by listing StoreID values

to recreate a subset dataset from DISC. This not only

validates the reliability and consistency of DISC but

also highlights its utility to serve as a valuable re-

source for bench-marking, comparison, and valida-

tion of information security classiﬁcation algorithms

and methodologies.

7 CONCLUSIONS

This paper presented DISC, the ﬁrst reproducible

information security classiﬁcation dataset publicly

available to information security researchers and pro-

fessionals. By providing detailed instructions on how

to curate this dataset from the original raw data, we

not only enable others to reproduce the dataset, but

also offer a framework for curating additional datasets

from similar data sources. DISC contains informa-

tion about 2,450 documents from the DNSA database,

including 963 OFFICIAL, 1,341 SECRET, and 146

TOP SECRET documents. As discussed in Section 5,

DISC can facilitate several areas of future research

in information security classiﬁcation. In addition,

DISC can be utilized in various research purposes be-

yond information security classiﬁcation such as im-

plementing information control and granting access

decisions. Future work would ideally include more

documents incorporating additional DNSA database

domains into the dataset following the same method-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

184

ology and annotation. DISC facilitates collaboration,

reproducibility, and innovation in future research in

mitigating information security cybersecurity chal-

lenges. Overall, DISC represents a signiﬁcant con-

tribution to the information security classiﬁcation re-

search community, offering an accessible, reliable,

and scalable resource for advancing research in this

critical domain.

Using the proposed framework, we are currently

working to include more documents in DISC. While

the models tested in this paper have proven to achieve

high performance in this task, we intend to evaluate

the performance of open-source LLMs for this task.

Expanding the DISC dataset creation framework to

encompass open-source LLMs offers the capability to

uphold the conﬁdentiality of sensitive data during the

processing of private information. We intent to uti-

lize an extended version of DISC to reﬁne recently in-

troduced LLMs like Falcon and Llama-2 in crafting

decision-making processes for information security

classiﬁcation levels. This endeavor will furnish the

community with a vital resource for preserving conﬁ-

dentiality in classifying highly sensitive information.

Finally, the framework presented in this paper can be

applied to other domains and languages. We encour-

age the community to pursue research with data from

other repositories (e.g., industry data) as well as on

documents in languages other than English.

REFERENCES

Alzhrani, K., Rudd, E. M., Boult, T. E., and Chow, C. E.

(2016). Automated big text security classiﬁcation. In

Proceedings of the 2016 IEEE Conference on Intelli-

gence and Security Informatics (ISI 2016), pages 103–

108, Tucson, AZ, USA. IEEE.

Boustead, A. E. and Herr, T. (2020). Analyzing the ethical

implications of research using leaked data. Political

Science and Politics, 53(3):505–509.

Brown, J. D. and Charlebois, D. (2010). Security classiﬁ-

cation using automated learning (scale): Optimizing

statistical natural language processing techniques to

assign security labels to unstructured text. Technical

Memorandum 2010-215, Defence R&D Canada – Ot-

tawa.

Chakraborty, T., Jajodia, S., Katz, J., Picariello, A., Sperli,

G., and Subrahmanian, V. S. (2021). A fake on-

line repository generation engine for cyber deception.

IEEE Transactions on Dependable and Secure Com-

puting, 18(2):518–533.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. In Burstein,

J., Doran, C., and Solorio, T., editors, Proceedings of

the 2019 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, pages 4171–4186, Min-

neapolis, USA. Association for Computational Lin-

guistics.

Engelstad, P. E., Hammer, H., Yazidi, A., and Bai, A.

(2015a). Advanced classiﬁcation lists (dirty word

lists) for automatic security classiﬁcation. In Proceed-

ings of the 2015 International Conference on Cyber-

Enabled Distributed Computing and Knowledge Dis-

covery, pages 44–53.

Engelstad, P. E., Hammer, H. L., Kongsgård, K. W., Yazidi,

A., Nordbotten, N. A., and Bai, A. (2015b). Auto-

matic security classiﬁcation with lasso. In Proceed-

ings of the 16th International Workshop on Informa-

tion Security Applications (WISA 2015), volume 9503

of Lecture Notes in Computer Science. Springer.

Information Security Oversight Ofﬁce (2018). Developing

and using security classiﬁcation guides.

Jadli, A., Hain, M., Chergui, A., and Jaize, A. (2020).

DCGAN-based data augmentation for document clas-

siﬁcation. In Proceedings of the 2nd IEEE Interna-

tional Conference on Electronics, Control, Optimiza-

tion and Computer Science (ICECOCS 2020).

Nakashima, E., Shepherd, C., and Cadell, C. (2023). Tai-

wan highly vulnerable to Chinese air attack, leaked

documents show. Washington Post.

NIST (2004). FIPS 199: Standards for security categoriza-

tion of federal information and information systems.

Federal Information Processing Standards Publication

199, National Institute of Standards and Technology.

Peña, A., Morales, A., Fierrez, J., Serna, I., Ortega-Garcia,

J., Puente, Í., Córdova, J., and Córdova, G. (2023).

Leveraging large language models for topic classiﬁ-

cation in the domain of public affairs. In Coustaty,

M. and Fornés, A., editors, Proceeding of the 17th

International Conference on Document Analysis and

Recognition (ICDAR 2023), pages 20–33. Springer.

White House (2009). Executive order 13526: Classiﬁed

national security information.

Whitham, B. (2017). Automating the generation of enticing

text content for high-interaction honeyﬁles. In Pro-

ceedings of the 50th Hawaii International Conference

on System Sciences (HICSS 2017).

APPENDIX

The database documents and associated information

are stored within DISC in a JSON (JavaScript Object

Notation) data structure based on the JSON Schema

illustrated in Figure 8.

DISC: A Dataset for Information Security Classiﬁcation

185