Authors:
Elijah Bass
;
Massimiliano Albanese
and
Marcos Zampieri
Affiliation:
Center for Secure Information Systems, George Mason University, Fairfax, U.S.A.
Keyword(s):
Information Security, Information Protection, Security Classification, Artificial Intelligence, Datasets.
Abstract:
Research in information security classification has traditionally relied on carefully curated datasets. However, the sensitive nature of the classified information contained in such documents poses challenges in terms of accessibility and reproducibility. Existing data sources often lack openly available resources for automated data collection and quality review processes, making it difficult to facilitate reproducible research. Additionally, datasets constructed from declassified information, though valuable, are not readily available to the public, and their creation methods remain poorly documented, rendering them non-reproducible. This paper addresses these challenges by introducing DISC, a dataset and framework, driven by artificial intelligence principles, for information security classification. This process aims to streamline all the stages of dataset creation, from preprocessing of raw documents to annotation. By enabling reproducibility and augmentation, this approach enhan
ces the utility of available document collections for information security classification research and allows researchers to create new datasets in a principled way.
(More)