
ology and annotation. DISC facilitates collaboration,
reproducibility, and innovation in future research in
mitigating information security cybersecurity chal-
lenges. Overall, DISC represents a significant con-
tribution to the information security classification re-
search community, offering an accessible, reliable,
and scalable resource for advancing research in this
critical domain.
Using the proposed framework, we are currently
working to include more documents in DISC. While
the models tested in this paper have proven to achieve
high performance in this task, we intend to evaluate
the performance of open-source LLMs for this task.
Expanding the DISC dataset creation framework to
encompass open-source LLMs offers the capability to
uphold the confidentiality of sensitive data during the
processing of private information. We intent to uti-
lize an extended version of DISC to refine recently in-
troduced LLMs like Falcon and Llama-2 in crafting
decision-making processes for information security
classification levels. This endeavor will furnish the
community with a vital resource for preserving confi-
dentiality in classifying highly sensitive information.
Finally, the framework presented in this paper can be
applied to other domains and languages. We encour-
age the community to pursue research with data from
other repositories (e.g., industry data) as well as on
documents in languages other than English.
REFERENCES
Alzhrani, K., Rudd, E. M., Boult, T. E., and Chow, C. E.
(2016). Automated big text security classification. In
Proceedings of the 2016 IEEE Conference on Intelli-
gence and Security Informatics (ISI 2016), pages 103–
108, Tucson, AZ, USA. IEEE.
Boustead, A. E. and Herr, T. (2020). Analyzing the ethical
implications of research using leaked data. Political
Science and Politics, 53(3):505–509.
Brown, J. D. and Charlebois, D. (2010). Security classifi-
cation using automated learning (scale): Optimizing
statistical natural language processing techniques to
assign security labels to unstructured text. Technical
Memorandum 2010-215, Defence R&D Canada – Ot-
tawa.
Chakraborty, T., Jajodia, S., Katz, J., Picariello, A., Sperli,
G., and Subrahmanian, V. S. (2021). A fake on-
line repository generation engine for cyber deception.
IEEE Transactions on Dependable and Secure Com-
puting, 18(2):518–533.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Burstein,
J., Doran, C., and Solorio, T., editors, Proceedings of
the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Hu-
man Language Technologies, pages 4171–4186, Min-
neapolis, USA. Association for Computational Lin-
guistics.
Engelstad, P. E., Hammer, H., Yazidi, A., and Bai, A.
(2015a). Advanced classification lists (dirty word
lists) for automatic security classification. In Proceed-
ings of the 2015 International Conference on Cyber-
Enabled Distributed Computing and Knowledge Dis-
covery, pages 44–53.
Engelstad, P. E., Hammer, H. L., Kongsgård, K. W., Yazidi,
A., Nordbotten, N. A., and Bai, A. (2015b). Auto-
matic security classification with lasso. In Proceed-
ings of the 16th International Workshop on Informa-
tion Security Applications (WISA 2015), volume 9503
of Lecture Notes in Computer Science. Springer.
Information Security Oversight Office (2018). Developing
and using security classification guides.
Jadli, A., Hain, M., Chergui, A., and Jaize, A. (2020).
DCGAN-based data augmentation for document clas-
sification. In Proceedings of the 2nd IEEE Interna-
tional Conference on Electronics, Control, Optimiza-
tion and Computer Science (ICECOCS 2020).
Nakashima, E., Shepherd, C., and Cadell, C. (2023). Tai-
wan highly vulnerable to Chinese air attack, leaked
documents show. Washington Post.
NIST (2004). FIPS 199: Standards for security categoriza-
tion of federal information and information systems.
Federal Information Processing Standards Publication
199, National Institute of Standards and Technology.
Peña, A., Morales, A., Fierrez, J., Serna, I., Ortega-Garcia,
J., Puente, Í., Córdova, J., and Córdova, G. (2023).
Leveraging large language models for topic classifi-
cation in the domain of public affairs. In Coustaty,
M. and Fornés, A., editors, Proceeding of the 17th
International Conference on Document Analysis and
Recognition (ICDAR 2023), pages 20–33. Springer.
White House (2009). Executive order 13526: Classified
national security information.
Whitham, B. (2017). Automating the generation of enticing
text content for high-interaction honeyfiles. In Pro-
ceedings of the 50th Hawaii International Conference
on System Sciences (HICSS 2017).
APPENDIX
The database documents and associated information
are stored within DISC in a JSON (JavaScript Object
Notation) data structure based on the JSON Schema
illustrated in Figure 8.
DISC: A Dataset for Information Security Classification
185