TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity

Inoussa Mouiche, Sherif Saad

2024

Abstract

Recent advancements highlight the crucial role of high-quality data in developing accurate AI models, especially in threat intelligence named entity recognition (TI-NER). This technology automates the detection and classification of information from extensive cyber reports. However, the lack of scalable annotated security datasets hinders TI-NER system development. To overcome this, researchers often use data augmentation techniques such as merging multiple annotated NER datasets to improve variety and scalability. Integrating these datasets faces challenges like maintaining consistent entity annotations and entity categories and adhering to standardized tagging schemes. Manually merging datasets is time-consuming and impractical on a large scale. Our paper presents TI-NERmerger, a semi-automated framework that integrates diverse TI-NER datasets into scalable, compliant datasets aligned with cybersecurity standards like STIX-2.1. We validated the framework’s efficiency and effectiveness by comparing it with manual processes using the DNRTI and APTNER datasets, producing Augmented APTNER (2APTNER). The results demonstrate over 94% reduction in manual labour, saving several months of work in just minutes. Additionally, we applied advanced ML algorithms to validate the effectiveness of the integrated NER datasets. We also provide publicly accessible datasets and resources, supporting further research in threat intelligence and AI model developments.

Download


Paper Citation


in Harvard Style

Mouiche I. and Saad S. (2024). TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity. In Proceedings of the 21st International Conference on Security and Cryptography - Volume 1: SECRYPT; ISBN 978-989-758-709-2, SciTePress, pages 357-370. DOI: 10.5220/0012867900003767


in Bibtex Style

@conference{secrypt24,
author={Inoussa Mouiche and Sherif Saad},
title={TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity},
booktitle={Proceedings of the 21st International Conference on Security and Cryptography - Volume 1: SECRYPT},
year={2024},
pages={357-370},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012867900003767},
isbn={978-989-758-709-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 21st International Conference on Security and Cryptography - Volume 1: SECRYPT
TI - TI-NERmerger: Semi-Automated Framework for Integrating NER Datasets in Cybersecurity
SN - 978-989-758-709-2
AU - Mouiche I.
AU - Saad S.
PY - 2024
SP - 357
EP - 370
DO - 10.5220/0012867900003767
PB - SciTePress