Leveraging Clustering and Natural Language Processing to Overcome Variety Issues in Log Management

Tobias Eljasik-Swoboda, Wilhelm Demuth

2020

Abstract

When introducing log management or Security Information and Event Management (SIEM) practices, organizations are frequently challenged by Gartner’s 3 Vs of Big Data: There is a large volume of data which is generated at a rapid velocity. These first two Vs can be effectively handled by current scale-out architectures. The third V is that of variety which affects log management efforts by the lack of a common mandatory format for log files. Essentially every component can log its events differently. The way it is logged can change with every software update. This paper describes the Log Analysis Machine Learner (LAMaLearner) system. It uses a blend of different Artificial Intelligence techniques to overcome variety issues and identify relevant events within log files. LAMaLearner is able to cluster events and generate human readable representations for all events within a cluster. A human being can annotate these clusters with specific labels. After these labels exist, LAMaLearner leverages machine learning based natural language processing techniques to label events even in changing log formats. Additionally, LAMaLearner is capable of identifying previously known named entities occurring anywhere within the logged event as well identifying frequently co-occurring variables in otherwise fixed log events. In order to stay up-to-date LAMaLearner includes a continuous feedback interface that facilitates active learning. In experiments with multiple differently formatted log files, LAMaLearner was capable of reducing the labeling effort by up to three orders of magnitude. Models trained on this labeled data achieved > 93% F1 in detecting relevant event classes. This way, LAMaLearner helps log management and SIEM operations in three ways: Firstly, it creates a quick overview about the content of previously unknown log files. Secondly, it can be used to massively reduce the required manual effort in log management and SIEM operations. Thirdly, it identifies commonly co-occurring values within logs which can be used to identify otherwise unknown aspects of large log files.

Download


Paper Citation


in Bibtex Style

@conference{icaart20,
author={Tobias Eljasik-Swoboda and Wilhelm Demuth},
title={Leveraging Clustering and Natural Language Processing to Overcome Variety Issues in Log Management},
booktitle={Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2020},
pages={281-288},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0008856602810288},
isbn={978-989-758-395-7},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Leveraging Clustering and Natural Language Processing to Overcome Variety Issues in Log Management
SN - 978-989-758-395-7
AU - Eljasik-Swoboda T.
AU - Demuth W.
PY - 2020
SP - 281
EP - 288
DO - 10.5220/0008856602810288


in Harvard Style

Eljasik-Swoboda T. and Demuth W. (2020). Leveraging Clustering and Natural Language Processing to Overcome Variety Issues in Log Management. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-395-7, pages 281-288. DOI: 10.5220/0008856602810288