Real-time Statistical Log Anomaly Detection with Continuous AIOps
Learning
Lu An
a
, An-Jie Tu, Xiaotong Liu and Rama Akkiraju
IBM Watson, 555 Bailey Ave, San Jose, U.S.A.
Keywords:
AI for IT Operations, Log Anomaly Detection, Online Statistical Learning, Error Entity Extraction, Continu-
ous Model Updating.
Abstract:
Anomaly detection from logs is a fundamental Information Technology Operations (ITOps) management task.
It aims to detect anomalous system behaviours and find signals that can provide clues to the reasons and the
anatomy of a system’s failure. Applying advanced, explainable Artificial Intelligence (AI) models throughout
the entire ITOps is critical to confidently assess, diagnose and resolve such system failures. In this paper, we
describe a new online log anomaly detection algorithm which helps significantly reduce the time-to-value of
Log Anomaly Detection. This algorithm is able to continuously update the Log Anomaly Detection model
at run-time and automatically avoid potential biased model caused by contaminated log data. The methods
described here have shown 60% improvement on average F1-scores from experiments for multiple datasets
comparing to the existing method in the product pipeline, which demonstrates the efficacy of our proposed
methods.
1 INTRODUCTION
The exploding growth of Information Technology
(IT) systems and services make the systems and ap-
plications become increasingly more complex to op-
erate, manage and monitor. By utilizing log process-
ing, machine learning and other advanced analytics
technologies, Artificial Intelligence for IT Operations
(AIOps) (Lerner, 2017) provides a promising solution
to enhance the reliability of the IT operations. Today,
most planet-scale service operators employ their own
AIOps to collect logs, traces and telemetry data, and
analyze the collected data to enhance their offerings
(Levin et al., 2019). One of the critical tasks in AIOps
is the anomaly detection which is the essential step to
detect anomalous system behaviours and find signals
that can provide clues to the reasons and the anatomy
of a system’s failure (Goldberg and Shan, 2015; Gu
et al., 2017; Chandola et al., 2009).
As system logs are records of the system states
and events at various critical points and log data is
universally available in nearly all IT systems, it is a
valuable resource for the AIOps to process, analyze
and perform anomaly detection algorithms. We call
the anomaly detection methods utilizing logs as data
source as Log Anomaly Detection (LAD). The tradi-
a
https://orcid.org/0000-0003-4050-3625
tional LAD methods were mostly manual operations
and rule-based methods, while such methods were
no long suitable for the large-scale IT systems with
sophisticated system incidents. In the recent years,
with the development of AI technologies, machine
learning based anomaly detection methods have re-
ceived more and more attention. For instance, some
works utilized unsupervised clustering-based meth-
ods (Givental et al., 2021a; Givental et al., 2021b) to
detect outliers. Though such methods do not require
labeled data for training, the anomaly detection per-
formance is not guaranteed and unstable. Moreover,
it is hard to apply such methods onto streaming log
data as the log patterns are changing over time.
Another popular and widely used LAD approach
is to first collect enough labeled training data during
the system’s normal operation period and adopt log
templates-based method for feature engineering, and
then employ Principal Component Analysis (PCA)
based methods to learn normal log patterns from
labeled training data and find anomalous log pat-
terns during inference streaming log data (Liu et al.,
2020; Liu et al., 2021). Even though the PCA-based
method is successful in certain scenarios, it still suf-
fers from some limitation in practice. Firstly, log tem-
plates learning often requires customers to provide
one week’s worth of training logs without incidents
An, L., Tu, A., Liu, X. and Akkiraju, R.
Real-time Statistical Log Anomaly Detection with Continuous AIOps Learning.
DOI: 10.5220/0011069200003200
In Proceedings of the 12th International Conference on Cloud Computing and Services Science (CLOSER 2022), pages 223-230
ISBN: 978-989-758-570-8; ISSN: 2184-5042
Copyright
c
2022 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
223