Authors:
Rudolf Hoffmann
and
Christoph Reich
Affiliation:
Institute for Data Science, Cloud Computing and IT Security, Furtwangen University, Germany
Keyword(s):
Cloud Computing, Reliability, Machine Learning, AI, XAI, Transparency, Explainability, Surrogate Model, Failure Detection, Fault Tree Analysis, Root Cause Analysis.
Abstract:
Cloud computing infrastructures availability rely on many components, like software, hardware, cloud management system (CMS), security, environmental, and human operation, etc. If something goes wrong the root cause analysis (RCA) is often complex. This paper explores the integration of Machine Learning (ML) with Fault Tree Analysis (FTA) to enhance explainable failure detection in cloud computing systems. We introduce a framework employing ML for FT selection and generation, and for predicting Basic Events (BEs) to enhance the explainability of failure analysis. Our experimental validation focuses on predicting BEs and using these predictions to calculate the Top Event (TE) probability. The results demonstrate improved diagnostic accuracy and reliability, highlighting the potential of combining ML predictions with traditional FTA to identify root causes of failures in cloud computing environments and make the failure diagnostic more explainable.