TIMING BEHAVIOR ANOMALY DETECTION IN ENTERPRISE
INFORMATION SYSTEMS
Matthias Rohr, Simon Giesecke and Wilhelm Hasselbring
Graduate School TrustSoft, Software Engineering Group, University of Oldenburg, 26111 Oldenburg, Germany
Keywords:
Software Measurement, Software Engineering, Anomaly Detection, Failure Diagnosis.
Abstract:
Business-critical enterprise information systems (EIS) have to satisfy high availability requirements. In order
to achieve the required availability, automatic failure detection and diagnosis techniques must be used. A
major cause of failures in EIS are software faults in the application layer. In this paper, we propose to use
anomaly detection to diagnose failures in the application layer of EIS. Anomaly detection aims to identify
unusual system behavior in monitoring data. These anomalies can be valuable indicators for availability or
security problems, and support failure diagnosis. In this paper we outline the basic principles of anomaly
detection, present the state of the art, and typical application challenges. We outline a new approach for
anomaly detection in Enterprise Information Systems that addresses some of these challenges.
1 INTRODUCTION
Business-critical enterprise information sys-
tems (EIS) have to satisfy high availability require-
ments. The availability of a system is determined
through its reliability (probability of failure free
operation over a period of time) and the time required
for its repair after a failure has occurred. To achieve
a higher availability, the reliability must be increased
or the time to repair must be reduced. Automatic
failure diagnosis is a strategy to reduce repair times.
Anomaly detection is a promising strategy to au-
tomatic failure diagnosis in EIS. Its main idea is to use
“unusual” system runtime behavior as an indicator of
a problem in the system. System runtime behavior
is considered unusual if it deviates from the behav-
ior that was observed earlier. Anomaly detection re-
quires an instrumentation to monitor system runtime
behavior. Even without anomaly detection, monitor-
ing data are a valuable source of information in fail-
ure diagnosis, especially for failures that are difficult
to reproduce. However, as enterprise information sys-
tems typically produce a large amount of monitoring
data, tool support for anomaly detection is required to
efficiently realize failure diagnosis.
This paper motivates the use of an anomaly detec-
tion method to increase the dependability of EIS. Ex-
isting approaches that apply anomaly detection to EIS
are presented and open research questions are iden-
tified. Additionally, we outline a new approach for
anomaly detection based on timing behavior analysis
for distributed, multi-user, enterprise-scale software
systems.
This paper is structured as follows: Sections 2 and
3 present the basic concepts of anomaly detection and
its application to EIS. A new approach for anomaly
detection in EIS is introduced in Section 4 before the
paper is summarized in Section 5.
2 FOUNDATIONS OF ANOMALY
DETECTION
A (behavior) anomaly of a system is a significant de-
viation from the behavior that was observed earlier.
For instance, exceptional high or low component ser-
vice response times are timing behavior anomalies.
The incidence of an anomaly does not automatically
imply that the system behavior is incorrect (Maxion,
1990). For example, exceptionally long component
service response times may be caused by an excep-
494
Rohr M., Giesecke S. and Hasselbring W. (2007).
TIMING BEHAVIOR ANOMALY DETECTION IN ENTERPRISE INFORMATION SYSTEMS.
In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 494-497
DOI: 10.5220/0002412004940497
Copyright
c
SciTePress
tionally high number of current system users.
The threats to dependability are faults, errors, and
failures (Avi
ˇ
zienis et al., 2004). In short, a (system)
failure is the incidence of system behavior that is in-
correct with respect to stated requirements. A fail-
ure is caused by faults. Active faults cause incorrect
states, which are called errors. When using anomaly
detection in failure diagnosis (e.g., failure detection,
fault localization), it is assumed that active faults and
failure processes cause anomalies.
Anomaly detection requires some criterion to
discriminate normal and unusual behavior. Many
anomaly detection methods require the manual spec-
ification of static thresholds for this. Static thresh-
olds require expert knowledge on the behavior of
the system under operation and may have to be up-
dated every time the system or its operation condi-
tions changes. More maintainable anomaly detection
techniques automatically learn threshold models by
statistical analysis of historical system behavior from
log files. Usually, only data describing normal sys-
tem behavior are available (Maxion and Olszewski,
1993), while training data for particular faults or fail-
ures is unavailable, in particular for fault classes that
have never occurred in the system before.
A typical architecture of an anomaly detection
failure diagnosis system monitors different system
behavior metrics at measurement points at multiple
places in the system architecture. A separate anomaly
detector is used for each source of measurement data.
In combination, the anomaly detectors produce a pat-
tern of anomalies that is evaluated by so-called event
correlation (Steinder and Sethi, 2004) for a final fail-
ure diagnosis.
In the domain of network management or com-
munication infrastructure management, anomaly de-
tection is used for failure diagnosis by analyzing the
monitoring data provided by communication middle-
ware (e.g., Maxion, 1990; Hoke et al., 2006).
In industrial manufacturing, anomaly detection
has been well-studied for failure diagnosis based on
mathematical prediction models or using computation
intelligence methods (Bocaniala and Palade, 2006).
In system security engineering, many intrusion
detection approaches assume that user or system be-
havior is similar to historic data. Hence, strong
changes of user or system behavior are indicative of
intrusions. For instance, Denning (1987) presented
an approach for detecting deviations from the normal
sequence of user commands.
2.1 Typical Challenges
This subsection presents typical challenges of
anomaly detection. Our approach presented in Sec-
tion 4 addresses some of these challenges.
False Alarms. A false alarm occurs when an
anomaly is mistakenly judged a failure. A false alarm
may lead to wrong failure handling or reduce the ac-
ceptance by administrators.
Usage Dependence. Changes in the amount or
types of user behavior can lead to a changes in system
runtime behavior. For instance, the “normal” average
response time depends on the degree of resource shar-
ing by concurrent requests.
Modeling Uncertainty, Non-linearity of System
Runtime Behavior. In complex systems, behavior
metrics such as response times are difficult to predict,
because not all relevant influences and dynamics can
be efficiently modeled. Therefore, it is difficult to de-
velop criteria that can be used to classify runtime be-
havior as either normal or unusual. Additionally, only
a part of all operational conditions can be monitored,
because a complete monitoring causes a monitoring
overhead that it is too large to be tolerable.
3 ANOMALY DETECTION IN
ENTERPRISE INFORMATION
SYSTEMS
Several runtime behavior metrics, such as service re-
sponse times, memory utilization, or trace shapes are
candidates for anomaly detection in EIS. Since tim-
ing behavior can be efficiently monitored, it is one of
the most common characteristics used to detect oper-
ational anomalies. System timing behavior or trace
shapes in enterprise information systems are usually
strongly dependent on the context and on the system
usage. It has been demonstrated that statistical analy-
sis of timing behavior can still be effectively used for
failure diagnosis in large software systems.
For instance, Agarwal et al. (2004) demonstrated
that timing behavior anomalies are a discriminatory
indicator in fault localization. Mielke (2006) reported
that end-to-end response times in Enterprise Resource
Planning systems are well-described by log-normal
distributions. Deviations from this distribution often
indicate performance problems (Mielke, 2006).
TIMING BEHAVIOR ANOMALY DETECTION IN ENTERPRISE INFORMATION SYSTEMS
495
Maintainability is a major requirement of fail-
ure diagnosis tools for EIS. EIS are regularly modi-
fied because of changed market requirements, to offer
new products, or to increase the level of integration.
Traditional software fault tolerance methods, such as
multi-version design, multiply the efforts required for
changes in system design. A failure diagnosis ap-
proach suitable for EIS should be maintainable, i.e.,
it should be easily adaptable to changes in the system
and should not reduce the changeability of the EIS by
imposing strict architectural constraints. Two critical
maintainability aspects of anomaly detection are the
monitoring instrumentation and the threshold specifi-
cation method.
In the last years, some first approaches to anomaly
detection for failure diagnosis in EIS have been pro-
posed. For instance, Agarwal et al. (2004) evaluate re-
sponse times of internal components using automati-
cally generated threshold models derived form histor-
ical violations of end-to-end response time require-
ments and component dependency graphs. Kiciman
(2005) presents an anomaly detection method that is
applied to structural system behavior such as compo-
nent interactions and the “shape” of service execution
sequences.
4 APPROACH TO TIMING
BEHAVIOR ANOMALY
DETECTION
The general strategy of our approach is as follows:
The target system is instrumented to monitor op-
eration response times and operation execution se-
quences to build a profile of “normal” system behav-
ior. For failure diagnosis, for each recent response
time the degree of anomaly is computed that quanti-
fies the “usualness” of a response time in the context
of the execution conditions given by the operation ex-
ecution sequences. All anomaly degrees are evaluated
by statistical pattern analysis to identify a possible
path of a failure process through the software archi-
tecture over time, or to discriminate types of failures.
The system is assumed to be assembled out of
software components, i.e. reusable building blocks of
software offering operations via interfaces. Further-
more, the allocation of software components to execu-
tion environments must be known. Each user request
is processed by a separate request process that man-
ages the execution of required component services. A
request process has a unique request identifier. The
middleware platform manages the execution for con-
current user requests by starting a separate request
process for each incoming user request. The sequence
of operation executions for a user request is called re-
quest trace.
Workload Awareness. In multi-user systems, ser-
vice response times depend on the current level of
concurrent system usage. We define the workload
metric w
d,t
as the number of active request processes
for a deployment context d at time t. A request pro-
cess is active, iff it is not currently waiting for re-
sponses from other deployment contexts. This simple
metric gives a coarse approximation of resource uti-
lization without the need for additional monitoring.
Our research is based on the hypothesis that
timing behavior evaluation is improved by explic-
itly modeling the workload, i.e. influences on tim-
ing behavior that arise from hardware resources be-
ing shared by concurrent user requests. From another
point of view, the higher robustness against workload
changes extends the applicability to multi-user appli-
cations in the first place by relieving the requirement
of absence of concurrent system usage.
User Request Awareness. The execution of opera-
tions can heavily depend on particular request prop-
erties such as input parameter values or the origin.
For instance, consider an operation that is the only
access point exposed by a large legacy system. This
operation is used both for requests with short response
times from private customers and for batch jobs with
long response times. We expect that the evaluation of
the response times of this operation can benefit from
taking the execution sequences that precede the call
into account.
Including the execution sequences into timing be-
havior anomaly detection improves the awareness
of individual characteristics of user requests. Since
not all variations in the execution sequence are rel-
evant to timing behavior analysis, we identify pat-
terns within operation execution sequences (traces)
that have shown distinguished timing behavior. If
such patterns are discovered, they will be handled
independently in the timing behavior evaluation, by
being represented through separate probability den-
sity functions of operation response times. We have
the hypothesis that timing behavior evaluation can be
more accurate if such trace patterns are explicitly con-
sidered.
Maintainability. We employ source code instru-
mentation (monitoring logic is integrated to the
source code) to realize maintainable timing-behavior
monitoring for enterprise-scale systems. This is in
contrast to middleware-interception methods that use
ICEIS 2007 - International Conference on Enterprise Information Systems
496
proprietary interfaces of middleware products or ma-
nipulates middleware for monitoring. A disadvan-
tage of code instrumentation is that the primary busi-
ness logic of the software is mixed with monitoring
logic, which may reduce the readability of the code
(and therefore, the maintainability). However, code
instrumentation avoids dependencies on manipulated
middleware or proprietary middleware interfaces. To
reduce the problem of mixing business and monitor-
ing logic in the source code, we employ an approach
based on Aspect-Oriented Programming (AOP). AOP
is a technique to isolate cross-cutting concerns, such
as monitoring, from the primary business logic. In
a previous case study, we evaluated the monitoring
overhead and the maintainability of this monitoring
approach in a large business system of a German
telecommunication provider (Focke et al., 2007).
Additionally, maintainability is addressed by au-
tomatically creating the profile of normal timing be-
havior from historical data. To derive the model, no
intrusive changes to the software system are required,
it is merely necessary to perform the system instru-
mentation as described in the previous section.
5 CONCLUSION
This paper motivates anomaly detection for failure di-
agnosis in enterprise information systems. Anomaly
detection allows to target failures caused by faults in
the application-layer. The problem of anomaly de-
tection was successfully targeted for failure diagnosis
in industrial manufacturing, network management, or
system security. Anomaly detection for failure diag-
nosis in EIS is far less studied, but promising first re-
sults indicated the potential benefits.
We propose a new approach to anomaly detection
for EIS based on timing behavior analysis. The ma-
jor concepts are workload awareness, and user request
awareness to improve the anomaly detection quality
by reducing the dependence to the changes in the op-
erational profile, which should reduce the number of
false alarms. From another point of view, the higher
robustness against workload changes increases the ap-
plicability to multi-user applications by omitting the
requirement of non-concurrent system usage. The
maintainability of our monitoring approach has been
shown during an evaluation in an EIS of a telecom-
munication company. In future work we will perform
an evaluation of the anomaly detection approach.
We discussed that the dynamic nature of enter-
prise application systems requires that in particular
the maintainability of a failure diagnosis approach for
EIS is important and outlined that anomaly detection
can be realized in a maintainable way.
ACKNOWLEDGEMENTS
This work is supported by the German Research
Foundation (DFG), grant GRK 1076/1.
REFERENCES
Agarwal, M. K., Appleby, K., Gupta, M., Kar, G., Neogi,
A., and Sailer, A. (2004). Problem determination us-
ing dependency graphs and run-time behavior mod-
els. In 15th IFIP/IEEE International Workshop on
Distributed Systems: Operations and Management
(DSOM’04), volume 3278 of Lecture Notes in Com-
puter Science, pages 171–182. Springer.
Avi
ˇ
zienis, A., Laprie, J.-C., Randell, B., and Landwehr, C.
(2004). Basic concepts and taxonomy of dependable
and secure computing. IEEE Transactions on Depend-
able and Secure Computing, 1(1):11–33.
Bocaniala, C. D. and Palade, V. (2006). Computational in-
telligence methodologies in fault diagnosis: Review
and state of the art. In Computational Intelligence in
Fault Diagnosis, Advanced Information and Knowl-
edge Processing, chapter 1, pages 1–36. Springer.
Denning, D. (1987). An intrusion-detection model. IEEE
Transactions on Software Engineering, 13(2):222–
232.
Focke, T., Hasselbring, W., Rohr, M., and Schute, J.-G.
(2007). Instrumentierung zum Monitoring mittels
Aspekt-orientierter Programmierung. In Proceedings
Software Engineering 2007, Hamburg, GI-Edition
Lecture Notes in Informatics. Bonner K
¨
ollen Verlag.
Hoke, E., Sun, J., Strunk, J. D., Ganger, G. R., and Falout-
sos, C. (2006). Intemon: continuous mining of sensor
data in large-scale self-infrastructures. SIGOPS Oper.
Syst. Rev., 40(3):38–44.
Kiciman, E. (2005). Using Statistical Monitoring to De-
tect Failures in Internet Services. PhD thesis, Stanford
University.
Maxion, R. A. (1990). Anomaly detection for network diag-
nosis. In Randell, B., editor, Proceedings of the 20th
International Symposium on Fault-Tolerant Comput-
ing (FTCS ’90), pages 20–27. IEEE.
Maxion, R. A. and Olszewski, R. T. (1993). Detection and
discrimination of injected network faults. In Digest of
Papers of the 23rd Internation Symposium on Fault-
Tolerant Computing, pages 198–207. IEEE.
Mielke, A. (2006). Elements for response-time statistics
in ERP transaction systems. Performance Evaluation,
63(7):635–653.
Steinder, M. and Sethi, A. S. (2004). A survey of fault lo-
calization techniques in computer networks. Science
of Computer Programming, 53(2):165–194.
TIMING BEHAVIOR ANOMALY DETECTION IN ENTERPRISE INFORMATION SYSTEMS
497