Detecting Data Stream Dependencies on High Dimensional Data
Jonathan Boidol
1,2
and Andreas Hapfelmeier
2
1
Institute for Informatics, Ludwig-Maximilians University, Oettingenstr. 67, D-80538, Munich, Germany
2
Corporate Technology, Siemens AG, Otto-Hahn-Ring 6, D-81739, Munich, Germany
Keywords:
Sensor Application, Online Algorithm, Entropy-based Correlation Analysis.
Abstract:
Intelligent production in smart factories or wearable devices that measure our activities produce on an ever
growing amount of sensor data. In these environments, the validation of measurements to distinguish sensor
flukes from significant events is of particular importance. We developed an algorithm that detects dependencies
between sensor readings. These can be used for instance to verify or analyze large scale measurements. An
entropy based approach allows us to detect dependencies beyond linear correlation and is well suited to deal
with high dimensional and high volume data streams. Results show statistically significant improvements in
reliability and on-par execution time over other stream monitoring systems.
1 INTRODUCTION
Large-scale wireless sensor networks (WSN) and
other forms of remote monitoring, reaching from
personal activity to surveillance of industrial plants
or whole ecological systems are advancing towards
cheap and widespread deployment. This progress has
spurred the need for algorithms and applications that
work on high dimensional streaming data. Stream-
ing data analysis is concerned with applications where
the records are processed in unbounded streams of in-
formation. Popular examples include the analysis of
streams of text, like in twitter, or the analysis of image
streams, like in flickr. However, there is also an in-
creasing interest in industrial applications. The nature
and volume of this type of data make traditional batch
learning exceedingly difficult, and fit naturally to al-
gorithms that work in one pass over the data, i.e. in an
online-fashion. To achieve the transition from batch
to online algorithms, window-based and incremental
algorithms are popular, often favoring heuristics over
exact results.
Instead of relying only on single stream statis-
tics to e.g. detect anomalies or find patterns in the
data, this paper is concerned with a setting where we
find many sensors monitoring in close proximity or
closely related phenomena, for example temperature
sensors in close spacial proximity or voltage and ro-
tor speed sensors in large turbines. It appears obvi-
ous that we should be able to utilize the – in some
sense redundant, or rather shared – information be-
tween sensor pairs to validate measurements. The
task at hand becomes then to reliably and efficiently
compute and report dependencies between pairs or
groups of data streams. We can imagine such a sce-
nario in the context of smart homes or smart cities
with personal monitoring or automated manufactur-
ing that form the internet of things. A particular ap-
plication could be the validation of sensor readings in
the context of multiple cheap sensors where measure-
ments are possibly impaired by limited technical pre-
cision, processing errors or natural fluctuations. Then,
unusual readings might either indicate actual changes
in the monitored system or be due to these measuring
uncertainties. Finding correlations helps differentiate
such cases.
The best known indicator for pairwise correla-
tion is Pearson’s correlation coefficient ρ, essentially
the normalized covariance between two random vari-
ables. Direct computation of ρ, however, is pro-
hibitively expensive and, more problematic, it is only
a suitable indicator for linear or linear transformed re-
lationships (Granger and Lin, 1994). Non-linearity in
time-series has been studied to some extent and may
arise for example due to shifts in the variance (Fernan-
dez et al., 2002) or simply if the underlying processes
are determined by non-linear functions.
We propose an algorithm that is used to detect
dependencies in high volume and high dimensional
data streams based on the mutual information be-
tween time series. The three-fold advantages of our
approach are that mutual information captures global
dependencies, is algorithmically suitable to be calcu-
lated in an incremental fashion and can be computed
Boidol, J. and Hapfelmeier, A.
Detecting Data Stream Dependencies on High Dimensional Data.
DOI: 10.5220/0005953303830390
In Proceedings of the International Conference on Internet of Things and Big Data (IoTBD 2016), pages 383-390
ISBN: 978-989-758-183-0
Copyright
c
2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
383