detection in WSNs, classifying them between
Statistical-Based approaches, Nearest Neighbour
approaches, Cluster-Based approaches,
Classification-Based approaches and Spectral
Decomposition-Based approaches.
Cluster-based approaches in IoT have several
advantages over the rest of techniques. Firstly, even
if supervised systems (mainly Classification-based
approaches) are commonly used for calibration of
sensors (Spinelle et al., 2015) and (Pena et al., 2003),
it is difficult to create supervised systems in a real
world IoT scenario, because they do not adapt to new
conditions (e.g. a change of location of the sensor)
easily. Secondly, statistical approaches suffer from
the adaption to real time and to the Big Data paradigm
associated with IoT. Thirdly, unsupervised systems
can be combined with spectral decomposition-based
approaches since these are focused on variable
reduction (usually for visualization, even if there has
been research on spectral decomposition-based
outlier detection using PCA techniques (Ghorbel et
al., 2015) and (Zheng et al., 2018). Finally,
unsupervised cluster-based systems do not need to
know the number of clusters in advance.
In general, unsupervised clustering techniques
have the advantage that they model the “usual”
behaviour of the system and then, they detect any
anomaly out of this behaviour. This is exactly the
problem to be solved by outlier detection in sensor
networks. Moreover, one may note that outlier
detection techniques can be combined with
supervised learning techniques in a way that these
anomalies can be classified in different classes
(different types of anomalies or events).
Most of the work related to outlier detection in
WSN have been focused on exploiting the temporal
(time sequenced) and spatial (location) correlation of
the different sensors measurements (Zheng et al.,
2018) and (Yang et al., 2008) with distributed
approaches focusing also on multi-variable as a
secondary correlation (Barakkath Nisha et al., 2014).
All these techniques are usually evaluated using a
clean dataset where outliers are added artificially,
since a labelled dataset is needed to evaluate an
algorithm and it is difficult to obtain such dataset
from real measurements.
In the paradigm of Big Data introduced by the
MapReduce (Dean and Ghemawat, 2008) technique
implemented in the most known technologies in the
field like Hadoop (Shvachko et al., 2010) or Spark
(Zaharia et al., 2016), data is not processed in
temporal order. Instead of that, chunks of data are
distributed to several nodes and MapReduce tasks are
launched in parallel. Hence, it is difficult to base an
outlier analysis in time correlations using a Big Data
approach.
Moreover, according to our knowledge, there is
no study that evaluates the multi-variable correlation
separately from the temporal and spatial correlations.
This would be useful in order to show the
characteristics of outlier detection for each one of
these correlations (temporal, spatial, multi-variable)
separately.
Finally, in real world IoT scenarios, most of the
variables are correlated, like the temperature and
vibration of a machine in a manufacturing line or the
different pollutant agents in a smart city
Hence, in this paper we aim to exploit and
evaluate the multi-variable correlation in outlier
detection. Firstly, to detect these outliers we use a set
of three well-known unsupervised algorithms,
namely Elliptic envelope, Isolation Forest and Local
Outlier Factor (Section 2). With their outputs, we
build an Ensemble Outlier Detector (EOD) based on
a majority voting system.
Secondly, we perform this analysis using the well-
known and broadly used Intel Berkeley dataset (Intel
Berkeley Research Lab, 2004) (Section 3). We
evaluate this system using the standard evaluation
techniques based on Detection Rate (DR) and False
Alarm Rate (FAR) with artificially generated outliers
composed of local and global outliers.
Thirdly, we evaluate the proposed model in a real
case scenario in the city of Barcelona, within the
scope of the GrowSmarter project (Growsmarter
project, 2019), using the data provided by a cluster of
sensors of 16 variables installed in bikes that move
around the city (Section 4).
Finally, we present future evolutions of the EOD
and we expose the conclusions (Section 5).
2 ENSEMBLE OUTLIER
DETECTOR
Ensemble methods are widely used to increase the
accuracy of the predictions when different criteria
need to be applied for decision making using data-
driven systems. For example, Skyline is a popular
open source project which uses ensemble methods for
outlier detection in time-series data (Stanway, 2013).
This work takes advantage of very well-known
unsupervised techniques for outlier detection in order
to get a unique robust classification.
In the presented Ensemble Outlier Detector, the
first implemented technique is Elliptic Envelope (EE)
(Rousseeuw and Van Driessen, 1999), which is based
Ensembled Outlier Detection using Multi-Variable Correlation in WSN through Unsupervised Learning Techniques
39