5 CONCLUSIONS, DISCUSSION,
AND FUTURE WORK
Summary of Contribution and Critical Discus-
sion. The literature does not currently give a struc-
tured methodology for cleaning a raw dataset of sen-
sor readings, and especially obtaining an annotated
dataset of readings which includes data faults in con-
figurable patterns. Without this methodology, studies
which design and evaluate algorithms for anomaly de-
tection, classification, and correction in sensor data is
difficult to evaluate comparatively.
Cleaning a given dataset is ideally done via the
resource-heavy process comparing the sensor data
against data acquired concurrently by a second, cal-
ibrated, high-fidelity sensor, which is able to col-
lect digital data, and either download it over a net-
work, or store large datasets in memory. However,
an ideal such second sensor is rarely available in
real-world deployments, and arguable all the sensor
datasets currently available for experimentation have
been cleaned via an unstructured process which com-
bines a degree of domain knowledge with a form of
basic inspection of the data. This process is error-
prone, and may not differentiate (for all types of phys-
ical phenomena sensed) between legitimate faults in
the data and true anomalous conditions in the envi-
ronment. Other means for classifying raw data points
are surveyed in (Zhang, 2010).
This paper does not contribute a data-cleaning
methodology, but provides a framework to prepare
annotated datasets with configured injected faults,
which are well suited for then evaluating fault-
detection methods. The framework requires a clean
dataset in the input. The datasets we publish here have
used as clean data some subsets of raw datasets which
we have judged, using our own domain knowledge
and visual inspection, to have been the least affected
by faults in the sensing system, and which thus re-
quired minimal manual cleaning and interpolation for
missing values.
We provide three benchmark datasets for the eval-
uation of fault detection and classification in wire-
less sensor networks, and a Java library which im-
plements configurable fault-injection algorithms. The
first benchmark dataset includes 280 subsets of tem-
perature and light sensors of 10 nodes extracted from
the Intel Lab raw data. The second benchmark dataset
includes 140 subsets of ambient temperature sensors
extracted from the SensorScope dataset. The third
benchmark dataset includes 224 subsets of tempera-
ture measurements obtained from 16 sensors as part
of the Smart Santander project. The three bench-
mark datasets total 5.783.504 data samples, covering
six types of faults that have been observed in sen-
sor data by prior literature. Faults are injected using
known, generic fault models. We publish the datasets
at http://tuananh.io/datasets. We believe that all pa-
pers listed in table 3 would have benefited from using
such annotated datasets.
This paper attempts an initial, systematic treat-
ment of the problem of the missing annotated
datasets. Its main limitation lies in the fact that the an-
notated datasets have been obtained based on cleaned
sensor data that is still not entirely guaranteed to be
free of all faulty readings.
Future Work. We plan to extend the current
datasets two-fold: (a) by processing and publish-
ing sensor datasets pertaining to physical phenomena
other than light and temperature, and (b) by prepar-
ing annotated datasets based on cleaned datasets with
a better guarantee of correctness. Also, we aim at de-
veloping a software tool of a user-friendlier nature,
for configuring the fault injection algorithms.
ACKNOWLEDGEMENT
The work is supported by 1) the FP7-ICT-2013-EU-
Japan, Collaborative project ClouT, EU FP7 Grant
number 608641; NICT management number 167
and 2) the Dutch National Research Council En-
ergy Smart Offices project, contract no. 647.000.004
and 3) the Dutch National Research Council Beijing
Groningen Smart Energy Cities project, contract no.
467-14-037.
REFERENCES
Baljak, V., Tei, K., and Honiden, S. (2013). Fault classi-
fication and model learning from sensory readings –
framework for fault tolerance in wireless sensor net-
works. In Intelligent Sensors, Sensor Networks and
Information Processing, 2013 IEEE Eighth Interna-
tional Conference on, pages 408–413.
Box, G. E., Jenkins, G. M., and Reinsel, G. C. (2013). Time
Series Analysis: Forecasting and Control. Wiley.com.
Gaillard, F., Autret, E., Thierry, V., Galaup, P., Coatanoan,
C., and Loubrieu, T. (2009). Quality control of large
argo datasets. Journal of Atmospheric and Oceanic
Technology, 26.
Hamdan, D., Aktouf, O., Parissis, I., El Hassan, B., and Hi-
jazi, A. (2012). Online data fault detection for wireless
sensor networks - case study. In Wireless Communi-
cations in Unusual and Confined Areas (ICWCUCA),
2012 International Conference on, pages 1–6.
SENSORNETS 2016 - 5th International Conference on Sensor Networks
194