Efficient Processing of Semantically Represented Sensor Data
Farah Karim
1
, Maria-Esther Vidal
2
and S
¨
oren Auer
1,2
1
Enterprise Information Systems (EIS), University of Bonn, Germany
2
Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Germany
Keywords:
Sensor, Linked Data, Data Factorization, Query Optimization and Execution.
Abstract:
Large collections of sensor data are semantically described using ontologies, e.g., the Semantic Sensor Net-
work (SSN) ontology. Semantic sensor data are RDF descriptions of sensor observations from related sam-
pling frames or sensors at multiple points in time, e.g., climate sensor data. Sensor values can be repeated in a
sampling frame, e.g., a particular temperature value can be repeated several times, resulting in a considerable
increase in data volume. We devise a factorized compact representation of semantic sensor data using linked
data technologies to reduce repetition of same sensor values, and propose algorithms to generate collections
of factorized semantic sensor data that can be managed by existing RDF triple stores. We empirically study
the effectiveness of the proposed factorized representation of semantic sensor data. We show that the size of
semantic sensor data is reduced by more than 50% on average without loss of information. Further, we have
evaluated the impact of this factorized representation of semantic sensor data on query execution. Results
suggest that query optimizers can be empowered with semantics from factorized representations to generate
query plans that effectively speed up query execution time on factorized semantic sensor data.
1 INTRODUCTION
The increasing use of semantic technologies in indu-
stry, health care, and other domains has brought at-
tention to scalability and performance improvements.
One crucial dimension of semantic technologies is
the semantic enrichment and integration of sensor
data. Repetition of sensor measurement values of
real-world phenomena causes extraordinary expan-
sion of data volumes. Particularly, duplicated mea-
surement values impact on the size of datasets when
sensor data is represented using RDF and ontologies,
e.g., the Semantic Sensor Network (SSN) ontology.
RDF data compression is a possible solution for
managing large volumes of RDF data efficiently. Ho-
wever, compressed RDF data can only be processed
and queried by RDF query engines customized for
a specific compression tool, e.g., HDT (Fern
´
andez
et al., 2013) binary serialization format for storing and
exchange of RDF data, and the RDF-3X (Neumann
and Weikum, 2010) compression schema for RDF.
Further, data semantics or ontological knowledge are
not utilized by these compression techniques.
Inspired by existing work on factorized databa-
ses (Bakibayev et al., 2013), we propose the Factori-
zing Semantic Sensor Data (FSSD) approach to facto-
rize semantic sensor data. FSSD exploits ontological
knowledge to: i) generate a compact representation of
sensor observations where repeated measurement va-
lues are reduced, and ii) rewrite and execute queries
over factorized data.
We empirically evaluate the effectiveness of the
FSSD framework on a collection of semantic sensor
data of different sizes. These datasets contain ob-
servations of different climate phenomena during the
hurricane and blizzard seasons in the United States
in the years 2003, 2004, and 2005. The experimen-
tal results show that the proposed factorization met-
hod reduces the amount of semantic sensor data (i.e.,
number of RDF triples) by more than 50% without
loss of information. Moreover, observed results show
that query processing over factorized sensor data is
boosted by up to two orders of magnitude.
This paper comprises seven additional sections.
Using a real-world example, the need for efficient re-
presentation of semantic sensor data is motivated in
2. Preliminary definitions are presented in 3. We then
define our approach in 4 and 5. Results of our empiri-
cal evaluation are reported in 6. Existing approaches
are reviewed in 7. We conclude and present an out-
look to future work in 8.
252
Karim, F., Vidal, M-E. and Auer, S.
Efficient Processing of Semantically Represented Sensor Data.
DOI: 10.5220/0006287002520259
In Proceedings of the 13th International Conference on Web Information Systems and Technologies (WEBIST 2017), pages 252-259
ISBN: 978-989-758-246-2
Copyright © 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
1 SELECT ? val , C O U NT (? val ) a s ? v C o u n t
2 WHERE {
3 ? o b s a : T e m p O bse r v a tio n ;
4 : result ? r e s .
5 ? r e s : f l oatV a l u e ? v a l .
6 } G R O U P BY ? val HAVING ? v C o u n t > 9
(a) Exemplar SPARQL query to calcu-
late the frequency distribution of tempera-
ture values repeated more than 9 times in
observations in 2003 MesoWest dataset.
(b) Frequency Distribution of Tem-
perature Values of Mesowest dataset. (c) Percentage of Repeated Values.
Figure 1: Motivating Example: (a) SPARQL query against year 2003 MesoWest dataset with temperature observations.
Prefixes are used as in https://www.w3.org/wiki/SRBench; (b) Frequency Distribution of temperature values from year 2003
Mesowest Linked Observation Data repeated more than 9 times observations; (c) Percentage of RDF triples per repeated
temperature values with respect to Linked Observations of temperature from year 2003 Mesowest Data.
2 MOTIVATING EXAMPLE
The MesoWest RDF datasets
1
comprise sensor lin-
ked data describing hurricane and blizzard observa-
tions in the United States. Observations include mea-
surements of different climate phenomena, e.g., tem-
perature, precipitation, wind speed, and humidity.
The SSN ontology is utilized to semantically describe
these observations. Together these datasets contain
almost two billion RDF triples comprehensively des-
cribing major storms in the United States since 2002.
The RDF dataset on the storm season in the year
2003 comprises 12,011,466 RDF triples about tem-
perature, and 1,193,345 observations. The SPARQL
query in Figure 1a calculates the frequency distribu-
tion of temperature values that are repeated more than
nine times in the temperature observations. Figure 1b
shows the results produced by the evaluation of this
query. The results reveal that 1,191,218 observations
out of 1,193,345 meet this condition. Additionally,
temperature values are repeated on average more than
1,100 times, and the temperature value 32 is the mode,
i.e., 32 is the most repeated value and is associated
with 24,632 observations. Also, each observation is
described using 11 RDF triples; Figure 1c illustrates
the percentage RDF triples repeated per temperature
value. As can be observed, some repeated tempera-
ture values are associated with up to 2.5% of the to-
tal number of temperature related RDF triples in the
whole dataset. Similar frequency distributions can be
observed for other climate phenomena, and corrobo-
rate the natural intuition that the number of phenome-
non distinct values is much smaller than the number
of observations.
We exploit these characteristics of semantic sen-
sor data, and propose a compact representation where
RDF triples of repeated measurement values are fac-
1
http://wiki.knoesis.org/index.php/LinkedSensorData
torized from the observations and included in the da-
taset only once. Unlike other compression techniques,
queries can be directly executed against a factorized
representation of semantic sensor data. Further, se-
mantics encoded in factorized representations can be
utilized to guide the query optimizer to generate query
plans able to speed up query processing.
3 PRELIMINARIES
The Semantic Sensor Network (SSN) Onto-
logy (Compton et al., 2012) developed by the
W3C Semantic Sensor Network Incubator Group
2
,
has been extensively used to semantically describe
sensor data (Henson et al., 2009; Gao et al., 2014; Ali
et al., 2015). SSN comprises the Skeleton and Data
modules, which provide RDF classes and properties
to describe sensor data in terms of observations,
feature of interest, observed property, measurement
values and units. Figure 2a depicts classes and pro-
perties in these modules; an RDF graph describing an
observation is presented in Figure 2b. Formally, an
RDF graph is defined as follows:
Definition 3.1 (RDF triple and RDF graph). (Arenas
et al., 2009) Let I, B, L be disjoint infinite sets of
URIs, blank nodes, and literals, respectively. A tri-
ple (s, p, o) (I B)×I×(I B L) is denomina-
ted an RDF triple, where s is called the subject, p the
predicate, and o the object. An RDF graph is a pair
G = (V, E), where V is a set of nodes in I B L,
and E is a set of RDF triples.
Example 3.1. Figure 3 presents an RDF graph that
corresponds to a portion of the RDF dataset from
the storm season in year 2003. Nodes correspond to
resources representing observations, measurements,
2
https://www.w3.org/2005/Incubator/ssn/
Efficient Processing of Semantically Represented Sensor Data
253
Data
ObservationValue
Skeleton
SensorOutput
SensorInput
Observation
Sensor
Property
FeatureOfInterest
Sensing
hasValue
isProducedBy
detects
observationResult
includesEvent
observedBy
isProxyFor
observedProperty
featureOfInterest
hasProperty
isPropertyOf
implements
observes
(a) Portion of the SSN Ontology.
:oMolecule
:obs1
:
hasObs
:mMolecule
:fahrenheit
:
uom
"96.0"^^:float
time1
:
samplingTime
:m1
:
result
:result
:
floatValue
(b) RDF graph with three molecules.
Figure 2: The SSN Ontology and RDF Molecule Example. (a) Concepts and relationships in Skeleton and Data Modules
from the SSN Ontology; (b) RDF graph with three subject molecules, each associated with two RDF triples.
and timestamps. Further, literals are also represented
as nodes in the RDF graph and properties are from a
variety of RDF vocabularies that include the Semantic
Sensor Network (SSN) Ontology. We ignore prefixes
and replace long URLs by short identifiers for clarity.
According to (Patni et al., 2010), an RDF graph
of semantically described observations models the
act of observing a real world phenomenon, and
produce a value of the observed property. Fi-
gure 3 presents an RDF graph with two observa-
tions, :obs1 and :obs2, from same sensor :TR197
and of same type :TempObservation and the obser-
ved property :AirTemp in different timestamps ts1
and ts2, having the same measurement value and unit:
"96.0"ˆˆ:float and :fahrenheit, respectively.
RDF graphs usually comprise entity description
sub-graphs, sometimes also referred to as Concise
Bounded Descriptions (CBD). These subgraphs are
named RDF subject molecules defined as follows:
Definition 3.2 (RDF Subject Molecule). (Fern
´
andez
et al., 2014) Given an RDF graph G, an RDF subject-
molecule M G is a set of triples t1, t2, ... , tn in
which subject(t1) = subject(t2) = ... = subject(tn).
Example 3.2. Figure 2b presents an RDF data set
with three RDF subject molecules, where the sub-
jects of the molecules are :obs1, :oMolecule and
:mMolecule, that describe resources in terms of their
properties. For simplicity, we will refer to RDF sub-
ject molecules as molecules in the rest of the paper.
4 THE FSSD APPROACH
We propose FSSD, a framework that relies on a de-
ductive database system to solve the problem of fac-
torizing semantic sensor data (FSSD). Figure 4 de-
picts the FSSD architecture. FSSD is composed of
two main components: the FSSD factorizer and the
the FSSD query engine. Given an RDF graph G
1
the
FSSD factorizer identifies a factorized RDF graph G
2
,
such that, G
1
`
F SS D
G
2
. Input RDF graphs are re-
presented as Datalog predicates using the FSSD data
model; a deductive system relies on a set of rules that
produce the factorized RDF graph G
2
. The FSSD
query engine is able to rewrite a SPARQL query Q
against the original RDF graph G
1
into a SPARQL
query Q
0
, and perform query processing techniques
to ensure that the answers of evaluating Q over G
1
and Q
0
over G
2
are the same, i.e., [[Q]]
G
1
= [[Q
0
]]
G
2
.
Optimization techniques are conducted to identify a
bushy query execution plan for Q
0
. Thus, the FSSD
query engine even evaluating Q
0
against the factori-
zed RDF graph G
2
generates the same answers as if
Q were evaluated over G
1
. However, as we will report
on our experimental study, query execution time can
be reduced by up to two orders of magnitude.
4.1 The FSSD Factorizer
The FSSD factorizer provides a solution to the pro-
blem of factorizing semantic sensor data (FSSD) in
RDF graph G
1
by generating a factorized graph G
2
.
Given an RDF graph G
1
, the FSSD factorizer relies
on the FSSD data model to represent G
1
as an exten-
sional database (EDB) in Datalog. The knowledge re-
quired to generate the factorized is encoded in a set of
intensional database (IDB) rules; a fixed point evalu-
ation of the IDB on EDB allows for the generation of
the instances of the FSSD data model that are used by
the factorized RDF graph G
2
. The factorized graph
creator receives the instances of the FSSD data mo-
del and generates the factorized RDF graph G
2
.
4.2 The FSSD Data Model
The FSSD factorizer resort to a deductive database sy-
stem to solve the problem of factorizing semantic sen-
sor data (FSSD). RDF triples in an input RDF graph
G
1
are formally represented as a Datalog extensional
facts (EDB) containing three arguments. For exam-
ple, the RDF triple (:obs1 :procedure :TR197) is for-
mally represented as the following EDB fact:
triple(: obs1,: procedure, : T R197)
The FSSD data model also provides predicates:
measurementMolecule() and observationMolecule(),
which represent RDF molecules of measurements and
WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies
254
:obs1
time1
:TR197
:Temp
Observation
:AirTemp
ts1
:m1
:Measure
Data
:fahrenheit
:result
"96.0"^^:float
:obs2
time2
:TR197
:Temp
Observation
:AirTemp
ts2
:m2
:Measure
Data
:fahrenheit
"96.0"^^:float
Figure 3: An Example of the Repeated Measurement Values: RDF graph G
1
has measurements, :m1 and :m2, and obser-
vations, :obs1 and :obs2, duplicated.
observations in a factorized RDF graph G
2
. For ex-
ample, an RDF molecule :mMolecule with value 96.0
and unit of metric Fahrenheit is modeled as follows:
measurementMolecule(:mMolecule,T,V,U), where
T : (:type,:MeasureData)
V : (:floatValue,“96.0”ˆˆ:float)
U: (:uom,:fahrenheit)
4.3 Deductive System Engine
The deductive system engine relies on a Datalog re-
presentation of triples in an RDF graph G
1
, as well as
on Datalog intentional rules 1-3 (Figure 5) to generate
a factorized RDF graph G
2
. Rule 1 (Figure 5a) creates
a measurement molecule for all the Datalog instanti-
ations of measurements in G
1
such that all the mea-
surements with same measurement unit and value and
are mapped to the one measurement molecule. Thus,
rule 1 ensures that repeated measurements are combi-
ned together in the form of a measurement molecule
in G
2
. Similarly, base and recursive cases of Rule 2
(Figure 5b and 5c) map all the Datalog instantiations
of observations, with same sensor, observed pheno-
menon, property, measurement value and unit, in G
1
to the one observation molecule in G
2
.
Rule 3 (Figure 5d) creates Datalog instantiations
of relationships between measurement and observa-
tion molecules. Following a semi-naive approach de-
ductive system engine computes Datalog representa-
tion of measurement and observation molecules in G
2
as a bottom-up evaluation of Rules 1-3 on Datalog re-
presentation of G
1
following a semi-naive algorithm
that stops when the least fixed-point is reached. Built-
in predicates are considered as EDB predicates that
instead of being physically stored, are implemented as
external programs and executed during the evaluation
of a Datalog program. To ensure the safety condition
of Datalog programs, built-in predicates are not eva-
luated until their input variables are bound (Ceri et al.,
1989). The Datalog inferred facts are given to the fac-
torized graph creator to generate the RDF graph G
2
.
4.4 The Factorized Graph Creator
Facts generated by the deductive system engine corre-
spond to the Datalog intentional predicates in the head
of Rules 1-3 (Figure 5), i.e., these predicates repre-
sent measurement and observation molecules and re-
lationship between them. The molecule creator com-
ponent of the factorized graph creator creates RDF
triples from these intentional predicates representing
measurement and observation molecules. The mole-
cule linker component of the factorized graph creator
derives RDF triples from the instances of the FSSD
data model, describing the relationships between the
measurement and the observation molecules, as well
as the rest of the RDF triples associated with them.
5 THE FSSD QUERY ENGINE
The FSSD query engine is described based on query
rewriting over factorized RDF graphs, as well as op-
timization and execution of the rewritten query.
5.1 The Query Rewriter and Optimizer
Given RDF graphs G
1
and G
2
, where G
1
`
F SS D
G
2
,
the FSSD query rewriter reformulates a SPARQL
query Q over G
1
into a SPARQL query Q
0
over G
2
according to σ, i.e., qr(Q,σ)=Q
0
. Then, query optimi-
zation is conducted to generate efficient query plans.
During query rewriting two BGPs, b
o
and b
m
, cor-
responding to observation and measurement molecu-
les, respectively, are generated against each BGP b in
Q
0
. Then query rewriter identifies a set of triple pat-
terns that both BGPs share, and generates three new
BGPs. One BGP b
0
includes shared triples patterns
and the other two BGPs, b
0
o
and b
0
m
are generated as
a result of eliminating shared triple patterns from b
o
and b
m
, respectively. Then, optimization techniques
are executed to identify star-shaped groups in all the
Efficient Processing of Semantically Represented Sensor Data
255
Factorized
Graph Creator
FSSD
Data Model
Deductive
System Engine
Molecule
Creator
Molecule
Linker
EDB IDB
Query
Rewriter
Rules
Query
Optimizer
Execution Engine
The FSSD Factorizer The FSSD Query Engine
Factorized SSD
G
2
Semantic Sensor
Data (SSD)
G
1
Answer
SPARQL
Query
Q
Figure 4: The FSSD Architecture Components: The FSSD Factorizer receives an RDF graph G
1
containing observations
with repeated measurement values, and generates a factorized RDF graph G
2
where repeated observations and measurements
are factorized. The FSSD Data Model internally represents RDF graphs as Datalog extensional facts (EDB); a set of Datalog
rules (IDB) represent the transformations to produce G
2
. A Deductive System Engine performs a fixed point to evaluate IDB
rules on from EDB facts. Inferred facts are transformed into RDF graphs by the RDF graph Creator. The FSSD Query Engine
allows for the execution of SPARQL queries against factorized RDF graphs and ensures efficient query processing.
BGPs, as well as to create corresponding bushy tree
plans.
5.2 The FSSD Execution Engine
We implemented the FSSD query engine on top of
the RDF triple store GH-RDF3X (Vidal et al., 2010)
to exploit the benefits of the plan produced by the
query rewriter. GH-RDF3X is an extension of RDF-
3X (Neumann and Weikum, 2010), a best-of-breed
RDF engine (Huang et al., 2011), to accept plans of
any shape and assign particular operators to join star-
shaped groups in a bushy plan. The features of RDF-
3X and GH-RDF3X are crucial to allow for efficient
executions of the reformulated queries; particularly,
the cache system implemented by RDF-3X is able to
load portions of data in resident memory, and thus,
speed up execution time in query plans that produce
small intermediate results.
6 EXPERIMENTAL STUDY
We empirically study the benefits of the proposed
factorization techniques for semantically represented
sensor data, and evaluate the impact on the size of the
factorized RDF graphs as well as on query execution
time. An extension of the best-of-breed RDF engine,
RDF-3X is utilized in the evaluation. We empirically
assessed the following research questions: RQ1) Are
the proposed factorization techniques able to reduce
redundancy of measurements and observations in sen-
sor data? RQ2) Can queries against factorized RDF
graphs speed up the execution time? RQ3) Is the per-
formance of queries against factorized RDF graphs
affected by the size of the factorized RDF graphs?
The experimental configuration to evaluate the rese-
arch questions mentioned above is as follows:
Datasets: We conduct the evaluation on three sen-
sor datasets
3
. These datasets comprise observations
of different climate phenomena e.g., temperature, vi-
sibility, precipitation, wind speed, and humidity, du-
ring the hurricane and blizzard seasons in the United
States in the years 2003, 2004, and 2005. Table 1 des-
cribes the main characteristics of these datasets.
Queries: SPARQL queries range from simple queries
with one triple pattern to complex queries having up
to fourteen triple patterns with UNION, OPTIONAL
and FILTER clauses, are used as baseline in our ex-
perimental testbed.
4
.
Metrics: We report on the following metrics:
a) Number of Triples (NT) in the semantic sensor
data collection. b) Query Execution Time (ET) is
the elapsed time between the submission of the query
to RDF-3X engine and the complete output of the ans-
wer. ET is measured as the real time produced by the
time command of the Linux operation system.
Implementation: Experiments were performed on
Linux Debian 8 machine with an CPU Intel I7 980X
3.3GHz with 32GB RAM 1333MHz DDR3. Queries
were run with cold and warm cache.
5
To run on warm
3
Available at: http://wiki.knoesis.org/index.php/
LinkedSensorData
4
Details can be found at
https://sites.google.com/site/fssdexperimets/
5
To run cold cache, we clear the cache before running
WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies
256
1 m ea su re me nt Mo le cu l e (: m M olec u le ,(: t ype ,: M e as ur e D a ta ) ,(: floa t Val u e , " 96 .0 " ˆ ˆ : flo a t ) ,(: uom ,: fah re n he it ) ) : -
2 tri p le (: m1 ,: type ,: Me a su r e D at a ) & tri p le (: m1 ,: f l oat V alu e ,"9 6 .0 "ˆ ˆ : flo a t ) &
3 tri p le (: m1 ,: uom ,: f a h r en h e i t ) & uri (: mMole c ule ,[ f loa t ( "9 6 . 0 "ˆ ˆ : fl o a t ) , l oc a ln am e (: fah r e n he it ) ] ) .
(a) Datalog Rule 1. A measurement molecule with URI :mMolecule.
1 o bs er va ti on Mo le cu l e (: o M olec u le ,(: proc e d ure ,: T R19 7 ) ,(: type ,: T e m p O bs er va ti o n ),
2 (: o bs e rv e dP r op e rt y ,: Ai r Te m p ) ,[(: h a sObs ,: obs1 ) ] ) : -
3 tri p le (: obs1 ,: p roce d ure ,: T R19 7 ) & t r ipl e (: obs1 ,: t ype ,: T e m p O b se rv at i o n ) &
4 tri p le (: obs1 ,: obs e rv e dP r op e rt y ,: A ir T em p ) & t r ip l e (: obs1 ,: result ,: m1 ) &
5 tri p le (: m1 ,: f l oat V alu e ,"9 6 .0 "ˆ ˆ : flo a t ) & t r ipl e (: m1 ,: uom ,: fa h re nh e it )& uri (: oM o lecu l e ,[ l o c a ln a me (:
obs1 ) .
(b) Base Case of Datalog Rule 2. An observation molecule with URI :oMolecule.
1 o bs er va ti on Mo le cu l e (: o M olec u le ,(: proc e d ure ,: T R19 7 ) ,(: type ,: T e m p O bs er va ti o n ),
2 (: o bs e rv e dP r op e rt y ,: Ai r Te m p ) , EM = [ (: hasObs , : obs1 ) ,(: hasOb s ,: o b s2 ) ]) : -
3 o b s e r v a t i o n M o l e c ul e (: o1 ,(: proce d ure ,: T R19 7 ) ,(: type ,: Te mp Ob se r v a t i o n ) ,
4 (: ob s er v ed P ro p er t y ,: Air T em p ) ,[(: hasObs ,: ob s 1 ) ]) & o b s e r v a t i o n M o l e c u le (: o2 ,(: p r oced u re ,: T R 197 ) ,
5 (: type , : T e m p O b s e rv at io n ) ,(: ob s er ve d Pr o pe r ty ,: Ai r Te m p ) ,[( ssd : has O b s ,: o bs2 ) ]) &
6 E M 1 =[ ( : hasObs ,: o b s 1 ) | EM1 ] , t r ipl e (: obs1 ,: result , : m1) & tr i ple (: m1 ,: uom ,: f ah r en he i t ) &
7 tri p le (: m1 ,: f l oat V alu e ,"9 6 .0 "ˆ ˆ : flo a t ) & E M 2 =[ ( : hasObs ,: o b s 2 ) | EM2 ] & tri p le (: obs2 ,: res u l t ,: m2 ) &
8 tri p le (: m2 ,: uom , : fah re n he it ) & tr i pl e (: m2 ,: floa t Val u e , " 96 .0 " ˆˆ : flo a t ) & : m 1 !=: m2 &
9 u r i (: o M olec u le ,[ l oc a l n am e (: o1 ) , l oc a ln a me (: o2 ) ]) & con c at ([(: hasObs ,: ob s 1 ) ] ,[(: h a s O b s ,: o bs2 ) ] , E M )
.
(c) Recursive Case of Datalog Rule 2. An observation molecule with URI :oMolecule.
1 tr i pl e (: o Mole c u le , : result , : m Mo le c ul e ) : -
2 m e a s u r e m e n t M o l e c ul e (: mM o lecu l e ,(: type ,: M e a su re D a t a ) ,(: f l oat V a lue ,"96 .0 " ˆ ˆ : flo a t ) ,
3 (: uom ,: fah r e n he it ) ) & o b s e r v a t i o nM ol ec ul e (: oM o lecu l e , (: p roc e d ure ,: T R19 7 ) ,
4 (: type , : T e m p O b s e rv at io n ) ,(: ob s er ve d Pr o pe r ty ,: Ai r Te m p ) , EM = [ ( : h a sObs ,: obs1 ) | E M 1 ] ,
5 tri p le (: obs1 ,: r e s u l t ,: m1 ) , t r ipl e (: m1 ,: uom ,: fa h re n h e it ) , t rip l e (:m1 ,: fl o atV a lue , " 9 6. 0 " ˆ ˆ: f l o at ) .
(d) Datalog Rule 3. Observation and Measurement Molecules are Linked.
Figure 5: Example of Datalog Rules for Molecule Creation and Linking: (a) Rule 1 creates a measurement molecule
:mMolecule with value and unit. (b) Rule 2 is the base case to create the observation molecule :oMolecule. (c) Recursive
case of Rule 2 combines observation molecules that comprise observation molecule :oMolecule, i.e., molecules such that all
observation molecules with same sensor, observed phenomenon, observed property, measurement unit, and value; (d) Rule 3
links observation and measurement molecules. Molecules of measurements and observations are identified with URIs.
cache, we execute the same query five times by drop-
ping the cache just before running the first iteration of
the query; thus, data temporally stored in cache du-
ring the execution of iteration i can be used in itera-
tion i + 1. The RDF triple store GH-RDF-3X
6
is used
to execute the SPARQL queries. GH-RDF3X is built
on top of RDF-3X version 0.3.4., and it is tailored to
execute star-shaped queries and bushy tree plans.
Effectiveness of the Semantic Sensor Data Facto-
rization For evaluating the effectiveness of the pro-
posed factorization techniques and answer research
each query by performing the command sh -c "sync ; echo
3 > /proc/sys/vm/drop caches".
6
https://github.com/gh-rdf3x/gh-rdf3x. Downloaded on
March 2016
questions RQ1, we execute the factorized decompo-
ser on datasets D1, D2, and D3. The effectiveness
of the factorization approach is studied in terms of
the reduction of RDF triples (NT). Table 1 shows the
number of RDF triples (NT) in datasets D1, D2, and
D3 before and after the factorization. The results de-
monstrate that the proposed factorization techniques
are capable of reducing the RDF triples by at least
53.24%. Moreover, the results report that the factori-
zed representation of sensor observations requires in
average a small number of RDF triples, i.e., five RDF
triples instead of ten, while preserving all the infor-
mation within the original RDF graph. These results
allows us to positively answer research question RQ1,
i.e., factorized RDF graphs effectively reduce redun-
Efficient Processing of Semantically Represented Sensor Data
257
Table 1: Effectiveness of the Semantic Sensor Data Fac-
torization (NT). Number of triples before and after factori-
zation along with savings. Percentage of savings increases
as size of the dataset (NT), while Avg. NT decreases.
Dataset NT NT Savings Avg. NT per Obs. Avg. NT per Obs.
ID before Factorization after Factorization in % before Factorization after Factorization
D1 38,054,493 17,795,661 53.24 9.29 4.35
D2 108,644,568 47,788,732 56.01 9.32 4.10
D3 179,128,407 78,421,471 56.22 9.31 4.08
dancy of measurements and observations.
Efficiency of Query Plans of Reformulated Queries
To answer research questions RQ2 and RQ3, we ana-
lyze the efficiency of generated query execution plans,
and the benefits of running these queries on cold and
warm caches. The original queries Q are compared to
reformulated queries Q’ with bushy query plans com-
posed of star-shaped groups. Original queries (Q) are
executed against the original datasets, while plans for
reformulated queries (Q’) are run against gradually in-
creasing factorized datasets. Figure 6 reports on the
query execution time (milliseconds. log-scale) with
cold cache, and the minimum value observed with
warm cache. In all cases bushy query plans exhi-
bit better performance whenever they are run on cold
and warm caches. This observation supports the state-
ment reported in the literature (Vidal et al., 2010) that
bushy tree plans composed of star-shaped groups pro-
duce small intermediate results which can be maintai-
ned in resident memory and re-used in further executi-
ons. Thus, the performance of optimized query plans
is considerable better with warm cache, overcoming
other executions by up to two orders of magnitude,
e.g., Q2. Q5 retrieves DISTINCT sensors against ob-
servation molecules, therefore, execution time in fac-
torized dataset is reduced by 5 orders of magnitudes.
Results also suggest that performance of optimized
query plans is not affected by the RDF graph size,
e.g., D1, D2, and D3 with 216,953,114 RDF triples.
These results allow us to positively answer research
questions RQ2 and RQ3.
7 RELATED WORK
In (Joshi et al., 2013) RDF data is compressed using
frequent pattern mining techniques. (Pan et al., 2014)
implements data summarizing and image compres-
sion techniques to minimize RDF data redundancy.
Similarly, Fernandez et al. (Fern
´
andez et al., 2013)
represents RDF triples as collection of data identifiers
instead of original subject, predicate, and object va-
lues. The compressed RDF data, generated by above
techniques, can only be queried by customized query
engines, and decompression techniques need to be
executed to execute any data management task. Con-
trary, we device factorization techniques to generate a
compact RDF representation, where query execution
can be performed directly and efficiently by exploi-
ting semantics encoded in the factorized representa-
tion to generate query execution plans.
Factorization techniques have have been utilized
for optimization of relational data and SQL query pro-
cessing by applying logical axioms of relational alge-
bra (Bakibayev et al., 2013; Bakibayev et al., 2012).
Queries can be executed in factorized relational data,
and efficient execution plans can be found to speed
up execution time. We build on these experimental
results and proposed factorization technique tailored
for semantically described sensor data.
8 CONCLUSIONS AND FUTURE
WORK
This paper presents factorization techniques for se-
mantic sensor data to reduce redundancy, ensure cor-
rectness of query answers, and preserve complexity
of query processing tasks. A set of Datalog rules pro-
vides the basis for a deductive system that allows for
the creation of factorized RDF graphs. Additionally,
these logical rules are utilized to transform SPARQL
queries against factorized RDF graphs, and to gene-
rate query plans that speed up query execution time.
We empirically evaluate the effectiveness of the pro-
posed factorizations techniques and results confirm
that exploiting semantics encoded in semantic sensor
data allows for reducing redundancy by up to 50%.
Our experiments confirm the efficiency of the query
plans generated by our proposed approach. In sum-
mary, factorization techniques provide a feasible so-
lution to the problem of reducing RDF redundancy.
In the future, we will focus on extending the propo-
sed factorization and query optimization techniques
to streaming RDF data. Finally, we plan to incorpo-
rate these factorization techniques as part of existing
RDF engines.
ACKNOWLEDGEMENTS
Farah Karim is supported by a scholarship of German
Academic Exchange Service (DAAD), and part of
this work is supported by the EU H2020 programme
for the project BigDataEurope (GA 644564).
WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies
258
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1
10
100
1000
10000
100000
1000000
10000000
D1
Original (Q) Factorized(Q')
ET (ms) Log-scale
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1
10
100
1000
10000
100000
1000000
10000000
D1+D2
Original(Q) Factorized(Q')
ET (ms) Log-scale
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1
10
100
1000
10000
100000
1000000
10000000
D1+D2+D3
Original(Q) Factorized(Q')
ET (ms) Log-scale
(a) Query Execution Time (ET) on Gradually Increasing Original and Factorized Semantic Sensor Data in Cold Cache.
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1
10
100
1000
10000
100000
1000000
10000000
D1
Original(Q) Factorized(Q')
ET (ms) Log-scale
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1
10
100
1000
10000
100000
1000000
10000000
D1+D2
Original(Q) Factorized(Q')
ET (ms) Log-scale
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1
10
100
1000
10000
100000
1000000
10000000
D1+D2+D3
Original(Q) Factorized(Q')
ET (ms) Log-scale
(b) Query Execution Time (ET) on Gradually Increasing Original and Factorized Semantic Sensor Data in Warm Cache.
Figure 6: Execution Time (ET in ms log-scale) of Testbed Queries (Q1-Q10) in GH-RDF3X. Original queries Q against
original RDF graphs of sensor data, and optimized queries Q’ are run in cold (First Row) and warm caches (Second Row).
Optimized query plans reduce execution time on factorized RDF graphs in cold and warm caches.
REFERENCES
Ali, M. I., Gao, F., and Mileo, A. (2015). Citybench: a
configurable benchmark to evaluate rsp engines using
smart city datasets. In International Semantic Web
Conference, pages 374–389. Springer.
Arenas, M., Gutierrez, C., and P
´
erez, J. (2009). Foundations
of rdf databases. In Reasoning Web. Semantic Techno-
logies for Information Systems, pages 158–204. Sprin-
ger.
Bakibayev, N., Kocisk
´
y, T., Olteanu, D., and Zavodny, J.
(2013). Aggregation and ordering in factorised data-
bases. PVLDB, 6(14):1990–2001.
Bakibayev, N., Olteanu, D., and Zavodny, J. (2012). FDB:
A query engine for factorised relational databases.
PVLDB, 5(11):1232–1243.
Ceri, S., Gottlob, G., and Tanca, L. (1989). What you al-
ways wanted to know about datalog (and never dared
to ask). IEEE Trans. Knowl. Data Eng., 1(1):146–166.
Compton, M., Barnaghi, P., Bermudez, L., Garc
´
ıA-Castro,
R., Corcho, O., Cox, S., Graybeal, J., Hauswirth, M.,
Henson, C., Herzog, A., et al. (2012). The ssn on-
tology of the w3c semantic sensor network incubator
group. Web Semantics: Science, Services and Agents
on the World Wide Web, 17:25–32.
Fern
´
andez, J. D., Llaves, A., and Corcho,
´
O. (2014). Ef-
ficient RDF interchange (ERI) format for RDF data
streams. In The Semantic Web - ISWC 2014 - 13th
International Semantic Web Conference, Riva del
Garda, Italy, October 19-23, 2014. Proceedings, Part
II, pages 244–259.
Fern
´
andez, J. D., Mart
´
ınez-Prieto, M. A., Guti
´
errez, C., Pol-
leres, A., and Arias, M. (2013). Binary RDF repre-
sentation for publication and exchange (HDT). J. Web
Sem., 19:22–41.
Gao, L., Bruenig, M., and Hunter, J. (2014). Semantic-
based detection of segment outliers and unusual
events for wireless sensor networks. arXiv preprint
arXiv:1411.2188.
Henson, C. A., Neuhaus, H., Sheth, A. P., Thirunarayan,
K., and Buyya, R. (2009). An ontological representa-
tion of time series observations on the semantic sensor
web.
Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable
SPARQL querying of large RDF graphs. PVLDB,
4(11):1123–1134.
Joshi, A. K., Hitzler, P., and Dong, G. (2013). Logical lin-
ked data compression. In 10th Extended Semantic Web
Conf. ESWC, pages 170–184.
Neumann, T. and Weikum, G. (2010). The RDF-3X en-
gine for scalable management of RDF data. VLDB J.,
19(1):91–113.
Pan, J. Z., G
´
omez-P
´
erez, J. M., Ren, Y., Wu, H., Wang,
H., and Zhu, M. (2014). Graph pattern based RDF
data compression. In 4th Joint Int. Conf. on Semantic
Technology (JIST).
Patni, H., Henson, C., and Sheth, A. (2010). Linked sen-
sor data. In Collaborative Technologies and Systems
(CTS), 2010 International Symposium on, pages 362–
370. IEEE.
Vidal, M., Ruckhaus, E., Lampo, T., Mart
´
ınez, A., Sierra, J.,
and Polleres, A. (2010). Efficiently joining group pat-
terns in SPARQL queries. In 7th Extended Semantic
Web Conf. (ESWC).
Efficient Processing of Semantically Represented Sensor Data
259