Efﬁcient Processing of Semantically Represented Sensor Data

Farah Karim

, Maria-Esther Vidal

and S

oren Auer

1,2

Enterprise Information Systems (EIS), University of Bonn, Germany

Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Germany

Keywords:

Sensor, Linked Data, Data Factorization, Query Optimization and Execution.

Abstract:

Large collections of sensor data are semantically described using ontologies, e.g., the Semantic Sensor Net-

work (SSN) ontology. Semantic sensor data are RDF descriptions of sensor observations from related sam-

pling frames or sensors at multiple points in time, e.g., climate sensor data. Sensor values can be repeated in a

sampling frame, e.g., a particular temperature value can be repeated several times, resulting in a considerable

increase in data volume. We devise a factorized compact representation of semantic sensor data using linked

data technologies to reduce repetition of same sensor values, and propose algorithms to generate collections

of factorized semantic sensor data that can be managed by existing RDF triple stores. We empirically study

the effectiveness of the proposed factorized representation of semantic sensor data. We show that the size of

semantic sensor data is reduced by more than 50% on average without loss of information. Further, we have

evaluated the impact of this factorized representation of semantic sensor data on query execution. Results

suggest that query optimizers can be empowered with semantics from factorized representations to generate

query plans that effectively speed up query execution time on factorized semantic sensor data.

1 INTRODUCTION

The increasing use of semantic technologies in indu-

stry, health care, and other domains has brought at-

tention to scalability and performance improvements.

One crucial dimension of semantic technologies is

the semantic enrichment and integration of sensor

data. Repetition of sensor measurement values of

real-world phenomena causes extraordinary expan-

sion of data volumes. Particularly, duplicated mea-

surement values impact on the size of datasets when

sensor data is represented using RDF and ontologies,

e.g., the Semantic Sensor Network (SSN) ontology.

RDF data compression is a possible solution for

managing large volumes of RDF data efﬁciently. Ho-

wever, compressed RDF data can only be processed

and queried by RDF query engines customized for

a speciﬁc compression tool, e.g., HDT (Fern

andez

et al., 2013) binary serialization format for storing and

exchange of RDF data, and the RDF-3X (Neumann

and Weikum, 2010) compression schema for RDF.

Further, data semantics or ontological knowledge are

not utilized by these compression techniques.

Inspired by existing work on factorized databa-

ses (Bakibayev et al., 2013), we propose the Factori-

zing Semantic Sensor Data (FSSD) approach to facto-

rize semantic sensor data. FSSD exploits ontological

knowledge to: i) generate a compact representation of

sensor observations where repeated measurement va-

lues are reduced, and ii) rewrite and execute queries

over factorized data.

We empirically evaluate the effectiveness of the

FSSD framework on a collection of semantic sensor

data of different sizes. These datasets contain ob-

servations of different climate phenomena during the

hurricane and blizzard seasons in the United States

in the years 2003, 2004, and 2005. The experimen-

tal results show that the proposed factorization met-

hod reduces the amount of semantic sensor data (i.e.,

number of RDF triples) by more than 50% without

loss of information. Moreover, observed results show

that query processing over factorized sensor data is

boosted by up to two orders of magnitude.

This paper comprises seven additional sections.

Using a real-world example, the need for efﬁcient re-

presentation of semantic sensor data is motivated in

2. Preliminary deﬁnitions are presented in 3. We then

deﬁne our approach in 4 and 5. Results of our empiri-

cal evaluation are reported in 6. Existing approaches

are reviewed in 7. We conclude and present an out-

look to future work in 8.

252

Karim, F., Vidal, M-E. and Auer, S.

Efﬁcient Processing of Semantically Represented Sensor Data.

DOI: 10.5220/0006287002520259

In Proceedings of the 13th International Conference on Web Information Systems and Technologies (WEBIST 2017), pages 252-259

ISBN: 978-989-758-246-2

1 SELECT ? val , C O U NT (? val ) a s ? v C o u n t

2 WHERE {

3 ? o b s a : T e m p O bse r v a tio n ;

4 : result ? r e s .

5 ? r e s : f l oatV a l u e ? v a l .

6 } G R O U P BY ? val HAVING ? v C o u n t > 9

(a) Exemplar SPARQL query to calcu-

late the frequency distribution of tempera-

ture values repeated more than 9 times in

observations in 2003 MesoWest dataset.

(b) Frequency Distribution of Tem-

perature Values of Mesowest dataset. (c) Percentage of Repeated Values.

Figure 1: Motivating Example: (a) SPARQL query against year 2003 MesoWest dataset with temperature observations.

Preﬁxes are used as in https://www.w3.org/wiki/SRBench; (b) Frequency Distribution of temperature values from year 2003

Mesowest Linked Observation Data repeated more than 9 times observations; (c) Percentage of RDF triples per repeated

temperature values with respect to Linked Observations of temperature from year 2003 Mesowest Data.

2 MOTIVATING EXAMPLE

The MesoWest RDF datasets

comprise sensor lin-

ked data describing hurricane and blizzard observa-

tions in the United States. Observations include mea-

surements of different climate phenomena, e.g., tem-

perature, precipitation, wind speed, and humidity.

The SSN ontology is utilized to semantically describe

these observations. Together these datasets contain

almost two billion RDF triples comprehensively des-

cribing major storms in the United States since 2002.

The RDF dataset on the storm season in the year

2003 comprises 12,011,466 RDF triples about tem-

perature, and 1,193,345 observations. The SPARQL

query in Figure 1a calculates the frequency distribu-

tion of temperature values that are repeated more than

nine times in the temperature observations. Figure 1b

shows the results produced by the evaluation of this

query. The results reveal that 1,191,218 observations

out of 1,193,345 meet this condition. Additionally,

temperature values are repeated on average more than

1,100 times, and the temperature value 32 is the mode,

i.e., 32 is the most repeated value and is associated

with 24,632 observations. Also, each observation is

described using 11 RDF triples; Figure 1c illustrates

the percentage RDF triples repeated per temperature

value. As can be observed, some repeated tempera-

ture values are associated with up to 2.5% of the to-

tal number of temperature related RDF triples in the

whole dataset. Similar frequency distributions can be

observed for other climate phenomena, and corrobo-

rate the natural intuition that the number of phenome-

non distinct values is much smaller than the number

of observations.

We exploit these characteristics of semantic sen-

sor data, and propose a compact representation where

RDF triples of repeated measurement values are fac-

http://wiki.knoesis.org/index.php/LinkedSensorData

torized from the observations and included in the da-

taset only once. Unlike other compression techniques,

queries can be directly executed against a factorized

representation of semantic sensor data. Further, se-

mantics encoded in factorized representations can be

utilized to guide the query optimizer to generate query

plans able to speed up query processing.

3 PRELIMINARIES

The Semantic Sensor Network (SSN) Onto-

logy (Compton et al., 2012) developed by the

W3C Semantic Sensor Network Incubator Group

has been extensively used to semantically describe

sensor data (Henson et al., 2009; Gao et al., 2014; Ali

et al., 2015). SSN comprises the Skeleton and Data

modules, which provide RDF classes and properties

to describe sensor data in terms of observations,

feature of interest, observed property, measurement

values and units. Figure 2a depicts classes and pro-

perties in these modules; an RDF graph describing an

observation is presented in Figure 2b. Formally, an

RDF graph is deﬁned as follows:

Deﬁnition 3.1 (RDF triple and RDF graph). (Arenas

et al., 2009) Let I, B, L be disjoint inﬁnite sets of

URIs, blank nodes, and literals, respectively. A tri-

ple (s, p, o) ∈ (I ∪ B)×I×(I ∪ B ∪ L) is denomina-

ted an RDF triple, where s is called the subject, p the

predicate, and o the object. An RDF graph is a pair

G = (V, E), where V is a set of nodes in I ∪ B ∪ L,

and E is a set of RDF triples.

Example 3.1. Figure 3 presents an RDF graph that

corresponds to a portion of the RDF dataset from

the storm season in year 2003. Nodes correspond to

resources representing observations, measurements,

https://www.w3.org/2005/Incubator/ssn/

Efﬁcient Processing of Semantically Represented Sensor Data

253

Data

ObservationValue

Skeleton

SensorOutput

SensorInput

Observation

Sensor

Property

FeatureOfInterest

Sensing

hasValue

isProducedBy

detects

observationResult

includesEvent

observedBy

isProxyFor

observedProperty

featureOfInterest

hasProperty

isPropertyOf

implements

observes

(a) Portion of the SSN Ontology.

:oMolecule

:obs1

hasObs

:mMolecule

:fahrenheit

uom

"96.0"^^:float

time1

samplingTime

:m1

result

:result

floatValue

(b) RDF graph with three molecules.

Figure 2: The SSN Ontology and RDF Molecule Example. (a) Concepts and relationships in Skeleton and Data Modules

from the SSN Ontology; (b) RDF graph with three subject molecules, each associated with two RDF triples.

and timestamps. Further, literals are also represented

as nodes in the RDF graph and properties are from a

variety of RDF vocabularies that include the Semantic

Sensor Network (SSN) Ontology. We ignore preﬁxes

and replace long URLs by short identiﬁers for clarity.

According to (Patni et al., 2010), an RDF graph

of semantically described observations models the

act of observing a real world phenomenon, and

produce a value of the observed property. Fi-

gure 3 presents an RDF graph with two observa-

tions, :obs1 and :obs2, from same sensor :TR197

and of same type :TempObservation and the obser-

ved property :AirTemp in different timestamps ts1

and ts2, having the same measurement value and unit:

"96.0"ˆˆ:float and :fahrenheit, respectively.

RDF graphs usually comprise entity description

sub-graphs, sometimes also referred to as Concise

Bounded Descriptions (CBD). These subgraphs are

named RDF subject molecules deﬁned as follows:

Deﬁnition 3.2 (RDF Subject Molecule). (Fern

andez

et al., 2014) Given an RDF graph G, an RDF subject-

molecule M ⊆ G is a set of triples t1, t2, ... , tn in

which subject(t1) = subject(t2) = ... = subject(tn).

Example 3.2. Figure 2b presents an RDF data set

with three RDF subject molecules, where the sub-

jects of the molecules are :obs1, :oMolecule and

:mMolecule, that describe resources in terms of their

properties. For simplicity, we will refer to RDF sub-

ject molecules as molecules in the rest of the paper.

4 THE FSSD APPROACH

We propose FSSD, a framework that relies on a de-

ductive database system to solve the problem of fac-

torizing semantic sensor data (FSSD). Figure 4 de-

picts the FSSD architecture. FSSD is composed of

two main components: the FSSD factorizer and the

the FSSD query engine. Given an RDF graph G

the

FSSD factorizer identiﬁes a factorized RDF graph G

such that, G

F SS D

. Input RDF graphs are re-

presented as Datalog predicates using the FSSD data

model; a deductive system relies on a set of rules that

produce the factorized RDF graph G

. The FSSD

query engine is able to rewrite a SPARQL query Q

against the original RDF graph G

into a SPARQL

query Q

, and perform query processing techniques

to ensure that the answers of evaluating Q over G

and Q

over G

are the same, i.e., [[Q]]

= [[Q

]]

Optimization techniques are conducted to identify a

bushy query execution plan for Q

. Thus, the FSSD

query engine even evaluating Q

against the factori-

zed RDF graph G

generates the same answers as if

Q were evaluated over G

. However, as we will report

on our experimental study, query execution time can

be reduced by up to two orders of magnitude.

4.1 The FSSD Factorizer

The FSSD factorizer provides a solution to the pro-

blem of factorizing semantic sensor data (FSSD) in

RDF graph G

by generating a factorized graph G

Given an RDF graph G

, the FSSD factorizer relies

on the FSSD data model to represent G

as an exten-

sional database (EDB) in Datalog. The knowledge re-

quired to generate the factorized is encoded in a set of

intensional database (IDB) rules; a ﬁxed point evalu-

ation of the IDB on EDB allows for the generation of

the instances of the FSSD data model that are used by

the factorized RDF graph G

. The factorized graph

creator receives the instances of the FSSD data mo-

del and generates the factorized RDF graph G

4.2 The FSSD Data Model

The FSSD factorizer resort to a deductive database sy-

stem to solve the problem of factorizing semantic sen-

sor data (FSSD). RDF triples in an input RDF graph

are formally represented as a Datalog extensional

facts (EDB) containing three arguments. For exam-

ple, the RDF triple (:obs1 :procedure :TR197) is for-

mally represented as the following EDB fact:

triple(: obs1,: procedure, : T R197)

The FSSD data model also provides predicates:

measurementMolecule() and observationMolecule(),

which represent RDF molecules of measurements and

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

254

:obs1

time1

:TR197

:Temp

Observation

:AirTemp

ts1

:m1

:Measure

Data

:fahrenheit

:result

"96.0"^^:float

:obs2

time2

:TR197

:Temp

Observation

:AirTemp

ts2

:m2

:Measure

Data

:fahrenheit

"96.0"^^:float

Figure 3: An Example of the Repeated Measurement Values: RDF graph G

has measurements, :m1 and :m2, and obser-

vations, :obs1 and :obs2, duplicated.

observations in a factorized RDF graph G

. For ex-

ample, an RDF molecule :mMolecule with value 96.0

and unit of metric Fahrenheit is modeled as follows:

measurementMolecule(:mMolecule,T,V,U), where

• T : (:type,:MeasureData)

• V : (:ﬂoatValue,“96.0”ˆˆ:ﬂoat)

• U: (:uom,:fahrenheit)

4.3 Deductive System Engine

The deductive system engine relies on a Datalog re-

presentation of triples in an RDF graph G

, as well as

on Datalog intentional rules 1-3 (Figure 5) to generate

a factorized RDF graph G

. Rule 1 (Figure 5a) creates

a measurement molecule for all the Datalog instanti-

ations of measurements in G

such that all the mea-

surements with same measurement unit and value and

are mapped to the one measurement molecule. Thus,

rule 1 ensures that repeated measurements are combi-

ned together in the form of a measurement molecule

in G

. Similarly, base and recursive cases of Rule 2

(Figure 5b and 5c) map all the Datalog instantiations

of observations, with same sensor, observed pheno-

menon, property, measurement value and unit, in G

to the one observation molecule in G

Rule 3 (Figure 5d) creates Datalog instantiations

of relationships between measurement and observa-

tion molecules. Following a semi-naive approach de-

ductive system engine computes Datalog representa-

tion of measurement and observation molecules in G

as a bottom-up evaluation of Rules 1-3 on Datalog re-

presentation of G

following a semi-naive algorithm

that stops when the least ﬁxed-point is reached. Built-

in predicates are considered as EDB predicates that

instead of being physically stored, are implemented as

external programs and executed during the evaluation

of a Datalog program. To ensure the safety condition

of Datalog programs, built-in predicates are not eva-

luated until their input variables are bound (Ceri et al.,

1989). The Datalog inferred facts are given to the fac-

torized graph creator to generate the RDF graph G

4.4 The Factorized Graph Creator

Facts generated by the deductive system engine corre-

spond to the Datalog intentional predicates in the head

of Rules 1-3 (Figure 5), i.e., these predicates repre-

sent measurement and observation molecules and re-

lationship between them. The molecule creator com-

ponent of the factorized graph creator creates RDF

triples from these intentional predicates representing

measurement and observation molecules. The mole-

cule linker component of the factorized graph creator

derives RDF triples from the instances of the FSSD

data model, describing the relationships between the

measurement and the observation molecules, as well

as the rest of the RDF triples associated with them.

5 THE FSSD QUERY ENGINE

The FSSD query engine is described based on query

rewriting over factorized RDF graphs, as well as op-

timization and execution of the rewritten query.

5.1 The Query Rewriter and Optimizer

Given RDF graphs G

and G

, where G

F SS D

the FSSD query rewriter reformulates a SPARQL

query Q over G

into a SPARQL query Q

over G

according to σ, i.e., qr(Q,σ)=Q

. Then, query optimi-

zation is conducted to generate efﬁcient query plans.

During query rewriting two BGPs, b

and b

, cor-

responding to observation and measurement molecu-

les, respectively, are generated against each BGP b in

. Then query rewriter identiﬁes a set of triple pat-

terns that both BGPs share, and generates three new

BGPs. One BGP b

includes shared triples patterns

and the other two BGPs, b

and b

are generated as

a result of eliminating shared triple patterns from b

and b

, respectively. Then, optimization techniques

are executed to identify star-shaped groups in all the

Efﬁcient Processing of Semantically Represented Sensor Data

255

Factorized

Graph Creator

FSSD

Data Model

Deductive

System Engine

Molecule

Creator

Molecule

Linker

EDB IDB

Query

Rewriter

Rules

Query

Optimizer

Execution Engine

The FSSD Factorizer The FSSD Query Engine

Factorized SSD

Semantic Sensor

Data (SSD)

Answer

SPARQL

Query

Figure 4: The FSSD Architecture Components: The FSSD Factorizer receives an RDF graph G

containing observations

with repeated measurement values, and generates a factorized RDF graph G

where repeated observations and measurements

are factorized. The FSSD Data Model internally represents RDF graphs as Datalog extensional facts (EDB); a set of Datalog

rules (IDB) represent the transformations to produce G

. A Deductive System Engine performs a ﬁxed point to evaluate IDB

rules on from EDB facts. Inferred facts are transformed into RDF graphs by the RDF graph Creator. The FSSD Query Engine

allows for the execution of SPARQL queries against factorized RDF graphs and ensures efﬁcient query processing.

BGPs, as well as to create corresponding bushy tree

plans.

5.2 The FSSD Execution Engine

We implemented the FSSD query engine on top of

the RDF triple store GH-RDF3X (Vidal et al., 2010)

to exploit the beneﬁts of the plan produced by the

query rewriter. GH-RDF3X is an extension of RDF-

3X (Neumann and Weikum, 2010), a best-of-breed

RDF engine (Huang et al., 2011), to accept plans of

any shape and assign particular operators to join star-

shaped groups in a bushy plan. The features of RDF-

3X and GH-RDF3X are crucial to allow for efﬁcient

executions of the reformulated queries; particularly,

the cache system implemented by RDF-3X is able to

load portions of data in resident memory, and thus,

speed up execution time in query plans that produce

small intermediate results.

6 EXPERIMENTAL STUDY

We empirically study the beneﬁts of the proposed

factorization techniques for semantically represented

sensor data, and evaluate the impact on the size of the

factorized RDF graphs as well as on query execution

time. An extension of the best-of-breed RDF engine,

RDF-3X is utilized in the evaluation. We empirically

assessed the following research questions: RQ1) Are

the proposed factorization techniques able to reduce

redundancy of measurements and observations in sen-

sor data? RQ2) Can queries against factorized RDF

graphs speed up the execution time? RQ3) Is the per-

formance of queries against factorized RDF graphs

affected by the size of the factorized RDF graphs?

The experimental conﬁguration to evaluate the rese-

arch questions mentioned above is as follows:

Datasets: We conduct the evaluation on three sen-

sor datasets

. These datasets comprise observations

of different climate phenomena e.g., temperature, vi-

sibility, precipitation, wind speed, and humidity, du-

ring the hurricane and blizzard seasons in the United

States in the years 2003, 2004, and 2005. Table 1 des-

cribes the main characteristics of these datasets.

Queries: SPARQL queries range from simple queries

with one triple pattern to complex queries having up

to fourteen triple patterns with UNION, OPTIONAL

and FILTER clauses, are used as baseline in our ex-

perimental testbed.

Metrics: We report on the following metrics:

a) Number of Triples (NT) in the semantic sensor

data collection. b) Query Execution Time (ET) is

the elapsed time between the submission of the query

to RDF-3X engine and the complete output of the ans-

wer. ET is measured as the real time produced by the

time command of the Linux operation system.

Implementation: Experiments were performed on

Linux Debian 8 machine with an CPU Intel I7 980X

3.3GHz with 32GB RAM 1333MHz DDR3. Queries

were run with cold and warm cache.

To run on warm

Available at: http://wiki.knoesis.org/index.php/

LinkedSensorData

Details can be found at

https://sites.google.com/site/fssdexperimets/

To run cold cache, we clear the cache before running

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

256

1 m ea su re me nt Mo le cu l e (: m M olec u le ,(: t ype ,: M e as ur e D a ta ) ,(: floa t Val u e , " 96 .0 " ˆ ˆ : flo a t ) ,(: uom ,: fah re n he it ) ) : -

2 tri p le (: m1 ,: type ,: Me a su r e D at a ) & tri p le (: m1 ,: f l oat V alu e ,"9 6 .0 "ˆ ˆ : flo a t ) &

3 tri p le (: m1 ,: uom ,: f a h r en h e i t ) & uri (: mMole c ule ,[ f loa t ( "9 6 . 0 "ˆ ˆ : fl o a t ) , l oc a ln am e (: fah r e n he it ) ] ) .

(a) Datalog Rule 1. A measurement molecule with URI :mMolecule.

1 o bs er va ti on Mo le cu l e (: o M olec u le ,(: proc e d ure ,: T R19 7 ) ,(: type ,: T e m p O bs er va ti o n ),

2 (: o bs e rv e dP r op e rt y ,: Ai r Te m p ) ,[(: h a sObs ,: obs1 ) ] ) : -

3 tri p le (: obs1 ,: p roce d ure ,: T R19 7 ) & t r ipl e (: obs1 ,: t ype ,: T e m p O b se rv at i o n ) &

4 tri p le (: obs1 ,: obs e rv e dP r op e rt y ,: A ir T em p ) & t r ip l e (: obs1 ,: result ,: m1 ) &

5 tri p le (: m1 ,: f l oat V alu e ,"9 6 .0 "ˆ ˆ : flo a t ) & t r ipl e (: m1 ,: uom ,: fa h re nh e it )& uri (: oM o lecu l e ,[ l o c a ln a me (:

obs1 ) .

(b) Base Case of Datalog Rule 2. An observation molecule with URI :oMolecule.

1 o bs er va ti on Mo le cu l e (: o M olec u le ,(: proc e d ure ,: T R19 7 ) ,(: type ,: T e m p O bs er va ti o n ),

2 (: o bs e rv e dP r op e rt y ,: Ai r Te m p ) , EM = [ (: hasObs , : obs1 ) ,(: hasOb s ,: o b s2 ) ]) : -

3 o b s e r v a t i o n M o l e c ul e (: o1 ,(: proce d ure ,: T R19 7 ) ,(: type ,: Te mp Ob se r v a t i o n ) ,

4 (: ob s er v ed P ro p er t y ,: Air T em p ) ,[(: hasObs ,: ob s 1 ) ]) & o b s e r v a t i o n M o l e c u le (: o2 ,(: p r oced u re ,: T R 197 ) ,

5 (: type , : T e m p O b s e rv at io n ) ,(: ob s er ve d Pr o pe r ty ,: Ai r Te m p ) ,[( ssd : has O b s ,: o bs2 ) ]) &

6 E M 1 =[ ( : hasObs ,: o b s 1 ) | EM1 ] , t r ipl e (: obs1 ,: result , : m1) & tr i ple (: m1 ,: uom ,: f ah r en he i t ) &

7 tri p le (: m1 ,: f l oat V alu e ,"9 6 .0 "ˆ ˆ : flo a t ) & E M 2 =[ ( : hasObs ,: o b s 2 ) | EM2 ] & tri p le (: obs2 ,: res u l t ,: m2 ) &

8 tri p le (: m2 ,: uom , : fah re n he it ) & tr i pl e (: m2 ,: floa t Val u e , " 96 .0 " ˆˆ : flo a t ) & : m 1 !=: m2 &

9 u r i (: o M olec u le ,[ l oc a l n am e (: o1 ) , l oc a ln a me (: o2 ) ]) & con c at ([(: hasObs ,: ob s 1 ) ] ,[(: h a s O b s ,: o bs2 ) ] , E M )

1 tr i pl e (: o Mole c u le , : result , : m Mo le c ul e ) : -

2 m e a s u r e m e n t M o l e c ul e (: mM o lecu l e ,(: type ,: M e a su re D a t a ) ,(: f l oat V a lue ,"96 .0 " ˆ ˆ : flo a t ) ,

3 (: uom ,: fah r e n he it ) ) & o b s e r v a t i o nM ol ec ul e (: oM o lecu l e , (: p roc e d ure ,: T R19 7 ) ,

4 (: type , : T e m p O b s e rv at io n ) ,(: ob s er ve d Pr o pe r ty ,: Ai r Te m p ) , EM = [ ( : h a sObs ,: obs1 ) | E M 1 ] ,

5 tri p le (: obs1 ,: r e s u l t ,: m1 ) , t r ipl e (: m1 ,: uom ,: fa h re n h e it ) , t rip l e (:m1 ,: fl o atV a lue , " 9 6. 0 " ˆ ˆ: f l o at ) .

(d) Datalog Rule 3. Observation and Measurement Molecules are Linked.

Figure 5: Example of Datalog Rules for Molecule Creation and Linking: (a) Rule 1 creates a measurement molecule

:mMolecule with value and unit. (b) Rule 2 is the base case to create the observation molecule :oMolecule. (c) Recursive

case of Rule 2 combines observation molecules that comprise observation molecule :oMolecule, i.e., molecules such that all

observation molecules with same sensor, observed phenomenon, observed property, measurement unit, and value; (d) Rule 3

links observation and measurement molecules. Molecules of measurements and observations are identiﬁed with URIs.

cache, we execute the same query ﬁve times by drop-

ping the cache just before running the ﬁrst iteration of

the query; thus, data temporally stored in cache du-

ring the execution of iteration i can be used in itera-

tion i + 1. The RDF triple store GH-RDF-3X

is used

to execute the SPARQL queries. GH-RDF3X is built

on top of RDF-3X version 0.3.4., and it is tailored to

execute star-shaped queries and bushy tree plans.

Effectiveness of the Semantic Sensor Data Facto-

rization For evaluating the effectiveness of the pro-

posed factorization techniques and answer research

each query by performing the command sh -c "sync ; echo

3 > /proc/sys/vm/drop caches".

https://github.com/gh-rdf3x/gh-rdf3x. Downloaded on

March 2016

questions RQ1, we execute the factorized decompo-

ser on datasets D1, D2, and D3. The effectiveness

of the factorization approach is studied in terms of

the reduction of RDF triples (NT). Table 1 shows the

number of RDF triples (NT) in datasets D1, D2, and

D3 before and after the factorization. The results de-

monstrate that the proposed factorization techniques

are capable of reducing the RDF triples by at least

53.24%. Moreover, the results report that the factori-

zed representation of sensor observations requires in

average a small number of RDF triples, i.e., ﬁve RDF

triples instead of ten, while preserving all the infor-

mation within the original RDF graph. These results

allows us to positively answer research question RQ1,

i.e., factorized RDF graphs effectively reduce redun-

Efﬁcient Processing of Semantically Represented Sensor Data

257

Table 1: Effectiveness of the Semantic Sensor Data Fac-

torization (NT). Number of triples before and after factori-

zation along with savings. Percentage of savings increases

as size of the dataset (NT), while Avg. NT decreases.

Dataset NT NT Savings Avg. NT per Obs. Avg. NT per Obs.

ID before Factorization after Factorization in % before Factorization after Factorization

D1 38,054,493 17,795,661 53.24 9.29 4.35

D2 108,644,568 47,788,732 56.01 9.32 4.10

D3 179,128,407 78,421,471 56.22 9.31 4.08

dancy of measurements and observations.

Efﬁciency of Query Plans of Reformulated Queries

To answer research questions RQ2 and RQ3, we ana-

lyze the efﬁciency of generated query execution plans,

and the beneﬁts of running these queries on cold and

warm caches. The original queries Q are compared to

reformulated queries Q’ with bushy query plans com-

posed of star-shaped groups. Original queries (Q) are

executed against the original datasets, while plans for

reformulated queries (Q’) are run against gradually in-

creasing factorized datasets. Figure 6 reports on the

query execution time (milliseconds. log-scale) with

cold cache, and the minimum value observed with

warm cache. In all cases bushy query plans exhi-

bit better performance whenever they are run on cold

and warm caches. This observation supports the state-

ment reported in the literature (Vidal et al., 2010) that

bushy tree plans composed of star-shaped groups pro-

duce small intermediate results which can be maintai-

ned in resident memory and re-used in further executi-

ons. Thus, the performance of optimized query plans

is considerable better with warm cache, overcoming

other executions by up to two orders of magnitude,

e.g., Q2. Q5 retrieves DISTINCT sensors against ob-

servation molecules, therefore, execution time in fac-

torized dataset is reduced by 5 orders of magnitudes.

Results also suggest that performance of optimized

query plans is not affected by the RDF graph size,

e.g., D1, D2, and D3 with 216,953,114 RDF triples.

These results allow us to positively answer research

questions RQ2 and RQ3.

7 RELATED WORK

In (Joshi et al., 2013) RDF data is compressed using

frequent pattern mining techniques. (Pan et al., 2014)

implements data summarizing and image compres-

sion techniques to minimize RDF data redundancy.

Similarly, Fernandez et al. (Fern

andez et al., 2013)

represents RDF triples as collection of data identiﬁers

instead of original subject, predicate, and object va-

lues. The compressed RDF data, generated by above

techniques, can only be queried by customized query

engines, and decompression techniques need to be

executed to execute any data management task. Con-

trary, we device factorization techniques to generate a

compact RDF representation, where query execution

can be performed directly and efﬁciently by exploi-

ting semantics encoded in the factorized representa-

tion to generate query execution plans.

Factorization techniques have have been utilized

for optimization of relational data and SQL query pro-

cessing by applying logical axioms of relational alge-

bra (Bakibayev et al., 2013; Bakibayev et al., 2012).

Queries can be executed in factorized relational data,

and efﬁcient execution plans can be found to speed

up execution time. We build on these experimental

results and proposed factorization technique tailored

for semantically described sensor data.

8 CONCLUSIONS AND FUTURE

WORK

This paper presents factorization techniques for se-

mantic sensor data to reduce redundancy, ensure cor-

rectness of query answers, and preserve complexity

of query processing tasks. A set of Datalog rules pro-

vides the basis for a deductive system that allows for

the creation of factorized RDF graphs. Additionally,

these logical rules are utilized to transform SPARQL

queries against factorized RDF graphs, and to gene-

rate query plans that speed up query execution time.

We empirically evaluate the effectiveness of the pro-

posed factorizations techniques and results conﬁrm

that exploiting semantics encoded in semantic sensor

data allows for reducing redundancy by up to 50%.

Our experiments conﬁrm the efﬁciency of the query

plans generated by our proposed approach. In sum-

mary, factorization techniques provide a feasible so-

lution to the problem of reducing RDF redundancy.

In the future, we will focus on extending the propo-

sed factorization and query optimization techniques

to streaming RDF data. Finally, we plan to incorpo-

rate these factorization techniques as part of existing

RDF engines.

ACKNOWLEDGEMENTS

Farah Karim is supported by a scholarship of German

Academic Exchange Service (DAAD), and part of

this work is supported by the EU H2020 programme

for the project BigDataEurope (GA 644564).

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

258

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

100

1000

10000

100000

1000000

10000000

Original (Q) Factorized(Q')

ET (ms) Log-scale

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

100

1000

10000

100000

1000000

10000000

D1+D2

Original(Q) Factorized(Q')

ET (ms) Log-scale

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

100

1000

10000

100000

1000000

10000000

D1+D2+D3

Original(Q) Factorized(Q')

ET (ms) Log-scale

(a) Query Execution Time (ET) on Gradually Increasing Original and Factorized Semantic Sensor Data in Cold Cache.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

100

1000

10000

100000

1000000

10000000

Original(Q) Factorized(Q')

ET (ms) Log-scale

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

100

1000

10000

100000

1000000

10000000

D1+D2

Original(Q) Factorized(Q')

ET (ms) Log-scale

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

100

1000

10000

100000

1000000

10000000

D1+D2+D3

Original(Q) Factorized(Q')

ET (ms) Log-scale

(b) Query Execution Time (ET) on Gradually Increasing Original and Factorized Semantic Sensor Data in Warm Cache.

Figure 6: Execution Time (ET in ms log-scale) of Testbed Queries (Q1-Q10) in GH-RDF3X. Original queries Q against

original RDF graphs of sensor data, and optimized queries Q’ are run in cold (First Row) and warm caches (Second Row).

Optimized query plans reduce execution time on factorized RDF graphs in cold and warm caches.

REFERENCES

Ali, M. I., Gao, F., and Mileo, A. (2015). Citybench: a

conﬁgurable benchmark to evaluate rsp engines using

smart city datasets. In International Semantic Web

Conference, pages 374–389. Springer.

Arenas, M., Gutierrez, C., and P

erez, J. (2009). Foundations

of rdf databases. In Reasoning Web. Semantic Techno-

logies for Information Systems, pages 158–204. Sprin-

ger.

Bakibayev, N., Kocisk

y, T., Olteanu, D., and Zavodny, J.

(2013). Aggregation and ordering in factorised data-

bases. PVLDB, 6(14):1990–2001.

Bakibayev, N., Olteanu, D., and Zavodny, J. (2012). FDB:

A query engine for factorised relational databases.

PVLDB, 5(11):1232–1243.

Ceri, S., Gottlob, G., and Tanca, L. (1989). What you al-

ways wanted to know about datalog (and never dared

to ask). IEEE Trans. Knowl. Data Eng., 1(1):146–166.

Compton, M., Barnaghi, P., Bermudez, L., Garc

ıA-Castro,

R., Corcho, O., Cox, S., Graybeal, J., Hauswirth, M.,

Henson, C., Herzog, A., et al. (2012). The ssn on-

tology of the w3c semantic sensor network incubator

group. Web Semantics: Science, Services and Agents

on the World Wide Web, 17:25–32.

Fern

andez, J. D., Llaves, A., and Corcho,

O. (2014). Ef-

ﬁcient RDF interchange (ERI) format for RDF data

streams. In The Semantic Web - ISWC 2014 - 13th

International Semantic Web Conference, Riva del

Garda, Italy, October 19-23, 2014. Proceedings, Part

II, pages 244–259.

Fern

andez, J. D., Mart

ınez-Prieto, M. A., Guti

errez, C., Pol-

leres, A., and Arias, M. (2013). Binary RDF repre-

sentation for publication and exchange (HDT). J. Web

Sem., 19:22–41.

Gao, L., Bruenig, M., and Hunter, J. (2014). Semantic-

based detection of segment outliers and unusual

events for wireless sensor networks. arXiv preprint

arXiv:1411.2188.

Henson, C. A., Neuhaus, H., Sheth, A. P., Thirunarayan,

K., and Buyya, R. (2009). An ontological representa-

tion of time series observations on the semantic sensor

web.

Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable

SPARQL querying of large RDF graphs. PVLDB,

4(11):1123–1134.

Joshi, A. K., Hitzler, P., and Dong, G. (2013). Logical lin-

ked data compression. In 10th Extended Semantic Web

Conf. ESWC, pages 170–184.

Neumann, T. and Weikum, G. (2010). The RDF-3X en-

gine for scalable management of RDF data. VLDB J.,

19(1):91–113.

Pan, J. Z., G

omez-P

erez, J. M., Ren, Y., Wu, H., Wang,

H., and Zhu, M. (2014). Graph pattern based RDF

data compression. In 4th Joint Int. Conf. on Semantic

Technology (JIST).

Patni, H., Henson, C., and Sheth, A. (2010). Linked sen-

sor data. In Collaborative Technologies and Systems

(CTS), 2010 International Symposium on, pages 362–

370. IEEE.

Vidal, M., Ruckhaus, E., Lampo, T., Mart

ınez, A., Sierra, J.,

and Polleres, A. (2010). Efﬁciently joining group pat-

terns in SPARQL queries. In 7th Extended Semantic

Web Conf. (ESWC).

Efﬁcient Processing of Semantically Represented Sensor Data

259