Listing 1: Dailymed Sample Triples.
d a i l y m e d d r u g : 8 2 a d ai l y m e d : d r u g
d a i l y m e d d r u g : 8 2 da i l y m e d : a c t i v e i n g r e d i e n t d a i l y m e d i n g :
P h e n y t o i n
d a i l y m e d d r u g : 8 2 r d f s : l a b e l ” D i l a n t i n −125 ( S u s p e n s i o n ) ”
d a i l y m e d d r u g : 2 0 1 a d a i l ym e d : dr ug
d a i l y m e d d r u g : 2 0 1 d a i ly m e d : a c t i v e i n g r e d i e n t d a i l y m e d i n g :
E t h o s u x i m i d e
d a i l y m e d d r u g : 2 0 1 r d f s : l a b e l ” Z a r o n t i n ( Ca p s u le ) ”
d ai l y m e d o r g : Par ke −Dav is a d a il y m e d : o r g a n i z a t i o n
d ai l y m e d o r g : Par ke −Dav is r d f s : l a b e l ” P ar k e−Dav is ”
d ai l y m e d o r g : Par ke −Dav is da il y m e d : p ro d u c e s Dr ug
d a i l y m e d d r u g : 8 2
d ai l y m e d o r g : Par ke −Dav is da il y m e d : p ro d u c e s Dr ug
d a i l y m e d d r u g : 2 0 1
d a i l y m e d i n g : P h e n y t o i n a d a il y m e d : i n g r e d i e n t s
d a i l y m e d i n g : P h e n y t o i n r d f s : l a b e l ” P h e n y t o i n ”
d a i l y m e d i n g : E th o s u x i m i d e a d a i l ym e d : i n g r e d i e n t s
d a i l y m e d i n g : E th o s u x i m i d e r d f s : l a b e l ” E t h o s u x im i d e ”
the Operating System and Fuseki 1.0 as the SPARQL
Endpoint server. For each dataset, we set up Fuseki
on different ports. We re-used the query set from our
previous work (Rakhmawati and Hausenblas, 2012).
We limited the query processing duration to one hour.
Each query was executed three times on two federa-
tion engines, namely SPLENDID (G
¨
orlitz and Staab,
2011) and DARQ (Quilitz and Leser, 2008). These
engines were chosen because SPLENDID employs
VoID(http://www.w3.org/TR/void/) as data catalogue
that contains a list of predicates and entities, while
DARQ has a list of predicates which is stored in the
Service Description(http://www.w3.org/TR/sparql11-
service-description/). Apart from using VoID,
SPLENDID also sends a SPARQL ASK query to de-
termine whether or not the source can potentially re-
turn the answer. We explain the details of our dataset
generation and metrics as follows:
4.1 Data Distribution
To determine the correlation between the commu-
nication cost of the federated SPARQL query and
the data distribution, we generate 9 datasets by di-
viding the Dailymed(http://wifo5-03.informatik.uni-
mannheim.de/dailymed/) into three partitions based
on following strategies:
4.1.1 Graph Partition
Inspired by data clustering for a single RDF storage
(Huang et al., 2011), we performed graph partition
over our dataset by using METIS (Karypis and Ku-
mar, 1998). The aim of this partition scheme is to
reduce the communication needed between machines
during the query execution process by storing the con-
nected components of the graph in the same machine.
We initially identify the connections of subject and
object in different triples. We only consider the URI
object which is also a subject in other triples. Intu-
itively, the reason is that the object which appears as
the subject in other triples can create a connection if
the triples are located in different dataset partitions.
V (D) denotes the set of pairs of subject and object that
are connected in the dataset D which can be formally
specified as V (D) = {(s, o)|∃s,o, p, p
′
∈ U : (s, p, o) ∈
D ∧ (o, p
′
,o
′
) ∈ D
′
}. We assign a numeric identifier
for each s,o ∈ V (D). After that, we create a list of se-
quential adjacent vertexes for each vertex then uses it
as input of METIS API. Run METIS to divide the ver-
texes and get a list of the partition number of vertexes
as output. Finally, we distribute each triple based on
the partition number of its subject and object. Con-
sider an example, given Listing 1 as a dataset sample,
then
V (D)={(dailymeddrug:82,
dailymeding:Phenytoin),(dailymeddrug:201,
dailymeding:Ethosuximide),(dailymedorg:Parke-Davis,
dailymeddrug:82),(dailymedorg:Parke-Davis,
dailymeddrug:201)}
Starting an identifier value from one and increment
the identifier later, we set the identifier for daily-
meddrug:82 = 1, dailymeding:Phenytoin =2, dai-
lymeddrug:201=3, dailymeding:Ethosuximide=4 and
dailymedorg:Parke-Davis=5. After that, we can
create list of sequential adjacent vertexes V (D) is
{(2,5), 1,(4,5), 3,(1,3)}. Suppose that we divide the
sample of dataset into 2 partitions, then the output of
METIS partition is {1,1,2, 2,1} where each value is
the partition number for each vertex. According to the
METIS output, we can say that dailymeddrug:82 be-
longs to partition 1, dailymeding:Phenytoin belongs
to partition 1, dailymeddrug:201 belongs to partition
2 and so on. In the end, we have two following parti-
tions:
Partition 1: all triples that contain dailymeddrug:82, daily-
meding:Phenytoin and dailymedorg:Parke-Davis
Partition 2: all triples that contain dailymeddrug:201 and
dailymeding:Ethosuximide
4.1.2 Entity Partition
The goal of this partition is to distribute the number of
entities evenly in each partition. Different classes can
be located in a single partition. However, the entities
of the same class should be grouped in the same parti-
tion until the number of entities reaches the maximum
number of entities for each source. We initially create
a list of the subjects along with its class (E(D)). The
set E(D) of pairs of subject and its class in the dataset
D is defined as E(D) = {(s, o)|∃(s,rd f type,o) ∈ D}
WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies
102