why these tools could take advantage of ontology
matching and alignments.
The main problem on the Web of data is to create
links between entities of different datasets. Most of-
ten, this consists of identifying the same entity across
different datasets and publishing a link between them
as a sameAs statement. We call this task data interlink-
ing and summarize it in Figure 1.
URI1 URI2
Data interlinking
owl:sameAs
Figure 1: The data interlinking problem.
Once identified, the links discovered between two
datasets must also be published in order to be reused.
The VoiD vocabulary (Alexander et al., 2009) allows
for describing linksets as special datasets containing
sets of links between resources of two given datasets.
Once linksets are constructed, two approaches are
proposed to retrieve equivalences between resources:
it is possible to assign to each real world entity a
global identifier that will then be related to every URIs
describing this entity. This is the approach taken in
the OKKAM project (Bouquet et al., 2008) that pro-
poses the usage of Entity Name Servers taking the
role of resource name repositories. The other ap-
proach uses equivalence lists maintained with inter-
linked resources across datasets. There is thus no
global identifier in this approach but equivalence links
can be followed using a third-party Web service, e.g.,
http://sameas.org, or a bilatteral protocol (Volz et al.,
2009).
The data interlinking task can be achieved manu-
ally or with the help of data interlinking tools. These
tools take as input two datasets and ultimately provide
a linkset. In addition, they use what we call a linking
specification, i.e., a “script” specifying how and/or
what to link. Indeed, given dataset sizes, the search
space for resources interlinking can reach many bil-
lion resources, e.g., DBPedia. It is thus necessary to
use heuristics giving hints to the interlinking system
where to look for the corresponding resources in the
two datasets. These linking specifications can be spe-
cific to a pair of datasets and can be reused for re-
generating linksets (we provide an example of such a
specification in the Silk language in Section 4).
Mining for similar resources in two Web datasets
raises many problems. Each datasets having its own
namespace, resources in different datasets are given
different URIs. Also, although naming conventions
exist, there is no formal nor standard way of naming
resources. For example, if we take the URI for the fa-
mous musician Johann Sebastian Bach in various Web
datasets we obtain very different results even though
they all represent the same real world object.
Fortunately, dereferencing URIs can be used for
retrieving more information about entities: property
values and related resources can be observed. But
for a same real-world entity, the same property can
take different values, making the interlinking process
more difficult. This can be because of varying value
approximations across datasets, because of different
units of measure, because of mistakes in the datasets,
or because of loose ontological specifications. For
instance, the property foaf:name does not specify in
what format should the name be given. “J.S. Bach”,
“Bach, J.S.” or “Johann Sebastian Bach” are possible
values for this property. Hence, data interlinking tools
have to compare property values in order to decide if
two entities are the same, and must be linked, or not.
For that purpose, tools use similarity measures based
on the type of values (e.g., string, numbers, dates) and
aggregate the results of these measures. This activity
is reminiscent of record linkage which has been given
considerable attention in database (Fellegi and Sunter,
1969; Winkler, 2006; Elmagarmid et al., 2007).
Another problem is caused by the usage of het-
erogeneous ontologies for describing datasets. In this
case, a same resource is typed according to differ-
ent classes and described with different RDF pred-
icates belonging to different ontologies. For exam-
ple, a name in a dataset can be attributed using the
foaf:name data property from the FOAF ontology
while it is attributed using the vcard:N object prop-
erty from the VCard ontology in another dataset.
Hence, for the interlinking techniques to work, it
is necessary that the datasets use the same ontology
or that data interlinking tools are aware of the corre-
spondences between ontologies.
The goal of this paper is to investigate the relation-
ships between data interlinking and ontology match-
ing (Euzenat and Shvaiko, 2007). In particular, we
want to understand if these two activities would ben-
efit to be merged into a single activity and sharing the
same formats.
3 A FRAMEWORK FOR DATA
INTERLINKING
We provide in this section a general framework en-
compassing the various approaches used to interlink
resources on the Web of data. We first consider each
case that may happen when interlinking data and de-
scribe them abstractly and through an example. In the
KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development
280