approach to devise a schema and its respective
instances, called entity-per-row(Connor et al., 2010).
In this approach, besides the schema, each row of
the table should describe a different entity and each
column an attribute for that entity. The spreadsheet
of Figure 1, for example, follows this kind of
organization: each column corresponds to an
attribute – e.g., Date, Genus, Species etc. – and
each row to an event – a collection of a specimen.
Han et al., (2008) and several related work assume
the entity-per-row organization to support the
process of manually mapping attributes, to make
them semantically interoperable. Initially, the user
must indicate a cell whose column contains a field
which plays the role of
identifier–equivalent to the
primary key of a database. In the example of
Figure1, it would be the field date and time
start. Then, the system allows manual association
between each cell of a field and an attribute of the
semantic entity, considering that the respective
column of the field will contain its values.
Langegger and Wob (2009) propose a similar,
but more flexible, solution to map spreadsheets in an
entity-per-row organization. They are able to treat
hierarchies among fields, when a field is divided into
sub-fields. In Figure 1, for example, the fields
Date, Time Start and Time End refer to when
the species was collected. It is usual that authors
create a label spanning the entire range above these
columns – e.g., labelled as "CollectionPeriod"
– to indicate that all these fields are subdivisions of
the larger field. This hierarchical perspective can be
expressed in our model, since a property can be
typed (rdfs:range) by a class, which in turn has
properties related to it – e.g., the identifier
property in Figure 13 is typed by the class
Specimen Identifier, which affords the
properties kingdom, phylum etc.
RDF has been widely adopted by related work as
an output format to integrate data from multiple
spreadsheets, since it is an open standard that
supports syntactic and semantic interoperability.
Langegger and WOB (Langegger and Wolfram,
2009) propose to access these data through SPARQL
(Pérez et al., 2009) – a query language for RDF.
Oconnor et al., (2010) propose a similar solution, but
using OWL.
Abraham and Erwig (2006) observed
spreadsheets are widely reused, but due to their
flexibility and level of abstraction, the reuse of a
spreadsheet by people outside its domain increases
errors of interpretation and therefore inconsistency.
Thus they propose a spreadsheet life cycle defined in
two phases: development and use, in order to
separate the schema of its respective instances. The
schema is developed in the first cycle, to be used in
the second cycle. Instances are inserted and
manipulated in the second cycle guided by the
schema, which cannot be changed in this cycle.
Another approach to address this problem is
automating the semantic mapping using Linked
Data. Syed et al., (2010) argue that a manual process
to map spreadsheets is not feasible, so they propose
to automate the semantic mapping by linking
existing data in the spreadsheets to concepts
available in knowledge bases, such as DBpedia
(http://dbpedia. org) and Yago (http://www.mpi-
inf.mpg.de/yago-naga/yago/).Yago is a large
knowledge base, whose data are extracted, among
others, from Wikipedia and WordNet
(http://wordnet.princeton.edu). The latter is a digital
lexicon of the English language, which semantically
relates words.
Among the advantages of the last approach, there
is the fact that such bases are constantly maintained
and updated by people from various parts of the
world. On the other hand, the search for labels
without considering their contexts can generate
ambiguous connections, producing inconsistencies.
Thus, there are studies that stress the importance of
delimiting a scope before attempting to find links.
Venetis et al., (2011) exploit the existing
semantics in the tables to drive the consistent
manipulation operations applicable to them. The
proposal describes a system that analyzes pairs of
terms heading columns and their relationship, in
order to improve the semantic interpretation of them.
Authors state that a main problem in the
interpretation of tabular data is the analysis of terms
independently. This paper tries to identify the scope
by recognizing a construction pattern, which is
related to a spreadsheet nature inside a context.
Jannach et al., (2009) state that the compact and
precise way to present the data are primarily directed
to human reading and not for machine interpretation
and manipulation. They propose a system to extract
information from web tables, associating them to
ontologies. They organize the ontologies in three
groups: 1. core: concepts related to the model
disassociated from a specific domain; 2. core +
domain: domain concepts of a schema related to the
information to be retrieved; 3. instance of ontology:
domain concepts of instances. These ontologies aim
at gradually linking the information to a semantic
representation and directed by the user's goal.
Among these solutions, we note that some of
them address individual pieces of information inside
spreadsheets – devoid of context – and others
ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems
66