Graph-based ETL Processes for Warehousing Statistical Open Data

Alain Berro, Imen Megdiche and Olivier Teste

IRIT UMR 5505, University of Toulouse, CNRS, INPT,

UPS, UT1, UT2J, 31062 Toulouse Cedex 9, France

Keywords:

Statistical Open Data, RDF Graphs, Tables Extraction, Holistic Integration, Multidimensional Schema.

Abstract:

Warehousing is a promising mean to cross and analyse Statistical Open Data (SOD). But extracting structures,

integrating and deﬁning multidimensional schema from several scattered and heterogeneous tables in the SOD

are major problems challenging the traditional ETL (Extract-Transform-Load) processes. In this paper, we

present a three step ETL processes which rely on RDF graphs to meet all these problems. In the ﬁrst step,

we automatically extract tables structures and values using a table anatomy ontology. This phase converts

structurally heterogeneous tables into a uniﬁed RDF graph representation. The second step performs a holistic

integration of several semantically heterogeneous RDF graphs. The optimal integration is performed through

an Integer Linear Program (ILP). In the third step, system interacts with users to incrementally transform the

integrated RDF graph into a multidimensional schema.

1 INTRODUCTION

Crossing and analysing Statistical Open Data (SOD)

in a data warehouse is a promising way to derive

several insights and reuse these miner sources of

information. Except that Open Data’ characteris-

tics make them unaffordable with traditional ETL

(Extract-Transform-Load) processes. Indeed, an im-

portant part

of governmental SOD holds into spread-

sheets disposing structurally and semantically hetero-

geneous tables. Moreover these sources are scat-

tered across multiple providers and even in the same

provider which hampers their integration.

To cope with these problems, we propose RDF-

based ETL processes composed of three steps which

adapt the extract, transform and load traditional ETL

steps. The ﬁrst step gives a solution to the automatic

data table extraction and annotation problem. The

second one gives a solution to the automatic holistic

data integration problem. The third step focuses on an

incremental multidimensional schema deﬁnition from

the integrated RDF graphs. In this latter, we aim to

maintain the re-usability of processed SOD.

RDF Graph-based ETL Processes. Our ETL pro-

cesses are based on manipulating RDF graphs. They

are depicted in Fig.1 as follows:

http://fr.slideshare.net/cvincey/opendata-benchmark-

fr-vs-uk-vs-us

• Step 1: ”Extraction and Annotation”, it takes as

input ﬂat SOD spreadsheets and provides as out-

put uniﬁed RDF instance-schema graphs. The

main challenge is to automatically identify and re-

structure data from spreadsheets tables in order to

be able to reuse them. We propose a workﬂow to

automatically perform the ETL extraction phase.

Each activity in the workﬂow extracts and anno-

tates table components guided by a table anatomy

ontology then constructs iteratively the vertices

and edges of the output graph. Finally, users val-

idate automatic annotations and the output graphs

are stored in a staging area.

• Step 2: ”Holistic Integration”, it takes as input

structural parts of several SOD RDF graphs and

generates as output an integrated RDF graph and

the underlying matchings. The integrated graph is

Figure 1: Graph-Based ETL Processes.

271

Berro A., Megdiche I. and Teste O..

Graph-based ETL Processes for Warehousing Statistical Open Data.

DOI: 10.5220/0005363302710278

In Proceedings of the 17th International Conference on Enterprise Information Systems (ICEIS-2015), pages 271-278

ISBN: 978-989-758-096-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

composed of collapsed and simple vertices con-

nected by simple edges. Collapsed vertices ref-

erence groups of matched vertices. Not matched

ones remain simple vertices. We propose a lin-

ear program to perform an automatic and optimal

holistic integration. Users can validate and/or ad-

just the proposed solution. The linear program

maximizes semantic and syntactic similarity be-

tween the labels of the vertices of the input graphs.

It also resolves conﬂictual matching situations in-

volving different graphs’ structures by ensuring

strict hierarchies (Malinowski and Zim

anyi, 2006)

and preserving logical edges directions.

• Step 3: ”Incremental Multidimensional Schema

Deﬁnition”, it takes as input an integrated graph

and generates as output a multidimensional

schema. An interactive process is established be-

tween users’ actions and system. On the one hand,

users’ actions consist on incrementally identify-

ing multidimensional components such as dimen-

sions, parameters, hierarchies, fact and measures.

On the other hand, system interacts with users’ ac-

tions by transforming identiﬁed components into

a multidimensional schema.

Motivating Example. The case study deals with an

agricultural company in England aiming at launching

a new project on cereal production. The company is

interested on analysing cereal production activity in

the previous years in UK. With a simple search on

the UK governmental provider https://www.gov.uk/,

the company retrieves relevant informations for this

topic. For instance the link

provides several detailed

sources for cereal production by cereal type, by farm

size, by farm type, and soon. These sources suffer

from the different problems presented above. They

are: (i) scattered into different spreadsheets even in

the same link, (ii) structurally heterogeneous since the

spreadsheets can have one or several tables disposed

randomly, and (iii) semantically heterogeneous due to

the different analysis axis inﬂuencing the agricultural

statistics.

The following sections describe the three steps of

our approach respectively in sections 2, 3, 4.

2 EXTRACTION AND

ANNOTATION

In this section, we describe a workﬂow performing

https://www.gov.uk/government/statistical-data-

sets/structure-of-the-agricultural-industry-in-england-and-

the-uk-at-june

Figure 2: The Open Data Tables Annotating Concepts.

extraction, annotation and transformation of ﬂat open

data tables into RDF instance-schema graphs.

2.1 The Tables Annotating Ontology

In order to annotate tables, we have extended the ba-

sic table anatomy described by (Wang, 1996). A ta-

ble contains a body (a set of numerical data) indexed

by structural data which are StubHead (row headers)

and/or BoxHead (column headers).

We propose, as depicted in Fig.2, a formalisation

of the ontology describing table annotations. We have

constructed this ontology manually. It forms a sup-

port for the automatic construction of each Open Data

source ontology as it will be detailed in the next sec-

tion. Each table is composed of two types of data:

(i) ”StructData” which means concepts (sequence of

words) able to express schema tables and (ii) ”Num-

Data” which means statistical or quantitative data.

For each data type, we associate three overlapping an-

notation types as follows:

• Inherent Annotation (IA) describes the low level

data types namely Numeric, Label, Formula and

Date.

• Topological Annotation (TA) describes name and

location of tables’ parts. As depicted in Fig.2,

the class ”TopologicalAnnotation” has four prop-

erties FL, LL, FC, LC (First Line, Last Line,

First Column, Last Column) representing respec-

tively the landmarks of the component in the ta-

ble. This class has different sub-classes: (i) ”Unit-

NumBlock” which is the body of table composed

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

272

Figure 3: The Workﬂow Activities Description.

it self by ”NumData” and describes a measure,

(ii) ”StubHead” and ”BoxHead” which are both

composed of ”StructData” and indexing ”Unit-

NumBlock” and (iii) ”SimBlock” composed of

”UnitNumBlock”. The ”SimBlock” has two sub-

classes: ”SimBlockC” (a block composed of Unit-

NumBlock indexed by the same StubHead and

different BoxHead) and ”SimBlockL” (a Block

composed of UnitNumBlock indexed by the same

BoxHead and different StubHead).

• Semantic Annotation (SA) describes the seman-

tic class of data, we focus on Temporal (Year,

Month,..) and Spatial (Region, GPS coordi-

nates,..) classes. The different types of concepts

will be related with a ”skos:broader”

relationship

to express the order between types.

2.2 The Workﬂow Activities Description

We propose, as depicted in Fig.3, a workﬂow activi-

ties transforming input ﬂat open data (encapsulating

tables) into enrich annotated graphs. Each activity

will perform, in the same time, the table components

detection, ontology-guided annotation and graph con-

struction.

The ﬁrst workﬂow activity consists of converting

ﬂat open data into a matrix M which encodes low level

cells’ types namely Numeric, Label, Date and For-

mula. The vertices and relations for Inherent Annota-

tions are created in G

. Then, four activities, related to

search all types of blocks, are executed. The numer-

ical detection activity extracts all the UnitNumBlock,

creates in G

all the UnitNumBlock instances vertices

with their landmarks properties, creates NumData in-

stances vertices and ﬁnally makes relations (i.e anno-

tates) between each NumData and the UnitNumBlock

encapsulating it. The label (resp. date, formula) de-

tection activity extracts all blocks containing concepts

(resp. dates and formula) but does not add vertices

into G

http://www.w3.org/2009/08/skos-reference/skos.html

Thereafter, BoxHead and StubHead detection ac-

tivities use the detection results of numerical and

label activities. To identify the BoxHead (respec-

tively StubHead) of each UnitNumBlock, our algo-

rithm consists of a backtracking search of the ﬁrst line

(respectively column) of labels situated above (re-

spectively on the left) the UnitNumBlock beginning

the search from the UnitNumBlock.FL (respectively

UnitNumBlock.FC). BoxHead, StubHead and Struct-

Data instances vertices contained into these blocks are

added to G

. Then the relationships between these lat-

ter are created into G

. Finally, indexing relationships

between UnitNumBlock, StubHead and BoxHead are

added into G

Similar block detection activity can be executed

after Stubhead or Boxhead activities. It groups dis-

joint UnitNumBlock having either the same BoxHead

(SimBlockC type) or the same StubHead (SimBlockL

type). Spatial detection activity uses the result of label

and numerical detections. It uses predeﬁned lists

annotate instances present in the detected blocks. In

, the found instances are added and annotated with

their appropriate types. Hierarchical relationships are

expressed by the property skos:broder (a hierarchical

link indicating that a concept is more general than an-

other concept) between all spatial concepts.

Temporal detection uses date detection or numeri-

cal detection results. For temporal concepts, we refer

to the generic temporal graph deﬁned by (Mansmann

and Scholl, 2007). We use regular expressions to

identify instances of each temporal entity. The mea-

sure detection activity uses the date detection and re-

lies on users to textually input measures names and

units.

The last activity is the hierarchical relationships

detection (skos:broader relations between StructData

as shown in Fig.2). Regarding the impact of com-

plex hierarchies (Malinowski and Zim

anyi, 2006) in

the summarizability problems, we focus on construct-

ing graphs with non complex hierarchies by applying

a set of rules (we will not detail them in this paper).

www.geonames.org

Graph-basedETLProcessesforWarehousingStatisticalOpenData

273

Figure 4: An excerpt from a resulting RDF Open Data Graph.

We propose two complementary alternatives for hier-

archical classiﬁcation: (i) concept classiﬁcation us-

ing annotations detected by the workﬂow activities

and (ii) concept classiﬁcation using data mining tech-

niques to complement the ﬁrst one. The ﬁrst alter-

native is based on the arrangement of the numerical

blocks which can determine hierarchical relationships

between StubHead or BoxHead concepts. The second

alternative concerns Stubhead or Boxhead concepts.

We cross two conceptual classiﬁcation techniques and

we select the relevant classes resulting from them.

The ﬁrst technique is lattices (Birkhoff, 1967) which

does not consider semantic aspects and generates for-

mal contexts formed by concepts and attributes. The

second technique is RELEVANT (Bergamaschi et al.,

2011). It clusters a set of concepts and provide rele-

vant attributes for each cluster.

An Example of Open Data RDF Graph. An Open

Data Extraction Tool (ODET) has been implemented

to validate our approach (Berro et al., 2014). It per-

forms the workﬂow described above. Fig.4 repre-

sents an excerpt of the output graph resulting from

ODET for one source from our running example. This

source contains statistics on wheat production per

city, district and year. As depicted in Fig.4, the Unit-

NumBlock describes the yield measure which has as

unit Tonnes/Hectares. The StubHead is composed

of StructData, for instance we have StuctData1 with

value ”England” and StructData2 with value ”South

West”. StructData1 has the semantic annotation city

and StructData2 has the semantic annotation district.

The relation skos:broader between city and district

in the ontology has been instantiated in the instance-

schema graph between StructData1 and StructData2.

The BoxHead1, as an instance of the BoxHead class,

is the topological annotation of the StructData3 which

has as value 2010 and as semantic annotation Year.

3 HOLISTIC OPEN DATA GRAPH

INTEGRATION

This section is devoted to present our holistic inte-

gration approach based on the integer linear program-

ming technique.

3.1 Pre-matching Phase

This phase takes as input a set of N graphs G

, E

) i ∈ [1, N], N ≥ 2 representing only struc-

tural schema elements which are StubHead, Box-

Head, spatio-temporal concepts and enrichment con-

cepts. The output of this phase is composed of: (1)

∑

N−1

i=1

(N − i) similarity matrices and (2) N direction

matrices representing the hierarchical relationships

between structural vertices.

Similarity Matrices Computation. We compute

∑

N−1

i=1

(N −i) similarity matrices denoted Sim

i, j

of size

× n

deﬁned between two different graphs G

and

, ∀i ∈ [1, N − 1], j ∈ [i + 1, N]. Each matrix en-

codes similarity measures. We deﬁne the similarity

measure as the maximum between two known sim-

ilarity distances : (i) the syntactic measure Jaccard

(Jaccard, 1912) and (ii) the semantic measure Wup

(Wu and Palmer., 1994). As we deal with N input

graphs, similarity measure are computed between all

combination of all pairwise vertices concepts v

and

belonging to two different input graphs G

and G

such as i ∈ [1, N − 1], j ∈ [i + 1, N].

Sim

i, j

= {sim

, j

, ∀k ∈ [1, n

], ∀l ∈ [1, n

]}

Direction Matrices Computation. We compute a

set of N direction matrices Dir

of size n

× n

deﬁned

for each graph G

, ∀i ∈ [1, N]. Each matrix encodes

edges directions and deﬁned as follows:

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

274

Dir

={dir

k,l

, ∀k × l ∈ [1, n

] × [1, n

]}

dir

k,l







1 if e

k,l

∈ E

−1 if e

l,k

∈ E

0 otherwise

3.2 Matching Phase

The idea of the LP4HM matcher consist of extending

the maximum-weight graph matching problem with

additional logical implication constraints. These lat-

ter stand for constraints on matching setup and graph

structure. Fig.5 depicts three abstract graphs G

, G

and G

, and similarities between graph vertices (not

shown similarities are considered as null values).

Figure 5: An example of holistic graph matching problem.

Decision Variable. We deﬁne a single decision

variable which expresses the possibility to have or

not a matching between two vertices belonging to

two different input graphs. For each G

and G

∀i ∈ [1, N − 1], j ∈ [i + 1, N], x

, j

is a binary deci-

sion variable equals to 1 if the vertex v

in the graph

matches with the vertex v

in the graph G

and 0

otherwise.

Logical Implication Constraints. A logical impli-

cation (Plastria, 2002) is expressed as follows: for x

a binary variable for all i in some ﬁnite set I and y a

binary variable, if x

= 0 for all i ∈ I then y = 0 which

is also equivalent to y ≤

∑

i∈I

The ﬁrst type of constraints belongs to Matching

Setup (MS). MS1 encodes the matching cardinality in

particular we focus on 1:1 matching (Rahm and Bern-

stein, 2001). MS2 encodes the matching threshold in

particular the model should not match vertices whose

similarity is inferior than a given threshold.

MS1 (Matching Cardinality.) Each vertex v

in the

graph G

could match with at most one vertex

in the graph G

, ∀i × j ∈ [1, N − 1] × [i +

1, N].

∑

l=1

, j

≤ 1, ∀k ∈ [1, n

]

MS2 (Matching Threshold.) We setup similarity

threshold thresh in order to get better match-

ing results. Our model encodes the threshold

setup in the following constraint : ∀i × j ∈

[1, N −1] × [i +1, N] and ∀k ×l ∈ [1, n

]×[1, n

]

sim

, j

≥ thresh x

, j

The second type of constraints belongs to graph

structure. GS1 constraint generates strict hierarchies

and GS2 generates simple edge directions.

GS1 (Strict Hierarchies.) This constraint allow us

to resolve non-strict hierarchy problem (Mali-

nowski and Zim

anyi, 2006) which is an impor-

tant issue in multidimensional schema deﬁni-

tion. Hence, users will save time in resolving

such problems in the multidimensional schema

deﬁnition step. ∀i × j ∈ [1, N − 1] × [i + 1, N]

and ∀k × l ∈ [1, n

] × [1, n

, j

≤ x

pred(k)

, j

pred(l)

GS2 (Edge Direction.) The purpose of this constraint

is to prevent the generation of conﬂictual edges.

∀i × j ∈ [1, N − 1] × [i + 1, N] such as i 6= j and

∀k, k

∈ [1, n

] ∀l, l

∈ [1, n

]

, j

+ x

, j

+ (dir

k,k

dir

l,l

) ≤ 0

Objective Function. The objective of our model

is to maximize the sum of the similarities between

matched vertices. The objective function is expressed

as follows

N−1

∑

i=1

∑

j=i+1

∑

k=1

∑

l=1

sim

, j

The resolution of the linear program of the exam-

ple in Fig.5 gives the following solution:

• v

matches with v

with sim

= 0.9

• v

matches with v

with sim

= 0.3

• v

matches with v

with sim

= 0.9

• v

matches with v

with sim

= 0.8

• v

matches with v

with sim

= 0.8

• the objective function value is 3.7

Graph-basedETLProcessesforWarehousingStatisticalOpenData

275

4 INCREMENTAL DATA

WAREHOUSE SCHEMA

DEFINITION

In this section, we describe an incremental process to

deﬁne a multidimensional schema from the integrated

Open Data RDF graphs. The system will interact with

users’ actions in order to generate a script to create

the data warehouse schema and populate data, in the

same time the integrated RDF graph will be reﬁned

with the QB4OLAP vocabulary proposed in (Etchev-

erry et al., 2014). QB4OLAP is more complete than

RDF Data Cube Vocabulary

(RDF QB) to describe

multidimensional schema. This twofold output is mo-

tivated by : (i) the interest of analysing populated data

with OLAP tools, (ii) the interest to keep reuse of in-

tegrated Open Data tables in other scenarios for in-

stance link them with available Statistical Open Data

or Linked Open Data in general.

4.1 A Conceptual Multidimensional

Schema and an Augmented RDF

Graph

Two types of output are expected from our incremen-

tal process. The outputs are described as follows :

A Conceptual Multidimensional Schema. We

will use the conceptual speciﬁcations of a generic

multidimensional schema (a group of star schema)

proposed in the works of (Ravat et al., 2008). This

generic model is the most suitable according to the al-

ternatives that can generate the integrated graph (for

instance several facts which share the same dimen-

sions). A multidimensional schema E is deﬁned by

, D

, Star

) such as :

• F

= {F

, . . . , F

} a ﬁnite set of facts;

• D

= {D

, . . . , D

} a ﬁnite set of dimensions;

• Star

: F

→ 2

a function associating for each

fact a set of dimensions.

A dimension D ∈ D

is deﬁned by (N

, A

, H

• N

a dimension name,

• A

= {a

, . . . , a

} ∪ {id

, All} a set of attributes,

• H

= {H

, . . . , H

} a set of hierarchies.

A hierarchy denoted H

∈ H

is deﬁned by (N

Param

, Weak

) such as:

• N

a hierarchy name,

http://www.w3.org/TR/vocab-data-cube/

• Param

= < id

, p

, . . . , p

, All > an ordered

set of attributes called parameters which repre-

sent the relevant dimension’ graduations, ∀k ∈

[1. . . v

], p

∈ A

• Weak

: Param

→ 2

−Param

is a function as-

sociating each parameter to one or several weak

attribute.

A fact denoted F ∈ F

is deﬁned by (N

, M

• N

a fact name,

• M

= f

), . . . , f

) a set of measures asso-

ciated to an aggregation function f

Example 1. The example depicted in Fig.6(a) is

the expected multidimensional schema from the

phase 3 of our process applied to the motivat-

ing example. For the sake of simplicity, fact

, dimension D

and hierarchy H

are cited by

their names. The multidimensional schema E

is composed of one fact and three dimensions.

E = ({AgriculturalStatistics}, {Time, Geography,

AgriculturalProduct}, {Star(AgriculturalStatistics)=

{Time, Geography, AgriculturalProduct} ) The di-

mensions are : (Time, {Year, All}, {HTime}),

(Geography, {City, Region, All}, {HRegion})

and (AgriculturalProduct, {TypeProduct, Product ,

All}, {HProduct}). The fact is : (Agricultural-

Statistics, {AVG(Yield), SUM(Area), AVG(Area),

SUM(Production), AVG(Production)})

An Augmented RDF Graph. The integrated RDF

graph resulting from the matching phase will be aug-

mented with concepts belonging to the QB4OLAP

vocabulary proposed in (Etcheverry et al., 2014). This

latter uses the basic concepts of RDF QB and extends

it with concepts representing hierarchies, hierarchies

levels, aggregation functions, cardinalities. The cor-

respondences between the multidimensional schema

and the concepts of the RDF graph are as follows:

• E is an instance of qb : dataStructureDe f inition.

• F

is an instance of a qb : Observation class.

• D

is an instance of a qb : DimensionProperty

class.

• Star

: F

→ 2

is an instance of qb : DataSet.

• A

is an instance of qb4o : levelProperty

• The members of A

are instances of qb :

leveMumber.

• H

is an instance of qb4o : HierarchyProperty.

• Param

is an instance of qb4o : LevelProperty.

• Weak

is an instance of a qb : AttributeProperty.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

276

(a) An example of Multidimensional Schema (b) An Augmented RDF Graph

Figure 6: Expected outputs for the incremental schema deﬁnition step.

• M

is an instance of qb : MeasureProperty and

the associated aggregation function f

is an in-

stance of qb4o : AggregateFunction.

The Fig.6(b) shows an excerpt of an expected aug-

mented RDF graph by following the correspondences

explained above.

The processes we use to obtain the two expected

results will be explained in the next section.

4.2 An Incremental Process to Deﬁne

Multidimensional Components

We propose an incremental process in which users

should begin by identifying dimensions and their

components (hierarchies, attributes, members) then

they deﬁne the fact and its measures. Three ac-

tors interact in this process: (1) users who interact

with the integrated graph via different functions to in-

crementally deﬁne multidimensional components, (2)

graph transformation (GT) which interacts with user’

actions in order to enrich the integrated graph with

OLAP4QB objects and properties, (3) data warehouse

script generator (SG) which interacts with user’ ac-

tions in order to generate the script describing the

multidimensional schema and from which the data

warehouse will be populated with SOD.

Dimensions Identiﬁcation. When a user

add dimensions, the GT creates a new vertex

with the dimension name as an instance of a

qb:dimensionProperty and the SG creates a new

table dimension without attributes. Then user add

hierarchies to the dimension, but as hierarchies are

composed of levels, users have to choose either

to add level or to select a level from the graph.

If user chooses the action add level, the GT cre-

ates a vertex with the hierarchy name instance

of qb4o:hierarchyProperty, a vertex for the level

instance of qb4o:levelProperty and a property

qb4o:haslevel between the hierarchy and the level. If

the user chooses the action select level form graph,

this level had to be a StructData vertex, it will be

changed to become an instance of qb4o:levelProperty

and the property qb4o:haslevel will be created

between this vertex and the hierarchy level. Then for

both actions add level and select level from graph,

user have to select members form the graph. The

GT will collect members from graph which are

StructData. These members are changed into in-

stances of qb:levelMemeber. Properties qb4o:inlevel

will be created between the concerned level and its

members and visually these vertices will be collapsed

in the label vertex. In the same time, the SG updates

dimension with its attributes, creates hierarchies with

levels and populates dimension attributes with their

members. The users have to iterate the described

process for all dimensions knowing that when a

the task of dimension creation ﬁnish, the system

can suggest to user potential dimensions. Indeed,

since each NumData is related to a path composed

of a succession of StructData, the GT deduces the

NumData related to the paths belonging to the actual

identiﬁed dimension and proposes to the users the

non yet identiﬁed paths related to the NumData as a

potential dimension.

Fact Identiﬁcation. When all dimensions have

been identiﬁed, the user can proceed to fact and mea-

sures identiﬁcation. When the user creates a fact, the

GT creates a vertex with the fact name as an instance

of qb:observation. Then he should link dimensions to

the fact, and select measures from the graph. So, the

SG creates a fact table with its measures and relates

it to the created dimensions. GT collects the Num-

Data related to the selected dimensions and measures

which are instances of the qb:dataset. Finally SG add

these instances to the fact table.

Graph-basedETLProcessesforWarehousingStatisticalOpenData

277

Discussion and Future Work. In the process de-

scribed above, the user intervention can be error-

prone hence some improvements may be envisaged.

For instance, we can inject some of the algorithms

proposed by (Romero and Abell

o, 2007) in order to

semi-automate the identiﬁcation of dimensions and

facts. Moreover, even if we have resolved the prob-

lem of complex hierarchies in the ﬁrst two steps of our

ETL processes by generating strict hierarchies, we ac-

knowledge that the summarizabiliy problem (Maz

et al., 2010) is not totally resolved especially com-

pleteness and disjointness integrity constraints (Lenz

and Shoshani, 1997). Both the works of (Romero and

Abell

o, 2007) and (Prat et al., 2012) propose solu-

tions applied on ontologies. Compared to (Romero

and Abell

o, 2007), the authors of (Prat et al., 2012)

propose more explicit rules which seems straightfor-

ward to be added to our process. Since the construc-

tion of our process outputs are performed in paral-

lel, the veriﬁcation of the summarizabilty constraints

in the ontology will be used to check the same con-

straints in the multidimensional model.

5 CONCLUSIONS

In this paper, we have presented a full RDF graph-

based ETL chain for warehousing SOD. This ETL

chain takes as input a set of ﬂat SOD containing sta-

tistical tables and transforms them into a multidimen-

sional schema through three steps. The ﬁrst step au-

tomatically extracts, annotates and transforms input

tables into RDF instance-schema graphs. The second

step performs automatically a holistic graph integra-

tion through an integer linear program. The third one

incrementally deﬁnes the multidimensional schema

through an interactive process between users and sys-

tem. The main contributions of our approach are:

(i) the uniﬁed representation of tables which facili-

tates their schema discovery and (ii) the extension of

the maximum weighted graph matching problem with

structural constraints in order to resolve the holistic

open data integration problem. In our future works,

we aim to train a user study on our approach to mea-

sure the efﬁciency of automatic detections and inte-

grations, and to measure the difﬁculties that users may

encounter when they deﬁne incrementally the multi-

dimensional schema from visual graphs.

REFERENCES

Bergamaschi, S., Guerra, F., Orsini, M., Sartori, C., and

Vincini, M. (2011). A semantic approach to etl

technologies. Data and Knowledge Engineering,

70(8):717 – 731.

Berro, A., Megdiche, I., and Teste, O. (2014). A content-

driven ETL processes for open data. In New Trends in

Database and Information Systems II - Selected pa-

pers of the 18th East European Conference on Ad-

vances in Databases and Information Systems and As-

sociated Satellite Events, ADBIS 2014 Ohrid, Mace-

donia, pages 29–40.

Birkhoff, G. (1967). Lattice Theory. American Mathemati-

cal Society, 3rd edition.

Etcheverry, L., Vaisman, A., and Zimnyi, E. (2014). Mod-

eling and querying data warehouses on the semantic

web using QB4OLAP. In Proceedings of the 16th

International Conference on Data Warehousing and

Knowledge Discovery, DaWaK’14, Lecture Notes in

Computer Science. Springer-Verlag.

Jaccard, P. (1912). The distribution of the ﬂora in the alpine

zone. New Phytologist, 11(2):37–50.

Lenz, H.-J. and Shoshani, A. (1997). Summarizability in

olap and statistical data bases. In Scientiﬁc and Sta-

tistical Database Management, 1997. Proceedings.,

pages 132–143.

Malinowski, E. and Zim

anyi, E. (2006). Hierarchies in

a multidimensional model: From conceptual mod-

eling to logical representation. Data Knowl. Eng.,

59(2):348–377.

Mansmann, S. and Scholl, M. H. (2007). Empowering the

olap technology to support complex dimension hierar-

chies. IJDWM, 3(4):31–50.

Maz

on, J.-N., Lechtenbrger, J., and Trujillo, J. (2010). A

survey on summarizability issues in multidimensional

modeling. In JISBD, pages 327–327. IBERGARC-

ETA Pub. S.L.

Plastria, F. (2002). Formulating logical implications in com-

binatorial optimisation. European Journal of Opera-

tional Research, 140(2):338 – 353.

Prat, N., Megdiche, I., and Akoka, J. (2012). Multidi-

mensional models meet the semantic web: deﬁning

and reasoning on OWL-DL ontologies for OLAP. In

DOLAP 2012, ACM 15th International Workshop on

Data Warehousing and OLAP, pages 17–24.

Rahm, E. and Bernstein, P. A. (2001). A survey of

approaches to automatic schema matching. VLDB

JOURNAL, 10.

Ravat, F., Teste, O., Tournier, R., and Zurﬂuh, G. (2008).

Algebraic and graphic languages for OLAP manipu-

lations. International Journal of Data Warehousing

and Mining, 4(1):17–46.

Romero, O. and Abell

o, A. (2007). Automating multidi-

mensional design from ontologies. In Proceedings of

the ACM Tenth International Workshop on Data Ware-

housing and OLAP, DOLAP ’07, pages 1–8. ACM.

Wang, X. (1996). Tabular abstraction, editing, and format-

ting. Technical report, University of Waretloo, Water-

loo, Ontaria, Canada.

Wu, Z. and Palmer., M. (1994). Verb semantics and lexical

selection. In In 32nd. Annual Meeting of the Associa-

tion for Computational Linguistics, New Mexico State

University, Las Cruces, New Mexico., pages 133–138.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

278