Using Signiﬁers for Data Integration in Rail Automation

Alexander Wurl

, Andreas Falkner

, Alois Haselb

ock

and Alexandra Mazak

2,∗

Siemens AG

Osterreich, Corporate Technology, Vienna, Austria

TU Wien, Business Informatics Group, Austria

Keywords:

Data Integration, Signiﬁer, Data Quality.

Abstract:

In Rail Automation, planning future projects requires the integration of business-critical data from heteroge-

neous data sources. As a consequence, data quality of integrated data is crucial for the optimal utilization of

the production capacity. Unfortunately, current integration approaches mostly neglect uncertainties and incon-

sistencies in the integration process in terms of railway speciﬁc data. To tackle these restrictions, we propose

a semi-automatic process for data import, where the user resolves ambiguous data classiﬁcations. The task

of ﬁnding the correct data warehouse classiﬁcation of source values in a proprietary, often semi-structured

format is supported by the notion of a signiﬁer, which is a natural extension of composite primary keys. In a

case study from the domain of asset management in Rail Automation we evaluate that this approach facilitates

high-quality data integration while minimizing user interaction.

1 INTRODUCTION

In order to properly plan the utilization of production

capacity, e.g., in a Rail Automation factory, informa-

tion from all business processes and project phases

must be taken into account. Sales people scan the

market and derive rough estimations of the number

of assets (i.e. producible units) of various types (e.g.

control units for main signals, shunting signals, dis-

tant signals, etc.) which may be ordered in the next

few years. The numbers of assets get reﬁned phase

by phase, such as bid preparation or order fulﬁll-

ment. Since these phases are often executed by differ-

ent departments with different requirements and in-

terests (e.g. rough numbers such as 100 signals for

cost estimations in an early planning phase, vs. de-

tailed bill-of-material with sub-components such as

different lamps for different signal types for a ﬁnal

installation phase), the same assets are described by

different properties (i.e. with - perhaps slightly - dif-

ferent contents) and in different proprietary formats

(e.g. spreadsheets or XML ﬁles). Apart from the tech-

nical challenges of extracting data from such propri-

etary structures, heterogeneous feature and asset rep-

resentations hinder the process of mapping and merg-

ing information which is crucial for a smooth over-

all process and for efﬁcient data analytics which aims

∗

Alexandra Mazak is afﬁliated with the CDL-MINT at TU

Wien.

at optimizing future projects based upon experiences

from all phases of previous projects. One solution ap-

proach is to use a data warehouse and to map all het-

erogeneous data sets of the different departments to

its uniﬁed data schema.

To achieve high data quality in this process, it is

important to avoid uncertainties and inconsistencies

while integrating data into the data warehouse. Espe-

cially if data includes information concerning costs, it

is essential to avoid storing duplicate or contradicting

information because this may have business-critical

effects. Part of the information can be used to identify

corresponding data in some way (i.e. used as key),

part of it can be seen as relevant values (such as quan-

tities and costs). Only if keys of existing information

objects in the data warehouse are comparable to that

one of newly added information from heterogeneous

data sets, that information can be stored unambigu-

ously and its values are referenced correctly.

Keys are formed from one or many components of

the information object and are signiﬁcant for compar-

ing information of heterogeneous data sets with infor-

mation stored in the data warehouse. If two of such

keys do not match, this is caused by one of two sig-

niﬁcantly different causes: (i) two objects should have

the same key but they slightly differ from each other,

and (ii) two objects really have different keys. Us-

ing solely heuristic lexicographical algorithms (Co-

hen et al., 2003) to automatically ﬁnd proper matches

172

Wurl, A., Falkner, A., Haselböck, A. and Mazak, A.

Using Signiﬁers for Data Integration in Rail Automation.

DOI: 10.5220/0006416401720179

In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 172-179

ISBN: 978-989-758-255-4

does not necessarily succeed to reliably distinguish

those two cases because each character of the key

might be important and have a deeper meaning, or

not. Inappropriate heuristics lead to wrong matching

results, which may have major consequences to busi-

ness. Therefore it is critical to rely on semantics for

matching algorithms.

To support this proposition, we adapt the concept

of signiﬁers (Langer et al., 2012). A human actor - an

expert of the domain - deﬁnes a useful combination

of properties during the design phase. In the integra-

tion process of a new data set to the data warehouse,

we use this deﬁnition for calculating each information

object’s key. In case of mismatches between the data

warehouse and the new data set during data integra-

tion, a manual control either conﬁrms the mismatch

or interacts aiming for a match. The signiﬁer serves

as a key to uniquely identify an object in the course of

normalization of information in the data warehouse.

The main contributions of this paper are: (i) we re-

veal challenges of integrating heterogeneous data sets

into a data warehouse in an industrial context; (ii) we

introduce a technique using a signiﬁer to avoid in-

consistencies between assets in the data warehouse;

(iii) we show how some minimal human interaction

can signiﬁcantly improve the matching result; (iv) we

evaluate in a case study how those techniques improve

data quality.

2 PROBLEM DEFINITION

We address data integration scenarios where values

from highly heterogeneous data sources are merged

into a conjoint data warehouse. In more detail,

merged features and assets of different data sources

inherently share the same content but reveal various

overlapping compositions which are caused by alter-

nating amounts of data generated by manual estima-

tions or obtained from the installed base. In order

to avoid duplicate and contradicting data in the data

warehouse, corresponding information from different

data sources needs to be identiﬁed. Unfortunately,

there are no global IDs available to identify informa-

tion objects over different data sources. Instead, com-

puting keys from the different data representations in

the sources allows to match them with keys which are

already stored in the data warehouse. Figure 1 shows

the setting and the issues in more detail.

One typical format is spreadsheets (such as Mi-

crosoft Excel) where each row represents an informa-

tion object, e.g., the expected quantity from a plan-

ning phase. ”Source1” in Figure 1 has ﬁve columns

where the ﬁrst three are used to compute a key (by

Figure 1: Integration scenario of heterogeneous data sets.

function ”extractKey1”) and the fourth column (se-

lected by function ”extractValue1”) contains a rele-

vant value, e.g., the quantity in form of a numerical

value.

Another format is XML which contains structured

objects (such as representations of physical objects),

e.g., a bill of materials from an installation phase.

”SourceN” in Figure 1 shows some objects. Each ob-

ject consists of sub-elements in which all information

of the corresponding object is included - either for use

in keys (accessed by function ”extractKeyN”) or for

values. In this example, the value is not selected di-

rectly, but aggregated from all objects with the same

key (using aggregation function ”aggregateValueN”),

e.g., by counting all those object instances to derive

their quantity as a numerical value.

The essence of each data source can be seen as a

set of triples, containing a key, a value, and the source

identiﬁer. As an intermediate representation, those

triples are merged into the data warehouse which

comprises the history of all values with references

to data sources and keys. The keys are necessary to

map different data representations of the same infor-

mation from different data sources onto each other. If

a key from a triple matches an existing key, the latter

is reused, e.g., ”key11” for the triple from ”source1”

in Figure 1. Else, a new key is added to the data ware-

house, e.g., ”keyN1” for the triple from ”sourceN”.

This means that the new information does not corre-

spond to other information in the data warehouse.

Such a scenario poses the following questions:

Using Signiﬁers for Data Integration in Rail Automation

173

• Can a simple approach, such as to assemble some

properties as components for a unique identiﬁca-

tion key, cover many use cases?

• High heterogeneity in technical aspects, data

model, and semantics requires an advanced ap-

proach. How can the extraction functions (e.g.,

”extractKey1”, ”extractValue1”, ”extractKeyN”)

be deﬁned in a systematic way?

• How shall a match be deﬁned in detail? Per-

fect match vs. near match (case-sensitivity, lex-

icographical distance)? How to avoid wrong

matches?

• How to decide whether (syntactically) not match-

ing keys refer to the same information? Are syn-

onyms used?

• The process of comparing keys needs some user

interaction (expert knowledge). What is the best

process? How to minimize efforts?

3 RELATED WORK

Data cleansing, also known as data cleaning, is an

inevitable prerequisite to achieve data quality in the

course of an ETL-process (Bleiholder and Naumann,

2009). Naumann describes data cleansing as use case

of data proﬁling to detect and monitor inconsistencies

(Naumann, 2014). Resolving inconsistencies as part

of the transformation phase has been a topic for the

last two decades (Leser and Naumann, 2007; Sharma

and Jain, 2014).

In the work of (Rahm and Do, 2000; Naumann,

2014) tasks for ensuring data quality are classiﬁed;

various data cleaning techniques have been proposed

such as rule-based (Dallachiesa et al., 2013; Fan and

Geerts, 2012), outlier detection (Dasu and Johnson,

2003; Hellerstein, 2008), missing values (Liu et al.,

2015), and duplicate detection (Bilenko and Mooney,

2003; Wang et al., 2012). Most of these techniques

require human involvement.

The work of (M

uller and Freytag, 2005; Krishnan

et al., 2016) points out that integrating data is an itera-

tive process with user interaction. Various approaches

take this into consideration. Frameworks proposed in

(Fan et al., 2010; Khayyat et al., 2015) enable the user

to edit rules, master data, and to conﬁrm the calcu-

lations leading to correct cleaning results. A higher

detection accuracy in duplicate detection by a hybrid

human-machine approach is achieved in the work of

(Wang et al., 2012). As presented in (Liu et al., 2015),

numerous techniques are used to associate the data to

get useful knowledge for data repairing, e.g., calculat-

ing similarities of contextual and linguistic matches

being able to determine relationships. In (Volkovs

et al., 2014) a logistic regression classiﬁer learns from

past repair preferences and predicts the type of repair

needed to resolve an inconsistency.

As (Dai et al., 2016) reports that - although there

are various data proﬁling tools to improve data qual-

ity - if people use them without having in mind a

clear quality measurement method according to their

needs, they are challenged by limited performance

and by unexpectedly weak robustness. (Gill and

Singh, 2014) claims that various frameworks offer the

integration of heterogeneous data sets but a frame-

work for quality issues such as naming conﬂicts,

structural conﬂicts, missing values, changing dimen-

sions has not been implemented in a tool at one place

yet.

The work of (Gottesheim et al., 2011) analyzes

the representation of real-world objects in the con-

text of ontology-driven situations. Similar to how

real-world objects are characterized by attributes, in

(Langer et al., 2012) the characteristics of models are

described by a signiﬁer. Basically, the concept of a

signiﬁer has its origin in the domain of model-driven

engineering (MDE) where a signiﬁer enhances the

versioning system by describing the combination of

features of model element types that convey the su-

perior meaning of its instances. A signiﬁer improves

versioning phases in comparing and merging models

leading to a higher quality of ﬁnally merged mod-

els. As we integrate objects from different sources,

a signiﬁer structures the combination of properties

and ﬁnally improves data integration when objects

to be integrated are compared and merged with ob-

jects in the data warehouse. Similarly, the work of

(Papadakis et al., 2015) addresses real-world entities

with blocking approaches based on schema-agnostic

and schema-based conﬁgurations. A schema-based

approach may be an alternative to signiﬁers but with

precision limitations in terms of a superior number of

detected duplicates when comparing properties. On

the other hand, a schema-agnostic approach may skip

differing important information of real-world objects

when clustering similar properties. In the integra-

tion process of objects, signiﬁers (1) provide a careful

and ﬂexible identiﬁcation structure for properties, and

(2) support normalization of information in the data

warehouse.

4 USING SIGNIFIERS FOR DATA

INTEGRATION

We propose a strategic technique in the ETL-process

to instantly react on potential inconsistencies, e.g.,

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

174

when there is not a perfect match of a source ob-

ject with an object in the data warehouse. Instead

of names or IDs, we use and extend the concept of

signiﬁers, as introduced in (Langer et al., 2012), for

mapping an object of a data source to the right object

type in the data target (the data warehouse). In simple

terms, a signiﬁer is an object consisting of different

components. To check if two objects match, the com-

ponents of their signiﬁers are checked pairwise.

To cope with the usage of different wordings or

words in different languages in the data source, we

need two versions of signiﬁers: a source signiﬁer and

a target signiﬁer which is extended by aliases.

Deﬁnition. A Source Signiﬁer is an n-tuple of

strings. The term S

refers to the i

component of

a source signiﬁer S.

The meaning of each element of the signiﬁer is

determined by its position in the tuple. In the railway

asset management example described in section 1, we

use signiﬁers of length 3, representing category, sub-

category and subsubcategory, respectively. Example:

(”Signal”, ”Main Signal”, ”8 Lamps”).

Deﬁnition. A Target Signiﬁer is an n-tuple of sets of

strings. The term T

refers to the i

component of a

target signiﬁer T.

Target signiﬁers allow to specify more than

one string per component. These strings represent

aliases. Example: ({”Signal”, ”S”}, {”Main Signal”,

”MainSig”, ”Hauptsignal”, ”HS”}, {”8”, ”8 Lamps”,

”8lamps”})

The main task of integrating a new object from a

data source into a target data warehouse is to match

source and target signiﬁers. To be able to deal with

approximate matches, too, we use a distance function

dist(s,t), returning a value from [0,1], where 0 means

that the two strings s and t are equal. There are many

well-studied string metrics that can be used here –

see, e.g., (Cohen et al., 2003). Given a string distance

function dist(., .), we deﬁne the minimum distance of

a string s and a set of strings ts by: dist

min

(s,ts) =

min

i=1..|ts|

dist(s,ts

In order to express different signiﬁcances of dif-

ferent components of a signiﬁer, we use weight fac-

tors w

for each component i. The total sum of all

weighted components of a signiﬁer is 1. In Section 5,

the weighting of the components is demonstrated.

Deﬁnition. Let S be a source signiﬁer and T be a

target signiﬁer with n components. Let dist(.,.) be

a string distance function. Let w

be component

weights. The function Dist(S,T) returns a value from

[0,1] and is deﬁned in the following way:

Dist(S,T ) =

∑

i=1..n

dist

min

) ∗ w

Now we are in the position to formally deﬁne per-

fect and approximate matches.

Deﬁnition. Let S be a source signiﬁer and T be a

target signiﬁer. S and T perfectly match, if and

only if Dist(S, T ) = 0. For a given threshold value

τ (0 < τ < 1), S and T approximately match, if and

only if 0 < Dist(S, T ) ≤ τ.

The algorithm of integrating a source data object

into the target database (data warehouse) is a semi-

interactive task based on the previously deﬁned per-

fect and approximate matches. If a perfect match is

found, the new object is automatically assigned to the

target data warehouse. In all other cases, the user is

asked to decide what to do. The following steps sum-

marize this algorithm for adding a source object with

signiﬁer S to a target database with existing target sig-

niﬁers T S:

1. If we ﬁnd a T ∈ T S that is a perfect match to S,

add the new object to the target database using T .

Done.

2. Let T S

approx

be the (possibly empty) set of target

signiﬁers that are approximate matches to S. Ask

the user for a decision with the following options:

(a) Accept one T ∈ T S

approx

as match. The new

object will be added to the target database using

T . The aliases of T are updated in the target

database by adding all components of S, that

do not fully match, to the aliases in T .

(b) Accept S as a new target signiﬁer. The new ob-

ject will be added to the database and S will be

added to the target signiﬁers.

database.

Each import of a data source potentially increases

and completes the number of target signiﬁers or ac-

cordingly the aliases of target signiﬁers. In that, user

interaction will decrease over time and the import of

source data will get closer and closer to run fully au-

tomatically.

5 CASE STUDY

In this section, we perform an empirical case study

based on the guidelines introduced in (Runeson and

ost, 2009). The main goal is to evaluate if the ap-

proach using signiﬁers for the integration of data from

heterogeneous data sets into the data warehouse sig-

niﬁcantly improves data quality, i.e., reduces incor-

rect classiﬁcations of objects imported into a given

data warehouse. Especially in the domain of rail au-

tomation with business critical data it is crucial to

Using Signiﬁers for Data Integration in Rail Automation

175

avoid incorrect classiﬁcation of data. We conducted

the case study for the business unit Rail Automation,

in particular for importing planning, order calculation

and conﬁguration data of railway systems into an as-

set management repository.

5.1 Research Questions

Q1: Can signiﬁers, as described in Section 4, mini-

mize the incorrect classiﬁcation of objects at data im-

port?

Q2: Can our import process based on signiﬁers mini-

mize user interactions?

Q3: From an implementation point of view, how eas-

ily can conventional (composite) primary keys of ob-

jects in a data warehouse be extended to signiﬁers?

5.2 Case Study Design

Requirements. We perform our empirical tests in an

asset management scenario for Rail Automation. A

data warehouse should collect the amount of all assets

(i.e., hardware modules and devices) of all projects

and stations that are installed, currently engineered,

or planned for future projects. To populate this data

warehouse, available documents comprising planned

and installed systems should be imported. Planning

and proposal data, available in Excel format, can be

classiﬁed as semi-structured input: While the column

structuring of the tables is quite stable, the names of

the different assets often differ because of the usage of

different wording, languages, and formatting. The ta-

bles also contain typing errors, because they are ﬁlled

out manually as plain text. Conﬁguration data for al-

ready installed systems is available in XML format

and is therefore well-structured. As the underlying

XML schema changed over time due to different en-

gineering tool versions, also XML data contain struc-

tural variability.

The exemplary import of a couple of different

Excel ﬁles, including ﬁles from projects of different

countries, and one XML ﬁle should be performed.

The test should start with an empty data warehouse,

i.e., no target signiﬁers are known at the beginning;

the set of appropriate target signiﬁers should be built

up during import of the different ﬁles.

Setup. The software architecture for our implementa-

tion of the ETL process for the case study is sketched

in Fig. 2. We use a standard ETL process as, e.g., de-

scribed in (Naumann, 2014). The extraction phase is

implemented with KNIME

The result of the extraction phase is a dataset

containing source signiﬁers and corresponding data

https://www.knime.org/

Figure 2: Case study ETL process.

values. Our Signiﬁer Matching component, imple-

mented in Python, tries to ﬁnd for each entry in that

dataset a matching target signiﬁer already stored in

the data warehouse. Ambiguities are resolved by ask-

ing the user. The load phase consists of updates to

the data warehouse: input data are added with their

references to a target signiﬁer and a source descrip-

tor. New target signiﬁers and new aliases of existing

target signiﬁers are added, as well.

Using our method, the following application-

speciﬁc parameters must be adjusted: the number of

signiﬁer components, the string distance metrics, the

weights of the signiﬁer components, and the thresh-

old value for approximate matches. In our tests,

we used signiﬁers consisting of 3 components, rep-

resenting category, subcategory and subsubcategory

of a data object. For string comparison we used a

case-insensitive Jaro-Winkler distance (Cohen et al.,

2003). Weights and threshold are determined as

shown in the next section, speciﬁcally cf. Table 4.

5.3 Results

In this section, we present the results of our case study

from data and behavioral perspectives. Our main goal

was to analyze the applicability of signiﬁers in the

transform phase of the ETL process with respect to

data quality and amount of necessary user decisions.

Data Integration with Signiﬁers. We demonstrate

our approach by describing the key steps of data inte-

gration of an Excel source and an XML source on a

concrete example from the Rail Automation domain.

We are sketching the main database tables according

to Figure 1.

Target signiﬁers are represented in a table as de-

picted in Table 1. Each row represents an asset type

with a primary key, ID, for being referenced from data

tables. The other columns represent the components

of the target signiﬁers. Please note that such compo-

nents typically contain not only a single string but a

set of strings (i.e., aliases) for different wordings or

different languages.

Figure 3 shows a small part of an input ta-

ble (in German language). The extraction of the

ﬁrst row results in the source signiﬁer [Signale,

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

176

Table 1: Target signiﬁer table of the data warehouse.

ID Category SubCat. SubSubCat.

1 Signal Main Signal,

Hauptsignal,

4 lamps,

2 Signal Shunting Signal,

Rangiersignal,

2 lamps,

. . . . . . . . . . . .

Hauptsignal, 4 Lampen] and value 10 represent-

ing the amount of main signals with 4 lamps in the

railway station. Obviously, this signiﬁer has no per-

fect match in the target signiﬁers as shown in Table 1,

because German language and some extended word-

ing is used in the source Excel ﬁle. The matching dis-

tances according to our settings are 0.05 for the ﬁrst

and 0.182 for the second target signiﬁer. The ﬁrst one

has a clearly lower distance value. After the user has

conﬁrmed that this is the right match, ﬁrstly, the tar-

get signiﬁer table is updated by adding aliases ”Sig-

nale” (for category) and ”4 Lampen” (for subsubcate-

gory) to signiﬁer 1, and secondly, the object value 10

is added to the data table (see ﬁrst row in Table 2) with

links to the right target signiﬁer and its source. The

other rows are extracted in the same way (not shown

in the table).

Figure 3: An excerpt of objects in an Excel spreadsheet.

For the XML source document depicted in Fig-

ure 4, the signiﬁer components for the ﬁrst object are

built in the following way: The ﬁrst component is the

tag name of the XML element, signal, the second

one its signal type, HS, and the third one is created

by counting the lamp sub-elements. Now we get the

source signiﬁer [signal, HS, 4]. Since our distant

metrics are case-insensitive, we have a perfect match

with target signiﬁer 1 in the data warehouse. The data

value for this kind of object is computed by counting

the number of objects of this type in the XML ﬁle.

After processing the second XML element in a simi-

lar way, the data table is updated. Table 2 shows the

resulting two entries at the end.

Results in Data Quality and User Interactions. We

performed our tests on several data sources in Excel

and XML format, starting with an empty data ware-

house without any target signiﬁers. In total about 90

Figure 4: An XML data source containing object elements.

Table 2: Data objects table after import of values from an

Excel and an XML source.

Signiﬁer Value Source

1 10 Source 1 (Excel)

. . . . . . . . .

1 1 Source 2 (XML)

2 1 Source 2 (XML)

signiﬁers were created; 10% of these signiﬁers got

aliases. The different sources partly contained dif-

ferent wordings, languages, and abbreviations for ob-

ject designation, and contained typing errors. Qual-

itatively speaking, for the most of these different

wordings, our signiﬁer matcher found approximate

matches and could very speciﬁcally ask the user for a

match decision. In case of no perfect match could be

found, the number of provided options was mostly in

the range of 3. A few signiﬁers had name variations

which fell out of the class of approximate matches,

and the user had to match them manually. Provided

that incorrect object classiﬁcations should be avoided,

the number of user decisions was minimal.

Table 3: The correlation of match candidates and expert ref-

erence (correct match).

correct match no

match candidate tp fp

no fn tn

For a quantitative assessment of the quality of

matches we use the F-measure (more speciﬁcally,

the F

-measure) from the ﬁeld of information re-

trieval (Salton and Harman, 2003; Wimmer and

Langer, 2013). This measure is based on the notion

Using Signiﬁers for Data Integration in Rail Automation

177

of precision and recall, which are deﬁned in terms

of true/false positives/negatives. Table 3 shows how

true/false positives/negatives are deﬁned by compar-

ing the correct value as deﬁned by an expert with

the match candidates computed by signiﬁer match-

ing (perfect or approximate matches). Precision, re-

call and the F

-measure are deﬁned as follows: P =

|t p|/(|t p|+| f p|), R = |t p|/(|t p|+| f n|), F

= 5×P×

R/(4×P+R). We use the F

-measure here to empha-

size the recall value; it is important in our application

that we do not miss any correct matches.

Table 4 shows the results of our tests by preci-

sion, recall and F

values. We varied the thresh-

old and weight values to ﬁnd an optimal combina-

tion. We used threshold values τ1 = 0.01, τ2 = 0.025,

τ3 = 0.05, τ4 = 0.1; we used weight values w1 =

(

), w2 = (

), w3 = (

). It was

observed that:

• As expected, small threshold values lead to high

precision values, large threshold values lead to

high recall values. Precision and recall are neg-

atively correlated.

• For the set of used test data a value combina-

tion τ = 0.025 and w = (

) achieves the best

matching results, i.e., has the highest F

-measure

value.

• The F

value of many threshold/weight combi-

nations are - in absolute terms - quite high, in-

dicating that the proposed method achieves high-

quality matches and is quite robust against small

changes of input parameters.

Table 4: Precision, recall and F

values for signiﬁer match-

ing tests varying approximate match threshold (τ) and sig-

niﬁer component weights (w).

τ1 = 0.00 τ2 = 0.025 τ3 = 0.05 τ4 = 0.1

w1: 1.000

w2: 1.000

w3: 1.000

w1: 0.891

w2: 0.930

w3: 0.901

w1: 0.779

w2: 0.887

w3: 0.793

w1: 0.710

w2: 0.835

w3: 0.432

w1: 0.895

w2: 0.895

w3: 0.895

w1: 0.950

w2: 0.950

w3: 0.950

w1: 0.956

w2: 0.950

w3: 0.950

w1: 0.961

w2: 0.950

w3: 0.956

w1: 0.914

w2: 0.914

w3: 0.914

w1: 0.938

w2: 0.946

w3: 0.940

w1: 0.914

w2: 0.937

w3: 0.914

w1: 0.898

w2: 0.925

w3: 0.770

5.4 Interpretation of Results

We analyze the results with regard to our research

questions.

Q1: Can signiﬁers minimize the incorrect classiﬁ-

cation of objects? Yes, according to the test results

shown in Table 4, our method is capable to achieve

high values of precision and recall values. Compared

to the ”perfect match only” scenario (τ = 0) with

its perfect precision, approximate matches showed a

weaker precision but a much better recall and there-

fore a better F-measure.

Q2: Can our import process based on signiﬁers min-

imize user interactions? Yes, provided that auto-

mated matches should only be made based on a per-

fect match, the number of user interactions were min-

imal in the sense that the user was not asked for a

similar match twice. Furthermore, the list presented

to the user for a manual match was sorted by match

distance; in most cases the user found his/her match

on the ﬁrst or second place in that list.

Q3: From an implementation point of view, how eas-

ily can conventional (composite) primary keys of ob-

jects in a data warehouse be extended to signiﬁers?

Instead of using data tables with composite keys in

the data warehouse, signiﬁers are stored in a separate

table with a generated key for referencing from data

tables. Design and implementation of this separate

signiﬁer table is straight-forward.

5.5 Threats to Validity

The ﬁrst import of a data source where all signiﬁer

components were translated to another language pro-

duces a lot of user interaction. This could be avoided

by generating and adding translations as aliases to the

signiﬁers in the data warehouse.

In situations where a signiﬁer component repre-

sents a number (e.g., number of lamps of a railway

signal), plain string distance is not an optimal choice.

E.g., a human would consider ”2” closer to ”3” than

to ”5”, which is usually not the case in a string dis-

tance function. Our deﬁnition of signiﬁers as a tu-

ple of string components could be extended to com-

ponents of different types. This would allow the im-

plementation of type-speciﬁc distance functions and

solve that problem.

6 CONCLUSION AND FUTURE

WORK

In this paper, we identiﬁed various issues in the in-

tegration process of business-critical data from het-

erogeneous data sources. To address these issues, we

proposed a semi-interactive approach. We introduced

a technique using the notion of a signiﬁer which is a

natural extension of composite primary keys to sup-

port the user resolving ambiguous data classiﬁcation.

In a case study, we validated the applicability of our

approach in the industrial environment of Rail Au-

tomation. The results show a signiﬁcant improvement

of data quality.

There are several ideas for future work. One re-

lates to extending the textual representation of com-

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

178

ponent types with numerical values. This affects the

storage of values in the data warehouse as well as the

algorithm that compares values. Another direction

is to strive for a joint dictionary bridging language-

speciﬁc component terms. This accelerates the inte-

gration process especially in companies with interna-

tional respect.

ACKNOWLEDGEMENTS

This work is funded by the Austrian Research Promo-

tion Agency (FFG) under grant 852658 (CODA). We

thank Walter Obenaus (Siemens Rail Automation) for

supplying us with test data.

REFERENCES

Bilenko, M. and Mooney, R. J. (2003). Adaptive duplicate

detection using learnable string similarity measures.

In Proceedings of the ninth ACM SIGKDD, pages 39–

48. ACM.

Bleiholder, J. and Naumann, F. (2009). Data fusion. ACM

Computing Surveys (CSUR), 41(1):1.

Cohen, W. W., Ravikumar, P., and Fienberg, S. E. (2003).

A comparison of string distance metrics for name-

matching tasks. In Proceedings of IJCAI-03, August

9-10, 2003, Acapulco, Mexico, pages 73–78.

Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., and Long, J.

(2016). Data proﬁling technology of data governance

regarding big data: Review and rethinking. In In-

formation Technology: New Generations, pages 439–

450. Springer.

Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.,

Ilyas, I. F., Ouzzani, M., and Tang, N. (2013). Nadeef:

a commodity data cleaning system. In Proceedings of

the 2013 ACM SIGMOD, pages 541–552. ACM.

Dasu, T. and Johnson, T. (2003). Exploratory data mining

and data cleaning: An overview. Exploratory data

mining and data cleaning, pages 1–16.

Fan, W. and Geerts, F. (2012). Foundations of data quality

management. Synthesis Lectures on Data Manage-

ment, 4(5):1–217.

Fan, W., Li, J., Ma, S., Tang, N., and Yu, W. (2010). To-

wards certain ﬁxes with editing rules and master data.

Proceedings of the VLDB Endowment, 3(1-2):173–

184.

Gill, R. and Singh, J. (2014). A review of contemporary

data quality issues in data warehouse etl environment.

Journal on Today’s Ideas - Tomorrow’s Technologies.

Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger,

W., and Baumgartner, N. (2011). Semgentowards a

semantic data generator for benchmarking duplicate

detectors. In DASFAA, pages 490–501. Springer.

Hellerstein, J. M. (2008). Quantitative data cleaning for

large databases. United Nations Economic Commis-

sion for Europe (UNECE).

Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani,

M., Papotti, P., Quian

e-Ruiz, J.-A., Tang, N., and Yin,

S. (2015). Bigdansing: A system for big data cleans-

ing. In Proceedings of the 2015 ACM SIGMOD, pages

1215–1230. ACM.

Krishnan, S., Haas, D., Franklin, M. J., and Wu, E. (2016).

Towards reliable interactive data cleaning: a user sur-

vey and recommendations. In HILDA@ SIGMOD,

page 9.

Langer, P., Wimmer, M., Gray, J., Kappel, G., and Valle-

cillo, A. (2012). Language-speciﬁc model version-

ing based on signiﬁers. Journal of Object Technology,

11(3):4–1.

Leser, U. and Naumann, F. (2007). Informationsintegration

- Architekturen und Methoden zur Integration verteil-

ter und heterogener Datenquellen. dpunkt.verlag.

Liu, H., Kumar, T. A., and Thomas, J. P. (2015). Clean-

ing framework for big data-object identiﬁcation and

linkage. In 2015 IEEE International Congress on Big

Data, pages 215–221. IEEE.

uller, H. and Freytag, J.-C. (2005). Problems, methods,

and challenges in comprehensive data cleansing. Pro-

fessoren des Inst. f

ur Informatik.

Naumann, F. (2014). Data proﬁling revisited. ACM SIG-

MOD Record, 42(4):40–49.

Papadakis, G., Alexiou, G., Papastefanatos, G., and

Koutrika, G. (2015). Schema-agnostic vs schema-

based conﬁgurations for blocking methods on homo-

geneous data. Proceedings of the VLDB Endowment,

9(4):312–323.

Rahm, E. and Do, H. H. (2000). Data cleaning: Prob-

lems and current approaches. IEEE Data Eng. Bull.,

23(4):3–13.

Runeson, P. and H

ost, M. (2009). Guidelines for conduct-

ing and reporting case study research in software engi-

neering. Empirical software engineering, 14(2):131.

Salton, G. and Harman, D. (2003). Information retrieval.

John Wiley and Sons Ltd.

Sharma, S. and Jain, R. (2014). Modeling etl process for

data warehouse: an exploratory study. In In ACCT,

2014 Fourth International Conference on, pages 271–

276. IEEE.

Volkovs, M., Chiang, F., Szlichta, J., and Miller, R. J.

(2014). Continuous data cleaning. In 2014 IEEE 30th

ICDE, pages 244–255. IEEE.

Wang, J., Kraska, T., Franklin, M. J., and Feng, J. (2012).

Crowder: Crowdsourcing entity resolution. Proceed-

ings of the VLDB Endowment, 5(11):1483–1494.

Wimmer, M. and Langer, P. (2013). A benchmark for

model matching systems: The heterogeneous meta-

model case. Softwaretechnik-Trends, 33(2).

Using Signiﬁers for Data Integration in Rail Automation

179