A Strategy for Selecting Relevant Attributes for Entity Resolution in

Data Integration Systems

Gabrielle Karine Canalle, Bernadette Farias Loscio and Ana Carolina Salgado

Center of Informatics (CIn), Federal University of Pernambuco (UFPE), Recife, Pernambuco, Brazil

Keywords:

Attribute Selection, Entity Resolution, Data Integration.

Abstract:

Data integration is an essential task for achieving a uniﬁed view of data stored in heterogeneous and distributed

data sources. A key step in this process is the Entity Resolution, which consists of identifying instances that

refer to the same real-world entity. In general, similarity functions are used to discover equivalent instances.

The quality of the Entity Resolution result is directly affected by the set of attributes selected to be compared.

However, such attribute selection can be challenging. In this context, this work proposes a strategy for selection

of relevant attributes to be considered in the process of Entity Resolution, more precisely in the instance

matching phase. This strategy considers characteristics from attributes, such as quantity of duplicated and

null values, in order to identify the most relevant ones for the instance matching process. In our experiments,

the proposed strategy achieved good results for the Entity Resolution process. Thus, the attributes classiﬁed

as relevant were the ones that contributed to ﬁnd the greatest number of true matches with a few incorrect

matches.

1 INTRODUCTION

The growing ease of data generation and sharing has

contributed to the rapid rise of data volume available

in digital environments. However, this growth has oc-

curred in an uncontrolled manner, in such a way that a

lot of data were published containing erroneous, miss-

ing or duplicate values, which may hinder its use.

Nevertheless, there is a rising demand for data inte-

gration solutions over distributed and heterogeneous

data sources. Examples are price comparison web-

sites, as Booking.com

and Trivago.com

Data integration solutions aim to combine data

from different sources to provide a uniﬁed view of

these data for users. An important step in this process

is Entity Resolution (Christen, 2012), which aims to

identify equivalent instances, i.e., which ones rep-

resent the same real-world concept (Dong and Sri-

vastava, 2015). Entity Resolution consists of sev-

eral stages, including a comparison step between in-

stances pairs. During this step, the similarity between

the attribute values that describe the compared in-

stances is evaluated. In this context, one of the main

challenges concerns the choice of attributes to use in

the comparison step.

http://www.booking.com

http://www.trivago.com

Generally, the choice of attributes is performed

manually (Gruenheid et al., 2014). However, given

the existence of a lot of attributes and the lack of

prior knowledge of the sources domain, this choice

may take into consideration attributes that could not

contribute efﬁciently for the Entity Resolution pro-

cess. Once the quality of Entity Resolution results

is directly affected by the selected attributes, the use

of strategies to help the selection of the best ones be-

comes essential.

In this paper, we propose a strategy that auto-

mates the Attribute Selection task in order to facili-

tate the identiﬁcation of good attributes to be used in

the comparison of instances. Our proposal considers

speciﬁc criteria related to attribute values, like density

and repetition to evaluate the attribute relevance. An

attribute is considered relevant if it contributes posi-

tively for the identiﬁcation of true correspondences,

and irrelevant if it contributes in identifying incorrect

matches (false positives and false negatives). By us-

ing the proposed strategy, it is expected that, at the

end of the Entity Resolution process, we obtain the

largest possible number of true matches with the low-

est number of incorrect matches.

The rest of this paper is organized as follows. Sec-

tion 2 presents a motivational example to illustrate the

relevance of the proposed strategy. Section 3 presents

Canalle, G., Lóscio, B. and Salgado, A.

A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems.

DOI: 10.5220/0006316100800088

In Proceedings of the 19th International Conference on Enterprise Information Systems (ICEIS 2017) - Volume 1, pages 80-88

ISBN: 978-989-758-247-9

Table 1: Result of a query about Data Integration in the sources CiteSeerX and DBLP.

ID Paper ID Source Author Title Year Venue Pages

1 1 CiteSeerX M. Lanzerini

Data Integration: A Theorical

Perspective (2002)

2002

Symposium on Principles

of Database Systems

NULL

2 DBLP Maurizio Lanzerini

Data Integration: A Theorical

Perspective

2002 PODS 2002 233-246

2 3 DBLP Guy Pierra

The PLIB ontology-based

approach to data integration

2004

IFIP Congress

Topical Sessions

13-18

3 4 CiteSeerX

Patrick Ziegler and

Klaus R. Dittrich

Three decades of data integration -

all problems solved

NULL

In 18th IFIP Computer

Congress (WCC)

NULL

5 DBLP

Patrick Ziegler and

Klaus R. Dittrich

Three decades of data integration -

All problems solved?

2004

IFIP Congress

Topical Sessions

NULL

a summary of the main concepts related to Entity Res-

olution and Attribute Selection. In Section 4, it is

speciﬁed the proposed Attribute Selection Strategy.

Section 5 discusses the results obtained through ex-

periments. In Section 6, some related works are pre-

sented, and ﬁnally, Section 7 presents the ﬁnal con-

siderations and future works.

2 MOTIVATIONAL EXAMPLE

To illustrate the need to select the best attributes to

be used in the comparison step of the Entity Resolu-

tion process, and how these attributes impact on the

quality of the process results, consider the following

example.

An on-line digital library service, in the domain

of Computer Science, integrates data from multiple

sources, such as CiteSeerX

and DBLP

, and makes

feasible searches by title, author, or keyword. Assume

that a user is interested in searching for papers about

”Data Integration”. The integration service submits

the query to the data sources CiteSeerX and DBLP,

and get a set of papers as the result for the query. A

small fraction of this result can be seen in Table 1.

In this example, we present ﬁve instances, each

one containing a proper identiﬁer (ID). These in-

stances are related to three different papers, identiﬁed

by column ID Paper. In order to provide the integrated

result to the user, the Entity Resolution process should

be performed. For this, in the comparison step, the

attributes that describe instances are compared. As

mentioned before, the attribute selection is done man-

ually or considering all attributes in the comparison

step, we will suppose both situations.

Suppose an Entity Resolution process that is per-

formed considering all attributes. Possibly, the in-

stances 1 and 2 would be considered as not duplicated,

given that for each of the ﬁve considered attributes,

two (Venue and Pages) have a low value of similar-

ity. The same would happen with the instances 4 and

http://citeseerx.ist.psu.edu

http://dblp.uni-trier.de

5, whereby the attributes Year and Pages also have a

low value of similarity. We can observe that attributes

containing null values may affect negatively the En-

tity Resolution result. This happens because we con-

sider that a null values led to similarity equals to 0

(zero). For this reason it does not make sense to use

attributes that contain a large number of missing val-

ues, because this can make two equivalent instances

as distinct ones. Therefore, the comparison consid-

ering all attributes would result in incorrect matches,

known as false negative, given that instances of papers

1 and 3 are duplicate, and possibly the Entity Resolu-

tion algorithm would consider them as not correspon-

dent.

Now, consider that a subset of attributes was se-

lected randomly, without considering the attribute

values. Suppose a subset composed by the attributes

Year and Venue. Probably, the instances 3 and 5

would be considered as duplicate ones, given that the

values of these attributes have a high similarity value.

This happens because repeated values can contribute

to increase the similarity value, which can make two

instances are given as corresponding even being dis-

tinct. This would result in an incorrect correspon-

dence, known as false positive, which refers to an in-

stance pair not correspondent as correspondent. On

the other hand, the instances 1 and 2, and 4 and 5

would be given as not correspondent, resulting in two

false negatives.

We can notice that using all attributes, or a sub-

set of attributes selected randomly, may led to Entity

Resolution results with a low F-measure

. Because of

this, a Strategy of Attribute Selection, in which only

the relevant attributes for the Entity Resolution pro-

cess are selected, becomes necessary.

Measure that indicates the quality of the Entity Resolu-

tion process, calculating the harmonic average between the

precision and recall values. More details will be presented

in Section 5.

A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems

3 ENTITY RESOLUTION AND

ATTRIBUTE SELECTION

Data Integration is a complex process, which can be

divided into three main steps: Schema Alignment,

Entity Resolution, and Data Fusion (Dong and Sri-

vastava, 2015). In this work, we are interested in the

Entity Resolution step, whose objective is to identify

different instances referring to the same real world en-

tity. The Entity Resolution, in its turn, is subdivided

into the following activities: Blocking, Pair Compari-

son and Clustering (Christen, 2012; Dong and Srivas-

tava, 2015), as shown in the Figure 1.

Figure 1: Activities of the Entity Resolution process (Dong

and Srivastava, 2015).

In the Blocking activity, the goal is to decrease the

number of comparisons between instances. For this,

blocking techniques are used. The standard blocking

uses a blocking key to partition the instances where

preferably all instances of the same entity are within

the same block. From this, in the Pair Comparison

step a detailed comparison between the instance pairs

is performed. Finally, in the Clustering the instances

are clustered so that each cluster is composed of in-

stances that refer to the same entity (Christen, 2012).

In this work, we focus on the Pair Comparison activ-

ity.

During the Pair Comparison activity, similarity

values of attributes describing two compared in-

stances are evaluated. The higher the similarity val-

ues of these attributes, higher the chances these two

instances will be equivalent. One of the main chal-

lenges is to determine which attributes to consider

for the comparison. Intuitively, one can think that

the higher the number of attributes, the better is the

result of the comparison, given that we consider the

maximum amount of information. However, some

attributes may not be relevant for the pair compari-

son. Therefore, it is necessary to select a subset of

attributes to be used during the comparison step, in

order to maximize the quality of Entity Resolution re-

sults.

In general, the attribute selection aims to ﬁnd a

subset of attributes from the original set, removing

those that are irrelevant or redundant, which may

signiﬁcantly improve the efﬁciency of the task per-

formed (Dash et al., 2002). For this purpose, the fol-

lowing issues should be treated: identiﬁcation of at-

tributes common to all instances participating in the

Entity Resolution and selection of an optimal subset

of common attributes. In other words, we look for a

set of attributes able to support the generation of en-

tity correspondences that are true (Gu et al., 2003). In

addition to data integration, the attribute selection has

been subject of research in other Computer Science

areas, such as: Data Mining and Machine Learning

(Jouve and Nicoloyannis, 2005; Li et al., 2006). In

these studies, it is intended, beyond improving pro-

cess efﬁciency, reducing the dimensionality of the

samples, minimizing the computational costs and also

the data storage space.

4 A STRATEGY FOR SELECTING

RELEVANT ATTRIBUTES FOR

THE ENTITY RESOLUTION

PROCESS

In this section, we present an overview of our strategy

for attribute selection in the Entity Resolution process

and its main tasks.

4.1 Overview

As mentioned before, the Entity Resolution process

can be seen as part of a Data Integration process,

whose main goal is to provide an integrated view of

data distributed in multiple data sources.

An overview of the proposed attribute selection

strategy is presented in the Figure 2 and described be-

low. In the following, consider a set of data sources

F =

{

, f

, ..., f

}

that provides data about one or

more entities E of the real world. An entity E in

a given data source f

has a set of instances f

.E =

{

, e

, ..., e

}

such that each e

represents an in-

stance of the real world, and is described by a set of

attributes E.A

{

, a

, ..., a

}

. An instance e

an entity E in the data source f

, denoted by f

.E.e

is deﬁned by a set of pairs

{

, v

), ..., (a

, v

)

}

where a

∈ E.A

, and v

is the value of a

for the

entity E in the data source f

The Entity Resolution process receives as input a

set of instances E

{

, e

, ..., e

}

referring to the

same entity of the real world. Each instance e

from

can be from a different data source f

and is de-

scribed by a set of attributes, which depends from the

data source f

where the instance comes from. As

shown in Figure 2, the attributes selection task is per-

formed as an auxiliary task of the Entity Resolution

process and receives as input the set E

and the set

int

, which denotes the set of attributes common to

all instances from E

(Gu et al., 2003).

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

Figure 2: Overview of the Attributes Selection Strategy.

The attributes selection is divided in two main

steps: (i) Analysis of Individual Relevance, which is

based on characteristics related to the data, such as

repetition and density values; (ii) Analysis of Global

Relevance, which considers characteristics of the data

sources, such as reliability. In order to support our

strategy, we consider the existence of a metadata

catalog (Oliveira et al., 2015) that stores informa-

tion about the data sources being integrated, such as:

name, url, schema (containing entities and attributes),

and quality metadata, including the reliability of the

data source.

During the Analysis of Individual Relevance, val-

ues for the repetition and density criteria are calcu-

lated, and so the Individual Relevance value is ob-

tained for each a

i j

∈ A

int

. The Analysis of Global

Relevance of a given attribute a

i j

consists of calculat-

ing a new value based on the value of the Individual

Relevance of a

i j

and the reliability value of the data

source where the instances from E

come from.

As output of the attributes selection process, a

set of global relevance values R =

{

, ..., RG

}

given. Each RG

is a pair (a

i j

, r

i j

), where r

i j

is the

Global Relevance value of the attribute a

i j

∈ A

int

4.2 Evaluation Criteria

In order to identify whether an attribute is relevant or

not, ﬁrst we should deﬁne the meaning of relevant at-

tribute for the task in hand. Then, we can deﬁne the

criteria to evaluate the relevance of a given attribute.

In our case, the task in hand is the Entity Resolu-

tion process and the relevance of an attribute is pro-

portional to its capability for discriminating instances

that belong to different entities and in not discriminat-

ing instances that belong to the same entity. We also

consider that the quality of the data sources can have a

direct impact on the relevance of an attribute. For ex-

ample, a data source that is likely to provide data that

is not true. In this sense, weighting the relevance of an

attribute according to the quality of the data sources

can help to verify if the result of the selection of at-

tributes is really reliable.

We identiﬁed six criteria to measure the rele-

vance of attributes for the Entity Resolution process

(Canalle, 2016). However, in this work only four cri-

teria are used, two of them are related to data and are

used to measure the individual relevance of attributes.

In general, these characteristics include the level of

errors and the number (and distribution) of attribute

values, i.e., its informative content. For example, a

ﬁeld such as sex has only two possible values and,

consequently, could not give enough information to

identify similarity between instances. On the other

hand, an attribute like surname contains more, how-

ever it is more prone to errors (Gu et al., 2003). The

other two, related to the data sources, are used to mea-

sure the global relevance of a given attribute.

4.3 Analysis of Individual Relevance

In this section, we will detail each of the criteria cho-

sen for the evaluation of the Individual Relevance of

each attribute, specifying how it is calculated.

4.3.1 Repetition and Density

The repetition of an attribute a

i j

is given by the num-

ber of times the same attribute value appears in the set

of instances E

. The choice of this criterion for the at-

tribute selection was motivated by the fact that using

attributes with high repetition values can contribute

to the generation of false positives, i.e., distinct in-

stances classiﬁed as similar.

Given the set of instances E

, for each attribute

i j

∈ A

int

the value for the repetition criterion is cal-

culated. In this paper, the value for the repetition cri-

terion is calculated according to the Equation 1, where

τ is the total number of distinct values of a

i j

, and η is

the total number of values of a

i j

Rep(a

i j

) = 1 −





(1)

We use a similarity function to identify whether

attributes have the same value. For this, there is a

A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems

number of functions proposed in the literature (Chris-

ten, 2012). In this work, the Levenshtein or Edit Dis-

tance similarity function was adopted. This function

is frequently used to compare relatively small strings,

which do not have necessarily the same size.

Given the set of instances E

, for each attribute

i j

∈ A

int

, the density criterion is calculated. The den-

sity is given by the percentage of non null values in

the set of values that describes it Naumann and Frey-

tag (2000). During the instances comparison, missing

values can make two corresponding instances be con-

sidered as distinct, once the comparison with miss-

ing values results in a similarity equal to 0 (zero). In

this sense, empty values may contribute to the identi-

ﬁcation of false negative, in other words, similar in-

stances classiﬁed as distinct. In this paper, the density

value can be calculated according to the Equation 2,

where α is the total number of nonnull values of a

i j

and β is the total number of values of a

i j

Den(a

i j

) =

(2)

4.3.2 Individual Relevance Calculation

The Individual Relevance R

ind

of a given attribute a

i j

is calculated based on repetition and density criteria,

according to the Equation 3.

ind

i j

) = Den(a

i j

) ∗ p

+ (1 − Rep(a

i j

)) ∗ p

(3)

Where Den is the value of the density criterion for

the attribute a

i j

, Rep is the value of the repetition cri-

terion for the attribute a

i j

is the weight for the den-

sity criteria, and p

is the weight for the repetition cri-

teria. Values of p

and p

are deﬁned according to the

importance degree of each criteria for the relevance of

a given attribute, so that the sum of the weights must

be equal to 1 (one).

4.4 Analysis of Global Relevance

The Global Relevance of an attribute a

i j

∈ A

int

is de-

noted by R

glob

i j

). We evaluate the global relevance

of an attribute a

i j

in order to include data source qual-

ity information as part of the attributes selection pro-

cess. This becomes necessary because data sources

from F, which provide instances to E

, may have low

quality. Therefore, to calculate the R

glob

i j

), in ad-

dition to the value of R

ind

i j

), we also consider data

sources quality information.

Speciﬁcally, in this work, the quality of a source

is given based on two criteria: Reliability and Cover-

age. The Reliability of a source f

concerns the de-

gree to which the data provided by f

is true and re-

liable (Wang and Strong, 1996). We assume that the

sources have quality metadata associated to them (Mi-

haila et al., 2000), where the reliability value, denoted

by Rel( f

) such that 0 ≤ Rel( f

) ≤ 1, can be obtained

by means of these metadata.

Coverage of a data source f

is deﬁned by the per-

centage of instances that f

provides to the set of in-

stances E

. In this work, we use the evaluation met-

ric proposal by Naumann and Freytag (2000) (Equa-

tion 4). The coverage of a data source f

, denoted by

Cov( f

), is calculated dividing the total number of in-

stances that f

provides to E

, denoted by π, by the

total of instances contained in E

, denoted by |E

Cov( f

) =

(4)

After calculating the values of reliability and cov-

erage for each data source f

∈ F, the quality of F,

denoted by Q(F), can be given by Equation 5.

Q(F) =

∑

k=1

Rel( f

) ∗Cov( f

) (5)

Finally, the R

glob

i j

) can be calculated according

to the Equation 6, where R

ind

i j

) is the Individual

Relevance value of the attribute a

i j

glob

i j

) = R

ind

i j

) ∗ Q(F) (6)

5 EXPERIMENTAL EVALUATION

To evaluate the proposed strategy we used the Cora

database

. This database contains 1.879 instances of

different data sources related to literary productions.

The instances are described by 15 attributes: id, au-

thor, title, journal, volume, pages, year, publisher, ad-

dress, note, venue, editor, type, institution and month.

Before performing our experiment, we performed a

pre-processing step on the database to clean the data.

For example, the attribute year has inconsistent char-

acters with the format assigned to it (e.g.: ”(1989)”

instead of ”1989”).

Table 2: Scenarios with the duplicate data percentages.

Scenario Scenarios of Duplicate Data

Scenario 1 5% - 10%

Scenario 2 15% - 30%

Scenario 3 35% - 50%

Scenario 4 55% - 70%

Scenario 5 >75%

Furthermore, the Cora database has a high per-

centage of duplicate data (±90%). Intuitively, we

https://people.cs.umass.edu/ mccallum/code-data.html

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

believed that for a dataset with a high percentage of

duplicate instances, any attribute selected for the in-

stances comparison could provide a good result for

the Entity Resolution process. For this reason, we

expected that the huge number of duplication would

lower the degree of effectiveness of our strategy.

Then, we conducted our experiments in ﬁve different

scenarios. Each scenario has a percentage of extracted

duplicate data from Cora database, whose conﬁgura-

tion is shown in Table 2. The goal of this experiment

was to investigate how the proposed strategy would

behave in each scenario and to conﬁrm the following

hypotheses:

H1 - The usage of all attributes in the comparison step

of the Entity Resolution process results in a low F-

measure, containing a high number of incorrect cor-

respondences (false positives and false negatives), and

a low number of correct correspondences (true posi-

tives and true negatives).

H2 - Considering just the most relevant attributes,

identiﬁed by the proposed strategy, in the compari-

son step of the Entity Resolution process results in a

high F-measure, i.e., a greater number of correct cor-

respondences, with a smaller number of incorrect cor-

respondences.

H3 - As less relevant attributes are added to the group

of attributes considered in the comparison step of the

Entity Resolution process, the number of incorrect

correspondences increases, and the F-measure of the

result decreases.

To evaluate our strategy, we considered the results

obtained from an existing Entity Resolution tool, per-

formed in different scenarios, using the attributes cho-

sen by the proposed Attribute Selection. We used the

DuDe Toolkit as Entity Resolution tool (Draisbach

and Naumann, 2010), and the resolution algorithm

was the Naive Duplicate Detection. For the similar-

ity calculation, we adopt the Levenshtein function.

The Cora database has a Gold Standard containing

the duplicate instance pairs, allowing to evaluate how

good is the result of the Entity Resolution process.

Speciﬁcally, we consider the number of true positives

(TP), true negatives (TN), false positives (FP), and

false negatives (FN) obtained as the result of the En-

tity Resolution process. Based on these values, dif-

ferent quality measures can be considered (Christen,

2012). For the evaluation of the Entity Resolution, we

adopted the F-measure, considering that this measure

is recommended as the best to evaluate the quality of

the Entity Resolution process (Christen, 2012). This

measure uses the values obtained by calculating pre-

cision by Equation 7, and recall, by Equation 8, and it

is calculated as shown in Equation 9.

precision =

T P

T P + FP

(7)

recall =

T P

T P + FN

(8)

F − measure = 2



precision ∗ recall

precision + recall



(9)

In these equations, TP is the number of cor-

responding instances correctly identiﬁed, FP is the

number of not corresponding instances identiﬁed as

corresponding, and FN is the number of correspond-

ing instances identiﬁed as not corresponding.

5.1 Analysis of Results

Figure 3 presents the classiﬁcation of each attribute

according to the Attribute Individual Relevance anal-

ysis, as described in Section 4.3.2. In Figure 3, col-

umn A represents the attributes, D the density values,

R are the repetition values, and RI the Individual Rel-

evance values. It is important to emphasize that in the

Entity Resolution process, instead of using a single

attribute (or ﬁrst attribute of the ranking), a set of at-

tributes for comparing instance pairs should be used,

because a single attribute may not be sufﬁcient for the

comparison phase.

In this sense, to validate our hypothesis, we have

created four groups of attributes. These groups con-

tain respectively the number of two, three and four

most relevant attributes. In order to evaluate our ap-

proach, the Entity Resolution process was performed

for each group. The obtained results are presented in

Figure 4.

To evaluate the impact of Attribute Global Rele-

vance, we consider the Group 1, which has a dataset

with 40 instances about scientiﬁc publications. Tak-

ing into account the provenance of these instances and

by assuming that 25 instances are from the data source

1 (DS1), that have a reliability of 90%. First, for each

data source F

, we calculate the coverage with regard

to the dataset, according to Equation 2, obtaining as

result Cov(F1) = 0.625, Cov(F2) = 0.375. Then, we

calculate the quality of the data source set DS, com-

posed by DS1 and DS2, following the Equation 3, ob-

taining Q(F) = 0.76. Thereby, for each a

i j

∈ A

int

the value of the Attribute Global Relevance is calcu-

lated considering quality information of the set of data

sources Q(F), based on the Attribute Individual Rele-

vance value. In our experiment, for every attribute,

the value of R

glob

has been lower than R

ind

, given

that the source contributes with a lower quantity of

instances and has a lower value of reliability.

A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems

Figure 3: Individual Relevance values of the attributes.

Analyzing the results, we can conclude that our

strategy is efﬁcient in all scenarios. In scenarios with

a signiﬁcant percentage of duplication (Scenario 4

and Scenario 5), we can observe that the relevance

value of attributes is not so high when compared to

other scenarios. This is due to the fact that, in sce-

narios with a signiﬁcant percentage of duplication the

value of the repetition criterion is always very high.

Our strategy was able to correctly list the best at-

tributes for such scenarios, proving its efﬁciency. Fur-

thermore, we conﬁrmed that using a large quantity of

attributes in the Entity Resolution process is not ac-

ceptable, given the results obtained using the set of

attributes of Group 4. It was also observed that the

greatest F-measure of the Entity Resolution process

was obtained using the group 1, with only the two

most relevant attributes. We also realize that as 1.

Thus, the results obtained through experiments vali-

dated our hypothesis.

With respect to the Global Relevance, we can ver-

ify that whether a source contributes to the majority

of instances for a data set, but if this source has a low

reliability, probably the attribute relevance should be

contested. Thus, we concluded that the analysis of the

Global Relevance can be useful, mainly to evaluate if

the results of the Attribute Selection is really reliable.

6 RELATED WORKS

Entity Resolution is a research area that has attracted

attention of researchers from different areas of Com-

puter Science, such as Data Mining, Artiﬁcial Intelli-

gence and Database. Because of this, several studies

have been carried out proposing to solve the Entity

Resolution problem in different ways, as using active

learning (Sarawagi and Bhamidipaty, 2002), genetic

programming (de Carvalho et al., 2010), and func-

tional dependencies (Fan et al., 2009) (Caruccio et al.,

2016). Several tools for the Entity Resolution process

have also been proposed in the literature. In (Kopcke

and Rahm, 2010), for example, a comparative evalu-

ation of some tools that help in the Entity Resolution

process is presented.

Considering the context of Entity Resolution, specif-

ically in the data integration process, just few works

discuss the attribute selection problem. Among them,

we highlight Chen et al. (2012) and Su et al. (2010).

Chen et al. (2012) proposes a method for Entity Res-

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

Figure 4: Results of the Entity Resolution process considering the attribute groups.

olution based on Machine Learning. This method

searches, by means of a training set, the best group

of attributes to be used in the comparison step of the

Entity Resolution process. First, for each attribute the

proper similarity measure is found, analyzing the F-

measure value that each measure provided. Then, the

groups of attributes are evaluated, where the groups

with the highest values of F-measure are chosen as

the best ones for the comparison step.

Su et al. (2010) proposes duplicate detection for

results of queries in Web data sources. In this sce-

nario, the work focuses on a Machine Learning algo-

rithm that aims to adjust the weights of the attributes

for the similarity evaluation. The algorithm is able to

learn how to adjust the weights of the attributes using

a data sample that contains instances without corre-

spondence from different data sources. The authors

argue that using this sample facilitates the Entity Res-

olution process, since in scenarios of queries for Web

databases, the percentage of duplicate instances is far

lower than the percentage of non-duplicate instances.

To balance the attribute weights according to their im-

portance, the authors use a non-duplicate data vector

and a vector of duplicate data. For duplicate data vec-

tor higher weights are assigned to the attributes with

high similarity between their values and low weights

for attributes with low similarity in their values, and

the opposite is done for the not duplicated vector.

Differently of (Chen et al., 2012) and (Su et al.,

2010), this work proposes an Attribute Selection strat-

A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems

egy for the evaluation of relevant attributes using cri-

teria related to the data and by means of metadata

related to the data sources. Another differential of

our work to the Chen et al. is that we do not need

a training set. The deﬁnition of a training set can be

a difﬁcult task, specially in scenarios containing large

volumes of data. Recently, some studies have been

proposed in order to facilitate this task (Bianco et al.,

2015).

7 CONCLUSIONS

In this work, we propose a strategy for selection of

relevant attributes for the Entity Resolution process.

This strategy consists of the following two steps: (i)

Individual Relevance Analysis and (ii) Global Rele-

vance Analysis. In the former, we analyze data fea-

tures, such as repetition and density, to measure the

individual relevance of an attribute. In the later, we

reﬁne results from earlier stages to weight the rele-

vance of each attribute based on quality criteria of the

data sources considered in the Entity Resolution.

For the purposes of evaluating the proposed strat-

egy, we performed several experiments using the

CORA dataset. These experiments have demon-

strated that the groups of attributes selected by our

strategy provide the best result for the Entity Resolu-

tion process, resulting in the validation of our hypoth-

esis. In addition, we have made experiments with the

Febrl dataset obtaining similar results.

As future work, we intend to include other criteria

in the attribute selection process, such as the suscepti-

bility of an attribute to contain errors (e.g. surname),

the attribute dynamism, i.e., if the attribute contains

values that may change over time (e.g. age). We be-

lieve that such characteristics can also be helpful for

the selection of relevant attributes in the Entity Reso-

lution process.

REFERENCES

Bianco, G. D., de Matos Galante, R., Gonalves, M. A.,

Canuto, S. D., and Heuser, C. A. (2015). A practi-

cal and effective sampling selection strategy for large

scale deduplication. IEEE Trans. Knowl. Data Eng.,

27(9):2305–2319.

Canalle, G. K. (2016). Uma estratgia para seleo de atributos

relevantes no processo de resoluo de entidades.

Caruccio, L., Deufemia, V., and Polese, G. (2016). Relaxed

functional dependencies - a survey of approaches.

IEEE Trans. Knowl. Data Eng., 28(1):147–165.

Chen, J., Jin, C., Zhang, R., and Zhou, A. (2012). A learning

method for entity matching. In In Proceedings of 10th

International Workshop on Quality in Databases, East

China Normal University, China.

Christen, P. (2012). Data Matching. Springer, Heidelberg.

Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002).

Feature selection for clustering - a ﬁlter solution. In

ICDM, pages 115–122. IEEE Computer Society.

de Carvalho, M. G., Laender, A. H. F., Goncalves, M. A.,

and da Silva, A. S. (2010). A genetic programming

approach to record deduplication. IEEE Transactions

on Knowledge and Data Engineering, 99(PrePrints).

Dong, X. L. and Srivastava, D. (2015). Big Data Integra-

tion. Synthesis Lectures on Data Management. Mor-

gan & Claypool Publishers.

Draisbach, U. and Naumann, F. (2010). Dude: The dupli-

cate detection toolkit. In In Proceedings of the Inter-

national Workshop on Quality in Databases (QDB).

Fan, W., Jia, X., Li, J., and Ma, S. (2009). Reasoning about

record matching rules. PVLDB, 2(1):407–418.

Gruenheid, A., Dong, X. L., and Srivastava, D. (2014). In-

cremental record linkage. PVLDB, 7(9):697–708.

Gu, L., Baxter, R., Vickers, D., and Rainsford, C. (2003).

Record linkage: Current practice and future direc-

tions. Technical report, CSIRO Mathematical and In-

formation Sciences.

Jouve, P.-E. and Nicoloyannis, N. (2005). A ﬁlter feature se-

lection method for clustering. In Hacid, M.-S., Mur-

ray, N. V., Ras, Z. W., and Tsumoto, S., editors, IS-

MIS, volume 3488 of Lecture Notes in Computer Sci-

ence, pages 583–593. Springer.

Kopcke, H. and Rahm, E. (2010). Frameworks for en-

tity matching: A comparison. Data Knowl. Eng.,

69(2):197–210.

Li, Y., Lu, B.-L., and Wu, Z.-F. (2006). A hybrid method

of unsupervised feature selection based on ranking. In

ICPR (2), pages 687–690. IEEE Computer Society.

Mihaila, G. A., Raschid, L., and Vidal, M.-E. (2000). Using

quality of data metadata for source selection and rank-

ing. In WebDB (Informal Proceedings), pages 93–98.

Oliveira, M. I. d. S., Lscio, B., and Gama, K. (2015). An-

lise de desempenho de catlogo de produtores de dados

para internet das coisas baseado em sensorml e nosql.

XIV Workshop em Desempenho de Sistemas Computa-

cionais e de Comunicao.

Sarawagi, S. and Bhamidipaty, A. (2002). Interactive dedu-

plication using active learning. In KDD ’02: Proceed-

ings of the eighth ACM SIGKDD international confer-

ence on Knowledge discovery and data mining, pages

269–278, New York, NY, USA. ACM.

Su, W., Wang, J., Lochovsky, F. H., and Society, I. C.

(2010). Record Matching over Query Results from

Multiple Web Databases. IEEE Transactions on

Knowledge and Data Engineering, 22(4):578–589.

Wang, R. Y. and Strong, D. M. (1996). Beyond accuracy:

What data quality means to data consumers. J. Man-

age. Inf. Syst., 12(4):5–33.

ICEIS 2017 - 19th International Conference on Enterprise Information Systems