use the concept of distinct super entity. A distinct
super entity is a super entity possessing at least one
property that does not exist in other super entities of
an instance. To extract a list of distinct super
entities, a pruner algorithm is proposed to check if
all elements of a super entity (the set of ‹property,
value› pairs specifying that super entity) exist in at
least one other super entity.
e
1
: {(dName, D1), (building, B1)}, src = {Dep}
e
2
: {(dName, D2), (building, B1)}, src = {Dep}
e
3
: {(dName, D3), (building, B2)}, src = {Dep}
e
4
: {(name, S1), (program, P1), (dep, D1),(dName, D1),
(building, B1), (supervisor, prof1), (pName, Prof1),
(degree, deg1), (profDep, D1)}, src = {Student}
e
5
: {(name, S2), (program, P2), (dep, D2), (dName,
D2),
(building, B1), (supervisor, prof2), (pName, Prof2),
(degree,deg1), (profDep, D1)}, src = {Student}
e
6
: {(name, S3), (program, P3), (dep, D2), (dName,
D2),
(building, B1), (supervisor, prof3), (pName, Prof3),
(degree, deg2), (profDep, D2)}, src = {Student}
e
7
: {(sName, S1), (name, S1), (program, P1), (dep,
D1),(dName, D1),(building, B1), ((supervisor,
prof1),
(pName, Prof1), (degree, deg1), (profDep, D1),
(course,
C1), (regDate, dt1)}, src ={Registration}
e
8
: {(sName, S2), (name, S2), (program, P2), (dep,
D2),(dName, D2), (building, B1), (supervisor,
prof2),
(pName, Prof2), (degree,deg1), (profDep, D1),
(course,
C2), (regDate,dt2)}, src ={Registration}
e
9
: {(sName, S2), (name, S3), (program, P3), (dep,
D2),(dName, D2), (building, B1), (supervisor, prof3),
(pName, Prof3), (degree, deg2), (profDep, D2),
(course,
C1), (regDate,dt3)}, src ={Registration}
e
10
: {(sName, S1), (name, S1),(program, P1), (dep,
D1),(dName, D1), (building, B1), (supervisor, prof1),
(pName, Prof1), (degree, deg1), (profDep,
D1),(course,
C2), (regDate, dt4)}, src ={Registration}
e
11
: {(pName, Prof1), (degree, deg1), (profDep, D1),
(dName, D1), (building, B1)}, src ={Prof}
e
12
: {(pName, Prof2), (degree, deg1), (profDep, D1),
(dName, D1), (building, B1)} , src ={Prof}
e
13
: {(pName, Prof3), (degree, deg2), (profDep, D2),
(dName, D2), (building, B1)} , src ={Prof}
Figure 5: Super entities generated for RATs in Figure 4.
To avoid brute force search, the pruner algorithm
for a given super entity checks only super entities
extracted from neighbours of the source relation of
that super entity. As a result, given a schema graph,
the algorithm searches for inclusion only among
super entities tagged as neighbours of the source of
that super entity. For example, for super entities
extracted from Dep, only instances of Student and
Prof are checked (these are the only relations
referencing Dep). Accordingly, only super entities
extracted from Registration are checked for each
super entity extracted from Student. The order of
checking super entities for inclusion can be
problematic as different checking orders may result
in different output. To address this problem, once an
inclusion is found, instead of physical deleting, the
item is marked as “deleted”. Actual deleting is
performed once all inclusion tests are performed.
In our example, the Super Entity Pruner
algorithm removes super entities e
1
as it is
completely included in e
4
. e
2
is removed because of
inclusion in e
5
(and e
6
). Accordingly, {e
1
, e
2
, e
3
} are
checked for inclusion in {e
4
, e
5
, e
6
, e
11
, e
12
, e
13
}. In
the same way, {e
4
, e
5
, e
6
} are checked for inclusion
in {e
7
, e
8
, e
9
, e
10
, e
12
}. Nothing is checked for e
7
, e
8
,
e
9
, e
10
as their source (i.e., Registration) is not
referenced by a relation in the schema graph.
─────────────────────────────────────
Algorithm 1: Super Entity Pruner.
─────────────────────────────────────
Input: a list of super entities suprEnt
a schema graph regarding a source schema G=(V, E)
Output: a pruned list of super entities
1: foreach super entity e
1
in suprEnt
2: src
1
= the source of e
1
3: refNeighbors = a set of nodes in G referencing src
1
4: // there is no node v
i
in V such that v
i
is referencing src
1
5: If (refNeighbors == null)
6: continue;
7: foreach super entity e
2
in suprEnt
8: src
2
= the source of e
2
9: If (refNeighbors contains src
2
)
10: If (e
1
is included in e
2
)
11: mark e
1
as “deleted”
12: foreach super entity e
1
in suprEnt
13: If (remove e
1
from suprEnt if e
1
is marked as “deleted”)
─────────────────────────────────────
Step 3 (Host Relation Selection). Selecting
target host relations requires considering several
issues. First, the same concepts may be shown using
different representations and as a result, two
different properties can represent the same concept
in the source and target. To connect source and
target, we use property correspondences in form of
‹p
1
, p
2
› representing correspondence between
property p
1
in source and property p
2
in the target.
Each correspondence shows that an attribute of the
target is semantically related to an attribute in the
source. In our approach, value correspondences are
directly used to select best hosts regarding source
EDEX:EntityPreservingDataExchange
225