found in the vicinity of some data source concept
name found within a phrase in a document. These
neighboring words will be useful to evaluate similar-
ity between concepts. Extracted words are integrated
and related to their corresponding concept in the data
source. In INDIGO, this enhancement of a data source
with information gathered from its context is called
data source enrichment.
To show the benefits of data source enrichment,
INDIGO was compared to two well-known matching
approaches, that is Similarity Flooding (Melnik et al.,
2002) and V-Doc (Y. Qu and Cheng, 2006). The ex-
periment consisted in applying each matching sys-
tem over the two following case studies: (1) Match-
ing of two database schemas taken from two open-
source e-commerce applications: Java Pet Store (Mi-
crosystems, 2005) and eStore (McUmber, 2003) and
(2) Matching of two real data sources describing
courses taught at Cornell University and at the Uni-
versity of Washington.
The rest of the paper is organized as follows. Sec-
tion 2 surveys recent work on linguistic matching.
Section 3 describes the current implementation of IN-
DIGO’s Context Analyzer and Mapper modules. Ex-
perimental results of our two case studies are pre-
sented and commented in Section 4. Concluding re-
marks and comments on future work are given in Sec-
tion 5.
2 RELATED WORK
As pointed out in section 1, there are essentially
two categories of techniques for linguistic matching:
string comparison metrics and lexical-based match-
ing methods. Concerning string metrics, the most
popular are certainly the following (Euzenat and et.
al, 2004): Levenstein, Needleman-Wunsch, Smith-
Waterman, Jaro-Winkler and Q-Gram. As for lexical-
based metrics, they were mostly developed within
the framework of specific mapping systems such as
ASCO (B.T. Le and Gandon, 2004) and HCONE-
merge (K. Kotis and Stergiou, 2004) to serve their
own application objectives. Both of these systems
resort to WordNet to collect additional semantic in-
formation (sets of words) about the concept to be
matched. For instance, ASCO searches for synonyms
while HCONE-merge looks for hyponyms.
Among the numerous linguistic matching solu-
tions mentioned in the literature, we had a closer
look at two systems frequently cited by researchers
for their performances and which were available on
the net: Similarity Flooding (SF)(Melnik et al., 2002)
and V-Doc (Y. Qu and Cheng, 2006). We selected
those two applications to experimentally compare
them with our own system INDIGO.
Similarity Flooding (SF) is a generic algorithm
used to match different kinds of data structures
called models. Models can be composed of data
schemas, data instances or a combination of both.
The SF algorithm converts both source and tar-
get models into some proprietary labeled directed
graph representation (G
1
and G
2
). It then applies
an iterative fixed point based procedure over these
two graphs to discover matches between their re-
spective nodes.
V-Doc is the linguistic matching module of Falcon-
AO (Hu et al., 2006), a matching system for on-
tologies. V-Doc constructs for each entity in the
ontologies to be aligned, a virtual document con-
sisting in a set of words extracted from the name,
label and comment fields of the entity and of its
neighbors within the ontology. It then compares
virtual documents using the well-known TF/IDF
3
technique.
INDIGO aims at the same objectives as the above
systems. From a conceptual standpoint, it belongs to
the category of systems relying on multiple strategic
matchers (A. Doan and Halevy, 2003; Do and Rahm,
2002). It distinguishes itself by taking into account
the informational context of data sources in its align-
ment process.
3 INDIGO’S ARCHITECTURE
To handle both context analysis and semantic match-
ing, INDIGO presents an architecture composed of
two main modules: a Context Analyzer and a Map-
per module. The Context Analyzer module takes the
data sources to be matched along with related context
documents and proceeds to their enrichment before
delivering them to the Mapper module for their effec-
tive matching.
3.1 Context Analyzer
The role of the Context analyzer consists in parsing
the artifacts related to a data source with the aim to
extract pertinent semantic information and to enrich
this data source with it. Our implementation targets
two types of enrichment: (1) enhancement of a data
source with complex concepts
4
and (2) enhancement
3
Term frequency/Inverted Document frequency.
4
This kind of enhancement will not be discussed in this
article because it is not directly involved in linguistic match-
ing (c.f. (Bououlid and Vachon, 2007) for details).
ICSOFT 2007 - International Conference on Software and Data Technologies
198