characterization of the actual contents of schema
attributes (Rahm and Bernstein, 2001; De Carvalho
et al., 2013).
By analysing the instance based schema
matching approaches, we observed that neural
network, machine learning, theoretic information
discrepancy and rule based have been utilized (Kang
and Naughton, 2003; Chua et al., 2003; Bilke and
Naumann, 2005; Liang, 2008; Kang and Naughton,
2008; Dai et al., 2008). The goal of these approaches
is to discover correspondences between schema
attributes whereby instances including instances
with numeric values are treated as strings. This
prevents discovering common patterns or
performing statistical computation between the
numeric instances (De Carvalho et al., 2013). As a
consequence, this causes unidentified matches for
numeric instances and further reduces the quality of
match results. Furthermore, textual similarity also is
not the best alternative for numeric instances (e.g.,
page number, year, phone number, price, quantity,
etc.) (De Carvalho et al., 2013; Cortez et al., 2010).
Thus, for instance level approaches, specific
strategies for identifying existing instance patterns
must be deployed.
In this paper, we propose a framework for
instance based schema matching that aims at finding
the correspondences between schema attributes of
two semantically and syntactically related data.
Since we only explore the instances, we rely on
matching strategies that are based on Google
similarity (Cilibrasi and Vitanyi, 2007) and regular
expression (Friedl, 2006; Liu et al., 2012) to find the
correspondences of schema attributes. As pointed
out by (Doan and Halevy, 2005; Li and Clifton,
2000), there are different types of matching
algorithms being applied in this area. However, this
problem is still a research hotspot in order to further
improve the accuracy of schema matching. Thus, our
framework is a step forward towards solving this
problem. The reason for utilizing Google similarity
and regular expression is that Google similarity uses
the World Wide Web as database and Google as
search engine. Whereas regular expressions are an
efficient way to describe text through pattern
(format) matching and provide an efficient way to
identify text. In addition, regular expression is
relatively inexpensive and does not require training
as in learning-based techniques. It can provide a
quick and concise method to capture valuable
knowledge (Doan and Halevy, 2005).
In summary, the main contribution of this paper
is a framework which: (1) uses only the instances to
find matches between schema attributes (1-1
matches) and (2) relies on matching strategies that
are based on Google similarity and regular
expression to find the matches.
The rest of this paper is organized as follows.
Section 2 discusses the related work. Section 3
presents the proposed framework of instance based
schema matching. In Section 4, the evaluation
metrics and the results are presented and discussed.
Finally, Section 5 draws the conclusions and points
out some future work directions.
2 RELATED WORK
Instance based schema matching examines instances
to determine corresponding schema attributes. It
represents a substitutional choice for schema
matching (Rahm and Bernstein, 2001; Bernstein et
al., 2011). Even when substantial schema
information is available, considering instances can
complement schema based approaches with
additional insights on the semantics and contents of
schema attributes and can be beneficial in
uncovering wrong interpretation of schema
information, i.e. it would be helpful to disambiguate
between schema level matches by matching the
attributes whose instances are syntactically and
semantically more similar. Neural network, machine
learning, information theoretic discrepancy and rule
based are approaches used for instance based
schema matching.
Neural network is able to obtain the similarities
among data directly from their instances and
empirically infer solutions from data in the absence
of prior knowledge for regularities. Neural network
is employed to cluster similar attributes, whose
instances are uniformly characterized using a feature
vector of constraint based criteria. For instance
based schema matching, the Back Propagation
Neural Network (BPNN), which can acquire and
store a mass of mappings between input and output,
is ideal. However, neural network can be viewed as
specific tool since it is trained based on domain-
specific training data. It can only be used to resolve
problems associated with that domain (Li et al.,
2000). Furthermore, neural network approaches (Li
and Clifton, 1994; Li and Clifton, 2000; Yang et al.,
2008; Li et al., 2005) for instance based schema
DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications
214