4. Context-based approach where two
attributes are considered to be similar if
their contexts are similar (Ehrig et al.,
2004).
5. Hybrid approach that combines two or
more approaches from the first four
categories to minimize false positives (i.e.
dissimilar attributes that appear similar) and
false negatives (i.e. similar attributes that
appear dissimilar) (Eyal et al., 2005; Ehrig
et al., 2004).
These approaches can also be used for detecting
duplicated (i.e. very similar) attributes. However, the
term-based approach can incur problems in
situations where the same terms are used to name
dissimilar attributes (i.e. homonyms) or where
different terms are used to name similar attributes
(i.e. synonyms). The value-based approach can
incur problems in situations where similar attributes
have no or few common values or where dissimilar
attributes have many common values. The
structure-based approach can incur problems in
situations where similar attributes are not organized
in the taxonomy or where the taxonomy is shallow.
These problems could be solved by domain
experts. However, they are often not available. In
such situations, duplicated attributes could be
detected by analyzing information on the past user
interaction with the ontology. This information may
be in the form of workload of queries or edit history.
(Wu and Weld, 2008) proposed to record a
history of changes made to the ontology and analyze
this information to detect duplicates. E.g. there can
be attributes in a class that are frequently renamed.
Or their values can be copied to one and the same
attribute in another class. Such an edit history points
to evidence that these attributes are duplicates.
However, edit history must be recorded for a long
time to minimize false positives and false negatives.
3 OUR APPROACH
Since terms, values and structures are not sufficient
criteria for identifying similar attributes, we decide
to use the context-based approach where two
attributes are considered to be similar if their
contexts are similar.
The main problem with this approach is how to
identify similar contexts. We address this problem
by adopting a similarity measure from market basket
analysis.
3.1 Market Basket Analysis
Market baskets are the sets of products bought
together by customers in transactions. These may be
the results of customer visits to the supermarket or
customer online purchases in a virtual store.
Typically, market baskets are represented as a binary
matrix where rows correspond to transactions and
columns to products. A row has a value of 1 for a
column if the customer has bought the product in the
transaction; otherwise, it is 0. The number of
products and their price are ignored.
One of the most popular tasks of market basket
analysis is to reveal customer buying patterns. These
patterns can be used to identify similar products.
Consider Coke and Pepsi. These two products
appear dissimilar because they have few customers
in common. However, it was observed that the
customers of Coke and Pepsi bought many other
products in common such as hamburgers,
cheeseburgers, pizzas and chips. Based on this
observation, (Das and Mannila, 2000) defined the
following similarity measure for products: two
products are considered to be similar if the buying
patterns of their customers are similar.
We adapt this similarity measure to attributes:
two attributes are considered to be similar if the
querying patterns of their users are similar. E.g. if it
were known that there are many users who have
asked about the birth place of actor together with the
actor’s name and birth date, and that there are many
users who have asked about the origin of actor,
again, together with the actor’s name and birth date,
we could conclude that attributes birth place and
origin in a class actor are similar to each other.
User querying patterns (i.e. contexts) are
revealed by analyzing a workload of queries asked
by users against the ontology. In the example above,
many users tend to ask about actor’s name and birth
date.
3.2 Assumptions
We assume that users do not ask about all attributes
in the ontology at once. (This is by analogy with
market basket analysis, which assumes that a market
basket contains a small set of products from
hundreds or thousands of products available in the
supermarket or virtual store.) In the example above,
the users have not asked about actor’s nationality
and marital status. These are called missing
attributes.
In addition, we assume that users understand the
ontology well enough to submit queries that reveal
ICEIS 2009 - International Conference on Enterprise Information Systems
284