cation scenarios. Up to our best knowledge, no cur-
rent system is however able to automatically discover
and describe product relationships as proposed in this
paper.
A well known related area are recommender sys-
tems that are nowadays commonly found in all large
online shops such as Amazon or Buy.com. A de-
cent overview on the state of the art in this field can
be found in (Adomavicius and A., 2005). The au-
thors interpret recommender systems as a way to help
customers deal with the information overload caused
by modern information technology. Other sources
like (Han and Karypis, 2005) show that deploying a
decent reommender system may directly result in a
commercial gain. Traditionally these systems mainly
fall into the two categories Content-Based-Filtering
(CBF) or Collaborative Filtering (CF).
CBF suggests items from different categories
based upon the similarity of the currently viewed
item X to each item category (van Meteren and van
Someren M., 2000). Such systems requires an item-
to-category similarity matrix which is usually built by
analyzing items’ textual descriptions. CF or people-
to-people correlation (Schafer et al., 1999) recom-
mends items to users based on what other people with
similar interests found interesting, so it does not di-
rectly relate products but people. Besides these two
approaches there exist many systems that try to com-
bine them like Amazon’s modern appraoch in (Lin-
den et al., 2003) or (Shen et al., 2007) where the user
provides an initial scenario which is then matched to
previous choices of other users. These systems have
in common that, in the eyes of a user, they appear to
relate products to each other. In contrast to our system
this relation is however not based on relevance seman-
tics (Product A is relevant to Product B) but shopping
behaviour and does not further explain the relation-
ships, either.
Another class of interesting systems are Product
Comparison Agents (PCA), online applications that
retrieve, process and re-format product information to
aid a customer’s decision making process (Wan et al.,
2007). A prime example is CNetShopper.com (CBS
interactive, 2010) where a product is related to similar
ones and popular accessories. The website also offers
a detailed comparison tool that relates products on the
level of features. A shortcomming of most PCA sys-
tems is however that they do not gather data automat-
ically from independent sources but rely on manually
tagged data sources.
A different type of product comparison is done
by (Liu et al., 2005) and (Kawamura et al., 2008).
Both systems extract customer opinions on products’
features and then compare products to each other,
based on these opinions, feature by feature. While the
first system extracts from rather well known sources
such as CNet reviews, the second one is able to ex-
tract opinions from random blogs, a characteristic that
makes it very powerful and interesting for this work.
A rather new subfield of Information Extrac-
tion (IE) is Relationship Extraction (RE) (Bach and
Badaskar, 2007), whose task is to extract related en-
tities from documents and eventually also specify the
relationship that holds between them. The TextRun-
ner system (Banko and Etzioni, 2008) with its un-
derlying theory is a famous example of state of the
art work in that area. It uses two input terms and
searches large document collections in order to ex-
tract text pieces that relate the terms. Up to now
the system does not interpret its results semantically.
In (Schutz and Buitelaar, 2005) the authors present an
interesting system that searches documents from soc-
cer game news tickers in order to extract relationship
triples containing two concept terms and the relation
between them. Both systems are very unique and in-
teresting but do not yet provide enough foundation for
the task of this work, as they either are incapable of
semantic interpretation of relations or are too fixed on
a special domain and therefore bound to a specific tex-
tual representation of relationship facts.
3 MAIN CONCEPT
Our concept for relationship classification consists of
the three main steps of
1. Setting up a hierarchically clustered product tree,
2. Dicovering connections between these products,
3. Classifying the product relationships found in
step 2.
The rationale behind the first step is to limit the
number of necessary comparison operations by ex-
ploiting relatedness of products. It is though very un-
likely to find any connection between a digital cam-
era and stilettos so it does make sense to cluster
products before finding relationships between them.
This can be done either using an existing hierarchi-
cal classification like the Amazon catalogue classifi-
cation or without any dependency on external classi-
fication schemes using k-means clustering. The latter
approach was used for the prototype implementation
shown in Section 6. We describe this method in a sep-
arate companion paper.
In this paper we focus on steps 2 and 3, i.e., con-
nection discovery and classification. These steps do
not depend on how the hierarchical clusters where
built in step 1. Thus we only assume to already have a
FINDING AND CLASSIFYING PRODUCT RELATIONSHIPS USING INFORMATION FROM THE PUBLIC WEB
301