the concept of “luxury bedding” depends on the
brands and designs available on the market that are
considered as luxury and their attributes. Bridging
the semantic gap therefore is in essence the problem
of inferring the meaning of search phrases in all its
nuances.
Our Approach: In this paper we present an
algorithm that (i) structures item information and (ii)
uses a frequent itemset mining algorithm to learn the
“target phrase” definitions.
2 RELATED WORKS
In (Aholen, 1998), generalized episodes and episode
rules are used for Descriptive Phrase Extraction.
Episode rules are the modification of association
rules and episode is the modification of frequent set.
An episode is a collection of feature vectors with a
partial order; authors claimed that their approach is
useful in phrase mining in Finnish, a language that
has the relaxed order of words in a sentence. In our
previous work (Nguyen, 2003), we present a co-
occurrence clustering algorithm that identifies
phrases that frequently co-occurs with the target
phrase from the meta-tags of Web documents.
However, in this paper we address a different
problem; we attempt to mine the phrase definitions
in terms of extracted item information, thus, the
mined definitions can be utilized to connect “search
phrases” to real items in all their nuances.
The frequent itemset mining problem is to
discover a set of items shared among a large number
of records in the database. There are two main
search strategies to find the frequent items set.
Apriori (Agrawal, 1994) and several other Apriori
like algorithms adopt Breadth-First-Search model,
while Eclat (Zaki, 2000) and FPGrowth (Han, 2000)
are well known algorithms that employ Depth-First
manner to search all frequent itemsets of a database.
Our algorithm also searches for frequent itemsets in
a Depth-First manner. But, unlike the lattice
structure used in Eclat or the conditional frequent
pattern tree used in FPGrowth, we propose the so
called 2-frequent itemset graph and utilize heuristic
syntheses to prune the search space in order to
improve the performance. We plan to further
optimize our algorithm and conduct detailed
comparisons to the above algorithms.
The relevance feedback (Salton, 1990) method
can also be used to refine the original keyword
phrase by using the document vectors (Baeza-Yates,
1999) of the extracted relevant items as additional
information. In Section 6, we present experimental
results and show that the rules that our system
learns, by utilizing the extracted relevant item
information, are easier to validate and perform better
than retrieval with the relevance feedback method.
3 SYSTEM DESCRIPTION
I. Item Name Structuring: This component takes a
product catalogue and extracts structured
information for mining the phrase based and
parametric definitions. Details are discussed in
Section 4.
II. Mining Search Phrase Definitions: In this
phase, we divide the phrase definition mining
problems into two sub problems (i) mining the
parametric definitions from extracted attribute value
pairs of items, and (ii) mining phrase based
definitions from the long item descriptions. Details
are discussed in Section 5.
4 DATA LABELING
This section presents the techniques for an e-
commerce domain, for the sake of providing
examples. Our techniques can be customized for
different domains. The major tasks in this phase are
structuring and labeling of extracted data. The
readers are also referred to (Davulcu, 2003) for more
information in details.
4.1 Labeling and Structuring
Extracted Data
This section describes a technique to partition the
short product item names into their various
attributes. We achieve this by grouping and aligning
the tokens in the item names such that the instances
of the same attribute from multiple products fall
under the same category indicating that they are of
similar types.
The motivation behind doing the partition is to
organize data. By discovering attributes in product
data and arranging the values in a table, one can
build a search engine which can enable quicker and
precise product searches in an efficient way.
4.2 The Algorithm
Before proceeding to the algorithm, it helps to
identify item names as a sequence of tokens obtained
when white-space is used as a delimiter. Since the
sequences of tokens obtained from item names are
BOOSTING ITEM FINDABILITY: BRIDGING THE SEMANTIC GAP BETWEEN SEARCH PHRASES AND ITEM
INFORMATION
49