BOOSTING ITEM FINDABILITY: BRIDGING THE SEMANTIC

GAP BETWEEN SEARCH PHRASES AND ITEM INFORMATION

Hasan Davulcu, Hung V. Nguyen, Viswanathan Ramachandran

Department of Computer Science and Engineering, Arizona State University Tempe, AZ 85287, USA

Keywords: E-commerce, Data Mining, Frequent Itemsets, Web Data, Information Retrieval, Information Extraction,

Relevance Feedback.

Abstract: Most search engines do their text query and retrieval based on keyword phrases. However, publishers

cannot anticipate all possible ways in which users search for the items in their documents. In fact, many

times, there may be no direct keyword match between a search phrase and descriptions of items that are

perfect “hits” for the search. We present a highly automated solution to the problem of bridging the

semantic gap between item information and search phrases. Our system can learn rule-based definitions that

can be ascribed to search phrases with dynamic connotations by extracting structured item information

from product catalogs and by utilizing a frequent itemset mining algorithm. We present experimental results

for a realistic e-commerce domain. Also, we compare our rule-mining approach to vector-based relevance

feedback retrieval techniques and show that our system yields definitions that are easier to validate and

perform better.

1 INTRODUCTION

Most search engines do their text query and retrieval

using keywords. The average keyword query length

is under three words (2.2 words (Crescenzi, 2000)).

Recent research (Andrews, 2003) found that 40

percent of companies rate their search tools as “not

very useful” or “only somewhat useful.” Further, a

review of 89 sites (Andrews, 2003) found that 75

percent have keyword search engines that fail to

retrieve important information and put results in

order of relevance; 92 percent fail to provide guided

search interfaces to help offset keyword deficiencies

(Andrews, 2003), and seven out of 10 web shoppers

were unable to find products using the search

engine, even when the items were stocked and

available.

The Defining Problem: Pub

lishers cannot

anticipate all possible ways in which users search for

the items in their documents. In fact, many times,

there may be no direct keyword match between a

search phrase and descriptions of items that are

perfect “hits” for the search. For example, if a

shopper uses “motorcycle jacket” then, unless the

publisher or search engine knows that every “leather

jacket” is a “motorcycle jacket”, it cannot produce

all matches for user’s search. Thus, for certain

phrases, there is a semantic gap between the search

phrase used and the way the corresponding matching

items are described. A serious consequence of this

gap is that it results in unsatisfied customers. Thus

there is a critical need to boost item findability by

bridging the semantic gap that exists between search

phrases and item information. Closing this gap has

the strong potential to translate web search traffic

into higher conversion rates and more satisfied

customers.

Issues in Bridging the Semantic Gap: We

denote a sea

rch phrase to be a “target search

phrase” if does not directly match certain relevant

item descriptions. The semantics of items matching

such “target search phrases” is implicit in their

descriptions. For phrases with fixed meanings i.e.

their connotations do not change such as in “animal

print comforter”, it is possible to close the gap by

extracting their meaning with a thesaurus (Voorhees,

1998) and relating it to product descriptions, such as

“zebra print comforter” or “leopard print bedding”

etc. Where they pose a more interesting challenge is

when their meaning is subjective, driven by

perceptions, and hence their connotations change

over time as in the case of “fashionable handbag”

and “luxury bedding”. The concept of a fashionable

handbag is based on trends, which change over time,

and correspondingly the attribute values

characterizing such a bag also changes. Similarly,

Davulcu H., V. Nguyen H. and Ramachandran V. (2005).

BOOSTING ITEM FINDABILITY: BRIDGING THE SEMANTIC GAP BETWEEN SEARCH PHRASES AND ITEM INFORMATION.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 48-55

DOI: 10.5220/0002525800480055

 SciTePress

the concept of “luxury bedding” depends on the

brands and designs available on the market that are

considered as luxury and their attributes. Bridging

the semantic gap therefore is in essence the problem

of inferring the meaning of search phrases in all its

nuances.

Our Approach: In this paper we present an

algorithm that (i) structures item information and (ii)

uses a frequent itemset mining algorithm to learn the

“target phrase” definitions.

2 RELATED WORKS

In (Aholen, 1998), generalized episodes and episode

rules are used for Descriptive Phrase Extraction.

Episode rules are the modification of association

rules and episode is the modification of frequent set.

An episode is a collection of feature vectors with a

partial order; authors claimed that their approach is

useful in phrase mining in Finnish, a language that

has the relaxed order of words in a sentence. In our

previous work (Nguyen, 2003), we present a co-

occurrence clustering algorithm that identifies

phrases that frequently co-occurs with the target

phrase from the meta-tags of Web documents.

However, in this paper we address a different

problem; we attempt to mine the phrase definitions

in terms of extracted item information, thus, the

mined definitions can be utilized to connect “search

phrases” to real items in all their nuances.

The frequent itemset mining problem is to

discover a set of items shared among a large number

of records in the database. There are two main

search strategies to find the frequent items set.

Apriori (Agrawal, 1994) and several other Apriori

like algorithms adopt Breadth-First-Search model,

while Eclat (Zaki, 2000) and FPGrowth (Han, 2000)

are well known algorithms that employ Depth-First

manner to search all frequent itemsets of a database.

Our algorithm also searches for frequent itemsets in

a Depth-First manner. But, unlike the lattice

structure used in Eclat or the conditional frequent

pattern tree used in FPGrowth, we propose the so

called 2-frequent itemset graph and utilize heuristic

syntheses to prune the search space in order to

improve the performance. We plan to further

optimize our algorithm and conduct detailed

comparisons to the above algorithms.

The relevance feedback (Salton, 1990) method

can also be used to refine the original keyword

phrase by using the document vectors (Baeza-Yates,

1999) of the extracted relevant items as additional

information. In Section 6, we present experimental

results and show that the rules that our system

learns, by utilizing the extracted relevant item

information, are easier to validate and perform better

than retrieval with the relevance feedback method.

3 SYSTEM DESCRIPTION

I. Item Name Structuring: This component takes a

product catalogue and extracts structured

information for mining the phrase based and

parametric definitions. Details are discussed in

Section 4.

II. Mining Search Phrase Definitions: In this

phase, we divide the phrase definition mining

problems into two sub problems (i) mining the

parametric definitions from extracted attribute value

pairs of items, and (ii) mining phrase based

definitions from the long item descriptions. Details

are discussed in Section 5.

4 DATA LABELING

This section presents the techniques for an e-

commerce domain, for the sake of providing

examples. Our techniques can be customized for

different domains. The major tasks in this phase are

structuring and labeling of extracted data. The

readers are also referred to (Davulcu, 2003) for more

information in details.

4.1 Labeling and Structuring

Extracted Data

This section describes a technique to partition the

short product item names into their various

attributes. We achieve this by grouping and aligning

the tokens in the item names such that the instances

of the same attribute from multiple products fall

under the same category indicating that they are of

similar types.

The motivation behind doing the partition is to

organize data. By discovering attributes in product

data and arranging the values in a table, one can

build a search engine which can enable quicker and

precise product searches in an efficient way.

4.2 The Algorithm

Before proceeding to the algorithm, it helps to

identify item names as a sequence of tokens obtained

when white-space is used as a delimiter. Since the

sequences of tokens obtained from item names are

BOOSTING ITEM FINDABILITY: BRIDGING THE SEMANTIC GAP BETWEEN SEARCH PHRASES AND ITEM

INFORMATION

all from a single web page and belong to the same

category, they are likely to have a similar pattern. As

mentioned before, our algorithm is designed to

process collections of such item names without any

labeling whatsoever. So it can be performed on the

fly as and when data is extracted from the web sites.

Following are the general properties of the data our

algorithm can process:

Super-Tokens: Any pair of tokens t

, t

that always

co-occur together and occur more than once belong

to a multi token instance of a type.

Context: All single tokens occurring between

identical attribute types belong to the same type.

This means that if two tokens t

and t

from distinct

item names occur in between same types T

and T

then they should be of the same type.

Anchor Type: A token that uniquely occurs within

all item names should belong to a unique type,

which we call an Anchor Type.

Density: Attribute types should be densely

populated. Meaning that, every type should occur

within the majority of item names.

Ordering: Pairwise ordering of all types should be

consistent within a collection.

Tokenization: The item names are tokenized by

using white space characters as delimiters. Tokens

are stemmed so using the Porter Stemmer (Porter,

1980).

Super Tokenization: The second step identifies

multi-token attributes.

Initialization of Types: To initialize, every item

name is prefixed and suffixed with a Begin and an

End token.

Context Based Inference: This step aligns tokens

from different item names under a single type. This

step takes advantage of tokens repeating across

descriptions and operates based on the first

assumption, Context, that tokens within similar

contexts have similar attribute types.

If a token sequences t

,t, t

and t'

, t', t'

exist in

D such that t

, t'

∈T

and t

, t'

∈T

, then combine

and replace the types of tokens t and t' with a new

type T

= Typeof(t) U Typeof(t') .

Type Ordering: In this step, the set of inferred

types T are sorted based on their ordering in the

original item names. We utilize the Pairwise Offset

Difference (POD) metric to compare different types.

POD between types T

and T

is defined as:

where f

is the token offset of x from the start of

its item name and f

is the token offset of y. If this

value is greater than zero, then the type T

comes

after type T

in the sorted order.

Due to space constraints, tokens have been

aligned such that those from the same type are offset

at the same column. The type numbers the tokens

belong to are indicated at the top.

___________________________________

Algorithm 1: Item Name Partition

Type Merging: A careful observation shows that

some of the neighbouring types are fillers for each

other. Meaning that, they are not instantiated

together for any item name. Such types are

candidates for merging and are called merge

compatible. Merging at this point is logical because

of our assumption that the types are densely

populated.

Merge Concatenation: Finally, merge-

concatenation is performed to eliminate sparsely

populated types. Sparsely populated types are those

with a majority of missing values. By our

assumption, collections of item names should have

dense attributes. This implies that the tokens of a

sparsely populated type should be concatenated with

the values of one of the neighbouring types.

4.3 Experimental Results

To evaluate the algorithm, our DataRover system

was used to crawl and extract list-of-products from

the following five Web sites: www.officemax.com,

(1)

ICEIS 2005 - SOFTWARE AGENTS AND INTERNET COMPUTING

www.officedepot.com, www.acehardware.com,

www.homeclick.com and www.overstock.com.

Three metrics were used to measure the

effectiveness of the algorithm. The first two evaluate

the ability to identify fragments of the descriptions

to the correct type and the last one indicates the

correctness of the number of attributes.

Precision indicates how correctly type-value

pairs are identified.

Recall This quantity indicates if every

existing type-value pair is being identified.

Attributes Error Rate indicates the error in

the number of attributes described in the set of

product names.

Table 1: Summary of Evaluation Measures for Different

Web Sites for the Items Name Structuring Algorithm

5 MINING THE DEFINITION OF A

TARGET PHRASE

In this section, we introduce the problem of mining

definitions of a phrase from product data extracted

from the matching Web pages. Using extraction

techniques discussed in Section 4 we can retrieve

tabular parametric attributes of matching products as

well as their long descriptions. Next, we apply

frequent itemset mining algorithms to learn the

parametric definitions and phrase-based definitions

of target phrases from the extracted product data.

First, in Sections 5.1 thru 5.4 we introduce an

algorithm that finds all frequent itemsets from a

database. Section 5.5 discusses the problem of

mining parametric definitions. In Section 5.6 textual

definition mining is discussed. Since their

introduction in 1994 by Agrawal et al. (Agrawal,

1994), the frequent itemset and association rule

mining problems have received a lot of attention

among data mining research community. Over the

last decade, many research papers (Han, 2001) have

been published presenting new algorithms as well as

improvements on existing algorithms to tackle the

efficiency of frequent itemset mining problems. The

frequent itemset mining problem is to discover a set

of items shared among a large number of transaction

instances in the database. For example, consider the

product information database matching ‘trendy

shoes’ that we extract from retail Web sites. Here,

each instance represents the collection of product’s

<attribute, value>pairs for attributes such as brand,

price, style, gender, color and description. The

discovered patterns would be the set of <attribute,

value> pairs that most frequently co-occur in the

database. These patterns define the parametric

description of the target phrase ‘trendy shoe’.

5.1 Boolean Representation of the

Database

The advantage of Boolean representation is that

many logical operations such as superset, subset, set

subtraction, OR, XOR, etc between any number of

attribute vectors can be performed extremely fast.

5.2 Constructing 2-frequent Itemsets

Graph

The set of 2-frequent itemsets plays crucial role in

finding all frequent itemsets. The main idea is that,

from the observation that if {I

i….

} is a frequent

itemset then all pairs of items in this set must also be

a frequent itemset. Using this property of a frequent

set, our algorithm will first create a graph that

represents the 2-frequent itemsets among all items

that satisfy the minimum support threshold.

The the 2-frequent itemset graph is the directed

graph G(V,E) which is constructed as follows:

V = I; I is the set of items that satisfy the

minimum support in database D.

E = {(v

) | {i,j} is a 2-frequent itemset and i<j).

We sort the frequent single items into

lexicographical order and for a 2-frequent itemset,

we construct a directed edge from the node (item)

whose index is lower to the node whose index is

higher.

BOOSTING ITEM FINDABILITY: BRIDGING THE SEMANTIC GAP BETWEEN SEARCH PHRASES AND ITEM

INFORMATION

Example 1.

1 1 1 0

0 1 1 1

1 0 1 0

1 1 0 0

0 0 1 1

1 1 0 0

0 1 0 1

1 0 0 1

Figure 1: Database I and its 2-frequent item graph.

For this database, if minimum support δ is set to

25%, then the 2-frequent itemsets are I

, I

. The 2-frequent itemsets graph would be as in

Figure 1.

5.3 Searching for Frequent Itemsets

The algorithm iteratively starts from every node in

the graph and recursively traverses depth-first to its

descendants. At any step k (k>1), the algorithm will

choose to go to a child node v of the current node so

that the path from the beginning node to v forms a k-

frequent itemset. If so, the algorithm will continue

expand to v’s children to search for (k+1)-frequent

itemset and so on. There are several algorithms [8,

16] that generate frequent itemsets in depth-first

manner. A distinguishing feature of our algorithm is

that it searches on the 2-frequent itemset graph.

Finding all 2-frequent set takes O(n

) operations

where n is the number of frequent single items. Our

algorithm utilizes the following heuristics to guide

the search.

Heuristic 1: At step k, choose only children

nodes of node v

k-1

that have incoming degree greater

than or equal to the number of visited nodes,

counted from the beginning node. Incoming degree

of a node v, denoted as deg(v) is the number of

nodes that point to v. The meaning of this heuristic is

that, if deg(v) is smaller than the number of visited

nodes (nodes in the path) then there exists at least

one node among the set of previously visited k-1

nodes that does not point to v. In other words, there

exists at least one node in the current path that does

not form a 2-frequent itemset with v. Therefore the

k-1 nodes in the path (visited nodes) and v cannot

form a k-frequent itemset hence it is pruned out

without candidate itemset generation.

Heuristic 2: At step k, choose only children

nodes of node v

k-1

that have the set of incoming

nodes that is a superset of the set of all k-1 nodes in

the visited path. This heuristic, which is applied after

Heuristic 1, ensures that all previously visited nodes

in the current path, must point to the node in

consideration. This is also a necessary precondition

that each visited node forms a 2-frequent itemset

with the node in consideration.

Heuristic 1 is efficient since the 2-frequent

itemset graph is already constructed and the degree

of all nodes is stored before the search proceeds.

Heuristic 2 superset testing operation can also be

performed efficiently using the bit-vector

representation. Consequently, by utilizing these

heuristic estimates, we can prune a lot of nodes that

cannot be added to the visited nodes to form a

frequent itemset and eliminate a lot of candidate

itemset generation.

5.4 Mining Parametric Definition of

Phrases

Note that, since we extract data from the Web by

posing a search phrase query to a web search engine,

all the instances in the data we get contain search

phrase. Therefore, the association rule generation

becomes simple by just putting the search phrase

into the header of association rules and the body of

rules is frequent itemsets. The support of obtained

association rules equals to the support of frequent

items set in their body since for a rule, the search

phrase occurs in all instances that the frequent

itemset (in the body of the rule) occurs. Next, we

would like to utilize the extracted product

information to mine parametric phrase definition

rules made up from conjunctions of distinct

<attribute, value> pairs, like:

Trendy shoe ←

brand = Steve Madden,

Color = black,

material = leather

5.5 Mining Textual Definitions of

Target Phrases

Another resource of rich phrase definitions is the

long product descriptions of the matching products.

In the Section 4, we have already described how we

plan to collect long product descriptions from

product Web pages that matches a given target

search phrase. In this section we describe the

proposed algorithm for mining phrase definitions

ICEIS 2005 - SOFTWARE AGENTS AND INTERNET COMPUTING

that can connect hidden phrases to product

descriptions themselves. In order to generate

candidate phrases first we perform part-of-speech

(POS) tagging and noun and verb phrase chunking

(Finch, 1997) on the long description to obtain a

more structured textual description. Part-of-speech

(POS) tagging and chunking the above description

yields the following structure. In the next step, we

utilize the noun phrases as transaction instances and

mine frequently used phrases from all the noun

phrases of all the product descriptions that we have

collected from the Web documents.

___________________________________

Algorithm 2: Frequent Itemset Mining

Next, we use the mined frequent phrases as

items and create transaction instances by marking all

of the frequently used phrases matching anywhere in

the long description. This would yield transaction

instances made-up from frequently used phrases

matching the product descriptions.

Next we mine the frequent itemsets among

instances corresponding to the long descriptions to

find the phrase definitions. Note that, due to our way

to construct the items, all items are combinations of

single words; therefore, there are items that subsume

other items. As a subsequence, there are a lot of

redundant final resultant frequent itemsets. For

example a long description might yield the following

items: “suede”, “pump”, “suede pump”, “fashion”,

“savvy”, “woman”, “fashion savvy”, “savvy

woman”, “fashion savvy woman”. Hence, we only

want to mine the frequent itemset “suede pump”,

“fashion savvy woman” because these frequent

itemsets subsume the former frequent itemsets.

6 EXPERIMENTAL RESULTS

The tables below show some of the definitions that

were mined. It is a relatively easy task for a domain

expert to inspect and evaluate the quality of such

rule-based definitions.

6.1 Comparison to Relevance

Feedback Method

In order to compare the performance of our

definition miner to standard relevance feedback

retrieval method we mined a large database of shoes

(33,000 items) from a collection of online vendors.

Next, we keyword queried the database with the

target exemplary search phrase “trendy shoe”.From

the 166 keyword matching shoes, we mined rule-

based phrase definitions for “trendy shoes” yielding

rules such as fashionable sneaker, platform shoes

etc. that were validated by a domain expert. These

mined rules matched 3,653 additional shoes.

Alternatively, we also computed the relevance

feedback query vector using the above 166 matching

shoes. We also identified a similarity threshold by

finding the maximal cosine theta, Θ, between the

relevance feedback query vector and all of the 166

shoe vectors. Retrieval using the relevance feedback

vector with this threshold yields more than 29,000

matches out of 33,000! The light colored bars in

Figure 3 illustrates the histogram plot of the 29,293

instances that falls into various similarity ranges.

Similarly, the dark colored bars plots the similarity

ranges of the 3,653 shoes that were retrieved by

matching with our mined definitions. As can be seen

from the distributions in the above chart, the items

retrieved with our mined definitions have a very

uniform similarity distribution (with around 300 of

these being below the threshold), as opposed to

having a skewed distribution towards the higher

values of similarity. Since dark colored bars

correspond to relevant “trendy shoes” matching our

rules, which were validated by an expert, most of

BOOSTING ITEM FINDABILITY: BRIDGING THE SEMANTIC GAP BETWEEN SEARCH PHRASES AND ITEM

INFORMATION

these items should have ranked towards the higher

end of the similarity spectrum. However, relevance

feedback measure failed to rank them as such;

hence, it performed poorly for this task.

6.2 Comparison to Relevance

Feedback with LSI

The plot of similarity ranges obtained by ranking the

3,653 shoes, retrieved with our mined rules, using

relevance feedback with and without

latent semantic

indexing

(LSI) (Deerwester, 1990) technique is

shown in Figure 2. The light colored dashed line

represents the cosine theta threshold Θ for the

relevance feedback ranking, similarly the dark

colored dashed line represents the cosine theta

threshold for the relevance feedback with LSI. The

recall for relevance feedback is nearly 93%,

however, since it matches 88% of a random

collection of shoes, its precision is lower. On the

other hand, even though the ranking of relevance

feedback with LSI falls onto a higher similarity

range, it appears to have a much lower recall (of

25%) for this experiment with exemplary target

phrase “trendy shoes”.

7 CONCLUSIONS AND FUTURE

WORK

Our initial experimental results for mining phrase

definitions are promising according to our retail

domain expert who is the Webmaster of an affiliate

marketing web site. We plan to scale up our

experiments to hundreds of product categories and

thousands of phrases. Also, we would like to

perform experiments to determine how precisely our

algorithm learns the definitions of phrases that

changes their meaning over time.

Parametric Rules Support

Brand = Jil Sander, material = leather, type = clutch Î fashion handbags 4.25%

Brand = Carla, design = mancini, material = leather Î fashion handbags 2.4%

Brand = Butterfly, design =beaded Î fashion handbags 2.4%

Brand = Sven, material = leather Î fashion handbags 10.2%

Design = beaded, color = pink Î fashion handbags 2%

Fashion

handbags

Design = beaded, color = blue, type = tote Î fashion handbags 3.2%

Design = Baffled box, material = cotton Î luxury beddings 5%

Design = Waterford, material = linen Î luxury beddings 6%

Material = silk Î luxury beddings 3%

Luxury

beddings

Design = Sussex, material = polyester Î luxury beddings 6%

Design = All American, material = polyester Î sport beddings 6%

Design = All star, material = polyester Î sport beddings 9%

Design = Big and bold Î sport beddings 17%

Sport

beddings

Design = sports fan Î sport beddings 45%

Textual Rules Support

Egyptian cotton mate-lass Î luxury beddings 0.6%

Silk, smooth, King set Î luxury beddings 0.75%

Piece ensemble Î luxury beddings 0.75%

American sport ensemble Î sport beddings 0.4%

Paraphernalia sport Î sport beddings 0.6%

fashionable sneaker Î trendy shoes 7%

Wedge edge Î trendy shoes 5%

Platform shoes Î trendy shoes 6%

ICEIS 2005 - SOFTWARE AGENTS AND INTERNET COMPUTING

REFERENCES

R. Agrawal and R. Srikant. 1994, “Fast Algorithms for

mining association rules”. In Proc. 20th Int. Conf.

VLDB pp. 487-499

H. Aholen, O. Heinonen, M. Klemettinen, and A. I.

Verkamo. 1998, “Applying Data Mining Techniques

for

Descriptive Phrase Extraction in Digital Collections”. In

Proceedings of ADL’98, Santa Barabara, USA

W. Andrews. 2003 “Gartner Report: Visionaries Invade

the 2003 Search Engine Magic Quadrant”,

V. Crescenzi, G. Mecca, and P. Merialdo. 2001

“Roadrunner: Towards automatic data extraction from

large web sites”, In Proc. of the 2001 Intl. Conf. on

Very Large Data Bases.

1000

2000

3000

4000

5000

6000

225

0.4

Similarity Measurement

Number of Instances

Vector Space Relevance

Feedback

Definition Query

H. Davulcu, S. Vadrevu, S. Nagarajan, I.V. Ramakrishnan.

2003, “OntoMiner: Bootstrapping and Populating

Ontologies From Domain Specific Web Sites”, in

IEEE Intelligent Systems, Volume 18, Number 5.

Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G.

W. and Harshman, R. A. 1990, “Indexing. Latent

semantic analysis”, journal of the Society for

Information Science, 41(6), pp. 391-407.

Steve Finch and Andrei Mikheev. 1997, “A Workbench

for Finding Structure in Texts”. Applied Natural

Language Processing , Washington D.C.

J. Han J.Pei, Y.Yin, and R. Mao. 2000, “Mining frequent

pattern without candidate generation.” In Proceedings

of the ACM SIGMOD International Conference on

Management of Data, volume 29(2) of SIGMOD

Record, ACM Press.

J. Han, and M. Kamber. 2001, “Data Mining: Concepts

and Techniques”, Morgan Kaufmann Publishers.

Hung V. Nguyen, P. Velamuru, D. Kolippakkam, H.

Davulcu, H. Liu, and M. Ates. 2003, “

Mining "Hidden

Phrase" Definitions from the Web”.

APWeb, Xi'an,

China, Springer-Velag, LNCS Vol 2642, pp. 156-165.

M.F.Porter. 1980, “An algorithm for suffix stripping”,

Program, 14 no. 3, pp. 130-137.

G. Salton and C. Buckley. 1990, “Improving retrieval

performance by relevance feedback”, journal of the

American Society for Information Science, pp. 288—

297.

Ellen M. Voorhees. 1998, “Using WordNet for Text

Retrieval”. In

WordNet: An Electronic Lexical

Database,

Edited by Christiane Fellbaum, MIT Press.

R. A. Baeza-Yates and Berthier A. Ribeiro-Neto. 1999,

“Modern Information Retrieval”, ACM Press /

Addison-Wesley.

M.J. Zaki. 2000, “Scalable algorithms for association

mining”. IEEE Transactions on Knowledge and Data

Engineering, 12(3), pp. 372-390.

200

400

600

800

1000

1200

1400

0.2

Similarity

Number of document ite

LSI/RF

Figure 2: Similarity histogram for relevance feedback and

relevance feedback with LSI

Figure 3: Similarity histogram for rule-based and

relevance feedback based matches

BOOSTING ITEM FINDABILITY: BRIDGING THE SEMANTIC GAP BETWEEN SEARCH PHRASES AND ITEM

INFORMATION