large, the synthesis is repeated by constructing a new
set of good minterms at the next lower precision
level for which minterms can be found. The initial
query guarantees that the repetitions will eventually
terminate successfully; in the worst case we may
have the initial query as the best query for the
search.
Definition
Let R and S be two sets of minterms, coalesce(R, S)
is a set of minterms defined as follows:
coalesce(R,S) = {π(r,s) | r ∈R and s ∈S },
where π is defined as follows:
π(m
1
,m
2
) = m, where m is a minterm such that
m
1
⇒ m, m
2
⇒ m, and for every minterm m’
(m
1
⇒m’⇒m; m
2
⇒m’⇒m) ⇒ (m’≡m). □
Definition
Let M be a set of p-minterms. We derive set of all
useful minterms, U as follows: U = M
1
∪ M
2
…;
where M
1
= M and M
i+1
= coalesce (M
i
, M).
□
The finite size of set of terms makes it easy to
see that for some finite k, M
k
= M
k+1
. To each u ∈ U
we associate precision(u) = |Relevant σ u|/
|Irrelevant σ u|; where denominator is 0, a large
value is assigned as precision.
A set of minterms where each minterm meets a
stipulated constraint precision value is a set of good
minterms. These minterms can be used to synthesise
query at the matching precision level. Formally,
Definition
GoodMinterms(U, p) = {u∈U | precision(u)≥ p &
∀w∈U [(u ⇒ w) ⇒ ((precision(w) < p)|(u≡w))]}
□
The precision values in set U are discrete; by
progressively lowering the value one can derive a
query that meets the search engine constraints at the
best precision value possible.
6 CONCLUDING REMARKS
The paper has described a method for constructing
search query from a collection of relevant and
irrelevant text documents. The method constructs the
query by following a series of stages. Each stage is a
greed-driven heuristic to certain goals. In our
experiments, we found the greedy approach to be
adequate; there is little need to seek optimised and
perfect Boolean expressions as there are
uncertainties outside our controls. For example, the
small number of documents used to construct the
query can only define a very imprecise image of
documents and resources on the Web. Efforts spent
on matching this imperfect image would not deliver
proportionate rewards.
The method performs well when a reasonable
collection of relevant and irrelevant documents is
available. As the size of the collection increases, it
becomes better image of the resources on the Web!
Too small a size of the set does not help in synthesis
of a good quality search query. A very large
collection, on the other hand, requires additional
processing effort. Particularly, it will require more
human effort to classify the documents in the
collection. The goal of the exercise is to reduce
human effort. A set of about 10 relevant documents
with 10 to 20 irrelevant documents are generally
adequate.
The constructed queries include some unintuitive
terms. However, the queries retrieve fresh links and
are precise when used over the Web. An unintuitive
term is one that web searchers are not likely use in
queries. An unintuitive term is not a counter-
intuitive term.
The tests that we have conducted using the
algorithm to this stage have shown very good
improvements in the quality of links to resources in
comparison with the links retrieved using naïve
queries. Even in the areas well understood by a user,
the synthesised queries have performed better than
those constructed by humans. Some performance
results are available in Patro and Malhotra (2005).
REFERENCES
Aho, A.V. and Ullman, J.D., 1992. Foundations of
computer science, Computer Science Press, New
York.
Apte, C., Damerau, F.J., and Weiss, S.M., 1994.
Automated Learning of Decision Rules for Text
Categorisation, ACM Transactions on Information
Systems, 3(12). 233-251.
Brin, S. and Page, L., 1998. The anatomy of a large-scale
hypertextual web search engine, J. Computer
Networks and ISDN Systems, 30(1). 107-117.
Patro, S. and Malhotra V. 2005, Characteristics of the
Boolean web search query: Estimating success from
characteristics, http://eprints.comp.utas.edu.au:81/
Sanchez, S.N., Triantaphyllou, E., Chen, J. and Liao,
T.W., 2002. An Incremental Learning Algorithm for
Constructing Boolean Functions from Positive and
Negative Examples, Computers and Operations
Research 29(12). 1677–1700.
Sebastiani, F. 2002. Machine learning in automated text
categorization, ACM Computing Surveys, 34(1): 1-47.
ICEIS 2005 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
296