![](bg2.png)
use of search engines and search service to find
specific information. Users are not satisfied with the
performance of the current generation of search
engines because of slow retrieval speed,
communication delays and poor quality of retrieved
results [1].
In this paper, we propose a new efficient method
called word-intersection clustering which can cluster
more than two documents based on words shared by
documents. This method applies an algorithm to
compute the correlation similarity score of
documents. The documents with the similarity score
above a given threshold will be clustered together. A
definition of documents profile is derived, so that
each document has a profile based on the
classification of category and similarity score. Then
the documents are clustered under different
categories. The proposed algorithm’s offline
computation scales independently of the number of
documents. If one document in a cluster is relevant,
then the whole cluster is relevant which makes the
information retrieval more efficient.
This paper is organized as follows: The next
section discusses the structure of a document based
on the words shared by various documents. In
section 3, we discuss the proposed algorithm and
technique to cluster documents and the final section
concludes the paper.
2 RESTRUCTURING OPERATION
Existing clustering methods focus on clustering two
documents [2]. There has been a lack of effort on
clustering more than two documents.
We propose a new restructuring operation by
using those keywords appearing in the documents.
Each keyword has different weight, ranging from 0
to 1. The value of weight is decided by system
designer based on the importance and relevance of
the keywords in that category and the number of
times that keyword appears in that document.
Figure 1 shows the idea of restructuring
operation of documents. The documents in the same
category are clustered in accordance with the words
shared by documents after the restructuring
operation. For example, the documents 1, 15, 18 and
22 are clustered, documents 2 and 3 are clustered,
and so are documents 7, 8 and 10.
Figure 1: Restructuring operation of documents
3 DOCUMENT CLUSTERING
In this section, we discuss the algorithm and
technique of word-intersection clustering.
We propose a restructuring operation to cluster
documents as described in section 2. In this section,
we will discuss the algorithm and technique of
documents clustering. The subsections are organized
as follows: section 3.1 presents the approach of
calculation of similarity score. Then the document
profiles will be derived in section 3.2. Finally, the
proposed k-time clustering algorithm will be
applied.
3.1 Computation of similarity score
To compute the similarity score of documents, first
of all, we select some keywords appearing in those
documents in a given category, whereby each word
is assigned a weight, ranging form 0 to 1. Different
word has different weight based on how important
and relevant of that word is in a particular category.
The value of weight is calibrated by system
administration. For example in the category of
information management, the words “information
filtering” might be assigned by system designer to
have higher weight than the words “data storage”.
The number of times a word appearing in a
document also signifies the relevance value with
respect to all other documents.
Table 1 shows the number of times a keyword
appears in the document in the category of
information management.
LEGEND
K -- KEYWORDS
ICETE 2004 - GLOBAL COMMUNICATION INFORMATION SYSTEMS AND SERVICES
242