Table 2: Performance comparison of clustering methods. P, E and F refers to purity, entropy and F-measure respectively.
SCuBA Soft K-means Hard
Keyword P E F P E F P E F P E F
apple 0.80 0.57 0.26 0.77 0.70 0.39 0.72 1.08 0.65 0.78 0.89 0.47
gold 0.79 0.67 0.19 0.78 0.78 0.33 0.75 1.11 0.63 0.73 0.99 0.44
jaguar 0.78 0.71 0.23 0.91 0.40 0.18 0.65 1.12 0.40 0.91 0.40 0.18
paper 0.59 1.34 0.44 0.52 1.87 0.42 0.45 2.27 0.40 0.46 2.29 0.38
saturn 0.47 1.52 0.36 0.73 0.91 0.27 0.31 1.89 0.41 0.73 0.91 0.27
Table 3: Comparison of actual clusters with clustering methods. Clusters and Pages refers to number of clusters and number
of total Web pages in the clusters respectively.
# of # of SCuBA Soft K-means Hard
Keyword Cat. pages Clusters Pages Clusters Pages Clusters Clusters
apple 6 648 801 4402 181 2406 6 23
gold 6 670 578 3276 165 2077 6 27
jaguar 4 138 15 88 4 95 4 4
paper 10 1422 2054 12593 120 3516 10 6
saturn 4 71 12 86 3 66 4 3
Table 4: Sample features of one subspace cluster for each
keyword.
Keyword Sample features Related Category
apple macintosh, users, computer, Computers
product, check, imac, mac
gold necklace, ring, pendant, earring Shopping
bracelet, shipping, silver, jewelry
jaguar auto, cars, parts, support Cars
paper printing, contact, privacy, product, Shopping
unique, see
saturn planet, sun, image, satellite Planets
the performance of subspace clustering significantly
as shown in Figure 3. Especially, TF and TF/IDF
showed nearly the same performance. As opposed
to the expectations, although the common terms are
ranked in lower, it is surprising to find out TF/IDF is
not able to eliminate them.
Consequently, our subspace clustering based algo-
rithm preserves the quality of the clusters initially ob-
tained from SCuBA while significantly reducing the
number of clusters. Furthermore, it is better than a
common state of art clustering algorithm although the
number of clusters are provided.
5 CONCLUSION
We present a novel subspace clustering based algo-
rithm to organize search results by simultaneously
clustering and identifying their distinguishing terms.
We present experimental results illustrating the effec-
tiveness of our algorithm by measuring purity, en-
tropy and F-measure of generated clusters based on
Open Directory Project (ODP). As the future work,
we will work on feature selection to eliminate com-
mon terms to increase the quality of the features.
Currently, merging is a simple rule based method,
whereas one can explore the use of sophisticated rela-
tionship analysis over clusters.
REFERENCES
Agarwal, N., Haque, E., Liu, H., and Parsons, L. (2006).
A subspace clustering framework for research group
collaboration. International Journal of Information
Technology and Web Engineering, 1(1):35–38.
Beil, F., Ester, M., and Xu, X. (2002). Frequent term-based
text clustering. In Proceedings of SIGKDD’02, pages
436–442, New York, NY, USA. ACM Press.
Crescenzi, V., Merialdo, P., and Missier, P. (2005). Clus-
tering web pages based on their structure. Data and
Knowledge Engineering, 54:279–299.
Leouski, A. and Croft, W. B. (1996). An evaluation of tech-
niques for clustering search results. Technical Report
IR-76, University of Massachusetts, Amherst.
Leuski, A. and Allan, J. (2000). Improving interactive re-
trieval by combining ranked lists and clustering. In
Proceedings of RIAO’2000, pages 665–681.
Parsons, L., Haque, E., and Liu, H. (2004). Subspace clus-
tering for high dimensional data: A review. SIGKDD
Explorations, 6(1):90.
Rosell, M., Kann, V., and Litton, J.-E. (2004). Compar-
ing comparisons: Document clustering evaluation us-
ing two manual classifications. In Proceedings of
ICON’04.
Strehl, A., Ghosh, J., and Mooney, R. (2000). Impact of
similarity measures on web-page clustering. In Pro-
ceedings of AAAI’00, pages 58–64. AAAI.
Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., and Ma, J.
(2004). Learning to cluster web search results. In
Proceedings of ACM SIGIR’04, pages 210–217, New
York, NY, USA. ACM Press.
SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS
339