in P(c
i
|t) (see Section 4.1). Hence, the information
that a term is absent from a document is not impor-
tant when selecting features with our metrics.
7 CONCLUSIONS
In this paper, we study the problem of categorizing
structured web sources by using their search inter-
faces. Our approach employs a filtering feature selec-
tion technique together with a Gaussian process clas-
sifier. In our research, we treated each search inter-
face simply as a bag-of-words. We conducted exper-
iments with our FS methods with new metrics and a
novel simple ranking scheme, as well with existing
FS methods. The experimental result indicates that:
(1) feature selection techniques improveclassification
performance significantly; (2) our classification ap-
proach and the proposed FS methods are effective.
Our research also points out that rare words are not
important to the categorization task.
For future work, in terms of the classification
method, one possible improvement to our research
is to identify and use different feature types from a
search interface. In terms of the feature selection
technique, we plan to evaluate the effectiveness of the
new methods in other text categorization problems.
REFERENCES
Barbosa, L. and Freire, J. (2005). Searching for hidden-web
databases. In WebDB’05, pages 1–6.
Barbosa, L., Freire, J., and Silva, A. (2007). Organizing
hidden-web databases by clustering visible web docu-
ments. In ICDE’07, pages 326–335.
Bergman, M. K. (2001). White paper - The Deep
Web: Surfacing hidden value. Accessible at
http://www.brightplanet.com.
Callan, J. P., Connell, M., and Du, A. (1999). Automatic
discovery of language models for text databases. In
SIGMOD’99, pages 479–490.
Chakrabarti, S., Dom, B., and Indyk, P. (1998). Enhanced
hypertext categorization using hyperlinks. ACM SIG-
MOD Record, 27(2):307 – 318.
Chang, K. C.-C., He, B., Li, C., Patel, M., and Zhang, Z.
(2004). Structured databases on the Web: Observa-
tions and implications. SIGMOD Record, 33(3):61–
70.
Chang, K. C.-C., He, B., and Zhang, Z. (2005). Toward
large scale integration: Building a MetaQuerier over
databases on the web. In CIDR’05, pages 44–55.
Chawathe, S., Garcia-molina, H., Hammer, J., Irel, K., Pa-
pakonstantinou, Y., Ullman, J., and Widom, J. (1994).
The Tsimmis project: Integration of heterogeneous in-
formation sources. JIIS, 8(2):7–18.
Forman, G. (2003). An extensive empirical study of fea-
ture selection metrics for text classification. JMLR,
3:1289–1305.
Gabrilovich, E. and Markovitch, S. (2004). Text catego-
rization with many redundant features: using aggres-
sive feature selection to make SVMs competitive with
C4.5. In ICML’04, pages 321–328.
He, B. and Chang, K. C.-C. (2003). Statistical schema
matching across web query interfaces. In SIG-
MOD’03, pages 217–228.
He, B., Tao, T., and Chang, K. C.-C. (2004). Organizing
structured web sources by query schemas: A cluster-
ing approach. In CIKM’04, pages 22–31.
He, H., Meng, W., Yu, C., and Wu, Z. (2005). WISE-
Integrator: A system for extracting and integrating
complex web search interfaces of the Deep Web. In
VLDB’05, pages 1314–1317.
Ipeirotis, P. G., Gravano, L., and Sahami, M. (2001).
Probe, count, and classify: categorizing hidden web
databases. In SIGMOD’01, pages 67–78.
Joachims, T. (1998). Text categorization with support vec-
tor machines: Learning with many relevant features.
In ECML’98, pages 137–142.
Levy, A. Y., Rajaraman, A., and Ordille, J. J. (1996). Query-
ing heterogeneous information sources using source
descriptions. In VLDB’96, pages 251–262.
Lu, Y., He, H., Peng, Q., Meng, W., and Yu, C. (2006).
Clustering e-commerce search engines based on their
search interface pages using WISE-Cluster. DKE
Journal, 59(2):231–246.
Mladenic, D. (1998). Feature subset selection in text-
learning. In ECML’98, pages 95–100.
Neal, R. M. (1997). Monte Carlo implementa-
tion of Gaussian process models for Bayesian
regression and classification. Available from
http://www.cs.toronto.edu/∼radford/.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian
Processes for Machine Learning. The MIT Press.
Rogati, M. and Yang, Y. (2002). High-performing feature
selection for text classification. In CIKM’02.
Sebastiani, F. (2002). Machine learning in automated text
categorization. ACM CSUR, 34(1):1–47.
Soucy, P. and Mineau, G. W. (2001). A simple feature se-
lection method for text classification. In ICAI’01.
UIUC (2003). The UIUC Web integration repository.
Computer Science Dept., Uni. of Illinois at Urbana-
Champaign. http://metaquerier.cs.uiuc.edu/repository.
Wu, W., Yu, C., Doan, A., and Meng, W. (2004). An inter-
active clustering-based approach to integrating source
query interfaces on the Deep Web. In SIGMOD’04.
Yang, Y. and Pedersen, J. O. (1997). A comparative
study on feature selection in text categorization. In
ICML’97, pages 412–420.
Zamir, O. and Etzioni, O. (1998). Web document clustering:
a feasibility demonstration. In SIGIR’98, pages 46–
54.
CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION
625