Authors:
Hieu Quang Le
and
Stefan Conrad
Affiliation:
Heinrich-Heine University, Germany
Keyword(s):
Deep web, Classification, Database, Feature selection, Gaussian processes.
Related
Ontology
Subjects/Areas/Topics:
Databases and Datawarehouses
;
e-Business and e-Commerce
;
Internet Technology
;
Society, e-Business and e-Government
;
System Integration
;
Web Information Systems and Technologies
Abstract:
This paper studies the problem of classifying structured data sources on the Web. While prior works use all features, once extracted from search interfaces, we further refine the feature set. In our research, each search interface is treated simply as a bag-of-words. We choose a subset of words, which is suited to classify web sources, by our feature selection methods with new metrics and a novel simple ranking scheme. Using aggressive feature selection approach, together with a Gaussian process classifier, we obtained high classification performance in an evaluation over real web data.