WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION

Yuxin Wang, Keizo Oyama

Abstract

We propose a web page classification method for creating a high quality homepage collection considering page group structure. We use support vector machine (SVM) with textual features obtained from each page and its surrounding pages. The surrounding pages are grouped according to connection type (in-link, outlink, and directory entry) and relative URL hierarchy (same, upper, or lower); then an independent feature subset is generated from each group. Feature subsets are further concatenated to compose the feature set of a classifier. The experiment results using ResJ-01 data set manually created by the authors and WebKB data set show the effectiveness of the proposed features compared with a baseline and some prior works. By tuning the classifiers, we then build a three-way classifier using a recall-assured and a precision-assured classifier in combination to accurately select the pages that need manual assessment to assure the required quality. It is also shown to be effective for reducing the amount of manual assessment.

References

  1. Aizawa, A., Oyama, K., 2005. A fast linkage detection scheme for multi-source information integration. In International Workshop on Challenges in Web Information Retrieval and Integration (WIRI 2005), Tokyo, Japan.
  2. Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., Flake, G. W., 2002. Using web structure for classifying and describing web pages. In 11th International World Wide Web Conference, Honolulu, Hawaii, USA.
  3. Cover, T.M., Thomas, J. A., 1991. Elements of information theory. Willey press.
  4. Kan., M.-Y., 2004. Web page categorization without the web page. In 13th World Wide Web Conference (WWW2004), New York, NY, USA, May 17-22.
  5. Kan, M.-Y., Thi, H.O.N., 2005. Fast webpage classification using URL features. In CIKM'05, Bremen, Germany.
  6. Masada, T., Takasu, A., Adachi, J., 2005. Improving web search performance with hyperlink information. IPSJ Transactions on Databases, Vol.46, No.8.
  7. Sun, A., Lim, E.-P., Ng. W.-K., 2002. Web classification using support vector machine. In 4th international workshop on web information and data management. ACM Press, McLean, Virginia, USA.
  8. Sun, A., Lim, E.-P., 2003. Web unit mining: finding and classifying subgraphs of web pages. In International Conference on Information and Knowledge Management (CIKM2003), New Orleans, Louisiana, USA.
  9. Sun, J., Zhang, B., Chen, Z., Lu, Y., Shi, C., Ma, W., 2004. GE-CKO: A method to optimize composite kernels for web page classification. In 2004 IEEE/WIC/ACM International Conference on Web Intelligence (WI2004), Beijing, China.
  10. Wang, Y., Oyama, K., 2006. Combining page group structure and content for roughly filtering researchers' homepages with high recall. IPSJ Transactions on Databases, Vol.47, No. SIG 8 (TOD 30).
  11. Yang, Y., Slattery, S., Ghani, R., 2002. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, volume 18. Kluwer Academic Press.
Download


Paper Citation


in Harvard Style

Wang Y. and Oyama K. (2007). WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION . In Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-972-8865-78-8, pages 170-175. DOI: 10.5220/0001271701700175


in Bibtex Style

@conference{webist07,
author={Yuxin Wang and Keizo Oyama},
title={WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION},
booktitle={Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2007},
pages={170-175},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001271701700175},
isbn={978-972-8865-78-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION
SN - 978-972-8865-78-8
AU - Wang Y.
AU - Oyama K.
PY - 2007
SP - 170
EP - 175
DO - 10.5220/0001271701700175