MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING

Moreno Carullo, Elisabetta Binaghi

Abstract

In this work we define a hybrid Web Content Mining strategy aimed to recognize within Web pages the main entity, intended as the short text that refers directly to the main topic of a given page. The salient aspect of the strategy is the use of a novel supervised Machine Learning model able to represent in an unified framework the integrated use of visual pages layout features, textual features and hyperlink description. The proposed approach has been evaluated with promising results.

References

  1. Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Web Technologies and Applications: 5th Asia-Pacific Web Conference, APWeb 2003, Xian, China, April 23-25, 2003. Proceedings, page 596.
  2. Carullo, M., Binaghi, E., and Gallo, I. (2009). Soft categorization and annotation of images with radial basis function networks. In VISSAPP, International Conference on Computer Vision Theory and Applications, volume 2, pages 309-314.
  3. Chakrabarti, S., Dom, B., and Indyk, P. (1998). Enhanced hypertext categorization using hyperlinks. In SIGMOD 7898: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 307-318, New York, NY, USA. ACM.
  4. Congalton, R. (1991). A review of assessing the accuracy of classifications of remotely sensed data. Remote sensing of environment, 37(1):35-46.
  5. Frakes, W. B. and Baeza-Yates, R. A., editors (1992). Information Retrieval: Data Structures & Algorithms. Prentice-Hall.
  6. F ├╝rnkranz, J. (2002). Web structure mining - exploiting the graph structure of the world-wide web. O GAI Journal, 21(2):17-26.
  7. Joachims, T., De, T. J., Cristianini, N., and Uk, N. R. A. (2001). Composite kernels for hypertext categorisation. In In Proceedings of the International Conference on Machine Learning (ICML, pages 250-257. Morgan Kaufmann Publishers.
  8. Kosala, R. and Blockeel, H. (2000). Web mining research: a survey. SIGKDD Explor. Newsl., 2(1):1-15.
  9. Michalski, R. S., Carbonell, J. G., and Mitchell, T. M. (1983). Machine Learning, An Artificial Intelligence Approach. McGraw-Hill.
  10. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York.
  11. Moody, J. E. and Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1:281-294.
  12. Oh, H.-J., Myaeng, S. H., and Lee, M.-H. (2000). A practical hypertext catergorization method using links and incrementally available class information. In SIGIR 7800: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 264-271, New York, NY, USA. ACM.
  13. Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth mover's distance as a metric for image retrieval. Int. J. Comput. Vision, 40(2):99-121.
  14. Spertus, E. (1997). Parasite: mining structural information on the web. Comput. Netw. ISDN Syst., 29(8- 13):1205-1215.
  15. Zhang, M.-L. and Zhou, Z.-H. (2006). Adapting rbf neural networks to multi-instance learning. Neural Process. Lett., 23(1):1-26.
Download


Paper Citation


in Harvard Style

Carullo M. and Binaghi E. (2010). MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 156-161. DOI: 10.5220/0003065401560161


in Bibtex Style

@conference{kdir10,
author={Moreno Carullo and Elisabetta Binaghi},
title={MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={156-161},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003065401560161},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING
SN - 978-989-8425-28-7
AU - Carullo M.
AU - Binaghi E.
PY - 2010
SP - 156
EP - 161
DO - 10.5220/0003065401560161