MINING THE COSTA RICAN WEB

Esteban Meneses

Abstract

There is much to say about the structure and composition of a local web. Identification of authorities, topics and web communities can be used to improve search engines, change a portal design or to develop marketing strategies. The Costa Rican web was chosen as a test case for web mining analysis. After the study we obtained several descriptors of the web as well as the answers to typical questions like how many pages on average a site has, which file type is preferred for building a dynamic site, what is the most referenced site, which sites are similar, and many more.

References

  1. Baeza-Yates R., Poblete B. and Saint-Jean, F. (2003). Evolution of the Chilean Web: 2001-2002 (In Spanish). In Proceedings of the Jornadas Chilenas de Computación. Chillán, Chile, November 2003.
  2. Boley D., Gini M., Gross R., Han S., Hastings K., Karypis G., Kumar V., Moore J. and Mobasher B. (1999). Partitioning-Based Clustering for Web Document Categorization. In Decision Support Systems. 1999.
  3. Bunke H., Last M., Schenker A. and Kandel A.(2003). A comparison of two novel algorithms for clustering web documents. In Proceedings of the 2nd International Workshop on Web Document Analysis (WDA).
  4. Buntine W., Perttu, S. and Tuulos V. (2004). Using Discrete PCA on Web Pages. In ECML/PKDD 2004, Proceedings of the Workshop on Statistical Approaches for Web Mining. Pisa, Italy, September, 2004.
  5. Castillo C. (2004). Effective Web Crawling. PhD Thesis, University of Chile, 2004.
  6. Chakrabarti, S. (2003). Mining the Web. Morgan Kaufmann Publishers, 2003.
  7. He X., Zha H., Ding C. and Simon H. (2001). Web Document Clustering Using Hyperlink Structures. Technical Report CSE-01-006, Dept. of Computer Science and Engineering, Pennsylvania State University, 2001.
  8. Jain A.K., Murty M.N. and Flynn P.J. (1999). Data Clustering: A Review In ACM Computing Surveys. September, 1999.
  9. Kleinberg J. (1999). Authoritative sources in a hyperlinked environment. In Journal of the ACM. 46(5):604632, 1999.
  10. Page L., Brin S., Motwani R. and Winograd T. (1998). The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.
  11. Rodríguez-Rojas O. (2000). Classification and Linear Models in Symbolic Data Analysis. PhD Thesis, University of Paris IX-Dauphine, 2000.
  12. Shen D., Chen Z., Yang Q., Zeng H., Zhang B., Lu Y. and Ma W. (2004). Web-page Classification through Summarization. In SIGIR 2004, Proceedings of the ACM Conference on Research & Development on Information Retrieval. South Yorkshire, United Kingdom, July, 2004.
Download


Paper Citation


in Harvard Style

Meneses E. (2006). MINING THE COSTA RICAN WEB . In Proceedings of WEBIST 2006 - Second International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-972-8865-46-7, pages 414-421. DOI: 10.5220/0001249504140421


in Bibtex Style

@conference{webist06,
author={Esteban Meneses},
title={MINING THE COSTA RICAN WEB},
booktitle={Proceedings of WEBIST 2006 - Second International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2006},
pages={414-421},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001249504140421},
isbn={978-972-8865-46-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of WEBIST 2006 - Second International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - MINING THE COSTA RICAN WEB
SN - 978-972-8865-46-7
AU - Meneses E.
PY - 2006
SP - 414
EP - 421
DO - 10.5220/0001249504140421