Architecture for a Garbage-less and Fresh Content Search Engine

Víctor M. Prieto, Manuel Álvarez, Rafael López García, Fidel Cacheda

Abstract

This paper presents the architecture of a Web search engine that integrates solutions for several state-of-the-art problems, such as Web Spam and Soft-404 detection, content update and resource use. To this end, the system incorporates a Web Spam detection module that is based on techniques that have been presented in previous works and whose success have been assessed in well-known public datasets. For the Soft-404 pages we propose some new techniques that improve the ones described in the state of the art. Finally, a last module allows the search engine to detect when a page has changed considering the user interaction. The tests we have performed allow us to conclude that, with the architecture we propose, it is possible to achieve important improvements in the efficacy and the efficiency of crawling systems. This has repercussions in the content that is provided to the users.

References

  1. Bar-Yossef, Z., Broder, A. Z., Kumar, R., and Tomkins, A. (2004). Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of the 13th international conference on World Wide Web, WWW 7804, pages 328-337, New York, NY, USA. ACM.
  2. Bergman, M. K. (2000). The deep web: Surfacing hidden value.
  3. Brewington, B. and Cybenko, G. (2000). How dynamic is the web? pages 257-276.
  4. Chellapilla, K. and Maykov, A. (2007). A taxonomy of javascript redirection spam. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, AIRWeb 7807, pages 81-88, New York, NY, USA. ACM.
  5. Cho, J. and Garcia-Molina, H. (2003). Estimating frequency of change. ACM Trans. Internet Technol., 3:256-290.
  6. Fetterly, D., Manasse, M., and Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, WebDB 7804, pages 1-6, New York, NY, USA. ACM.
  7. Fetterly, D., Manasse, M., and Najork, M. (2005). Detecting phrase-level duplication on the world wide web. In In Proceedings of the 28th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 170-177. ACM Press.
  8. Gyongyi, Z. and Garcia-Molina, H. (2004). Web spam taxonomy. Technical Report 2004-25, Stanford InfoLab.
  9. Kumar, J. P. and Govindarajulu, P. (2009). Duplicate and near duplicate documents detection: A review. European Journal of Scientific Research, 32:514-527.
  10. Ntoulas, A. and Manasse, M. (2006). Detecting spam web pages through content analysis. In In Proceedings of the World Wide Web conference, pages 83-92. ACM Press.
  11. Prieto, V. M., Í lvarez, M., and Cacheda, F. (2012). Analysis and detection of web spam by means of web content. In Proceedings of the 5th Information Retrieval Facility Conference, IRFC 7812.
  12. Quinlan, J. R. (1996). Bagging, boosting, and c4.5. In In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 725-730. AAAI Press.
  13. Raghavan, S. and Garcia-Molina, H. (2001). Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 7801, pages 129-138, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  14. Wu, B. and Davison, B. D. (2005a). Cloaking and redirection: A preliminary study.
  15. Wu, B. and Davison, B. D. (2005b). Identifying link farm spam pages. In Special interest tracks and posters of the 14th international conference on World Wide Web, WWW 7805, pages 820-829, New York, NY, USA. ACM.
Download


Paper Citation


in Harvard Style

M. Prieto V., Álvarez M., López García R. and Cacheda F. (2012). Architecture for a Garbage-less and Fresh Content Search Engine . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 378-381. DOI: 10.5220/0004167203780381


in Bibtex Style

@conference{kdir12,
author={Víctor M. Prieto and Manuel Álvarez and Rafael López García and Fidel Cacheda},
title={Architecture for a Garbage-less and Fresh Content Search Engine},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={378-381},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004167203780381},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Architecture for a Garbage-less and Fresh Content Search Engine
SN - 978-989-8565-29-7
AU - M. Prieto V.
AU - Álvarez M.
AU - López García R.
AU - Cacheda F.
PY - 2012
SP - 378
EP - 381
DO - 10.5220/0004167203780381