Architecture for a Garbage-less and Fresh Content Search Engine

Víctor M. Prieto, Manuel Álvarez, Rafael López García, Fidel Cacheda



This paper presents the architecture of a Web search engine that integrates solutions for several state-of-the-art problems, such as Web Spam and Soft-404 detection, content update and resource use. To this end, the system incorporates a Web Spam detection module that is based on techniques that have been presented in previous works and whose success have been assessed in well-known public datasets. For the Soft-404 pages we propose some new techniques that improve the ones described in the state of the art. Finally, a last module allows the search engine to detect when a page has changed considering the user interaction. The tests we have performed allow us to conclude that, with the architecture we propose, it is possible to achieve important improvements in the efficacy and the efficiency of crawling systems. This has repercussions in the content that is provided to the users.


