Authors:
Domenico Benfenati
;
Antonio M. Rinaldi
;
Cristiano Russo
and
Cristian Tommasino
Affiliation:
Department of Electrical Engineering and Information Technology (DIETI), University of Naples Federico II, Naples, Italy
Keyword(s):
Web Crawling, Web Pages Classification, Generative AI, Web Topic Analysis.
Abstract:
The unprecedented expansion of the internet necessitates the development of increasingly efficient techniques for systematic data categorization and organization. However, contemporary state-of-the-art techniques often need help with the complex nature of heterogeneous multimedia content within web pages. These challenges, which are becoming more pressing with the rapid growth of the internet, highlight the urgent need for advancements in information retrieval methods to improve classification accuracy and relevance in the context of varied and dynamic web content. In this work, we propose GenCrawl, a generative multimedia-focused crawler designed to enhance web document classification by integrating textual and visual content analysis. Our approach combines the most relevant topics extracted from textual and visual content, using innovative generative techniques to create a visual topic. The reported findings demonstrate significant improvements and a paradigm shift in classificatio
n efficiency and accuracy over traditional methods. GenCrawl represents a substantial advancement in web page classification, offering a promising solution for systematically organizing web content. Its practical benefits are immense, paving the way for more efficient and accurate information retrieval in the era of the expanding internet.
(More)