GenCrawl: A Generative Multimedia Focused Crawler for Web Pages Classification
Domenico Benfenati, Antonio M. Rinaldi, Cristiano Russo, Cristian Tommasino
2024
Abstract
The unprecedented expansion of the internet necessitates the development of increasingly efficient techniques for systematic data categorization and organization. However, contemporary state-of-the-art techniques often need help with the complex nature of heterogeneous multimedia content within web pages. These challenges, which are becoming more pressing with the rapid growth of the internet, highlight the urgent need for advancements in information retrieval methods to improve classification accuracy and relevance in the context of varied and dynamic web content. In this work, we propose GenCrawl, a generative multimedia-focused crawler designed to enhance web document classification by integrating textual and visual content analysis. Our approach combines the most relevant topics extracted from textual and visual content, using innovative generative techniques to create a visual topic. The reported findings demonstrate significant improvements and a paradigm shift in classification efficiency and accuracy over traditional methods. GenCrawl represents a substantial advancement in web page classification, offering a promising solution for systematically organizing web content. Its practical benefits are immense, paving the way for more efficient and accurate information retrieval in the era of the expanding internet.
DownloadPaper Citation
in Harvard Style
Benfenati D., M. Rinaldi A., Russo C. and Tommasino C. (2024). GenCrawl: A Generative Multimedia Focused Crawler for Web Pages Classification. In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN 978-989-758-716-0, SciTePress, pages 91-101. DOI: 10.5220/0012998900003838
in Bibtex Style
@conference{kdir24,
author={Domenico Benfenati and Antonio M. Rinaldi and Cristiano Russo and Cristian Tommasino},
title={GenCrawl: A Generative Multimedia Focused Crawler for Web Pages Classification},
booktitle={Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
year={2024},
pages={91-101},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012998900003838},
isbn={978-989-758-716-0},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - GenCrawl: A Generative Multimedia Focused Crawler for Web Pages Classification
SN - 978-989-758-716-0
AU - Benfenati D.
AU - M. Rinaldi A.
AU - Russo C.
AU - Tommasino C.
PY - 2024
SP - 91
EP - 101
DO - 10.5220/0012998900003838
PB - SciTePress