DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study
Adrian Sterca, Oana Nourescu, Adriana Guran, Camelia Serban
2023
Abstract
Web page segmentation plays a crucial role in analyzing and understanding the content of web pages, enabling various web-related tasks. The approaches based on computer vision and machine learning have limitations determined by the need of large datasets for training and validation. In this paper, we propose a Document Object Model (DOM) based approach that uses clustering algorithms for web page segmentation. By leveraging the hierarchical structure of the DOM, our approach aims to achieve accurate and reliable segmentation results. We conduct an empirical study, using a custom built dataset to compare the performance of different clustering algorithms for web segmentation. Our research objectives focus on dataset creation, features identification, distance metrics definition, and appropriate clustering algorithms selection. The findings provide insights into the effectiveness and limitations of our approach, enabling informed decision-making in real-world applications.
DownloadPaper Citation
in Harvard Style
Sterca A., Nourescu O., Guran A. and Serban C. (2023). DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study. In Proceedings of the 19th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST; ISBN 978-989-758-672-9, SciTePress, pages 302-309. DOI: 10.5220/0012184000003584
in Bibtex Style
@conference{webist23,
author={Adrian Sterca and Oana Nourescu and Adriana Guran and Camelia Serban},
title={DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study},
booktitle={Proceedings of the 19th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST},
year={2023},
pages={302-309},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012184000003584},
isbn={978-989-758-672-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 19th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST
TI - DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study
SN - 978-989-758-672-9
AU - Sterca A.
AU - Nourescu O.
AU - Guran A.
AU - Serban C.
PY - 2023
SP - 302
EP - 309
DO - 10.5220/0012184000003584
PB - SciTePress