DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study

Adrian Sterca, Oana Nourescu, Adriana Guran, Camelia Serban

2023

Abstract

Web page segmentation plays a crucial role in analyzing and understanding the content of web pages, enabling various web-related tasks. The approaches based on computer vision and machine learning have limitations determined by the need of large datasets for training and validation. In this paper, we propose a Document Object Model (DOM) based approach that uses clustering algorithms for web page segmentation. By leveraging the hierarchical structure of the DOM, our approach aims to achieve accurate and reliable segmentation results. We conduct an empirical study, using a custom built dataset to compare the performance of different clustering algorithms for web segmentation. Our research objectives focus on dataset creation, features identification, distance metrics definition, and appropriate clustering algorithms selection. The findings provide insights into the effectiveness and limitations of our approach, enabling informed decision-making in real-world applications.

Download


Paper Citation


in Harvard Style

Sterca A., Nourescu O., Guran A. and Serban C. (2023). DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study. In Proceedings of the 19th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST; ISBN 978-989-758-672-9, SciTePress, pages 302-309. DOI: 10.5220/0012184000003584


in Bibtex Style

@conference{webist23,
author={Adrian Sterca and Oana Nourescu and Adriana Guran and Camelia Serban},
title={DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study},
booktitle={Proceedings of the 19th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST},
year={2023},
pages={302-309},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012184000003584},
isbn={978-989-758-672-9},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST
TI - DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study
SN - 978-989-758-672-9
AU - Sterca A.
AU - Nourescu O.
AU - Guran A.
AU - Serban C.
PY - 2023
SP - 302
EP - 309
DO - 10.5220/0012184000003584
PB - SciTePress