we have explored.
In conclusion, our research contributes to the
field of web page segmentation by comparing vari-
ous DOM-based clustering algorithms. The empirical
study evaluates the effectiveness of the studied clus-
tering algorithms for web segmentation and provides
valuable insights for its application in real-world sce-
narios. Future research directions include refining the
approach to handle more complex web page layouts,
investigating the scalability of the approach for large-
scale web page datasets, and exploring additional fea-
tures and clustering techniques for further improve-
ments.
ACKNOWLEDGEMENT
The present work has received financial support through
the project: Integrated system for automating business pro-
cesses using artificial intelligence, POC/163/1/3/121075 -
a Project Cofinanced by the European Regional Develop-
ment Fund (ERDF) through the Competitiveness Opera-
tional Programme 2014-2020.
REFERENCES
Andrew, J. J., Ferrari, S., Maurel, F., Dias, G., and Giguet,
E. (2019). Web page segmentation for non visual
skimming. In The 33rd Pacific Asia Conference
on Language, Japan. hal-02309625. Information and
Computation (PACLIC 33) Hakodate.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J.
(1999). Optics: Ordering points to identify the clus-
tering structure. ACM Sigmod record, 28(2):49–60.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander,
J. (1999). Optics-of: Identifying local outliers. In
Principles of Data Mining and Knowledge Discov-
ery: Third European Conference, PKDD’99, Prague,
Czech Republic, September 15-18, 1999. Proceedings
3, pages 262–270. Springer.
Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Ex-
tracting content structure for web pages based on vi-
sual representation. In Web Technologies and Applica-
tions: 5th Asia-Pacific Web Conference, APWeb 2003,
Xian, China, April 23–25, 2003 Proceedings 5, pages
406–417. Springer.
Chac
´
on, J. E. and Rastrojo, A. I. (2023). Minimum adjusted
rand index for two clusterings of a given size. Adv
Data Anal Classif, 17:125–133.
Chen, J., Zhou, B., Shi, J., Zhang, H., and Fengwu, Q.
(2001). Function-based object model towards website
adaptation. In 01). Association for Computing Ma-
chinery, New York, NY, USA, pages 587–596. Proceed-
ings of the 10th international conference on World
Wide Web (WWW.
Hubert, L. and Arabie, P. C. p. (1985). Journal of classifi-
cation 2. pages 193–218.
Jayashree, S. R., Dias, G., Andrew, J. J., Saha, S., Maurel,
F., and Ferrari, S. (2022a). Multimodal web page seg-
mentation using self-organized multi-objective clus-
tering. ACM Transactions on Information Systems
(TOIS), 40(3):1–49.
Jayashree, S. R., Dias, G., Andrew, J. J., Saha, S., Maurel,
F., and Ferrari, S. (2022b). Multimodal web page seg-
mentation using self-organized multi-objective clus-
tering. acm trans. Inf, 40:3.
Kiesel, J., Kneist, F., Meyer, L., Komlossy, K., Stein, B.,
and Potthast, M. (2020). Web page segmentation re-
visited: Evaluation framework and dataset. In Pro-
ceedings of the 29th ACM International Conference
on Information & Knowledge Management, CIKM
’20, page 3047–3054, New York, NY, USA. Associ-
ation for Computing Machinery.
Mantratzis, C. and Cassidy, S. (2005). Dom-based xhtml
document structure analysis separating content from
navigation elements. In International Conference
on Computational Intelligence for Modelling, pages
632–637, Web Technologies and Internet Commerce
(CIMCA-IAWTIC’06), Vienna, Austria. Control and
Automation and International Conference on Intelli-
gent Agents.
Saaty, T. L. (1980). The analytic hierarchy process.
McGraw-Hill, New York, NY.
Sanoja, A. and Ganc¸arski, S. (2014). Block-o-matic: A
web page segmentation framework. In 2014 interna-
tional conference on multimedia computing and sys-
tems (ICMCS), pages 595–600. IEEE.
Zhang, S., Wu, J., and Yang, K. (2020). A webpage seg-
mentation method based on node information entropy
of dom tree. In Journal of Physics: Conference Series,
volume 1624, page 032023. IOP Publishing.
DOM-Based Clustering Approach for Web Page Segmentation: A Comparative Study
309