DOM-Based Online Store Comments Extraction

Julián Alarte, Carlos Galindo, Carlos Martín, Josep Silva

2024

Abstract

Online stores often include a customer comments section on their product pages. This section is valuable for other customers, as they can read reviews from users who have previously purchased or tried the products. This feedback is also important for the owners and managers of online stores, as they can obtain valuable information about the products they sell, such as buyer opinions and ratings. Additionally, the comments section holds significant value for the manufacturers of the products, as they can analyze comments posted on various online stores to receive valuable feedback about their products. This work presents a novel technique to automatically extract from a web page the customer comments without knowing a priori the web page structure. The technique not only extracts text but also other types of relevant content, such as images, animations, and videos. It is based on the DOM tree and only needs to load a single web page to extract its product comments; therefore, it can be used in real-time during browsing without the need for page preprocessing. To train and evaluate the technique, we have built a benchmark suite from real and heterogeneous web pages. The empirical evaluation shows that the technique achieves an average F1 score of 90.4% and reaches 100% on most web pages.

Download


Paper Citation


in Harvard Style

Alarte J., Galindo C., Martín C. and Silva J. (2024). DOM-Based Online Store Comments Extraction. In Proceedings of the 20th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST; ISBN 978-989-758-718-4, SciTePress, pages 15-24. DOI: 10.5220/0012894600003825


in Bibtex Style

@conference{webist24,
author={Julián Alarte and Carlos Galindo and Carlos Martín and Josep Silva},
title={DOM-Based Online Store Comments Extraction},
booktitle={Proceedings of the 20th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST},
year={2024},
pages={15-24},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012894600003825},
isbn={978-989-758-718-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 20th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST
TI - DOM-Based Online Store Comments Extraction
SN - 978-989-758-718-4
AU - Alarte J.
AU - Galindo C.
AU - Martín C.
AU - Silva J.
PY - 2024
SP - 15
EP - 24
DO - 10.5220/0012894600003825
PB - SciTePress