DOM-Based Online Store Comments Extraction
Julián Alarte, Carlos Galindo, Carlos Martín, Josep Silva
2024
Abstract
Online stores often include a customer comments section on their product pages. This section is valuable for other customers, as they can read reviews from users who have previously purchased or tried the products. This feedback is also important for the owners and managers of online stores, as they can obtain valuable information about the products they sell, such as buyer opinions and ratings. Additionally, the comments section holds significant value for the manufacturers of the products, as they can analyze comments posted on various online stores to receive valuable feedback about their products. This work presents a novel technique to automatically extract from a web page the customer comments without knowing a priori the web page structure. The technique not only extracts text but also other types of relevant content, such as images, animations, and videos. It is based on the DOM tree and only needs to load a single web page to extract its product comments; therefore, it can be used in real-time during browsing without the need for page preprocessing. To train and evaluate the technique, we have built a benchmark suite from real and heterogeneous web pages. The empirical evaluation shows that the technique achieves an average F1 score of 90.4% and reaches 100% on most web pages.
DownloadPaper Citation
in Harvard Style
Alarte J., Galindo C., Martín C. and Silva J. (2024). DOM-Based Online Store Comments Extraction. In Proceedings of the 20th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST; ISBN 978-989-758-718-4, SciTePress, pages 15-24. DOI: 10.5220/0012894600003825
in Bibtex Style
@conference{webist24,
author={Julián Alarte and Carlos Galindo and Carlos Martín and Josep Silva},
title={DOM-Based Online Store Comments Extraction},
booktitle={Proceedings of the 20th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST},
year={2024},
pages={15-24},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012894600003825},
isbn={978-989-758-718-4},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 20th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST
TI - DOM-Based Online Store Comments Extraction
SN - 978-989-758-718-4
AU - Alarte J.
AU - Galindo C.
AU - Martín C.
AU - Silva J.
PY - 2024
SP - 15
EP - 24
DO - 10.5220/0012894600003825
PB - SciTePress