but also animations, images, and videos. Our imple-
mentation is open and free, available on the official
Mozilla Firefox add-ons portal. The webExtension
can be also combined with tools such as Selenium in
order to automate the generation of product reviews
from different websites.
As future work, we plan to incorporate other out-
put formats for the webExtension (besides HTML).
One of these formats will be plain text since, usually,
sentiment analysis and opinion mining algorithms in-
put the comments as plain text. On the other hand,
the functionality of many scraping tools is based on
knowing the className or the id of the DOM node.
Therefore, another interesting output is the className
and id of the DOM node that corresponds to the root
of the comments section. We also plan to augment our
dataset of product web pages with more real webpages
labelled for comments extraction.
Finally, it should be highlighted that not all com-
ments are always visible on a web page (sometimes
the user has to press a button called “Show more” or
similar), which is a limitation of our technique. How-
ever, we are investigating how to retrieve all comments,
whether they are visible or not.
ACKNOWLEDGMENTS
This work has been partially supported by the Span-
ish MCIN/AEI under grant PID2019- 104735RB-
C41 and by Generalitat Valenciana under grant
CIPROM/2022/6 (Fasslow). Carlos Galindo was par-
tially supported by the Spanish Ministerio de Univer-
sidades under grant FPU20/03861.
REFERENCES
Alarte, J. and Silva, J. (2021). Page-level main content
extraction from heterogeneous webpages. ACM Trans.
Knowl. Discov. Data, 15(6).
Alarte, J. and Silva, J. (2022a). A benchmark suite for
template detection and content extraction.
Alarte, J. and Silva, J. (2022b). Hybex: A hybrid tool for
template extraction. In Companion Proceedings of the
Web Conference 2022, WWW ’22, page 205–209, New
York, NY, USA. Association for Computing Machin-
ery.
Aren, S., G
¨
uzel, M., Kabadayı, E., and Alpkan, L. (2013).
Factors affecting repurchase intention to shop at the
same website. Procedia - Social and Behavioral Sci-
ences, 99:536–544. The Proceedings of 9th Interna-
tional Strategic Management Conference.
Bar-Yossef, Z. and Rajagopalan, S. (2002). Template detec-
tion via data mining and its applications. In Proceed-
ings of the 11th International Conference on World
Wide Web (WWW’02), pages 580–591, New York, NY,
USA. ACM.
Baroni, M., Chantree, F., Kilgarriff, A., and Sharoff, S.
(2008). Cleaneval: a Competition for Cleaning Web
Pages. In Proceedings of the International Conference
on Language Resources and Evaluation (LREC’08),
pages 638–643.
Binali, H., Potdar, V., and Wu, C. (2009). A state of the art
opinion mining and its application domains. In 2009
IEEE International Conference on Industrial Technol-
ogy, pages 1–6.
Chen, L., Qi, L., and Wang, F. (2012). Comparison of feature-
level learning methods for mining online consumer re-
views. Expert Systems with Applications, 39(10):9588–
9601.
Consortium, W. (1997). Document Object Model (DOM).
Available from URL: http://www.w3.org/DOM/.
Ding, X., Liu, B., and Yu, P. S. (2008). A holistic lexicon-
based approach to opinion mining. In Proceedings of
the 2008 International Conference on Web Search and
Data Mining, WSDM ’08, page 231–240, New York,
NY, USA. Association for Computing Machinery.
Faty, L., Ndiaye, M., Sarr, E. N., and Sall, O. (2020). Opin-
ionscraper: A news comments extraction tool for opin-
ion mining. In 2020 Seventh International Conference
on Social Networks Analysis, Management and Secu-
rity (SNAMS), pages 1–5.
Gottron, T. (2007). Evaluating content extraction on HTML
documents. In Proceedings of the 2nd International
Conference on Internet Technologies and Applica-
tions (ITA’07), pages 123–132. National Assembly for
Wales.
Heinonen, K. (2011). Consumer activity in social media:
Managerial approaches to consumers’ social media
behavior. Journal of Consumer Behaviour, 10(6):356–
364.
Hossin, M., Mu, Y., Fang, J., and Kofi Frimpong, A. N.
(2019). Influence of picture presence in reviews on
online seller product rating: Moderation role approach.
KSII TIIS, 13.
Hu, M. and Liu, B. (2004). Mining and summarizing cus-
tomer reviews. In Proceedings of the Tenth ACM
SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, KDD ’04, page 168–177,
New York, NY, USA. Association for Computing Ma-
chinery.
Jamshed, H., Khan, S. A., Khurrum, M., Inayatullah, S., and
Athar, S. (2019). Data preprocessing: A preliminary
step for web data mining. 3c Tecnolog
´
ıa: glosas de
innovaci
´
on aplicadas a la pyme, 8(1):206–221.
Johan, A. (2021). Product ranking: Measuring product re-
views on the purchase decision. Business & Economic
Review, 4.
Kumar, A., Morabia, K., Wang, W., Chang, K., and Schwing,
A. (2022). Cova: Context-aware visual attention for
webpage information extraction. In Proceedings of The
Fifth Workshop on e-Commerce and NLP (ECNLP 5).
Association for Computational Linguistics.
Leonhardt, J., Anand, A., and Khosla, M. (2020). Boilerplate
removal using a neural sequence labeling model. In
DOM-Based Online Store Comments Extraction
23