Authors:
Shuhua Liu
and
Thomas Forss
Affiliation:
Arcada University of Applied Sciences, Finland
Keyword(s):
Web Content Classification, Text Summarization, Topic Similarity, Sentiment Analysis, Online Safety Solutions.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Mining Text and Semi-Structured Data
;
Soft Computing
;
Symbolic Systems
;
Web Mining
Abstract:
Automatic classification of web content has been studied extensively, using different learning methods and tools, investigating different datasets to serve different purposes. Most of the studies have made use of the content and structural features of web pages. However, previous experience has shown that certain groups of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into consideration. In this study we present a new approach for automatically classifying web pages into pre-defined topic categories. We apply text summarization and sentiment analysis techniques to extract topic and sentiment indicators of web pages. We then build classifiers based on combined topic and sentiment features. A large amount of experiments were carried out. Our results suggest that incorporating the sentiment dimension can indeed bring much added value to web content classification. Topi
c similarity based classifiers solely did not perform well, but when topic similarity and sentiment features are combined, the classification model performance is significantly improved for many web categories. Our study offers valuable insights and inputs to the development of web detection systems and Internet safety solutions.
(More)