Authors:
Shuhua Liu
and
Thomas Forss
Affiliation:
Arcada University of Applied Sciences, Finland
Keyword(s):
Web Content Classification, Text Summarization, Topic Similarity, Sentiment Analysis, Online Safety Solutions.
Abstract:
This research concerns the development of web content detection systems that will be able to automatically classify any web page into pre-defined content categories. Our work is motivated by practical experience and observations that certain categories of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into account. To further improve the performance of detection systems, we bring web sentiment features into classification models. In addition, we incorporate n-gram representation into our classification approach, based on the assumption that n-grams can capture more local context information in text, and thus could help to enhance topic similarity analysis. Different from most studies that only consider presence or frequency count of n-grams in their applications, we make use of tf-idf weighted n-grams in building the content classification models. Our result shows th
at unigram based models, even though a much simpler approach, show their unique value and effectiveness in web content classification. Higher order n-gram based approaches, especially 5-gram based models that combine topic similarity features with sentiment features, bring significant improvement in precision levels for the Violence and two Racism related web categories.
(More)