Combining N-gram based Similarity Analysis with
Sentiment Analysis in Web Content Classification
Shuhua Liu and Thomas Forss
Arcada University of Applied Sciences, Jan-Magnus Janssonin aukio 1, 00560 Helsinki, Finland
Keywords: Web Content Classification, Text Summarization, Topic Similarity, Sentiment Analysis, Online Safety
Solutions.
Abstract: This research concerns the development of web content detection systems that will be able to automatically
classify any web page into pre-defined content categories. Our work is motivated by practical experience
and observations that certain categories of web pages, such as those that contain hatred and violence, are
much harder to classify with good accuracy when both content and structural features are already taken into
account. To further improve the performance of detection systems, we bring web sentiment features into
classification models. In addition, we incorporate n-gram representation into our classification approach,
based on the assumption that n-grams can capture more local context information in text, and thus could
help to enhance topic similarity analysis. Different from most studies that only consider presence or
frequency count of n-grams in their applications, we make use of tf-idf weighted n-grams in building the
content classification models. Our result shows that unigram based models, even though a much simpler
approach, show their unique value and effectiveness in web content classification. Higher order n-gram
based approaches, especially 5-gram based models that combine topic similarity features with sentiment
features, bring significant improvement in precision levels for the Violence and two Racism related web
categories.
1 INTRODUCTION
This study concerns the development of web content
detection systems that will be able to automatically
classify any web page into pre-defined content
categories. Previous experience and observations
with web detection systems in practice has shown
that certain groups of web pages such as those carry
hate and violence content prove to be much harder to
classify with good accuracy even when both content
and structural features are already taken into
consideration. There is a great need for better
content detection systems that can accurately
identify excessively offensive and harmful websites.
Hate and violence web pages often carry strong
negative sentiment while their topics may vary a lot.
Advanced developments in computing
methodologies and technology have brought us
many new and better means for text content analysis
such as topic extraction, topic modeling and
sentiment analysis. In our recent work we have
developed topic similarity and sentiment analysis
based classification models (Liu and Forss, 2014),
which bring encouraging results and suggest that
incorporating the sentiment dimension can bring
much added value in the detection of sentiment-rich
web categories such as those carrying hate, violent
and racist messages. In addition, our results also
highlight the effectiveness of integrating topic
similarity and sentiment features in web content
classifiers.
Meanwhile, we observed from our earlier
experiments that topic similarity based classifiers
alone perform rather poorly and worse than
expected. To further improve the performance of the
classification models, in this study we develop new
models by incorporating n-gram representations into
our classification approach. Our assumption is that
n-grams can capture more local context information
in text, thus could help to enhance topic similarity
analysis. N-grams are commonly applied in many
text mining tasks. However, their effects are
uncertain and very much depending on the nature of
the text and the purpose of the task. It is our goal to
investigate the effects of adopting higher order n-
grams in web content classification. Unlike most
530
Liu S. and Forss T..
Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification.
DOI: 10.5220/0005170305300537
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (SSTM-2014), pages 530-537
ISBN: 978-989-758-048-2
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
applications that only consider presence (Boolean
features) or frequency counts of n-grams, in this
study we will explore the use of tf-idf weighted n-
grams in topic analysis of web pages and web
categories. We then build our content classification
models by integrating topic similarity analysis with
sentiment analysis of web pages.
Automatic classification of web pages has been
studied extensively, using different learning methods
and tools, investigating different datasets to serve
different purposes (Qi and Davidson, 2007). The
earliest studies on web classification appeared
already in the late 1990s soon after the web was
invented. Chakrabarti et al (1998) studied hypertext
categorization using hyperlinks. Cohen (2002)
combined anchor extraction with link analysis to
improve web page classifiers. The method exploits
link structure within a site as well as page structure
within hub pages, and it brought substantial
improvement to the accuracy of a bag-of-words
classifier, reducing error rate by about half on
average (Cohen, 2002).
Dumais and Chen (2000) explored the use of
hierarchical structure for classifying a large,
heterogeneous collection of web content. They
applied SVM classifiers in the context of
hierarchical classification and found small
advantages in accuracy for hierarchical models over
flat (non-hierarchical) models. They also found the
same accuracy using a sequential Boolean decision
rule and a multiplicative decision rule, with much
more efficiency.
There exists a huge amount of research on text
classification in general. However, web content
classification differs from general text categorization
due to its special structure, meta-data and its
dynamics. Shen et al (2004, 2007) studied web-page
classification based on text summarization. They
gave empirical evidence that web-page summaries
created manually by human editors can indeed
improve the performance of web-page classification
algorithms. They proposed a sentence-based
summarization method and showed that their
summarization-based classification algorithm
achieves an approximately 8.8% improvement as
compared to pure-text-based classification
algorithm, and an ensemble classifier using the
improved summarization algorithm achieves about
12.9% improvement over pure-text based methods.
Our approach differs in that we take a word-based
instead of sentence-based approach.
In recent years, there have been many studies of
text classification techniques for social media
analysis (e.g. customer reviews, twitter), sentiment
analysis, etc. For example, an interesting study by
Zhang et al (2013) investigated classification of
short text using information paths to deal with the
less informative word co-occurrences and sparseness
of such texts. Their method makes use of ordered
subsets in short texts, which is termed “information
path”. They found classification based on each
subset resulted in higher overall accuracy than
classifying the entire data set directly.
Related to online safety solutions, Hammami et
al (2003) developed a web filtering system
WebGuard that focuses on automatically detecting
and filtering adult content on the Web. It combines
the textual content, image content, and URL of a
web page to construct its feature vector, and classify
a web page into two classes: Suspect and Normal.
The suspect URLs are stored in a database, which is
constantly and automatically updated in order to
reflect the highly dynamic evolution of the Web.
Last et al (2003) and Elovici et al (2005)
developed systems for anomaly detection and
terrorist detection on the Web using content-based
methods. Web content is used as the audit
information provided to the detection system to
identify abnormal activities. The system learns the
normal behavior by applying an unsupervised
clustering algorithm to the content of web pages
accessed by a normal group of users and computes
their typical interests. The content models of normal
behavior are then used in real-time to reveal
deviation from normal behavior at a specific location
on the web (Last et al, 2003). They can thus monitor
the traffic emanating from the monitored group of
users, issue an alarm if the access information is not
within the typical interests of the group, and track
down suspected terrorists by analyzing the content
of information they access (Elovici et al, 2005).
In more recent years, Calado et al (2006) studied
link-based similarity measures as well as
combination with text-based similarity metrics for
the classification of web documents for Internet
safety and anti-terrorism applications (Calado et al,
2006). Qi and Davidson (2007) presented a survey
of features and algorithms in the space of web
content classification.
Fürnkranz (1998) and Fürnkranz et al (1999) are
the earliest studies on n-grams in text classification.
They studied the effect of using n-grams and
linguistic phrases for text categorization. They found
that bigram and trigrams were most useful when
applied to a 20 newsgroups data set and 21,578
REUTERS newswire articles. Longer sequences
were found to reduce classification performance.
Fürnkranz et al (1999) and Riloff et al (2001) then
CombiningN-grambasedSimilarityAnalysiswithSentimentAnalysisinWebContentClassification
531
revealed that linguistic phrase features can help
improve the precision of learned text classification
models at the expense of coverage.
The rest of the paper is organized as follows. In
Section 2, we describe our approach for web content
classification and explain the methods and
techniques used in topic extraction, topic similarity
analysis and sentiment analysis. In Section 3 we
describe our data and experiments for the
classification of Hate, Violence and Racist web
content. We compare the performance of different
models based on unigram, trigram and 5-grams.
Section 4 concludes the paper.
2 COMBINING N-GRAM BASED
CONTENT SIMILARITY
ANALYSIS WITH SENTIMENT
ANALYSIS IN WEB CONTENT
CLASSIFICATION
Our approach to web content classification is
illustrated in Figure 1. Exploring the textual
information, we apply word weighting, text
summarization and sentiment analysis techniques to
extract topic features, content similarity features and
sentiment indicators of web pages to build
classifiers.
In this study we only take into consideration the
page attributes that are text-related. Our focus is on
added value to web classification that can be gained
from textual content analysis. We should point out
that structural features and hyperlink information
capture the design elements of web pages that may
also serve as effective indicators of their content
nature and category (Cohen, 2002). They contain
very useful information for web classification. In
addition, analysis of images contained in a web page
would provide another source of useful information
for web classification (Chen et al, 2006; Kludas,
2007). However, these topics are dealt with in other
projects.
2.1 Content Representation and Topic
Extraction
The Topic Extraction step takes web textual
information as input and generates a set of topic
terms. We start with extracting topics from each web
page and then each of the collections of web pages
belonging to the same categories. The extracted
topics hopefully give a good representation of the
core content of a web page or a web category.
We use the tf-idf weighted vector space model of
n-grams (where n=1, 3, 5) to represent the original
text content and extracted topics of web pages and
web categories. When n=1, it is a feature vector that
contains one weight attribute (instead of Boolean or
simple frequency count) for each unique term that
occurs in a web page or collection and their topics.
In other words, each web page or collection or topic
is represented by the set of unique words it consists
of. Similarly, when n>1, each web page or collection
or topic is represented by the set of unique n-grams
it contains. We planned an experiment testing the
possibilities with n= 1 to 6. However, due to time
constraints we have to scale down the number of
experiments, and choose to test only unigram,
trigram and 5-grams.
In pre-processing, we apply stemming and stop-
words removing for obtaining unigrams, but no stop-
words removing when obtaining higher order n-
grams. However, we remove n-grams with stop
words as the beginning or ending. We build our own
IDF databases using the entire data collection of
over 165,000 web pages of 20 categories. The tf-idf
weight of an n-gram is adjusted using the weight of
unigrams it contains, i.e. add to the tf-idf value of
the n-gram the tf-idf value of the words it contains.
This is done independently for unigrams, trigrams
and 5-grams.
For each webpage, we make use of its multiple
content attributes as raw data input for the term/
Figure 1: Web content classification based on topic and sentiment analysis.
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
532
n-gram weighting process. The content attributes
include full-page content as well as URL and three
other meta-contents (T8, T10, T13).
The topic of a web page is obtained simply based
on tf-idf n-gram weighting. For each webpage, by
applying different compression rates, we can obtain
different sets of topic words (for example top 10%,
20%, 50%, 100%). The topic of a web category is
obtained through summarization of all the web pages
in the same category. For each web page collection,
we apply the Centroid method of the MEAD
summarization tool (Radev et al, 2003; 2004) to
make summaries of the document collection.
Through this we try to extract topics that are a good
representation of a specific web category.
Based on our earlier experiments, we set a cut-
off level at top 15000 highest weighted unigrams,
top 100,000 highest weighted trigrams, and top
120,000 highest weighted 5-grams to represent the
topic content of a web page or web category.
We should point out that there exists a large
body of literature on topic extraction. For example,
LDA based topic models are another popular means
for topic detection, especially deriving hidden topics
in large text collections. It is our intention to develop
and test such topic models in web content
classification soon, but it is out of the scope of this
article.
2.2 Web-Page vs. Web-Category Topic
Similarity
We use topic similarity to measure the content
similarity between a web page and a web category.
Topic similarity is implemented as the cosine
similarity between topic terms of a web page and
topic terms of each web category. For each web
page, we compute its cosine similarity to each web
category in different n-grams vector space. We
compare results from using only unigrams and
unigram-weight adjusted n-grams.
2.3 Extracting Sentiment Features
Based on the extracted topic content for each web
page, we make assessments of its sentiment strength
through using the SentiStrength (Thelwall et al,
2011, 2012) sentiment analysis tool.
Sentiment analysis methods generally fall into
two categories: (1) the lexical approach –
unsupervised, use direct indicators of sentiment, i.e.
sentiment bearing words; (2) the learning approach –
supervised, classification based algorithms, exploit
indirect indicators of sentiment that can reflect genre
or topic specific sentiment patterns (Thelwall et al,
2011). SentiStrength takes a lexical approach to
sentiment analysis, making use of a combination of
sentiment lexical resources, semantic rules, heuristic
rules and additional rules. It contains a
EmotionLookupTable of 2310 sentiment words and
wordstems taken from Linguistic Inquiry and Word
Count (LIWC) program (Pennebaker et al, 2003),
the General Inquirer list of sentiment terms (Stone et
al, 1966) and ad-hoc additions made during testing
of the system. The SentiStrength algorithm has been
tested on several social web data sets such as
MySpace, Twitter, YouTube, Digg, Runner’s World,
BBC Forums. It was found to be robust enough to be
applied to a wide variety of social web contexts.
While most opinion mining algorithms attempt to
identify the polarity of sentiment in text - positive,
negative or neutral, SentiStrength gives sentiment
measurement on both positive and negative direction
with the strength of sentiment expressed on different
scales. To help web content classification, we use
sentiment features to get a grasp of the sentiment
tone of a web page. This is different from the
sentiment of opinions concerning a specific entity,
as in traditional opinion mining literature.
Sentiment features are extracted by using the key
topic terms extracted from the topic extraction
process as input to the SentiStrength. This gives
sentiment strength values for each web page in the
range of -5 to +5, with -5 indicating strong negative
sentiment and +5 indicating strong positive
sentiment. We found that negative sentiment
strength value was a better discriminator of web
content than positive sentiment strength value at
least for the three web categories Hate, Violence and
Racism. Thus, in our first set of experiments we only
use negative sentiment strength value as data for
learning and prediction. Corresponding to the six
sets of topic words for each web page, six sentiment
features are obtained.
In sentiment analysis, we only apply topic
content in the form of unigrams (stemmed, with
stop-words removing). In addition to applying the
original SentiStrength tool, we also tried to
customize the SentiStrength algorithm in two ways:
(1) Counts of positives and negative sentiment
words in a web page; (2) Sum of word sentiment
value weighted by word frequency, normalized on
total word counts, value between -5 and 5.
3 CLASSIFICATION MODELS
FOR THE DETECTION OF
HATE, VIOLENCE AND
RACISM WEB PAGES
3.1 Data and Experiments
Our dataset is a collection of over 165,000 single
CombiningN-grambasedSimilarityAnalysiswithSentimentAnalysisinWebContentClassification
533
labeled web pages in 20 categories. As described
earlier, in our study we selected a subset of the con-
tent features as the raw data, taking into account
missing entries for different attributes. More
specifically we utilized full-page free text content, in
combination with the textual meta-content of web
pages including URL words, title words (TextTitle)
and meta-description terms (CobraMetaDescription,
CobraMetaKeywords, TagTextA and TagTextMeta
Content).
To build classifiers for identifying violence, hate
and racism web pages, four datasets are sampled
from the entire database. The datasets contain
training data with balanced positive and negative
examples for the four web categories: Violence,
Racism, Racist and Hate. Each dataset makes
maximal use of positive examples available, with
negative samples distributed evenly in the other 19
web categories.
Features for learning in the data for each web
page include topic similarity to ten web categories
(including the 4-selected categories) and a number
of sentiment strength values of each web page. A
summary of the features is given in Table 1.
In a series of experiments we develop three types
of classification models and compare their
performance: topic similarity based, sentiment
based, and the combined models over unigrams,
trigrams and 5-grams. We choose to apply
NäiveBayes (NB) method with cross validation to
build binary classifiers: c = 1, belong to the
category, (Violence, Hate, Racism, Racist), c = 0
(not belong to the category). NB Classifier is simple
but has been shown to perform very well on
language data. Support Vector Machines (SVM) as
another of the most commonly used algorithms in
classification often achieves similar results while
training takes much longer time.
Table 1: List of extracted features for web pages.
Page-Category topic
similarity
Sentiment strength features
Sim1- Sim10:
Topic similarity
between a web page
and web category #1
to #10
Pos3-Pos5 and Neg3-Neg5
(Counts of SentiStrength values
as +3, +4, +5 and -3, -4, -5)
NewScale1
NewScale2
3.2 Results and Discussion
The results of our experiment with unigram, trigram
and 5-gram based models are summarized in Table
2, Table 3 and Table 4.
3.2.1 Unigram based Models
Among the unigram models, topic similarity based
models perform surprisingly well when compared to
our earlier studies, especially with the three web
categories Hate, Racism and Racist. One reason
could be that the raw data input had a big effect
here. Another could be that stemming and a
customized IDF database helped very much in
content similarity analysis.
The sentiment based models alone, on the other
hand, did not perform as well as the similarity based
models, also a bit lower level than results from our
previous experiments. The reason lies mainly in the
differences in the negative samples of the training
set. We could still try to improve on the sentiment-
based models by looking more into the sentiment
features.
In all four web-categories, we were able to
develop combined classification models with rather
decent performance.
3.2.2 Trigram based Models
When compared to using unigrams, in the case of
trigram models, the topic similarity based method
did not gain on precision level but rather on recall
level. This is counter-intuitive. One reason could be
that our selection of cut-off level for category
centroid (the dimension of trigrams vectors) is not
suitable, which would have limited the performance
on similarity-based models. We will try to improve
the results by enlarging the vector space, which will
have an effect on the precision of the classifiers.
When topic similarity and sentiment features are
combined, we are able to build classifiers that are
slightly better than in the case of unigrams,
especially for the Racist category, which has
increased precision level by a big margin. This is
interesting and seems to tell us that there is
something genre-specific about the Racist category.
3.2.3 5-gram based Models
Comparing with the unigram and trigram
approaches, 5-gram based combined models show
significantly improved precision levels for the
Violence and two Racism related web categories.
However, when comparing the trigram and 5-grams
results, it seems the effect of high n-grams on topic
similarity based models is indeed minor. We need to
look further into this to understand if the large
amount of computation needed in processing higher-
order n-gram adds much to improve the
classification performance.
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
534
Table 2: Unigram: similarity based classifiers vs sentiment based vs combined.
Unigram based classification models
Web Category
Topic similarity based Sentiment based Best combined
Precision Recall Precision Recall Precision Recall
Violence 63.64% 83.17% 69.96% 53.87% 87.11% 60.16%
Hate 79.51% 75.97% 65.86% 65.49% 81.06% 76.25%
Racism 76.96% 77.35% 70.89% 58.83% 78.17% 79.77%
Racist 79.31% 80.50% 68.42% 62.22% 86.23% 88.41%
Table 3: Trigram: similarity based classifiers vs sentiment based vs combined.
Trigram based classification models
Web Category
Topic similarity based Sentiment based Combined
Precision Recall Precision Recall Precision Recall
Violence 59.66% 92.67% 69.96% 53.87% 84.57% 82.09%
Hate 61.73% 94.64% 65.86% 65.49% 64.07% 94.75%
Racism 65.85% 92.73% 70.89% 58.83% 78.63% 84.94%
Racist 67.67% 95.97% 68.42% 62.22% 95.85% 81.36%
Table 4: 5-gram: similarity based classifiers vs sentiment based vs combined.
5-gram based classification models
Web Category
Topic similarity based Sentiment based Combined
Precision Recall Precision Recall Precision Recall
Violence 60.71% 92.57% 69.96% 53.87%
96.56% 70.52%
Hate 62.05% 95.96% 65.86% 65.49% 67.56% 96.45%
Racism 60.87% 95.43% 70.89% 58.83% 74.34% 90.80%
Racist 63.77% 96.22% 68.42% 62.22%
97.62% 82.62%
Table 5: Combined models (only meta content as raw data,
unigram, no stemming ).
Model Performance (combined features)
Category Precision Recall
Violence 93.69% 82.75%
Hate 64.43% 96.28%
Racism 69.96% 91.82%
Racist 98.26% 96.30%
Finally, comparing our results with results from an
earlier study (Table 5), the 5-gram models making
use of combined topic similarity and sentiment
strength measurements outperform on precision
levels for the Violence, Hate and Racism groups, but
not on the other category (Racist). a much simpler
approach we applied earlier actually shows its value
as well on the detection of the Racist category
content.
4 CONCLUSIONS
In this study we investigated the application of n-
gram representation in web content classification
models. Our assumption is that n-grams can capture
more local context information in text, and thus
could help to improve accuracy in capturing content
similarity, which will subsequently help further
improving the performance of the classification
models. N-gram weighting, text summarization and
sentiment analysis techniques are applied to extract
topic and sentiment indicators of web pages.
NäiveBayes classifiers are developed based on the
extracted topic similarity and sentiment features.
CombiningN-grambasedSimilarityAnalysiswithSentimentAnalysisinWebContentClassification
535
A large number of experiments were carried out.
Our results reveal that unigram based models,
although a much simpler approach, show their
unique value and effectiveness in web content
classification. Raw data input, stemming, IDF
database, all play important roles in determining
topic similarity, just like the choice of representation
model as uni-gram or higher order n-grams.
Higher order n-gram based approach, especially
5-gram based models in our study, when combined
with sentiment features, bring significant
improvement in precision levels for the Violence
and two Racism related web categories. However,
the effect of high n-grams on topic similarity based
models seems to be really minor. We need to look
into this further to understand if the improvements
made in classification models justify the large
amount of computation needed in processing n-
gram.
The main contributions of our paper are: (1)
Investigation of a new approach for web content
classification to serve online safety applications; (2)
Contrary to most studiesn which only consider
presence or frequency count of n-grams in their
applications, we make use of tf-idf weighted n-
grams in building the content classification models.
(3) A large amount of feature extraction and model
developing experiments contributes to a better
understanding of text summarization, sentiment
analysis methods, and learning models; (4)
Analytical results that directly benefit the
development of cyber safety solutions.
In our future work we will explore the
incorporation of probabilistic topic models (Blei et
al, 2003; Blei, 2012; Lu et al, 2011b), revisit topic-
aware sentiment lexicons (Lu et al, 2011a), and fine-
tuning the models with different learning methods.
We believe there is still much room for
improvements and some of these methods will
hopefully help to enhance the classification
performance to a new level.
ACKNOWLEDGEMENTS
This research is supported by the Tekes funded
DIGILE D2I research program, Arcada Research
Foundation, and our industry partner.
REFERENCES
Blei, D, Ng, A., and Jordan, M. I. 2003. Latent dirichlet
allocation. Advances in neural information processing
systems. 601-608.
Blei, D. 2012. Probabilistic topic models. Communi-
cations of the ACM, 55(4):77–84, 2012
Calado, P., Cristo, M., Goncalves, M. A., de Moura, E. S.,
Ribeiro-Neto, B., and Ziviani, N. 2006. Link-based
similarity measures for the classification of web
documents. Journal of the American Society for
Information Science and Technology (57:2), 208-221.
Chakrabarti, S., B. Dom and P. Indyk. 1998. Enhanced
hypertext categorization using hyperlinks.
Proceedings of ACM SIGMOD 1998.
Chen, Z., Wu, O., Zhu, M., and Hu, W. 2006. A novel web
page filtering system by combining texts and images.
Proceedings of the 2006 IEEE/WIC/ACM
International Conference on Web Intelligence, 732–
735. IEEE Computer Society.
Cohen, W. 2002. Improving a page classifier with anchor
extraction and link analysis. In S. Becker, S. Thrun,
and K. Obermayer (Eds.), Advances in Neural
Information Processing Systems (Volume 15,
Cambridge, MA: MIT Press) 1481–1488.
Dumais, S. T., and Chen, H. 2000. Hierarchical
classification of web content. Proceedings of
SIGIR'00, 256-263.
Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman,
M., Schneider, M., and Kandel, A. 2005. Content-
based detection of terrorists browsing the web using
an advanced terror detection system (ATDS),
Intelligence and Security Informatics (Lecture Notes
in Computer Science Volume 3495), 244-255.
Fürnkranz J, Exploiting structural information for text
classification on the WWW, Advances in Intelligent
Data Analysis, 487-497, 1999
Fürnkranz J., T. Mitchell and E. Riloff, A Case Study in
Using Linguistic Phrases for Text Categorization on
the WWW, Working Notes of the 1998 AAAI/ICML
Workshop on Learning for Text Categorization.
Fürnkranz J, A study using n-gram features for text
categorization, Austrian Research Institute for
Artifical Intelligence 3 (1998), 1-10
Fürnkranz J, T Mitchell, E Riloff, A case study in using
linguistic phrases for text categorization on the
WWW, Proceedings from the AAAI/ICML Workshop
on Learning for Text Categorization, 5-12, 1999
Gabrilovich, E., and Markovich, S. 2007. Computing
Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis. In Proceedings of the 20th
International Joint Conference on Artificial
Intelligence (IJCAI’07), Hyderabad, India.
Hammami, M., Chahir, Y., and Chen, L. 2003.
WebGuard: web based adult content detection and
filtering system. Proceedings of the IEEE/WIC Inter.
Conf. on Web Intelligence (Oct. 2003), 574 – 578.
Kludas, J. 2007. Multimedia retrieval and classification for
web content, Proc. of the 1st BCS IRSG conference on
Future Directions in Information Access, British
Computer Society Swinton, UK ©2007
Last, M., Shapira, B., Elovici, Y., Zaafrany, O., and
Kandel, A. 2003. Content-Based Methodology for
Anomaly Detection on the Web. Advances in Web
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
536
Intelligence, Lecture Notes in Computer Science (Vol.
2663, 2003), 113-123.
Liu, B. 2012. Sentiment Analysis and Opinion Mining.
Synthesis Lectures on Human Language Technologies,
Morgan & Claypool Publishers 2012
Liu S. and T. Forss, “Web Content Classification based on
Topic and Sentiment Analysis of Text”, accepted by
KDIR 2014, Rome, Italy, October 2014
Lu, Y., M. Castellanos, U. Dayal, C. Zhai. 2011a.
"Automatic Construction of a Context-Aware
Sentiment Lexicon: An Optimization Approach",
Proceedings of the 20th international conference on
World wide web (WWW'2011) Pages: 347-356
Lu, Y., Q. Mei, C. Zhai. 2011b. "Investigating Task
Performance of Probabilistic Topic Models - An
Empirical Study of PLSA and LDA", Information
Retrieval, April 2011, Volume 14, Issue 2, pp 178-203
Pang, B., and Lee, L. 2008. Opinion mining and sentiment
analysis. Foundations and Trends in Information
Retrieval 2(1-2), 1-135, July 2008
Pennebaker, J., Mehl, M., & Niederhoffer, K. 2003.
Psychological aspects of natural language use: Our
words, our selves. Annual review of psychology, 54(1),
547–577.
Qi, X., and Davidson, B. 2007. Web Page Classification:
Features and Algorithms. Technical Report LU-CSE-
07-010, Dept. of Computer Science and Engineering,
Lehigh University, Bethlehem, PA, 18015
Radev, D., Allison, T., Blair-Goldensohn, S., Blitzer, J.,
Celebi, A., Dimitrov, S., and Zhang, Z. 2004a.
MEAD-a platform for multidocument multilingual text
summarization. Proeedings of the 4
th
LREC
Conference (Lisbon, Portugal, May 2004)
Radev, D., Jing, H., Styś, M., and Tam, D. 2004b.
Centroid-based summarization of multiple documents.
Information Process. and Management (40) 919–938.
Riloff E, J Fürnkranz, T Mitchell, A Case Study in Using
Linguistic Phrases for Text Categorization on the
WWW, AAAI/ICML Workshop on Learning for Text
Categorization, 2001
Salton, G., and Buckley, C. 1988. Term-weighting
approaches in automatic text retrieval. Information
processing and management, 24(5), 513-523.
Shen, D., Z. Chen, Q. Yang, H. Zeng, B. Zhang, Y. Lu, W.
Ma: Web-page classification through summarization.
SIGIR 2004: 242-249
Shen, D., Qiang Yang, Zheng Chen: Noise reduction
through summarization for Web-page classification.
Info. Process. and Manage. 43(6): 1735-1747 (2007)
Stone, P. J., Dunphy, D. C., Smith, M. S., and Ogilvie, D.
M. 1966. The general inquirer: a computer approach
to content analysis. The MIT Press, Cambridge,
Massachusetts, 1966. 651
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and
Kappas, A. 2010. Sentiment strength detection in short
informal text. Journal of the American Society for
Information Sci. and Technology, 61(12), 2544–2558.
Thelwall, M., Buckley, K., and Paltoglou, G. 2012.
Sentiment strength detection for the social Web.
Journal of the American Society for Information
Science and Technology, 63(1), 163-173.
Zhang, S., Xiaoming Jin, Dou Shen, Bin Cao, Xuetao
Ding, Xiaochen Zhang: Short text classification by
detecting information path. CIKM 2013: 727-732.
CombiningN-grambasedSimilarityAnalysiswithSentimentAnalysisinWebContentClassification
537