The methods of unsupervised machine learning
allow to avoid dependence on training data. For their
work, one also needs a Corpus of documents, but
preliminary markup is not required. Within the
framework of this approach, the probabilistic-
statistical regularities of the text are found and, on
their basis, the key subtasks of the aspect-emotional
analysis are solved: identification of aspect terms and
determination of their tonality. However, such
methods require complex tuning to a given domain.
For example, the method based on Latent Dirichlet
Allocation (LDA) in its original form is not able to
effectively detect topics, therefore, its additional
adaptation and adjustment of correspondence of
identified topics to the target set of contexts is
required (Titov, 2008).
The methods of Text Classification, considered
above, requires the presence of Sentiment Dictionary
of text tonality evaluation. There are three basic
approaches to such Dictionary (Liu, 2012): expert;
based on dictionaries / thesaurus; and on the basis of
text collections.
With the expert approach, the dictionary is
compiled by experts. The approach differs, on the one
hand, by complexity and high probability of the
absence of domain-specific words in the dictionary,
on the other – by high quality of the dictionary in
sense of adequacy of the assigned key.
In the dictionaries / thesaurus approach, the initial
small list of evaluation words is expanded by various
dictionaries, for example, explanatory or synonyms /
antonyms. This also does not take into account the
subject area.
In the approach based on text collections,
statistical analysis of the marked texts, as a rule,
belonging to the subject domain in question, is used
to compile the Dictionary.
In (Klekovkina and Kotelnikov, 2012), the
dictionary of emotional vocabulary, compiled by
experts manually, was used to determine the tone of
individual words. In the dictionary, each word and
phrase are associated with orientation of the key
(positive / negative) and with strength (in points).
The author's methods proposed in (Taboada et al.,
2011; Boiy, 2007) are based on a dictionary approach:
to determine the tonality of texts, a dictionary of
estimated words is used, where each word has a
numerical weight that determines the degree of word
significance. In the method of working with the
dictionary closest to the paper (Boucher and Osgood,
1969), however: the dictionary firstly is created on the
basis of a statistical analysis of training collection;
secondly, the weight of words is determined with the
help of a genetic algorithm.
In most studies, tone of the text is determined on
the basis of calculation of weights of the appraisal
words included in it:
=
=
C
N
i
i
С
T
wW
1
(1)
where
С
T
W
– weight of text T for tonality C; w
i
–
weight of the evaluated word i;
C
N
– number of
estimated bigrams of tonality C in the text T.
To classify texts according to the linear function:
()
neg
Tneg
pos
T
neg
T
pos
T
WkWWWf •+=,
(2)
where
pos
T
W
is the positive weight of the text T;
neg
T
W
is the negative weight of the text T;
neg
k
–
coefficient, compensating the fact of preponderance
of positive vocabulary in text (Pang, 2008). If the
value of the function f is greater than zero, the text is
positive, otherwise – negative.
3 METHODOLOGY OF
WEBSITES CONTENT
SENTIMENT CLASSIFICATION
The objective of this research is testing and evaluation
of Text Classification Methodology grounded on the
Manually Created Corpora-based Sentiment
Dictionary (SC- methodology).
The developed Methodology assumes realization
of three main practical stages:
1) Manual Creation of Corpora-based Sentiment
Dictionary (CBSD).
2) Carrying out Texts Classification based on
created CBSD.
3) Evaluation of the adequacy of Texts
Classification results.
As a case study for testing the basic workability
and proposed Methodology quality the
Polish-
language Film Reviews Corpora will be used.
3.1 Novelty and Motivation
In this paper the following scientific research
questions (RQ) were raised:
RQ_1: Does the structure of the Sentiment
Dictionary influence the quality of classification?
RQ_2: Does the writing style of the analyzed text
influence the quality of Classification?