different/complementary research areas: text mining,
sentiment prediction, classification, clustering,
natural language processing, usage of resource terms
and so on.
In (Hatzivassiloglou, 1997) the authors extract
adjectives joined by conjunction relations (and / or),
based on the concept that adjectives joined by
conjunction have the same or opposite polarity and
semantic value.
In (Turney, 2002) 3-grams are compared against a
predefined syntactical relationship table, extracting
targets and their associated opinion words along with
their sematic values.
In (Hu and Liu, 2004) frequent nouns and noun
phrases are used to extract product feature candidates.
The target extraction proposed in (Popescu, 2005)
determines whether a noun or noun phrase is a
product feature or not. A PMI score is computed
between the phrase and its discriminant found by a
search on the Web by using the known product class.
In (Jin et. all. 2009) lexical Hidden Markov
Models are employed. A propagation module extends
the previously extracted targets and opinion words.
The authors expand the opinion words with synonyms
and antonyms and expand the targets with related
words combining them into bigrams. The noise is
treated using weights which are assigned to the
resulted bigrams.
The extraction of product features using grammar
rules is described in (Zhang et. all. 2010). They also
use the HITS algorithm, a link analysis algorithm for
rating Web pages along with feature frequency for
ranking features by relevance.
In (Liu, 2012) seed words set expansion and
features identification are described. The seed words
set, denoted also as lexicon, is composed of adjectives
with a polarity associated – in the form of a positive,
neutral or negative score. The features and opinion
words are extracted in pairs, by using a dependency
grammar and by exploiting the syntactic
dependencies between nouns and adjectives in
sentences.
Supervised and unsupervised approaches are
combined for extracting opinion words and their
targets in (Su Su Htay and Khin Thidar Lynn, 2013).
Targets are extracted by using a training corpus, while
opinion words are extracted by using grammar rules.
The problem from combining approaches lies in the
domain dependency given by the supervised part.
In (Hu et. all, 2013) sentiments are extracted out
of the emoticons used in social texts like blogs,
comments and tweets. The authors use the orthogonal
nonnegative matrix tri-factorization model (ONMTF);
clustering data instances based on the distribution of
features, and features according to their distribution of
data instances.
(Guerini et. All, 2013) tackles a polarity
assignment problem, using a posterior polarity for
achieving polarity consistency through the text. The
authors also obtain better results from a framework
constructed from a collection of posterior polarity
calculating formulas. Their results also show the
advantage of computing the average of all senses of a
word over the usage of its most frequent sense.
In order to determine the opinion polarity values,
in (Marrese-Taylor et. all. 2013), a lexical and a rule-
based approach is proposed. A polarity lexicon and
linguistic rules are used to obtain a list of words with
known orientations.
Our work devises a generalized methodology by
considering a comprehensive set of grammar rules for
identification of opinion bearing words. Moreover,
we focus on tuning our method for the best tradeoff
between precision-recall, time and number of seed
words. The method is general enough to perform well
using just 2 seed words therefore we can state that it
is an unsupervised strategy. Moreover, since the 2
seed words are class representatives (“good”, “bad”)
we claim that the method is domain independent.
3 THE PROPOSED TECHNIQUE
The method proposed in this paper is presented in
Figure 1, where the conceptual modules of our
architecture together with the intermediate data
produced are depicted. The architecture is composed
by 3 components: 1 – Retriever Service; 2 – Feature-
Opinion Pair Identification, 3 – Polarity Aggregator.
The Retriever services generate syntactic trees
from the given input corpus. This preprocessing
module handles the usual NLP tasks. The
transformations applied at sentence level are:
tokenization, lemmatization, part-of-speech tagging
and syntactic parsing. First, each review document is
segmented into sentences, which are used for
discovering words in the tokenizing step.
Lemmatization reduces the word to its base (root)
form. Finally, the parsing step generates syntactic
trees for each sentence, given the output of the
previous steps. This syntactic decomposition is used
as input for the second main task of the system, the
identification of feature-opinion pairs.
The <feature, opinion> tuple identification
component extracts the feature-opinion pairs using
the double propagation algorithm. The rule-based
strategy followed - double propagation - uses the
extraction rules listed in (Cosma, 2014).
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
234