Unsupervised Irony Detection: A Probabilistic Model

with Word Embeddings

Debora Nozza, Elisabetta Fersini and Enza Messina

DISCo, University of Milano-Bicocca, Viale Sarca, 336, Milan, Italy

Keywords:

Irony Detection, Unsupervised Learning, Probabilistic Model, Word Embeddings.

Abstract:

The automatic detection of ﬁgurative language, such as irony and sarcasm, is one of the most challenging tasks

of Natural Language Processing (NLP). This is because machine learning methods can be easily misled by the

presence of words that have a strong polarity but are used ironically, which means that the opposite polarity

was intended. In this paper, we propose an unsupervised framework for domain-independent irony detection.

In particular, to derive an unsupervised Topic-Irony Model (TIM), we built upon an existing probabilistic topic

model initially introduced for sentiment analysis purposes. Moreover, in order to improve its generalization

abilities, we took advantage of Word Embeddings to obtain domain-aware ironic orientation of words. This is

the ﬁrst work that addresses this task in unsupervised settings and the ﬁrst study on the topic-irony distribution.

Experimental results have shown that TIM is comparable, and sometimes even better with respect to supervised

state of the art approaches for irony detection. Moreover, when integrating the probabilistic model with word

embeddings (TIM+WE), promising results have been obtained in a more complex and real world scenario.

1 INTRODUCTION

Mining opinions and sentiments from user generated

texts expressed in natural language is an extremely

difﬁcult task. It requires a deep understanding of ex-

plicit and implicit information conveyed by language

structures, whether in a single word or an entire docu-

ment (Bosco et al., 2013). In particular, social media

users are inclined to adopt a creative language mak-

ing use of original devices such as sarcasm and irony

(Ghosh et al., 2015a).

These ﬁgures of speech are commonly used to in-

tentionally convey an implicit meaning that may be

the opposite of the literal one. According to Colston

and Gibbs (Colston and Gibbs, 2007) an ironic mes-

sage typically conveys a negative opinion using only

positive words.From the sentiment analysis perspec-

tive such utterances represent a challenge as an in-

terfering factor that can revert the message polarity

(usually from positive to negative). The detection of

ironic expressions is crucial in different application

domains, such as marketing and politics, where the

users tend to subtly communicate dissatisfaction usu-

ally referring to a product or to a political ideology or

politician.

Although sarcasm and irony are a well-studied

phenomenons in linguistics, psychology and cog-

nitive science, their automatic detection is still a

great challenge because of its complexity. Stan-

dard dictionary-based methods for sentiment analysis,

based on a predeﬁned sentiment-driven lexicon, have

often shown to be inadequate in the face of indirect

ﬁgurative meanings (Ghosh et al., 2015a). Several

methods have been proposed to evaluate the abilities

of semi-supervised and supervised machine learning

approaches to tackle irony detection problem. How-

ever, they assume as prerequisite human annotation

of texts as training data, which in a real social media

context is costly and difﬁcult even for human, so as to

make it prohibitive. Moreover, it is commonly known

that supervised machine learning classiﬁers trained

on one domain often fail to produce satisfactory re-

sults when shifted to another domain, since natural

language expressions can be quite different (Blitzer

et al., 2007).

In this paper we propose a fully unsupervised

framework for domain-independent irony detection.

To perform unsupervised topic-irony detection, we

built upon an existing probabilistic topic model, ini-

tially introduced for sentiment analysis purposes. The

aim of this model is to discover the hidden thematic

structure in large archives of texts. Probabilistic topic

models are particularly suitable for two main reasons:

ﬁrst, they are able to discover topics embedded in text

Nozza, D., Fersini, E. and Messina, E.

Unsupervised Irony Detection: A Probabilistic Model with Word Embeddings.

DOI: 10.5220/0006052000680076

In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016) - Volume 1: KDIR, pages 68-76

ISBN: 978-989-758-203-5

messages in an unsupervised way, and second, they

result in a language model that estimates how much a

word is related to each topic and to the irony ﬁgure of

speech.

Moreover, in order to improve the generaliza-

tion abilities we took advantage of word embeddings

to obtain domain-aware ironic orientation for words.

This is the ﬁrst work that addresses the problem of

irony detection in a fully unsupervised settings. Fur-

thermore, this paper contributes as a ﬁrst investigation

on irony-topic models.

The rest of the paper is organized as follows. Sec-

tion 2 introduces the related work. In Section 3, the

proposed framework grounded on an unsupervised

Topic-Irony model and Word Embeddings are pre-

sented. In Section 4, the experimental investigation

is presented. Finally, we conclude and discuss further

research directions in Section 5.

2 RELATED WORK

As deﬁned in (Edward and Connors, 1971), a ﬁg-

ure of speech is any artful deviation from the ordi-

nary mode of speaking or writing. Among the most

problematic ﬁgures of speech in Natural Language

Processing (NLP) we focused on sarcasm and irony

(Katz et al., 2005), which are commonly used to con-

vey implicit criticism with a particular victim as its

target, saying or writing the opposite of what the

author means (McDonald, 1999). As mentioned in

(Weitzel et al., 2016), language should not be taken

literally, especially when addressing a sentiment anal-

ysis task. The presence of strongly positive (or nega-

tive) words that are used ironically, which means that

the opposite polarity was intended, can easily mislead

sentiment analysis classiﬁcation models (Reyes and

Rosso, 2014).

In the last year several approaches for irony de-

tection based on different set of features have been

investigated. In (Davidov et al., 2010), the authors

proposed a semi-supervised technique to detect sar-

casm in Amazon product reviews and tweets. They

used pattern-based (high frequency words and con-

tent words) and punctuation-based features to build

the sarcasm detection model. A supervised approach

has been proposed in (Gonz

alez-Ib

anez et al., 2011),

where the irony detection problem is studied for sen-

timent analysis in Twitter data. The authors used uni-

grams, word categories, interjections (e.g., ah, yeah),

and punctuation as features. Emoticons and ToUser

(which marks if a tweet is a reply to another tweet)

were also used. In (Riloff et al., 2013), the authors

considered a speciﬁc type of sarcasm where sarcas-

tic tweets include a positive sentiment (such as “love”

or “enjoy”) followed by an expression that describes

an undesirable activity or state (e.g., “taking exams”

or “being ignored”). In (Reyes et al., 2013) the au-

thors focused on developing classiﬁers to detect ver-

bal irony based on a set of high-level features: ambi-

guity, polarity unexpectedness and emotional cues. In

(Pt

cek et al., 2014) a supervised model has been ex-

ploited for document-level irony detection in Czech

and English by using n-grams, patterns, POS tags,

emoticons, punctuation and word case.

A similar approach, where a novel set of linguisti-

cally related features are used, has been presented in

(Barbieri and Saggion, 2014). In (Fersini et al., 2015)

the authors proposed an ensemble approach, based on

a Bayesian Model Averaging paradigm, which makes

use of models trained using several linguistic fea-

tures, such as pragmatic particles and Part-Of-Speech

tags. In (Hern

andez-Far

ıas et al., 2015), the irony

detection problem has been addressed by investigat-

ing statistical-based and lexicon-based features paired

with two semantic similarity measures, i.e. Lesk and

Wu-Palmer (Pedersen et al., 2004).

Other recent works (Bamman and Smith, 2015;

Rajadesingan et al., 2015) aim to address the sarcasm

detection in microblogs by including extra-linguistic

information from the context such as properties of the

author, the audience, the immediate communicative

environment and the user’s past messages. Word em-

beddings have been used as features in a supervised

approach in (Ghosh et al., 2015b), where the authors

expressed the sarcasm detection task as a word sense

disambiguation problem.

Although the above mentioned studies represent

a fundamental step towards the deﬁnition of effective

irony detection systems, they suffer of three main lim-

itations:

• they assume a labelled corpus for training super-

vised and semi-supervised models;

• they are tailored for domain-dependent irony de-

tection, restraining their applicability to other do-

main of interest;

• they disregard the topic subjected to the irony.

In order to overcome these limitations, we inves-

tigated an unsupervised topic-irony model enriched

with domain-independent word embeddings.

Unsupervised Irony Detection: A Probabilistic Model with Word Embeddings

3 PROPOSED FRAMEWORK

3.1 Topic-Irony Model (TIM)

In order to perform unsupervised irony detection, tak-

ing into account also the topic-dependency of the

words, we focused our investigation on the suite of

generative models called probabilistic topic models,

originally deﬁned for sentiment purposes. We consid-

ered three main generative models, which are exten-

sions of the well-known Latent Dirichlet Allocation

model (Blei et al., 2003).

The ﬁrst one is Topic Sentiment Mixture (TSM)

(Mei et al., 2007), that jointly models the mixture of

topics and sentiment predictions for the entire docu-

ment. Here, the sentiment language model is consid-

ered as separated from the topics ones, that can lead to

a language model that is not able to explain the hidden

correlation between a topic and sentiment. The sec-

ond one is Joint Sentiment-Topic (JST) model (Lin

and He, 2009), which assumes that topics are depen-

dent on sentiment distributions and words are condi-

tioned on sentiment-topic pairs. The last one is As-

pect and Sentiment Uniﬁcation Model (ASUM) (Jo

and Oh, 2011), that slightly differs from JST with re-

spect to the language distribution constraints. While

in JST each word may come from different language

models, ASUM constrains the words in a single sen-

tence to come from the same language model.

Among these models, we based our Topic-Irony

Model on ASUM. This choice is motivated by the

fact that (1) the topic-irony model should generate

a topic and an ironic/not-ironic orientation for each

word (2) this model is particularly suitable for mi-

croblog text, where messages have a maximum num-

ber of characters and a sentence would be either ironic

or not ironic with respect to a speciﬁc topic (3) ASUM

makes use of a set of seed words explicitly integrated

into the generative process, making the model more

stable from a statistical point of view.

The proposed Topic-Irony model (TIM) is able

to model irony toward different topics in a fully

unsupervised paradigm, enabling each word in a

sentence to be generated from the same irony-topic

distribution. More formally, let D be the number

of documents, M the number of sentences, N the

number of words, T the number of topics, I the

number of irony classes {ironic, not ironic} and V

the vocabulary size.

The generative process is as follows:

1. For every pair of (i, z) such that i ∈ I and z ∈ T ,

draw a word distribution φ

∼ Dirichlet(β

2. For each document d,

(a) Draw the document’s irony distribution π

∼

Dirichlet(γ)

(b) For each i ∈ I, draw a topic distribution θ

∼

Dirichlet(α)

i. Choose an irony class

i ∼ Multinomial(π

)

ii. Given

i, choose a topic ˆz ∼

Multinomial(θ

)

iii. Generate words w ∼ Multinomial(φ

iˆz

)

Following (Jo and Oh, 2011), β is the parameter

that controls the integration of seed words in the mod-

els and we used its asymmetric form. Indeed, one

can expect that the words “news, bbc, science” are

not probable in ironic expressions, and similarly “lol,

oh, duh” are probably ironic expressions. This expec-

tation can be encoded in β. The latent variables θ, π,

and φ are inferred by Gibbs sampling. The graphical

representation of TIM is shown in Figure 1.

Figure 1: Graphical representation of TIM. Nodes are ran-

dom variables, edges are dependencies, and plates are repli-

cations. Shaded nodes are observable.

3.2 Word Embeddings (WE)

The original ASUM topic model makes use of known

general sentiment seed words to derive domain-

speciﬁc sentiment words (Jijkoun et al., 2010). For

sentiment seed words, existing sentiment word lex-

icons can be used (e.g., SentiWordNet (Esuli and

Sebastiani, 2006)) or a new set of words may be

obtained by using sentiment propagation techniques

(Kaji and Kitsuregawa, 2007; Mohammad et al.,

2009; Rao and Ravichandran, 2009; Lu et al., 2011).

For irony detection, a lexicon cannot be a pri-

ori deﬁned, but it can be automatically derived in an

unsupervised way using huge quantity of text. To

this purpose, word embeddings can be adopted to de-

rive latent relationships among words (e.g. irony is

strictly related to epic fail) and therefore to automat-

ically create lexicons based on the language model

used in online social networks. This representation

KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval

is derived by various training methods inspired from

neural-network models In our investigation the ironic-

lexicon, among the available distributed representa-

tions (Bengio et al., 2006; Turian et al., 2010; Huang

et al., 2012), two model architectures have been used

(Mikolov et al., 2013), Continuous Bag of Words

(CBOW) and Skip-gram have been chosen because

of their efﬁciency on training and their limited loss of

information. The training objective of CBOW is to

combine the representations of surrounding words to

predict the word in the middle. Similarly, in the Skip-

gram model, the training objective is to learn word

vector representations that are good at predicting con-

text in the same sentence. The model architectures of

these two methods are shown in Figure 2.

Figure 2: Graphical representation of the CBOW and Skip-

gram model. The CBOW architecture predicts the current

word based on the context, and the Skip-gram predicts sur-

rounding words given the current word.

In practice, Skip-gram gives better word represen-

tations when the monolingual data is small. However,

CBOW is faster and more suitable for larger datasets.

The Skip-gram and CBOW models are typically

trained using stochastic gradient descent. The gradi-

ent is computed using back propagation rule (Rumel-

hart et al., 1986). When trained on a large dataset,

these models capture a substantial amount of seman-

tic information. Closely related words will have sim-

ilar vector representations, e.g., Italy, France, Portu-

gal will be similar to Spain. More interestingly, the

word vectors can also capture complex analogy pat-

terns. For example, vector(king) is to vector(man) as

vector(queen) is to vector(woman).

4 EXPERIMENTS

4.1 Dataset and Evaluation Settings

We evaluated the proposed framework on a bench-

mark dataset for irony detection (Reyes et al., 2013).

The dataset contains 10,000 ironic tweets and 30,000

non-ironic tweets (10,000 for each topic: Education,

Humour and Politics). As in the original paper, we

performed a series of binary classiﬁcations, between

Irony vs Education, Irony vs Humour and Irony vs

Politics in a balanced settings (50% ironic texts and

50% not ironic texts). We also considered the task

with unbalanced classes, i.e. to learn ironic vs oth-

ers. In order to deal with a more realistic and complex

scenario, where the term irony can not be explicitly

available, we evaluated the proposed model accord-

ing to two experimental conditions:

• Original scenario (O): the dataset has been main-

tained as it is (where the hashtags have been re-

moved), in order to allow a direct comparison with

the state of the art models;

• Simulated scenario (S): the hashtags and the term

irony have been removed from the data in or-

der to simulate a more realistic and complex sce-

nario where the presence of irony is not explicitly

pointed out.

Concerning the proposed model, two hyper-

parameters, γ and β, have been tuned. γ is a prior

for the irony distribution in texts. Because it is not

possible to make assumptions on this distribution,

several conﬁgurations have been evaluated. The

second hyper-parameter, β, is the key elements for

integrating the seed words that originate through

WE into TIM. β is the prior of the word-irony-topic

distribution deﬁned for ironic seed words, not ironic

seed words and all the other words.

The construction of the irony lexicon (to be en-

closed as seed words) has been performed by training

the word embedding model on all the tweets in the

corpus. The seed words have been obtained by ex-

tracting the most similar words to the term “irony”.

After a preliminary experimental investigation, we

decided to report the results related to the best dis-

tributed representation. In particular, the following

results are related to the CBOW model thanks to its

ability to deal with large corpus.

In the following experimental results, TIM will

denote the Topic-Irony Model, while TIM+WE will

represent the Topic-Irony Model based on the lexi-

cons induced by CBOW. The experimental investiga-

tion is conducted by comparing TIM, TIM+WE and

two supervised approaches available in the literature.

We evaluated the performance in terms of Preci-

sion (P), Recall (R), F-Measure (F), distinguishing

between ironic (+) and not ironic (-). A global perfor-

mance measure is also reported in terms of Accuracy.

Unsupervised Irony Detection: A Probabilistic Model with Word Embeddings

4.2 Irony Detection Results

4.2.1 Balanced Dataset

Original Scenario (O). The results of the proposed

framework is compared with (Reyes et al., 2013) in

Table 1. Our framework clearly outperforms the su-

pervised method with signiﬁcant improvements, i.e.

(on average) 11% for Precision, 14% for Recall and

13% for F-Measure.

Table 1: Results compared with a supervised state-of-the-

art method for each binary problem (O).

P R F

irony (Reyes et al., 2013) 0.7600 0.6600 0.7000

vs TIM 0.8225 0.8746 0.8477

education TIM + WE 0.8228 0.8629 0.8423

irony (Reyes et al., 2013) 0.7500 0.7100 0.7300

vs TIM 0.9127 0.8560 0.8834

politics TIM + WE 0.9131 0.8373 0.8735

irony (Reyes et al., 2013) 0.7800 0.7400 0.7600

vs TIM 0.8414 0.8174 0.8292

humour TIM + WE 0.8142 0.7832 0.7983

A further remark relates to the Precision and Re-

call obtained by TIM and TIM+WE. It can be eas-

ily noted that the two proposed models achieve ho-

mogeneous performance on both orientations and in

all the binary classiﬁcation problems, obtaining Pre-

cision and Recall performance of similar magnitude.

In order to grasp more peculiar behaviours, the per-

formance measures both for ironic (+) and not ironic

(-) texts have been reported in Table 3. In this case,

we can highlight that Precision and Recall for both

classes are well balanced, ensuring good performance

also on the most difﬁcult (ironic) target. Concerning

accuracy, TIM and TIM+WE are able not only to out-

perform a trivial classiﬁer that would ensure 50% of

accuracy, but they perform differently according to

the binary problem that they address. We can note

that tackling Irony vs Humor is more difﬁcult than

Irony vs Politics and Irony vs Education. In fact,

as stated by the authors of the original paper (Reyes

et al., 2013), the similarity estimated between pairs of

classes is signiﬁcantly higher in Irony vs Humor than

the other binary problems.

Moreover, we can notice that in this scenario the

contribution of the ironic-lexicon derived through WE

does not generally improve the performances of TIM.

This is probably due to the impact that the word irony

has into the dataset and into the model: the lexicon of

TIM only composed of the irony term is sufﬁcient to

discriminate between the ironic and non-ironic orien-

tations. Although the additional seed words enclosed

in TIM+WE allow the model to obtain remarkable re-

sults with respect to the supervised settings and simi-

lar performance compared to TIM, the only presence

of the term irony guarantees better performance than

richer lexicons. As expected, TIM better ﬁts the orig-

inal scenario where the ironic statements available

into the dataset are strongly characterized by the irony

term. In order to evaluate the generalization abilities

of the proposed models in a real and more complex

scenario, where the term irony is not explicitly avail-

able into the ironic statements, we evaluated the per-

formance in the following simulated scenario.

We report some additional results to compare the

proposed approaches with respect to some related

works (supervised) on the same dataset used for the

experimental investigation. In particular, the bench-

mark corpus exploited for training and inference TIM

and TIM+WE has been previously adopted also in

(Barbieri and Saggion, 2014) and (Hern

andez-Far

ıas

et al., 2015) (only in a balanced settings for the orig-

inal scenario). The results reported in terms of F-

Measure by the original authors are shown in Table 2.

Table 2: Results in terms of F-Measure of the proposed

models against the state of the art approaches.

Irony vs.

Education Humour Politics

(Reyes et al., 2013) 0,70 0,76 0,73

(Barbieri and Saggion, 2014) 0,73 0,75 0,75

(Hern

andez-Far

ıas et al., 2015)

0,78 0,75 0,79

(Hern

andez-Far

ıas et al., 2015)

0,78 0,79 0,79

TIM 0,85 0,83 0,88

TIM+WE 0,84 0,80 0,87

This ﬁnal comparison clearly highlights the contribu-

tion that the proposed models are able to provide. Not

only TIM and TIM+WE perform signiﬁcantly better

than the state of the art models, but it is even more

remarkable that they perform better although their na-

ture is completely unsupervised.

Simulated Scenario (S). We report in the following

the computational results on the simulated scenario,

where the ironic ﬁgurative language is not explicitly

marked in the dataset, but embedded in to the sen-

tences. In Table 4 the results in terms of precision,

recall and F-measure are reported distinguishing be-

tween ironic (+) and not-ironic (-) classes, together

with the global Accuracy measure.

As expected, the recognition performance of TIM

and TIM+WE decrease, compared to the original sce-

nario (see Table3), once the term irony is removed

from the corpus. However, in this case where the

In this experiment, the authors used the Lesk similarity

measure.

In this experiment, the authors used the Wu-Palmer

similarity measure.

KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval

Table 3: Results of our framework for each binary problem (O).

P (+) R (+) F (+) P (-) R (-) F (-) Accuracy

irony vs education

TIM 0,8225 0,8746 0,8477 0,8664 0,8116 0,8380 0,8430

TIM + WE 0,8228 0,8629 0,8423 0,8566 0,8146 0,8350 0,8388

irony vs politics

TIM 0,9127 0,8560 0,8834 0,8644 0,9183 0,8905 0,8871

TIM + WE 0,9131 0,8373 0,8735 0,8498 0,9204 0,8836 0,8788

irony vs humour

TIM 0,8414 0,8174 0,8292 0,8227 0,8461 0,8342 0,8318

TIM + WE 0,8142 0,7832 0,7983 0,7911 0,8214 0,8059 0,8022

Table 4: Results of our framework for each binary problem (S).

P (+) R (+) F (+) P (-) R (-) F (-) Accuracy

irony vs education

TIM 0,7996 0,7934 0,7964 0,7958 0,8022 0,7989 0,7977

TIM + WE 0,8050 0,8103 0,8075 0,8098 0,8046 0,8070 0,8073

irony vs politics

TIM 0,8719 0,8358 0,8534 0,8426 0,8775 0,8596 0,8567

TIM + WE 0,8780 0,8420 0,8596 0,8485 0,8833 0,8655 0,8627

irony vs humour

TIM 0,7356 0,7675 0,7510 0,7574 0,7247 0,7405 0,7460

TIM + WE 0,7205 0,8392 0,7752 0,8079 0,6752 0,7354 0,7570

presence of irony is not explicitly pointed out, a lex-

icon able to boost TIM and therefore the recognition

performance of ironic messages becomes beneﬁcial.

By analysing all the performance measures, it is clear

that the introduction of WE derived-lexicon allow the

probabilistic model TIM+WE to achieve better results

than simple TIM. Also in this experimental settings,

we can remark that the two proposed models are able

to obtain Precision and Recall performance of simi-

lar magnitude, highlighting robust performance in this

complex scenario.

4.2.2 Unbalanced Dataset

Original Scenario (O). In order to compare the

proposed framework with the state of the art on irony

detection, we reported in Figure 3 the results ob-

tained by TIM and TIM+WE with two supervised ap-

proaches, i.e. the irony model presented in (Reyes

et al., 2013) and the ensemble approach introduced in

(Fersini et al., 2015). First of all, TIM and TIM+WE

are able perform better than a trivial classiﬁer that

would ensure 70% of accuracy. Furthermore, we

can point out that both proposed unsupervised mod-

els achieve remarkable results compared to the super-

vised ones. In particular, we can highlight that both

Figure 3: Comparison of TIM and TIM+WE with super-

vised state of the art methods on the unbalanced dataset.

TIM and TIM+WE are able to obtain higher recog-

nition performance than the supervised irony model

introduced in (Reyes et al., 2013). When compar-

ing the proposed models to the ensemble presented in

(Fersini et al., 2015), we can point out that TIM and

TIM+WE are not so far from the supervised model.

Considering that our method is fully unsuper-

vised, we can state that the proposed models are

promising. Extended results of our framework are

shown in Table 5. Similar to the balanced experi-

mental settings on the original scenario, the contri-

bution of WE does not generally improve the perfor-

mance of TIM, still remaining comparable. Again,

the better performance obtained by TIM with respect

to TIM+WE is related to the dataset composition,

where more than 36% of the ironic textual messages

contains the word irony.

Simulated Scenario (S). In the following, we re-

port the computational results on the simulated sce-

nario, where irony meaning is embedded in the sen-

tences with no reference to the term irony. We re-

port in Table 6 the behavior of both proposed models.

Similar to the previous balanced case study, the recog-

nition performance of TIM and TIM+WE decrease,

compared to the original scenario (see Table 5), once

the term irony is removed from the corpus. However,

in this context we can grasp even more the contri-

bution of WE. In a more complex and real scenario,

where the ratio of ironic and not ironic messages is

low and the ironic orientation in a sentence can be de-

rived only by the surrounding context, TIM+WE is

able to provide a valuable contribution to bridge the

semantic gap. If we analyse in details Precision and

Recall of both models, we can derive two main obser-

vations:

Unsupervised Irony Detection: A Probabilistic Model with Word Embeddings

Table 5: Results of our framework on the unbalanced dataset (O).

P (+) R (+) F (+) P (-) R (-) F (-) Accuracy

TIM 0,7543 0,6408 0,6929 0,8861 0,9305 0,9078 0,8581

TIM + WE 0,7122 0,6478 0,6784 0,8862 0,9128 0,8993 0,8466

Table 6: Results of our framework on the unbalanced dataset (S).

P (+) R (+) F (+) P (-) R (-) F (-) Accuracy

TIM 0,4406 0,5902 0,5044 0,8464 0,7507 0,7957 0,7107

TIM + WE 0,5320 0,4958 0,5132 0,8361 0,8550 0,8455 0,7654

• TIM and TIM+WE, although induced in the worst

scenario where the dataset is imbalanced and

lacks explicit reference to irony, are able to per-

form better than a trivial classiﬁer that would en-

sure 70% of accuracy. This makes the proposed

models particularly suitable for real world appli-

cations.

• TIM+WE obtains Precision and Recall of the

same magnitude both for the ironic class (0,5320

for P(+) and 0,4958 for R(+)) and not ironic class

(0,8361 for P(-) and 0,8550 for R(-)), compared to

TIM which obtains a poor trade-off between the

two performance measures (0,4406 for P(+) and

0,5902 for R(+), and 0,8464 for P(-) and 0,7507

for R(-)). This suggests that TIM+WE has good

predictive performance characterized by well pro-

portioned abilities both in terms of precision and

recall on both ironic and not-ironic orientations.

4.3 Topic Detection Results

In order to perform a qualitative analysis of the ob-

tained results, we report in the following some ex-

amples of discovered ironic and not ironic topics. In

particular, Table 7 shows a sample of the topics un-

derlying ironic and not-ironic messages derived in the

original scenario and in a balanced settings by TIM.

In Table 8, the same output is shown for TIM+WE in

the simulated scenario again in a balanced settings.

As general remark, the experimental results sug-

gests that the proposed Topic-Irony Model may not

only help the irony classiﬁcation step, but also the

ability to identify the underlying topics. In fact, the

considered topics are well distinguished by looking

at most relevant keywords identiﬁed by the proposed

approach, still maintaining a good characterization of

ironic and not ironic orientations. For instance, the

sentence @user Deeper irony would be Sarah Palin

campaigning for literacy” is correctly classiﬁed as

ironic and properly related to the topic Politics.

A further remark concerns TIM+WE, and in par-

ticular to its ability to deal with short and noisy text.

The fact that social network text is composed of few

words poses considerable problems when applying

traditional probabilistic topic models. These models

typically suffer from data sparsity to estimate robust

word co-occurrence statistics when dealing with short

and ill- formed text. The proposed model is able to re-

duce the negative impact of short and noisy text in real

and complex scenarios thanks its ability to take ad-

vantage of distributed representation derived through

word embeddings.

TIM+WE is therefore particularly suitable for

dealing with those topic-related ironic sentences

where the ironic orientation is not explicitly available.

Table 7: Topic-related words are reported in bold, while the irony-related ones are marked as underlined . These results are

related to TIM in the original scenario (O) and the balanced settings.

humour(-) humour(+) politics(-) politics(+) education(-) education(+)

funny unions tcot irony technology irony

posemoticon workers politics oh education linux

shoy beneﬁt obama get new org

award always news lol apple microsoft

nominate cd p like google open

lol movies gop u school tsunami

humor labor tlot people news attack

jokes porn teapay day ipad creates

joke fox us one posemoticon sponsors

q tv palin love twitter openmainframe

comedy news iran common via gnu

quote playboy pay got ac religion

like weed sgp time iphone ban

get marijuana iranelection posemoticon edtech thought

one cannabis hcr see web dilemma

KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval

Table 8: Topic-related words are reported in bold, while the irony-related ones are marked as underlined . These results are

related to TIM+WE in the simulated scenario (S) and the balanced settings.

humour(-) humour(+) politics (-) politics(+) education (-) education (+)

funny quote tcot oh technology common

posemoticon popular obama u education postrank

shoy love politics lol new education

award palin news get apple health

nominate blind p like google nowplaying

lol lingerie gop day news make

humor vote tlot posemoticon school lol

jokes quickpolls us one twitter ﬂaker

joke anonymous teapay people ipad cholesterol

q com pay got via video

comedy voteglobal iran common posemoticon man

quote gotpolitics sgp love ac difference

like politics hcr yet iphone causes

one friends iranelection politics one sense

get barbie health time edtech ﬁction

An instance of its ability can be grasped by the fol-

lowing sentence “catching up on news... see that Pres.

Obama’s aunt is in the news again, and that she said

she loves Pres. Bush.”, where the model correctly

classiﬁes the statement as Politics and recognizes as

ironic (even if the ironic orientation is not explicitly

marked in the text).

5 CONCLUSION

In this paper, we proposed an unsupervised gener-

ative model for topic-irony detection, enriched with

a neural language lexicon derived through word em-

beddings. The proposed model has been shown to

achieve remarkable results, signiﬁcantly outperform-

ing existing supervised models currently available in

the state of the art.

Concerning the future work, two main research

directions will be investigated to improve the gener-

alization abilities of the proposed generative model.

First, we would like to overcome the limitation related

to the word independence assumption by introducing

latent relationships that could exist among different

terms and/or sentences. Second, we would like to

model parameter switching when dealing with ironic

and not ironic statements, in order to set the different

level of importance of seed words according to each

modeled class.

REFERENCES

Bamman, D. and Smith, N. A. (2015). Contextualized sar-

casm detection on twitter. In Proceedings of the 9th

International AAAI Conference on Web and Social

Media, pages 574–77.

Barbieri, F. and Saggion, H. (2014). Modelling irony in

twitter. In Proceedings of the Student Research Work-

shop at the 14th Conference of the European Chap-

ter of the Association for Computational Linguistics,

pages 56–64.

Bengio, Y., Schwenk, H., Sen

ecal, J.-S., Morin, F., and

Gauvain, J.-L. (2006). Neural probabilistic language

models. In Innovations in Machine Learning: Theory

and Applications, pages 137–186. Springer.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. The Journal of Machine Learning

Research, 3:993–1022.

Blitzer, J., Dredze, M., and Pereira, F. (2007). Biographies,

bollywood, boom-boxes and blenders: Domain adap-

tation for sentiment classiﬁcation. In Association for

Computational Linguistics, volume 7, pages 440–447.

Bosco, C., Patti, V., and Bolioli, A. (2013). Developing

corpora for sentiment analysis: The case of irony and

senti-tut. IEEE Intelligent Systems, 28(2):55–63.

Colston, H. and Gibbs, R. (2007). A brief history of irony.

In Irony in language and thought: A cognitive science

reader, pages 3–21. Lawrence Erlbaum Assoc Incor-

porated.

Davidov, D., Tsur, O., and Rappoport, A. (2010). Semi-

supervised recognition of sarcastic sentences in twitter

and amazon. In Proceedings of the 14th Conference

on Computational Natural Language Learning, pages

107–116. Association for Computational Linguistics.

Edward, P. C. and Connors, R. (1971). Classical rhetoric

for the modern student.

Esuli, A. and Sebastiani, F. (2006). Sentiwordnet: A pub-

licly available lexical resource for opinion mining. In

Proceedings of the 5th Conference on Language Re-

sources and Evaluation, volume 6, pages 417–422.

Citeseer.

Fersini, E., Pozzi, F. A., and Messina, E. (2015). Detect-

ing irony and sarcasm in microblogs: The role of ex-

pressive signals and ensemble classiﬁers. In Proceed-

ings of IEEE International Conference on Data Sci-

ence and Advanced Analytics, pages 1–8. IEEE.

Unsupervised Irony Detection: A Probabilistic Model with Word Embeddings

Ghosh, A., Li, G., Veale, T., Rosso, P., Shutova, E., Barn-

den, J., and Reyes, A. (2015a). Semeval-2015 task

11: Sentiment analysis of ﬁgurative language in twit-

ter. In Proceedings of the 9th International Workshop

on Semantic Evaluation, pages 470–478.

Ghosh, D., Guo, W., and Muresan, S. (2015b). Sarcas-

tic or not: Word embeddings to predict the literal or

sarcastic meaning of words. In Proceedings of the

2015 Conference on Empirical Methods in Natural

Language Processing, pages 1003–1012.

Gonz

alez-Ib

anez, R., Muresan, S., and Wacholder, N.

(2011). Identifying sarcasm in twitter: a closer look.

In Proceedings of the 49th Annual Meeting of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies: short papers-Volume 2, pages

581–586. Association for Computational Linguistics.

Hern

andez-Far

ıas, I., Bened

ı, J.-M., and Rosso, P. (2015).

Applying basic features from sentiment analysis for

automatic irony detection. In Pattern Recognition and

Image Analysis, pages 337–344. Springer.

Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y.

(2012). Improving word representations via global

context and multiple word prototypes. In Proceed-

ings of the 50th Annual Meeting of the Association

for Computational Linguistics: Long Papers-Volume

1, pages 873–882. Association for Computational Lin-

guistics.

Jijkoun, V., de Rijke, M., and Weerkamp, W. (2010). Gen-

erating focused topic-speciﬁc sentiment lexicons. In

Proceedings of the 48th Annual Meeting of the Associ-

ation for Computational Linguistics, pages 585–594.

Association for Computational Linguistics.

Jo, Y. and Oh, A. H. (2011). Aspect and sentiment uniﬁ-

cation model for online review analysis. In Proceed-

ings of the 4th ACM International Conference on Web

Search and Data Mining, WSDM ’11, pages 815–824,

New York, NY, USA. ACM.

Kaji, N. and Kitsuregawa, M. (2007). Building lexicon

for sentiment analysis from massive collection of html

documents. In Proceedings of the 2007 Joint Confer-

ence on Empirical Methods in Natural Language Pro-

cessing and Computational Natural Language Learn-

ing, pages 1075–1083, Prague, Czech Republic. As-

sociation for Computational Linguistics.

Katz, A. N., Colston, H., and Katz, A. (2005). Dis-

course and sociocultural factors in understanding non-

literal language. In Figurative language comprehen-

sion: Social and cultural inﬂuences, pages 183–207.

Lawrence Erlbaum Associates, Inc. Mahwah, NJ.

Lin, C. and He, Y. (2009). Joint sentiment/topic model for

sentiment analysis. In Proceedings of the 18th ACM

Conference on Information and Knowledge Manage-

ment, pages 375–384. ACM.

Lu, Y., Castellanos, M., Dayal, U., and Zhai, C. (2011).

Automatic construction of a context-aware sentiment

lexicon: An optimization approach. In Proceedings

of the 20th International Conference on World Wide

Web, pages 347–356. ACM.

McDonald, S. (1999). Exploring the process of inference

generation in sarcasm: A review of normal and clini-

cal studies. Brain and Language, 68(3):486–506.

Mei, Q., Ling, X., Wondra, M., Su, H., and Zhai, C. (2007).

Topic sentiment mixture: modeling facets and opin-

ions in weblogs. In Proceedings of the 16th Inter-

national Conference on World Wide Web, pages 171–

180. ACM.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. CoRR, abs/1301.3:1–12.

Mohammad, S., Dunne, C., and Dorr, B. (2009). Generat-

ing high-coverage semantic orientation lexicons from

overtly marked words and a thesaurus. In Proceedings

of the 2009 Conference on Empirical Methods in Nat-

ural Language Processing: Volume 2-Volume 2, pages

599–608. Association for Computational Linguistics.

Pedersen, T., Patwardhan, S., and Michelizzi, J. (2004).

Wordnet::similarity: Measuring the relatedness of

concepts. In Demonstration Papers at HLT-NAACL

2004, HLT-NAACL–Demonstrations ’04, pages 38–

41, Stroudsburg, PA, USA. Association for Computa-

tional Linguistics.

cek, T., Habernal, I., and Hong, J. (2014). Sarcasm

detection on czech and english twitter. In Proceed-

ings of the 25th International Conference on Com-

putational Linguistics: Technical Papers, pages 213–

223, Dublin, Ireland. Dublin City University and As-

sociation for Computational Linguistics.

Rajadesingan, A., Zafarani, R., and Liu, H. (2015). Sarcasm

detection on twitter: A behavioral modeling approach.

In Proceedings of the 8th ACM International Confer-

ence on Web Search and Data Mining, pages 97–106.

ACM.

Rao, D. and Ravichandran, D. (2009). Semi-supervised po-

larity lexicon induction. In Proceedings of the 12th

Conference of the European Chapter of the Associ-

ation for Computational Linguistics, pages 675–682.

Association for Computational Linguistics.

Reyes, A. and Rosso, P. (2014). On the difﬁculty of au-

tomatically detecting irony: beyond a simple case

of negation. Knowledge and Information Systems,

40(3):595–614.

Reyes, A., Rosso, P., and Veale, T. (2013). A multidimen-

sional approach for detecting irony in twitter. Lan-

guage resources and evaluation, 47(1):239–268.

Riloff, E., Qadir, A., Surve, P., De Silva, L., Gilbert, N.,

and Huang, R. (2013). Sarcasm as contrast between a

positive sentiment and negative situation. In Proceed-

ings of the 2013 Conference on Empirical Methods in

Natural Language Processing, pages 704–714. Asso-

ciation for Computational Linguistics.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).

Learning representations by back-propagating errors.

Nature, 323:533–536.

Turian, J., Ratinov, L., and Bengio, Y. (2010). Word rep-

resentations: a simple and general method for semi-

supervised learning. In Proceedings of the 48th An-

nual Meeting of the Association for Computational

Linguistics, pages 384–394. Association for Compu-

tational Linguistics.

Weitzel, L., Prati, R. C., and Aguiar, R. F. (2016). The

Comprehension of Figurative Language: What Is the

Inﬂuence of Irony and Sarcasm on NLP Techniques?,

pages 49–74. Springer International Publishing.

KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval