Combining Clustering and Classiﬁcation Approaches for Reducing the

Effort of Automatic Tweets Classiﬁcation

Elias de Oliveira

, Henrique Gomes Basoni

, Marcos Rodrigues Sa´ude

and Patrick Marques Ciarelli

Programa de P´os-Graduac¸ ˜ao em Inform´atica, Universidade Federal do Esp´ırito Santo, Vit´oria, Brazil

Programa de P´os-Graduac¸ ˜ao em Engenharia El´etrica, Universidade Federal do Esp´ırito Santo, Vit´oria, Brazil

Keywords:

Text Classiﬁcation, Social Network, Textmining.

Abstract:

The classiﬁcation problem has got a new importance dimension with the growing aggregated value which has

been given to the Social Media such as Twitter. The huge number of small documents to be organized into

subjects is challenging the previous resources and techniques that have been using so far. Futhermore, today

more than ever, personalization is the most important feature that a system needs to exhibit. The goal of many

online systems, which are available in many areas, is to address the needs or desires of each individual user. To

achieve this goal, these systems need to be more ﬂexible and faster in order to adapt to the user’s needs. In this

work, we explore a variety of techniques with the aim of better classify a large Twitter data set accordingly to a

user goal. We propose a methodology where we cascade an unsupervised following by supervised technique.

For the unsupervised technique we use standard clustering algorithms, and for the supervised technique we

propose the use of a kNN algorithm and a Centroid Based Classiﬁer to perform the experiments. The results

are promising because we reduced the amount of work to be done by the specialists and, in addition, we were

able to mimic the human assessment decisions 0.7907 of the time, according to the F1-measure.

1 INTRODUCTION

Social Media is presenting us with a lot of

users’ information worthwhile for market analysis,

event planing, product monitoring and many more.

However, the challenging is still to deal with all this

information at once and unveil its hidden semantic

layers.

Twitter may be one of the social network most

currently studied. The most common approach to

work with Twitter data is to collect a number of

tweets from Twitter’s API based on some given

keywords or previously known hashtags (Bruns and

Liang, 2012; Gundecha and Liu, 2012). We choose

the hashtags, or keywords, which encompass the

subjects we have interest in study. Nevertheless,

using solely these tools to ﬁnd and understand the

messages conveyed by the goal masses is not good

enough due to hashtags hijacking actions (Hadgu

et al., 2013), variety of viewpoints within community,

among other problems. Hence, traditional subject

text classiﬁcation plays an important role in the

organization of this type of short documents. In fact,

the huge number of small documents to be organized

into subjects is challenging the previousresources and

techniques that have been using so far (Sebastiani,

2002; Berry, 2003).

In addition, tweets differ from traditional

documents in the point that users are forming

their own linguistic tribes (Bryden et al., 2013).

Sometimes no clear formal rule is applied while

people express themselves through these languages.

Usually, in these cases, the meaning is grasped by

association, by human inference from the context, or

only by individuals within the communities. This is

one of the major problems when dealing with tweets

documents if we are interested in having the message

processed, and understood, by machines.

As a consequence, some researches are still

struggling with great manual effort for the

classiﬁcation of their data sets, when they are

interested in more realistic meaning of the messages

being analyzed.

In this work, we introduce a combination of two

strategies usually used separated. We propose the

use of clustering and re-clustering process over the

entire a data set so that a user can have a quick over

view of the content within this data. By given an

overview of the data structure space, the user can

make easier decisions on the classes which s/he wants

465

de Oliveira E., Gomes Basoni H., Rodrigues Saúde M. and Marques Ciarelli P..

Combining Clustering and Classiﬁcation Approaches for Reducing the Effort of Automatic Tweets Classiﬁcation.

DOI: 10.5220/0005159304650472

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2014), pages 465-472

ISBN: 978-989-758-048-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

to closely observe from that point on in time. At

this point this user can further structured the data set

by assigning labels to sample. The assigning labels

process is guided by our proposed procedure and it

tries to minimize the build of good training sample

for the classiﬁcation process that follows.

This work is organized as follows. We present

the general problem and its context in Section 2. In

Section 3, some related works are brieﬂy reviewed.

In Section 4, we describe how the experiments

were performed and the results obtained by a group

of strategies we used to achieve the high level of

accuracy in our results. The conclusions are then

presented in Section 5.

2 THE PROBLEM DESCRIPTION

In order to mining what people are effectively saying

within an event mediated by Social Media, many

researchers have come to the task of manually

classifying their data sets according to some subjects.

This was the case when we have decided to analyze a

Brazilian national discussion data set regarding to the

Marco Civil for the internet.

The discussion of a Marco Civil for the internet

by the Brazilian parliament begun in October 27th

of 2009, together with the creation of the hashtag

#MarcoCivil

and the

@MarcoCivil

Twitter proﬁle.

The discussion was in a very slow pace since

the beginning of the discussion. The leak by

Edward Snowden

that the U.S. government had

obtained unauthorized conﬁdential information about

some international governments, has triggered the

motivation for the Brazilian politicians to intensify

the discussion about the implementation of rules for

the use of internet in Brazil. Due to especially this

event the number of comments on the Social Media

has greatly increased. Many people started expressing

their opinions via twitter, for instance. In the light of

that, we collected a data set of messages within the

period from August of 2012 to December of 2013. We

sought the twitter data stream via the keyword

"marco

civil"

and any hashtag which contains the sub-string

marcocivil

There are several opinions about this theme.

In order to better address the social problem, the

government and politicians need to understand each

class of demands to work on a social consensus.

Considering that a good sample of the society were

using the Twitter social media to express their truly

Edward Snowden is a former employee of the National

Security Agency

opinion, this media can be used as a good sample

of the population opinions. Nevertheless, we still

need to read and manually label some of these

opinions according to our own understanding so that

the machine can later imitate our way of organizing

the information. Note that each group of analysis

can have their own objectives and, therefore, can

label differently the same data set. We argue that

although one can use predeﬁned classes to classify

tweet messages (Sriram et al., 2010), such strategies

are not always accurate with regard to the user’s

needs.

The problem we are interested in solving is that

given a set Ω ⊂ D of unlabeled data set, work with

the specialist sample by sample of this data set, asking

them to label these samples. The goal at this point

is to minimize the number of steps to gather a set

of good labeled examples in order to provide the

user with some as precised as possible suggestions

for the classiﬁcation of what is left in the data set.

Figure 1 depicts our combined model to minimize the

necessary amount of work when one want to organize

a data set according to the user’s subjects within it.

Figure 1: Clustering-Classiﬁcation combined model for

minimizing the effort of classifying a large data set of

tweets.

The general idea is that we initially have D,

the domain of documents. From this domain we

are going to work with S

, a subset of D. Every

document d

∈ S

are prepossessed and represented as

a vector of features, therefore S

= {d

,. .., d

} is

our given sample data set. The process is such that the

user is going to assign labels to each {S

,. ..,S

}

until the user feels satisﬁed with the homogeneous

characteristics within each cluster. At this point,

i=1

= Ω, where i = 1,2, ... ,l, ... , p, p ≤ n, and

we expect to have |Ω| ≪ |D|.

We present each part of this model in the

following subsections.

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

466

2.1 Text Clustering

The Text Clustering Problem is usually deﬁned as

a task of identifying natural groupings of texts, or

documents d

. This process is usually carried out on

the basis of their extracted features (Jain et al., 1999;

Everitt et al., 2011). In other words, given a ﬁnite

set of documents, this multidimensional problem is to

cluster similar objects together. Due to the difﬁculty

to deﬁne what a good clustering is (Kleinberg, 2002),

we consider putting the user in the loop of our

clustering process, as shown in Figure 1. Thereby,

the user will decide how many groups are necessary

to represent their needs.

After the ﬁrst step of clustering D, a sample S

will

be given to the user to assign labels of their interest.

Now, with this new input our system can improve its

clustering result by taking into account what the user

wants as grouping result. During this process, thus,

different similarity measures can be tested in order to

be more coherent with what the user intends. Hence,

a new clustering step can be carried out until the user

is satisﬁed with the groupings (Vens et al., 2013).

The loop between the building of each sample S

and the decision of another re-clustering – this is a

user decision, aims for turning each cluster as much

homogeneous as intended by the user. Therefore, this

is a continuous process of convergence guided by the

expert based on the level of their interest and quality.

2.2 Text Classiﬁcation

The Text Classiﬁcation Problem is usually deﬁned as

a task of assigning labels from a predeﬁned set of

classes to unclassiﬁed documents (Baeza-Yates and

Ribeiro-Neto, 2011; Sebastiani, 2002).

Let D be the domain of documents, C =

,. ..,c

|C |

} a set of pre-deﬁned classes, and

Ω = { d

,. ..,d

|Ω|

} an initial corpus of documents

previously classiﬁed manually by a domain expert

into subsets of categories of C . In the machine

learning process, the training(-and-validation) set

TV = {d

,. ..,d

|TV|

} contains documents, each

associated with its respective label c

∈ C . TV

is used to train and validate (i.e., to tune eventual

parameters of) a classiﬁcation system that associates

the appropriate combination of classes with the

characteristics of each document in the TV. The

test set Te = {d

|TV|+1

,. ..,d

|Ω|

}, conversely, contains

documents for which the categories are unknown to

the classiﬁcation system. After being (tunned and)

trained with TV, the classiﬁcation system is used to

predict the set of classes of each document in Te.

To statistically validate the experiments, we

apply the k-fold validation tests. We divide the

|Ω|-documents into at least k parts, and we used

one part as Te for each experimental run, and the

other k − 1 parts are used as TV. k experiments are

performed, where each experiment uses a diferent

part as Te.

There are many ways to evaluate a text classiﬁer

system. The classical approach is to take a binary

function F : D × C → {0, 1} that assigns a value of 1

when the document d

belongs to the class c

, where

] ∈ {D × C }; and 0 otherwise.

3 RELATED WORKS

Usually the strategies which have been used to

collect and analyze social media events, in particular

from Twitter social media, are based on a careful

selection decision about the number of hashtags and

keywords that should be chosen in advance (Bruns

and Liang, 2012; Makazhanov et al., 2014) in order

for monitoringthe movimentswhich we are interested

in to follow. This methodology has shown to be, in

certain cases, not good enough to capture the depth

of what actually happens within the social moviments

(Hadgu et al., 2013). Thereby, it is presented in

(Sriram et al., 2010) a strategy to deal with the

problem of mining this social media by actually

classifying the short messages passed on by users

through tweets.

In (Sriram et al., 2010) is proposed an approach to

classify incoming tweets into a predeﬁned category.

They consider the following categories: News,

Events, Opinions, Deals, and Private Messages.

To achieve their goal, they used only 8 types of

features within the tweets. The ﬁrst feature was the

1) authorship. They claim by empirical results that

authorship plays a crucial role in classiﬁcation. In

fact, it is a reasonable assumption to think authors

identify themselves with a few speciﬁc subjects. The

other features were 2) presence of shortening of

words and slangs, 3) time-event phrases, 4) opinioned

words, 5) emphasis on words, 6) currency and

percentage signs, 7) @username at the beginning

of the tweet, and 8) @username within the tweet.

They show experimentaly an enhanced outcoming

of accuracy and their approach outperformed the

traditional Bag-Of-Words strategy. Their results

showed 32.1% of improvement on average over

Bag-Of-Words.

The work presented in (Kyriakopoulou and

Kalamboukis, 2007) is the basis for our work.

Theirs goal is to explore intrinsic information

unveiled by the ﬁrst clustering process phase when

CombiningClusteringandClassificationApproachesforReducingtheEffortofAutomaticTweetsClassification

467

applied over the whole data set, both training and

testing examples, to improve the second phase of

classiﬁcation. Their experiment results in fact showed

that for all the collections which they tested, their

clustering approach combined with two versions of

a SVM-classiﬁer outperformed the standard SVM

classiﬁer without the clustering phase. They reported

an improvement in performance by the combined

approach on all cases studied. The best improvement

reported was on average by 6.6% when the SVM

classiﬁer is used with clustering and by 3.2% when

the transductive SVM classiﬁer is used accordingly.

Although we can observe improvements on the

classiﬁcation accuracy in our experiments, our main

goal is differently to apply our combined approach to

build a good and still reduced labeled sample for the

training data set.

In the next section, we show some of the results

of our strategy over a Brazilian tweets data set.

We discuss the results ﬁrstly without the use of the

clustering phase, and later with the clustering phase

as a process to form a training set for the following

classiﬁcation phase using the kNN and the CBC

algorithms.

4 EXPERIMENTS AND RESULTS

We collected the tweets from August of 2012 until

December of 2013, gathering a total of 21000 in 2012,

and 110000 in 2013 tweets. For doing so, we sought

for any hashtags with the sub-string

marcocivil

After removing all the identical tweets and some other

unreadable tweets due to some problems during the

collecting process, we ended up with 2080 tweets.

Each tweet was manually classiﬁed in 2 meta

categories far before we carried out the experiments

discussed in this work. The ﬁrst meta-class was

named Political Positioning, which aims for assigning

a tweet message into one of its 3 classes: Neutral,

Progressive, and Conservative comments. Tweets

which messages are not clear enough with regarding

to the political positioning were assigned to the

Neutral class. For those which messages were

clearly in favor of the broadening and deepening

of the discussions we assigned to the Progressive

class. To Conservative class were assigned all the

messages which were against any change of the

current legislation. The results of this meta facet of

the data set is presented under the name Marco Civil

Under the name of Marco Civil II we refer

to the second meta-class, Opinion. The goal is

now to assign a tweet message to one of its 9

classes: Alert, Antagonism, Support, Compliance,

Explanation, Indignation, Information, Mobilization,

and Note. Alert is a class to aggregate all the tweets

which draw people attention to the evolution of the

discussion within the parliament. For instance, this

user is pointing out that the politicians are trying to

include a nonsense subject – Copyright Rights – into

the core of the Marco Civil project, a strategy usually

used to postpone the main point of a discussion:

@penas – V˜ao usar o Marco Civil da

Internet para defender o copyright. E querem

votar hoje! Nem pensar, esse assunto precisa

sair fora!

The Antagonism class gathers the messages in

opposition of the approval of the Marco Civil project.

In the following example, the user says we do not

need the government saying what we can or cannot

do on the Internet:

@RobertoElleryJr – Eu apoio a

campan[h]a contra o marco civil na internet.

n˜ao precisamos do governo nos dizendo o que

fazer

Differently from the previous class, the Support

class represents those tweets in favor of both

discussion and approval of the Marco Civil project.

Although the Compliance class has messages

showing sympathy towards the project, they do not

show openly support to a ofﬁcial legislation of

the matter. Some people posted messages mainly

for commenting and analyzing the evolution of the

discussions about the project. We assigned these

messages to the class Note. Although very similar

to the previous class, the class Information aims

at gathering those tweets which share with the

community some sort of news about the project, not

a personal opinion. All tweets which explain what

Marco Civil project is, the legislation proposals and

their consequences were grouped within the class

Explanation. The Indignation class stands for those

users who are against the news press attitude, the way

the deputies postponed the voting in the parliament,

and essentially the lack of any kind of legislation

about the use of internet in Brazil. Finally, the class

Mobilization gathers those messages which try to

bring people to participation, to engagement into the

movement. The following is a tweet calling people

to send message to their deputies in the Marco Civil

Especial Committee in the parliament:

@idec – Envie uma mensagem agora aos

deputados da Comiss˜ao Especial do Marco

Civil! http://t.co/kslJpTOh

In Table 1, we show the characterization of both

two points of view of the same Marco Civil data

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

468

Table 1: Characterization of the data sets used in the experiments.

Data set ASDC (x) ASCC ASPC (y) Ratio (y/x)

Marco Civil (I) 0.561809 0.989217 0.967745 1.722553

Marco Civil (II) 0.568465 0.962814 0.918002 1.614879

set. ASDC is the Average Similarity between every

Documents of a class and their respective centroids.

On one hand, the values in Table 1 show that the

tweets of the same class are spatially well separated

due to the lowASDC value. On the other hand, ASCC

is the Average Similarity between the centroids of

each class and the main centroid, and the ASCC value

is high, close to the maximum value. Hence, we can

say that the centroids of the classes are very close to

the main centroid. ASPC is the Average Similarity

between Pairs of centroids. The high value of ASPC

indicates that the classes are overlapping, causing

high rates of y/x. One can, therefore, conclude that

the tweets of any categories are spatially quite mixed,

which complicates the classiﬁcation of the tweets

within this data set.

4.1 Clustering Experiments

The objective of the clustering phase is to help the

user with labeling their data set with as minimal

steps as possible. At this phase the user sets up a

threshold ρ for the average similarities between pairs

of elements within each cluster.

We adopted a very naive strategy but very

effective for our problem. We used the CLUTO

Clustering Toolkit (Karypis, 2002) with a divisive

clustering algorithm with repeated bisections. Our

strategy is such that for each yielded cluster,

according to the setting up ρ value, we asked

the user to assign labels to the most dissimilarity

pair of elements. Should an identical label is

given to both elements, we assign this label to the

remaining elements of the cluster, forming a S

subset.

Otherwise, we put this cluster apart and recursively

treat it as if it was a new data set itself.

Applying this strategy, we carried out some

experiments. In Table 2, we show some values for

ρ and its impact on the amount of work passed on

to the user, the average number of clusters generated,

avgNC, on each step of our strategy. Note that for

each generated cluster we ask the user to assign a pair

of labels, that is to say that, given that was necessary

30 Steps to cover the whole data set when ρ is 0.8, for

Marco Civil I, this user is asked to assign 2× 7× 30=

420 labels plus a number 1,111 of labels which could

not be aggregated into any cluster. This and other

Table 2: Clustering process phase results.

ρ avgNC Steps Error(%)

Marco Civil I

0.6 5 29 12.21

0.75 7 37 16.39

0.8 7 30 15.71

0.9 6 17 2.84

0.95 4 11 0.04

Marco Civil II

0.6 6 46 39.25

0.75 8 31 20.44

0.8 9 25 19.32

0.9 6 38 17.53

0.95 5 29 8.59

results are depicted in Figure 4. In this case, our Error

is on average of 15.71%, in other words there are less

than 327 tweets within the data set which received an

incorrect label. In the second case, even if we relax

value of ρ down to 0.6 for the average similarities

among the elements within each cluster, the error did

not change signiﬁcantly.

On the Marco Civil II data set, the avgNC number

of clusters varied from 29 (ρ = 0.95) to 46 (ρ = 0.6).

Although we can see a great impact on the number of

steps for ρ= 0.6 showing that the recursive part of the

process was more demanded in this case, the amount

of work carried out by the user was still reduced when

comparing with that of having to assign labels to the

whole data set.

From these results we can also imply that should

one adopt a value of ρ = 0.95 as the number of

assigned labels, the error of mislabeling do not

increase much more than that of the other values, on

the contrary the error is greatly reduced.

4.2 Classiﬁcation Experiments

Although each tweet is classiﬁed concurrently in

both of these 2 meta categories, which could then

be treated as a multi-label classiﬁcation problem

(Ciarelli et al., 2013), in this work we tackled this

problem as an one-label classiﬁcation problem in each

one of our 2 meta categories: Marco Civil I and II.

CombiningClusteringandClassificationApproachesforReducingtheEffortofAutomaticTweetsClassification

469

The data set Marco Civil was pre-processed by

removing some stopwords. Each word was turn into

their stem form by the use of the algorithm Reducer

Sufﬁxes in Portuguese Language (RSLP) proposed in

(Orengo and Huyck, 2001). This algorithm considers

the extraction of stem of words through eight steps,

consisting of the removal of the plural form, feminine

form of the word, adverbial form, augmentative or

diminutive form, verb endings, removing vowels and

accents. A major advantage of using this process

of stem extraction under Portuguese Language is the

use of an external and editable dictionary of rules.

This dictionary contains about 32,000 words, with

rules for their proper stemming, allowing relocate its

content or even improving extraction by inserting new

exception rules within its conﬁguration. In addition,

we applied a set of feature selection techniques during

the training phase with the goal of eliminating noising

terms and to keep as much as possible just the

terms which could contributepositivelyfor the correct

classiﬁcation results.

We have chosen to use two well known

algorithms. The choice of these algorithms was

based on the aim at comparison with the results we

can ﬁnd in the literature. To this end, kNN (Soucy

and Mineau, 2001) is a well known classiﬁer widely

used in experiments involving information retrieval,

and it has been shown to yield good results in vary

situations. It measures the distance between every

documents within the training subset of the data set

and tested document, and then their distances are

ranked. The most common class in the k nearest

documents is chosen to be the class for the tested

document.

Another classiﬁer used in our experiments is the

CBC (Centroid-Based Classiﬁer) (Han and Karypis,

2000), which classiﬁes each tested document based

on its proximity to a given category’s centroid of the

data set. The choice of this approach is also because

of its implementation simplicity and for being fast

both for training and for testing a large number of

documents within our data set.

For both classiﬁers, we are using the cosine of

the angle between any two documents and their

class centroids to measure their similarities. The

metrics Recall (Equation 1), Precision (Equation 2)

and F1-measure (Equation 3) were adopted in this

work to evaluate the classiﬁcation results, as shown

below:

Recall(C

) =

TP(C

)

TP(C

) + FN(C

)

(1)

Precision(C

) =

TP(C

)

TP(C

) + FP(C

)

(2)

F1-measure(C

) =

2Precision(C

) × Recall(C

)

(Precision(C

) + Recall(C

))

(3)

where TP is the number of documents correctly

assigned to class C

by automatic classiﬁer, FP is

the number of documents incorrectly assigned to the

class C

by automatic classiﬁer and FN is the number

of documents belonging to class C

and incorrectly

classiﬁed by the automated classiﬁer as belonging to

another class.

The experimental results were obtained applying

k-fold cross validation and calculated the average

values for Precision, Recall and F1-measure. In

order to optimize the parameters of the techniques,

9 folds were used for training and another fold was

used for validation. The elements to form each fold

were randomly chosen so that each fold has balanced

number of elements for each class. We repeated this

process 50 times to calculate some statistics out of

these experiments.

4.3 Analysis the Results

We tested many k values for the kNN algorithm in

order to increase the F1-measure metric. So, the value

of k which achievedthe highest F1-measure was k = 1

for both version of the data set, thus we chose this

value to carry out the rest of our experiments. Figure

2 displays a comparison chart of this calibration, the

selected k (horizontal axis) against the F1-measure

metric (vertical axis) for the data set Marco Civil I.

Figure 3 for the data set Marco Civil II.

0 10 20 30 40 50 60

0.4 0.6 0.8

Neighborhood Size

F1 − Measure

Figure 2: The kNN best k for Marco Civil I.

0 10 20 30 40 50 60

0.0 0.2 0.4 0.6

Neighborhood Size

F1 − Measure

Figure 3: The kNN best k for Marco Civil II.

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

470

The calibration of a kNN consists in ﬁnding its

best value for k. When k = 1 it means that we are

using the very one nearest neighbor for deciding a

class for the testing new document. Differently, the

CBC uses the centroid of the training class for ﬁnding

this class for the same testing new document. Hence,

the latter approach uses more neighbor documents to

make a decision. The result of these two approaches

are shown in Table 3.

We performed an experiment with the initial status

of the data set, considering a pure classiﬁcation

problem. Hence we prepossessed the 2080 registers

according to what is described in Section 4.2. The

results are shown in Table 3.

Table 3: The results of kNN & CBC classiﬁcation.

Data

Set Classiﬁer Recall Precision F1

kNN 0.4858 0.4955 0.4853

CBC 0.4941 0.4920 0.4884

kNN 0.5042 0.6079 0.5381

CBC 0.5667 0.6146 0.5253

The results show us that the CBC approach is

slightly better than the kNN with respect to the

F1-metric in Marco Civil I, but slightly worse Marco

Civil II. With this data set the CBC is functioning on

average as a good classiﬁer for the testing documents.

An interesting result is that the CBC algorithm is

better in the Recall metric in both cases. The results

also show that the centroid of a class gives a better

memory of the class position than a single document.

In this experiments, the simple nearest neighbor is

slightly better on the Precision metric for Marco Civil

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.5 0.7 0.9

F1 − Measure

327

1178

1531

1723

1888

Figure 4: Performance of the kNN classiﬁer used after

clustering process – Marco Civil I.

Another experiment we carried out is that of

considering ﬁrst the clustering process described in

Section 4.1. In this process the user is asked to assign

a number of labels during this clustering phase. As

mentioned before, we claim that this clustering phase

can spare the user from a lot of work on assigning

labels for the classiﬁcation problem. In Figure 4,

we show the results for each value of ρ, the number

of labels assigned by the user and the quality of the

classiﬁcation afterwards.

These results show us that even when the

clustering phase is very tight, ρ = 0.95, the number

of assigned labels was only 1888 elements of the data

set, whereas for value of ρ = 0.60 the number of

assigned labels came down to 327. Note that even

for ρ = 0.60 the value of F1-metric is better than

that when yielded by the classiﬁcation problem in

the beginning of this section. This is to show that,

applying a clustering process as a starting point for a

problem as this one discussed here can in fact spare

the user from a lot of work. The only case where the

F1-metric results of classiﬁcation has been worsened

by the clustering phase is that when we chose ρ = 0.8.

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.3 0.5 0.7

F1 − Measure

538

1227

1471

1883

2000

Figure 5: Performance of the kNN classiﬁer used after

clustering process – Marco Civil II.

Marco Civil II is a version of the data set where

we seek to assign 9 different labels/class, whereas

in the ﬁrst case was three. This is a possible cause

for the poor results in this version, as we are dealing

with the same set of texts. We managed to improve

the F1-metric when choosing ρ = 0.80 and ρ = 0.90,

F1-metric=0.59 and 0.57, respectively. The other

results were worsened the F1-metric when compared

with the results yielded in Table 3. Nevertheless, in all

the cases the number of assigned labels was reduced.

5 CONCLUSIONS

In this paper, we propose a strategy to reduce the

user’s effort on classifying a large data set of tweets

by introducing a clustering phase as a ﬁrst step of

the whole process. The ultimate goal is to have a

good, ﬂexible and fast algorithm to help an expert

with the semi-automatically classiﬁcation process of

large tweets’ data sets.

CombiningClusteringandClassificationApproachesforReducingtheEffortofAutomaticTweetsClassification

471

For the clustering process we used a Clustering

Tookit to clusters the tweets. For the classiﬁcation

phase, we applied two classical algorithm strategies,

kNN and CBC, in order to be able to analyze the

impact of them on the results. In the experiments

we analyzed a variety of clustering conﬁgurations and

their inﬂuence on the following step of the proposed

strategy: the classiﬁcation phase.

The comparison of the results obtained by our

strategy and that produced by an expert revealed that

our approach was able to imitate the human expert up

to 0.7907% of the times. These ﬁndings also showed

that we can greatly reduce the effort of the expert.

Our future work is in the direction of ﬁnd a way to

predict the best ρ to start with the clustering process

in order to minimize the effort and maximize the

accuracy of the classiﬁcation process.

ACKNOWLEDGEMENTS

The ﬁrst author would like to thanks CAPES for its

partial support on this research under the grant n

BEX-6128/12-2.

REFERENCES

Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern

Information Retrieval. Addison-Wesley, New York,

2 edition.

Berry, M. W. (2003). Survey of Text Mining: Clustering,

Classiﬁcation, and Retrieval. Springer-Verlag, New

York.

Bruns, A. and Liang, Y. (2012). Tools and Methods

for Capturing Twitter Data During Natural Disasters.

First Monday, 17(4).

Bryden, J., Funk, S., and Jansen, V. A. A. (2013). Word

Usage Mirrors Community Structure in the Online

Social Network Twitter. EPJ Data Science, 2(1):3+.

Ciarelli, P. M., Oliveira, E., and Salles, E. O. T.

(2013). Multi-label Incremental Learning Applied to

Web Pages Categorization. Neural Computing and

Applications, pages 1–17.

Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011).

Cluster Analysis. John Wiley & Sons, Ltd, London, 5

edition.

Gundecha, P. and Liu, H. (2012). Mining Social Media: A

Brief Introduction. Tutorials in Operations Research,

1(4).

Hadgu, A. T., Garimella, K., and Weber, I. (2013). Political

Hashtag Hijacking in the U.S. In Proceedings of

the 22Nd International Conference on World Wide

Web Companion, WWW ’13 Companion, pages

55–56, Republic and Canton of Geneva, Switzerland.

International World Wide Web Conferences Steering

Committee.

Han, E.-H. S. and Karypis, G. (2000). Centroid-Based

Document Classiﬁcation: Analysis and Experimental

Results. Springer.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data

Clustering: a Review. ACM Computing Surveys,

31(3):264–323.

Karypis, G. (2002). CLUTOa Clustering Toolkit. Technical

report, Dept. of Computer Science, University of

Minnesota. Technical Report 02-017.

Kleinberg, J. (2002). An Impossibility Theorem for

Clustering. pages 446–453. MIT Press.

Kyriakopoulou, A. and Kalamboukis, T. (2007). Using

Clustering to Enhance Text Classiﬁcation. In 30nd

International ACM SIGIR Conference on Research

and Development in Information Retrieval, pages

805–806, New York, NY, USA. ACM Press.

Makazhanov, A., Raﬁei, D., and Waqar, M. (2014).

Predicting Political Preference of Twitter Users.

Social Network Analysis and Mining, 4(1).

Orengo, V. M. and Huyck, C. R. (2001). A Stemming

Algorithmm for the Portuguese Language. In SPIRE,

volume 8, pages 186–193.

Sebastiani, F. (2002). Machine Learning in Automated Text

Categorization. ACMComputing Surveys, 34(1):1–47.

Soucy, P. and Mineau, G. W. (2001). A Simple KNN

Algorithm for Text Categorization. In ICDM

’01: Proceedings of the 2001 IEEE International

Conference on Data Mining, pages 647–648,

Washington, DC, USA. IEEE Computer Society.

Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H.,

and Demirbas, M. (2010). Short Text Classiﬁcation

in Twitter to Improve Information Filtering. In 33rd

International ACM SIGIR Conference on Research

and Development in Information Retrieval, SIGIR

’10, pages 841–842, New York, NY, USA. ACM.

Vens, C., Verstrynge, B., and Blockeel, H. (2013).

Semi-supervised Clustering with Example Clusters.

In 5th International Joint Conference on Knowledge

Discovery, Knowledge Engineering and Knowledge

Management, pages 45–51, Vilamoura, Algarve,

Portugal.

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

472