For the clustering process we used a Clustering
Tookit to clusters the tweets. For the classification
phase, we applied two classical algorithm strategies,
kNN and CBC, in order to be able to analyze the
impact of them on the results. In the experiments
we analyzed a variety of clustering configurations and
their influence on the following step of the proposed
strategy: the classification phase.
The comparison of the results obtained by our
strategy and that produced by an expert revealed that
our approach was able to imitate the human expert up
to 0.7907% of the times. These findings also showed
that we can greatly reduce the effort of the expert.
Our future work is in the direction of find a way to
predict the best ρ to start with the clustering process
in order to minimize the effort and maximize the
accuracy of the classification process.
ACKNOWLEDGEMENTS
The first author would like to thanks CAPES for its
partial support on this research under the grant n
o
BEX-6128/12-2.
REFERENCES
Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern
Information Retrieval. Addison-Wesley, New York,
2 edition.
Berry, M. W. (2003). Survey of Text Mining: Clustering,
Classification, and Retrieval. Springer-Verlag, New
York.
Bruns, A. and Liang, Y. (2012). Tools and Methods
for Capturing Twitter Data During Natural Disasters.
First Monday, 17(4).
Bryden, J., Funk, S., and Jansen, V. A. A. (2013). Word
Usage Mirrors Community Structure in the Online
Social Network Twitter. EPJ Data Science, 2(1):3+.
Ciarelli, P. M., Oliveira, E., and Salles, E. O. T.
(2013). Multi-label Incremental Learning Applied to
Web Pages Categorization. Neural Computing and
Applications, pages 1–17.
Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011).
Cluster Analysis. John Wiley & Sons, Ltd, London, 5
edition.
Gundecha, P. and Liu, H. (2012). Mining Social Media: A
Brief Introduction. Tutorials in Operations Research,
1(4).
Hadgu, A. T., Garimella, K., and Weber, I. (2013). Political
Hashtag Hijacking in the U.S. In Proceedings of
the 22Nd International Conference on World Wide
Web Companion, WWW ’13 Companion, pages
55–56, Republic and Canton of Geneva, Switzerland.
International World Wide Web Conferences Steering
Committee.
Han, E.-H. S. and Karypis, G. (2000). Centroid-Based
Document Classification: Analysis and Experimental
Results. Springer.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data
Clustering: a Review. ACM Computing Surveys,
31(3):264–323.
Karypis, G. (2002). CLUTOa Clustering Toolkit. Technical
report, Dept. of Computer Science, University of
Minnesota. Technical Report 02-017.
Kleinberg, J. (2002). An Impossibility Theorem for
Clustering. pages 446–453. MIT Press.
Kyriakopoulou, A. and Kalamboukis, T. (2007). Using
Clustering to Enhance Text Classification. In 30nd
International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages
805–806, New York, NY, USA. ACM Press.
Makazhanov, A., Rafiei, D., and Waqar, M. (2014).
Predicting Political Preference of Twitter Users.
Social Network Analysis and Mining, 4(1).
Orengo, V. M. and Huyck, C. R. (2001). A Stemming
Algorithmm for the Portuguese Language. In SPIRE,
volume 8, pages 186–193.
Sebastiani, F. (2002). Machine Learning in Automated Text
Categorization. ACMComputing Surveys, 34(1):1–47.
Soucy, P. and Mineau, G. W. (2001). A Simple KNN
Algorithm for Text Categorization. In ICDM
’01: Proceedings of the 2001 IEEE International
Conference on Data Mining, pages 647–648,
Washington, DC, USA. IEEE Computer Society.
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H.,
and Demirbas, M. (2010). Short Text Classification
in Twitter to Improve Information Filtering. In 33rd
International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR
’10, pages 841–842, New York, NY, USA. ACM.
Vens, C., Verstrynge, B., and Blockeel, H. (2013).
Semi-supervised Clustering with Example Clusters.
In 5th International Joint Conference on Knowledge
Discovery, Knowledge Engineering and Knowledge
Management, pages 45–51, Vilamoura, Algarve,
Portugal.
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
472