Gender Clustering of Blog Posts using Distinguishable Features

Yaakov HaCohen-Kerner, Yarden Tzach, Ori Asis


The aim of this research is to find out how to perform effective clustering of unlabeled personal blog posts written in English by gender. Given a gender-labeled blog corpus and a blog corpus that is not gender-labeled, we extracted from the labeled corpus distinguishable unigrams for both males and females. Then, we defined two general features that represent the relative frequencies of the distinguishable males’ unigrams and females’ unigrams, (males’ frequency and females’ frequency). The best distinguishable feature was found to be the males’ frequency feature with a ratio factor at least 1.4 times that of females. This feature leads to accuracy rate of 83.7% for gender clustering of the unlabeled blog corpus. To the best of our knowledge, this study presents two novelties: (1) this is the first study to cluster blog posts by gender, and (2) clustering of an unlabeled corpus using distinguishable features that were extracted from a labeled corpus.


  1. Aggarwal, C. C., Zhai, C., 2012. A survey of text clustering algorithms. In mining text data (pp. 77- 128). Springer US.
  2. Aldenderfer, M. S., Blashfield, R. K., 1984. Cluster Analysis Sage University Papers Series. Quantitative.
  3. Bailey, Ken., 1994. Numerical taxonomy and cluster analysis. Typologies and Taxonomies. p. 34.
  4. Banerjee, S., Ramanathan, K., Gupta, A., 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 787-788). ACM.
  5. Bradley, P. S., Fayyad, U., Reina, C., 1998. Scaling EM (expectation-maximization) clustering to large databases (pp. 9-15). Redmond: Technical Report MSR-TR-98-35, Microsoft Research.
  6. Brooks, C., Montanez, N., 2006. Improved annotation of the blogosphere via auto tagging and hierarchical clustering, in: Proceedings of the WWW 2006, ACM, Edinburgh, UK, 625-632.
  7. Mary Bucholtz and Kira Hall. 2005. Identity and interaction: A sociocultural linguistic approach. Discourse studies, 7(4-5):585-614.
  8. Burger, J. D., Henderson, J., Kim, G., Zarrella, G., 2011. Discriminating gender on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1301-1309). Association for Computational Linguistics.
  9. Cucchiara, R. 1998. Genetic algorithms for clustering in machine vision. Machine Vision and Applications, 11(1), 1-6.
  10. Dempster, A. P., Laird, N. M., Rubin, D. B., 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39 (1): 1-38.
  11. Dhillon, I. S., Guan, Y., Kogan, J., 2002. Iterative clustering of high dimensional text data augmented by local search. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on (pp. 131-138). IEEE.
  12. Eckert, P., McConnell-Ginet. S., 2013. Language and gender. Cambridge University Press.
  13. Eckert, P., 1997. Age as a sociolinguistic variable. The handbook of sociolinguistics. Blackwell Publishers.
  14. Estivill-Castro, Vladimir. 2002. Why so many clustering algorithms - A Position Paper.
  15. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. 1996. Advances in knowledge discovery and data mining.
  16. Frenkel, L., Feder, M., 1999. Recursive expectationmaximization (EM) algorithms for time-varying parameters with applications to multiple target tracking. IEEE Transactions on Signal Processing, 47(2), 306-320.
  17. Filho, J. A. B. L., Pasti, R., de Castro, L. N., 2016. Gender Classification of Twitter Data Based on Textual MetaAttributes Extraction. In New Advances in Information Systems and Technologies (pp. 1025-1034). Springer International Publishing.
  18. Gao, J., and Lai, W. 2010. Formal concept analysis based clustering for blog network visualization. In Advanced Data Mining and Applications (pp. 394-404). Springer Berlin Heidelberg.
  19. HaCohen-Kerner, Y., Margaliot, O., 2013. Various document clustering tasks using word lists. In Asia Information Retrieval Symposium (pp. 156-169). Springer Berlin Heidelberg.
  20. HaCohen-Kerner, Y., Margaliot, O., 2014. Authorship attribution of responsa using clustering. Cybernetics and Systems, 45(6), 530-545.
  21. Hall, M. A. 1999. Correlation-based feature selection for machine learning (Doctoral dissertation, The University of Waikato).
  22. Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten., 2009. The WEKA Data Mining Software: an Update. ACM SIGKDD Explorations Newsletter, 11(1), pp.10-18.
  23. Jain, A. K., Murty, M. N., Flynn, P. J., 1991. Data Clustering: A Review. ACM Computing Surveys 31, 3 (264-323).
  24. Johnson, S. C., 1967. Hierarchical clustering schemes. Psychometrika, 32(3), 241-254.
  25. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., Wu, A. Y., 2002. An efficient kmeans clustering algorithm: Analysis and implementation. IEEE transactions on pattern analysis and machine intelligence, 24(7), 881-892.
  26. Koppel, M., Argamon, S., Shimoni, A. R., 2002. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), 401-412.
  27. Kuzar, T., Navrat, P. 2011. Slovak blog clustering enhanced by mining the web comments. In Web Intelligence and Intelligent Agent Technology (WIIAT), 2011 IEEE/WIC/ACM International Conference on, Vol. 3, 293-296. IEEE.
  28. Marwick, A. E. and Boyd D., 2011. I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience. New Media Society, 13(1):114-133.
  29. Mukherjee, A., Liu, B., 2010. Improving gender classification of blog authors. In Proceedings of the 2010 conference on Empirical Methods in natural Language Processing (pp. 207-217). Association for Computational Linguistics.
  30. Ngan, M., and Grother, P. 2015. Face recognition vendor test (frvt) performance of automated gender classification algorithms. In Technical Report NIST IR 8052. National Institute of Standards and Technology.
  31. Nguyen, D. P., Trieschnigg, R. B., Dogruöz, A. S., Gravel, R., Theune, M., Meder, T., de Jong, F. M. G. 2014. Why gender and age prediction from tweets is hard: Lessons from a crowdsourcing experiment. COLING, Association for Computational Linguistics.
  32. Yingbo Miao, Vlado Keselj, and Evangelos Milios. Document Clustering using Character N-grams: A Comparative Evaluation with Term-based and Wordbased Clustering. In Proc. of the 14th ACM int. conference on Information and knowledge management, 357-358. 2005.
  33. Schler, J., Koppel, M., Argamon, S., Pennebaker, J. W., 2006. Effects of Age and Gender on Blogging. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, Vol. 6, pp. 199-205. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. Vol. 6. 2006.
  34. Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., ... Ungar, L. H., 2013. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 8(9), e73791.
  35. Sharan, R., Shamir, R., 2000. CLICK: a clustering algorithm with applications to gene expression analysis. In Proc Int Conf Intell Syst Mol Biol (Vol.8, No.307, p.16).
  36. Steinhaus, H., 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, 1(804), 801.
  37. Tasoulis, D. K., Plagianakos, V. P., and Vrahatis, M. N. 2004. Unsupervised clustering of bioinformatics data. In European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems, Eunite, pp. 47-53.
  38. Toutanova, K., Klein, D., Manning, C. D., Singer, Y.: Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol. 1, NAACL'03, Association for Computational Linguistics, 173-180. 2003.
  39. Tryon, R. C., 1939. Cluster Analysis: Correlation Profile and Orthometric (factor) Analysis for the Isolation of Unities in Mind and Personality. Edwards Brothers.
  40. Witten, I. H., E. Frank., 2005. Data Mining: Practical Machine Learning Tools Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). San Mateo, CA: Morgan Kaufmann.
  41. Yan, X., and Yan, L., 2006. Gender Classification of Weblog Authors. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs (pp. 228-230).
  42. Zheng, Y., Cheng, X., Huang, R., Man, Y., 2006. A comparative study on text clustering methods. In International Conference on Advanced Data Mining and Applications (pp. 644-651). Springer Berlin Heidelberg.

Paper Citation

in Harvard Style

HaCohen-Kerner Y., Tzach Y. and Asis O. (2016). Gender Clustering of Blog Posts using Distinguishable Features . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 384-391. DOI: 10.5220/0006077403840391

in Bibtex Style

author={Yaakov HaCohen-Kerner and Yarden Tzach and Ori Asis},
title={Gender Clustering of Blog Posts using Distinguishable Features},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},

in EndNote Style

JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - Gender Clustering of Blog Posts using Distinguishable Features
SN - 978-989-758-203-5
AU - HaCohen-Kerner Y.
AU - Tzach Y.
AU - Asis O.
PY - 2016
SP - 384
EP - 391
DO - 10.5220/0006077403840391