CLUSTERING OF THREAD POSTS IN ONLINE DISCUSSION FORUMS

Dina Said, Nayer Wanas

Abstract

Online discussion forums are considered a challenging repository for data mining tasks. Forums usually contain hundreds of threads which which in turn maybe composed of hundreds, or even thousands, of posts. Clustering these posts potentially will provide better visualization and exploration of online threads. Moreover, clustering can be used for discovering outlier and off-topic posts. In this paper, we propose the Leader-based Post Clustering (LPC), a modification to the Leader algorithm to be applied to the domain of clustering posts in threads of discussion boards. We also suggest using asymmetric pair-wise distances to measure the dissimilarity between posts. We further investigate the effect of indirect distance between posts, and how to calibrate it with the direct distance. In order to evaluate the proposed methods, we conduct experiments using artificial and real threads extracted from Slashdot and Ciao discussion forums. Experimental results demonstrate the effectiveness of the LPC algorithm when using the linear combination of direct and indirect distances, as well as using an averaging approach to evaluate a representative indirect distance.

References

  1. Babu, T. and Murty, M. (2001). Comparison of genetic algorithm based prototype selection schemes. Pattern Recognition, 34(2):523-525.
  2. Carullo, M., Binaghi, E., and Gallo, I. (2009). An online document clustering technique for short web contents. Pattern Recognition Letter, 30(10):870-876.
  3. Huang, Y. and Mitchell, T. (2009). Toward mixed-initiative email clustering. In AAAI Spring Symposia 2009: Agents that learn from human teachers, pages 71-78, Stanford University, CA, USA.
  4. Li, F. and Hsieh, M.-H. (2006). An empirical study of clustering behavior of spammers and group-based antispam strategies. In CEAS 2006: 3rd Conference on E-mail and Anti-Spam, Mountain Veiw, CA, USA.
  5. Song, S. and Li, C. (2005). Tcuap: a novel approach of text clustering using asymmetric proximity. In Proc. 2nd Indian International Conf. on Artificial Intelligence, pages 447-453, Pune, India.
  6. Song, S. and Li, C. (2006). Improved rock for text clustering using asymmetric proximity. In SOFSEM 2006: Theory and Practice of Computer Science, 32nd Conference on Current Trends in Theory and Practice of Computer Science, volume 3831 of Lecture Notes in Computer Science, pages 501-510, MerĂ­n, Czech Republic.
  7. Tan, P., Steinbach, M., and Kumar, V. (2005). Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA.
  8. Wanas, N., Magdy, A., and Ashour, H. (2009). Using automatic keyword extraction to detect off-topic posts in online discussion boards. In content Analysis in Web 2.0 Workshop (CAW2.0), In conjunction with 18th International World Wide Web Conference (WWW2009), Madrid, Spain.
  9. Xiang, Y. (2009). Managing email overload with an automatic nonparametric clustering system. Journal of Supercomputing, 48(3):227-242.
Download


Paper Citation


in Harvard Style

Said D. and Wanas N. (2010). CLUSTERING OF THREAD POSTS IN ONLINE DISCUSSION FORUMS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 314-319. DOI: 10.5220/0003104303140319


in Bibtex Style

@conference{kdir10,
author={Dina Said and Nayer Wanas},
title={CLUSTERING OF THREAD POSTS IN ONLINE DISCUSSION FORUMS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={314-319},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003104303140319},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - CLUSTERING OF THREAD POSTS IN ONLINE DISCUSSION FORUMS
SN - 978-989-8425-28-7
AU - Said D.
AU - Wanas N.
PY - 2010
SP - 314
EP - 319
DO - 10.5220/0003104303140319