ing the same combining function. For example, the
performance of the Avg and Med is about 1% less
that of the Min and AvgF for the Slashdot-Art corpus
while the performance of the Min is about 0.4% less
that of the AvgF for the Ciao corpus. In general, we
recommend the using of the AvgF function since it
is not biased towards the minimum indirect distance
like the Min. Additionally, it considers only the five
indirect links which makes it a more reflecting to the
indirect distance compared with the Avg and Med.
5 CONCLUSIONS
Online discussion boards represent a rich repository
for data mining tasks in user generated texts. This re-
search addresses the problem of clustering posts in
different threads. The purpose of this clustering is
mainly to provide improved usability of threads in on-
line discussion boards. This may also facilitate the
discovery of off-topic and outlier posts in discussion
threads. The Leader-based Posts Clustering (LPC)
approach suggested captures the time dependency be-
tween posts. Starting from the head post, subsequent
posts are assigned to either the most related clus-
ter or to new clusters, based on an automatically-
determined threshold of distances. An asymmetric
distance is suggested for measuring the pair-wise dis-
tance between posts. This distance allows for model-
ing the inter-post tagging and commenting. Addition-
ally, we suggest incorporating the indirect distance
between posts. Four functions, the Minimum, Aver-
aging, and Median aggregating functions, have been
suggested for aggregating different indirect links. In
addition, four methods for combining indirect and di-
rect distances have been proposed; namely the Con-
stant, Power, Linear, and Tanh functions.
Our experiments have been conducted using four
corpora, two of them are artificially generated, where
true clusters are known and the others are real online
threads. These were geenrated from threads crawled
from Slashdot and Ciao discussion boards. The re-
sults show the potential of the LPC, while using Lin-
ear combining function and averaging aggregate func-
tion (Avg, AvgF). This is in comparison with the
performance of the k-means algorithm on the artifi-
cial corpora while setting k to be the true number
of clusters. Moreover, the LPC algorithm, unlike
the k-means, eliminates the requirement to estimate
the number of actual clusters or predefined thresh-
olds. For real corpora, the Linear combining func-
tion along with the averaging aggregate function has
demonstrated the best performance among all the ex-
amined methods.
ACKNOWLEDGEMENTS
The authors would like to thank the Fundacion
Barcelona Media (FBM) for crawling the corpora
used in this research and making them available for
research use. This research has been conducted dur-
ing an internship granted to the first author at the
Cairo Microsoft Innovation Lab.
REFERENCES
Babu, T. and Murty, M. (2001). Comparison of genetic al-
gorithm based prototype selection schemes. Pattern
Recognition, 34(2):523–525.
Carullo, M., Binaghi, E., and Gallo, I. (2009). An online
document clustering technique for short web contents.
Pattern Recognition Letter, 30(10):870–876.
Huang, Y. and Mitchell, T. (2009). Toward mixed-initiative
email clustering. In AAAI Spring Symposia 2009:
Agents that learn from human teachers, pages 71–78,
Stanford University, CA, USA.
Li, F. and Hsieh, M.-H. (2006). An empirical study of clus-
tering behavior of spammers and group-based anti-
spam strategies. In CEAS 2006: 3rd Conference on
E-mail and Anti-Spam, Mountain Veiw, CA, USA.
Song, S. and Li, C. (2005). Tcuap: a novel approach of text
clustering using asymmetric proximity. In Proc. 2nd
Indian International Conf. on Artificial Intelligence,
pages 447–453, Pune, India.
Song, S. and Li, C. (2006). Improved rock for text cluster-
ing using asymmetric proximity. In SOFSEM 2006:
Theory and Practice of Computer Science, 32nd Con-
ference on Current Trends in Theory and Practice of
Computer Science, volume 3831 of Lecture Notes in
Computer Science, pages 501–510, Mer
´
ın, Czech Re-
public.
Tan, P., Steinbach, M., and Kumar, V. (2005). Introduction
to data mining. Addison-Wesley Longman Publishing
Co., Inc. Boston, MA, USA.
Wanas, N., Magdy, A., and Ashour, H. (2009). Using au-
tomatic keyword extraction to detect off-topic posts in
online discussion boards. In content Analysis in Web
2.0 Workshop (CAW2.0), In conjunction with 18th In-
ternational World Wide Web Conference (WWW2009),
Madrid, Spain.
Xiang, Y. (2009). Managing email overload with an au-
tomatic nonparametric clustering system. Journal of
Supercomputing, 48(3):227–242.
CLUSTERING OF THREAD POSTS IN ONLINE DISCUSSION FORUMS
319