forums etc. We address the problem by using a three-
dimensional representation for each message, which
involves textual semantic content, social interactions
and creation time. Then, we propose a suite of fea-
tures based on the three dimensional representation
to compute the similarity measure between messages,
which is used in a clustering algorithms to detect the
threads. We also propose the use of a supervised
model which combines these features using the prob-
ability to be in the same thread estimated by the model
as a distance measure between two messages. We
show that the use of a classifier leads to higher ac-
curacy in thread detection, outperforming all earlier
approaches.
For future work, an interesting variation of the
problem to consider is the conversation tree recon-
struction, where we have to detect the reply structure
of the conversations inside a thread.
REFERENCES
Adams, P. H. and Martell, C. H. (2008). Topic detection and
extraction in chat. In ICSC 2008, pages 581–588.
Aumayr, E., Chan, J., and Hayes, C. (2011). Reconstruction
of threaded conversations in online discussion forums.
In Weblogs and Social Media.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
dirichlet allocation. the Journal of machine Learning
research, 3:993–1022.
Bouguettaya, A., Yu, Q., Liu, X., Zhou, X., and Song, A.
(2015). Efficient agglomerative hierarchical cluster-
ing. Expert Syst. Appl., 42(5):2785–2797.
Coussement, K. and den Poel, D. V. (2008). Improving
customer complaint management by automatic email
classification using linguistic style features as predic-
tors. Decision Support Systems, 44(4):870–882.
Dehghani, M., Shakery, A., Asadpour, M., and
Koushkestani, A. (2013). A learning approach
for email conversation thread reconstruction. J.
Information Science, 39(6):846–863.
Domeniconi, G., Moro, G., Pasolini, R., and Sartori, C.
(2016). A comparison of term weighting schemes for
text classification and sentiment analysis with a super-
vised variant of tf.idf. In Data Management Technolo-
gies and Applications (DATA 2015), Revised Selected
Papers, volume 553, pages 39–58. Springer.
Erera, S. and Carmel, D. (2008). Conversation detection
in email systems. In ECIR, Glasgow, UK, March 30-
April 3, 2008., pages 498–505.
Ester, M., Kriegel, H., Sander, J., and Xu, X. (1996). A
density-based algorithm for discovering clusters in
large spatial databases with noise. In (KDD-96), Port-
land, Oregon, USA, pages 226–231.
F. M. Khan, T. A. Fisher, L. S. T. W. and Pottenger, W. M.
(2002). Mining chatroom conversations for social and
semantic interactions. In Technical Report LU-CSE-
02-011, Lehigh University.
Glass, K. and Colbaugh, R. (2010). Toward emerging topic
detection for business intelligence: Predictive analysis
of meme’ dynamics. CoRR, abs/1012.5994.
Hall, M. A., Frank, E., Holmes, G., Pfahringer, B., Reute-
mann, P., and Witten, I. H. (2009). The WEKA data
mining software: an update. SIGKDD Explorations,
11(1):10–18.
Hofmann, T. (1999). Probabilistic latent semantic indexing.
In ACM SIGIR, pages 50–57. ACM.
Huang, J., Zhou, B., Wu, Q., Wang, X., and Jia, Y. (2012).
Contextual correlation based thread detection in short
text message streams. J. Intell. Inf. Syst., 38(2):449–
464.
Joshi, S., Contractor, D., Ng, K., Deshpande, P. M., and
Hampp, T. (2011). Auto-grouping emails for faster
e-discovery. PVLDB, 4(12):1284–1294.
Jurczyk, P. and Agichtein, E. (2007). Discovering author-
ities in question answer communities by using link
analysis. In CIKM, Lisbon, Portugal, November 6-10,
2007, pages 919–922.
Manning, C. D., Raghavan, P., Sch
¨
utze, H., et al. (2008).
Introduction to information retrieval, volume 1. Cam-
bridge university press Cambridge.
Porter, M. F. (1980). An algorithm for suffix stripping. Pro-
gram, 14(3):130–137.
Salton, G. and Buckley, C. (1988). Term-weighting ap-
proaches in automatic text retrieval. Information pro-
cessing & management, 24(5):513–523.
Shen, D., Yang, Q., Sun, J., and Chen, Z. (2006). Thread
detection in dynamic text message streams. In SIGIR,
Washington, USA, August 6-11, 2006, pages 35–42.
Singhal, A. (2001). Modern information retrieval: A brief
overview. IEEE Data Eng. Bull., 24(4):35–43.
Soboroff, I., de Vries, A. P., and Craswell, N. (2006).
Overview of the TREC 2006 enterprise track. In
TREC, Gaithersburg, Maryland, USA, November 14-
17, 2006.
Ulrich, J., Murray, G., and Carenini, G. (2008). A publicly
available annotated corpus for supervised email sum-
marization. In AAAI08 EMAIL Workshop.
Wang, H., Wang, C., Zhai, C., and Han, J. (2011). Learn-
ing online discussion structures by conditional ran-
dom fields. In In SIGIR 2011, Beijing, China, July
25-29, 2011, pages 435–444.
Wu, Y. and Oard, D. W. (2005). Indexing emails and email
threads for retrieval. In SIGIR, pages 665–666.
X. Wang, M. Xu, N. Z. and Chen, N. (2008). Email conver-
sations reconstruction based on messages threading
for multi-person. In (ETTANDGRS ’08), volume 1,
pages 676–680.
Yeh, J. (2006). Email thread reassembly using similarity
matching. In CEAS, July 27-28, 2006, Mountain View,
California, USA.
Zhao, Q. and Mitra, P. (2007). Event detection and visu-
alization for social text streams. In ICWSM, Boulder,
Colorado, USA, March 26-28, 2007.
Zhao, Q., Mitra, P., and Chen, B. (2007). Temporal and
information flow based event detection from social
text streams. In AAAI, July 22-26, 2007, Vancouver,
British Columbia, Canada, pages 1501–1506.
DATA 2016 - 5th International Conference on Data Management Technologies and Applications
54