Authors:
Giacomo Domeniconi
1
;
Konstantinos Semertzidis
2
;
Vanessa Lopez
3
;
Elizabeth M. Daly
3
;
Spyros Kotoulas
3
and
Gianluca Moro
1
Affiliations:
1
University of Bologna, Italy
;
2
University of Ioannina, Greece
;
3
IBM Research, Ireland
Keyword(s):
Clustering Algorithms, Conversation Threads, Topic Detection.
Related
Ontology
Subjects/Areas/Topics:
Data Engineering
;
Data Management and Quality
;
Data Management for Analytics
;
Data Modeling and Visualization
;
Data Structures and Data Management Algorithms
;
Information Quality
Abstract:
Efficiently detecting conversation threads from a pool of messages, such as social network chats, emails,
comments to posts, news etc., is relevant for various applications, including Web Marketing, Information
Retrieval and Digital Forensics. Existing approaches focus on text similarity using keywords as features that
are strongly dependent on the dataset. Therefore, dealing with new corpora requires further costly analyses
conducted by experts to find out new relevant features. This paper introduces a novel method to detect threads
from any type of conversational texts overcoming the issue of previously determining specific features for
each dataset. To automatically determine the relevant features of messages we map each message into a three
dimensional representation based on its semantic content, the social interactions in terms of sender/recipients
and its timestamp; then clustering is used to detect conversation threads. In addition, we propose a supervised
approach to detect
conversation threads that builds a classification model which combines the above extracted
features for predicting whether a pair of messages belongs to the same thread or not. Our model harnesses the
distance measure of a message to a cluster representing a thread to capture the probability that a message is
part of that same thread. We present our experimental results on seven datasets, pertaining to different types
of messages, and demonstrate the effectiveness of our method in the detection of conversation threads, clearly
outperforming the state of the art and yielding an improvement of up to a 19%.
(More)