context utilize publicly shared attributes including
name, gender, location, and other information to
identify user profiles in social networks. However,
due to the privacy settings, user’s attributes are not
available in many cases and this makes these existing
works fragile. In addition, some researchers address
the problem of suicide only through tweets (Kavuluru
et al., 2016), (Colombo et al., 2015). However, even
though tweets contain rich information that can
identify users, they can miss some significant details
that maybe available on the user profile public
attributes and may contribute to a higher accuracy of
suicide detection. Apart from these works, we utilize
in our approach both user shared information that we
call account features and tweets as an attempt to solve
the problem of suicidal profiles detection. First,
posted tweets pose important challenges to infer more
information about users. The most relevant challenge
is semantic features that are difficult to extract
directly from user’s posted tweets such as stylometry,
writing style, sentiments, emojis, hashtags, n-grams,
etc. Instead of many existing studies that ignore these
features to identify users, we analyse tweets and
extract as much as possible of semantic features.
Second, adding account features to the user’s posted
tweets can help to improve the suicide detection task
since they may reflect the habits and characteristics of
users.
Although there are many studies (Sueki, 2014),
(O'dea, 2018) that focused on the particular problem
of suicidality detection in social networks, they do not
take into account the profile itself. They only
considered suicide related-communication with the
aim of classifying text relating to suicide. However,
the biggest challenge for the suicide task is how to
detect users who want to commit suicide from their
public profiles in social networks.
In this paper, we consider the challenge of suicidal
profiles detection in Twitter. We analyse posted
tweets to extract semantic features including
linguistic, emotional, stylometric, etc. These features
allow us to distinguish between the writing styles of
different users and thus to facilitate the final
classification of users into suicidal or not suicidal. In
particular, posted tweets contain temporal information
that can indicate the real time of user’s posting. Such
information is very relevant to enrich the user
identification and improve the suicide detection. We
also use account features related to publicly shared
information such as profile photo, location,
biography, followees, etc. We exploit these features to
infer other implicit ones and build a rich profile that
can help us to predict suicidal users. We adopt
different data mining tools and techniques for the
extraction process. We also introduce a supervised
machine learning model to learn the features
identifying each user. Moreover, we adopt several
classification techniques to classify profiles into
suicidal and not suicidal. We apply our method to a
data set collected from Twitter and including profiles
whose owners committed suicide.
The rest of this paper is organized as follows:
Section 2 discusses related work. Our method of
suicidal profiles detection is explained in Section 3,
which also presents the collection of data from twitter
based on tweets and account features. Section
4reports on evaluation. Section 5 concludes the paper
and outlines directions for future work.
2 RELATED WORK
Social media have become increasingly popular and
the number of active users continues to increase.
Several phenomena such as suicide are now visible on
social media. To address suicide and reduce the
related mortality rates, many studies were conducted
on suicidality in social networks.
Kavuluru et al., 2016 conducted a suicide study
by classifying text relating to suicide on Twitter. They
built a set of account classifiers using lexical,
structural, emotive and psychological features
extracted from Twitter posts. Their aim was to
distinguish between the more worrying content, such
as suicidal ideation, and other suicide-related topics.
Other studies (Kavuluru et al., 2014) have focused
on the writing styleusing the LIWC tool as a sampling
technique to identify ’sad’ Twitter posts that were
subsequently classified using a machine learning
classifier into levels of distress on an ordinal scale,
with around 64% accuracy in the best-case.
Additionally, (Birjali et al., 2017) based their work on
WordNet to analyse semantically Twitter data. They
address the lack of terminological resources related to
suicide by constructing a vocabulary associated with
suicide.
A case study (O'dea et al., 2015) used both human
coders and a machine classifier to confirm that
Twitter is used by individuals to express suicidality
and that it is possible to distinguish the level of
concern among suicide-related tweets.
In another work, (De Choudhury et al., 2016)
considered online platforms such as Reddit and
applied topic analysis and linguistic features to
identify behavioural shifts and mental health issues
such as suicidal ideation, thus highlighting the risks of
supposedly helpful messages in such online forums.
Furthermore, (Colombo et al., 2015) investigated the
WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies
290