pattern recognition application, such as targeting
advertisements and events detection. The majority of
the studies performed clustering in order to detect
news, topics, events, and facts and to predict
sentiments. Different clustering methods and
algorithms were implemented in these studies, each
of different dataset and number of clusters. From the
13 reviewed datasets, it can be observed that the
average dataset size is 162,550 for tweets textual data,
ranging from 50 to 1,084,200 and average of 126,329
for Twitter user accounts, ranging from 10,000 to
242,658 distinct user accounts. The majority of the
dataset sizes observed in the surveys are relatively
small, which means that the high volume challenge of
Twitter data has not been taken into consideration.
Therefore, in order for these algorithms to be
effective, they should be able to scale well to the
massive amounts of Twitter data. In this matter, the
scalability (in terms of clustering performance) of
most of the algorithms implemented in the surveys is
questionable as these algorithms have not been tested
on considerably large datasets.
As partitioning algorithms require the number of
clusters, c, to be pre-set, c has been included in the
review to provide an indication on the number of
clusters that might be appropriate for similar tasks.
From the provided comparisons, the average number
of clusters maintained can be derived, which is 7, with
2 as the minimum clusters and 10 as the maximum.
The table additionally compared the different
distance measures used. It can be observed that
Euclidean distance is the prominent for partitioning
algorithms, whereas hierarchical algorithms
commonly implemented the cosine similarity
measure. In terms of clustering features, different sets
were used depending on the implemented approach.
The features observed from the review include some
or all of the following:
Hashtags –31% of the reviewed surveys included
hashtags in the features set and considered their
impact, 23% treated hashtags as normal words in
the text, and 31% removed hashtags before
analysis (excluding the 15% studies that are
clustering upon user accounts).
Account metadata –username, date, status,
latitude, longitude, followers, and account
followings.
Tweet metadata –tweet id, published date, and
language.
Maintaining a BOW of the unique words
contained in each textual data of a tweet and their
frequencies as the feature vector. Some included
hashtags in the BOW while others ignored them.
None of the surveys studied the impact of retweets
nor “@mentions”. Rather, some datasets did not
remove the retweeted tweets which affected the
resulting clustering credibility. Because tweets
commonly get large number of retweets, keeping
them in the dataset will produce large clusters
containing redundant tweets rather than tweets with
similar features. This will consequently reinforce
false patterns and increase run time.
Evaluation methods vary from robust measures,
such as ASW to manual observations, such as
manually comparing an algorithm’s detected topics
with Google news headlines. ASW has been utilised
by most of the studies to measure the clustering
performance. Some of the evaluation methods are
derived from other data mining techniques such as
association rules and classification. These methods
include clustering based on confidence and support
levels, and calculating precision, recall and the F
measure from a confusion matrix.
5 CONCLUSION
The review contributes to the literature in several
significant ways. First, it provides a comparative
analysis on applications that utilized and tuned text
mining methods, particularly clustering, to the
characteristics of Twitter unstructured data. Second,
the review concentrated on algorithms of the general
clustering methods: (1) partition-based, (2)
hierarchical-based, (3) hybrid-based, and (4) density-
based, in Twitter mining. Third, unlike existing
reviews which provides high level and abstract
specification of surveys, this review was
comprehensive in that it provided comparative
information and discussion across the dataset size,
approach, clustering methods, algorithm, number of
clusters, distance measure, clustering feature,
evaluation methods, and results.
Thirteen articles were reviewed in this paper, and
the results indicated that there is a sufficient
improvement in the exploratory analysis of social
media data. However, many of the existing
methodologies have limited capabilities in their
performance and thus limited potential abilities in
recognising patterns in the data:
Most of the dataset sizes are relatively small
which is not indicative of the patterns in social
behaviours and therefore generalised conclusions
cannot be drawn. Because of the sparsity of
Twitter textual data, it is difficult to discover
representative information in small datasets.