pernym and synonym approach are superior. The
whiskers are higher in both of these approaches sug-
gesting that there is a large variation in a small num-
ber of results. This suggests that while the addition of
hypernym and synonym did, on average, improve the
results, there are a minority of instances where they
added noise to the dataset and skewed some of the re-
sults. This phenomenon is not witnessed in the graph
approach which only improved upon results.
Figure 7: A box plot of all of the errors.
7 CONCLUSIONS
This paper discusses the investigation of the use of
external ontologies to improve performance of a clus-
tering algorithm through meaningful augmentation of
documents. A standard package Wordnet was used to
identify if the use of hypernyms and synonyms can
improve performance. Additionally a bespoke ontol-
ogy was constructed that represents relationships be-
tween terms based on co-occurrence, to see if the use
of context can improve results. Our dataset is not a
standard document collection so it poses additional
challenges that limit the effectiveness of traditional
clustering approaches. The best results were shown
to be achieved when context was used.
In this work, we manually identified related
threads from which to construct our ontology. Future
work will see the automatic identification of related
reddit threads. This can be achieved through measur-
ing syntactical similarity between threads. The future
aim of our work is to enhance this context construc-
tion and applying it to identifying different points of
view.
REFERENCES
Baghel, R. and Dhir, R. (2010). A frequent concepts based
document clustering algorithm. International Journal
of Computer Applications, 4(5):6–12.
Bennett, W. L. (2012). The personalization of politics po-
litical identity, social media, and changing patterns of
participation. The ANNALS of the American Academy
of Political and Social Science, 644(1):20–39.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2002). Latent
dirichlet allocation. Advances in neural information
processing systems, 1:601–608.
Bruza, P. and Song, D. (2001). Discovering information
flow using a high dimensional conceptual space. In
Proceedings of the 24th ACM SIGIR Conference on
Research and Development in Information Retrieval.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,
T. K., and Harshman, R. (1990). Indexing by latent
semantic analysis. Journal of the American society
for information science, 41(6):391.
Golkar, S. (2011). Liberation or suppression technologies?
the internet, the green movement and the regime in
Iran. International Journal of Emerging Technologies
and Society, 9(1):50.
Hotho, A., Staab, S., and Stumme, G. (2003). Ontologies
improve text document clustering. In Data Mining,
2003. ICDM 2003. Third IEEE International Confer-
ence on, pages 541–544. IEEE.
Hung, C., Wermter, S., and Smith, P. (2004). Hybrid neural
document clustering using guided self-organization
and wordnet. IEEE Intelligent Systems, 19(2):68–77.
Lotan, G., Graeff, E., Ananny, M., Gaffney, D., Pearce, I.,
et al. (2011). The arab spring— the revolutions were
tweeted: Information flows during the 2011 tunisian
and egyptian revolutions. International journal of
communication, 5:31.
Lund, K. and Burgess, C. (1996). Producing high-
dimensional semantic spaces from lexical co-
occurrence. Behavior Research Methods, Instruments,
& Computers, 28(2):203–208.
Mahajan, S. and Shah, N. (2016). Efficient pre-processing
for enhanced semantics based distributed document
clustering. In Computing for Sustainable Global De-
velopment (INDIACom), 2016 3rd International Con-
ference on, pages 338–343. IEEE.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and
Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research,
12:2825–2830.
Shah, N. and Mahajan, S. (2012). Document clustering:
a detailed review. International Journal of Applied
Information Systems, 4(5):30–38.
Shirky, C. (2011). The political power of social media:
Technology, the public sphere, and political change.
Foreign affairs, pages 28–41.
Singer, P., Fl
¨
ock, F., Meinhart, C., Zeitfogel, E., and
Strohmaier, M. (2014). Evolution of reddit: from the