tering and extraction techniques. arXiv preprint
arXiv:1707.02919.
Bird, S., Klein, E., and Loper, E. (2009). Natural language
processing with Python: analyzing text with the natu-
ral language toolkit. ” O’Reilly Media, Inc.”.
Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic mod-
els. In International Conference on Machine Learning
(ICML).
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
dirichlet allocation. Journal of Machine Learning Re-
search, 3:993–1022.
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F.,
Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P.,
Gramfort, A., Grobler, J., Layton, R., VanderPlas, J.,
Joly, A., Holt, B., and Varoquaux, G. (2013). API de-
sign for machine learning software: experiences from
the scikit-learn project. In ECML PKDD Workshop:
Languages for Data Mining and Machine Learning,
pages 108–122.
Bunk, S. and Krestel, R. (2018). Welda: Enhancing
topic models by incorporating local word context. In
ACM/IEEE on Joint Conference on Digital Libraries,
pages 293–302.
Churchill, R. and Singh, L. (2020). Percolation-based topic
modeling for tweets. In WISDOM 2020: Workshop on
Issues of Sentiment Discovery and Opinion Mining.
Churchill, R., Singh, L., and Kirov, C. (2018). A tempo-
ral topic model for noisy mediums. In Pacific-Asia
Conference on Knowledge Discovery and Data Min-
ing (PAKDD).
Denny, M. J. and Spirling, A. (2018). Text preprocessing
for unsupervised learning: Why it matters, when it
misleads, and what to do about it. Political Analysis,
26(2):168–189.
Dieng, A. B., Ruiz, F. J., and Blei, D. M. (2019a).
Topic modeling in embedding spaces. arXiv preprint
arXiv:1907.04907.
Dieng, A. B., Ruiz, F. J. R., and Blei, D. M. (2019b).
The dynamic embedded topic model. CoRR,
abs/1907.05545.
Foundation, I. (2021). Reddit statistics for 2021.
https://foundationinc.co/lab/reddit-statistics/. Ac-
cessed: 2021-03-01.
InternetLiveStats (2021). Twitter usage statistics.
http://www.internetlivestats.com/twitter-statistics/.
Accessed: 2021-03-01.
Knoblock, C. A., Lerman, K., Minton, S., and Muslea, I.
(2003). Accurately and reliably extracting data from
the web: A machine learning approach. In Intelligent
exploration of the web, pages 275–287. Springer.
Lafferty, J. D. and Blei, D. M. (2006). Correlated topic
models. In Advances in Neural Information Process-
ing Systems (NIPS), pages 147–154.
Lau, J. H., Newman, D., and Baldwin, T. (2014). Machine
reading tea leaves: Automatically evaluating topic co-
herence and topic model quality. In Conference of
the European Chapter of the Association for Compu-
tational Linguistics, pages 530–539.
Li, C., Wang, H., Zhang, Z., Sun, A., and Ma, Z. (2016).
Topic modeling for short texts with auxiliary word
embeddings. In ACM SIGIR Conference on Re-
search and Development in Information Retrieval,
pages 165–174.
McCallum, A. K. (2002). Mallet: A machine learning for
language toolkit.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
Dean, J. (2013). Distributed representations of words
and phrases and their compositionality. In Advances in
neural information processing systems, pages 3111–
3119.
Moody, C. E. (2016). Mixing dirichlet topic models
and word embeddings to make lda2vec. CoRR,
abs/1605.02019.
Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T.
(2000). Text classification from labeled and unlabeled
documents using em. Machine learning, 39(2-3):103–
134.
Noyes, D. (2020). The top 20 valuable facebook statis-
tics - updated may 2017. https://zephoria.com/top-15-
valuable-facebook-statistics/. Accessed: 2021-03-01.
Pushshift.io (2021). Pushshift.io api documentation.
https://pushshift.io/api-parameters/. Accessed: 2021-
03-07.
Qiang, J., Chen, P., Wang, T., and Wu, X. (2016). Topic
modeling over short texts by incorporating word em-
beddings. CoRR, abs/1609.08496.
Qiang, J., Zhenyu, Q., Li, Y., Yuan, Y., and Wu, X.
(2019). Short text topic modeling techniques, appli-
cations, and performance: A survey. arXiv preprint
arXiv:1904.07695.
Quan, X., Kit, C., Ge, Y., and Pan, S. J. (2015). Short and
sparse text topic modeling via self-aggregation. In In-
ternational Joint Conference on Artificial Intelligence.
Rahm, E. and Do, H. H. (2000). Data cleaning: Problems
and current approaches. IEEE Data Engineering Bul-
letin, 23(4):3–13.
Raman, V. and Hellerstein, J. M. (2001). Potter’s wheel: An
interactive data cleaning system. In Very Large Data
Bases (VLDB), volume 1, pages 381–390.
ˇ
Reh
˚
u
ˇ
rek, R. and Sojka, P. (2010). Software Framework for
Topic Modelling with Large Corpora. In LREC Work-
shop on New Challenges for NLP Frameworks, pages
45–50.
Schofield, A., Magnusson, M., and Mimno, D. (2017).
Pulling out the stops: Rethinking stopword removal
for topic models. In Conference of the European
Chapter of the Association for Computational Lin-
guistics: Volume 2, Short Papers, volume 2, pages
432–436.
Singh, L., Bansal, S., Bode, L., Budak, C., Chi, G., Kawin-
tiranon, K., Padden, C., Vanarsdall, R., Vraga, E., and
Wang, Y. (2020). A first look at covid-19 information
and misinformation sharing on twitter.
Srividhya, V. and Anitha, R. (2010). *evaluating prepro-
cessing techniques in text categorization. Interna-
tional Journal of Computer Science and Application,
47(11):49–51.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M.
(2006). Hierarchical dirichlet processes. Journal of
textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data
69