gorithms to a benchmark Reuters dataset and web-
scrapped Daily Star dataset. The experiments with
the random forest classifier and KNN show promising
results. The random forest classifier achieves around
90% accuracy over the Reuters dataset and 75% accu-
racy over the Daily Star dataset. The performance of
the classifiers has been compared with existing work
and it shows up to 50% improvement in accuracy.
The extended version of the framework will in-
clude a hybrid recommender system integrating user-
centric preferences. The hybrid recommender system
facilitates both content-based and collaborative filter-
ing. Moreover, multiple repositories of text docu-
ments can be added to the framework for smooth user
experience of information browsing. The framework
will eventually allow users to browse and submit doc-
uments for finding contextually similar records. The
framework will will be more robust as the time passes
by through a monitoring scheme, keeping the updated
user profiles based on the search history. The au-
thors of this paper have already been working towards
building such an extended system.
In sum, the proposed text classification framework
shows great results for classification tasks and envi-
sions an all-around platform for text processing jobs
including classification, clustering and document rec-
ommendation. The framework brings state-of-art al-
gorithms, tools and techniques under the same um-
brella to provide a unique user experience in text cat-
egorization.
ACKNOWLEDGEMENT
We want to thank S. M. Keramat Ali, Assistant Engi-
neer at the Department of Transportation in Califor-
nia, USA, completed MA in Language & Linguistics
from the University of Dhaka, for supporting us to
validate the classification results on Daily star dataset.
REFERENCES
Adhikari, A., Ram, A., Tang, R., and Lin, J. (2019).
Docbert: Bert for document classification. arXiv
preprint arXiv:1904.08398.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural ma-
chine translation by jointly learning to align and trans-
late. arXiv preprint arXiv:1409.0473.
Chang, W.-C., Yu, H.-F., Zhong, K., Yang, Y., and Dhillon,
I. (2019). X-bert: extreme multi-label text classifica-
tion with using bidirectional encoder representations
from transformers. arXiv preprint arXiv:1905.02331.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (24
May 2019). Bert: Pre-training of deep bidirectional
transformers for language understanding. ArXiv.
Godoy, D. and Amandi, A. (2000). Personalsearcher: an
intelligent agent for searching web pages. pages 43–
52.
Hu, W., Xu, D., and Niu, Z. (2021). Improved k-means text
clustering algorithm based on bert and density peak.
pages 260–264.
Kim, Y., Denton, C., Hoang, L., and Rush, A. M.
(2017). Structured attention networks. arXiv preprint
arXiv:1702.00887.
Lewis, D. (1997). Reuters-21578 text cat-
egorization test collection, distribution
1.0. http://www.research.att.com/lewis/-
reuters21578.html.
Li, Y., Cai, J., and Wang, J. (2020). A text document clus-
tering method based on weighted bert model. 1:1426–
1430.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
V. (26 July 2019). Roberta: A robustly optimized bert
pretraining approach. ArXiv.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,
C., Lee, K., and Zettlemoyer, L. (2018). Deep contex-
tualized word representations. in naacl.
Qaiser, S. and Ali, R. (2018). Text mining: use of
tf-idf to examine the relevance of words to docu-
ments. International Journal of Computer Applica-
tions, 181(1):25–29.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever,
I. (2018). Improving language understanding with un-
supervised learning.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (March
2020). Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter. Hugging Face.
Shi, H. and Wang, C. (2020). Self-supervised document
clustering based on bert with data augment. arXiv
preprint arXiv:2011.08523.
Yeasmin, S., Afrin, N., Saif, K., and Huq,
M. R. (2022). Daily star dataset.
https://github.com/Sumona062/Daily-Star-
Datasets.git.
DATA 2022 - 11th International Conference on Data Science, Technology and Applications
82