Authors:
Sumona Yeasmin
;
Nazia Afrin
;
Kashfia Saif
and
Mohammad Rezwanul Huq
Affiliation:
Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh
Keyword(s):
Natural Language Processing, Machine Learning, Classification, Transformer-based Embedding, Contextual Similarity.
Abstract:
Traditional text document classification methods represent documents with non-contextualized word embeddings and vector space models. Recent techniques for text classification often rely on word embeddings as a transfer learning component. The existing text document classification methodologies have been explored first and then we evaluated their strengths and limitations. We have started with models based on Bag-of-Words and shifted towards transformer-based architectures. It is concluded that transformer-based embedding is necessary to capture the contextual meaning. BERT, one of the transformer-based embedding architectures, produces robust word embeddings, analyzing from left to right and right to left and capturing the proper context. This research introduces a novel text classification framework based on BERT embeddings of text documents. Several classification algorithms have been applied to the word embeddings of the pre-trained state-of-art BERT model. Experiments show that
the random forest classifier obtains the highest accuracy than the decision tree and k-nearest neighbor (KNN) algorithms. Furthermore, the obtained results have been compared with existing work and show up to 50% improvement in accuracy. In the future, this work can be extended by building a hybrid recommender system, combining content-based documents with similar features and user-centric interests. This study shows promising results and validates the proposed methodology viable for text classification.
(More)