Evaluating the Suitability of Long Document Embeddings for Classification Tasks: A Comparative Analysis
Bardia Rafieian, Pere-Pau Vázquez
2024
Abstract
Long documents pose a significant challenge for natural language processing (NLP), which requires high-quality embeddings. Despite the numerous approaches that encompass both deep learning and machine learning methodologies, tackling this task remains hard. In our study, we tackle the issue of long document classification by leveraging recent advancements in machine learning and deep learning. We conduct a comprehensive evaluation of several state-of-the-art models, including Doc2vec, Longformer, LLaMA-3, and SciBERT, focusing on their effectiveness in handling long to very long documents (in number of tokens). Furthermore, we trained a Doc2vec model using a massive dataset, achieving state-of-the-art quality, and surpassing other methods such as Longformer and SciBERT, which are very costly to train. Notably, while LLaMA-3 outperforms our model in certain aspects, Doc2vec remains highly competitive, particularly in speed, as it is the fastest among the evaluated methods. Through experimentation, we thoroughly evaluate the performance of our custom-trained Doc2vec model in classifying documents with an extensive number of tokens, demonstrating its efficacy, especially in handling very long documents. However, our analysis also uncovers inconsistencies in the performance of all models when faced with documents containing larger text volumes.
DownloadPaper Citation
in Harvard Style
Rafieian B. and Vázquez P. (2024). Evaluating the Suitability of Long Document Embeddings for Classification Tasks: A Comparative Analysis. In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN 978-989-758-716-0, SciTePress, pages 320-327. DOI: 10.5220/0012950400003838
in Bibtex Style
@conference{kdir24,
author={Bardia Rafieian and Pere-Pau Vázquez},
title={Evaluating the Suitability of Long Document Embeddings for Classification Tasks: A Comparative Analysis},
booktitle={Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
year={2024},
pages={320-327},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012950400003838},
isbn={978-989-758-716-0},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - Evaluating the Suitability of Long Document Embeddings for Classification Tasks: A Comparative Analysis
SN - 978-989-758-716-0
AU - Rafieian B.
AU - Vázquez P.
PY - 2024
SP - 320
EP - 327
DO - 10.5220/0012950400003838
PB - SciTePress