loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Bardia Rafieian and Pere-Pau Vázquez

Affiliation: ViRVIG Group Department of Computer Science, UPC-BarcelonaTECH, C/ Jordi Girona 1-3, Ed Omega 137, 08034, Barcelona, Spain

Keyword(s): Long Document Classification, Document Embeddings, Doc2vec, Longformer, LLaMA-3, SciBERT, Deep Learning, Machine Learning, Natural Language Processing (NLP).

Abstract: Long documents pose a significant challenge for natural language processing (NLP), which requires high-quality embeddings. Despite the numerous approaches that encompass both deep learning and machine learning methodologies, tackling this task remains hard. In our study, we tackle the issue of long document classification by leveraging recent advancements in machine learning and deep learning. We conduct a comprehensive evaluation of several state-of-the-art models, including Doc2vec, Longformer, LLaMA-3, and SciBERT, focusing on their effectiveness in handling long to very long documents (in number of tokens). Furthermore, we trained a Doc2vec model using a massive dataset, achieving state-of-the-art quality, and surpassing other methods such as Longformer and SciBERT, which are very costly to train. Notably, while LLaMA-3 outperforms our model in certain aspects, Doc2vec remains highly competitive, particularly in speed, as it is the fastest among the evaluated methods. Through exp erimentation, we thoroughly evaluate the performance of our custom-trained Doc2vec model in classifying documents with an extensive number of tokens, demonstrating its efficacy, especially in handling very long documents. However, our analysis also uncovers inconsistencies in the performance of all models when faced with documents containing larger text volumes. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.226.28.97

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Rafieian, B. and Vázquez, P. (2024). Evaluating the Suitability of Long Document Embeddings for Classification Tasks: A Comparative Analysis. In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR; ISBN 978-989-758-716-0; ISSN 2184-3228, SciTePress, pages 320-327. DOI: 10.5220/0012950400003838

@conference{kdir24,
author={Bardia Rafieian and Pere{-}Pau Vázquez},
title={Evaluating the Suitability of Long Document Embeddings for Classification Tasks: A Comparative Analysis},
booktitle={Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR},
year={2024},
pages={320-327},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012950400003838},
isbn={978-989-758-716-0},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR
TI - Evaluating the Suitability of Long Document Embeddings for Classification Tasks: A Comparative Analysis
SN - 978-989-758-716-0
IS - 2184-3228
AU - Rafieian, B.
AU - Vázquez, P.
PY - 2024
SP - 320
EP - 327
DO - 10.5220/0012950400003838
PB - SciTePress