Evaluating the Suitability of Long Document Embeddings for

Classiﬁcation Tasks: A Comparative Analysis

Bardia Raﬁeian

and Pere-Pau V

azquez

ViRVIG Group Department of Computer Science, UPC-BarcelonaTECH, C/ Jordi Girona 1-3,

Ed Omega 137, 08034, Barcelona, Spain

{bardia.raﬁeian, pere.pau.vazquez}@upc.edu

Keywords:

Long Document Classiﬁcation, Document Embeddings, Doc2vec, Longformer, LLaMA-3, SciBERT,

Deep Learning, Machine Learning, Natural Language Processing (NLP).

Abstract:

Long documents pose a signiﬁcant challenge for natural language processing (NLP), which requires high-

quality embeddings. Despite the numerous approaches that encompass both deep learning and machine learn-

ing methodologies, tackling this task remains hard. In our study, we tackle the issue of long document classiﬁ-

cation by leveraging recent advancements in machine learning and deep learning. We conduct a comprehensive

evaluation of several state-of-the-art models, including Doc2vec, Longformer, LLaMA-3, and SciBERT, fo-

cusing on their effectiveness in handling long to very long documents (in number of tokens). Furthermore, we

trained a Doc2vec model using a massive dataset, achieving state-of-the-art quality, and surpassing other meth-

ods such as Longformer and SciBERT, which are very costly to train. Notably, while LLaMA-3 outperforms

our model in certain aspects, Doc2vec remains highly competitive, particularly in speed, as it is the fastest

among the evaluated methods. Through experimentation, we thoroughly evaluate the performance of our

custom-trained Doc2vec model in classifying documents with an extensive number of tokens, demonstrating

its efﬁcacy, especially in handling very long documents. However, our analysis also uncovers inconsistencies

in the performance of all models when faced with documents containing larger text volumes.

1 INTRODUCTION

Text embeddings are pivotal in natural language pro-

cessing (NLP) tasks, such as text classiﬁcation, where

the quality of embeddings signiﬁcantly affects per-

formance. Traditional methods, such as Word2vec

(Mikolov et al., 2013) and GloVe (Pennington et al.,

2014), have been foundational in generating embed-

dings at the token and sentence levels. Recent ad-

vancements include transformer-based models like

BERT (Devlin et al., 2018) and ELMO (Peters et al.,

2018), which have improved the quality of embed-

dings through contextualized representations.

Despite these advancements, handling very long

documents presents substantial challenges. Models

like BERT are limited by maximum sequence lengths,

which restricts their ability to generate embeddings

for extensive texts. While recent models such as

Longformer (Beltagy et al., 2020) and large language

models offer better performance, they come with high

computational costs and resource demands (Samsi

https://orcid.org/0000-0003-4591-8934

https://orcid.org/0000-0003-4638-4065

et al., 2023). These models are expensive to train and

deploy, and there is a lack of standardized evaluations

across various benchmarks (Tay et al., 2021).

On the other side, the scarcity of datasets with

very long documents further complicates the issue,

where existing labeled datasets consist only of short

articles (up to 800 tokens per document), yet training

a classiﬁer for long texts requires a labeled dataset

consisting of long documents. This gap in the liter-

ature highlights the need for more effective methods

to handle lengthy texts while considering computa-

tional efﬁciency. This paper provides an evaluation

of several state-of-the-art models, including Doc2vec,

Longformer, LLaMA-3, and SciBERT, focusing on

their effectiveness in handling long to very long doc-

ument tokens. The study speciﬁcally aims to assess

how well these models generate embeddings from

documents that are exceptionally long in terms of to-

ken count. The key question is how agnostic these

models are to document length, and how the quality

of the generated embeddings inﬂuences their perfor-

mance in downstream text classiﬁcation tasks.

We also trained a Doc2vec model on a large

320

Raﬁeian, B. and Vázquez, P.

Evaluating the Suitability of Long Document Embeddings for Classiﬁcation Tasks: A Comparative Analysis.

DOI: 10.5220/0012950400003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 1: KDIR, pages 320-327

ISBN: 978-989-758-716-0; ISSN: 2184-3228

dataset to evaluate its capability against BERT-based

models and large language models (LLMs) in gener-

ating embeddings for very long documents. The goal

was to assess how well the Doc2vec model performs

compared to these advanced models in terms of both

the quality of the embeddings produced and their ef-

fectiveness in a text classiﬁcation task. This com-

parison provides insights into the relative strengths

and limitations of traditional embedding methods

like Doc2vec versus modern transformer-based ap-

proaches when handling lengthy documents. Given

the scarcity of very long document datasets, our eval-

uation utilizes both public datasets and newly cre-

ated datasets with over 4,000 tokens from arXiv and

bioRxiv documents. The paper is structured as fol-

lows: Section 2 reviews related work. Section 3 de-

tails data preparation and preprocessing steps and the

preparation of our pretrained Doc2vec model. In sec-

tion 4 we study our experiments and ﬁnally, we con-

clude and discuss future directions.

2 RELATED WORKS

With the introduction of Word2Vec and GloVe, vari-

ous methods have emerged to encode sentences, para-

graphs, and longer texts into embeddings. Among

these methods, Doc2vec (Le and Mikolov, 2014)

stands out as a Paragraph Vector, an unsupervised

algorithm that learns ﬁxed-length feature representa-

tions from variable-length pieces of text. Empirical

results have shown that Doc2vec outperforms bag-

of-words models and other text representation tech-

niques.

The advent of transformer models brought signif-

icant improvements in text encoders. BERT-based

models, in particular, demonstrated substantial per-

formance gains. The ﬁrst application of BERT to

document classiﬁcation, as presented in ”DocBERT:

BERT for Document Classiﬁcation” (Adhikari et al.,

2019), improved baseline results by ﬁne-tuning

BERT, achieving higher classiﬁcation accuracy across

various datasets. However, BERT-based models were

limited by a ﬁxed input sequence length of 512 to-

kens. To address this, models like SciBERT extended

the number of tokens to 768 through ﬁne-tuning. In

SciBERT that follows the BERT architecture, which

uses the Transformer model for encoding text, the

process of generating embeddings can be described

as follows:

The input is a tokenized text sequence:

x = [x

, x

, . . . , x

]

The tokens are then converted to embeddings:

E = [E(x

), E(x

), . . . , E(x

)]

Next, these embeddings pass through multiple

transformer layers. Each transformer layer applies

self-attention:

(l)

= TransformerLayer(H

(l−1)

)

where H

(0)

= E, and l is the layer number.

The ﬁnal hidden states from the last transformer

layer are used for downstream tasks:

(L)

= [h

, h

, . . . , h

]

Despite these advancements, transformer-based mod-

els struggle with processing long sequences due to

the computational complexity of their self-attention

mechanism, which can lead to information loss in

documents with more than 1,000 tokens. To over-

come this limitation, the Longformer was introduced.

It features an attention mechanism that scales lin-

early with sequence length, allowing it to handle doc-

uments with thousands of tokens. The Longformer

achieves this by sparsifying the full self-attention ma-

trix according to an ”attention pattern” that speci-

ﬁes which input locations attend to each other. This

makes the model efﬁcient for longer sequences. At

the time of its introduction, the Longformer con-

sistently outperformed RoBERTa on long document

tasks, setting new state-of-the-art results on datasets

like WikiHop and TriviaQA. Here is the process of

generating embeddings using longformer:

The input is a tokenized sequence:

x = [x

, x

, . . . , x

]

The attention mechanism is restricted to a ﬁxed-

size window:

= Attention



(l−1)

, H

(l−1)

i−w:i+w



where w is the window size. Global attention is ap-

plied to selected important tokens across the entire

sequence. The embeddings are passed through multi-

ple layers of this modiﬁed attention mechanism. The

ﬁnal hidden states are used for downstream tasks:

(L)

= [h

, h

, . . . , h

]

More recently, signiﬁcant efforts have been made

to improve the performance of text encoders on long

texts. Notable examples include the LLaMA-2 (Tou-

vron and Lavril, 2023) and LLaMa-3 models. Al-

though detailed technical information about these

proprietary models is limited, they propose novel

methods for generating embeddings from long texts,

further advancing the ﬁeld of NLP. Since we used

LLaMA-3 and GEMMA-2B, we describe the process

of generating embeddings as below:

LLaMA follows a standard transformer architec-

ture with self-attention and feedforward networks.

Evaluating the Suitability of Long Document Embeddings for Classiﬁcation Tasks: A Comparative Analysis

321

The input is a tokenized text sequence:

x = [x

, x

, . . . , x

]

Each transformer layer consists of multi-head self-

attention and feedforward networks:

(l)

= MultiHeadAttention(H

(l−1)

) +FFN(H

(l−1)

)

The ﬁnal hidden states are used for downstream

tasks:

(L)

= [h

, h

, . . . , h

]

A recent study on transformer-based models

(Fields et al., 2024) addresses key questions such as

”How Wide, How Large, How Long, How Accurate,

How Expensive, and How Safe are they?” The study

emphasizes the latest advancements in large language

models (LLMs) by evaluating their accuracy across

358 datasets spanning 20 different applications. The

ﬁndings challenge the assumption that LLMs are uni-

versally superior, revealing unexpected results related

to accuracy, cost, and safety. LLMs now encompass

both unimodal and multimodal tasks, where unimodal

models use only textual information, and multimodal

models incorporate text, video, signals, images, au-

dio, and columnar data for classiﬁcation. The pa-

per highlights that while recent models like GPT-4

and Longformer can handle input text lengths of up

to 8,192 tokens with high accuracy in classiﬁcation

tasks, the cost of training these LLMs, along with the

associated economic and environmental concerns, has

become a signiﬁcant issue in recent years. Another

notable study by (Wagh et al., 2021) examines the

classiﬁcation of long documents. The authors reaf-

ﬁrm that while BERT-based models can perform well

across various datasets and are suitable for document

classiﬁcation tasks, they come with a high compu-

tational cost. They also point out that long docu-

ment classiﬁcation is a relatively simpler task, and

even basic algorithms can achieve competitive perfor-

mance compared to BERT-based approaches on most

datasets.

3 METHODOLOGY

In our paper, we compare and discuss the capabil-

ities of state-of-the-art models in generating high-

quality embeddings for very long texts. Subsequently,

we evaluate the generated embeddings using various

methods, including Doc2vec, in the context of docu-

ment classiﬁcation tasks.

3.1 Datasets

Given the ongoing challenge of benchmarking very

long texts due to the lack of agreement on datasets

and baselines (Tay et al., 2021), we have prepared and

introduced datasets with more than 1,000 tokens per

text to evaluate embedding quality. table 1 shows the

detailed information about each dataset.

Table 1: Dataset information including token size, sample

size, and number of labels. Note*: 20 news and arxiv 100

information on section appendix 6.

Dataset # Avg. Tokens Size Labels

Dataset#1 7630 554 11

Dataset#2 11305 1101 11

s2orc 3450 58905 4

20 news* 149 11297 20

arxiv 100* 121 100004 10

S2ORC. Semantic Scholar Open Research Corpus

(Lo et al., 2020) is a comprehensive corpus designed

for natural language processing and text mining on

scientiﬁc papers. It includes over 136 million pa-

per nodes, with more than 12.7 million full-text pa-

pers connected by approximately 467 million cita-

tion edges, derived from various sources and aca-

demic disciplines. The number of tokens in our se-

lected dataset ranges from 1 to 287,400. We chose

documents with at least 200 tokens in two classes of

computer science and physics to ensure they are not

smaller than the shortest document in our test set.

arxiv + Biorxiv. This dataset includes documents

from the years 2022 and 2023 in both combina-

tion of arxiv and biorxiv, containing 550 and 1,000

documents respectively. Each document includes

the full text of the papers, with an average of

more than 7,000 tokens after preprocessing (tokeniza-

tion, lemmatization, stop words removal and extra

phrases removal). These datasets encompass mul-

tiple classes where for arxiv+biorxiv 2022 (labeled

as Dataset#1) includes Evolutionary Biology, Pale-

ontology, Mathematics, Computer Science, Zoology,

Statistics, Pharmacology and Toxicology, Biochem-

istry, Economics, Physics and Electrical Engineer-

ing. On the other side, the arxiv+biorxiv 2023 (la-

beled as Dataset#2) dataset contain Biochemistry, Pa-

leontology, Genomics, Quantitative Biology, Quanti-

tative Finance, Statistics, Computer Science, Electri-

cal Engineering and Systems Science, Mathematics,

Physics and Zoology labels.

To prepare them, we ﬁrst converted PDF docu-

ments to text format and then removed author names,

images, tables, captions, references, acknowledg-

ments, and formulas. Furthermore, we eliminated

sentences with fewer than three tokens. All prepro-

cessing steps, as well as subsequent operations, were

executed using Python 3.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

322

3.2 Embeddings

Doc2vec. Given Doc2vec’s scalability with large

datasets, we explored its functionalities by training it

on extensive technical corpora. For training purposes,

we focused on technical documents of S2ORC and

collected 341,891 documents, totaling approximately

10GB, from ﬁelds including Engineering, Computer

Science, Physics, and Math. It is important to note

that we excluded the test set from the training set to

train the Doc2vec model effectively. Finally, we gen-

erated the embeddings using Doc2vec.

SciBERT. (Beltagy et al., 2019), is a BERT-based

model pre-trained on a large corpus of scientiﬁc text,

which includes papers from the corpus of Semantic

Scholar. The model aims to address the unique chal-

lenges posed by scientiﬁc text, such as specialized ter-

minology and longer sentence structures. By leverag-

ing this specialized pre-training, SciBERT achieves

better performance on downstream scientiﬁc NLP

tasks compared to the vanilla BERT model, particu-

larly in domains like biomedical and computer sci-

ence literature. In this model, full-text documents

are encoded into chunks of 512 tokens. We gener-

ated text embeddings using the SciBERT model SciB-

ERT scivocab uncased with a maximum sequence

length of 512 tokens. The ﬁnal layer’s hidden states

were used as embeddings, with mean pooling applied

to obtain sentence-level embeddings. We utilized the

Hugging Face ‘transformers‘ library (version 4.x.x)

for model loading and inference.

Longformer. Introduced by (Beltagy et al., 2020),

addresses the challenge of processing long documents

by extending the input sequence token size up to

4096 tokens, signiﬁcantly more than BERT’s 512-

token limit. Longformer employs a combination of

local and global attention mechanisms that scale lin-

early with the sequence length, allowing it to han-

dle much longer documents efﬁciently. This model

is speciﬁcally designed to mitigate the computational

inefﬁciencies of the quadratic complexity of the stan-

dard self-attention mechanism in BERT. In our ex-

periments, we utilized the Longformer-large model

allenai/longformer large 4096 to generate document

embeddings. This model comprises 24 layers, each

with a hidden size of 1024, and uses 16 attention

heads. It is capable of processing sequences up to

4096 tokens in length, leveraging a sliding window

attention mechanism with a window size of 512 to-

kens and supporting global attention for key tokens.

Embeddings were generated by extracting the CLS

token’s output from the last hidden layer, optionally

followed by mean pooling for a ﬁxed-size representa-

tion.

LLaMA-3 and GEMMA-2B. Large Language

Model for AI Assistance (Touvron and Lavril, 2023)

represents a signiﬁcant advancement in the realm of

large-scale language models. Unlike earlier models

like BERT or even Longformer, which are constrained

by their maximum input sequence lengths, LLaMA-

3 is designed to handle extremely large contexts,

accommodating up to 16,000 tokens per sequence.

This makes it particularly suitable for tasks involv-

ing extensive documents, such as entire books, com-

prehensive reports, and complex dialogues. More-

over, GEMMA-2B (Team, 2024) (Generative Em-

bedding Model with Multi-headed Attention) distin-

guishes itself with a focus on generating high-quality

embeddings for downstream NLP tasks. This model

operates with a maximum input sequence length of

2048 tokens, striking a balance between the exten-

sive context capabilities of models like LLaMA-3 and

the more focused scope of traditional models. We

generated text embeddings using the LLaMA 3 8B

model provided by Ollama (Ollama, 2024). This

model, which has 8.03 billion parameters, is opti-

mized for instruction-following tasks and operates

efﬁciently through quantization techniques, such as

0. The embedding generation process utilizes

the output from the model’s last hidden layer, en-

suring rich contextual representations of the input

text. Ollama’s quantization reduces the model’s size

to 5.5GB, allowing for effective deployment on lo-

cal hardware while maintaining high-quality perfor-

mance. Table 2 illustrates detailed information of

each model.

Table 2: Characteristics of different models including vo-

cabulary size, corpus size, maximum length, and embed-

ding size.

Model Vocab Corpus Max Len Embedding

SciBERT 30K 1.14M 512 768

Doc2vec 33K 1.2M 10k 400

LLaMA-3 128K 15 Tn 8k 4096

GEMMA-2B 256K 6 Tn 8k 2048

Longformer 30K 33 Tn 4k 768

4 EXPERIMENTS AND RESULTS

In this section, we present the results of our experi-

ments on several datasets using state-of-the-art mod-

els to generate high-quality embeddings for text clas-

siﬁcation. The models evaluated include Doc2vec,

LLaMA-3, Longformer, SciBERT, and GEMMA-2B.

We utilized both SVM and MLP classiﬁers to assess

the performance of these embeddings. The evalua-

tion metrics include accuracy, precision, recall, and

F1 score. The reason we selected these classiﬁers,

Evaluating the Suitability of Long Document Embeddings for Classiﬁcation Tasks: A Comparative Analysis

323

rather than model-based ones like LongformerClas-

siﬁer, is to remain agnostic regarding classiﬁer selec-

tion. This approach allows us to reuse the embeddings

for other NLP tasks, providing greater ﬂexibility and

utility. Below we give more information on each:

SVM. We utilized a Support Vector Machine (SVM)

classiﬁer with a linear kernel to perform the classiﬁ-

cation tasks. The model was conﬁgured with a regu-

larization parameter, C, set to 1.0 to balance the trade-

off between minimizing training error and achieving

low testing error. The SVM classiﬁer was trained

on the given feature set and corresponding labels, fa-

cilitating effective class separation within the feature

space. MLP. We utilized the MLP Classiﬁer from

scikit-learn to build a neural network classiﬁer for our

dataset. The model features two hidden layers with

100 and 50 neurons, respectively, and was trained for

a maximum of 60 iterations. We set the random seed

to 42 for reproducibility.

4.1 Results

We observed Doc2vec consistently demonstrated ro-

bust performance on the Dataset#2 dataset, achiev-

ing an MLP accuracy of 0.67, and an F1 score of

0.65. Longformer also delivered competitive results,

with SVM accuracy of 0.64, and an F1 score of 0.65.

In contrast, SciBERT and LLaMA-3 showed slightly

lower performance, with SVM accuracies of 0.61 and

0.56, and MLP accuracies of 0.64 and 0.60. The

GEMMA-2B model, however, had the least favor-

able outcomes. We can express the lower results of

GEMMA-2B model comparing with LLaMA-3 due

to its lower embedding dimension and model param-

eters size. We were surprised by the strong perfor-

mance of TF-IDF embeddings, which outperformed

all other models, likely due to its effectiveness in han-

dling massive documents. On Dataset#1 Doc2vec

emerged as the top performer, achieving an SVM ac-

curacy of 0.7590, an MLP accuracy of 0.71, and an

F1 score of 0.78. SciBERT followed closely, with an

SVM accuracy of 0.72, an MLP accuracy of 0.71, and

an F1 score of 0.72. Longformer, however, showed a

decline in performance, reﬂected by an SVM accu-

racy of 0.5500, an MLP accuracy of 0.5833, and an

F1 score of 0.5550. LLaMA-3 provided moderate re-

sults with an SVM accuracy of 0.4940, an MLP ac-

curacy of 0.3976, and an F1 score of 0.7804. Mean-

while, GEMMA-2B continued to struggle, recording

the lowest performance metrics with an SVM accu-

racy of 0.4700, an MLP accuracy of 0.4600, and an

F1 score of 0.4500.

Finaly LLaMA-3 demonstrated superior perfor-

mance on the S2ORC dataset with two classes,

achieving nearly perfect scores with an SVM ac-

curacy, MLP accuracy, and F1 score all at 0.99.

Doc2vec also showed strong results, with an SVM

accuracy of 0.97, an MLP accuracy of 0.99, and an

F1 score of 0.9776. Both Longformer and SciBERT

maintained high levels of accuracy, with SVM scores

of 0.96 and 0.97, and MLP accuracies of 0.99 and

0.97, respectively, complemented by high F1 scores.

These ﬁndings highlight the remarkable efﬁciency of

LLaMA-3 and Doc2vec in managing large-scale sci-

entiﬁc documents.

The results in table 3 indicates that LLaMA-3

consistently outperforms other models across vari-

ous datasets, particularly on the 20 news (6) and

S2ORC datasets, demonstrating its robustness and ef-

fectiveness in handling long and shorter documents

by generating high-quality embeddings. Doc2vec

also shows competitive performance, especially on

the S2ORC dataset. Longformer and SciBERT ex-

hibit moderate performance, with SciBERT perform-

ing better on the arxiv 100 dataset (APPENDIX).

GEMMA-2B, while a powerful model for embedding

generation, did not perform as well in this classiﬁca-

tion task, suggesting that its embeddings might need

further ﬁne-tuning for speciﬁc tasks or datasets. To

further analyze the effectiveness of the embeddings

generated by the different models, we projected the

high-dimensional embeddings into a 2D space us-

ing the PACMAP dimensionality reduction technique

(Wang et al., 2021). This visualization allows for a

deep understanding of how well the models differenti-

ate between classes in various datasets (see Appendix

6).

4.2 Training and Inference Time

Transformer-based models, such as SciBERT, LLMs,

and Longformer, possess a complex architecture in-

volving multi-head self-attention mechanisms and

multiple layers, which enable them to capture entan-

gled dependencies and contextual information. These

models typically require massive datasets for pre-

training where depending on the model size and hard-

ware, the training time can range from several days

to months, although ﬁne-tuning usually takes a few

hours to a few days on powerful GPUs (Devlin et al.,

2018). In contrast, simpler models like Word2Vec

and Doc2vec use much less complex architectures.

Word2Vec, for example, leverages shallow neural net-

works with a single hidden layer, while Doc2vec ex-

tends Word2Vec by considering document context but

remains relatively straightforward. These models also

utilize large datasets but not to the extent required

for transformer models, typically training on corpora

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

324

Table 3: Evaluation metrics Macro average of (Precision, Recall, F1 Score) and SVM Classiﬁcation/MLP classiﬁcation

accuracy for different models across bioarxiv 2022, bioarxiv 2023, and s2orc datasets.

Dataset Model Macro avg. P Macro avg. R Macro avg. F1 SVM acc MLP acc

Dataset#2

Doc2vec 0.6702 0.6545 0.6593 0.6545 0.6780

GEMMA-2B 0.5100 0.5100 0.5000 0.5000 0.5000

LLaMA-3 0.5964 0.5697 0.5744 0.5697 0.6000

Longformer 0.6919 0.6424 0.6519 0.6424 0.6420

SciBERT 0.6295 0.6182 0.6213 0.6180 0.6400

TF-IDF 0.8900 0.8800 0.8800 0.8800 0.8900

Dataset#1

Doc2vec 0.8181 0.7711 0.7825 0.7590 0.7100

GEMMA-2B 0.4800 0.4700 0.4500 0.4700 0.4600

LLaMA-3 0.7819 0.7819 0.7804 0.4940 0.3976

Longformer 0.6789 0.5500 0.5550 0.5500 0.5833

SciBERT 0.7597 0.7229 0.7279 0.7229 0.7100

TF-IDF 0.7400 0.7200 0.7200 0.7600 0.7800

s2orc

Doc2vec 0.9775 0.9777 0.9776 0.9778 0.9998

LLaMA-3 0.9976 0.9976 0.9976 0.9976 0.9976

Longformer 0.9674 0.9677 0.9675 0.9678 0.9993

SciBERT 0.9749 0.9749 0.9749 0.9749 0.9797

TF-IDF 0.9700 0.9700 0.9700 0.9800 0.9800

containing millions to billions of words. Training

these models is much faster, with Doc2vec being

trainable on a large corpus in a matter of hours using a

few CPUs or a single GPU, still considerably quicker

than transformer models. Figures 1a and 1b show the

comparison of ﬁne-tuning—training/ time and mem-

ory/time of full self-attention and different implemen-

tations of Longformer’s methods vs Doc2vec.

As shown in 2, Doc2vec can perform competi-

tively while offering signiﬁcant advantages in terms

of inference time, resource requirements, and energy

consumption. Speciﬁcally, Doc2vec demonstrates

much faster average embedding inference times per

second on a CPU, needing signiﬁcantly less computa-

tional resources and consuming less energy compared

to other models.

5 LIMITATIONS

One of the signiﬁcant challenges encountered in this

study was ﬁnding datasets with tokens exceeding

1,000 to effectively compare the models’ ability to ex-

tract embeddings from very long texts. Such datasets

are crucial for evaluating model performance on ex-

tended sequences.

Additionally, inferring heavy models like

LLaMA-3 and GEMMA-2B required substantial

time, effort, and computational resources. These

models have considerable demands, and their infer-

ence process was constrained by the limitations of

available libraries and computing environments.

6 CONCLUSIONS

In this study, we have evaluated the performance of

various state-of-the-art models, including Doc2vec,

SciBERT, Longformer, LLaMA-3, and GEMMA-2B,

on the task of generating high-quality embeddings for

text classiﬁcation. Our experiments spanned multi-

ple datasets such as 20 news, arxiv 100, Dataset#1,

Dataset#2, and S2ORC, providing a comprehensive

analysis of each model’s strengths and limitations.

The results indicate that LLaMA-3 consistently

outperforms other models across different datasets,

particularly excelling in the 20 news and S2ORC

datasets with superior accuracy and F1 scores. SciB-

ERT also demonstrated robust performance, espe-

cially with the arxiv 100 dataset. Notably, Doc2vec,

while slightly behind in absolute performance met-

rics, offers competitive results with signiﬁcantly bet-

ter computational efﬁciency, making it an excel-

lent choice for applications requiring faster inference

times and lower resource consumption. This bal-

ance between performance and efﬁciency is critical

for practical deployment in real-world scenarios.

Additionally, our study highlighted the challenges

associated with handling very long documents, where

models like Longformer and LLaMA-3, designed for

extended context processing, showed signiﬁcant ad-

vantages. However, GEMMA-2B, despite its pow-

erful embedding capabilities, requires further ﬁne-

tuning.

In future, we aim to investigate the quality of em-

beddings in additional NLP tasks, such as question

Evaluating the Suitability of Long Document Embeddings for Classiﬁcation Tasks: A Comparative Analysis

325

(a) Fine-tuning time of self-attention Longformers and

Doc2vec (Beltagy et al., 2020).

(b) Memory usage of self-attention Longformers and

Doc2vec (Beltagy et al., 2020).

Figure 1: Performance comparison of Longformers and Doc2vec models.

Figure 2: Inference time of different models for embedding

extraction.

answering and summarization on very long texts. We

will also review the tuned combinations of embed-

dings for speciﬁc tasks and domains.

ACKNOWLEDGEMENTS

This project has been supported by PID2021-

122136OB-C21 from the Ministerio de Ciencia e In-

novaci

on, by 839 FEDER (EU) funds.

REFERENCES

Adhikari, A., Ram, A., Tang, R., and Lin, J. (2019).

Docbert: BERT for document classiﬁcation. CoRR,

abs/1904.08398.

Beltagy, I., Lo, K., and Cohan, A. (2019). Scibert: A pre-

trained language model for scientiﬁc text.

Beltagy, I., Peters, M. E., and Cohan, A. (2020). Long-

former: The long-document transformer.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018).

BERT: pre-training of deep bidirectional transformers

for language understanding. CoRR, abs/1810.04805.

Fields, J., Chovanec, K., and Madiraju, P. (2024). A survey

of text classiﬁcation with transformers: How wide?

how large? how long? how accurate? how expensive?

how safe? IEEE Access, 12:6518–6531.

Lang, K. (1995). Newsweeder: learning to ﬁlter netnews. In

Proceedings of the Twelfth International Conference

on International Conference on Machine Learning,

ICML’95, page 331–339, San Francisco, CA, USA.

Morgan Kaufmann Publishers Inc.

Le, Q. V. and Mikolov, T. (2014). Distributed rep-

resentations of sentences and documents. CoRR,

abs/1405.4053.

Lo, K., Wang, L. L., Neumann, M., Kinney, R., and Weld,

D. (2020). S2ORC: The semantic scholar open re-

search corpus. In Jurafsky, D., Chai, J., Schluter, N.,

and Tetreault, J., editors, Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 4969–4983, Online. Association

for Computational Linguistics.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. CoRR, abs/1301.3781.

Ollama (2024). Ollama: Ai models locally. Accessed: July

26, 2024.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe:

Global vectors for word representation. In Moschitti,

A., Pang, B., and Daelemans, W., editors, Proceed-

ings of the 2014 Conference on Empirical Methods in

Natural Language Processing (EMNLP), pages 1532–

1543, Doha, Qatar. Association for Computational

Linguistics.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M.,

Clark, C., Lee, K., and Zettlemoyer, L. (2018).

Deep contextualized word representations. CoRR,

abs/1802.05365.

Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A.,

Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and

Gadepally, V. (2023). From Words to Watts: Bench-

marking the Energy Costs of Large Language Model

Inference. arXiv e-prints, page arXiv:2310.03003.

Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham,

P., Rao, J., Yang, L., Ruder, S., and Metzler, D. (2021).

Long range arena : A benchmark for efﬁcient trans-

formers. In International Conference on Learning

Representations.

Team, G. (2024). Gemma: Open models based on gemini

research and technology.

Touvron, H. and Lavril, T. (2023). Llama: Open and efﬁ-

cient foundation language models.

Wagh, V., Khandve, S. I., Joshi, I., Wani, A., Kale, G., and

Joshi, R. (2021). Comparative study of long document

classiﬁcation. CoRR, abs/2111.00702.

Wang, Y., Huang, H., Rudin, C., and Shaposhnik, Y. (2021).

Understanding how dimension reduction tools work:

An empirical approach to deciphering t-sne, umap,

trimap, and pacmap for data visualization. Journal

of Machine Learning Research, 22(201):1–73.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

326

APPENDIX

Datasets 20 News: (Lang, 1995) is widely used for

text classiﬁcation and natural language processing

(NLP) tasks. It contains approximately 20,000 news-

group documents, divided into 20 different news-

groups.

arxiv 100: dataset comprises 100,000 arXiv paper

abstracts and averages 121 tokens per document, cov-

ering subjects such as Electrical Engineering and Sys-

tems Science, Statistics, Computer Science, Physics,

Quantum Physics, Mathematics, High Energy Physics

- Theory, High Energy Physics, Condensed Matter

Physics, and Astrophysics.

Results: On 20 news dataset, LLaMA-3 signiﬁcantly

outperformed other models, achieving an SVM accu-

racy of 0.97 and an F1 score of 0.97. Doc2vec showed

decent performance with an SVM accuracy of 0.75,

while its F1 score was 0.67. Longformer and SciB-

ERT demonstrated moderate results, with SVM accu-

racies of 0.75 and 0.66, and MLP accuracies of 0.65

and 0.65, respectively. LLaMA-3’s results reﬂect its

superior ability to handle the complexity of the news-

group data.

On arxiv 100, SciBERT led with an SVM accu-

racy of 0.81, both with an F1 score of 0.81. Doc2vec

followed closely, with SVM and MLP accuracies of

0.81 and 0.76. LLaMA-3 also performed well, show-

ing an SVM accuracy of 0.78, and an F1 score of 0.78.

Longformer lagged behind with an SVM accuracy of

0.72 and an MLP accuracy of 0.74, with an F1 score

of 0.72. These results underscore SciBERT’s effec-

tiveness in handling scientiﬁc abstracts and technical

documents. Table 4 summarizes the classiﬁcation and

F-score results on these datasets.

Dimensionality Reduction and Embedding Analy-

sis: We applied the PACMAP dimensionality reduc-

tion method (Wang et al., 2021) to embeddings ex-

tracted from various models on the S2ORC test set.

As illustrated in Figure 3, LLaMA-3 effectively sep-

arated the embeddings in the 2D space, demonstrat-

ing distinct class separation. While Doc2Vec and

SciBERT also achieved some degree of separation be-

tween classes, the resulting data points remained in

close proximity within the 2D space. Finally, Long-

former, despite distinguishing the classes, performed

the weakest separation performance among the oth-

ers.

Table 4: Evaluation metrics (Precision, Recall, F1 Score)

and SVM/MLP classiﬁcation results for different models

across arxiv 100 and 20 news datasets.

Data Model P R F1 SVM MLP

20n

Doc2vec 0.6700 0.6665 0.6665 0.747 0.694

LLaMA-3 0.9775 0.9741 0.9749 0.974 0.971

Longf 0.6481 0.6324 0.6303 0.746 0.646

SciBERT 0.6628 0.6581 0.6589 0.658 0.653

arxiv 100

Doc2vec 0.8009 0.8007 0.8007 0.805 0.756

LLaMA-3 0.7819 0.7819 0.7804 0.781 0.780

Longf 0.7183 0.7173 0.7171 0.716 0.739

SciBERT 0.8094 0.8093 0.8093 0.809 0.785

a) PACMAP with LLaMA-3

b) PACMAP with Doc2vec

c) PACMAP with Longformer

d) PACMAP with SciBERT

Figure 3: 2D embedding visualization on S2ORC

dataset(tests) , extracted from a)LLaMA-3, b)Longformer,

c)SciBERT and d)Doc2vec, results showing a great perfor-

mance of LLaMA-3 on class separations.

Evaluating the Suitability of Long Document Embeddings for Classiﬁcation Tasks: A Comparative Analysis

327