Antibiotic Resistance Gene Identiﬁcation from Metagenomic Data Using

Ensemble of Finetuned Large Language Models

Syama K

and J. Angel Arul Jothi

Department of Computer Science, Birla Institute of Technology and Science Pilani Dubai Campus, Dubai, U.A.E.

Keywords:

Antibiotic Resistance Gene, Ensemble Learning, Transformer, LLM.

Abstract:

Antibiotic resistance is a potential challenge to global health. It limits the effect of antibiotics on humans. An-

tibiotic resistant genes (ARG) are primarily associated with acquired resistance, where bacteria gain resistance

through horizontal gene transfer or mutation. Hence, the identiﬁcation of ARGs is essential for the treatment

of infections and understanding the resistance mechanism. Though there are several methods for ARG identi-

ﬁcation, the majority of them are based on sequence alignment and hence fail to provide accurate results when

the ARGs diverge from those in the reference ARG databases. Additionally, a signiﬁcant fraction of pro-

teins still need to be accounted for in public repositories. This work introduces a multi-task ensemble model

called ARG-LLM of multiple large language models (LLMs) for ARG identiﬁcation and antibiotic category

prediction. We ﬁnetuned three pre-trained protein language LLMs, ProtBert, ProtAlbert, and Evolutionary

Scale Modelling (ESM), with the ARG prediction data. The predictions of the ﬁnetuned models are combined

using a majority vote ensembling approach to identify the ARG sequences. Then, another ProtBert model is

ﬁne-tuned for the antibiotic category prediction task. Experiments are conducted to establish the superiority of

the proposed ARG-LLM using the PLM-ARGDB dataset. Results demonstrate that ARG-LLM outperforms

other state-of-the-art methods with the best Recall of 96.2%, F1-score of 94.4%, and MCC of 90%.

1 INTRODUCTION

Antibiotics are one of the signiﬁcant discoveries of

the 20th century, saving millions of lives from infec-

tious diseases. However, their widespread use and

misuse make pathogens increasingly resistant to an-

tibiotics. The World Health Organization (WHO) has

listed antibiotic resistance among the top 10 threats to

global health. Furthermore, according to WHO, an-

tibiotic resistance directly caused 1.27 million deaths

worldwide in 2019, and if no action is taken, this

number is predicted to increase to 10 million by

2050 (Murray et al., 2022; L

ar and Kishony, 2019).

Additionally, antibiotic resistance is spread between

pathogens by transferring antibiotic resistant genes

(ARG) through food, water, animals, and humans.

Therefore, identifying ARG in pathogens is signiﬁ-

cant in stopping their spread, understanding the resis-

tance mechanism, and developing the targeted treat-

ment or control measures. Global efforts such as the

Global Antimicrobial Resistance Surveillance System

and the Global Antibiotic Research and Development

https://orcid.org/0009-0000-6297-4407

https://orcid.org/0000-0002-1773-8779

Partnership have been initiated for this. The pri-

mary focus of these consortium efforts is to develop

an efﬁcient tool for identifying antibiotic resistance

(Mendelson M, 2015). Culture-based Antibiotic Sus-

ceptibility Tests (AST) are the standard practice in

clinical microbiology that determine the effectiveness

of antibiotics against speciﬁc bacteria. However, it

takes weeks to get the results and does not apply to

the unculturable bacteria (Pham and Kim, 2012).

The emergence of high-throughput DNA sequenc-

ing techniques in metagenomics helped the develop-

ment of various tools to proﬁle the DNA of pathogens

and increased the amount of DNA and protein se-

quences in public databases. For example, UniProt

(Consortium, 2015) is the largest collection of pro-

tein sequences available after merging it with pro-

teins translated from multiple metagenomic sequenc-

ing projects. This, in turn, encouraged researchers to

enhance the understanding of the functional diversity

of microbial communities signiﬁcantly. This knowl-

edge helped identify ARGs in different pathogens

present in livestock manure, compost, wastewater

treatment plants, soil, water, and the human micro-

biome (Mao et al., 2015; Pehrsson et al., 2016). How-

102

K., S. and Jothi, J.

Antibiotic Resistance Gene Identiﬁcation from Metagenomic Data Using Ensemble of Finetuned Large Language Models.

DOI: 10.5220/0012999100003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 1: KDIR, pages 102-112

ISBN: 978-989-758-716-0; ISSN: 2184-3228

ever, the main challenge faced by researchers is a no-

table portion of proteins remains unannotated.

ARG identiﬁcation methods are categorized into

sequence-based alignment or assembly and machine

learning (ML)-based. For alignment-based methods

(McArthur et al., 2013), the query ARG sequence

is compared against the existing ARG sequences in

the database using alignment tools such as BLAST

(Altschul et al., 1990), DIAMOND (Buchﬁnk et al.,

2015), and BWA (Li and Durbin, 2009). Although

these methods are widely used in ARG identiﬁca-

tion, they also have disadvantages. For example,

the sequence-based methods may miss novel genes

that are not present in the reference genome database

(Chowdhury et al., 2019), and the accurate results are

highly dependent on the value of the critical hyper-

parameter, such as the similarity threshold (Li et al.,

2021). Alternatively, multiple ML methods have been

developed for ARG identiﬁcation tasks (Gibson et al.,

2015; Arango-Argoty et al., 2018a). ML-based meth-

ods depend on the features representing the character-

istics of ARGs (Rupp

e et al., 2019) and learn the sta-

tistical patterns of ARGs. So, ML methods are able

to identify novel genes (Li et al., 2018). However, the

ML methods are trained using the genetic features ex-

tracted from the ARG sequences of the particular or-

ganism of interest. This limits their capacity to a more

generalized applicability. Deep learning (DL) meth-

ods are especially powerful due to their inherent capa-

bility to learn features, avoiding separate feature ex-

traction. In both ML and DL methods, researchers al-

ways try to improve and optimize classiﬁcation mod-

els to achieve better accuracy. Ensemble learning is

a widely used technique to enhance classiﬁcation ac-

curacy (Miah et al., 2024). It aggregates two or more

base classiﬁers to improve the predictive performance

of the combined classiﬁer, and it overcomes the weak-

ness of a single weak base classiﬁer.

Presently, to uncover the properties of the novel

ARGs, the ideas embedded in natural language pro-

cessing (NLP) are adopted into protein sequence pro-

cessing. Protein sequences are considered as sen-

tences in protein language, and then NLP techniques

are used to extract the features in the protein se-

quences. In particular, transformer-based large lan-

guage models (LLM) (Devlin et al., 2018) have

achieved state-of-the-art (SOTA) performance for

several NLP and protein language tasks (Bepler and

Berger, 2021). Few LLM-based ARG identiﬁcation

models have been developed (Wu et al., 2023) for

ARG identiﬁcation. These models have been widely

used as feature extractors, demonstrating signiﬁcant

improvements in various tasks. However, ﬁnetuning

the pre-trained model further improves the model’s

predictive power. Finetuning involves training a pre-

trained model further on a speciﬁc task or dataset

to enhance its performance for that task. Since the

model is already pre-trained on a large dataset, ﬁne-

tuning requires signiﬁcantly less time and computa-

tional resources. Hence, an ARG prediction tool that

harnesses the power of LLM-based models is in high

demand.

In this work, a multi-task ensemble model, ARG-

LLM, is used to leverage the prediction of ARG and

then further identify what antibiotic family it is resis-

tant to. It harnesses the capabilities of three publicly

available pre-trained transformer-based LLMs such as

ProtBert (Elnaggar et al., 2021), ProtAlbert (Elnag-

gar et al., 2021), and Evolutionary Scale Modelling

(ESM) (Rao et al., 2021). In the ﬁrst task, the three

LLMs are ﬁnetuned with the ARG prediction dataset.

The prediction output of each of the language models

is passed through a majority-voting ensemble method.

In the second task, the ProtBert model is ﬁnetuned

with the Antibiotic category prediction dataset, and

those sequences predicted as ARGs in the ﬁrst task

are further passed through the ﬁne-tuned model for

the prediction of antibiotic categories.

This paper is organized as follows. Section 2 re-

views previous works done in ARG prediction and

Antibiotic category prediction tasks. Details of the

dataset used in this work are explained in Section 3.

Section 4 presents the methodology. The experiments

and the evaluation metrics are provided in Section 5.

The results and discussion are presented in Section

6. Section 7 provides the conclusion and the future

work.

2 RELATED WORKS

Antibiotic resistance is a serious global threat to hu-

man health that urgently requires practical action.

Identifying antibiotic resistant genes is a crucial step

in understanding the mechanism of antibiotic resis-

tance. This section covers an overview of the related

works introduced in the ARG identiﬁcation ﬁeld, em-

phasizing the works done using ML and DL methods.

The traditional computational methods devel-

oped for ARG identiﬁcation are all sequence-

based. Hence, they are designed to identify speciﬁc

pathogens’ ARGs. For instance, ResFinder (Klein-

heinz et al., 2014) predicts speciﬁcally plasmid-borne

ARGs and the tool developed in (Bradley et al., 2015)

is dedicated to 12 types of antimicrobials. Simi-

larly, another study (Davis et al., 2016) is limited to

identifying ARGs encoding resistance to carbapenem,

methicillin, and beta-lactam antibiotics. Most of

these tools identify the query sequence’s similarity

Antibiotic Resistance Gene Identiﬁcation from Metagenomic Data Using Ensemble of Finetuned Large Language Models

103

with the sequences in the existing microbial resis-

tance databases, using a ”best hit” approach to pre-

dict whether a sequence is an ARG. These methods

require a cutoff threshold to identify the similarity be-

tween the sequences. This restricts those models from

identifying novel ARGs (McArthur and Tsang, 2017).

To overcome the disadvantages of the previous meth-

ods, many ML and DL-based methods have been in-

troduced.

The work by Arango et al. (Arango-Argoty et al.,

2018b) introduced DeepARG, a novel DL approach

for predicting ARGs from metagenomic data. It con-

tained two components: DeepARG-SS for classifying

short reads and DeepARG-LS for annotating novel

ARG genes. It used a Deep Neural Network (DNN)

architecture for predicting ARGs from metagenomic

data, and a bitscore-based dissimilarity index was

used as the feature for the DL model. The DeepARG-

SS model, trained on short sequence reads, achieved

an overall precision of 0.97 and recall of 0.91 for the

30 antibiotic categories tested.

The HMD-ARG model in (Li et al., 2021) con-

sisted of hierarchically connected three DL models

that predict ARG properties by focusing on antibiotic

resistance type, mechanism, and gene mobility. Con-

volutional Neural Network (CNN) models were used

at each level. At the ﬁrst level, the sequences were

classiﬁed into ARG or not. The ARG sequences were

classiﬁed in the second level based on the resistant

antibiotic family, resistant mechanism, and gene mo-

bility information. In the ﬁnal level, if the predicted

antibiotic family was beta-lactamase, the framework

further predicted the subclass of beta-lactamase. The

framework could identify ARGs without querying ex-

isting databases. The HMD-ARG model achieved

an Accuracy of 0.948, Precision of 0.939, Recall of

0.951, and F1 of 0.938.

Another work named ARG-SHINE by (Wang

et al., 2021) introduced a novel ARG prediction

framework by integrating sequence homology and

functional information with DL techniques. It used

CNN for the classiﬁcation. This framework proposed

the method to combine sequence homology, func-

tional information, and DL, and the integration im-

proved antibiotic resistance prediction accuracy. The

ARG-SHINE model achieved an Accuracy of 0.8557

and an F1 of 0.8595.

A recent work named PLM-ARG proposed by

(Wu et al., 2023) introduced a novel method for ARG

identiﬁcation using a pre-trained protein language

model, ESM-1b. It harnessed the power of ESM-

1b to generate embedding for protein sequences and

utilized the Extreme Gradient Boosting (XGBoost)

ML model to classify the antibiotic category. The

study provided insights into applying Artiﬁcial Intel-

ligence (AI)-powered language models for ARG iden-

tiﬁcation. The PLM-ARG model achieved an Accu-

racy of 0.912, Precision of 1, Recall of 0.825, F1 of

0.904, and Mathews Correlation Coefﬁcient (MCC)

of 0.838.

The literature review shows that the efﬁcacy of

transformer-based NLP models is less utilized in the

ARG identiﬁcation task. Researchers have identiﬁed

that ﬁnetuning the transformer-based models gives an

exceptional performance in NLP tasks (Devlin et al.,

2018). However, ﬁnetuning the transformer-based

models for ARG prediction with the ARG dataset has

yet to be explored. Hence, in this work, we ﬁnetune

the protein language models and use the ﬁnetuned

model for classiﬁcation. Additionally, we utilized the

capacity of ensembling the prediction of the ﬁnetuned

models to identify ARG sequences.

3 DATASET

We collected antibiotic resistance gene sequences

from the published ARG database PLM-ARGDB

(Wu et al., 2023). It contains 57158 gene sequences,

28579 of which are labeled as ARG and 28579 of

which are labeled as non-ARGs. The sequences

which are labeled as ARG are further labeled with

their antibiotic category. The 26391 ARGs in the

28579 ARG sequences are labeled with 22 explicit

resistance categories, and 2188 ARGs are tagged

with a general category “multi-drug” or “antibiotic

without deﬁned classiﬁcation.” PLM-ARGDB is con-

structed by extracting ARG sequences from six pub-

licly available ARG databases, as 4790 from CARD

(Jia et al., 2016), 859 from ResFinder (Zankari et al.,

2012), 2044 from MEGARes (Lakin et al., 2017), 444

from AMRFinderPlus (Feldgarden et al., 2019), 9863

from ARGMiner, and 10579 from HMD-ARG-DB

(Li et al., 2021). The non-ARG sequences are taken

from the UniProt database.

4 PROPOSED METHODOLOGY

In this work, we introduce a novel multi-task en-

semble framework, ARG-LLM, which automatically

identiﬁes the ARGs and the categories of antibi-

otics to which the pathogen is resistant. Figure 1

presents the overall methodology of this work. ARG-

LLM performs two tasks: one is the ARG predic-

tion task, and the other is the Antibiotic category

prediction task. ARG-LLM starts with preprocess-

ing the dataset and preparing the data for subsequent

ﬁnetuning and prediction. The ARG prediction task

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

104

Protein

sequence

dataset

validation

set

Train set

ProtBert_bin

Test set

ProtAlbert_

bin

Prediction 1

Prediction 2

Majority voting

ESM_bin

Prediction 3

ARG

Non ARG

Ensemble Prediction

ARG prediction

dataset

categories of

antibiotic the

pathogen is

resistant to

Fine-tune the ProtBert model for

Category prediction(ProtBert_Cat)

pleuromutilin

fluoroquinolone

glycopeptide

lincosamide

peptide

phenicol

sulfonamide

sulfone

tetracycline

others

aminoglycoside

macrolide

rifamycin

beta-lactam

diaminopyrimidine

disinfecting agents and antiseptics

nitroimidazole

nucleoside

streptogramin

aminocoumarin

alkaloids with antibiotic activity

Antibiotic category Prediction task

Pretrained

ProtBert

Encoder Layer

Pretrained

ProtBert

Encoder Layer

Pre-trained Transformer model

Input sequence embedding from

last hidden layer

Classification head

...

Tokenization and

Encoding

Input protein

sequence

Fine-tune the

ProtAlbert model

Fine-tune the ESM model

(esm2_t12_35M_UR50D)

Fine-tune the

ProtBert model

ARG Sequences from test

data

Antibiotic category

prediction dataset

ARG Prediction task

Finetuning with Antibiotic

category prediction dataset

Preprocessing

Pretrained

ProtBert

Encoder Layer

Pretrained

ProtBert

Encoder Layer

Pre-trained Transformer model

Input sequence embedding from

last hidden layer

Classification head

...

Tokenization and

Encoding

Input protein

sequence

Pretrained

ProtAlbert

Encoder Layer

Pretrained

ProtAlbert

Encoder Layer

Pre-trained Transformer model

Input sequence embedding from

last hidden layer

Classification head

...

Tokenization and

Encoding

Input protein

sequence

Pretrained

ESM

Encoder Layer

Pretrained

ESM

Encoder Layer

Pre-trained Transformer model

Input sequence embedding from

last hidden layer

Classification head

...

Tokenization and

Encoding

Input protein

sequence

Figure 1: Overview of the proposed methodology.

ﬁnetunes the pre-trained base LLM models, such as

ProtBert, ProtAlbert, and ESM. Then, these ﬁnetuned

models (ProtBert bin, ProtAlbert bin, ESM bin) are

used as base classiﬁers for the majority voting classi-

ﬁer to predict whether the given sequence is ARG or

not. The Antibiotic category prediction task ﬁnetunes

the ProtBert model and then predicts the categories of

antibiotics for those sequences that are predicted as

Antibiotic Resistance Gene Identiﬁcation from Metagenomic Data Using Ensemble of Finetuned Large Language Models

105

ARG during the ARG prediction task. The following

subsections explain each step in detail.

4.1 Preprocessing

In this step, the sequences are read from the database

which is in the format of a FASTA ﬁle. The ARG

sequences labeled ”multi-drug” or ”antibiotic without

deﬁned classiﬁcation” are changed to the label ”oth-

ers”. Thus, the sequences have two labels, where one

is the ARG label and the other 21 are the antibiotic

categories label. The ARG label is given as 0 or 1,

where 0 represents non-ARG and 1 represents ARG.

The antibiotic category labels are present for only

those sequences with ARG equal to 1. The antibiotic

category labels are transformed into a binary matrix

format using sklearn MultiLabelBinarizer(). Then,

separate train and validation sets are formed, one for

the binary (ARG) prediction and the other for the mul-

tilabel (Antibiotic category) prediction. Hence, in this

work, we refer to the ARG prediction dataset as the

training and validation datasets used for ARG pre-

diction. These datasets contain only the protein se-

quences and their ARG labels. Furthermore, these

datasets are used to ﬁnetune the three base LLMs.

Similarly, we refer to the Antibiotic category predic-

tion dataset as the train and the validation datasets

used for Antibiotic category prediction. This dataset

contains the protein sequences and their Antibiotic

category labels, which are used to ﬁnetune the Prot-

Bert model for Antibiotic category prediction.

4.2 Architecture of ARG-LLM

The two tasks of ARG-LLM are explained in the fol-

lowing subsections.

4.2.1 ARG Prediction

ARG prediction task includes ﬁnetuning the base

LLM models with ARG prediction dataset, and

combine the predictions done by the ﬁnetuned model

using ensemble prediction.

a) Finetuning the LLMs:

This task utilized three transformer-based LLMs.

The transformer model was introduced in 2017

by Vaswani et al. (Vaswani et al., 2017). It is a

neural network model that understands the context

of the input sequence. Usually, the transformer

has an encoder-decoder architecture. However, the

pre-trained models used in this study are based

on Encoder-only Transformer (EOT) architecture

because they focus on generating embedding for the

protein sequences. EOT understands the features

...

Feedforward

Add&Norm

Multi-Head

attention

Tokenization and

Encoding

Input protein sequence

Representation

A single layer of encoder

Query key

Value

Figure 2: Architecture of a single layer of transformer en-

coder.

and patterns in the input sequence and generates a

representation for the input. The encoder is a stack of

multiple layers. The encoder takes the input protein

sequences composed of amino acids, passes them

through a series of operations, and generates the

abstract representation that encapsulates the learned

information from the entire sequence.

Figure 2 shows a single encoder layer in the trans-

former. It comprises of three modules: tokeniza-

tion and encoding module, self-attention module, and

feed-forward module. Tokenization aims to tokenize

each amino acid (word) in the protein sequence (sen-

tence). Then, the encoding step converts each token

to a vector. In order to provide information about the

position of a token in the sequence, positional encod-

ing is then added to the vector of each token. Since

transformers lack an inherent sense of sequence or-

der, positional encoding is necessary to add informa-

tion about the order of tokens in each sequence. All

the pre-trained models used in this study use absolute

positional encoding (Vaswani et al., 2017). Absolute

positional encoding uses sine and cosine functions to

generate a unique vector for each token’s ﬁxed posi-

tion in the sequence. These vectors are added to the

input representations of amino acids before being fed

into the transformer layers. The positional encoding

for each position pos pos is calculated as follows.

PE(pos,2i) = sin(

pos

10000

2i/d

model

)

PE(pos,2i + 1) = cos(

pos

10000

2i/d

model

)

(1)

where i is the dimension of the positional encoding,

odel is the dimensionality of the encoded input.

The self-attention module consists of self-

attention and layer normalization. Self-attention

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

106

uses a multi-head attention mechanism to relate each

amino acid in the input protein sequence with other

amino acids. The encoded vector of each amino acid

(token) is then fed to three parameters: Query (Q),

Key (K), and Value (V). Q is a vector representing the

token for which the attention scores are calculated. K

are vectors associated with each token in the sequence

and are used to compare against the Q vector to com-

pute a score. V are vectors the same as K but are used

to calculate the ﬁnal representation of the word after

the attention mechanism is applied. In a multi-head

attention mechanism with h heads, the Q, K, and V

are linearly projected, and h versions of Q, K, and V

are obtained as follows.

= XW

; K

= XW

; V

= XW

(2)

where i = 1,2,···,h,

, and W

are the learned projection matrices

for head i, and X is the input tokens matrix.

Each attention head i performs a scaled dot prod-

uct attention as follows.

Attention(Q

) = so f tmax(

√

(3)

where d

is the dimension of the Key vectors.

After computing the attention from all the heads,

the attention vectors are concatenated and trans-

formed using a linear transformation as given below.

MultiHead(Q, K,V) = concat(head

,head

,···,

head

(4)

where head

= Attention(Q

), and W

is the

learned weight matrix for linear transformation.

By computing attention scores across multiple

heads and combining the results, the transformer

model can better understand the context and depen-

dencies within the data. The output of the multi-head

attention is added to the input using the residual con-

nection, and the sum is passed to the layer normaliza-

tion operation.

The output of the self-attention module is passed

to the feed-forward module. The feed-forward mod-

ule consists of a fully connected feed-forward net-

work containing two linear transformations with a

ReLU activation in between. Equation 5 shows the

feed-forward network operation performed on input

FFN(x) = ReLU (W

x + b

+ b

(5)

where W

, W

are the learned weight matrices, and b

are biases.

The output of the feed-forward module is added

to its input using a residual connection, followed by

layer normalization. These operations are performed

in each of the layers of the encoder. The transformer

encoder can have N such layers. The output of the

ﬁnal encoder layer is the abstract representation of the

input sequence with a rich contextual understanding.

After the success of transformers in many NLP

tasks, Devlin et al. introduced a bidirectional Encoder

Only transformer called Bidirectional Encoder Rep-

resentations from Transformers (BERT) for text pro-

cessing in 2018 (Devlin et al., 2018). BERT differs

from traditional transformer models by using a bidi-

rectional approach, meaning it considers the context

from both the left and right sides of a sequence. BERT

is pre-trained on a large corpus of text using two unsu-

pervised tasks: Masked Language Modeling (MLM)

(Taylor, 1953) and Next Sentence Prediction (NSP).

BERT can be adapted to various NLP tasks by adding

a simple output layer. The models used in this work,

such as ProtBert, ProtAlbert, and ESM, are based on

the BERT architecture.

ProtBert: It is a protein-speciﬁc variant of BERT

developed by training the pre-trained BERT model

using 393 billion amino acid sequences from UniRef

(Suzek et al., 2015) and BFD(Steinegger and S

oding,

2018) databases. It is trained using MLM objective

in a self-supervised manner. The number of layers

of ProtBert was increased to 30 compared to BERT,

which had 24 layers.

ProtAlbert: It is a protein-speciﬁc variant of A Lite

BERT(ALBERT

) model developed by pretraining

the Albert model using UniRef100 (Suzek et al.,

2015) dataset. Albert models use parameter sharing

across layers, which reduces the total number of pa-

rameters while maintaining a similar model depth,

making it a Lite version of BERT.

ESM: It is a transformer-based model designed ex-

plicitly for protein sequence analysis and was devel-

oped by Meta AI (formerly Facebook AI Research).

ESM is trained with UniRef50 (Suzek et al., 2015), a

massive dataset of 250 million protein sequences en-

compassing 86 billion amino acids. The model uti-

lizes unsupervised learning to learn representations

that capture biological properties and evolutionary di-

versity from sequence data. It comes in different vari-

ants based on the number of parameters and layers. In

this work, we used ”esm2 t12 35M UR50D”, which

refers to a speciﬁc variant or conﬁguration of the ESM

model.

To ﬁnetune the pre-trained LLMs for the ARG

prediction task, the model is modiﬁed by adding a

https://github.com/google-research/bert

https://github.com/google-research/albert

Antibiotic Resistance Gene Identiﬁcation from Metagenomic Data Using Ensemble of Finetuned Large Language Models

107

Dense layer (relu)

Droput layer

Dense layer (relu)

Droput layer

Dense layer

(Softmax)

Figure 3: Architecture of the classiﬁcation head.

classiﬁcation head on top of the model architecture.

Figure 3 presents the architecture of classiﬁcation

head. It includes two fully connected layers with a

ReLU activation function. Each fully connected layer

is followed by a dropout layer. Finally, a softmax ac-

tivation function is used for the classiﬁcation.

Then we train the entire model with the ARG

prediction dataset. During training, only the weights

of the last k layers of the pre-trained model and the

newly added classiﬁcation head are updated based on

the loss calculated from the classiﬁcation task. The

loss function used here is the Binary Cross Entropy

(BCE) loss. After ﬁnetuning, the ﬁnetuned models

are called ProtBert bin, ProtAlbert bin, and ESM bin

respectively.

b) ARG Prediction using Ensemble of Finetuned

LLMs:

The ﬁnetuned models are used for prediction with

the test data. The predictions furnished by the

models mentioned above are combined through a

process known as majority voting (Dietterich, 2000).

This entails tallying the occurrences of ARG and

non-ARG labels. The ﬁnal prediction is obtained

depending on the votes achieved by each label. For

a given protein sequence x, base classiﬁer h

, each

produces a predicted class label h

(x), then the

majority voting method can be performed as follows.

ˆy = argmax

c∈C

∑

i=1

1(h

(x) = c) (6)

where C = {ARG, non-ARG}; set of possible class

labels, N=3 is the number of base classiﬁers, 1 is the

indicator function, which return 1 if the argument is

true, argmax selects the class with the maximum vote

and ˆy is the ﬁnal predicted class label.

4.2.2 Antibiotic Category Prediction

In this task, a ProtBert model with a classiﬁcation

head for predicting the antibiotic categories of the

ARG sequences is ﬁnetuned with the Antibiotic cate-

gory prediction dataset. The ﬁnetuned model is called

ProtBert cat. Then, ProtBert cat is used to predict

the antibiotic categories of those sequences which are

predicted as ARG by the ensemble model.

5 EXPERIMENTAL SETUP AND

EVALUATION METRICS

5.1 Experimental Setup

The proposed framework is written in Python 3, and

the libraries used are Sklearn version 1.0.2 and Py-

torch version 1.13. All the experiments are executed

on an ml.g5.xlarge instance type in Amazon Sage-

Maker, equipped with an NVIDIA A10G Tensor Core

GPU and 24 GB dedicated memory. Table 1 presents

the parameters used by each LLM.

5.2 Evaluation Metrics

The performance of the proposed model is evaluated

using metrics like: F1-score (F1), accuracy, precision,

recall and Matthews Correlation Coefﬁcient (MCC).

Let TP, TN, FP, and FN be the number of true pos-

itives, true negatives, false positives, and false neg-

atives, respectively, then each of the metrics is cal-

culated as follows. For the multilabel classiﬁcation

of category prediction the model performance was

calculated based on micro-averages for each perfor-

mance metric. Each of the metric is calculated as

shown in equation 7.

Accuracy =

T P + T N

T P + FP + T N +FN

Recall =

T P

T P + FN

Precision =

T P

T P + FP

F1 −score =

2 ×Precision×Recall

Precision + Recall

MCC =

T P ×T N −FP ×FN

(T P + FP)(T P + FN)(T N + FP)(T N + FN)

(7)

6 RESULTS AND DISCUSSIONS

This section presents and discusses the results of the

proposed ensemble framework obtained on the test

dataset.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

108

Table 1: The parameters and conﬁgurations used by each model.

Parameters ProtBert ProtAlbert ESM

Number of Layers 30 12 12

Embedding Size 1024 4096 4096

Number of Parameters 420 M 224 M 35 M

Learning rate 0.0005 0.0005 0.0005

Optimizer Adam Adam Adam

Batch size 1 1 1

1st dense layer size in the classiﬁcation head 512 512 512

2nd dense layer size in the classiﬁcation head 128 128 128

number of unfrozen layers 8 5 8

Loss function BCE BCE BCE

Table 2: Comparison results of individual ﬁnetuned LLMs and ARG-LLM on ARG and Antibiotic category prediction (Best

results are highlighted in bold).

ARG Prediction Category Prediction

Accuracy Precision Recall F1 MCC Accuracy Precision Recall F1 MCC

ProtBert 0.9827 0.9689 0.9753 0.9721 0.9712 0.9168 0.9324 0.9289 0.9306 0.8754

ProtAlbert 0.9752 0.9638 0.9747 0.9692 0.9624 0.9175 0.9461 0.9293 0.9376 0.8854

ESM 0.9832 0.9723 0.9859 0.9791 0.9763 0.9281 0.9085 0.9612 0.9343 0.8967

ARG-LLM 0.9931 1 0.9859 0.9929 0.9862 0.9232 0.9261 0.9616 0.9435 0.9001

Table 3: Comparison with Pre-trained LLM models as embedding generator and XGBoost as classiﬁer for ARG and Antibiotic

category prediction (Best results are highlighted in bold).

ARG Prediction Category Prediction

Accuracy Precision Recall F1 MCC Accuracy Precision Recall F1 MCC

ProtBert 0.9562 0.9678 0.9587 0.9632 0.9215 0.9108 0.9075 0.9151 0.9113 0.8941

ProtAlbert 0.9487 0.9758 0.9475 0.9614 0.9245 0.9012 0.9161 0.9327 0.9243 0.8995

ESM 0.9923 0.9954 0.9852 0.9902 0.9758 0.9174 0.9167 0.9296 0.9231 0.789

ARG-LLM 0.9931 1 0.9859 0.9929 0.9862 0.9232 0.9261 0.9616 0.9435 0.9001

Figure 4: Comparison of the performance of ARG-LLM with state-of-the-art approaches on Antibiotic category prediction.

6.1 Comparison with Individual

Finetuned LLMs

Experiments are conducted using the individual

ﬁnetuned models ProtBert bin, ProtAlbert bin, and

ESM bin separately for ARG prediction and the Prot-

Bert cat model for category prediction. The results

achieved by each individual ﬁnetuned model are com-

pared with those of ARG-LLM. Table 2 presents

the comparison results. The results show that the

ARG-LLM invariably delivered competitive results

across multiple metrics, highlighting its ability to pre-

dict ARG and its categories. The proposed model

Antibiotic Resistance Gene Identiﬁcation from Metagenomic Data Using Ensemble of Finetuned Large Language Models

109

achieved the best accuracy of 0.9931, Precision of 1,

Recall of 0.9859, F1 of 0.9929, and MCC of 0.9862

for the ARG prediction task. For the Antibiotic cat-

egory prediction task, the ESM model achieves the

best accuracy with a value of 0.9281, and the Pro-

tAlbert model achieves the best Precision of 0.9461.

ARG-LLM achieves the best Recall, F1, and MCC

with values 0.9616, 0.9435, and 0.9001, respectively.

Additionally, the ESM model’s performance is signif-

icant out of the three transformer models, as it consis-

tently shows strong prediction capability.

6.2 Comparison with Pre-Trained

LLMs as Embedding Generators

In this experiment, the pre-trained ProtBert, Pro-

tAlbert, and ESM models are used to generate em-

beddings for the sequences in the dataset. Then,

the embeddings are provided as input for a subse-

quently trained XGBoost model for ARG prediction.

A trained multilabel XGBoost classiﬁer is used to pre-

dict the antibiotic categories. The comparison results

are presented in Table 3. From the table 3, it is evident

that the ARG-LLM achieves the best performance on

the ARG prediction task with an Accuracy of 0.9931,

Precision of 1, Recall of 0.9859, F1 of 0.9929, and

MCC of 0.9862. Similarly, ARG-LLM outperforms

the antibiotic category prediction task with best Accu-

racy of 0.9232, Precision of 0.9262, Recall of 0.9616,

F1 of 0.9435, and MCC of 0.9001.

6.3 Comparison with SOTA Methods

Figure 4 compares ARG-LLM results with SOTA

methods like DeepArg (Arango-Argoty et al., 2018b),

HMD-ARG (Li et al., 2021), and PLM-ARG (Wu

et al., 2023) for Antibiotic category prediction. The

referenced studies have not provided the results of

ARG prediction, so we are unable to give a com-

parison of ARG prediction in this section. Also, the

referenced research HMD-ARG did not present the

MCC value in their work paper; thus, we are unable

to include it in our comparison. From the available

results, it can be observed that the highest accuracy

of 0.948 is achieved by the HMD-ARG model, and

the PLM-ARG model achieves the highest precision

of 1. However, ARG-LLM achieves the best Recall,

F1, and MCC of 0.962, 0.944, and 0.900, respectively.

Additionally, ARG-LLM achieves the 2nd highest ac-

curacy of 0.923. Overall, if F1 is taken as a metric,

ARG-LLM outperforms other SOTA methods.

Overall, this work aimed to predict ARG and

antibiotic resistance categories using an ensemble

of ﬁnetuned transformer-based LLMs. The exper-

imental results reveal promising performance gains

achieved by the ARG-LLM framework. The results

from Table 3 show that ﬁnetuning the pre-trained

LLMs improves their performance in classifying the

ARG sequences into their antibiotic categories. Fine-

tuning helps the model to adapt to the speciﬁc charac-

teristics of a new, smaller dataset relevant to the target

task. Similarly, from Table 2, it is clear that ensem-

bling the three LLMs led to a signiﬁcant improvement

in performance.

7 CONCLUSION

We propose a multi-task ensemble model of ﬁne-

tuned LLMs to leverage the prediction of ARG and

then further identify what antibiotic family it is re-

sistant to. The experimental results conﬁrm the reli-

ability of the proposed model in identifying ARGs.

The comparison results show that ﬁnetuning a pre-

trained model with a task-speciﬁc dataset improves

the model’s performance. Additionally, ensemble

prediction with the ﬁne-tuned LLMs further enhanced

the performance of the proposed model. The out-

comes of this experimentation have powerful impli-

cations for researchers and practitioners engaged in

ARG identiﬁcation tasks. The proposed model can be

a powerful tool to alleviate the global threat of antibi-

otic resistance. In the future, the ARG structural in-

formation can be incorporated with the sequence fea-

tures to improve the performance of the model.

REFERENCES

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and

Lipman, D. J. (1990). Basic local alignment search

tool. Journal of molecular biology, 215(3):403–410.

Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S.,

Vikesland, P., and Zhang, L. (2018a). Deeparg: a deep

learning approach for predicting antibiotic resistance

genes from metagenomic data. Microbiome, 6:1–15.

Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S.,

Vikesland, P., and Zhang, L. (2018b). Deeparg: a deep

learning approach for predicting antibiotic resistance

genes from metagenomic data. Microbiome, 6:1–15.

Bepler, T. and Berger, B. (2021). Learning the protein lan-

guage: Evolution, structure, and function. Cell sys-

tems, 12(6):654–669.

Bradley, P., Gordon, N. C., Walker, T. M., Dunn, L., Heys,

S., Huang, B., Earle, S., Pankhurst, L. J., Anson,

L., De Cesare, M., et al. (2015). Rapid antibiotic-

resistance predictions from genome sequence data for

staphylococcus aureus and mycobacterium tuberculo-

sis. Nature communications, 6(1):10063.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

110

Buchﬁnk, B., Xie, C., and Huson, D. H. (2015). Fast and

sensitive protein alignment using diamond. Nature

methods, 12(1):59–60.

Chowdhury, A. S., Call, D. R., and Broschat, S. L. (2019).

Antimicrobial resistance prediction for gram-negative

bacteria via game theory-based feature evaluation.

Scientiﬁc reports, 9(1):14487.

Consortium, U. (2015). Uniprot: a hub for protein informa-

tion. Nucleic acids research, 43(D1):D204–D212.

Davis, J. J., Boisvert, S., Brettin, T., Kenyon, R. W., Mao,

C., Olson, R., Overbeek, R., Santerre, J., Shukla, M.,

Wattam, A. R., et al. (2016). Antimicrobial resis-

tance prediction in patric and rast. Scientiﬁc reports,

6(1):27930.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding.

Dietterich, T. G. (2000). Ensemble methods in machine

learning. In International workshop on multiple clas-

siﬁer systems, pages 1–15. Springer.

Elnaggar, A., Heinzinger, M., Dallago, C., Rihawi, G.,

Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C.,

Steinegger, M., Bhowmik, D., and Rost, B. (2021).

Prottrans: Towards cracking the language of life’s

code through self-supervised deep learning and high

performance computing.

Feldgarden, M., Brover, V., Haft, D. H., Prasad, A. B.,

Slotta, D. J., Tolstoy, I., Tyson, G. H., Zhao, S., Hsu,

C.-H., McDermott, P. F., et al. (2019). Validating the

amrﬁnder tool and resistance gene database by using

antimicrobial resistance genotype-phenotype correla-

tions in a collection of isolates. Antimicrobial agents

and chemotherapy, 63(11):10–1128.

Gibson, M. K., Forsberg, K. J., and Dantas, G. (2015).

Improved annotation of antibiotic resistance determi-

nants reveals microbial resistomes cluster by ecology.

The ISME journal, 9(1):207–216.

Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo,

P., Tsang, K. K., Lago, B. A., Dave, B. M., Pereira,

S., Sharma, A. N., et al. (2016). Card 2017: expan-

sion and model-centric curation of the comprehensive

antibiotic resistance database. Nucleic acids research,

page gkw1004.

Kleinheinz, K. A., Joensen, K. G., and Larsen, M. V.

(2014). Applying the resﬁnder and virulenceﬁnder

web-services for easy identiﬁcation of acquired antibi-

otic resistance and e. coli virulence genes in bacterio-

phage and prophage nucleotide sequences. Bacterio-

phage, 4(2):e27943.

Lakin, S. M., Dean, C., Noyes, N. R., Dettenwanger, A.,

Ross, A. S., Doster, E., Rovira, P., Abdo, Z., Jones,

K. L., Ruiz, J., et al. (2017). Megares: an antimicro-

bial resistance database for high throughput sequenc-

ing. Nucleic acids research, 45(D1):D574–D580.

ar, V. and Kishony, R. (2019). Transient antibiotic

resistance calls for attention. Nature microbiology,

4(10):1606–1607.

Li, H. and Durbin, R. (2009). Fast and accurate short read

alignment with burrows–wheeler transform. bioinfor-

matics, 25(14):1754–1760.

Li, Y., Wang, S., Umarov, R., Xie, B., Fan, M., Li, L., and

Gao, X. (2018). Deepre: sequence-based enzyme ec

number prediction by deep learning. Bioinformatics,

34(5):760–769.

Li, Y., Xu, Z., Han, W., Cao, H., Umarov, R., Yan, A.,

Fan, M., Chen, H., Duarte, C. M., Li, L., et al. (2021).

Hmd-arg: hierarchical multi-task deep learning for an-

notating antibiotic resistance genes. Microbiome, 9:1–

12.

Mao, D., Yu, S., Rysz, M., Luo, Y., Yang, F., Li, F., Hou,

J., Mu, Q., and Alvarez, P. (2015). Prevalence and

proliferation of antibiotic resistance genes in two mu-

nicipal wastewater treatment plants. Water research,

85:458–466.

McArthur, A. G. and Tsang, K. K. (2017). Antimicrobial

resistance surveillance in the genomic age. Annals of

the New York Academy of Sciences, 1388(1):78–91.

McArthur, A. G., Waglechner, N., Nizam, F., Yan, A., Azad,

M. A., Baylay, A. J., Bhullar, K., Canova, M. J.,

De Pascale, G., Ejim, L., et al. (2013). The compre-

hensive antibiotic resistance database. Antimicrobial

agents and chemotherapy, 57(7):3348–3357.

Mendelson M, M. M. (2015). The world health organization

global action plan for antimicrobial resistance. S Afr

Med J, 105(5):325.

Miah, M. S. U., Kabir, M. M., Sarwar, T. B., Safran, M.,

Alfarhood, S., and Mridha, M. (2024). A multimodal

approach to cross-lingual sentiment analysis with en-

semble of transformer and llm. Scientiﬁc Reports,

14(1):9603.

Murray, C. J., Ikuta, K. S., Sharara, F., Swetschinski, L.,

Aguilar, G. R., Gray, A., Han, C., Bisignano, C., Rao,

P., Wool, E., et al. (2022). Global burden of bacterial

antimicrobial resistance in 2019: a systematic analy-

sis. The lancet, 399(10325):629–655.

Pehrsson, E. C., Tsukayama, P., Patel, S., Mej

ıa-Bautista,

M., Sosa-Soto, G., Navarrete, K. M., Calderon,

M., Cabrera, L., Hoyos-Arango, W., Bertoli, M. T.,

et al. (2016). Interconnected microbiomes and re-

sistomes in low-income human habitats. Nature,

533(7602):212–216.

Pham, V. H. and Kim, J. (2012). Cultivation of unculturable

soil bacteria. Trends in biotechnology, 30(9):475–484.

Rao, R. M., Liu, J., Verkuil, R., Meier, J., Canny, J. F.,

Abbeel, P., Sercu, T., and Rives, A. (2021). Trans-

former protein language models are unsupervised

structure learners. bioRxiv.

Rupp

e, E., Ghozlane, A., Tap, J., Pons, N., Alvarez, A.-S.,

Maziers, N., Cuesta, T., Hernando-Amado, S., Clares,

I., Mart

ınez, J. L., et al. (2019). Prediction of the

intestinal resistome by a three-dimensional structure-

based method. Nature microbiology, 4(1):112–123.

Steinegger, M. and S

oding, J. (2018). Clustering huge pro-

tein sequence sets in linear time. Nature communica-

tions, 9(1):2542.

Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B.,

Wu, C. H., and Consortium, U. (2015). Uniref clus-

ters: a comprehensive and scalable alternative for im-

proving sequence similarity searches. Bioinformatics,

31(6):926–932.

Antibiotic Resistance Gene Identiﬁcation from Metagenomic Data Using Ensemble of Finetuned Large Language Models

111

Taylor, W. L. (1953). “cloze procedure”: A new tool

for measuring readability. Journalism quarterly,

30(4):415–433.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Wang, Z., Li, S., You, R., Zhu, S., Zhou, X. J., and

Sun, F. (2021). Arg-shine: improve antibiotic resis-

tance class prediction by integrating sequence homol-

ogy, functional information and deep convolutional

neural network. NAR Genomics and Bioinformatics,

3(3):lqab066.

Wu, J., Ouyang, J., Qin, H., Zhou, J., Roberts, R., Siam, R.,

Wang, L., Tong, W., Liu, Z., and Shi, T. (2023). Plm-

arg: antibiotic resistance gene identiﬁcation using a

pretrained protein language model. Bioinformatics,

39(11):btad690.

Zankari, E., Hasman, H., Cosentino, S., Vestergaard, M.,

Rasmussen, S., Lund, O., Aarestrup, F. M., and

Larsen, M. V. (2012). Identiﬁcation of acquired an-

timicrobial resistance genes. Journal of antimicrobial

chemotherapy, 67(11):2640–2644.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

112