RUDEUS: A Machine Learning Classiﬁcation System to Study

DNA-Binding Proteins

David Medina-Ortiz

1,2 a

, Gabriel Cabas-Mora

1 b

, Iv

an Moya

1 c

, Nicole Soto-Garc

ıa

1 d

and Roberto Uribe-Paredes

Departamento de Ingenier

ıa En Computaci

on, Universidad de Magallanes, Avenida Bulnes 01855, Punta Arenas, Chile

Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Beauchef 851, Santiago, Chile

Keywords:

DNA-Binding Proteins, Single-Stranded and Double-Stranded DNA, Machine Learning,

Protein Language Models.

Abstract:

DNA-binding proteins play crucial roles in biological processes such as replication, transcription, pack-

aging, and chromatin remodeling. Their study has gained importance across scientiﬁc ﬁelds, with com-

putational biology complementing traditional methods. While machine learning has advanced bioinfor-

matics, generalizable pipelines for identifying DNA-binding proteins and their speciﬁc interactions remain

scarce. We present RUDEUS, a Python library with hierarchical classiﬁcation models to identify DNA-

binding proteins and distinguish between single- and double-stranded DNA interactions. RUDEUS inte-

grates protein language models, supervised learning, and Bayesian optimization, achieving 95% precision

in DNA-binding identiﬁcation and 89% accuracy in distinguishing interaction types. The library also includes

tools for annotating unknown sequences and validating DNA-protein interactions through molecular dock-

ing. RUDEUS delivers competitive performance and is easily integrated into protein engineering workﬂows.

It is available under the MIT License, with the source code and models available on the GitHub repository

https://github.com/ProteinEngineering-PESB2/RUDEUS.

1 INTRODUCTION

DNA-protein interactions are fundamental to numer-

ous cellular processes critical for biological func-

tions. Approximately 6-7% of eukaryotic proteins in-

teract with DNA, utilizing speciﬁc DNA-binding do-

mains and varying afﬁnities for single- and double-

stranded DNA (Attali et al., 2021; Gupta et al., 2021).

These interactions are driven by direct base–amino

acid recognition and indirect forces from DNA con-

formational changes (Arora et al., 2023).

DNA-binding proteins (DBPs) play key roles in

processes like DNA replication, transcription, pack-

aging, and chromatin remodeling (Kabir et al., 2024).

They aid in strand separation, maintain DNA in-

tegrity, regulate gene expression, and inﬂuence chro-

matin structure. Understanding DBPs is essential

for insights into gene regulation and links between

https://orcid.org/0000-0002-8369-5746

https://orcid.org/0009-0004-2344-9860

https://orcid.org/0000-0002-0458-378X

https://orcid.org/0009-0001-1438-1938

mutations and genetic diseases (Zhang et al., 2022;

Kabir et al., 2024). Recent studies on proteins such

as TDP-43 and helicase chromodomain proteins have

advanced knowledge in ﬁelds like neurodegeneration

and cancer (Lye and Chen, 2022; Alendar and Berns,

2021; Wang et al., 2022a).

Computational biology, bolstered by AI and ma-

chine learning, has enhanced the discovery of DBPs

by predicting interaction sites and transcription factor

binding hotspots (Wang et al., 2022b). While many

machine learning models have been applied, includ-

ing deep learning, comparing them is difﬁcult due to

variations in datasets and validation methods (Shadab

et al., 2020; Zhang et al., 2020; Ali et al., 2022; Ban-

jar et al., 2022; Barukab et al., 2022). Recent ap-

proaches have employed large protein language mod-

els for more robust numerical representations (Med-

ina et al., 2023; Medina-Ortiz et al., 2024; Fern

andez

et al., 2023).

This paper introduces RUDEUS, a Python library

designed for DNA-binding classiﬁcation and distin-

guishing between single- and double-stranded inter-

actions. RUDEUS combines protein language mod-

302

Medina-Ortiz, D., Cabas-Mora, G., Moya, I., Soto-García, N. and Uribe-Paredes, R.

RUDEUS: A Machine Learning Classiﬁcation System to Study DNA-Binding Proteins.

DOI: 10.5220/0012946500003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 1: KDIR, pages 302-310

ISBN: 978-989-758-716-0; ISSN: 2184-3228

els, supervised learning algorithms, and Bayesian

hyperparameter tuning to build predictive models.

Achieving precision rates of 95% for DNA-binding

identiﬁcation and 89% for interaction type evaluation,

RUDEUS demonstrates strong performance. It anno-

tated over 20,000 protein sequences and was validated

using structural bioinformatics. The library’s ﬂexibil-

ity and ease of use make it a valuable tool for explor-

ing latent space and mutation landscapes in DBPs.

2 METHODS

2.1 Collecting and Processing Protein

Sequences

All protein sequences were sourced from the litera-

ture, including datasets from Hu et al. (2019); Shadab

et al. (2020); Sharma et al. (2021); Wang et al. (2017).

After collection, a preprocessing step was applied to

merge, clean, and remove redundancy and inconsis-

tencies. Filters were then applied to exclude non-

canonical sequences and select sequences within a

length range of 50 to 1024 amino acids. Addition-

ally, homology redundancy was eliminated using the

CDHit library Fu et al. (2012).

2.2 Numerical Representation

Strategies

This work explore different pre-trained models based

on protein language models, including ProTrans (El-

naggar et al., 2020) and ESM (Rives et al., 2021;

Meier et al., 2021). All pre-trained models were ap-

plied through the bio-embedding tool, combined with

a reduction process to obtain vectors in a 1− D dimen-

sion (Dallago et al., 2021). Moreover, physicochem-

ical based approaches and Fourier transforms also

were explored (Medina-Ortiz et al., 2022, 2020a).

2.3 Training Predictive Models and

Tuning Optimization

A classic machine learning pipeline was employed to

train predictive models Medina-Ortiz et al. (2020b).

The datasets were ﬁrst split into training (70%), val-

idation (20%), and testing (10%) sets. The mod-

els were then trained using the strategies proposed

in Medina-Ortiz et al. (2024), which included an

exploration phase, statistical methods to select the

best combinations of numerical representation strate-

gies and machine learning algorithms, and Bayesian

approaches for hyperparameter tuning Akiba et al.

(2019). Once the models were trained, the testing

datasets were used for benchmarking, and the models

were deployed to predict unknown protein sequences

(See Figure 3 of Appendix for a schematic represen-

tation of the employed pipeline to train the predictive

models).

2.4 Structural Bioinformatics

Approaches

RUDEUS incorporates a structural bioinformatics

pipeline to validate model predictions using DNA-

protein molecular docking via LightDock v9.4 Roel-

Touris et al. (2020). The pipeline prepares protein

structures by applying protonation, hydrogen dele-

tion, structure rebuilding with the Reduce library, and

modifying atoms to comply with the AMBER94 force

ﬁeld. After preparation, molecular docking is per-

formed with 400 swarms, 200 glowworms, and 100

steps. The resulting conformers are clustered using

the RMSD metric with the BSAS function, and the

best pose is selected based on the highest docking

score.

2.5 Availability and Implementation

Strategies

All source code was implemented under the Python

Language programming v3.9.16, including the mod-

ules, libraries, and demonstration scripts in RUDEUS.

The main libraries employed to develop the predic-

tive models were scikit-learn (Pedregosa et al., 2011)

and Optuna (Akiba et al., 2019). Furthermore, to

process and compile all datasets, the Pandas library

was employed (McKinney et al., 2011). Finally, a

conda environment was constructed to facilitate the

deployment of the built library, combined with dif-

ferent Jupyter Notebooks, to ensure the replicabil-

ity of the presented work. All source code, envi-

ronment conﬁguration, datasets, and created models

are available for non-commercial uses in the GitHub

repository under the MIT licence https://github.com/

ProteinEngineering-PESB2/RUDEUS.

3 RESULTS AND DISCUSSIONS

3.1 RUDEUS Achieves High

Performances in Its Classiﬁcation

Models

Two classiﬁcation tasks were explored in RUDEUS:

DNA-binding protein classiﬁcation and the identiﬁ-

RUDEUS: A Machine Learning Classiﬁcation System to Study DNA-Binding Proteins

303

cation of single- versus double-stranded DNA inter-

actions. For each task, over 10,000 combinations

of numerical representation strategies and supervised

learning algorithms were evaluated. The models’ per-

formance was measured using accuracy, precision, re-

call, and F-score.

Figure 4 of Appendix displays the recall met-

ric distributions for the training process. On aver-

age, the models achieved 83% precision for DNA-

binding classiﬁcation and 82% precision for DNA

strand type prediction. The highest-performing DNA-

binding models were based on pre-trained ProtTrans

Uniref, BDF, and XLU50 models, independent of the

learning algorithm. For DNA strand type classiﬁca-

tion, the best results came from ProtTrans XLU50,

Uniref, t5bdf, ESM1B, and ESM1V models. Ensem-

ble methods like Random Forest, Gradient Boosting,

ExtraTrees, and KNN consistently delivered the top

results for both tasks.

A statistical selection process identiﬁed the best

combinations of representation strategies and algo-

rithms. Sixteen Bernoulli events were evaluated us-

ing two ﬁlters: i) top-performing models above the

90th quantile and ii) models with standard deviations

below the 10th quantile. A binomial distribution was

then applied to detect outliers, with a success thresh-

old of > 12 events, representing a success probabil-

ity below 0.01. This stringent selection yielded ﬁve

optimal combinations for DNA-binding classiﬁcation

and four for DNA strand type prediction, as summa-

rized in Table 1. While the selected models exhibited

strong performance, overﬁtting was observed, with

differences between training and validation metrics.

Table 1: Selected combinations of supervised learning algo-

rithms and numerical representation approaches for all tasks

explored in this work.

Task Algorithm Encoder Recall

DNA-binding

classiﬁcation

ExtraTrees prot. Uniref 0.93

ExtraTrees prot. bdf 0.93

Gradient B prot. Uniref 0.91

KNeighbors prot. Uniref 0.93

RandomForest prot. Uniref 0.93

Single-stranded

or double-stranded

ExtraTrees prot. Uniref 0.90

ExtraTrees prot. XLU50 0.90

Gaussian Pro. prot. XLU50 0.89

SVC prot. XLU50 0.90

All selected combinations of supervised learn-

ing algorithms and numerical representation strate-

gies were optimized using the Optuna library (Akiba

et al., 2019). Two models were then selected based

on the criteria outlined in the pipeline. For DNA-

binding prediction, the ExtraTrees algorithm com-

bined with the ProtTrans Uniref model was chosen,

while for single-stranded or double-stranded DNA in-

teraction, the same algorithm was used, but the Prot-

Trans XLU50 model was selected. The DNA-binding

model achieved 95% precision with a Matthews cor-

relation coefﬁcient (MCC) of 0.89, and the single-

stranded/double-stranded model achieved 89% preci-

sion with an MCC of 0.81.

Figure 1 summarizes both models’ performance.

The confusion matrices (Figure 1 A and 1 C) indi-

cate strong performance in identifying positive and

negative classes, with the DNA-binding model out-

performing the single/double-stranded model in dis-

tinguishing interactions. Precision-recall curves (Fig-

ure 1 B and 1 D) showed average precision values of

0.98 and 0.96, respectively, aligning with the confu-

sion matrices and demonstrating the greater difﬁculty

in classifying interaction types. ROC curves, calcu-

lated using k = 5 cross-validation, revealed area under

the curve (AUC) scores of 0.98 for DNA-binding and

0.97 for the interaction model, conﬁrming the mod-

els’ robust predictive capabilities.

Table 2 compares the RUDEUS models with state-

of-the-art methods. For DNA-binding, RUDEUS

achieved the highest speciﬁcity (95.5%) and MCC

(0.89), while the method in (Zhang et al., 2021) had

the highest sensitivity, differing only by 0.1% from

RUDEUS. For the single-stranded/double-stranded

task, RUDEUS achieved the highest MCC (0.81), al-

though other methods reported higher sensitivity (Ali

et al., 2020) and speciﬁcity (Tan et al., 2019). How-

ever, these methods showed signs of overﬁtting, as

indicated by large gaps between sensitivity and speci-

ﬁcity and lower MCC values compared to RUDEUS.

Table 2: State-of-the-art comparison for DNA-binding clas-

siﬁcation models and single-stranded or double-stranded in-

teraction models.

Task Classiﬁer SN(%) SP(%) MCC Reference

DNA-binding

RF 79.3 89.0 0.69 (Kumar et al., 2009)

RF 83.7 90.0 0.72 (Ma et al., 2016)

SVM 87.0 85.5 0.72 (Zaman et al., 2017)

SVM 89.1 88.8 0.78 (Ali et al., 2018)

SVM 94.1 97.6 0.92 (Rahman et al., 2018)

SVM 91.1 88.8 0.79 (Mishra et al., 2019)

SVM 91.8 93.0 0.84 (Ali et al., 2019)

SVM 93.4 93.4 0.86 (Zhang et al., 2021)

ExtraTrees 93.3 95.5 0.89 This work

Single-stranded

or double-stranded

RF 90.8 78.8 0.64 (Wang et al., 2017)

SVM 94.2 80.33 0.72 (Ali et al., 2020)

GTB 78.4 97.5 0.79 (Tan et al., 2019)

HMM 85.3 92.8 0.78 (Sharma et al., 2021)

ExtraTrees 87.8 91.6 0.81 This work

3.2 RUDEUS Facilitate the Exploration

of Single-Stranded or

Double-Stranded Interaction

Evaluation

More than 20,000 DNA-binding protein sequences

were classiﬁed as either single- or double-stranded

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

304

Figure 1: Description through different performances visualization the selected and optimized models for both tasks

explored in this work. A-D Confusion matrix estimated during the validation process for DNA-binding task single-stranded

or double-stranded task, respectively. B-E Precision-recall curve estimated during the validation process for DNA-binding

task single-stranded or double-stranded task, respectively. The average precision (AP) was calculated in both cases, achieving

0.98 and 0.96, respectively. C-F Receiver operating characteristic (ROC) curve estimated during the training process for

DNA-binding task single-stranded or double-stranded task, respectively. In both cases, the area under the curve (AUC) was

estimated to achieve 0.98 and 0.97, respectively.

using the exploration module in RUDEUS. First, the

sequences were numerically represented using pre-

trained models selected for strand interaction classiﬁ-

cation. The predictions showed that over 18,000 pro-

teins were classiﬁed as double-stranded, while around

2,000 were identiﬁed as single-stranded, reﬂecting

proportions similar to the dataset used for model

training.

Three DNA-binding proteins with identiﬁed

strand interactions were further evaluated using the

bioinformatics structural pipeline. Figure 2 provides

molecular docking visualizations and detailed interac-

tion site analyses for these proteins, all of which were

previously reported in the literature.

Figure 2 A illustrates the molecular docking

of protein 1BNZ, a hyperthermophile chromoso-

mal protein that binds double-stranded DNA (Gao

et al., 1998; Guagliardi et al., 2002). Key hy-

drophobic residues—TRP24, VAL26, MET29, and

ALA45—play a signiﬁcant role in DNA binding (Fig-

ure 2 B). Interactions occur via hydrogen bonds, salt

bridges, and van der Waals contacts, consistent with

previous reports (Gao et al., 1998).

Similarly, Figure 2 C shows the docking of pro-

tein 1HRY, which is involved in sexual differentia-

tion by regulating the gene responsible for M

ullerian

duct regression in male embryos (Werner et al., 1995).

Six residues (ASN10, PHE12, ILE13, SER33, ILE35,

SER36, TYR74) interact with DNA bases, forming

hydrogen bonds and electrostatic interactions (Figure

2 D), as described in (Werner et al., 1995).

In contrast, Figure 2 E presents the docking of

protein 3ULP, known as Pf-SSB, a single-stranded

DNA-binding protein crucial for DNA metabolism

in the malaria-causing parasite (Antony et al., 2012).

The homotetramer structure of 3ULP features iden-

tical DNA-contacting residues (S110, N114, T129)

across all four subunits (Figure 2 F), which form part

RUDEUS: A Machine Learning Classiﬁcation System to Study DNA-Binding Proteins

305

TRP24

VAL26

MET29

ALA45

SER33

ASN10

ILE35

PHE12

TYR74

ILE13

SER110

ASN114

THR129

Figure 2: Structural bioinformatics validation through DNA-protein molecular docking for three DNA-binding pro-

teins and their interaction type identiﬁed with the models available in RUDEUS. A-B DNA-protein molecular docking

and the most relevant identiﬁed residues for the DNA interaction for the protein 1BNZ. C-D DNA-protein molecular docking

and the most relevant identiﬁed residues for the interaction for the protein 1HRY. E-F DNA-protein molecular docking and

the most relevant identiﬁed residues for the interaction for the protein 3ULP.

of the replication and maintenance machinery in the

apicoplast (Antony et al., 2012).

4 CONCLUSIONS

This work introduces RUDEUS, a Python library

speciﬁcally designed for the investigation and clas-

siﬁcation of DNA-binding proteins, as well as the

identiﬁcation of DNA strand interaction types. The

methodology incorporates a ﬂexible pipeline that

leverages protein language models, supervised learn-

ing algorithms, and Bayesian optimization to train

high-performance classiﬁcation models. These mod-

els surpass state-of-the-art benchmarks in sensitiv-

ity, speciﬁcity, and MCC scores, demonstrating

RUDEUS’s superiority in this domain, while main-

taining the simplicity and replicability of existing

methods.

An extensive exploration process highlighted the

utility of RUDEUS, enabling the annotation of

over 20,000 protein sequences as single- or double-

stranded, validated through structural bioinformatic

approaches and DNA-protein molecular docking.

RUDEUS’s intuitive interface and powerful features

make it highly applicable for integration into broader

protein design pipelines, including landscape recon-

struction, directed evolution, and latent space explo-

ration using deep generative models.

COMPETING INTERESTS

The authors declare that the research was conducted

without any commercial or ﬁnancial relationships that

could be construed as a potential conﬂict of interest.

AUTHOR CONTRIBUTIONS

STATEMENT

IM-B and DM-O: conceptualization. DM-O, GC-M,

and NS-G: methodology. DM-O and RU-P: valida-

tion. IM-B, GC-M, and NS-G: investigation. DM-O,

IM-B, RU-P, and GC-M: writing, review, and editing.

DM-O and RU-P: supervision and funding resources.

DM-O: project administration.

ACKNOWLEDGEMENTS

This research has been ﬁnanced mainly by the Centre

for Biotechnology and Bioengineering - CeBiB (PIA

project FB0001, Conicyt, Chile). DM-O acknowl-

edges ANID for the project “SUBVENCI

ON A IN-

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

306

STALACI

ON EN LA ACADEMIA CONVOCATO-

RIA A

NO 2022”, Folio 85220004. RU-P acknowl-

edges ANID for the grant Fondecyt 1230298.

REFERENCES

Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M.

(2019). Optuna: A next-generation hyperparameter

optimization framework. In Proceedings of the 25th

ACM SIGKDD international conference on knowl-

edge discovery & data mining, pages 2623–2631.

Alendar, A. and Berns, A. (2021). Sentinels of chromatin:

chromodomain helicase dna-binding proteins in de-

velopment and disease. Genes & Development, 35(21-

22):1403–1430.

Ali, F., Ahmed, S., Swati, Z. N. K., and Akbar, S. (2019).

Dp-binder: machine learning model for prediction

of dna-binding proteins by fusing evolutionary and

physicochemical information. Journal of Computer-

Aided Molecular Design, 33:645–658.

Ali, F., Arif, M., Khan, Z. U., Kabir, M., Ahmed, S., and Yu,

D.-J. (2020). Sdbp-pred: Prediction of single-stranded

and double-stranded dna-binding proteins by extend-

ing consensus sequence and k-segmentation strategies

into pssm. Analytical biochemistry, 589:113494.

Ali, F., Kabir, M., Arif, M., Swati, Z. N. K., Khan, Z. U.,

Ullah, M., and Yu, D.-J. (2018). Dbppred-pdsd: Ma-

chine learning approach for prediction of dna-binding

proteins using discrete wavelet transform and opti-

mized integrated features space. Chemometrics and

Intelligent Laboratory Systems, 182:21–30.

Ali, F., Kumar, H., Patil, S., Ahmed, A., Banjar, A., and

Daud, A. (2022). Dbp-deepcnn: prediction of dna-

binding proteins using wavelet-based denoising and

deep learning. Chemometrics and Intelligent Labo-

ratory Systems, 229:104639.

Antony, E., Weiland, E. A., Korolev, S., and Lohman, T. M.

(2012). Plasmodium falciparum ssb tetramer wraps

single-stranded dna with similar topology but opposite

polarity to e. coli ssb. Journal of molecular biology,

420(4-5):269–283.

Arora, S., Gupta, S., Verma, S., and Malik, I. (2023).

Prediction of dna interacting residues. In 2023

International Conference on Computational Intelli-

gence, Communication Technology and Networking

(CICTN), pages 54–57. IEEE.

Attali, I., Botchan, M. R., and Berger, J. M. (2021). Struc-

tural mechanisms for replicating dna in eukaryotes.

Annual review of biochemistry, 90:77–106.

Banjar, A., Ali, F., Alghushairy, O., and Daud, A. (2022).

idbp-pbmd: A machine learning model for detec-

tion of dna-binding proteins by extending compres-

sion techniques into evolutionary proﬁle. Chemomet-

rics and Intelligent Laboratory Systems, 231:104697.

Barukab, O., Ali, F., Alghamdi, W., Bassam, Y., and Khan,

S. A. (2022). Dbp-cnn: Deep learning-based pre-

diction of dna-binding proteins by coupling discrete

cosine transform with two-dimensional convolutional

neural network. Expert Systems with Applications,

197:116729.

Dallago, C., Sch

utze, K., Heinzinger, M., Olenyi, T.,

Littmann, M., Lu, A. X., Yang, K. K., Min, S., Yoon,

S., Morton, J. T., and Rost, B. (2021). Learned em-

beddings from deep learning to visualize and predict

protein sets. Current Protocols, 1(5):e113.

Elnaggar, A., Heinzinger, M., Dallago, C., Rihawi, G.,

Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C.,

Steinegger, M., Bhowmik, D., and Rost, B. (2020).

Prottrans: Towards cracking the language of life’s

code through self-supervised deep learning and high

performance computing.

Fern

andez, D., Olivera-Nappa,

A., Uribe-Paredes, R., and

Medina-Ortiz, D. (2023). Exploring machine learn-

ing algorithms and protein language models strate-

gies to develop enzyme classiﬁcation systems. In In-

ternational Work-Conference on Bioinformatics and

Biomedical Engineering, pages 307–319. Springer.

Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). Cd-

hit: accelerated for clustering the next-generation se-

quencing data. Bioinformatics, 28(23):3150–3152.

Gao, Y.-G., Su, S.-Y., Robinson, H., Padmanabhan, S., Lim,

L., McCrary, B. S., Edmondson, S. P., Shriver, J. W.,

and Wang, A. H.-J. (1998). The crystal structure of the

hyperthermophile chromosomal protein sso7d bound

to dna. Nature structural biology, 5(9):782–786.

Guagliardi, A., Cerchia, L., Rossi, M., et al. (2002). The

sso7d protein of sulfolobus solfataricus: in vitro rela-

tionship among different activities. Archaea, 1:87–93.

Gupta, N. K., Wilkinson, E. A., Karuppannan, S. K., Bai-

ley, L., Vilan, A., Zhang, Z., Qi, D.-C., Tadich, A.,

Tuite, E. M., Pike, A. R., et al. (2021). Role of order

in the mechanism of charge transport across single-

stranded and double-stranded dna monolayers in tun-

nel junctions. Journal of the American Chemical So-

ciety, 143(48):20309–20319.

Hu, S., Ma, R., and Wang, H. (2019). An improved deep

learning method for predicting dna-binding proteins

based on contextual features in amino acid sequences.

PLoS one, 14(11):e0225317.

Kabir, A., Bhattarai, M., Rasmussen, K. O., Shehu, A.,

Bishop, A. R., Alexandrov, B. S., and Usheva, A.

(2024). Advancing transcription factor binding site

prediction using dna breathing dynamics and se-

quence transformers via cross attention. bioRxiv,

pages 2024–01.

Kumar, K. K., Pugalenthi, G., and Suganthan, P. N. (2009).

Dna-prot: identiﬁcation of dna binding proteins from

protein sequence information using random forest.

Journal of Biomolecular Structure and Dynamics,

26(6):679–686.

Lye, Y. S. and Chen, Y.-R. (2022). Tar dna-binding protein

43 oligomers in physiology and pathology. IUBMB

life, 74(8):794–811.

Ma, X., Guo, J., and Sun, X. (2016). Dnabp: Identiﬁcation

of dna-binding proteins based on feature selection us-

ing a random forest and predicting binding residues.

PloS one, 11(12):e0167345.

RUDEUS: A Machine Learning Classiﬁcation System to Study DNA-Binding Proteins

307

McKinney, W. et al. (2011). pandas: a foundational python

library for data analysis and statistics. Python for high

performance and scientiﬁc computing, 14(9):1–9.

Medina, D., Sepulveda-Yanez, J., Alvarez-Saravia, D.,

Uribe-Paredes, R., Veelken, H., and Navarrete, M.

(2023). Artiﬁcial intelligence approach for the discov-

ery of autoantigen recognition by b-cell lymphomas.

Blood, 142:125.

Medina-Ortiz, D., Contreras, S., Amado-Hinojosa, J.,

Torres-Almonacid, J., Asenjo, J. A., Navarrete, M.,

and Olivera-Nappa, A. (2020a). Combination of dig-

ital signal processing and assembled predictive mod-

els facilitates the rational design of proteins. arXiv

preprint arXiv:2010.03516.

Medina-Ortiz, D., Contreras, S., Amado-Hinojosa, J.,

Torres-Almonacid, J., Asenjo, J. A., Navarrete, M.,

and Olivera-Nappa,

A. (2022). Generalized property-

based encoders and digital signal processing facilitate

predictive tasks in protein engineering. Frontiers in

Molecular Biosciences, 9.

Medina-Ortiz, D., Contreras, S., Fern

andez, D., Soto-

Garc

ıa, N., Moya, I., Cabas-Mora, G., and Olivera-

Nappa,

A. (2024). Protein language models and ma-

chine learning facilitate the identiﬁcation of antimi-

crobial peptides. International Journal of Molecular

Sciences, 25(16):8851.

Medina-Ortiz, D., Contreras, S., Quiroz, C., and Olivera-

Nappa,

A. (2020b). Development of supervised learn-

ing predictive models for highly non-linear biological,

biomedical, and general datasets. Frontiers in molec-

ular biosciences, 7:13.

Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., and Rives,

A. (2021). Language models enable zero-shot predic-

tion of the effects of mutations on protein function. In

Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.,

and Vaughan, J. W., editors, Advances in Neural Infor-

mation Processing Systems, volume 34, pages 29287–

29303. Curran Associates, Inc.

Mishra, A., Pokhrel, P., and Hoque, M. T. (2019). Stackdp-

pred: a stacking based prediction of dna-binding pro-

tein from sequence. Bioinformatics, 35(3):433–441.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., et al. (2011). Scikit-learn:

Machine learning in python. the Journal of machine

Learning research, 12:2825–2830.

Rahman, M. S., Shatabda, S., Saha, S., Kaykobad, M., and

Rahman, M. S. (2018). Dpp-pseaac: a dna-binding

protein prediction model using chou’s general pseaac.

Journal of theoretical biology, 452:22–34.

Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J.,

Guo, D., Ott, M., Zitnick, C. L., Ma, J., and Fergus, R.

(2021). Biological structure and function emerge from

scaling unsupervised learning to 250 million protein

sequences. Proceedings of the National Academy of

Sciences, 118(15).

Roel-Touris, J., Bonvin, A. M., and Jim

enez-Garc

ıa, B.

(2020). Lightdock goes information-driven. Bioin-

formatics, 36(3):950–952.

Shadab, S., Khan, M. T. A., Neezi, N. A., Adilina, S., and

Shatabda, S. (2020). Deepdbp: deep neural networks

for identiﬁcation of dna-binding proteins. Informatics

in Medicine Unlocked, 19:100318.

Sharma, R., Kumar, S., Tsunoda, T., Kumarevel, T., and

Sharma, A. (2021). Single-stranded and double-

stranded dna-binding protein prediction using hmm

proﬁles. Analytical biochemistry, 612:113954.

Tan, C., Wang, T., Yang, W., and Deng, L. (2019). Predpsd:

a gradient tree boosting approach for single-stranded

and double-stranded dna binding protein prediction.

Molecules, 25(1):98.

Wang, W., Sun, L., Zhang, S., Zhang, H., Shi, J., Xu,

T., and Li, K. (2017). Analysis and prediction of

single-stranded and double-stranded dna binding pro-

teins based on protein sequences. BMC bioinformat-

ics, 18:1–10.

Wang, Y., Zhang, L., Huang, T., Wu, G.-R., Zhou, Q.,

Wang, F.-X., Chen, L.-M., Sun, F., Lv, Y., Xiong, F.,

et al. (2022a). The methyl-cpg-binding domain 2 fa-

cilitates pulmonary ﬁbrosis by orchestrating ﬁbroblast

to myoﬁbroblast differentiation. European Respira-

tory Journal, 60(3).

Wang, Z., Gong, M., Liu, Y., Xiong, S., Wang, M.,

Zhou, J., and Zhang, Y. (2022b). Towards a better

understanding of tf-dna binding prediction from ge-

nomic features. Computers in Biology and Medicine,

149:105993.

Werner, M. H., Huth, J. R., Gronenborn, A. M., and Clore,

G. M. (1995). Molecular basis of human 46x, y

sex reversal revealed from the three-dimensional so-

lution structure of the human sry-dna complex. Cell,

81(5):705–714.

Zaman, R., Chowdhury, S. Y., Rashid, M. A., Sharma, A.,

Dehzangi, A., Shatabda, S., et al. (2017). Hmm-

binder: Dna-binding protein prediction using hmm

proﬁle based features. BioMed research international,

2017.

Zhang, J., Chen, Q., and Liu, B. (2020). idrbp mmc: iden-

tifying dna-binding proteins and rna-binding proteins

based on multi-label learning model and motif-based

convolutional neural network. Journal of molecular

biology, 432(22):5860–5875.

Zhang, Q., Liu, P., Wang, X., Zhang, Y., Han, Y., and Yu,

B. (2021). Stackpdb: predicting dna-binding proteins

based on xgb-rfe feature optimization and stacked en-

semble classiﬁer. Applied Soft Computing, 99:106921.

Zhang, Y., Bao, W., Cao, Y., Cong, H., Chen, B., and Chen,

Y. (2022). A survey on protein–dna-binding sites in

computational biology. Brieﬁngs in Functional Ge-

nomics, 21(5):357–375.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

308

APPENDIX

Prottrans

Bepler

Numerical representation

strategies

Tyr Ser Gly Ser

Extracted and lter protein

sequences

Datasets with DNA-Binding

protein sequences

ESM

Recall

200

400

600

800

1000

Count

100

Precision

200

400

600

800

1000

Count

100

Accuracy

200

400

600

800

1000

Count

100

Exploring ML

algorithms

Analyze performance

distributions

Best

performing

models

Select and

export model

100

Models

Peptide numerical

representations

1 2 0 5

3 0 4 1

2 7 3 0

2 2 3 1

Selection of top models

via statistical approaches

Tuning model's

hyperparameters

Figure 3: The designed e implemented pipeline to train predictive models for DNA-Binding identiﬁcation incorporated

in RUDEUS. The proposed pipeline ﬁrst collects and processes the protein sequences by incorporating length ﬁlters and

removing non-canonical residues. Then, numerical representation strategies are applied to obtain encoded vectors through

pre-trained models based on protein language models, including Prottrans family models, ESM family models, Bepler, Glove,

and all the different pre-trained models available in the bio-embedding library. Then, different supervised learning algo-

rithms are explored using default hyperparameters employing all generated datasets in the previous step. Then, statistical

approaches are applied to ﬁlter and select the best combinations of supervised learning algorithms and numerical representa-

tion approaches. A Bayesian approach guides the selected combinations tuning hyperparameters process through the Optuna

library, and ensemble learning is explored to evaluate different combinations of the individual optimized models. Finally, the

best strategy is selected based on the best performances, including training, validation, and overﬁtting ratio.

RUDEUS: A Machine Learning Classiﬁcation System to Study DNA-Binding Proteins

309

Figure 4: Recall distribution performances for all explored tasks in this work evaluated by numerical representa-

tion strategies and supervised learning algorithms. A Recall distribution for DNA-binding classiﬁcation task grouped by

pre-trained model employed as numerical representation strategy. B Recall distribution for DNA-binding classiﬁcation task

grouped by supervised learning algorithm. C Recall distribution for single-stranded or double-stranded DNA type interaction

task grouped by pre-trained model employed as numerical representation strategy. D Recall distribution for single-stranded

or double-stranded DNA type interaction task grouped by supervised learning algorithm.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

310