RUDEUS: A Machine Learning Classification System to Study
DNA-Binding Proteins
David Medina-Ortiz
1,2 a
, Gabriel Cabas-Mora
1 b
, Iv
an Moya
1 c
, Nicole Soto-Garc
1 d
and Roberto Uribe-Paredes
Departamento de Ingenier
ıa En Computaci
on, Universidad de Magallanes, Avenida Bulnes 01855, Punta Arenas, Chile
Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Beauchef 851, Santiago, Chile
DNA-Binding Proteins, Single-Stranded and Double-Stranded DNA, Machine Learning,
Protein Language Models.
DNA-binding proteins play crucial roles in biological processes such as replication, transcription, pack-
aging, and chromatin remodeling. Their study has gained importance across scientific fields, with com-
putational biology complementing traditional methods. While machine learning has advanced bioinfor-
matics, generalizable pipelines for identifying DNA-binding proteins and their specific interactions remain
scarce. We present RUDEUS, a Python library with hierarchical classification models to identify DNA-
binding proteins and distinguish between single- and double-stranded DNA interactions. RUDEUS inte-
grates protein language models, supervised learning, and Bayesian optimization, achieving 95% precision
in DNA-binding identification and 89% accuracy in distinguishing interaction types. The library also includes
tools for annotating unknown sequences and validating DNA-protein interactions through molecular dock-
ing. RUDEUS delivers competitive performance and is easily integrated into protein engineering workflows.
It is available under the MIT License, with the source code and models available on the GitHub repository
DNA-protein interactions are fundamental to numer-
ous cellular processes critical for biological func-
tions. Approximately 6-7% of eukaryotic proteins in-
teract with DNA, utilizing specific DNA-binding do-
mains and varying affinities for single- and double-
stranded DNA (Attali et al., 2021; Gupta et al., 2021).
These interactions are driven by direct base–amino
acid recognition and indirect forces from DNA con-
formational changes (Arora et al., 2023).
DNA-binding proteins (DBPs) play key roles in
processes like DNA replication, transcription, pack-
aging, and chromatin remodeling (Kabir et al., 2024).
They aid in strand separation, maintain DNA in-
tegrity, regulate gene expression, and influence chro-
matin structure. Understanding DBPs is essential
for insights into gene regulation and links between
mutations and genetic diseases (Zhang et al., 2022;
Kabir et al., 2024). Recent studies on proteins such
as TDP-43 and helicase chromodomain proteins have
advanced knowledge in fields like neurodegeneration
and cancer (Lye and Chen, 2022; Alendar and Berns,
2021; Wang et al., 2022a).
Computational biology, bolstered by AI and ma-
chine learning, has enhanced the discovery of DBPs
by predicting interaction sites and transcription factor
binding hotspots (Wang et al., 2022b). While many
machine learning models have been applied, includ-
ing deep learning, comparing them is difficult due to
variations in datasets and validation methods (Shadab
et al., 2020; Zhang et al., 2020; Ali et al., 2022; Ban-
jar et al., 2022; Barukab et al., 2022). Recent ap-
proaches have employed large protein language mod-
els for more robust numerical representations (Med-
ina et al., 2023; Medina-Ortiz et al., 2024; Fern
et al., 2023).
This paper introduces RUDEUS, a Python library
designed for DNA-binding classification and distin-
guishing between single- and double-stranded inter-
actions. RUDEUS combines protein language mod-
els, supervised learning algorithms, and Bayesian
hyperparameter tuning to build predictive models.
Achieving precision rates of 95% for DNA-binding
identification and 89% for interaction type evaluation,
RUDEUS demonstrates strong performance. It anno-
tated over 20,000 protein sequences and was validated
using structural bioinformatics. The library’s flexibil-
ity and ease of use make it a valuable tool for explor-
ing latent space and mutation landscapes in DBPs.
2.1 Collecting and Processing Protein
All protein sequences were sourced from the litera-
ture, including datasets from Hu et al. (2019); Shadab
et al. (2020); Sharma et al. (2021); Wang et al. (2017).
After collection, a preprocessing step was applied to
merge, clean, and remove redundancy and inconsis-
tencies. Filters were then applied to exclude non-
canonical sequences and select sequences within a
length range of 50 to 1024 amino acids. Addition-
ally, homology redundancy was eliminated using the
CDHit library Fu et al. (2012).
2.2 Numerical Representation
This work explore different pre-trained models based
on protein language models, including ProTrans (El-
naggar et al., 2020) and ESM (Rives et al., 2021;
Meier et al., 2021). All pre-trained models were ap-
plied through the bio-embedding tool, combined with
a reduction process to obtain vectors in a 1 D dimen-
sion (Dallago et al., 2021). Moreover, physicochem-
ical based approaches and Fourier transforms also
were explored (Medina-Ortiz et al., 2022, 2020a).
2.3 Training Predictive Models and
Tuning Optimization
A classic machine learning pipeline was employed to
train predictive models Medina-Ortiz et al. (2020b).
The datasets were first split into training (70%), val-
idation (20%), and testing (10%) sets. The mod-
els were then trained using the strategies proposed
in Medina-Ortiz et al. (2024), which included an
exploration phase, statistical methods to select the
best combinations of numerical representation strate-
gies and machine learning algorithms, and Bayesian
approaches for hyperparameter tuning Akiba et al.
(2019). Once the models were trained, the testing
datasets were used for benchmarking, and the models
were deployed to predict unknown protein sequences
(See Figure 3 of Appendix for a schematic represen-
tation of the employed pipeline to train the predictive
2.4 Structural Bioinformatics
RUDEUS incorporates a structural bioinformatics
pipeline to validate model predictions using DNA-
protein molecular docking via LightDock v9.4 Roel-
Touris et al. (2020). The pipeline prepares protein
structures by applying protonation, hydrogen dele-
tion, structure rebuilding with the Reduce library, and
modifying atoms to comply with the AMBER94 force
field. After preparation, molecular docking is per-
formed with 400 swarms, 200 glowworms, and 100
steps. The resulting conformers are clustered using
the RMSD metric with the BSAS function, and the
best pose is selected based on the highest docking
2.5 Availability and Implementation
All source code was implemented under the Python
Language programming v3.9.16, including the mod-
ules, libraries, and demonstration scripts in RUDEUS.
The main libraries employed to develop the predic-
tive models were scikit-learn (Pedregosa et al., 2011)
and Optuna (Akiba et al., 2019). Furthermore, to
process and compile all datasets, the Pandas library
was employed (McKinney et al., 2011). Finally, a
conda environment was constructed to facilitate the
deployment of the built library, combined with dif-
ferent Jupyter Notebooks, to ensure the replicabil-
ity of the presented work. All source code, envi-
ronment configuration, datasets, and created models
are available for non-commercial uses in the GitHub
repository under the MIT licence
3.1 RUDEUS Achieves High
Performances in Its Classification
Two classification tasks were explored in RUDEUS:
DNA-binding protein classification and the identifi-
cation of single- versus double-stranded DNA inter-
actions. For each task, over 10,000 combinations
of numerical representation strategies and supervised
learning algorithms were evaluated. The models’ per-
formance was measured using accuracy, precision, re-
call, and F-score.
Figure 4 of Appendix displays the recall met-
ric distributions for the training process. On aver-
age, the models achieved 83% precision for DNA-
binding classification and 82% precision for DNA
strand type prediction. The highest-performing DNA-
binding models were based on pre-trained ProtTrans
Uniref, BDF, and XLU50 models, independent of the
learning algorithm. For DNA strand type classifica-
tion, the best results came from ProtTrans XLU50,
Uniref, t5bdf, ESM1B, and ESM1V models. Ensem-
ble methods like Random Forest, Gradient Boosting,
ExtraTrees, and KNN consistently delivered the top
results for both tasks.
A statistical selection process identified the best
combinations of representation strategies and algo-
rithms. Sixteen Bernoulli events were evaluated us-
ing two filters: i) top-performing models above the
90th quantile and ii) models with standard deviations
below the 10th quantile. A binomial distribution was
then applied to detect outliers, with a success thresh-
old of > 12 events, representing a success probabil-
ity below 0.01. This stringent selection yielded ve
optimal combinations for DNA-binding classification
and four for DNA strand type prediction, as summa-
rized in Table 1. While the selected models exhibited
strong performance, overfitting was observed, with
differences between training and validation metrics.
Table 1: Selected combinations of supervised learning algo-
rithms and numerical representation approaches for all tasks
explored in this work.
Task Algorithm Encoder Recall
ExtraTrees prot. Uniref 0.93
ExtraTrees prot. bdf 0.93
Gradient B prot. Uniref 0.91
KNeighbors prot. Uniref 0.93
RandomForest prot. Uniref 0.93
or double-stranded
ExtraTrees prot. Uniref 0.90
ExtraTrees prot. XLU50 0.90
Gaussian Pro. prot. XLU50 0.89
SVC prot. XLU50 0.90
All selected combinations of supervised learn-
ing algorithms and numerical representation strate-
gies were optimized using the Optuna library (Akiba
et al., 2019). Two models were then selected based
on the criteria outlined in the pipeline. For DNA-
binding prediction, the ExtraTrees algorithm com-
bined with the ProtTrans Uniref model was chosen,
while for single-stranded or double-stranded DNA in-
teraction, the same algorithm was used, but the Prot-
Trans XLU50 model was selected. The DNA-binding
model achieved 95% precision with a Matthews cor-
relation coefficient (MCC) of 0.89, and the single-
stranded/double-stranded model achieved 89% preci-
sion with an MCC of 0.81.
Figure 1 summarizes both models’ performance.
The confusion matrices (Figure 1 A and 1 C) indi-
cate strong performance in identifying positive and
negative classes, with the DNA-binding model out-
performing the single/double-stranded model in dis-
tinguishing interactions. Precision-recall curves (Fig-
ure 1 B and 1 D) showed average precision values of
0.98 and 0.96, respectively, aligning with the confu-
sion matrices and demonstrating the greater difficulty
in classifying interaction types. ROC curves, calcu-
lated using k = 5 cross-validation, revealed area under
the curve (AUC) scores of 0.98 for DNA-binding and
0.97 for the interaction model, confirming the mod-
els’ robust predictive capabilities.
Table 2 compares the RUDEUS models with state-
of-the-art methods. For DNA-binding, RUDEUS
achieved the highest specificity (95.5%) and MCC
(0.89), while the method in (Zhang et al., 2021) had
the highest sensitivity, differing only by 0.1% from
RUDEUS. For the single-stranded/double-stranded
task, RUDEUS achieved the highest MCC (0.81), al-
though other methods reported higher sensitivity (Ali
et al., 2020) and specificity (Tan et al., 2019). How-
ever, these methods showed signs of overfitting, as
indicated by large gaps between sensitivity and speci-
ficity and lower MCC values compared to RUDEUS.
Table 2: State-of-the-art comparison for DNA-binding clas-
sification models and single-stranded or double-stranded in-
teraction models.
Task Classifier SN(%) SP(%) MCC Reference
RF 79.3 89.0 0.69 (Kumar et al., 2009)
RF 83.7 90.0 0.72 (Ma et al., 2016)
SVM 87.0 85.5 0.72 (Zaman et al., 2017)
SVM 89.1 88.8 0.78 (Ali et al., 2018)
SVM 94.1 97.6 0.92 (Rahman et al., 2018)
SVM 91.1 88.8 0.79 (Mishra et al., 2019)
SVM 91.8 93.0 0.84 (Ali et al., 2019)
SVM 93.4 93.4 0.86 (Zhang et al., 2021)
ExtraTrees 93.3 95.5 0.89 This work
or double-stranded
RF 90.8 78.8 0.64 (Wang et al., 2017)
SVM 94.2 80.33 0.72 (Ali et al., 2020)
GTB 78.4 97.5 0.79 (Tan et al., 2019)
HMM 85.3 92.8 0.78 (Sharma et al., 2021)
ExtraTrees 87.8 91.6 0.81 This work
3.2 RUDEUS Facilitate the Exploration
of Single-Stranded or
Double-Stranded Interaction
More than 20,000 DNA-binding protein sequences
were classified as either single- or double-stranded
Figure 1: Description through different performances visualization the selected and optimized models for both tasks
explored in this work. A-D Confusion matrix estimated during the validation process for DNA-binding task single-stranded
or double-stranded task, respectively. B-E Precision-recall curve estimated during the validation process for DNA-binding
task single-stranded or double-stranded task, respectively. The average precision (AP) was calculated in both cases, achieving
0.98 and 0.96, respectively. C-F Receiver operating characteristic (ROC) curve estimated during the training process for
DNA-binding task single-stranded or double-stranded task, respectively. In both cases, the area under the curve (AUC) was
estimated to achieve 0.98 and 0.97, respectively.
using the exploration module in RUDEUS. First, the
sequences were numerically represented using pre-
trained models selected for strand interaction classifi-
cation. The predictions showed that over 18,000 pro-
teins were classified as double-stranded, while around
2,000 were identified as single-stranded, reflecting
proportions similar to the dataset used for model
Three DNA-binding proteins with identified
strand interactions were further evaluated using the
bioinformatics structural pipeline. Figure 2 provides
molecular docking visualizations and detailed interac-
tion site analyses for these proteins, all of which were
previously reported in the literature.
Figure 2 A illustrates the molecular docking
of protein 1BNZ, a hyperthermophile chromoso-
mal protein that binds double-stranded DNA (Gao
et al., 1998; Guagliardi et al., 2002). Key hy-
drophobic residues—TRP24, VAL26, MET29, and
ALA45—play a significant role in DNA binding (Fig-
ure 2 B). Interactions occur via hydrogen bonds, salt
bridges, and van der Waals contacts, consistent with
previous reports (Gao et al., 1998).
Similarly, Figure 2 C shows the docking of pro-
tein 1HRY, which is involved in sexual differentia-
tion by regulating the gene responsible for M
duct regression in male embryos (Werner et al., 1995).
Six residues (ASN10, PHE12, ILE13, SER33, ILE35,
SER36, TYR74) interact with DNA bases, forming
hydrogen bonds and electrostatic interactions (Figure
2 D), as described in (Werner et al., 1995).
In contrast, Figure 2 E presents the docking of
protein 3ULP, known as Pf-SSB, a single-stranded
DNA-binding protein crucial for DNA metabolism
in the malaria-causing parasite (Antony et al., 2012).
The homotetramer structure of 3ULP features iden-
tical DNA-contacting residues (S110, N114, T129)
across all four subunits (Figure 2 F), which form part
Figure 2: Structural bioinformatics validation through DNA-protein molecular docking for three DNA-binding pro-
teins and their interaction type identified with the models available in RUDEUS. A-B DNA-protein molecular docking
and the most relevant identified residues for the DNA interaction for the protein 1BNZ. C-D DNA-protein molecular docking
and the most relevant identified residues for the interaction for the protein 1HRY. E-F DNA-protein molecular docking and
the most relevant identified residues for the interaction for the protein 3ULP.
of the replication and maintenance machinery in the
apicoplast (Antony et al., 2012).
This work introduces RUDEUS, a Python library
specifically designed for the investigation and clas-
sification of DNA-binding proteins, as well as the
identification of DNA strand interaction types. The
methodology incorporates a flexible pipeline that
leverages protein language models, supervised learn-
ing algorithms, and Bayesian optimization to train
high-performance classification models. These mod-
els surpass state-of-the-art benchmarks in sensitiv-
ity, specificity, and MCC scores, demonstrating
RUDEUS’s superiority in this domain, while main-
taining the simplicity and replicability of existing
An extensive exploration process highlighted the
utility of RUDEUS, enabling the annotation of
over 20,000 protein sequences as single- or double-
stranded, validated through structural bioinformatic
approaches and DNA-protein molecular docking.
RUDEUS’s intuitive interface and powerful features
make it highly applicable for integration into broader
protein design pipelines, including landscape recon-
struction, directed evolution, and latent space explo-
ration using deep generative models.
The authors declare that the research was conducted
without any commercial or financial relationships that
could be construed as a potential conflict of interest.
IM-B and DM-O: conceptualization. DM-O, GC-M,
and NS-G: methodology. DM-O and RU-P: valida-
tion. IM-B, GC-M, and NS-G: investigation. DM-O,
IM-B, RU-P, and GC-M: writing, review, and editing.
DM-O and RU-P: supervision and funding resources.
DM-O: project administration.
This research has been financed mainly by the Centre
for Biotechnology and Bioengineering - CeBiB (PIA
project FB0001, Conicyt, Chile). DM-O acknowl-
edges ANID for the project “SUBVENCI
NO 2022”, Folio 85220004. RU-P acknowl-
edges ANID for the grant Fondecyt 1230298.
Numerical representation
Tyr Ser Gly Ser
Tyr Ser Gly Ser
Tyr Ser Gly Ser
Tyr Ser Gly Ser
Extracted and lter protein
Datasets with DNA-Binding
protein sequences
Exploring ML
Analyze performance
Select and
export model
Peptide numerical
1 2 0 5
3 0 4 1
2 7 3 0
2 2 3 1
Selection of top models
via statistical approaches
Tuning model's
Figure 3: The designed e implemented pipeline to train predictive models for DNA-Binding identification incorporated
in RUDEUS. The proposed pipeline first collects and processes the protein sequences by incorporating length filters and
removing non-canonical residues. Then, numerical representation strategies are applied to obtain encoded vectors through
pre-trained models based on protein language models, including Prottrans family models, ESM family models, Bepler, Glove,
and all the different pre-trained models available in the bio-embedding library. Then, different supervised learning algo-
rithms are explored using default hyperparameters employing all generated datasets in the previous step. Then, statistical
approaches are applied to filter and select the best combinations of supervised learning algorithms and numerical representa-
tion approaches. A Bayesian approach guides the selected combinations tuning hyperparameters process through the Optuna
library, and ensemble learning is explored to evaluate different combinations of the individual optimized models. Finally, the
best strategy is selected based on the best performances, including training, validation, and overfitting ratio.
Figure 4: Recall distribution performances for all explored tasks in this work evaluated by numerical representa-
tion strategies and supervised learning algorithms. A Recall distribution for DNA-binding classification task grouped by
pre-trained model employed as numerical representation strategy. B Recall distribution for DNA-binding classification task
grouped by supervised learning algorithm. C Recall distribution for single-stranded or double-stranded DNA type interaction
task grouped by pre-trained model employed as numerical representation strategy. D Recall distribution for single-stranded
or double-stranded DNA type interaction task grouped by supervised learning algorithm.
