Machine Learning Methods for Phenotype Prediction from

High-Dimensional, Low Population Aquaculture Data

Giovanni Faldani

, Enrico Rossignolo

, Eleonora Signor

, Alessio Longo

, Sara Faggion

Luca Bargelloni

, Matteo Comin

and Cinzia Pizzi

Department of Information Engineering, University of Padova, Padova, 35131, Italy

Department of Comparative Biomedicine and Food Science, University of Padova, Legnaro (PD), 35020, Italy

{matteo.comin, cinzia.pizzi}@unipd.it

Keywords:

High-Dimensional, Low Population, SNP Data, Machine Learning Classiﬁcation, Phenotype Prediction.

Abstract:

Recent research has increasingly focused on classiﬁcation rules within the big data framework, yet many bioin-

formatics applications still address prediction problems that involve small-sample, high-dimensional data. In

phenotype prediction, especially with the rise of large-scale genomic data, a central challenge arises from han-

dling high-dimensional datasets where the number of genetic features (such as SNPs) far exceeds the sample

size. A signiﬁcant example of such high-dimensional, low-sample datasets is found in aquaculture, a rapidly

growing sector within global food production and a crucial source of high-quality protein. This study uses data

from an experiment performed on European seabass as a test case, focusing on predicting resistance to Viral

Nervous Necrosis (VNN) as a speciﬁc phenotype of interest. We explore a range of machine learning tech-

niques to address the complexities of high-dimensional data, from established methods like gradient boosting,

SVM, and deep learning to newer approaches. This paper evaluates various methods for associating SNPs

with phenotypic traits, benchmarking their performance on challenging aquaculture genomic data to provide

insight into the effectiveness of these techniques.

1 INTRODUCTION

The exploration of genotype-phenotype relationships

has seen a growing number of studies focused

on identifying genetic variants linked to various

diseases. Most single nucleotide polymorphisms

(SNPs), which are used as markers for speciﬁc ge-

nomic regions, exert minimal biological effects. To

ﬁnd the SNPs that impact biological functions is in

general very challenging (Uffelmann et al., 2021).

In recent years, genome-wide association stud-

ies (GWAS) have signiﬁcantly expanded our under-

standing of SNP roles and associations, shedding light

on the genetic impact on diseases (Uffelmann et al.,

2021). Through GWAS, SNPs can be identiﬁed as

candidate biomarkers, potentially indicating suscepti-

bility to complex diseases. Despite GWAS’ success

in pinpointing disease-related SNPs, unique chal-

lenges arise, particularly in the context of big genomic

data where high-dimensional datasets often feature

far more genetic variables than samples (Uppu et al.,

2018). A tightly related problem is the phenotype pre-

diction of a disease from this high-dimensional, low-

population SNP data. In phenotype prediction, where

uncovering gene-disease associations is key, datasets

typically contain a vast number of SNPs (e.g. 10

)

against relatively small sample sizes (e.g. 10

). Nav-

igating this high-dimensional space to identify rep-

resentative SNPs is a persistent challenge in under-

standing the genetic foundations of disease.

In this paper, we will use as a test case data from

an experimental challenge test performed on Euro-

pean sea bass, and in particular the prediction of a

speciﬁc phenotype, resistance to viral nervous necro-

sis (VNN).

Aquaculture is a key source of high-quality pro-

tein worldwide and has become one of the fastest-

growing sectors in global food production (Buri

et al., 2020; You et al., 2020). European sea bass is a

highly valued species across Europe and the Mediter-

ranean area, carrying substantial economic and cul-

tural importance (Vandeputte et al., 2019). In the past

two decades, global aquaculture production of Euro-

pean sea bass has seen a signiﬁcant growth, rising

from 7,694 tons in 2000 to 299,810 tons in 2021.

However, the industry faces increasing challenges

from infectious diseases, which threaten both the sus-

tainability of sea bass farming and the health of cul-

638

Faldani, G., Rossignolo, E., Signor, E., Longo, A., Faggion, S., Bargelloni, L., Comin, M. and Pizzi, C.

Machine Learning Methods for Phenotype Prediction from High-Dimensional, Low Population Aquaculture Data.

DOI: 10.5220/0013248000003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 1, pages 638-646

ISBN: 978-989-758-731-3; ISSN: 2184-4305

tured populations.

Viral Nervous Necrosis is a major viral disease

impacting global aquaculture, affecting numerous

farmed and ecologically vital species. It is the pri-

mary viral infectious disease in European sea bass, re-

sponsible for 15% of all on-farm disease-related mor-

talities (Muniesa et al., 2020). VNN resistance in Eu-

ropean sea bass is characterized by signiﬁcant addi-

tive genetic variation and recently one genomic re-

gion has been detected as signiﬁcantly associated with

this trait (Mukiibi et al., 2024), yet the speciﬁc causal

gene(s) and mutation(s) underlying this resistance re-

main unknown.

In this paper, we exploit three different machine

learning approaches to predict VNN resistance of

about a thousand individuals through the analysis of

several SNPs datasets. Machine learning provides

a versatile and extensive set of techniques suited to

tackle the challenges of these high-dimensional low-

population SNP datasets. Among the several machine

learning approaches, we selected XGBoost (Chen

and Guestrin, 2016), and COMBI (both SVM and

Deep Learning versions) (Mieth et al., 2016; Mi-

eth et al., 2021). Moreover, we designed an ad-hoc

Chaos Game Representation (CGR) approach (Jef-

frey, 1990) that maps sequences of SNPs into im-

ages which are then classiﬁed using a Convolutional

Neural Network. This choice of tools covers both

“classic” machine learning approaches (such as SVM

and gradient boosting) and more recent deep learn-

ing approaches, including an original CGR encoding

scheme for SNPs. In our experiments, we assessed

these machine learning methods for SNP and pheno-

type association and evaluated their prediction per-

formance on a challenging bank of high dimensional

aquacultural genomic data to provide insight on their

efﬁcacy and efﬁciency.

2 METHODS

Our research aims to investigate the mortality of a

low-rank sea bass population exploiting the high di-

mensionality of its SNP data. SNPs act as biological

markers and can identify genes associated with a dis-

ease.

We dig into different machine learning approaches

from boosted trees to support vector machines and

neural networks. Among these, we chose two well-

known machine learning algorithms already applied

in bioinformatics for classifying SNP data: XG-

Boost and COMBI, both characterized by great model

explainability and classiﬁcation performance. Fur-

thermore, we introduce a novel approach for clas-

sifying SNPs by adapting Chaos Game Representa-

tion (CGR), an alignment-free sequence algorithm, to

make it suitable for representing SNPs. The obtained

representation is then fed to a network of machine

learning classiﬁers. These three approaches will be

presented in the next sections.

2.1 XGBoost

XGBoost (eXtreme Gradient Boosting) is a widely

used and powerful machine learning algorithm based

on the gradient boosting framework (Chen and

Guestrin, 2016). It has gained signiﬁcant popularity

due to its efﬁciency, scalability and effectiveness in

a wide variety of data science applications, including

bioinformatics and genomics.

In bioinformatics, XGBoost has been applied in

tasks such as predicting gene expression values (Li

et al., 2019), identifying disease from biomarkers

(Sharma and Verbeke, 2020), and classifying complex

biological data, such as those coming from SNP data

(Medvedev et al., 2022). One of the major advantages

of XGBoost is its ability to handle sparse data efﬁ-

ciently, which is especially useful when dealing with

medical and biological datasets that require data col-

lection, where missing values are common.

The model trained with XGBoost can be easily

and effectively explained. Gradient boosting is based

on decision trees and decision trees themselves are

effortlessly interpretable compared to more complex

models like neural networks. Each decision tree rep-

resents a series of decisions (or splits) based on fea-

ture values, and these decisions can be visually exam-

ined to understand how the model arrives at its predic-

tions. Operationally this means that a feature is rep-

resented as a decision node. Depending on the value

of this feature, the tree branches into two leaves, each

containing a speciﬁc value that is added to the model’s

output.

XGBoost’s feature importance scores are used to

rank the most inﬂuential features contributing to the

prediction. The ability to quantify feature importance

is one of XGBoost’s strengths, allowing us to interpret

which features have the most signiﬁcant impact on the

model’s predictions. This capability is often used in

feature selection before training the real model. The

actual model is trained using the plain genotype coded

as an integer vector (see Section 3.1) without any ad-

ditional preprocessing.

2.1.1 Training Parameters

The hyperparameters used for the model were: the

number of trees (ranged from 10 to 100); the grow

Machine Learning Methods for Phenotype Prediction from High-Dimensional, Low Population Aquaculture Data

639

policy (either loss-guide or depth-wise); the learn-

ing rate (ranged from 0.01 to 0.2); the maximum tree

depth (between 4 and 6); the minimum child weight

(between 1 and 3); λ (ranged from 0 to 5); and γ

(ranged from 0 to 5) that can be tuned to add com-

plexity or limit overﬁtting. The best hyperparameters

are chosen with the help of a grid-search.

2.2 COMBI

The two methods that carry the name COMBI aim at

examining the relation between SNPs and phenotyp-

ical traits (Mieth et al., 2016) and represent the ba-

sis of the interpretable machine learning paradigms in

bioinformatics for the analysis of human DNA. These

paradigms focus on the explainability of certain traits

while still offering predictive capability, and aim at

maximizing both of these aspects of classiﬁcation.

For this end, COMBI uses a support vector machine

model (Cortes and Vapnik, 1995) applied to the Well-

come Trust Case Control Consortium (WTCCC) data

of human genome-disease association (Jones et al.,

2007), taking advantage of the direct mathematical

correlation it provides between inputs and outputs.

The decision-making process of machine learn-

ing algorithms is usually black-box, limiting the in-

terpretability of results in complex contexts such as

SNP data and other biological data. COMBI has

proved extremely useful in providing an answer to

this problem, like detecting genetic risk scores for

quantifying patients’ predisposition to disease on the

WTCCC (Marigorta et al., 2018), advancing precision

medicine in the ﬁeld of oncology for therapy targeted

to each patient (Asada et al., 2021), and predicting

susceptibility to asthma based on SNP information of

individuals (Gaudillo et al., 2019).

Recently, deep learning has emerged as a power-

ful classiﬁcation tool, leading to the development of

DeepCOMBI (Mieth et al., 2021), a neural network-

based classiﬁer that uses layer relevance propagation

to achieve the same level of explainability as the orig-

inal COMBI model, with increased performance on

the same WTCCC dataset. DeepCOMBI has suc-

cessfully been applied to the study of the response

of rheumatoid arthritis patients to certain medication

based on their genome data, helping to better identify

non-responders (Lim et al., 2022), and for improving

risk prediction of developing schizophrenia, a highly

inheritable disorder whose genetic markers are still

unclear (Martins et al., 2024).

In this study, the COMBI framework consists of

the testing and adaptation of the methods used by

COMBI (Mieth et al., 2016) with the Support Vec-

tor Machine (SVM) model and DeepCOMBI (Mieth

et al., 2021) with the Multilayer Perceptron (MLP)

model.

2.2.1 Training Parameters

The following parameters were obtained with a grid-

search selection process, where the search space ex-

tremities are listed in parentheses after the chosen

value.

The hyperparameters used for the SVM model

were: L2 regularization with C = 100 (1, 100) and

squared hinge loss (Lee and Lin, 2013). The SVM

model was trained until convergence and its optimal

hyperparameters were found through a grid-search

procedure.

The hyperparameters used for the MLP model

were: one hidden layer of 128 neurons (128, 512);

0.3 dropout rate between each layer (0.0, 0.5);

ReLU activation function; L1 and L2 regularization

weighted at 0.1 and 0.01 respectively (0.0001, 0.1);

learning rate 10

−12

(10

−14

, 10

−3

) with binary cross-

entropy loss and 1, 000 epochs of training on the data

(100, 1, 000).

2.3 Chaos Game Representation

Chaos Game Representation (CGR) is an iterative

mapping technique to transforms a sequence deﬁned

over an alphabeth Σ into an image. In CGR a se-

quence is represented as a unique pattern and is

mapped to unique coordinates. For any sequence,

regardless of its length and background, CGR can

encode it into an image by representing each fea-

ture through a point identiﬁed by coordinates; further-

more, by knowing the coordinates of a feature, CGR

allows the inference of the input sequence.

The application of CGR to bioinformatics was

ﬁrst proposed in (Jeffrey, 1990), where an encod-

ing scheme for genomic sequences into squares was

ﬁrst proposed. In this representation each vertex

of the square corresponds to one of the four DNA

nucleotides, with Σ = {A, C, G, T }. Extension to

the framework have followed, involving also RNA,

proteins and physio-chemical properties (Kania and

Sarapata, 2022; Dick and Green, 2020; Akbari

Rokn Abadi et al., 2023).

2.3.1 Encoder Unit

To apply the Chaos Game Representation algorithm

to the context of sea bass genetic data, we modiﬁed its

genomic application to genotype sequences. Our pro-

posed CGR encoding for genotype keeps the square

representation assigning a genotype to each of the

vertices except one, for backward compatibility with

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

640

genomic sequences, and maintains the distribution of

genotype s within the image clear (Figure 1).

Let h = h

. . . h

be a sequence deﬁned over the

alphabet Σ = {0, 1, 2}. Then the CGR encoding of

the sequence h is the bidimensional representation of

the ordered set of pairs {(x

, y

), 0 ≤ i ≤ n}, where the

pair (x

, y

) is iteratively deﬁned as:

, y

) =

((x

i−1

, y

i−1

) + g(h

)) with i ≥ 1 (1)

where the origin O(x

, y

) = (0, 0) and

g(h

) =











(−1, 1) h

= 0

(−1, −1) h

= 1

(1, 1) h

= 2

(2)

Figure 1: The Chaos Game Representation applied to geno-

type sequence. The graphical-conceptual application of

CGR for genotype (Left column) and image conversion for

a sea bass with Active10 SNP features (Right column).

2.3.2 Classiﬁer Unit

Our classiﬁer unit consists of a Deep Convolutional

Neural Network (DCNN). We chose to use DCNN

due to the large number of SNP features selected

against the small number of sea basses in the ana-

lyzed datasets. We instantiated and tested several net-

works: AlexNet (Krizhevsky et al., 2017), ResNet50

and ResNet101 (He et al., 2016).

AlexNet and ResNet are three-channel architec-

tures and require a ﬁxed image size for training;

227 ×227 for AlexNet and 224×224 for ResNet. We

replicated the content of the single channel in each

of the three channels and resized the images, with di-

mensions consistent with networks, using bicubic in-

terpolation (Lundh et al., 2024).

2.3.3 Training Parameters

The hyperparameters used for the model were the last

dense layer with one neuron for the binary classiﬁca-

tion; SGD optimizer with learning rate 10

−3

and the

sparse categorical cross-entropy loss function; 5-fold

cross-validation of training on the data with a number

of epochs from 30 to 120 and batch sizes from 15 to

30, due to the high dimensionality of the SNP features

and the low-rank data treated in this study.

3 EXPERIMENTS AND

DISCUSSION

This section presents the aquaculture data we used in

our research, the experimental setup instantiated on

our classiﬁcation models, and the phenotype classiﬁ-

cation results we obtained on each model and experi-

ment performed.

3.1 Datasets

Data used in our research refer to the study (Muki-

ibi et al., 2024). In that study, 990 European sea

bass, produced in a full-factorial mating scheme us-

ing 25 sires and 25 dams, were subjected to a 29

days NNV (nervous necrosis virus) challenge test.

Mortality was individually recorded as a binary trait

(alive/dead). Phenotypes were fairly balanced hav-

ing 54.24Sires, dams and 40 offspring were whole-

genome sequenced, whereas the remaining offspring

were genotyped using a commercial SNP array con-

sisting of about 30K SNPs. These animals were

then imputed to whole-genome sequence, obtain-

ing a high-dimensional genotype data consisting of

6,072,853 SNPs for each ﬁsh. Since European sea

bass is a diploid organism, it has two alleles at the

SNP and two possible variations from the reference

allele. Genomic data included:

• the SNP feature identiﬁer, composed by chromo-

some number and location of each SNP in base

pairs;

• the number of copies of the reference allele (in

this case, the minor allele), that can be 0, 1 or 2.

Having only 990 samples, the feature matrix is

low-rank, making it exceptionally difﬁcult to analyze.

To overcome the data’s high dimensionality, we

will use three groups of features selected on the basis

of functional genomic information: 1) Tissue speciﬁc,

2) Active, and 3) Control. Tissue-speciﬁc refers to ge-

netic variants located in open chromatin regions based

on ATAC-seq data obtained from two key tissues in

VNN, brain and head kidney, sampling 10 ﬁsh either

after infection or mock-infected. Active refers to ac-

tive regulatory regions based on ATAC- and ChIP-

seq data from several different tissue types (brain,

gill, liver, gonads, skeletal muscle and head kidney)

in control ﬁsh. Active datasets consist of SNPs in-

cluded in regulatory regions that were found to be ac-

Machine Learning Methods for Phenotype Prediction from High-Dimensional, Low Population Aquaculture Data

641

tive across at least 80%, 50% and 10% of analysed

tissues. Control datasets, numerically proportional to

the Active ones, contain randomly selected SNP fea-

tures that were located within non-active regions (i.e.

quiescent regions).

Tissue-speciﬁc datasets are very large datasets

with a large number of SNPs (about a million); in-

stead, Active and Control datasets have a far fewer

number of SNPs (thousands or tens of thousands),

more details are reported in Table 1.

Table 1: Information on the composition of the datasets in

Tissue-speciﬁc, Active and Control categories.

Category Dataset # SNPs

Tissue-speciﬁc

Hk NNV 1,193,048

Hk mock 1,082,100

Br NNV 775,840

Br mock 832,801

Active

Active80 6,862

Active50 11,130

Active10 80,768

Control

Control80 6,862

Control50 11,130

Control10 80,768

All ﬁsh were divided in two genomically distant

populations through key-means clustering (cluster 0

and cluster 1), minimizing intra-class relatedness and

maximizing inter-class genomic distance.

Genomic prediction across genomically divergent

populations represents a signiﬁcant challenge (Amar-

iuta et al., 2020). The two clusters, each representing

a genomically distant population, were created clas-

sifying each sea bass through genomic information.

Clustering animals on the basis of their genomic re-

lationships within only two clusters implies some ge-

nomic diversiﬁcation also intra-class, consequence of

the low number of individuals, which does not allow a

complete mapping of relationships within each clus-

ter. Animal clustering into genomically distant popu-

lations and the classiﬁcation for phenotypes is shown

in Table 2.

Table 2: Distribution of sea basses by genomically distant

populations and phenotypes.

# sea basses

Phenotypes

alive dead

Cluster 0 589 373 216

Cluster 1 401 164 237

Total 990 537 453

3.2 Experimental Setup

We carried out several tests on a random partition and

on the genomically distant population data.

In the random partition test, the data are parti-

tioned using training and testing sets with an 80%-

20% split. Given the high dimensionality of the data,

we used the features selected based on biological

signiﬁcance. We considered ﬁrst the Tissue-speciﬁc

datasets, which are more extensive. Then we move

to the Active datasets that contain far fewer features

compared to the Tissue-speciﬁc ones, helping to as-

sess whether the number of SNPs inﬂuences classi-

ﬁcation performance. To assess the effectiveness of

the selected SNPs for classiﬁcation, we compared the

models trained on the Active selections with Control

models using random SNP selections of the same size.

These Control datasets allowed us to verify whether

the SNPs chosen based on biological relevance were

genuinely contributing to improved classiﬁcation per-

formance.

We also set up the genomically distant population

test. The models were trained on the largest partition

(cluster 0) and tested on the other (cluster 1), since the

largest of the two clusters makes for a more adequate

training set.

Four machine learning models were employed:

XGBoost, COMBI SVM, DeepCOMBI, and CGR.

The models were evaluated using accuracy, precision,

recall, and F1-score.

3.3 Results and Discussion

Here the results obtained by each of the models are

reported, tested on the various splits of data described

above. The main observation to highlight for these

tests is that the task at hand is quite challenging, as

we know the mortality phenotype taken in exam is

not exclusively determined by the genotype, so all re-

sults reﬂect this difﬁculty. This section will only list

the F1-score as a performance metric for space limi-

tations, but the complete set of results can be found in

the Supplemental Material.

3.3.1 Random Partition Tests

The ﬁrst benchmark used was a random ﬁxed 80%

training and 20% test partition for the individuals in

the Tissue-speciﬁc datasets, as is standard practice in

many machine learning applications and tasks. Tests

on this split are a useful metric to compare the efﬁ-

cacy of these approaches with regard to the rest of the

literature on the subject, minding the challenge of the

data at hand. The results of these tests are listed in

Table 3.

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

642

Table 3: F1-scores obtained on the Tissue-speciﬁc tests us-

ing the random partition split.

Dataset XG-

Boost

COMBI

SVM

Deep-

COMBI

CGR

Hk NNV 0.53 N/A 0.53 0.15

Hk mock 0.53 N/A 0.61 0.58

Br NNV 0.58 0.62 0.61 0.58

Br mock 0.58 0.62 0.48 0.26

Note that the COMBI SVM method uses the un-

derlying LIBLINEAR (Fan et al., 2008) software li-

brary for its implementation, which has a maximum

allowed variable size that is exceeded by the amount

of data in the head-kidney tissues.

The best performance we were able to achieve on

the random partition tests is using the COMBI SVM

approach on the sets that allowed it, reaching 62%

F1-score, while the head-kidney datasets have proven

more challenging for XGBoost. Both of the neural

network-based DeepCOMBI and CGR methods en-

counter signiﬁcantly more difﬁculty in the Hk NNV

and Br mock datasets, but achieve 61% and 58% F1-

score respectively on Hk mock and Br NNV.

The head-kidney and brain tissue data contain an

extremely large number of SNPs, since all four data

sets represent accessible, but not necessarily active

genomic regions. As a means to signiﬁcantly re-

duce the number of features, we decided to use more

detailed functional information, including ChIP/seq

data. This allowed to identify active regulatory ele-

ments. Despite the inclusion of a larger number of tis-

sues, focusing on active regions only enabled a dras-

tic reduction of features, while preserving core in-

formation on biological importance. This way, three

datasets containing Active SNPs were tested. To as-

certain the quality of these selections, randomly sam-

pled datasets of the same size were used as Control

sets, with the expectation that the Active SNP selec-

tions would yield higher performance than the Con-

trol sets because of their careful ﬁltering process. The

comparison of the above tests using the same random

ﬁxed 80%-20% split as above can be found in Table

Table 4: F1-scores obtained on the Active and Control tests

using the random partition split.

Dataset XG-

Boost

COMBI

SVM

Deep-

COMBI

CGR

Active80 0.60 0.62 0.42 0.58

Control80 0.51 0.63 0.00 0.10

Active50 0.57 0.60 0.62 0.48

Control50 0.57 0.61 0.04 0.40

Active10 0.58 0.62 0.53 0.52

Control10 0.61 0.63 0.39 0.00

The tests using Active and Control subsets re-

veal how each model is able to distinguish between

high-quality SNP data and random noise. XGBoost

seems to distinguish well between Active80 and Con-

trol80, but loses this ability on the 50% and 10%

variants, while COMBI SVM seems to be unable to

make meaningful distinctions between any couple of

Active and Control sets. Interestingly, both of the

neural network-based approaches display huge gaps

in F1-score between Active and Control sets, often

with many decimal points of difference. This would

suggest that these methods are better suited at distin-

guishing the random noise of the control sets from

more meaningful SNP data.

3.3.2 Genomically Distant Population Tests

The results of the tests performed using the parti-

tion of genomically distant individuals in clusters are

summed up in Table 5. The clusters were selected to

be as genomically distant from each other as possi-

ble, making the expectation for this task to be worse

overall performance than the random partition tests.

In these tests, the COMBI SVM framework was able

to process all data due to the smaller training sets.

Table 5: F1-scores obtained on the Tissue-speciﬁc tests us-

ing the genomically distant split.

Dataset XG-

Boost

COMBI

SVM

Deep-

COMBI

CGR

Hk NNV 0.60 0.41 0.09 0.40

Hk mock 0.49 0.41 0.34 0.00

Br NNV 0.59 0.42 0.71 0.31

Br mock 0.51 0.41 0.46 0.67

On these data splits we can see how, as expected,

the difﬁculty of the problem notably increases, due to

the much smaller training set size and groups speciﬁ-

cally selected in a way to contain genomically differ-

ent individuals, making the prediction of the pheno-

type overall much harder. In spite of this, some re-

sults go even beyond what the 80%-20% tests were

able to achieve, with markedly high F1-scores of

60% on Hk NNV by XGBoost, 71% on Br NNV by

DeepCOMBI, and 67% on Br mock by CGR. The

Hk mock dataset becomes very challenging for all

methods, and COMBI SVM performs badly on all

Tissue-speciﬁc datasets.

Lastly, the same genomically distant test was per-

formed as before on the Active and Control datasets,

listed in Table 6. Using these clusters, the Active and

Control tests also show more ambiguous results than

before. On the 80% sets, all methods except COMBI

SVM struggle to distinguish between the random and

meaningful data, while the 50% sets show good dis-

Machine Learning Methods for Phenotype Prediction from High-Dimensional, Low Population Aquaculture Data

643

Table 6: F1-scores obtained on the Active and Control tests

using the genomically distant split.

Dataset XG-

Boost

COMBI

SVM

Deep-

COMBI

CGR

Active80 0.58 0.50 0.25 0.62

Control80 0.63 0.43 0.44 0.74

Active50 0.66 0.49 0.70 0.74

Control50 0.62 0.43 0.42 0.55

Active10 0.64 0.42 0.72 0.76

Control10 0.61 0.39 0.68 0.74

criminatory capabilities on all models, with gaps of

many decimal points between Active and Control.

Active10 and Control10 are interestingly only a few

decimal points apart on every test, but with Active10

always in the lead, giving the impression that there

is just enough difference to meaningfully distinguish

the two. When it comes to overall classiﬁcation per-

formance, the CGR model outperforms all the other

ones on the Active sets, often nearing or exceeding

70% F1-score, indicating that despite the diversity of

the clusters, there are features highlighted by the CGR

representation that can meaningfully distinguish be-

tween the two phenotypes very well.

4 CONCLUSIONS

In summary, XGBoost does not often perform the

best, but among all models it is the one that most

consistently obtained reliable results often reaching

around 60% F1-score on all the above tests. COMBI

SVM reaches a similar level of reliability on the 80%-

20% split tests, but it ﬁnds signiﬁcantly more difﬁ-

culty in classiﬁcation between the two clusters, while

DeepCOMBI’s performance is inconsistent, ranging

from very good at over 70% F1-score, to extremely

poor at under 10%. CGR is similarly inconsistent

in most cases, with high peaks and low valleys, but

shines when used on the Active splits for the genomi-

cally distant population tests.

ACKNOWLEDGMENTS

Authors are supported by the Project funded under

the National Recovery and Resilience Plan (NRRP),

Mission 4 Component 2 Investment 1.4 - Call for

tender No. 3138 of 16 December 2021, rectiﬁed by

Decree n.3175 of 18 December 2021 of Italian Min-

istry of University and Research funded by the Euro-

pean Union – NextGenerationEU; Project code CN-

00000033, Concession Decree No. 1034 of 17 June

2022 adopted by the Italian Ministry of University

and Research, CUP D33C22000960007, Project title

“National Biodiversity Future Center - NBFC”.

REFERENCES

Akbari Rokn Abadi, S., Mohammadi, A., and Koohi, S.

(2023). A new proﬁling approach for dna sequences

based on the nucleotides’ physicochemical features

for accurate analysis of sars-cov-2 genomes. BMC ge-

nomics, 24(1):266.

Amariuta, T., Ishigaki, K., Sugishita, H., Ohta, T., Koido,

M., Dey, K., Matsuda, K., Murakami, Y., Price,

A., Kawakami, E., Terao, C., and Raychaudhuri, S.

(2020). Improving the trans-ancestry portability of

polygenic risk scores by prioritizing variants in pre-

dicted cell-type-speciﬁc regulatory elements. Nature

Genetics, 52:1–9.

Asada, K., Kaneko, S., Takasawa, K., Machino, H., Taka-

hashi, S., Shinkai, N., Shimoyama, R., Komatsu, M.,

and Hamamoto, R. (2021). Integrated analysis of

whole genome and epigenome data using machine

learning technology: toward the establishment of pre-

cision oncology. Frontiers in oncology, 11:666937.

Buri

c, M., Bav

cevi

c, L., Grguri

c, S., Vresnik, F., Kri

zan, J.,

and Antoni

c, O. (2020). Modelling the environmen-

tal footprint of sea bream cage aquaculture in relation

to spatial stocking design. Journal of Environmental

Management, 270:110811.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable

tree boosting system. In Proceedings of the 22nd acm

sigkdd international conference on knowledge discov-

ery and data mining, pages 785–794.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Mach. Learn., 20(3):273–297.

Dick, K. and Green, J. R. (2020). Chaos game representa-

tions & deep learning for proteome-wide protein pre-

diction. In 2020 IEEE 20th International Conference

on Bioinformatics and Bioengineering (BIBE), pages

115–121.

Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., and

Lin, C. J. (2008). Liblinear: A library for large linear

classiﬁcation. J. Mach. Learn. Res., 9:1871–1874.

Gaudillo, J., Rodriguez, J. J. R., Nazareno, A., Baltazar,

L. R., Vilela, J., Bulalacao, R., Domingo, M., and

Albia, J. (2019). Machine learning approach to sin-

gle nucleotide polymorphism-based asthma predic-

tion. PLOS ONE, 14(12):1–12.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Jeffrey, H. (1990). Chaos game representation of gene struc-

ture. Nucleic Acids Research, 18(8):2163–2170.

Jones, R. W., McArdle, W. L., Ring, S. M., Strachan, D. P.,

Pembrey, M., Clayton, D. G., Dunger, D. B., Nutland,

S., Stevens, H. E., Walker, N. M., Widmer, B., Todd,

J. A., et al. (2007). Genome-wide association study

of 14,000 cases of seven common diseases and 3,000

shared controls. Nature, 447(7145):661–678.

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

644

Kania, A. and Sarapata, K. (2022). Multifarious aspects of

the chaos game representation and its applications in

biological sequence analysis. Computers in Biology

and Medicine, 151:106243.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-

agenet classiﬁcation with deep convolutional neural

networks. Commun. ACM, 60(6):84–90.

Lee, C. P. and Lin, C. J. (2013). A study on l2-loss

(squared hinge-loss) multiclass svm. Neural Compu-

tation, 25(5):1302–1323.

Li, W., Yin, Y., Quan, X., and Zhang, H. (2019). Gene ex-

pression value prediction based on xgboost algorithm.

Frontiers in Genetics, 10.

Lim, A. J., Lim, L. J., Ooi, B. N., Koh, E. T., Tan, J. W. L.,

Chong, S. S., Khor, C. C., Tucker-Kellogg, L., Leong,

K. P., and Lee, C. G. (2022). Functional coding haplo-

types and machine-learning feature elimination identi-

ﬁes predictors of methotrexate response in rheumatoid

arthritis patients. EBioMedicine, 75.

Lundh, F., Clark, J. A., and contributors (2024). Image

module - pillow (pil fork) 10.4.0 documentation. Last

consultation 23 September 2024.

Marigorta, U. M., Rodr

ıguez, J. A., Gibson, G., and

Navarro, A. (2018). Replicability and prediction:

Lessons and challenges from gwas. Trends in Genet-

ics, 34(7):504–517.

Martins, D., Abbasi, M., Egas, C., and Arrais, J. P. (2024).

Enhancing schizophrenia phenotype prediction from

genotype data through knowledge-driven deep neural

network models. Genomics, 116(5):110910.

Medvedev, A., Mishra Sharma, S., Tsatsorin, E., Nabieva,

E., and Yarotsky, D. (2022). Human genotype-to-

phenotype predictions: Boosting accuracy with non-

linear models. PloS one, 17(8):e0273293.

Mieth, B., Kloft, M., Rodr

ıguez, J. A., Sonnenburg, S., Vo-

bruba, R., Morcillo-Su

arez, C., Farr

e, X., Marigorta,

U. M., Fehr, E., Dickhaus, T., et al. (2016). Combin-

ing multiple hypothesis testing with machine learning

increases the statistical power of genome-wide associ-

ation studies. Scientiﬁc reports, 6(1):36671.

Mieth, B., Rozier, A., Rodriguez, J. A., H

ohne, M. M. C.,

ornitz, N., and M

uller, K.-R. (2021). DeepCOMBI:

explainable artiﬁcial intelligence for the analysis and

discovery in genome-wide association studies. NAR

Genomics and Bioinformatics, 3(3):lqab065.

Mukiibi, R., Ferraresso, S., Franch, R., Peruzza, L., Ro-

vere, G. D., Babbucci, M., Bertotto, D., Toffan, A.,

Pascoli, F., Faggion, S., Pe

naloza, C., Tsigenopoulos,

C. S., Houston, R. D., Bargelloni, L., and Robledo, D.

(2024). Integrated functional genomic analysis identi-

ﬁes the regulatory variants underlying a major qtl for

disease resistance in european sea bass. bioRxiv.

Muniesa, A., Basurco, B., Aguilera, C., Furones, D., Re-

vert

e, C., Sanjuan-Vilaplana, A., Jansen, M. D., Brun,

E., and Tavornpanich, S. (2020). Mapping the knowl-

edge of the main diseases affecting sea bass and sea

bream in mediterranean. Transboundary and Emerg-

ing Diseases, 67(3):1089–1100.

Sharma, A. and Verbeke, W. J. M. I. (2020). Improving

diagnosis of depression with xgboost machine learn-

ing model and a large biomarkers dutch dataset (n =

11,081). Frontiers in Big Data, 3.

Uffelmann, E., Huang, Q., Munung, N., De Vries, J.,

Okada, Y., Martin, A., Martin, H., Lappalainen, T.,

and Posthuma, D. (2021). Genome-wide association

studies. Nature Reviews Methods Primers, 1:1–21.

Uppu, S., Krishna, A., and Gopalan, R. P. (2018). A

review on methods for detecting snp interactions in

high-dimensional genomic data. IEEE/ACM Transac-

tions on Computational Biology and Bioinformatics,

15:599–612.

Vandeputte, M., Gagnaire, P.-A., and Allal, F. (2019). The

european sea bass: a key marine ﬁsh model in the wild

and in aquaculture. Animal Genetics, 50(3):195–206.

You, X., Shan, X., and Shi, Q. (2020). Research advances in

the genomics and applications for molecular breeding

of aquaculture animals. Aquaculture, 526:735357.

APPENDIX

Supplemental Material

Table 7: Accuracy scores obtained using the random parti-

tion split.

Accuracy XGBoost COMBI SVM DeepCOMBI CGR

Hk NNV 0.58 N/A 0.53 0.55

Hk mock 0.58 N/A 0.46 0.53

Br NNV 0.64 0.66 0.60 0.56

Br mock 0.62 0.66 0.57 0.55

Active80 0.65 0.65 0.55 0.59

Control80 0.57 0.67 0.53 0.54

Active50 0.61 0.64 0.48 0.56

Control50 0.64 0.65 0.55 0.56

Active10 0.60 0.66 0.49 0.65

Control10 0.61 0.66 0.52 0.53

Table 8: Accuracy scores obtained using the genomically

distant split.

Accuracy XGBoost COMBI SVM DeepCOMBI CGR

Hk NNV 0.58 0.52 0.41 0.47

Hk mock 0.52 0.52 0.43 0.41

Br NNV 0.55 0.53 0.55 0.45

Br mock 0.49 0.53 0.45 0.56

Active80 0.55 0.56 0.45 0.56

Control80 0.60 0.54 0.46 0.59

Active50 0.61 0.57 0.56 0.59

Control50 0.57 0.53 0.49 0.53

Active10 0.59 0.53 0.57 0.66

Control10 0.57 0.52 0.55 0.59

Machine Learning Methods for Phenotype Prediction from High-Dimensional, Low Population Aquaculture Data

645

Table 9: Precision scores obtained using the random parti-

tion split.

Precision XGBoost COMBI SVM DeepCOMBI CGR

Hk NNV 0.58 N/A 0.49 1.00

Hk mock 0.55 N/A 0.46 0.50

Br NNV 0.62 0.63 0.55 0.52

Br mock 0.59 0.63 0.54 0.45

Active80 0.64 0.62 0.51 0.56

Control80 0.54 0.64 0.00 0.45

Active50 0.58 0.62 0.47 0.52

Control50 0.63 0.63 0.50 0.54

Active10 0.56 0.63 0.46 0.66

Control10 0.57 0.64 0.46 0.00

Table 10: Precision scores obtained using the genomically

distant split.

Precision XGBoost COMBI SVM DeepCOMBI CGR

Hk NNV 0.68 0.76 0.52 0.61

Hk mock 0.65 0.75 0.55 0.00

Br NNV 0.64 0.77 0.58 0.61

Br mock 0.59 0.77 0.54 0.60

Active80 0.64 0.77 0.65 0.64

Control80 0.70 0.78 0.57 0.60

Active50 0.68 0.81 0.58 0.59

Control50 0.64 0.75 0.63 0.63

Active10 0.66 0.78 0.59 0.65

Control10 0.66 0.78 0.58 0.59

Table 11: Recall scores obtained using the random partition

split.

Recall XGBoost COMBI SVM DeepCOMBI CGR

Hk NNV 0.59 N/A 0.57 0.01

Hk mock 0.51 N/A 0.90 0.73

Br NNV 0.55 0.62 0.68 0.65

Br mock 0.57 0.62 0.44 0.15

Active80 0.56 0.62 0.35 0.55

Control80 0.49 0.62 0.00 0.05

Active50 0.56 0.58 0.91 0.43

Control50 0.53 0.59 0.01 0.32

Active10 0.59 0.60 0.62 0.51

Control10 0.65 0.62 0.34 0.00

Table 12: Recall scores obtained using the genomically dis-

tant split.

Recall XGBoost COMBI SVM DeepCOMBI CGR

Hk NNV 0.54 0.28 0.05 0.30

Hk mock 0.39 0.29 0.25 0.00

Br NNV 0.54 0.29 0.92 0.21

Br mock 0.45 0.28 0.40 0.77

Active80 0.53 0.37 0.15 0.60

Control80 0.57 0.30 0.36 0.96

Active50 0.65 0.35 0.88 0.99

Control50 0.59 0.30 0.32 0.49

Active10 0.62 0.29 0.90 0.90

Control10 0.56 0.26 0.82 1.00

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

646