
Kania, A. and Sarapata, K. (2022). Multifarious aspects of
the chaos game representation and its applications in
biological sequence analysis. Computers in Biology
and Medicine, 151:106243.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-
agenet classification with deep convolutional neural
networks. Commun. ACM, 60(6):84–90.
Lee, C. P. and Lin, C. J. (2013). A study on l2-loss
(squared hinge-loss) multiclass svm. Neural Compu-
tation, 25(5):1302–1323.
Li, W., Yin, Y., Quan, X., and Zhang, H. (2019). Gene ex-
pression value prediction based on xgboost algorithm.
Frontiers in Genetics, 10.
Lim, A. J., Lim, L. J., Ooi, B. N., Koh, E. T., Tan, J. W. L.,
Chong, S. S., Khor, C. C., Tucker-Kellogg, L., Leong,
K. P., and Lee, C. G. (2022). Functional coding haplo-
types and machine-learning feature elimination identi-
fies predictors of methotrexate response in rheumatoid
arthritis patients. EBioMedicine, 75.
Lundh, F., Clark, J. A., and contributors (2024). Image
module - pillow (pil fork) 10.4.0 documentation. Last
consultation 23 September 2024.
Marigorta, U. M., Rodr
´
ıguez, J. A., Gibson, G., and
Navarro, A. (2018). Replicability and prediction:
Lessons and challenges from gwas. Trends in Genet-
ics, 34(7):504–517.
Martins, D., Abbasi, M., Egas, C., and Arrais, J. P. (2024).
Enhancing schizophrenia phenotype prediction from
genotype data through knowledge-driven deep neural
network models. Genomics, 116(5):110910.
Medvedev, A., Mishra Sharma, S., Tsatsorin, E., Nabieva,
E., and Yarotsky, D. (2022). Human genotype-to-
phenotype predictions: Boosting accuracy with non-
linear models. PloS one, 17(8):e0273293.
Mieth, B., Kloft, M., Rodr
´
ıguez, J. A., Sonnenburg, S., Vo-
bruba, R., Morcillo-Su
´
arez, C., Farr
´
e, X., Marigorta,
U. M., Fehr, E., Dickhaus, T., et al. (2016). Combin-
ing multiple hypothesis testing with machine learning
increases the statistical power of genome-wide associ-
ation studies. Scientific reports, 6(1):36671.
Mieth, B., Rozier, A., Rodriguez, J. A., H
¨
ohne, M. M. C.,
G
¨
ornitz, N., and M
¨
uller, K.-R. (2021). DeepCOMBI:
explainable artificial intelligence for the analysis and
discovery in genome-wide association studies. NAR
Genomics and Bioinformatics, 3(3):lqab065.
Mukiibi, R., Ferraresso, S., Franch, R., Peruzza, L., Ro-
vere, G. D., Babbucci, M., Bertotto, D., Toffan, A.,
Pascoli, F., Faggion, S., Pe
˜
naloza, C., Tsigenopoulos,
C. S., Houston, R. D., Bargelloni, L., and Robledo, D.
(2024). Integrated functional genomic analysis identi-
fies the regulatory variants underlying a major qtl for
disease resistance in european sea bass. bioRxiv.
Muniesa, A., Basurco, B., Aguilera, C., Furones, D., Re-
vert
´
e, C., Sanjuan-Vilaplana, A., Jansen, M. D., Brun,
E., and Tavornpanich, S. (2020). Mapping the knowl-
edge of the main diseases affecting sea bass and sea
bream in mediterranean. Transboundary and Emerg-
ing Diseases, 67(3):1089–1100.
Sharma, A. and Verbeke, W. J. M. I. (2020). Improving
diagnosis of depression with xgboost machine learn-
ing model and a large biomarkers dutch dataset (n =
11,081). Frontiers in Big Data, 3.
Uffelmann, E., Huang, Q., Munung, N., De Vries, J.,
Okada, Y., Martin, A., Martin, H., Lappalainen, T.,
and Posthuma, D. (2021). Genome-wide association
studies. Nature Reviews Methods Primers, 1:1–21.
Uppu, S., Krishna, A., and Gopalan, R. P. (2018). A
review on methods for detecting snp interactions in
high-dimensional genomic data. IEEE/ACM Transac-
tions on Computational Biology and Bioinformatics,
15:599–612.
Vandeputte, M., Gagnaire, P.-A., and Allal, F. (2019). The
european sea bass: a key marine fish model in the wild
and in aquaculture. Animal Genetics, 50(3):195–206.
You, X., Shan, X., and Shi, Q. (2020). Research advances in
the genomics and applications for molecular breeding
of aquaculture animals. Aquaculture, 526:735357.
APPENDIX
Supplemental Material
Table 7: Accuracy scores obtained using the random parti-
tion split.
Accuracy XGBoost COMBI SVM DeepCOMBI CGR
Hk NNV 0.58 N/A 0.53 0.55
Hk mock 0.58 N/A 0.46 0.53
Br NNV 0.64 0.66 0.60 0.56
Br mock 0.62 0.66 0.57 0.55
Active80 0.65 0.65 0.55 0.59
Control80 0.57 0.67 0.53 0.54
Active50 0.61 0.64 0.48 0.56
Control50 0.64 0.65 0.55 0.56
Active10 0.60 0.66 0.49 0.65
Control10 0.61 0.66 0.52 0.53
Table 8: Accuracy scores obtained using the genomically
distant split.
Accuracy XGBoost COMBI SVM DeepCOMBI CGR
Hk NNV 0.58 0.52 0.41 0.47
Hk mock 0.52 0.52 0.43 0.41
Br NNV 0.55 0.53 0.55 0.45
Br mock 0.49 0.53 0.45 0.56
Active80 0.55 0.56 0.45 0.56
Control80 0.60 0.54 0.46 0.59
Active50 0.61 0.57 0.56 0.59
Control50 0.57 0.53 0.49 0.53
Active10 0.59 0.53 0.57 0.66
Control10 0.57 0.52 0.55 0.59
Machine Learning Methods for Phenotype Prediction from High-Dimensional, Low Population Aquaculture Data
645