Authors:
Giovanni Faldani
1
;
Enrico Rossignolo
1
;
Eleonora Signor
1
;
Alessio Longo
2
;
Sara Faggion
2
;
Luca Bargelloni
2
;
Matteo Comin
1
and
Cinzia Pizzi
1
Affiliations:
1
Department of Information Engineering, University of Padova, Padova, 35131, Italy
;
2
Department of Comparative Biomedicine and Food Science, University of Padova, Legnaro (PD), 35020, Italy
Keyword(s):
High-Dimensional, Low Population, SNP Data, Machine Learning Classification, Phenotype Prediction.
Abstract:
Recent research has increasingly focused on classification rules within the big data framework, yet many bioinformatics applications still address prediction problems that involve small-sample, high-dimensional data. In phenotype prediction, especially with the rise of large-scale genomic data, a central challenge arises from handling high-dimensional datasets where the number of genetic features (such as SNPs) far exceeds the sample size. A significant example of such high-dimensional, low-sample datasets is found in aquaculture, a rapidly growing sector within global food production and a crucial source of high-quality protein. This study uses data from an experiment performed on European seabass as a test case, focusing on predicting resistance to Viral Nervous Necrosis (VNN) as a specific phenotype of interest. We explore a range of machine learning techniques to address the complexities of high-dimensional data, from established methods like gradient boosting, SVM, and deep lear
ning to newer approaches. This paper evaluates various methods for associating SNPs with phenotypic traits, benchmarking their performance on challenging aquaculture genomic data to provide insight into the effectiveness of these techniques.
(More)