Authors:
Luise Odenthal
1
;
Jens Allmer
2
and
Malik Yousef
3
;
4
Affiliations:
1
Department of Machine Learning, Bielefeld University, Bielefeld, Germany
;
2
Medical Informatics and Bioinformatics, Hochschule Ruhr West, University of Applied Sciences, Mülheim adR., Germany
;
3
Department of Information Systems, Zefat Academic College, Zefat, 13206, Israel
;
4
Galilee Digital Health Research Center (GDH), Zefat Academic College, Israel
Keyword(s):
MicroRNA, miRNA, Machine Learning, Bioinformatics, Categorization.
Abstract:
Many diseases are driven by dysregulated gene expression. MicroRNAs are key players for post-transcriptional gene regulation. miRBase contains microRNAs (miRNAs) from about 200 species organized into about 70 clades. It has been shown that not all miRNAs collected in the database are likely to be real and, therefore, novel routes to delineate between correct and false miRNAs should be explored. Here, a novel approach allowing the assignment of an unknown miRNA to its most likely clade/species of origin is presented. A simple way to filter new data would be to ensure that the novel miRNA categorizes closely to the species it is said to originate from. The approach presented here automatically assigns a miRNA sample to its clade/species of origin. For that, an ensemble classifier of multiple two class random forest was designed, where each random forest was trained on one species/clade pair. The approach was tested with different sampling methods on a dataset that was taken from miRBas
e and it was evaluated using a hierarchical f-measure. The approach predicted 81% to 94% of the test data correctly, depending on the sampling method. This is the first classifier that can classify miRNAs to their species of origin.
(More)