Authors:
Nic Herndon
and
Doina Caragea
Affiliation:
Kansas State University, United States
Keyword(s):
Splice Site Prediction, Domain Adaptation, Imbalanced Data, Logistic Regression, Näive Bayes.
Related
Ontology
Subjects/Areas/Topics:
Bioinformatics
;
Biomedical Engineering
;
Data Mining and Machine Learning
;
Sequence Analysis
Abstract:
The next generation sequencing technologies (NGS) have made it affordable to sequence any organism, opening the door to assembling new genomes and annotating them, even for non-model organisms. One option for annotating a genome is to assemble RNA-Seq reads into a transcriptome and aligning the transcriptome to the genome assembly to identify the protein-encoding genes. However, there are a couple of problems with this approach. RNA-Seq is error prone and therefore the gene models generated with this technique need to be validated. In addition, this method can only capture the genes expressed at the time of sequencing. Machine learning can help address both of these problems by generating ab initio gene models that can provide supporting evidence to the models generated with RNA-Seq, as well as predict additional genes that were not expressed during sequencing. However, machine learning algorithms need large amounts of labeled data to learn accurate classifiers, and newly sequenced,
non-model organisms have insufficient labeled data. This can be addressed by leveraging the abundant labeled data from a related model-organism (the source domain) and use it in conjunction with the little labeled data from the organism of interest (the target domain) to train a classifier in a domain adaptation setting. The method we propose uses this approach and generates accurate classification on the task of splice site prediction – a difficult and essential step in gene prediction. It is simple – it combines source and target labeled data, with different weights, into one dataset, and then trains a supervised classifier on the combined dataset. Despite its simplicity it is surprisingly accurate, with highest areas under the precision-recall curve between 53.33% and 83.57%. Out of the domain adaptation classifiers evaluated (SVM, na¨ıve Bayes, and logistic regression) this method produced the best results in 12 out of the 16 cases studied.
(More)