Authors:
Ana Stanescu
and
Doina Caragea
Affiliation:
Kansas State University, United States
Keyword(s):
Semi-supervised learning, Expectation maximization, Alternative splicing.
Related
Ontology
Subjects/Areas/Topics:
Bioinformatics
;
Biomedical Engineering
;
Data Mining and Machine Learning
;
Genomics and Proteomics
;
Model Design and Evaluation
;
Pattern Recognition, Clustering and Classification
Abstract:
Successful advances in DNA sequencing technologies have made it possible to obtain tremendous amounts of data fast and inexpensively. As a consequence, the afferent genome annotation has become the bottleneck in our understanding of genes and their functions. Traditionally, data from biological domains have been analyzed using supervised learning techniques. However, given the large amounts of unlabeled genomics data available, together with small amounts of labeled data, the use of semi-supervised learning algorithms is desirable. Our purpose is to study the applicability of semi-supervised learning frameworks to DNA prediction problems, with focus on alternative splicing, a natural biological process that contributes to protein diversity. More specifically, we address the problem of predicting alternatively spliced exons. To utilize the unlabeled data, we train classifiers via the Expectation Maximization method and variants of this method. The experiments conducted show an increas
e in the quality of the prediction models when unlabeled data is used in the training phase, as compared to supervised prediction models which do not make use of the unlabeled data.
(More)