and P.pacificus)), using logistic regression with
nucleotide and trimer features. It also produced
the best results when the domains are distant
(for D.melanogaster and A.thaliana), using na
¨
ıve
Bayes with nucleotide and trimer features, in five
out of eight cases. This is a similar behavior to the
one observed in (Ng and Jordan, 2001), namely
that a generative classifier performs better than a
discriminative one when there is a small amount
of training labeled data. For domain adaptation,
when the domains are close the source labeled
data contributes a lot to the classifier so a discrim-
inative classifier performs better than a generative
one. When the domains are distant, the source la-
beled data contributes less and a generative classi-
fier performs better than a discriminative one. An-
other case for which our method produced the best
results is for very distant domains (A.thaliana),
using logistic regression with nucleotide features,
when there is somewhat scarce target labeled data
(6,500 instances). There are only two cases in
which another domain adaptation classifier, the
SVM proposed by (Schweikert et al., 2009), out-
performed our proposed method.
5 CONCLUSIONS AND FUTURE
WORK
In this paper we proposed a simple domain adapta-
tion method to address the lack of or limited amount
of labeled data for a target domain, by leveraging the
large amount of labeled data from a related domain.
We evaluated this method on a biological problem,
splice site prediction, a critical step for gene annota-
tion, since many organisms have limited to no labeled
data, whereas related, more studied model organisms
have large amounts of labeled data.
From our experimental results we made a few ob-
servations, such as, in some cases simple features are
preferred over complex ones when the latter can lead
to sparse representations and decreased accuracy, and
vice versa; using more labeled data increases the ac-
curacy of the classifier; and that as the distance be-
tween the domains increases the contribution of the
source data decreases. More importantly, we ob-
served that our proposed method performed better
than previously proposed methods with only a cou-
ple of exceptions, recommending it for ab initio splice
site prediction.
For future work we would like to explore ways to
further increase its accuracy. For example, we would
like to create balanced subsamples, through under-
sampling, and then training an ensemble of classifiers
on these subsamples. In addition, we would like to
experiment with ensembles of classifiers produced by
the different methods proposed, on balanced datasets.
Another direction for future work is to combine data
from multiple organisms and train a classifier for a
target organism, i.e., use multiple source domains.
ACKNOWLEDGEMENTS
This work was supported by an Institutional Devel-
opment Award (IDeA) from the National Institute of
General Medical Sciences of the National Institutes
of Health under grant number P20GM103418. The
content is solely the responsibility of the authors and
does not necessarily represent the official views of
the National Institute of General Medical Sciences or
the National Institutes of Health. The computing for
this project was performed on the Beocat Research
Cluster at Kansas State University, which is funded
in part by grants MRI-1126709, CC-NIE-1341026,
MRI-1429316, CC-IIE-1440548.
REFERENCES
Arita, M., Tsuda, K., and Asai, K. (2002). Modeling splic-
ing sites with pairwise correlations. Bioinformatics,
18(suppl 2):S27–S34.
Baten, A. K., Chang, B. C., Halgamuge, S. K., and Li, J.
(2006). Splice site identification using probabilistic
parameters and svm classification. BMC bioinformat-
ics, 7(Suppl 5):S15.
Baten, A. K., Halgamuge, S. K., Chang, B., and Wickrama-
rachchi, N. (2007). Biological sequence data prepro-
cessing for classification: A case study in splice site
identification. In Advances in Neural Networks–ISNN
2007, pages 1221–1230. Springer.
Cai, D., Delcher, A., Kao, B., and Kasif, S. (2000). Model-
ing splice sites with bayes networks. Bioinformatics,
16(2):152–158.
Davis, J. and Goadrich, M. (2006). The relationship be-
tween precision-recall and roc curves. In Proceed-
ings of the 23rd international conference on Machine
learning, pages 233–240. ACM.
Giannoulis, G., Krithara, A., Karatsalos, C., and Paliouras,
G. (2014). Splice site recognition using transfer learn-
ing. In SETN, pages 341–353. Springer.
Gross, S. S., Do, C. B., Sirota, M., and Batzoglou, S.
(2007). Contrast: a discriminative, phylogeny-free ap-
proach to multiple informant de novo gene prediction.
Genome biology, 8(12):R269.
Herndon, N. and Caragea, D. (2014a). Empirical Study of
Domain Adaptation Algorithms on the Task of Splice
Site Prediction. Communications in Computer and In-
formation Science (CCIS 2014). Springer-Verlag.
Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers
251