when the class distribution is skewed, which is the
case with this dataset.
The results for this dataset were very poor, with
our algorithm always gravitating towards classifying
each instance as not containing a splice site. We be-
lieve that this is due mainly because the k-mers indi-
cating a splice site occur with low frequency and their
relative position to the splice site is important. We
will discuss in Section 4 how we propose to address
this issue in future work.
4 CONCLUSIONS AND FUTURE
WORK
In this paper, we proposed a domain adaptation classi-
fier for biological sequences. This algorithm showed
promising classification performance in our experi-
ments. Our analysis indicates that the closer the tar-
get domain is to the source domain the better is the
classifier learned. Other conclusions drawn from our
observations: using 2-mers or 3-mers results in bet-
ter prediction, with small differences between them;
removing features from the target domain reduces the
accuracy of classifier; having more target labeled data
increases the accuracy of classifier; and adding too
much target unlabeled data decreases the accuracy of
classifier.
In future work we would like to investigate how
would assigning different weights to the data used for
training influence the accuracy prediction of the algo-
rithm. We would like to assign higher weight to the
labeled data from the target domain since this is more
likely to correctly predict the class of the target test
data than the labeled data from the source domain or
the unlabeled data from the target domain.
We would also like to evaluate other methods for
selecting the generalizable features. For example,
we would like to investigate if selecting generalizable
features using the mutual information of the features
instead of their probabilities, in Equation (1), leads to
better classification accuracy.
Another aspect we would like to improve is the
accuracy of this classifier on the splice site dataset, to
get accuracy that is similar to state of the art splice
site classifiers, e.g., SVM classifiers. We would like
to reduce the number of motifs with different cluster-
ing strategies, and identify more discriminative mo-
tifs using Gibbs sampling or MEME. In addition, we
would like to run experiments on smaller splice site
datasets to better understand the characteristics of this
problem.
ACKNOWLEDGEMENTS
The computing for this project was performed on the
Beocat Research Cluster at Kansas State University,
which is funded in part by NSF grants CNS-1006860,
EPS-1006860, EPS-0919443, and MRI-1126709.
REFERENCES
Baten, A., Chang, B., Halgamuge, S., and Li, J. (2006).
Splice site identification using probabilistic param-
eters and svm classification. BMC Bioinformatics,
7(Suppl 5):S15.
Bernal, A., Crammer, K., Hatzigeorgiou, A., and Pereira,
F. (2007). Global discriminative learning for higher-
accuracy computational gene prediction. PLoS Com-
put Biol, 3(3):e54.
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N.,
Sugnet, C., Furey, T. S., M.Ares, J., and Haussler, D.
(2000). Knowledge-based analysis of microarray gene
expression data using support vector machines. PNAS,
97(1):262–267.
Dai, W., Xue, G., Yang, Q., and Yu, Y. (2007). Transferring
na
¨
ıve bayes classifiers for text classification. In Pro-
ceedings of the 22nd AAAI Conference on Artificial
Intelligence.
Degroeve, S., Saeys, Y., De Baets, B., Rouz
´
e, P., and Van
De Peer, Y. (2005). Splicemachine: predicting splice
sites from high-dimensional local context representa-
tions. Bioinformatics, 21(8):1332–1338.
Eaton, J. W., Bateman, D., and Hauberg, S. (2008). GNU
Octave Manual Version 3. Network Theory Ltd.
Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne,
G. (2000). Predicting subcellular localization of pro-
teins based on their N-terminal amino acid sequence.
Journal of molecular biology, 300(4):1005–1016.
Gardy, J. L., Laird, M. R., Chen, F., Rey, S., Walsh, C. J.,
Ester, M., and Brinkman, F. S. L. (2005). Psortb
v.2.0: Expanded prediction of bacterial protein sub-
cellular localization and insights gained from compar-
ative proteome analysis. Bioinformatics, 21(5):617–
623.
Gardy, J. L., Spencer, C., Wang, K., Ester, M., Tusn
´
ady,
G. E., Simon, I., Hua, S., deFays, K., Lambert, C.,
Nakai, K., and Brinkman, F. S. (2003). Psort-b:
improving protein subcellular localization prediction
for gram-negative bacteria. Nucleic Acids Research,
31(13):3613–3617.
Huang, J., Li, T., Chen, K., and Wu, J. (2006). An approach
of encoding for prediction of splice sites using svm.
Biochimie, 88:923–9.
Jaakkola, T. S. and Haussler, D. (1999). Exploiting gen-
erative models in discriminative classifiers. In Pro-
ceedings of the 1998 conference on Advances in neu-
ral information processing systems II, pages 487–493,
Cambridge, MA, USA. MIT Press.
Jiang, J. and Zhai, C. (2007). A two-stage approach to do-
main adaptation for statistical classifiers. In Proceed-
ings of the sixteenth ACM conference on Conference
NaïveBayesDomainAdaptationforBiologicalSequences
69