Using a Random Forest Classifier to Find Nuclear Export Signals in Proteins of Arabidopsis thaliana

Claudia Rubiano, Thomas Merkle, Tim W. Nattkemper


This paper presents a new computational strategy for predicting Nuclear Export Signals (NESs) in proteins of the model plant Arabidopsis thaliana based on a random forest classifier. NESs are amino acid sequences that enable a protein to interact with a nuclear receptor and in this way to be exported from the nucleus to the cytoplasm. The proposed classifier uses two kinds of features, the sequence of the NESs expressed as the score obtained from a HMM profile and physicochemical properties of the amino acid residues expressed as amino acid index values. Around 5000 proteins from the total of protein sequences from Arabidopsis were predicted as containing NESs. A small group of these proteins was experimentally tested for the actual presence of an NES. 11 out of 13 tested proteins showed positive interaction with the receptor Exportin 1 (XPO1a) from Arabidopsis in yeast two-hybrid assays, which indicates they contain NESs. The experimental validation of the nuclear export activity in a selected group of proteins is an indicator of the potential usefulness of the tool. From the biological perspective, the nuclear export activity observed in those proteins strongly suggests that nucleo-cytoplasmic partitioning could be involved in regulation of their functions.


  1. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5):412-24.
  2. Bendtsen, J. D., Nielsen, H., von Heijne, G., and Brunak, S. (2004). Improved prediction of signal peptides: SignalP 3.0. J Mol Biol, 340(4):783-95.
  3. Bock, J. R. and Gough, D. A. (2001). Predicting proteinprotein interactions from primary structure. Bioinformatics, 17(5):455-60.
  4. Brameier, M., Krings, A., and MacCallum, R. M. (2007). NucPred-predicting nuclear localization of proteins. Bioinformatics, 23(9):1159-60.
  5. Breiman, L. (2001). Random Forests. Machine Learning, 45(5-32):1-28.
  6. Caragea, C., Sinapov, J., Silvescu, A., Dobbs, D., and Honavar, V. (2007). Glycosylation site prediction using ensembles of Support Vector Machine classifiers. BMC Bioinformatics, 8:438.
  7. Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T., Higgins, D., and Thompson, J. (2003). Multiple sequence alignment with the CLUSTAL series of programs. Nucleic Acids Res, 31:3497-3500.
  8. Cook, A., Bono, F., Jinek, M., and Conti, E. (2007). Structural biology of nucleocytoplasmic transport. Annu Rev Biochem, 76:647-71.
  9. Fawcett, T. (2004). ROC graphs : Notes and practical considerations for researchers. Technical report, HP Laboratories, MS 1143, 1501 Page Mill Road, Palo Alto CA 94304.
  10. Görlich, D. and Kutay, U. (1999). Transport between the cell nucleus and the cytoplasm. Annu Rev Cell Dev Biol, 15:607-60.
  11. Gromiha, M. M. and Yabuki, Y. (2008). Functional discrimination of membrane proteins using machine learning techniques. BMC Bioinformatics, 9:135.
  12. Haasen, D., Kö hler, C., Neuhaus, G., and Merkle, T. (1999). Nuclear export of proteins in plants: AtXPO1 is the export receptor for leucine-rich nuclear export signals in Arabidopsis thaliana. Plant J, 20(6):695-705.
  13. Hua, S. and Sun, Z. (2001). Support Vector Machine approach for protein subcellular localization prediction. Bioinformatics, 17:721-728.
  14. Ihaka, R. and Gentleman, R. (1996). R: a language for data analysis and graphics. Journal of computational and graphical statistics.
  15. Kawashima, S. and Kanehisa, M. (2000). AAindex: amino acid index database. Nucleic Acids Res, 28:374.
  16. Kuhn, M. (2008a). Building predictive models in R using the caret package. JSS Journal of Statistical Software, 28(5):1-26.
  17. Kuhn, M. (2008b). Documentation for package caret version 3.45. [].
  18. Kumar, M. and Raghava, G. P. S. (2009). Prediction of nuclear proteins using SVM and HMM models. BMC Bioinformatics, 10:22.
  19. La-Cour, T., Gupta, R., Rapacki, K., Skriver, K., Poulsen, F.-M., and Brunak, S. (2003). NESbase version 1.0: a database of nuclear export signals. Nucleic Acids Res, 31(1):393-6.
  20. La-Cour, T., Kiemer, L., Mølgaard, A., Gupta, R., Skriver, K., and Brunak, S. (2004). Analysis and prediction of leucine-rich nuclear export signals. Protein Eng Des Sel, 17(6):527-36.
  21. Lee, B. J., Shin, M. S., Oh, Y. J., Oh, H. S., and Ryu, K. H. (2009). Identification of protein functions using a machine-learning approach based on sequencederived properties. Proteome science, 7:27.
  22. Lei, Z. and Dai, Y. (2005). An SVM-based system for predicting protein subnuclear localizations. BMC Bioinformatics, 6:291.
  23. Liu, B., Wang, X., Lin, L., Tang, B., Dong, Q., and Wang, X. (2009). Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC Bioinformatics, 10:381.
  24. Merkle, T. (2001). Nuclear import and export of proteins in plants: a tool for the regulation of signalling. Planta, 213:499-517.
  25. Merkle, T. (2004). Nucleo-cytoplasmic partitioning of proteins in plants: implications for the regulation of environmental and developmental signalling. Curr Genet, 44:231-260.
  26. Merkle, T. (2011). Nucleo-cytoplasmic transport of proteins and rna in plants. Plant Cell Rep, 30:153-176.
  27. Myers, E. W. and Miller, W. (1988). Optimal alignments in linear space. Comput Appl Biosci, 4(1):11-17.
  28. Ossareh-Nazari, B., Gwizdek, C., and Dargemont, C. (2001). Protein export from the nucleus. Traffic, 2(10):684-9.
  29. Pazos, F. and jung Wook Bang (2006). Computational prediction of functionally important regions in proteins. Current Bioinformatics, 1(1):15-23.
  30. Pemberton, L.-F. and Paschal, B.-M. (2005). Mechanisms of receptor-mediated nuclear import and nuclear export. Traffic, 6(3):187-198.
  31. Provost, F. and Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42:203- 231.
  32. R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3- 900051-07-0.
  33. Riis, S. and Krogh, A. (1996). Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J Comput Biol, 3:163-183.
  34. Sammeth, M., Rothgänger, J., Esser, W., Albert, J., Stoye, J., and Harmsen, D. (2003). QAlign: quality-based multiple alignments with dynamic phylogenetic analysis. Bioinformatics, 19(12):1592-1593.
  35. Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. (2005). ROCR: visualizing classifier performance in R. Bioinformatics, 21(20):3940.
  36. Str öm, A. C. and Weis, K. (2001). Importin-betalike nuclear transport receptors. Genome Biol, 2(6):Reviews-3008.
  37. The Gene Ontology Consortium (2000). Gene Ontology: tool for the unification of biology. Nat Genet, 25(1):25-29.
  38. Timm, W., Scherbart, A., Bö cker, S., Kohlbacher, O., and Nattkemper, T. W. (2008). Peak intensity prediction in maldi-tof mass spectrometry: a machine learning study to support quantitative proteomics. BMC Bioinformatics, 9:443.
  39. Tung, C.-W. and Ho, S.-Y. (2008). Computational identification of ubiquitylation sites from protein sequences. BMC Bioinformatics, 9:310.

Paper Citation

in Harvard Style

Rubiano C., Merkle T. and W. Nattkemper T. (2013). Using a Random Forest Classifier to Find Nuclear Export Signals in Proteins of Arabidopsis thaliana . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013) ISBN 978-989-8565-35-8, pages 98-104. DOI: 10.5220/0004192200980104

in Bibtex Style

author={Claudia Rubiano and Thomas Merkle and Tim W. Nattkemper},
title={Using a Random Forest Classifier to Find Nuclear Export Signals in Proteins of Arabidopsis thaliana},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},

in EndNote Style

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)
TI - Using a Random Forest Classifier to Find Nuclear Export Signals in Proteins of Arabidopsis thaliana
SN - 978-989-8565-35-8
AU - Rubiano C.
AU - Merkle T.
AU - W. Nattkemper T.
PY - 2013
SP - 98
EP - 104
DO - 10.5220/0004192200980104