Predicting Molecular Functions in Plants using Wavelet-based Motifs

G. Arango-Argoty, A. F. Giraldo-Forero, J. A. Jaramillo-Garzón, L. Duque-Muñoz, G. Castellanos-Dominguez

Abstract

Predicting molecular functions of proteins is a fundamental challenge in bioinformatics. Commonly used algorithms are based on sequence alignments and fail when the training sequences have low percentages of identity with query proteins, as it is the case for non-model organisms such as land plants. On the other hand, machine learning-based algorithms offer a good alternative for prediction, but most of them ignore that molecular functions are conditioned by functional domains instead of global features of the whole sequence. This work presents a novel application of theWavelet Transform in order to detect discriminant sub-sequences (motifs) and use them as input for a pattern recognition classifier. The results show that the continuous wavelet transform is a suitable tool for the identification and characterization of motifs. Also, the proposed classification methodology shows good prediction capabilities for datasets with low percentage of identity among sequences, outperforming BLAST2GO on about 11,5% and PEPSTATS-SVMon 16,4%. Plus, it offers major interpretability of the obtained results.

References

  1. Arango-Argoty, G., Jaramillo-Garzón, J. A., Röthlisberger, S., and Castellanos-Domínguez, C. G. (2011). Protein subcellular location prediction based on variablelength motifs detection and dissimilarity based classification. Annual International Conference of the IEEE EMBS, (76).
  2. Bai, J., Pennill, L., Ning, J., Lee, S., Ramalingam, J., Webb, C., Zhao, B., Sun, Q., Nelson, J., Leach, J., et al. (2002). Diversity in nucleotide binding siteleucine-rich repeat genes in cereals. Genome research, 12(12):1871.
  3. Barrell, D., Dimmer, E., Huntley, R., Binns, D., O'Donovan, C., and Apweiler, R. (2009). The GOA database in 2009-an integrated Gene Ontology Annotation resource. Nucleic acids research, 37(Database issue):D396.
  4. Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1):321-357.
  5. Cheng, B., Carbonell, J., and Klein-Seetharaman, J. (2005). Protein classification based on text document classification techniques. Proteins: Structures, Function and Bioinformatics, 58:955-970.
  6. Conesa, A. and Götz, S. (2008). Blast2GO: A Comprehensive Suite for Functional Analysis in Plant Genomics. International journal of plant genomics, 2008:619832.
  7. Gattiker, A., Gasteiger, E., and Bairoch, A. (2002). ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics, 1(2):107-108.
  8. Gupta, R., Mittal, A., Singh, K., Narang, V., and Roy, S. (2009). Time-series approach to protein classification problem. Engineering in Medicine and Biology Magazine, 28(4):32-37.
  9. Huang, Y., Niu, B., Gao, Y., Fu, L., and Li, W. (2010). Cdhit suite: a web server for clustering and comparing biological sequences. Bioinformatics, 26(5):680-682.
  10. Jain, E., Bairoch, A., Duvaud, S., Phan, I., Redaschi, N., Suzek, B., Martin, M., McGarvey, P., and Gasteiger, E. (2009). Infrastructure for the life sciences: design and implementation of the UniProt website. BMC bioinformatics, 10(1):136.
  11. Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S., and Madden, T. (2008). Ncbi blast: a better web interface. Nucleic acids research, 36(suppl 2):W5-W9.
  12. Kawashima, S. and Kanehisa, M. (2000). Aaindex: amino acid index database. Nucleic acids research, 28(1):374.
  13. Lin, H., Han, L., Zhang, H., Zheng, C., Xie, B., and Chen, Y. (2006). Prediction of the functional class of lipid binding proteins from sequence-derived properties irrespective of sequence similarity. Journal of lipid research, 47(4):824.
  14. Liu, X., Korde, N., Jakob, U., and Leichert, L. (2006). CoSMoS: conserved sequence motif search in the proteome. BMC bioinformatics, 7(1):37.
  15. Lodish, H., Berk, A., Zipursky, S., Matsudaira, P., Baltimore, D., and Darnell, J. (1995). Molecular cell biology. New York.
  16. Martin, G., Bogdanove, A., and Sessa, G. (2003). Understanding the functions of plant disease resistance proteins. Annual review of plant biology, 54(1):23-61.
  17. Murray, K., Gorse, D., and Thornton, J. (2002). Wavelet transforms for the characterization and detection of repeating motifs1. Journal of molecular biology, 316(2):341-363.
  18. Sarac¸, O. (2010). GOPred: GO Molecular Function Prediction by Combined Classifiers. PloS one, 5(8):1-11.
  19. Schneider, T. (2002). Consensus sequence zen. Applied bioinformatics, 1(3):111.
  20. Shen, Y. and Burger, G. (2010). TESTLoc: protein subcellular localization prediction from EST data. BMC bioinformatics, 11(1):563.
  21. Swarbreck, D., Wilks, C., Lamesch, P., Berardini, T. Z., Garcia-Hernandez, M., Foerster, H., Li, D., Meyer, T., Muller, R., Ploetz, L., Radenbaugh, A., Singh, S., Swing, V., Tissier, C., Zhang, P., and Huala, E. (2008). The arabidopsis information resource (tair): gene structure and function annotation. Nucleic acids research, 36.
  22. Vinga, S. and Almeida, J. (2003). Alignment-free sequence comparison: a review. Bioinformatics, 19(4):513.
  23. Wheeler, D. (2002). Selecting the right protein-scoring matrix. Current Protocols in Bioinformatics, pages 3-5.
  24. Wilson, D., Pethica, R., Zhou, Y., Talbot, C., Vogel, C., Madera, M., Chothia, C., and Gough, J. (2009). Superfamilysophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic acids research, 37(suppl 1):D380.
  25. Yu, L. and Liu, H. (2003). Feature selection for highdimensional data: A fast correlation-based filter solution. In Machine Learning-International Workshop then Conference-, volume 20, page 856.
Download


Paper Citation


in Harvard Style

Arango-Argoty G., F. Giraldo-Forero A., A. Jaramillo-Garzón J., Duque-Muñoz L. and Castellanos-Dominguez G. (2013). Predicting Molecular Functions in Plants using Wavelet-based Motifs . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013) ISBN 978-989-8565-35-8, pages 140-145. DOI: 10.5220/0004234201400145


in Bibtex Style

@conference{bioinformatics13,
author={G. Arango-Argoty and A. F. Giraldo-Forero and J. A. Jaramillo-Garzón and L. Duque-Muñoz and G. Castellanos-Dominguez},
title={Predicting Molecular Functions in Plants using Wavelet-based Motifs},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},
year={2013},
pages={140-145},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004234201400145},
isbn={978-989-8565-35-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)
TI - Predicting Molecular Functions in Plants using Wavelet-based Motifs
SN - 978-989-8565-35-8
AU - Arango-Argoty G.
AU - F. Giraldo-Forero A.
AU - A. Jaramillo-Garzón J.
AU - Duque-Muñoz L.
AU - Castellanos-Dominguez G.
PY - 2013
SP - 140
EP - 145
DO - 10.5220/0004234201400145