REACTION KERNELS - Structured Output Prediction Approaches for Novel Enzyme Function

Katja Astikainen, Esa Pitkänen, Juho Rousu, Liisa Holm, Sándor Szedmák


Enzyme function prediction problem is usually solved using annotation transfer methods. These methods are suitable in cases where the function of the new protein is previously characterized and included in the taxonomy such as EC hierarchy. However, given a new function that is not previously described, these approaches arguably do not offer adequate support for the human expert. In this paper, we explore a structured output learning approach, where enzyme function—an enzymatic reaction—is described in fine-grained fashion with so called reaction kernels which allow interpolation and extrapolation in the output (reaction) space. Two structured output models are learned via Kernel Density Estimation and Maximum Margin Regression to predict enzymatic reactions from sequence motifs. We bring forward two choices for constructing reaction kernels and experiment with them in the remote homology case where the functions in the test set have not been seen in the training phase. Our experiments demonstrate the viability of our approach.


  1. Astikainen, K., Holm, L., Pitknen, E., Szedmak, S., and Rousu, J. (2008). Towards structured output prediction of enzyme function. BMC Proceedings, 2(S4):S2.
  2. Barutcuoglu, Z., Schapire, R., and Troyanskaya, O. (2006). Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7):830-836.
  3. Blockeel, H., Schietgat, L., Struyf, J., et al. (2006). Decision trees for hierarchical multilabel classification: A case study in functional genomics. In PKDD.
  4. Borgwardt, K. M., Ong, C. S., Schnauer, S., Vishwanathan, S. V. N., Smola, A. J., and Kriegel, H.-P. (2005). Protein function prediction via graph kernels. Bioinformatics, 21(1):47-56.
  5. Clare, A. and King, R. (2002). Machine learning of functional class from phenotype data. Bioinformatics, 18(1):160-166.
  6. Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C., Hofmann, K., and Bairoch, A. (2002). The prosite database, its status in 2002. Nucleic Acids Research, 30(1):235.
  7. Gartner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations, 5.
  8. Goto, S., Okuno, Y., Hattori, M., Nishioka, T., and Kanehisa, M. (2002). Ligand: database of chemical compounds and reactions in biological pathways. Nucleic Acids Research, 30(1):402.
  9. Heger, A., Korpelainen, E., Hupponen, T., Mattila, K., Ollikainen, V., and Holm, L. (2008). Pairsdb atlas of protein sequence space. Nucl. Acids Res., 36:D276- D280.
  10. Heger, A., Mallick, S., Wilton, C., and Holm, L. (2007). The global trace graph, a novel paradigm for searching protein sequence databases. Bioinformatics, 23(18).
  11. Henikoff, J. and Henikoff, S. (1996). Blocks database and its applications. METHODS IN ENZYMOLOGY, pages 88-104.
  12. Holm, L. and Sander, C. (1996). Dali/fssp classification of three-dimensional protein folds. Nucleic Acids Research, 25(1):231-234.
  13. Krissinel, E. and Henrick, K. (2004). Secondary-structure matching (ssm), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica D Biol Crystallogr, 60(1 Part 12):2256-2268.
  14. Lanckriet, G., Deng, M., Cristianini, N., et al. (2004). Kernel-based data fusion and its application to protein function prediction in yeast. PSB, 2004.
  15. Mulder, N., Apweiler, R., Attwood, T., Bairoch, A., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., et al. (2002). Interpro: An integrated documentation resource for protein families, domains and functional sites. Briefings in Bioinformatics, 3(3):225-235.
  16. Palsson, B. (2006). Systems Biology: Properties of Reconstructed Networks. Cambridge University Press.
  17. Punta, M. and Ofran, Y. (2008). The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Computational Biology, 4(10).
  18. Rousu, J., Saunders, C., Szedmak, S., and Shawe-Taylor, J. (2006). Kernel-based learning of hierarchical multilabel classification models. JMLR, 7.
  19. Schlkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443-1471.
  20. Sokolov, A. and Ben-Hur, A. (2008). A structured-outputs method for prediction of protein function. In Proceedings of the 3rd International Workshop on Machine Learning in Systems Biology.
  21. Szedmak, S., Shawe-Taylor, J., and Parado-Hernandez, E. (2005). Learning via linear operators: Maximum margin regression. Technical report, Pascal.
  22. Taskar, B., Guestrin, C., and Koller, D. (2004). Max-margin markov networks. In NIPS 2003.
  23. Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML.
  24. Ye, Y. and Godzik, A. (2004). Fatcat: a web server for flexible structure comparison and structure similarity searching. Nucleic Acids Research, 32(Web Server Issue):W582.

Paper Citation

in Harvard Style

Astikainen K., Pitkänen E., Rousu J., Holm L. and Szedmák S. (2010). REACTION KERNELS - Structured Output Prediction Approaches for Novel Enzyme Function . In Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010) ISBN 978-989-674-019-1, pages 48-55. DOI: 10.5220/0002741700480055

in Bibtex Style

author={Katja Astikainen and Esa Pitkänen and Juho Rousu and Liisa Holm and Sándor Szedmák},
title={REACTION KERNELS - Structured Output Prediction Approaches for Novel Enzyme Function},
booktitle={Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010)},

in EndNote Style

JO - Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010)
TI - REACTION KERNELS - Structured Output Prediction Approaches for Novel Enzyme Function
SN - 978-989-674-019-1
AU - Astikainen K.
AU - Pitkänen E.
AU - Rousu J.
AU - Holm L.
AU - Szedmák S.
PY - 2010
SP - 48
EP - 55
DO - 10.5220/0002741700480055